[go: up one dir, main page]

CN111816157B - Music score intelligent video-singing method and system based on voice synthesis - Google Patents

Music score intelligent video-singing method and system based on voice synthesis Download PDF

Info

Publication number
CN111816157B
CN111816157B CN202010590726.0A CN202010590726A CN111816157B CN 111816157 B CN111816157 B CN 111816157B CN 202010590726 A CN202010590726 A CN 202010590726A CN 111816157 B CN111816157 B CN 111816157B
Authority
CN
China
Prior art keywords
music score
audio
abc
splicing
notes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202010590726.0A
Other languages
Chinese (zh)
Other versions
CN111816157A (en
Inventor
刘昆宏
吴清强
吴苏悦
张敬峥
詹旺平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202010590726.0A priority Critical patent/CN111816157B/en
Publication of CN111816157A publication Critical patent/CN111816157A/en
Application granted granted Critical
Publication of CN111816157B publication Critical patent/CN111816157B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The invention provides a music score intelligent video-singing method and a music score intelligent video-singing system based on voice synthesis, wherein the method comprises the following steps: step one, data preparation, namely inputting and analyzing an abc music score to obtain pitch and duration information of each note in the specific abc music score; step two, training parameters, namely generating notes with the length within 5 when training data are made, namely dividing all notes into a group of 5 notes when a complete abc music score is processed; step three, synthesizing audio splicing, specifically comprising three substeps of music score segmentation identification, segment splicing, waveform alignment and blank segment filling; and step four, performing visual display on the synthesized audio. The invention solves the technical problems of large calculation amount in the training process, obvious splicing trace in direct splicing, splicing noise and the like, and the difference is difficult to distinguish by comparing the effects of the generated audio and the original data.

Description

一种基于语音合成的乐谱智能视唱方法和系统A method and system for intelligent solfeggio based on speech synthesis

技术领域technical field

本发明属于计算机领域,具体地,涉及一种基于语音合成的乐谱智能视唱方法和系统。The invention belongs to the field of computers, and in particular relates to an intelligent solfeggio method and system based on speech synthesis.

背景技术Background technique

近年来,随着网络的不断发展,手机应用电脑应用的不断更新迭代,网络教育也在不断发展,而艺术类的学科则有一定的特殊性,它更强调老师与学生的沟通和互相的反馈,学生在学习这类课程时需要及时收到反馈才能正确认识到自己的错误,例如音乐学习,单纯的看谱唱谱每个学生会出现不同的问题,老师是无法同一时间去顾及每个学生,而我们学生在学习音乐这门课程的时候,最好是结合眼睛看,耳朵听,嘴巴唱,多个感官配合起来,这样才能够实现更好的去学习音乐这类刺激感官的科目。In recent years, with the continuous development of the Internet, the continuous update and iteration of mobile phone applications and computer applications, online education is also developing continuously, while art subjects have certain particularities, which emphasize communication and mutual feedback between teachers and students , students need to receive feedback in time to correctly recognize their own mistakes when learning this type of course. For example, in music learning, each student will have different problems simply by reading and singing scores. Teachers cannot take care of every student at the same time. When our students are learning music, it is best to combine seeing with eyes, listening with ears, and singing with the mouth, so as to cooperate with multiple senses, so that we can better learn subjects that stimulate the senses such as music.

本发明研究的是音乐线上教学平台的视唱教学板块,希望乐谱视唱可以由传统的请专业老师把每个乐谱唱好变为由计算机来模拟真人的唱法去唱乐谱,这样不仅学生可以灵活的学习各种乐谱,而且对老师或者说是平台扩充数据库也有很大作用,所以本发明研究的问题借由计算机来完成乐谱视唱任务。What the present invention researches is the sight-singing teaching section of the music online teaching platform. It is hoped that the sight-singing of music scores can be changed from the traditional way of asking a professional teacher to sing each score well into a computer to simulate the singing method of a real person to sing the score, so that not only students can Learning various music scores flexibly, and it also has a great effect on the teacher or the platform to expand the database, so the problem studied by the present invention is to complete the task of sight-singing music scores by computer.

乐谱视唱是一个类似语音合成的问题,而现在的语音合成技术早已经达到投入市场的水平了,乐谱可以看成是一种特殊的语言,计算机要做的就是学习每个音符的发音。乐谱会在开头的信息中给出节拍等信息,所以处理时需要把这个信息带入到字符串中的所有音符中去,字符串中包含了每个音符的时长和音高的信息,但具体应用时面临训练数据太大会产生无法训练、直接拼接时拼接痕迹会比较明显,会产生杂音的等技术问题。Music score sight-singing is a problem similar to speech synthesis, and the current speech synthesis technology has already reached the level of being put into the market. Music score can be regarded as a special language. What the computer needs to do is to learn the pronunciation of each note. The music score will give information such as tempo in the information at the beginning, so this information needs to be brought into all the notes in the string when processing. The string contains information about the duration and pitch of each note, but the specific application Sometimes, when the training data is too large, the training cannot be performed, the splicing traces will be more obvious when splicing directly, and noise will be generated.

发明内容Contents of the invention

本发明提供了一种基于语音合成的乐谱智能视唱方法和系统,能将乐谱翻译为一种新的语言,使之贴合某种语言的<文字,发音>的对应映射关系,解决了训练过程计算量大、直接拼接时拼接痕迹会比较明显、拼接杂音等技术问题,生成的音频和原数据的效果对比很难分辨出差别。The present invention provides an intelligent solfeggio method and system based on speech synthesis, which can translate the music score into a new language so that it fits the corresponding mapping relationship of <text, pronunciation> in a certain language, and solves the problem of training The calculation process is heavy, the splicing traces will be more obvious when splicing directly, and there are technical problems such as splicing noise. It is difficult to distinguish the difference between the generated audio and the original data.

为了解决上述问题,本发明提供一种基于语音合成的乐谱智能视唱方法,所述方法包括:In order to solve the above problems, the present invention provides a method for intelligent sight-singing of music scores based on speech synthesis, said method comprising:

步骤一、数据准备,输入并解析abc乐谱,得出具体一个abc乐谱中每个音符的音高以及时长信息;Step 1, data preparation, input and analyze the abc score, and obtain the pitch and duration information of each note in a specific abc score;

步骤二、训练参数,制作训练数据时生成了长度5个以内的音符,即在处理一个完整的abc乐谱时,将其所有音符划分成5个音符一组;Step 2, training parameters, when making training data, notes with a length of less than 5 are generated, that is, when processing a complete abc score, all notes are divided into groups of 5 notes;

步骤三、合成音频拼接,具体包括乐谱分段识别、片段拼接、波形对齐与空白段填充三个子步骤;Step 3. Synthetic audio splicing, specifically including three sub-steps of music score segmentation recognition, fragment splicing, waveform alignment and blank segment filling;

步骤四、对合成的音频进行可视化展示。Step 4: Visually display the synthesized audio.

第二方面,本申请实施例提供了一种基于语音合成的乐谱智能视唱系统,所述系统包括:In the second aspect, the embodiment of the present application provides a music score intelligent sight-singing system based on speech synthesis, the system comprising:

数据准备模块,用于输入并解析abc乐谱,得出具体一个abc乐谱中每个音符的音高以及时长信息;The data preparation module is used to input and analyze the abc score, and obtain the pitch and duration information of each note in a specific abc score;

训练参数模块,制作训练数据时生成了长度5个以内的音符,即在处理一个完整的abc乐谱时,将其所有音符划分成5个音符一组;The training parameter module generates notes with a length of less than 5 when making training data, that is, when processing a complete abc score, divide all its notes into groups of 5 notes;

合成音频拼接模块,包含乐谱分段识别子模块、片段拼接子模块、波形对齐与空白段填充子模块;Synthetic audio splicing module, including score segmentation recognition sub-module, fragment splicing sub-module, waveform alignment and blank segment filling sub-module;

可视化模块,用于对合成的音频进行可视化展示。The visualization module is used to visualize the synthesized audio.

第三方面,本申请实施例提供了一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现本申请实施例描述的方法。In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and operable on the processor, wherein the processor executes the The method described in the embodiment of the present application is implemented when the computer program is described.

第四方面,本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序用于:In a fourth aspect, the embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program is used for:

所述计算机程序被处理器执行时实现如本申请实施例描述的方法。When the computer program is executed by the processor, the method described in the embodiment of the present application is realized.

附图说明Description of drawings

图1是本发明基于语音合成的乐谱智能视唱流程图;Fig. 1 is the intelligent solfeggio flow chart of musical notation based on speech synthesis of the present invention;

图2是本发明基于乐谱合成音频结果效果图。Fig. 2 is an effect diagram of the result of audio synthesis based on music score in the present invention.

具体实施方式Detailed ways

为了能够使得本发明的发明目的、技术流程及技术创新点进行更加清晰的阐述,以下结合附图及实例,对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。此外,下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to clarify the invention purpose, technical process and technical innovation points of the present invention, the present invention will be further described in detail below in conjunction with the accompanying drawings and examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

为达到以上目的,本发明提供了一种基于语音合成的乐谱智能视唱方法。主流程如图1所示,该方法包括:To achieve the above object, the present invention provides a method for intelligent sight-singing of music score based on speech synthesis. The main process is shown in Figure 1, and the method includes:

步骤一、数据准备,输入并解析abc乐谱,得出具体一个abc乐谱中每个音符的音高以及时长信息;Step 1, data preparation, input and analyze the abc score, and obtain the pitch and duration information of each note in a specific abc score;

本申请的要处理的数据为abc文件,abc文件包含两个部分的信息,前面五行为乐谱的节奏,调等信息,下面内容为音符,与语言不同的是,乐谱是在乐谱的首部规定了整个乐谱的唱法,而语言是每一个词有独立的发音规则,所以在处理abc乐谱时,需要考虑类似的处理办法,将首部的乐谱信息带入到每一个音符中去。The data to be processed in this application is the abc file. The abc file contains two parts of information. The first five lines are the rhythm and key information of the score, and the following content is the note. The difference from the language is that the score is specified at the beginning of the score The singing method of the whole score, and the language is that each word has independent pronunciation rules, so when processing the abc score, it is necessary to consider a similar processing method to bring the score information of the first part into each note.

步骤二、训练参数,制作训练数据时生成了长度5个以内的音符,即在处理一个完整的abc乐谱时,将其所有音符划分成5个音符一组;Step 2, training parameters, when making training data, notes with a length of less than 5 are generated, that is, when processing a complete abc score, all notes are divided into groups of 5 notes;

步骤三、合成音频拼接,具体包括乐谱分段识别、片段拼接、波形对齐与空白段填充三个子步骤;Step 3. Synthetic audio splicing, specifically including three sub-steps of music score segmentation recognition, fragment splicing, waveform alignment and blank segment filling;

步骤四、对合成的音频进行可视化展示。Step 4: Visually display the synthesized audio.

优选的,数据准备步骤还包括:每个音符设定由时长、音符、音高三个部分组成,时长最短单位选取了1/8拍,音符的时长为1/8、1/4、3/8、1/2、3/4、1拍,音域为f3~f#5,数据由于人声数据采集实在困难,使用vocaloid软件合成音频来替代人声,降半调用降一调的升半调表示,为了标注准确,所有“黑键”表示的音都统一用升半调表示。通过这种组合方式可以将乐谱首部信息带入每一个音符中,就解决了abc乐谱翻译问题。Preferably, the data preparation step also includes: each note setting is composed of three parts: duration, note, and pitch, the shortest unit of duration is 1/8 beat, and the duration of the note is 1/8, 1/4, 3/8 , 1/2, 3/4, 1 beat, the vocal range is f3~f#5, the data is difficult to collect vocal data, use vocaloid software to synthesize audio to replace the human voice, and call the raised half tone with a lowered tone to represent , in order to label accurately, all the notes represented by the "black keys" are uniformly represented by sharp half-tones. Through this combination, the first information of the score can be brought into each note, which solves the translation problem of abc score.

优选的,训练过程还包括:本申请使用脚本生成了总共11000个数据,数据集划分为10000个数据作为训练数据,剩下1000个数据用于测试数据测试模型效果,测试方式为人工听音,分辨声音的时长和发音是否准确。训练每1000步会生成检查点和当前的encoder、decoder对齐状况,中途断开训练可以重新从检查点处恢复训练;将#和阿拉伯数字都用英文字母代替,小节长度仍然固定,如果发音单词很短,使用长度不固定的数据,“r”音则在后续合成时直接添加进合成音频当中。Preferably, the training process also includes: the application uses a script to generate a total of 11,000 data, the data set is divided into 10,000 data as training data, and the remaining 1,000 data are used for testing data to test the effect of the model, and the testing method is artificial listening. Identify the duration and pronunciation of sounds. Every 1000 steps of training will generate a checkpoint and the current alignment of the encoder and decoder. If you disconnect the training midway, you can resume training from the checkpoint; replace # and Arabic numerals with English letters, and the length of the section is still fixed. Short, using data of variable length, the "r" sound is added directly to the synthesized audio during subsequent synthesis.

采用带宽为80的梅尔刻度声谱图作为解码器的目标输出,样本率为20000hz,帧长为50ms,帧偏移为12.5ms,优化器采用常规的adam,批大小为32,延迟学习率可以有效降低模型过拟合的情况。A mel-scale spectrogram with a bandwidth of 80 is used as the target output of the decoder, the sample rate is 20000hz, the frame length is 50ms, and the frame offset is 12.5ms. The optimizer uses conventional adam, the batch size is 32, and the learning rate is delayed It can effectively reduce the situation of model overfitting.

优选的,合成音频拼接具体包括:Preferably, the synthetic audio splicing specifically includes:

乐谱分段识别,对于一段abc文件进行拆分,以5个音符为一组拆分为多个小节,遇到“r”符号也进行分割,最后余下的音符一起作为一个小节,然后对每个小节用训练好的模型进行处理合成音频,生成每个小节对应的wav文件;Music score segmentation recognition, for a section of abc file, split it into multiple measures with 5 notes as a group, and divide it when encountering the "r" symbol, and finally the remaining notes together as a measure, and then for each The section uses the trained model to process the synthesized audio and generate the wav file corresponding to each section;

片段拼接,对分段合成的音频进行拼接;Fragment splicing, splicing the audio synthesized by segments;

波形对齐与空白段填充,对于连续唱的乐谱来说,粗略的拼接会有很明显的拼接痕迹,首先利用软件进行简单拼接观察发现需要处理两个基本问题,一是掐头去尾,合成的音频首尾处会有短暂的空白时间,因此需要对空白处进行修剪,由于合成音频的首尾空白时间几乎相同,所以可以统一进行处理,第二个问题是粗略拼接的“啪啪啪”声的问题,主要原因是在某个音后面若是接的浊音,即“哆”、“唻”、“咪”、“拉”四个音,即会出现这种情况,在放大波形图后发现出现杂音是因为该处波形的骤变引起的。为了解决这些问题,本申请采用如下方案:abc乐谱中’r’表示不发音的空音段,在合成音频时,对需要添加的部分直接加入空白时长的音频作为缓冲带,使用sox对小节进行合成,r音由两个部分组成,一是发音长度,二是r,根据abc乐谱的要求选择对应的“r”音频进行处理,每段拼接部分用一个非常短的即可消除杂音又不影响时长感受的r音来处理后输出。这种方法对杂音的消除有意想不到的结果。经试验,如图2所示,对乐谱合成音频结果图放大后可看到拼接处添加的空白r段,听感上并无明显感受。实际效果与直接人工合成音频无异。Waveform alignment and blank segment filling. For continuous singing scores, rough splicing will have obvious splicing traces. First, use software to perform simple splicing observations and find that two basic problems need to be dealt with. There will be a short blank time at the beginning and end of the audio, so the blank space needs to be trimmed. Since the blank time at the beginning and end of the synthesized audio is almost the same, it can be processed uniformly. The second problem is the rough splicing of the "pop" sound. , the main reason is that if there is a voiced sound after a certain sound, that is, the four sounds of "Duo", "唻", "Mi" and "La", this will happen. After zooming in on the waveform diagram, it is found that the noise is Because of the sudden change of the waveform here. In order to solve these problems, this application adopts the following scheme: 'r' in the abc score represents an empty segment without pronunciation, when synthesizing the audio, directly add the audio with a blank duration as a buffer to the part that needs to be added, and use sox to perform the subsection Synthesis, the r sound is composed of two parts, one is the pronunciation length, and the other is r. According to the requirements of abc score, the corresponding "r" audio is selected for processing. A very short splicing part can eliminate noise without affecting The r sound experienced by the duration is processed and output. This approach has had unexpected results in murmur cancellation. After testing, as shown in Figure 2, after zooming in on the music score synthesized audio results, you can see the blank r segment added at the splicing, and there is no obvious feeling in the sense of hearing. The actual effect is no different from direct artificially synthesized audio.

作为另一方面,本申请还提供了一种基于语音合成的乐谱智能视唱系统,所述系统包括:As another aspect, the present application also provides a music score intelligent solfeggio system based on speech synthesis, the system comprising:

数据准备模块,用于输入并解析abc乐谱,得出具体一个abc乐谱中每个音符的音高以及时长信息;The data preparation module is used to input and analyze the abc score, and obtain the pitch and duration information of each note in a specific abc score;

训练参数模块,制作训练数据时生成了长度5个以内的音符,即在处理一个完整的abc乐谱时,将其所有音符划分成5个音符一组;The training parameter module generates notes with a length of less than 5 when making training data, that is, when processing a complete abc score, divide all its notes into groups of 5 notes;

合成音频拼接模块,包含乐谱分段识别子模块、片段拼接子模块、波形对齐与空白段填充子模块;Synthetic audio splicing module, including score segmentation recognition sub-module, fragment splicing sub-module, waveform alignment and blank segment filling sub-module;

可视化模块,用于对合成的音频进行可视化展示。The visualization module is used to visualize the synthesized audio.

优选的,数据准备模块还包括:每个音符设定由时长、音符、音高三个部分组成,时长最短单位选取了1/8拍,音符的时长为1/8、1/4、3/8、1/2、3/4、1拍,音域为f3~f#5,数据由于人声数据采集实在困难,使用vocaloid软件合成音频来替代人声,降半调用降一调的升半调表示,为了标注准确,所有“黑键”表示的音都统一用升半调表示。Preferably, the data preparation module also includes: each note setting is composed of three parts: duration, note, and pitch, the shortest unit of duration is 1/8 beat, and the duration of the note is 1/8, 1/4, 3/8 , 1/2, 3/4, 1 beat, the vocal range is f3~f#5, the data is difficult to collect vocal data, use vocaloid software to synthesize audio to replace the human voice, and call the raised half tone with a lowered tone to represent , in order to label accurately, all the notes represented by the "black keys" are uniformly represented by sharp half-tones.

优选的,训练参数模块还用于:Preferably, the training parameter module is also used for:

训练每1000步会生成检查点和当前的encoder、decoder对齐状况,中途断开训练可以重新从检查点处恢复训练;Every 1000 steps of training will generate a checkpoint and the current alignment status of the encoder and decoder, and the training can be resumed from the checkpoint if the training is disconnected halfway;

将#和阿拉伯数字都用英文字母代替,小节长度仍然固定,如果发音单词很短,使用长度不固定的数据,“r”音则在后续合成时直接添加进合成音频当中。Both # and Arabic numerals are replaced with English letters, and the length of the bar is still fixed. If the pronounced word is very short, use data with an unfixed length, and the "r" sound will be directly added to the synthesized audio during subsequent synthesis.

优选的,合成音频拼接模块具体包括:Preferably, the synthetic audio splicing module specifically includes:

乐谱分段识别子模块,用于对于一段abc文件进行拆分,以5个音符为一组拆分为多个小节,遇到“r”符号也进行分割,最后余下的音符一起作为一个小节,然后对每个小节用训练好的模型进行处理合成音频,生成每个小节对应的wav文件;Music score segmentation recognition sub-module, used to split a piece of abc file, split into multiple measures with 5 notes as a group, and also divide when encountering the "r" symbol, and finally the rest of the notes together as a measure, Then use the trained model to process and synthesize audio for each subsection, and generate a wav file corresponding to each subsection;

片段拼接子模块,用于对分段合成的音频进行拼接;Fragment splicing sub-module, used for splicing audio synthesized by segments;

波形对齐与空白段填充子模块,用于abc乐谱中’r’表示不发音的空音段,在合成音频时,对需要添加的部分直接加入空白时长的音频作为缓冲带,使用sox对小节进行合成,r音由两个部分组成,一是发音长度,二是r,根据abc乐谱的要求选择对应的“r”音频进行处理,每段拼接部分用一个非常短的即可消除杂音又不影响时长感受的r音来处理后输出。The waveform alignment and blank section filling sub-module is used for the 'r' in the abc score to indicate the blank section that is not pronounced. When synthesizing the audio, directly add the audio of the blank duration as a buffer to the part that needs to be added, and use sox to measure the section Synthesis, the r sound is composed of two parts, one is the pronunciation length, and the other is r. According to the requirements of abc score, the corresponding "r" audio is selected for processing. A very short splicing part can eliminate noise without affecting The r sound experienced by the duration is processed and output.

作为另一方面,本申请还提供了一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如描述于本申请实施例描述的方法。As another aspect, the present application also provides a computer device, including a memory, a processor, and a computer program stored in the memory and operable on the processor, wherein the processor executes the The computer program implements the method as described in the embodiment of the present application.

作为另一方面,本申请还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中前述装置中所包含的计算机可读存储介质;也可以是单独存在,未装配入设备中的计算机可读存储介质。计算机可读存储介质存储有一个或者一个以上程序,前述程序被一个或者一个以上的处理器用来执行描述于本申请实施例描述的方法。As another aspect, the present application also provides a computer-readable storage medium, which may be the computer-readable storage medium contained in the aforementioned devices in the above-mentioned embodiments; computer-readable storage media stored in the device. The computer-readable storage medium stores one or more programs, and the aforementioned programs are used by one or more processors to execute the methods described in the embodiments of this application.

应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。如,如果用硬件来实现和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(ProgrammableGate Array;以下简称:PGA),现场可编程门阵列(Field ProgrammableGate Array;以下简称:FPGA)等。It should be understood that each part of the present application may be realized by hardware, software, firmware or a combination thereof. In the embodiments described above, various steps or methods may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented by any one or a combination of the following techniques known in the art: a discrete Logic circuits, ASICs with appropriate combinational logic gate circuits, Programmable Gate Array (hereinafter referred to as: PGA), Field Programmable Gate Array (Field Programmable Gate Array; hereinafter referred to as: FPGA), etc.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium. During execution, one or a combination of the steps of the method embodiments is included.

此外,在本申请各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are realized in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.

上述提到的存储介质可以是只读存储器,磁盘或光盘等。尽管上面已经示出和描述了本申请的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。The storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, and the like. Although the embodiments of the present application have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limitations on the present application, and those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.

Claims (8)

1. A music score intelligent video-singing method based on voice synthesis comprises the following steps:
step one, data preparation, namely inputting and analyzing an abc music score to obtain pitch and duration information of each note in the specific abc music score;
step two, training parameters, namely generating notes with the length within 5 when training data are made, namely dividing all notes of a complete abc music score into a group of 5 notes when the complete abc music score is processed;
step three, synthesizing audio splicing, specifically comprising three substeps of music score segmentation identification, segment splicing, waveform alignment and blank segment filling;
fourthly, visually displaying the synthesized audio;
the synthesized audio concatenation specifically includes:
segmenting and identifying a music score, namely segmenting an abc file, segmenting the abc file into a plurality of measures by taking 5 notes as a group, segmenting the abc file when meeting an r symbol, taking the rest notes as a measure, processing each measure by using a trained model to synthesize audio, and generating a wav file corresponding to each measure;
segment splicing, namely splicing the audio synthesized by the segments;
the waveform is aligned and the blank section is filled, r' in the abc music score represents an unvoiced blank sound section, when the audio is synthesized, the audio with blank duration is directly added to the part needing to be added to serve as a buffer zone, sox is used for synthesizing the measure, r sound is composed of two parts, namely pronunciation length and r, the corresponding r audio is selected according to the requirements of the abc music score for processing, and each section of spliced part is processed and output by using a very short r sound which can eliminate the noise and does not influence the duration feeling.
2. The method of claim 1, wherein the data preparation step further comprises: each note setting comprises three parts of duration, note and pitch, wherein 1/8 beat is selected in the shortest unit of duration, the duration of the note is 1/8, 1/4, 3/8, 1/2, 3/4 and 1 beat, the range of the note is f 3-f #5, because the human voice data is difficult to collect, audio is synthesized by using vocaloloid software to replace the human voice, the falling half tone is expressed by the rising half tone of the falling half tone, and in order to mark accurately, the tones expressed by all black keys are uniformly expressed by the rising half tone.
3. The method of claim 1, wherein the training process further comprises:
the alignment condition of the check point and the current encoder and decoder can be generated every 1000 training steps, the training is interrupted in the midway, and the training is recovered from the check point again;
the number # and Arabic numerals are both replaced by English letters, the length of the bar is still fixed, if the pronunciation word is short, the data with unfixed length is used, and the 'r' sound is directly added into the synthesized audio in the subsequent synthesis.
4. A music score intelligent video-song system based on speech synthesis, the system comprising:
the data preparation module is used for inputting and analyzing the abc music score to obtain the pitch and duration information of each note in the specific abc music score;
the training parameter module generates notes with the length within 5 when training data are produced, namely, all notes of a complete abc music score are divided into a group of 5 notes when the complete abc music score is processed;
the synthesis audio splicing module comprises a music score segmentation identification sub-module, a segment splicing sub-module and a waveform alignment and blank segment filling sub-module;
the visualization module is used for visually displaying the synthesized audio;
the synthesized audio splicing module specifically comprises:
the music score segmentation recognition sub-module is used for splitting an abc file, splitting the abc file into a group of 5 notes and a plurality of measures, splitting the abc file when the r symbol is met, using the rest notes as a measure, processing each measure by using a trained model to synthesize audio, and generating a wav file corresponding to each measure;
the segment splicing submodule is used for splicing the audio synthesized by segments;
the waveform alignment and blank section filling submodule is used for 'r' in an abc music score to represent an unvoiced blank section, when the audio is synthesized, the audio with blank duration is directly added to a part needing to be added to serve as a buffer zone, sox is used for synthesizing a measure, r sound is composed of two parts, namely pronunciation length and r, the corresponding 'r' audio is selected according to the requirements of the abc music score to be processed, and each spliced part is processed and output by using a very short r sound which can eliminate the noise and does not influence the duration feeling.
5. The system of claim 4, wherein the data preparation module further comprises: each note setting comprises three parts of duration, note and pitch, wherein 1/8 beat is selected in the shortest unit of duration, the duration of the note is 1/8, 1/4, 3/8, 1/2, 3/4 and 1 beat, the range of the note is f 3-f #5, because the human voice data is difficult to collect, audio is synthesized by using vocaloloid software to replace the human voice, the falling half tone is expressed by the rising half tone of the falling half tone, and in order to mark accurately, all the tones expressed by the black keys are uniformly expressed by the rising half tone.
6. The system of claim 4, wherein the training parameter module is further configured to:
the alignment condition of the check point and the current encoder and decoder can be generated every 1000 training steps, the training is interrupted in the midway, and the training is recovered from the check point again;
the number # and Arabic numerals are replaced by English letters, the lengths of the bars are still fixed, if the pronunciation words are short, data with unfixed lengths are used, and the 'r' sound is directly added into the synthesized audio in the subsequent synthesis.
7. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 3 when executing the computer program.
8. A computer-readable storage medium having stored thereon a computer program for: the computer program, when executed by a processor, implements the method of any one of claims 1-3.
CN202010590726.0A 2020-06-24 2020-06-24 Music score intelligent video-singing method and system based on voice synthesis Expired - Fee Related CN111816157B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010590726.0A CN111816157B (en) 2020-06-24 2020-06-24 Music score intelligent video-singing method and system based on voice synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010590726.0A CN111816157B (en) 2020-06-24 2020-06-24 Music score intelligent video-singing method and system based on voice synthesis

Publications (2)

Publication Number Publication Date
CN111816157A CN111816157A (en) 2020-10-23
CN111816157B true CN111816157B (en) 2023-01-31

Family

ID=72854997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010590726.0A Expired - Fee Related CN111816157B (en) 2020-06-24 2020-06-24 Music score intelligent video-singing method and system based on voice synthesis

Country Status (1)

Country Link
CN (1) CN111816157B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816148B (en) * 2020-06-24 2023-04-07 厦门大学 Virtual human voice and video singing method and system based on generation countermeasure network
CN114078474A (en) * 2021-11-09 2022-02-22 京东科技信息技术有限公司 Voice conversation processing method and device based on multi-modal characteristics and electronic equipment

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2101314U (en) * 1991-05-09 1992-04-08 北京电子专科学校 Numbered musical notation recorded automatically playing music apparatus
CN103902647A (en) * 2013-12-27 2014-07-02 上海斐讯数据通信技术有限公司 Music score identifying method used on intelligent equipment and intelligent equipment
CN104978884A (en) * 2015-07-18 2015-10-14 呼和浩特职业学院 Teaching system of preschool education profession student music theory and solfeggio learning
US10134300B2 (en) * 2015-10-25 2018-11-20 Commusicator Ltd. System and method for computer-assisted instruction of a music language
JP6610715B1 (en) * 2018-06-21 2019-11-27 カシオ計算機株式会社 Electronic musical instrument, electronic musical instrument control method, and program
CN110148394B (en) * 2019-04-26 2024-03-01 平安科技(深圳)有限公司 Singing voice synthesizing method, singing voice synthesizing device, computer equipment and storage medium
CN110738980A (en) * 2019-09-16 2020-01-31 平安科技(深圳)有限公司 Singing voice synthesis model training method and system and singing voice synthesis method

Also Published As

Publication number Publication date
CN111816157A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
Vaughn et al. Re-examining phonetic variability in native and non-native speech
RU2690863C1 (en) System and method for computerized teaching of a musical language
Trofimovich et al. Development of second language French oral skills in an instructed setting: A focus on speech ratings
Proutskova et al. Breathy, resonant, pressed–automatic detection of phonation mode from audio recordings of singing
Zhang Current trends in research of Chinese sound acquisition
Hirst ProZed: A speech prosody editor for linguists, using analysis-by-synthesis
CN111816157B (en) Music score intelligent video-singing method and system based on voice synthesis
De-la-Mota Improving non-native pronunciation: Teaching prosody to learners of Spanish as a second/foreign language
Do et al. Production and substantive bias in phonological learning
Xiao et al. Performative Vocal Synthesis for Foreign Language Intonation Practice
Hromada et al. Digital primer implementation of human-machine peer learning for reading acquisition: Introducing curriculum 2
Johnson An integrated approach for teaching speech spectrogram analysis to engineering students
Van et al. Adopting StudyIntonation CAPT tools to tonal languages through the example of Vietnamese
Laturnus Comparative acoustic analyses of L2 english: the search for systematic variation
Zhang Tianjin Mandarin tones and tunes
Yuen et al. Enunciate: An internet-accessible computer-aided pronunciation training system and related user evaluations
Gao Weighing phonetic patterns in non-native English speech
Díez et al. Non-native speech corpora for the development of computer assisted pronunciation training systems
Ronanki Prosody generation for text-to-speech synthesis
Radzevičius et al. Speech synthesis using stressed sample labels for languages with higher degree of phonemic orthography
Patience Relative difficulty in the L2 acquisition of the Spanish dorsal fricative
Baral Preserving Indigenous Language: Text-To-Speech System for the Myaamia Language
Ping et al. Innovative approaches to English pronunciation instruction in ESL environments: integration of multi-sensor detection and advanced algorithmic feedback
Hill et al. Unrestricted text-to-speech revisited: rhythm and intonation.
CN114783412B (en) Spanish spoken language pronunciation training correction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20230131

CF01 Termination of patent right due to non-payment of annual fee