CN114420084A

CN114420084A - Voice generation method, device and storage medium

Info

Publication number: CN114420084A
Application number: CN202111493700.5A
Authority: CN
Inventors: 杨扬; 邹一新
Original assignee: Zebred Network Technology Co Ltd
Current assignee: Zebred Network Technology Co Ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-04-29

Abstract

The application relates to a voice generation method, a voice generation device and a storage medium. The method comprises the following steps: performing text analysis processing on a text to be processed to obtain text information of each text unit in the text to be processed; determining target prosody information of each text unit according to the text information and symbol information of punctuations contained in the text to be processed; and generating target voice according to the text information and the target prosody information. According to the method and the device, the symbol information and the text information of the punctuation marks are combined, so that the punctuation marks in the text to be processed can act on the generated target voice, and the generated target voice can better accord with the emotional scene and the like when the characters are written; in the second aspect, by introducing the symbol information of punctuation, the target voice can better meet the requirement of the context where the text is located, and the target voice is more natural, real and close to the natural language expression mode.

Description

Voice generation method, device and storage medium

技术领域technical field

本申请涉及计算机技术领域，尤其涉及一种语音生成方法、装置及存储介质。The present application relates to the field of computer technology, and in particular, to a voice generation method, device, and storage medium.

背景技术Background technique

语音交互式人们用来传递信息的最直接、最自然也是最有效的一种方式。随着近些年来手机等电子设备的迅速发展，新型的语音交互方式也成为了计算机、语言学、通信等科学研究的热点。语音交互是基于语音识别，自然语言理解及语音合成的人机对话技术，语音合成是语音交互的核心之一，其目的是能够使得电子设备能够像人一样地发出自然的声音。因此，如何让手机、电脑等电子设备可以完全像人一样会“说”，播报带有感情语气、自然流畅的语音，使得播报的语音更符合用户输入的文本内容等所对应语气场景等，是目前亟待解决的问题。Voice interaction is the most direct, natural and effective way people use to communicate information. With the rapid development of electronic devices such as mobile phones in recent years, new voice interaction methods have also become a hot spot in scientific research such as computer, linguistics, and communication. Speech interaction is a human-machine dialogue technology based on speech recognition, natural language understanding and speech synthesis. Speech synthesis is one of the cores of speech interaction. Its purpose is to enable electronic devices to emit natural sounds like humans. Therefore, how to make mobile phones, computers and other electronic devices can "speak" exactly like people, broadcast a natural and smooth voice with an emotional tone, so that the broadcast voice is more in line with the corresponding tone scene of the text content input by the user, etc., is problems to be solved at present.

发明内容SUMMARY OF THE INVENTION

为克服相关技术中存在的问题，本申请提供一种语音生成方法、装置及存储介质。In order to overcome the problems existing in the related art, the present application provides a voice generation method, device and storage medium.

根据本申请实施例的第一方面，提供一种语音生成方法，包括：According to a first aspect of the embodiments of the present application, a method for generating speech is provided, including:

对待处理文本进行文本分析处理，得到所述待处理文本中各个文本单元的文本信息；Performing text analysis processing on the text to be processed to obtain text information of each text unit in the text to be processed;

根据所述文本信息和所述待处理文本所包含的标点符号的符号信息，确定各个所述文本单元的目标韵律信息；Determine the target prosody information of each of the text units according to the text information and the symbol information of the punctuation marks contained in the text to be processed;

根据所述文本信息和所述目标韵律信息，生成目标语音。A target speech is generated according to the text information and the target prosody information.

在一些实施例中，所述文本信息包括：文本音素和文本音调；所述根据所述文本信息和所述待处理文本所包含的标点符号的符号信息，确定各个所述文本单元的目标韵律信息，包括：In some embodiments, the text information includes: text phonemes and text tones; the target prosody information of each of the text units is determined according to the text information and the symbol information of punctuation marks included in the text to be processed ,include:

根据各个所述文本单元的所述文本音素和所述文本音调，确定各个所述文本单元文本韵律信息；According to the text phoneme and the text pitch of each of the text units, determine the text prosody information of each of the text units;

根据所述符号信息，确定各个所述文本单元的符号韵律信息；According to the symbol information, determine the symbol prosody information of each of the text units;

根据所述文本韵律信息和所述符号韵律信息，确定所述目标韵律信息。The target prosody information is determined according to the text prosody information and the symbol prosody information.

在一些实施例中，所述符号信息包括：符号位置；所述根据所述符号信息，确定各个所述文本单元的符号韵律信息，包括：In some embodiments, the symbol information includes: symbol position; and the determining, according to the symbol information, symbol prosody information of each of the text units includes:

根据所述标点符号的符号位置，确定各个所述标点符号所作用的文本单元；其中，一个标点符号作用于至少一个文本单元；According to the symbol positions of the punctuation marks, determine the text units that each punctuation mark acts on; wherein, one punctuation mark acts on at least one text unit;

确定所述标点符号所作用的文本单元的符号韵律信息。Symbolic prosody information of the text unit to which the punctuation marks act is determined.

在一些实施例中，所述符号信息包括：符号类型和符号数量；所述根据所述符号信息，确定各个所述文本单元的符号韵律信息，包括：In some embodiments, the symbol information includes: symbol type and number of symbols; and determining the symbol prosody information of each of the text units according to the symbol information, including:

根据所述标点符号的符号类型，确定语气类型；Determine the tone type according to the symbol type of the punctuation mark;

根据所述标点符号的符号数量，确定语气程度；Determine the tone degree according to the number of the punctuation marks;

根据所述语气类型和/或所述语气程度，确定所述符号韵律信息。The symbol prosody information is determined according to the tone type and/or the tone degree.

在一些实施例中，所述根据所述文本信息和所述目标韵律信息，生成目标语音，包括：In some embodiments, generating the target speech according to the text information and the target prosody information includes:

将所述文本信息和所述目标韵律信息与预设语音库中的预设文本信息和预设韵律信息进行匹配；其中，所述预设语音库用于存储音频片段，以及所述预设文本信息和所述预设韵律信息与所述音频片段之间的映射关系；Matching the text information and the target prosody information with preset text information and preset prosody information in a preset voice library; wherein the preset voice library is used to store audio clips, and the preset text information and the mapping relationship between the preset prosody information and the audio segment;

利用与匹配结果对应的语音生成策略，根据所述文本信息和所述目标韵律信息，生成所述目标语音；其中，不同的匹配结果对应不同的语音生成策略。Using the speech generation strategy corresponding to the matching result, the target speech is generated according to the text information and the target prosody information; wherein, different matching results correspond to different speech generation strategies.

在一些实施例中，所述利用与匹配结果对应的语音生成策略，根据所述文本信息和所述目标韵律信息，生成所述目标语音，包括：In some embodiments, generating the target speech according to the text information and the target prosody information by using the speech generation strategy corresponding to the matching result includes:

在所述匹配结果表征所述预设语音库中存在所述文本信息和所述目标韵律信息的情况下，从所述预设语音库确定与所述文本信息和所述目标韵律信息对应的音频片段；In the case that the matching result indicates that the text information and the target prosody information exist in the preset voice library, determine the audio corresponding to the text information and the target prosody information from the preset voice library fragment;

对各个所述音频片段进行拼接处理，生成所述目标语音。Perform splicing processing on each of the audio segments to generate the target speech.

在所述匹配结果表征所述预设语音库中不存在所述文本信息和所述目标韵律信息的情况下，对各个所述文本单元的所述文本信息和所述目标韵律信息进行特征转换处理，得到声学特征；In the case that the matching result indicates that the text information and the target prosody information do not exist in the preset speech library, feature conversion processing is performed on the text information and the target prosody information of each of the text units , get the acoustic features;

对各个所述声学特征进行解码处理，生成所述目标语音。Decoding each of the acoustic features is performed to generate the target speech.

根据本申请实施例的第二方面，提供一种语音生成装置，包括：According to a second aspect of the embodiments of the present application, there is provided a voice generation apparatus, including:

分析模块，配置为对待处理文本进行文本分析处理，得到所述待处理文本中各个文本单元的文本信息；an analysis module, configured to perform text analysis processing on the text to be processed, and obtain text information of each text unit in the text to be processed;

确定模块，配置为根据所述文本信息和所述待处理文本所包含的标点符号的符号信息，确定各个所述文本单元的目标韵律信息；a determining module, configured to determine the target prosody information of each of the text units according to the text information and the symbol information of the punctuation marks contained in the text to be processed;

生成模块，配置为根据所述文本信息和所述目标韵律信息，生成目标语音。A generating module is configured to generate a target speech according to the text information and the target prosody information.

在一些实施例中，所述文本信息包括：文本音素和文本音调；所述确定模块，配置为：In some embodiments, the text information includes: text phonemes and text tones; the determining module is configured to:

在一些实施例中，所述符号信息包括：符号位置；所述确定模块，配置为：In some embodiments, the symbol information includes: symbol position; and the determining module is configured to:

在一些实施例中，所述符号信息包括：符号类型和符号数量；所述确定模块，配置为：In some embodiments, the symbol information includes: symbol type and symbol quantity; and the determining module is configured to:

在一些实施例中，所述生成模块，配置为：In some embodiments, the generating module is configured to:

根据本申请实施例的第三方面，提供一种语音生成装置，包括：According to a third aspect of the embodiments of the present application, there is provided a voice generation apparatus, including:

处理器；processor;

配置为存储处理器可执行指令的存储器；a memory configured to store processor executable instructions;

其中，所述处理器配置为：执行时实现上述第一方面中任一种语音生成方法中的步骤。Wherein, the processor is configured to implement the steps in any one of the voice generation methods in the first aspect when executed.

根据本申请实施例的第四方面，提供一种非临时性计算机可读存储介质，当所述存储介质中的指令由语音生成装置的处理器执行时，使得所述装置能够执行上述第一方面中任一种语音生成方法中的步骤。According to a fourth aspect of the embodiments of the present application, a non-transitory computer-readable storage medium is provided, when an instruction in the storage medium is executed by a processor of a speech generating apparatus, the apparatus can execute the above-mentioned first aspect Steps in any speech generation method.

本申请的实施例提供的技术方案可以包括以下有益效果：The technical solutions provided by the embodiments of the present application may include the following beneficial effects:

本申请通过确定待处理文本中各个文本单元的文本信息，以及待处理文本所包含的标点符号的符号信息，共同确定各个文本单元的目标韵律信息，然后通过得到的文本信息和目标韵律信息来生成目标语音。这样，通过将标点符号的符号信息和文本信息相结合，第一方面，能够使得待处理文本中的标点符号对生成的目标语音产生作用，进而使得生成的目标语音更符合书写文字时的情感场景等；第二方面，通过引入标点符号的符号信息，可以使得目标语音更加符合文本所在语境的需要，更自然真实且接近自然语言表达方式。The present application determines the text information of each text unit in the text to be processed and the symbol information of the punctuation marks contained in the text to be processed, and jointly determines the target prosody information of each text unit, and then generates the text information and target prosody information through the obtained text information and target prosody information. target voice. In this way, by combining the symbol information of the punctuation marks with the text information, in the first aspect, the punctuation marks in the text to be processed can be made to have an effect on the generated target voice, thereby making the generated target voice more in line with the emotional scene when writing the text etc.; secondly, by introducing the symbolic information of punctuation marks, the target speech can be made more in line with the needs of the context in which the text is located, more natural and realistic and close to natural language expressions.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not limiting of the present application.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本申请的实施例，并与说明书一起用于解释本申请的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application.

图1是根据本申请一示例性实施例示出的一种语音生成方法的流程图。FIG. 1 is a flowchart of a method for generating speech according to an exemplary embodiment of the present application.

图2是根据本申请一示例性实施例示出的一种语音合成系统的示意图。FIG. 2 is a schematic diagram of a speech synthesis system according to an exemplary embodiment of the present application.

图3是根据本申请一示例性实施例示出的一种语音生成装置框图。Fig. 3 is a block diagram of a voice generation apparatus according to an exemplary embodiment of the present application.

图4是根据本申请一示例性实施例示出的一种语音生成装置的硬件结构框图。FIG. 4 is a block diagram of a hardware structure of a speech generating apparatus according to an exemplary embodiment of the present application.

图5是根据本申请一示例性实施例示出的一种语音生成装置的硬件结构框图。FIG. 5 is a block diagram of the hardware structure of a voice generation apparatus according to an exemplary embodiment of the present application.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as recited in the appended claims.

图1是根据一示例性实施例示出的语音生成方法的流程图，如图1所示，主要包括以下步骤：Fig. 1 is a flow chart of a method for generating speech according to an exemplary embodiment. As shown in Fig. 1 , the method mainly includes the following steps:

在步骤101中，对待处理文本进行文本分析处理，得到所述待处理文本中各个文本单元的文本信息；In step 101, text analysis processing is performed on the text to be processed to obtain text information of each text unit in the text to be processed;

在步骤102中，根据所述文本信息和所述待处理文本所包含的标点符号的符号信息，确定各个所述文本单元的目标韵律信息；In step 102, according to the text information and the symbol information of the punctuation marks contained in the text to be processed, determine the target prosody information of each of the text units;

在步骤103中，根据所述文本信息和所述目标韵律信息，生成目标语音。In step 103, a target speech is generated according to the text information and the target prosody information.

在一些实施例中，本申请的语音生成方法可以应用于电子设备，这里，电子设备可以包括：终端设备，例如，移动终端、固定终端或车载终端等。其中，移动终端可以包括：手机、平板电脑、笔记本电脑或者穿戴式设备等设备，还可以包括智能家居设备，例如，智能音箱等。固定终端可以包括：台式电脑或智能电视等。车载终端可以包括车辆监控管理系统的前端设备，也可以称为车辆调度监控(Telematics Control Unit，TCU)终端，如，车机终端等。车载终端可以融合全球定位系统(Global Positioning System，GPS)技术、里程定位技术及汽车黑匣等技术，能用于对车辆进行现代化管理，包括：行车安全监控管理、运营管理、服务质量管理、智能集中调度管理、电子站牌控制管理等。In some embodiments, the voice generation method of the present application may be applied to an electronic device, where the electronic device may include a terminal device, such as a mobile terminal, a fixed terminal, or a vehicle-mounted terminal. The mobile terminal may include devices such as mobile phones, tablet computers, notebook computers, or wearable devices, and may also include smart home devices, such as smart speakers. Fixed terminals may include: desktop computers or smart TVs. The vehicle-mounted terminal may include a front-end device of a vehicle monitoring and management system, and may also be called a vehicle dispatch monitoring (Telematics Control Unit, TCU) terminal, such as a vehicle-machine terminal and the like. The vehicle terminal can integrate global positioning system (Global Positioning System, GPS) technology, mileage positioning technology and car black box technology, and can be used for modern management of vehicles, including: driving safety monitoring management, operation management, service quality management, intelligent Centralized scheduling management, electronic stop sign control management, etc.

本申请实施例中，待处理文本可以是指当前需要处理的文本，文本可以包含文本内容以及标点符号。文本可以是具有完整、系统含义的带标点符号的一个句子或多个句子的组合，一个文本可以是一个句子、一个段落或者一个篇章，如短句、谚语、格言、标题等短文本，或者文章、文档文本等长文本。待处理文本可以通过向量序列等形式来表示(例如但不限于[I do not like the story ofthe movie，very much！])，本申请实施例对此不作具体限定。文本单元可以是指包含于待处理文本中具有自身属性的单元，例如，文本单元可以是待处理文本中单个的字或词，例如：待处理文本为“张三乘高铁前往某市。”，则所述待处理文本中包含的文本单元依次为“张三”、“乘”、“高铁”、“前往”和“某市”。In this embodiment of the present application, the text to be processed may refer to the text that needs to be processed currently, and the text may include text content and punctuation marks. A text can be a sentence or a combination of sentences with complete, systematic meaning with punctuation marks, a text can be a sentence, a paragraph or a chapter, such as short texts such as short sentences, proverbs, aphorisms, titles, etc., or articles , document text, etc. The text to be processed may be represented by a vector sequence or the like (for example, but not limited to [I do not like the story of the movie, very much!]), which is not specifically limited in this embodiment of the present application. The text unit may refer to a unit with its own attributes contained in the text to be processed. For example, the text unit may be a single word or word in the text to be processed. For example, the text to be processed is "Zhang San takes the high-speed train to a certain city.", Then the text units included in the text to be processed are "Zhang San", "Chang", "High-speed rail", "Go to" and "A certain city" in sequence.

文本信息可以是指用来表示文本单元的属性的信息，如，可以标识文本单元的语义、读音等属性的信息，文本信息可以包括以下至少之一：音素信息；音调信息；位置信息；语法信息；发音时长；频率信息等。例如：待处理文本为汉字文本(如，有一个道理...)，电子设备可以对汉字文本进行音素转换处理，得到音素序列(也即连续的拼音字符串)(如，[youyi ge dao li...])，再对音素序列进行音调标注处理，可以得到带有音调标记的音素序列(如，[you3 yi1 ge4 dao4 li3...])，电子设备可以将带有音调标记的音素序列等作为文本信息。Text information may refer to information used to represent attributes of text units, such as information that can identify attributes such as semantics and pronunciation of text units, and text information may include at least one of the following: phoneme information; pitch information; location information; grammar information ; pronunciation duration; frequency information, etc. For example: if the text to be processed is Chinese text (for example, there is a reason...), the electronic device can perform phoneme conversion processing on the Chinese text to obtain a phoneme sequence (that is, a continuous pinyin string) (for example, [youyi ge dao li ...]), and then perform pitch tagging processing on the phoneme sequence to obtain a phoneme sequence with pitch tags (for example, [you3 yi1 ge4 dao4 li3...]), and the electronic device can convert the phoneme sequence with pitch tags etc. as text information.

在一些实施例中，电子设备可以通过对待处理文本进行文本分析处理，得到待处理文本中各个文本单元的文本信息。文本分析(也可以称为语言学分析，主要模拟人对自然语言的理解过程)可以是指为了得到文本信息而对待处理文本进行的特征提取或者特征转换等处理，文本分析主要可以包括：文本正则化、分词、词性标注、多音字消歧等处理。电子设备可以以句为处理单位，逐句对待处理文本进行词汇、语法和语义等形式上的分析，以确定句子的底层结构(如，以音节为单位的结构或者以字为单位的结构等底层结构)和每个字音素等的组成，文本分析还可以包括：文本的断句、字词切分、数字、字符的处理、缩略语的处理等，得到电子设备能够理解的文本信息。In some embodiments, the electronic device may obtain text information of each text unit in the text to be processed by performing text analysis processing on the text to be processed. Text analysis (also known as linguistic analysis, which mainly simulates the process of human understanding of natural language) can refer to the feature extraction or feature conversion of the text to be processed in order to obtain text information. Text analysis can mainly include: text regularization Processing, word segmentation, part-of-speech tagging, polyphonic word disambiguation, etc. Electronic devices can take sentences as processing units, and perform lexical, grammatical and semantic analysis of the text to be processed sentence by sentence, so as to determine the underlying structure of the sentence (such as the structure of syllables or the structure of words). structure) and the composition of each character phoneme, etc., text analysis may also include: text segmentation, word segmentation, processing of numbers, characters, processing of abbreviations, etc., to obtain text information that can be understood by electronic devices.

标点符号可以是指书面上用于标明句读和语气的符号，标点符号是辅助文字记录语言的符号，是书面语的组成部分，可以用来表示停顿、语气以及词语的性质和作用等，标点符号可以包括标号和点号等。本申请实施例中，待处理文本可以包括文本内容以及标点符号，标点符号的符号信息可以作用于文本内容。符号信息可以是指用来表征标点符号的属性的信息，例如，符号信息可以包括：标点符号的类型、位置等属性的信息，标点符号的符号信息可以包括符号名称、符号位置、符号数量等信息。以待处理文本为“在上课？”为例，可以确定标点符号的符号名称为问号，符号位置为在文本单元“上课”之后，那么电子设备可以确定待处理文本中的文本单元“上课”应该是以疑问语气进行播报。Punctuation marks can refer to symbols used in writing to indicate the reading and tone of sentences. Punctuation marks are symbols that assist written language to record language and are an integral part of written language. They can be used to indicate pauses, tone, and the nature and function of words. Punctuation marks can be Including labels and dots, etc. In this embodiment of the present application, the text to be processed may include text content and punctuation marks, and the symbol information of the punctuation marks may act on the text content. Symbol information may refer to information used to characterize the attributes of punctuation marks. For example, symbol information may include information on attributes such as type and location of punctuation marks. Symbol information of punctuation marks may include information such as symbol name, symbol position, and number of symbols. . Taking the text to be processed as "in class?" as an example, it can be determined that the symbol name of the punctuation mark is a question mark, and the symbol position is after the text unit "class", then the electronic device can determine that the text unit "class" in the text to be processed should be It is broadcast in an interrogative tone.

语音的韵律在声学上表现为停顿、音高、音长、音强等特性，也被称为超音段特征，电子设备在进行语音播报时，通过播报不同文本单元“轻重缓急，抑扬顿挫”等不同语气特点，来体现出语音的韵律。韵律信息(也可以称为韵律参数或者韵律序列等)可以是指用来表征韵律特性的属性信息，韵律信息可以用序列或数值等形式来表示，每个文本单元可以对应一个目标韵律信息。目标韵律信息用于表征待处理文本中的各个文本单元对应的韵律信息，与待处理文本整体对应的韵律信息作区别。The rhythm of speech is acoustically characterized by pauses, pitch, length, intensity, etc., also known as supersegmental features. When electronic equipment broadcasts speech, it broadcasts different text units "priority, cadence" and other different Tone characteristics, to reflect the rhythm of speech. Prosody information (also called prosodic parameters or prosodic sequence, etc.) may refer to attribute information used to characterize prosody characteristics. The target prosody information is used to represent the prosody information corresponding to each text unit in the text to be processed, and is distinguished from the prosody information corresponding to the entire text to be processed.

在一些实施例中，电子设备可以通过对待处理文本进行韵律预测处理来确定韵律信息，韵律预测处理可以理解为通过对待处理文本进行语调、重音、时长以及停顿的处理，从而确定各个文本单元的具体发音、停顿位置及其轻重读音等抑扬顿挫特点等信息。电子设备进行韵律预测时，可以按照不同的韵律层级对待处理文本进行处理，得到不同韵律层级对应的韵律信息，可以包括：韵律词(Prosody Word，PW)级，韵律短语(Prosody Phrase，PP)级和语调短语(Intonation Phrase，IP)级等，本申请不作具体限定。In some embodiments, the electronic device may determine the prosody information by performing prosody prediction processing on the text to be processed, and the prosody prediction processing may be understood as performing intonation, accent, duration, and pause processing on the text to be processed, so as to determine the specific value of each text unit. Pronunciation, pause position and pronunciation, such as cadence characteristics and other information. When the electronic device performs prosody prediction, it can process the text to be processed according to different prosody levels, and obtain prosody information corresponding to different prosody levels, which can include: prosody word (Prosody Word, PW) level, prosody phrase (Prosody Phrase, PP) level and Intonation Phrase (IP) level, etc., which are not specifically limited in this application.

本申请实施例中，电子设备可以根据文本信息和待处理文本所包含的标点符号的符号信息，确定各个文本单元的目标韵律信息。电子设备可以基于待处理文本中的文本单元对应的文本信息和标点符号对应的符号信息，共同确定韵律信息。在一种可能的实施例中，电子设备可以先基于文本单元，生成第一韵律信息，再基于符号信息，生成第二韵律信息，然后将第一韵律信息和第二韵律信息进行综合处理(如，对第一韵律信息和第二韵律信息进行相加处理等)，得到目标韵律信息。电子设备也可以根据文本信息和符号信息，直接生成目标韵律信息等，本申请不作具体限定。In this embodiment of the present application, the electronic device may determine the target prosody information of each text unit according to the text information and the symbol information of the punctuation marks included in the text to be processed. The electronic device may jointly determine the prosodic information based on the text information corresponding to the text unit in the text to be processed and the symbol information corresponding to the punctuation marks. In a possible embodiment, the electronic device may first generate the first prosody information based on the text unit, then generate the second prosody information based on the symbol information, and then perform comprehensive processing on the first prosody information and the second prosody information (such as , adding the first prosody information and the second prosody information, etc.) to obtain the target prosody information. The electronic device may also directly generate target prosody information and the like according to text information and symbol information, which is not specifically limited in this application.

在一种可能的实施例中，电子设备可以将文本单元输入韵律预测模型中，得到第一韵律信息，韵律预测模型可以是指训练好的、用于根据文本单元生成第一韵律信息的神经网络模型。以韵律预测模型为基于双向长短期记忆神经网络模型为例，电子设备可以先确定待处理文本的各个文本单元，然后将文本单元转换为词嵌入向量，将各个词嵌入向量输入神经网络模型，得到各个文本单元的预测标签，根据预测标点确定第一韵律信息。In a possible embodiment, the electronic device may input the text unit into a prosody prediction model to obtain the first prosody information, and the prosody prediction model may refer to a trained neural network for generating the first prosody information according to the text unit Model. Taking the prosody prediction model as an example based on a bidirectional long short-term memory neural network model, the electronic device can first determine each text unit of the text to be processed, then convert the text unit into a word embedding vector, and input each word embedding vector into the neural network model. For the predicted labels of each text unit, the first prosodic information is determined according to the predicted punctuation.

在一些实施例中，电子设备可以预先设备符号信息与第二韵律信息之间的对应关系，在电子设备确定符号信息之后，可以根据符号信息以及对应关系确定第二韵律信息。例如：电子设备可以预先设置问号对应疑问语气，感叹号对应感慨语气等，疑问语气对应的第二韵律信息为a，感慨语气对应的第二韵律信息为b等。若电子设备识别出待处理文本中存在问号，那么可以确定第二韵律信息为a，其中，第二韵律信息a可以单独作用在问号前或者问号后的文本单元，也可以作用在问号前后的句子中等，对于第二韵律信息作用的范围不作具体限定，可以根据实际使用需求自定义设置。In some embodiments, the electronic device may pre-define the correspondence between the device symbol information and the second prosody information, and after the electronic device determines the symbol information, the second prosody information may be determined according to the symbol information and the corresponding relationship. For example, the electronic device can preset the question mark corresponding to the interrogative tone, the exclamation mark corresponding to the emotional tone, etc., the second prosodic information corresponding to the interrogative tone is a, the second prosody information corresponding to the emotional tone is b, and so on. If the electronic device recognizes that there is a question mark in the text to be processed, it can determine that the second prosody information is a, wherein the second prosody information a can be applied to the text unit before or after the question mark alone, or can be applied to the sentences before and after the question mark. Moderate, the scope of the role of the second prosody information is not specifically limited, and the settings can be customized according to actual usage requirements.

电子设备确定各个文本单元的目标韵律信息后，可以根据文本信息和目标韵律信息，生成目标语音。目标语音可以是指基于待处理文本进行播报的音频数据，音频数据可以包括声音信号或者语音波形数据等数据。电子设备生成目标语音的方法按设计的主要思想，可以划分为规则驱动方法和数据驱动两类。规则驱动的主要思想是根据人类发音物理过程制定规则来模拟重现发音过程，数据驱动则是利用语音库中的数据，通过统计建模的方法来完成语音合成，对语音语料库的质量、最小单元和规模有更多的依赖，合成的语音更自然流畅。例如：共振峰合成、波形拼接、谐波加噪声模型和神经网络及端到端的深度神经网络模型等语音生成方法，对于生成目标语音的具体方式，本申请不作具体限定。After the electronic device determines the target prosody information of each text unit, it can generate the target speech according to the text information and the target prosody information. The target voice may refer to audio data broadcast based on the text to be processed, and the audio data may include data such as sound signals or voice waveform data. According to the main idea of design, the method of generating target speech by electronic equipment can be divided into two categories: rule-driven method and data-driven method. The main idea of rule-driven is to formulate rules according to the physical process of human pronunciation to simulate and reproduce the pronunciation process, while data-driven is to use the data in the speech database to complete speech synthesis through statistical modeling, and to improve the quality of the speech corpus and the smallest unit. And with more reliance on scale, the synthesized speech is more natural and smooth. For example, speech generation methods such as formant synthesis, waveform splicing, harmonic plus noise model, neural network and end-to-end deep neural network model, etc., the specific method for generating target speech is not specifically limited in this application.

本申请实施例中，文本信息至少可以包括文本音素和文本音调。音素(Phone)是根据语音的自然属性划分出来的最小语音单位，依据音节里的发音动作来分析，一个动作构成一个音素。如，汉语音节啊(ā)只有一个音素，爱(ài)有两个音素，代(dài)有三个音素等。声音频率的高低叫做音调(Pitch)，是声音的三个主要的主观属性，即音量(响度)、音调、音色(也称音品)之一。音调可以表示用户的听觉分辨一个声音的调子高低的程度，音调主要由声音的频率决定，同时也与声音强度有关。文本音素可以是指基于待处理文本的文本内容，确定的音素，文本音调可以是指基于待处理文本的文本内容，确定的音调，为了与基于标点符号等其他内容确定的音素和音调作区别。In this embodiment of the present application, the text information may at least include text phonemes and text tones. Phoneme is the smallest phonetic unit divided according to the natural properties of speech. It is analyzed according to the pronunciation action in the syllable, and an action constitutes a phoneme. For example, the Chinese syllable ah (ā) has only one phoneme, love (ài) has two phonemes, and dai (dài) has three phonemes. The level of sound frequency is called pitch (Pitch), which is one of the three main subjective attributes of sound, namely volume (loudness), pitch, and timbre (also known as fret). The pitch can indicate the degree to which the user's hearing can distinguish the pitch of a sound. The pitch is mainly determined by the frequency of the sound, and is also related to the intensity of the sound. The text phoneme may refer to the text content based on the text to be processed, the determined phoneme, and the text pitch may refer to the text content based on the text to be processed, the determined pitch, in order to distinguish it from the phoneme and pitch determined based on other content such as punctuation marks.

在一种可能的实施例中，电子设备可以通过音素转换处理，将待处理文本转换为文本音素。例如：电子设备可以利用训练好的音素转换模型，将为“明天去学校”的待处理文本，转换为“[ming tian qu xue xiao]”形式的文本音素。然后电子设备可以对文本音素进行音调标注处理，例如：电子设备能够以音节为处理单位，在预设的词典中查询每个音节对应的音调，得到文本音调[2,1,4,2,4]等，电子设备可以将文本音素和文本音调，使用一个序列综合表示文本信息，例如，[ming2 tian1 qu4 xue2 xiao4]等。In a possible embodiment, the electronic device may convert the text to be processed into text phonemes through phoneme conversion processing. For example, an electronic device can use the trained phoneme conversion model to convert the to-be-processed text of "going to school tomorrow" into text phonemes in the form of "[ming tian qu xue xiao]". Then the electronic device can perform pitch labeling processing on the text phonemes. For example, the electronic device can use the syllable as the processing unit to query the pitch corresponding to each syllable in the preset dictionary to obtain the text pitch [2,1,4,2,4 ], etc., the electronic device can use a sequence of text phonemes and text tones to comprehensively represent text information, for example, [ming2 tian1 qu4 xue2 xiao4] and so on.

电子设备确定文本音素和文本音调后，可以根据文本音素和文本音调进行韵律预测处理，得到各个文本单元的文本韵律信息。例如：电子设备可以通过机器学习方法根据预先训练的韵律预测模型对待预测的文本音素和文本音调等内容进行预测，从而获取该文本单元对应的停顿预测结果(如，停顿位置、包括上停顿和短停顿的停顿类型以及停顿类型相对于的概率值等)等的文本韵律信息。After the electronic device determines the text phoneme and the text pitch, it can perform prosody prediction processing according to the text phoneme and the text pitch to obtain text prosody information of each text unit. For example, the electronic device can use machine learning methods to predict the text phoneme and text pitch to be predicted according to the pre-trained prosody prediction model, so as to obtain the pause prediction result corresponding to the text unit (for example, the pause position, including the upper pause and the short pause) The text prosody information such as the pause type of the pause and the probability value relative to the pause type, etc.).

电子设备确定符号信息后，可以根据符号信息，确定各个文本单元的符号韵律信息。电子设备可以预先设置符号信息与符号韵律信息之间的对应关系，然后根据具体的符号信息以及对应关系，确定符号韵律信息。例如：电子设备预先设置第一符号信息对应第一符号韵律信息，第二符号信息对应第二符号韵律信息等，然后电子设备可以将对应关系存储在电子设备的存储器中，在确定第一符号信息之后，可以从存储器中查询得到第一符号韵律信息等。在另一种可能的实施例中，电子设备还可以通过符号韵律预测模型，根据符号信息得到符号韵律信息。符号韵律预测模型可以是指训练好的神经网络模型，输入符号信息，输出对应的符号韵律信息等，例如：电子设备可以通过将符号信息输入训练好的符号韵律预测模型，得到具体的符号韵律信息等。在训练的过程中，电子设备可以将标记好的符号信息和符号韵律信息输入初始化的符号韵律预测模型进行配置参数的训练，得到训练好的符号韵律预测模型。After the electronic device determines the symbol information, it can determine the symbol prosody information of each text unit according to the symbol information. The electronic device may preset the correspondence between the symbol information and the symbol prosody information, and then determine the symbol prosody information according to the specific symbol information and the corresponding relationship. For example, the electronic device presets that the first symbol information corresponds to the first symbol prosody information, the second symbol information corresponds to the second symbol prosody information, etc., and then the electronic device can store the corresponding relationship in the memory of the electronic device, after determining the first symbol information Afterwards, the first symbol prosody information and the like can be obtained by querying the memory. In another possible embodiment, the electronic device may further obtain the symbol prosody information according to the symbol information through the symbol prosody prediction model. The symbolic prosody prediction model may refer to a trained neural network model, which inputs symbolic information and outputs corresponding symbolic prosody information, etc. For example, an electronic device can obtain specific symbolic prosody information by inputting the symbolic information into the trained symbolic prosody prediction model Wait. During the training process, the electronic device can input the marked symbolic information and symbolic prosody information into the initialized symbolic prosody prediction model to train the configuration parameters, and obtain a trained symbolic prosody prediction model.

电子设备确定文本韵律信息和符号韵律信息后，可以根据文本韵律信息和符号韵律信息，确定目标韵律信息。在一个可选的实施例中，电子设备可以将文本韵律信息和符号韵律信息进行相加融合处理，得到目标韵律信息。例如：电子设备确定文本韵律信息为[0.3,0.2,0.7]序列等形式，其中序列中数值可以表示韵律不同属性的属性值(如，音高为0.3、音长位0.2以及音强为0.7等)。电子设备可以确定符号韵律信息为[0.6,0.5,0.1]等，那么电子设备可以确定目标韵律信息为[0.9,0.7,0.8]等。After the electronic device determines the text prosody information and the symbol prosody information, it can determine the target prosody information according to the text prosody information and the symbol prosody information. In an optional embodiment, the electronic device may add and fuse text prosody information and symbol prosody information to obtain target prosody information. For example, the electronic device determines that the text prosody information is in the form of a [0.3, 0.2, 0.7] sequence, where the numerical values in the sequence can represent the attribute values of different attributes of prosody (for example, pitch is 0.3, length is 0.2, and intensity is 0.7, etc. ). The electronic device may determine that the symbol prosody information is [0.6, 0.5, 0.1], etc., then the electronic device may determine that the target prosody information is [0.9, 0.7, 0.8], etc.

在一种可能的实施例中，电子设备还可以对文本韵律信息和符号韵律信息进行加权均值处理，例如：电子设备先确定文本韵律信息的权重为0.4，符号韵律信息的权重为0.6等，然后根据具体的权重计算文本韵律信息和符号韵律信息的加权均值。对于不同的文本单元对应的文本韵律信息和符号韵律信息之间的权重可以相同，也可以不同。电子设备可以根据文本单元对应的词性，来确定权重，例如，文本单元为主语，文本韵律信息和符号韵律信息对应的权重可以分别为0.3和0.7；文本单元为谓语，文本韵律信息和符号韵律信息对应的权重可以分别为0.4和0.6等。In a possible embodiment, the electronic device may also perform weighted average processing on the text prosody information and the symbol prosody information. For example, the electronic device first determines that the weight of the text prosody information is 0.4, the weight of the symbol prosody information is 0.6, etc., and then The weighted mean of text prosody information and symbol prosody information is calculated according to specific weights. The weights between the text prosody information and the symbol prosody information corresponding to different text units may be the same or different. The electronic device can determine the weight according to the part of speech corresponding to the text unit. For example, the text unit is the subject, and the weights corresponding to the text prosody information and the symbol prosody information can be 0.3 and 0.7 respectively; the text unit is the predicate, the text prosody information and the symbol prosody information. The corresponding weights can be 0.4 and 0.6, etc., respectively.

本申请实施例中，通过根据各个文本单元的文本音素和文本音调，确定各个文本单元的文本韵律信息，根据符号信息，确定各个文本单元的符号韵律信息，根据文本韵律信息和符号韵律信息，确定目标韵律信息。相对于相关技术中，仅考虑基于待处理文本的文本信息来确定韵律信息，未考虑待处理文本的其他信息来确定韵律信息，本申请实施例中，可以根据多种不同形式的内容得到文本韵律信息和符号韵律信息，从而确定更准确丰富的目标韵律信息。In the embodiment of the present application, the text prosody information of each text unit is determined according to the text phonemes and text tones of each text unit, the symbol prosody information of each text unit is determined according to the symbol information, and the symbol prosody information of each text unit is determined according to the text prosody information and the symbol prosody information. Target prosody information. Compared with the related art, only the prosody information is determined based on the text information of the text to be processed, and other information of the text to be processed is not considered to determine the prosody information. information and symbolic prosody information, so as to determine more accurate and rich target prosody information.

本申请实施例中，标点符号的符号信息至少可以包括符号位置。电子设备可以通过对待处理文本进行字符串识别处理，从而确定出各个标点符号的符号位置，例如：电子设备确定出第一标点符号为逗号，第一标点符号的符号位置为10，电子设备确定出第二标点符号为句号，第二标点符号的符号位置为23等。由于电子设备播报语音时，不会直接播报标点符号，所以需要确定各个标点符号所作用的文本单元，电子设备确定标点符号的符号位置后，可以确定各个标点符号所作用的文本单元。其中，一个标点符号作用于至少一个文本单元。In this embodiment of the present application, the symbol information of the punctuation mark may include at least the position of the symbol. The electronic device can determine the symbol position of each punctuation mark by performing character string recognition processing on the text to be processed. For example, the electronic device determines that the first punctuation mark is a comma, and the symbol position of the first punctuation mark is 10. The second punctuation mark is a period, and the symbol position of the second punctuation mark is 23 and so on. Since the electronic device does not broadcast the punctuation directly when broadcasting the voice, it is necessary to determine the text unit that each punctuation mark acts on. After the electronic device determines the symbol position of the punctuation mark, it can determine the text unit that each punctuation mark acts on. Among them, a punctuation mark acts on at least one text unit.

在一些实施例中，电子设备可以预先设置作用范围，电子设备可以将标点符号之前的一个文本单元作为所作用的文本单元，也可以将标点符号之后的一个文本单元作为所作用的文本单元，还可以将标点符号之间的一个或多个文本单元作为所作用的文本单元等。例如：对于待处理文本“在上课？”，电子设备确定标点符号“？”所作用的文本单元可以为“上课”。对于待处理文本“？？张三是说真的吗”，电子设备确定标点符号“？？”所作用的文本单元可以为“张三”等。对于待处理文本“李四昨天！逃课看电影去了”，电子设备确定标点符号“！”所作用的文本单元可以为“昨天”和“逃课”等。In some embodiments, the electronic device may preset the scope of action, and the electronic device may use a text unit before the punctuation mark as the text unit to be used, or a text unit after the punctuation mark as the text unit to be used. One or more text units between punctuation marks can be used as the text unit to be used, and so on. For example, for the to-be-processed text "in class?", the electronic device determines that the text unit used by the punctuation mark "?" can be "in class". For the to-be-processed text "?? Is Zhang San telling the truth?", the electronic device determines that the text unit used by the punctuation mark "??" can be "Zhang San" or the like. For the to-be-processed text "Li Si yesterday! He skipped class to watch a movie", the electronic device determines that the text units used by the punctuation mark "!" can be "yesterday" and "skipped class".

电子设备确定各个标点符号所作用的文本单元后，可以分别确定标点符号所作用的文本单元的符号韵律信息，例如：待处理文本中第一个、第二个和第三个文本单元无符号韵律信息，第四个文本单元的符号韵律信息可以为[0.2,0.5,0.7]等。电子设备可以对多个文本单元组成的句级单位的文本进行播报时，利用符号韵律信息播报对应的语气，也可以只对待处理文本中部分文本单元进行播报时，利用符号韵律信息播报对应的语气等。在一种可能的实施例中，电子设备还可以根据不同类型的标点符号的组合，共同来确定标点符号所作用的文本单元。例如：电子设备可以确定待处理文本为“今天老师朗读了《乡愁！》”，那么电子设备可以确定感叹号“！”对应的感叹语气所作用的文本单元，可以为书名号内的文本内容“乡愁”等。电子设备还可以通过引号(如，“你还好吗？”。等形式)或连续、相同的多个标点符号(如，？明天不上班？。等形式)等形式，来确定标点符号所作用的文本单元。After the electronic device determines the text unit acted by each punctuation mark, it can separately determine the symbol prosody information of the text unit acted by the punctuation mark, for example: the unsigned prosody of the first, second and third text units in the text to be processed information, the symbolic prosody information of the fourth text unit can be [0.2, 0.5, 0.7], etc. When an electronic device broadcasts the text of a sentence-level unit composed of multiple text units, it can use the symbol prosody information to broadcast the corresponding tone, or it can broadcast only some text units in the text to be processed, using the symbol prosody information to broadcast the corresponding tone. Wait. In a possible embodiment, the electronic device may further jointly determine the text units to which the punctuation marks act according to the combination of different types of punctuation marks. For example, the electronic device can determine that the text to be processed is "Today, the teacher read "Nostalgia!", then the electronic device can determine the text unit used by the exclamation corresponding to the exclamation mark "!", which can be the text content "Nostalgia" in the title number. Wait. Electronic devices can also use quotation marks (such as, "How are you?", etc.) or consecutive, identical multiple punctuation marks (such as, ? Not going to work tomorrow?. etc.) and other forms to determine the role of punctuation marks text unit.

本申请实施例中，通过根据标点符号的符号位置，确定各个标点符号所作用的文本单元，其中，一个标点符号作用于至少一个文本单元，确定标点符号所作用的文本单元的符号韵律信息，可以准确地确定待处理文本对应的符号韵律信息，相对于确定每个文本单元对应的符号韵律信息，可以有效地减少工作量，在播报待处理文本的过程中，使得播报语气更自然流畅等。In the embodiment of the present application, by determining the text units acted by each punctuation mark according to the symbol positions of the punctuation marks, wherein one punctuation mark acts on at least one text unit, and determining the symbol prosody information of the text unit acted by the punctuation mark, you can Accurately determining the symbol prosody information corresponding to the text to be processed can effectively reduce the workload compared to determining the symbol prosody information corresponding to each text unit, and make the broadcast tone more natural and smooth in the process of broadcasting the text to be processed.

本申请实施例中，符号信息至少包括：符号类型和符号数量。符号类型可以包括：停顿类、免读类或者语气类等多种类型。如，停顿类可以包括：逗号、顿号、句号、分号、连接号等标点符号；免读类可以包括：书名号、括号等标点符号；语气类可以包括问号、叹号等标点符号；其中，停顿类和免读类也可以称为无语气类型。语气类型至少可以包括：陈述语气、祈使语气、感叹语气、疑问语气等类型的语气，电子设备可以预先设备不同的标点符号与不同的语气类型之间的对应关系。例如：句号对应陈述语气，感叹号对应感叹语气，问号对应疑问语气等。对于没有明显语气特性的标点符号，电子设备可以将标点符号设置为默认语气类型，例如：电子设备可以将逗号、分号等标点符号设置为默认语气类型等。In this embodiment of the present application, the symbol information at least includes: symbol type and symbol quantity. Symbol types can include: pause, read-free, or tone. For example, the pause class may include: commas, commas, periods, semicolons, hyphens and other punctuation marks; the read-free class may include: book title, brackets and other punctuation marks; the tone class may include question marks, exclamation marks and other punctuation marks; Classes and read-free classes can also be referred to as voiceless types. The tone types may at least include: declarative, imperative, exclamatory, interrogative and other types of tone, and the electronic device may pre-set the correspondence between different punctuation marks and different tone types. For example, the period corresponds to the declarative tone, the exclamation mark corresponds to the exclamation tone, the question mark corresponds to the interrogative tone, etc. For punctuation marks without obvious mood characteristics, the electronic device can set the punctuation marks as the default mood type, for example: the electronic device can set the comma, semicolon and other punctuation marks as the default mood type, etc.

电子设备可以根据标点符号的符号数量，确定语气程度，例如：一个问号“？”可以表示普通疑问语气级，连续两个及以上的问号“？？”可以表示强烈疑问语气级等；一个感叹号“！”可以表示普通感叹语气级，连续两个及以上的感叹号“！！”可以表示强烈感叹语气级等。在一种可能的实施例中，不同类型的标点符号也可以组合在一起，如，“？！”或“？？！！”等，可以用于表示强烈的疑问同时惊讶，因此在问号前的文本单元可以标记疑问惊讶语气级或强烈疑问惊讶语气级(如果有两个及以上)等。The electronic device can determine the tone level according to the number of punctuation marks. For example, a question mark "?" can indicate a common question mark, two or more consecutive question marks "??" can indicate a strong question mark, etc.; an exclamation mark " !" can express a common exclamation level, and two or more consecutive exclamation marks "!!" can express a strong exclamation level, etc. In a possible embodiment, different types of punctuation marks can also be combined, such as "?!" or "??!!", etc., which can be used to express strong doubts and surprises, so before the question mark Text units can be marked with a questionable surprise level or a strong questionable surprise level (if there are two or more), etc.

电子设备确定符号类型和符号数量后，可以根据语气类型和/或语气程度，确定符号韵律信息。例如：电子设备确定待处理文本有5个文本单元，在最后一个文本单元后面有一个问号，那么电子设备可以确定第五个文本单元对应的符号韵律信息可以为普通疑问语气等。After the electronic device determines the symbol type and the number of symbols, it can determine the symbol prosody information according to the tone type and/or tone degree. For example, if the electronic device determines that the text to be processed has 5 text units, and there is a question mark after the last text unit, the electronic device may determine that the symbol prosody information corresponding to the fifth text unit may be an ordinary interrogative tone, etc.

本申请实施例中，通过根据标点符号的符号类型，确定语气类型，根据标点符号的符号数量，确定语气程度，根据语气类型和/或语气程度，确定符号韵律信息，可以简单有效地确定符号韵律信息，提高电子设备的运行效率等。In the embodiment of the present application, by determining the mood type according to the symbol type of punctuation marks, determining the mood degree according to the number of punctuation marks, and determining the symbol prosody information according to the mood type and/or mood degree, the symbol prosody can be determined simply and effectively information, improve the operating efficiency of electronic equipment, etc.

本申请实施例中，预设语音库可以用于存储音频片段，以及预设文本信息和预设韵律信息与音频片段之间的映射关系。例如：第一预设文本信息和第一预设韵律信息对应第一音频片段，第二预设文本信息和第二预设韵律信息对应第二音频片段等。音频片段可以是指较短的文本(如，字或词等文本单元)对应的音频数据，电子设备可以通过多个音频片段组合成完整的待处理文本对应的目标语音。In this embodiment of the present application, the preset speech library may be used to store audio clips, and the mapping relationship between preset text information and preset prosody information and audio clips. For example, the first preset text information and the first preset prosody information correspond to the first audio segment, the second preset text information and the second preset prosody information correspond to the second audio segment, and so on. The audio segment may refer to audio data corresponding to a short text (eg, text units such as words or words), and the electronic device may combine multiple audio segments to form a target speech corresponding to the complete text to be processed.

同时存在文本信息和目标韵律信息There are both textual information and target prosody information

电子设备可以将文本信息和目标韵律信息与预设语音库中的预设文本信息和预设韵律信息进行匹配。匹配结果至少可以包括：匹配成功和匹配失败。若预设语音库中同时存在与文本信息对应的预设文本信息以及与目标韵律信息对应的预设韵律信息，那么电子设备可以确定匹配成功。若预设语音库中未同时存在文本信息对应的预设文本信息以及目标韵律信息对应的预设韵律信息(如，预设语音库中只存在文本信息对应的预设文本信息，或者预设语音库中只存在目标韵律信息对应的预设韵律信息)，或者预设语音库中既不存在文本信息对应的预设文本信息又不存在目标韵律信息对应的预设韵律信息，那么电子设备可以确定匹配失败。例如：电子设备确定文本信息与第一预设文本信息匹配成功，且目标韵律信息与第一预设韵律信息匹配成功，那么电子设备可以确定匹配结果为匹配成功；若电子设备确定文本信息与第二预设文本信息匹配成功，目标韵律信息与所有的预设韵律信息都不匹配，那么电子设备可以确定匹配结果为匹配失败等。The electronic device may match the text information and the target prosody information with the preset text information and the preset prosody information in the preset speech library. The matching result can at least include: matching success and matching failure. If both the preset text information corresponding to the text information and the preset prosody information corresponding to the target prosody information exist in the preset voice library, the electronic device may determine that the matching is successful. If the preset text information corresponding to the text information and the preset prosody information corresponding to the target prosody information do not exist in the preset voice library at the same time (for example, the preset voice library only contains preset text information corresponding to the text information, or the preset voice There is only preset prosody information corresponding to the target prosody information in the library), or there is neither preset text information corresponding to the text information nor preset prosody information corresponding to the target prosody information in the preset voice library, then the electronic device can determine Match failed. For example: the electronic device determines that the text information matches the first preset text information successfully, and the target prosody information matches the first preset prosody information successfully, then the electronic device can determine that the matching result is successful; if the electronic device determines that the text information matches the first preset prosody information successfully 2. The preset text information is successfully matched, and the target prosody information does not match all the preset prosody information, then the electronic device may determine that the matching result is a matching failure, etc.

电子设备确定匹配结果后，可以利用与匹配结果对应的语音生成策略，根据文本信息和目标韵律信息，生成目标语音，其中，不同的匹配结果对应不同的语音生成策略。例如：匹配成功可以对应第一语音生成策略，匹配失败可以对应第二语音生成策略等。语音生成策略可以是指电子设备生成目标语音的方式或规则等，如：共振峰合成策略、发音过程合成策略、波形拼接策略、神经网络模型策略等。例如：若匹配结果为匹配成功，那么电子设备可以利用第一语音生成策略，生成目标语音；若匹配结果为匹配失败，那么电子设备可以利用第二语音生成策略，生成目标语音等。After the electronic device determines the matching result, it can generate the target speech according to the text information and the target prosody information by using the speech generation strategy corresponding to the matching result, wherein different matching results correspond to different speech generation strategies. For example, successful matching may correspond to the first voice generation strategy, and failure of matching may correspond to the second voice generation strategy. The speech generation strategy may refer to the way or rules for the electronic device to generate the target speech, such as: formant synthesis strategy, pronunciation process synthesis strategy, waveform splicing strategy, neural network model strategy, etc. For example, if the matching result is successful, the electronic device can use the first voice generation strategy to generate the target voice; if the matching result is a match failure, the electronic device can use the second voice generation strategy to generate the target voice, etc.

本申请实施例中，通过将文本信息和目标韵律信息与预设语音库中的预设文本信息和预设韵律信息进行匹配，利用与匹配结果对应的语音生成策略，根据文本信息和目标韵律信息，生成目标语音，可以利用多种不同策略，准确有效地保证得到完整的目标语音。In the embodiment of the present application, by matching the text information and the target prosody information with the preset text information and the preset prosody information in the preset voice database, using the voice generation strategy corresponding to the matching result, according to the text information and the target prosody information , to generate the target speech, a variety of different strategies can be used to ensure the complete target speech accurately and effectively.

本申请实施例中，电子设备在匹配结果表征预设语音库中存在文本信息和目标韵律信息的情况下，也即匹配结果为匹配成功时，电子设备可以从预设语音库确定与文本信息和目标韵律信息对应的音频片段。例如：电子设备确定第一文本单元的文本信息和目标韵律信息对应第一音频片段，第二文本单元的文本信息和目标韵律信息对应第二音频片段等。然后电子设备可以对各个音频片段进行拼接处理，生成目标语音。In this embodiment of the present application, when the matching result indicates that text information and target prosody information exist in the preset voice library, that is, when the matching result is successful, the electronic device can determine the text information and target prosody information from the preset voice library. The audio segment corresponding to the target prosody information. For example, the electronic device determines that the text information and target prosody information of the first text unit correspond to the first audio segment, and the text information and target prosody information of the second text unit correspond to the second audio segment, and so on. The electronic device can then perform splicing processing on each audio segment to generate the target speech.

电子设备可以根据各个文本单元之间的位置信息，来确定拼接顺序，例如：电子设备确定第一文本信息为待处理文本中的第一个文本单元，第二文本信息为待处理文本中的第二个文本单元，那么电子设备可以确定第一音频片段与第二音频片段相连，且第一音频片段在前。在对各个音频片段进行拼接处理，电子设备也可以对各个音频片段进行缩放、滤波、增强等处理，从而保证生成目标语音的自然流畅等。The electronic device can determine the splicing sequence according to the position information between each text unit. For example, the electronic device determines that the first text information is the first text unit in the text to be processed, and the second text information is the first text unit in the text to be processed. two text units, then the electronic device can determine that the first audio segment is connected to the second audio segment, and the first audio segment is in front. When splicing each audio segment, the electronic device can also perform scaling, filtering, enhancement, etc. on each audio segment, so as to ensure the natural and smoothness of the generated target speech.

本申请实施例中，通过在匹配结果表征预设语音库中存在文本信息和目标韵律信息的情况下，从预设语音库确定与文本信息和目标韵律信息对应的音频片段，对各个音频片段进行拼接处理，生成目标语音，可以直接从预设语音库中简单准确地确定音频片段，有比较稳定的合成效果，提高电子设备的运行效率等。In the embodiment of the present application, when the matching result indicates that there are text information and target prosody information in the preset voice library, audio fragments corresponding to the text information and target prosody information are determined from the preset voice library, and each audio fragment is analyzed. The splicing process generates the target voice, and the audio clip can be simply and accurately determined directly from the preset voice library, which has a relatively stable synthesis effect and improves the operation efficiency of electronic equipment.

本申请实施例中，电子设备在匹配结果表征预设语音库中不存在文本信息和目标韵律信息的情况下，也即匹配结果为匹配失败时，电子设备可以对各个文本单元的文本信息和目标韵律信息进行特征转换处理，得到声学特征。声学特征(也可以称为声学参数或语音参数等)可以是表示语音声学特性的物理量，也是声音诸要素声学表现的统称。如，表示音色的能量集中区、共振峰频率、共振峰强度和带宽，以及表示语音韵律特性的时长、基频、平均语声功率等。例如：电子设备可以通过训练好的声学模型，将文本信息和目标韵律信息转换为声学特征。In the embodiment of the present application, when the matching result indicates that there is no text information and target prosody information in the preset voice library, that is, when the matching result is that the matching fails, the electronic device can compare the text information and target of each text unit. The prosodic information is subjected to feature conversion processing to obtain acoustic features. Acoustic features (also referred to as acoustic parameters or speech parameters, etc.) can be physical quantities that represent the acoustic characteristics of speech, and are also a general term for the acoustic performance of various elements of the sound. For example, it represents the energy concentration area, formant frequency, formant intensity and bandwidth of the timbre, as well as the duration, fundamental frequency, and average voice power, which represent the prosody characteristics of speech. For example, electronic devices can convert text information and target prosody information into acoustic features through a trained acoustic model.

电子设备得到声学特征后，可以对各个声学特征进行解码处理，生成目标语音，例如：电子设备可以将各个文本单元对应的声学特征，按照各个文本单元之间的位置，组合成声学特征序列或集合等，输入到训练好的声码器中，得到目标语音(也可以称为待处理文本对应的目标波形等)。声码器是指用于将前面得到的高层特征(也即声学特征)转换为声音波形的神经网络模型。After the electronic device obtains the acoustic features, it can decode each acoustic feature to generate the target speech. For example, the electronic device can combine the acoustic features corresponding to each text unit into an acoustic feature sequence or set according to the position between the text units. etc., and input it into the trained vocoder to obtain the target speech (also referred to as the target waveform corresponding to the text to be processed, etc.). Vocoder refers to a neural network model used to convert the previously obtained high-level features (ie, acoustic features) into sound waveforms.

在一种可能的实施例中，电子设备也可以通过利用端对端等策略来生成目标语音，端到端合成策略可以直接输入待处理文本或者待处理文本的注音字符，直接输出音频波形(也即目标语音)。电子设备可以利用端到端模型(如，TACOTRON模型)来实现端到端合成策略，电子设备可以将文本信息和目标韵律信息输入训练好的端到端模型，输入目标语音。在端到端模型的训练过程中，需要根据标记好的文本信息、韵律信息以及语音等数据进行训练。本申请实施例中，电子设备利用端到端策略来生成目标语音，可以降低对语言学知识的要求，同时能够方便快捷地在不同语种上进行复制，批量实现几十种甚至更多语种的合成策略，并且端到端策略具有强大丰富的发音风格和韵律表现力等。In a possible embodiment, the electronic device can also generate the target speech by using strategies such as end-to-end, and the end-to-end synthesis strategy can directly input the text to be processed or the phonetic characters of the text to be processed, and directly output the audio waveform (also the target voice). The electronic device can utilize an end-to-end model (eg, TACOTRON model) to implement an end-to-end synthesis strategy, and the electronic device can input text information and target prosody information into the trained end-to-end model and input target speech. In the training process of the end-to-end model, it needs to be trained according to data such as labeled text information, prosody information, and speech. In the embodiment of the present application, the electronic device uses the end-to-end strategy to generate the target speech, which can reduce the requirements for linguistic knowledge, and at the same time, it can be copied in different languages conveniently and quickly, and the synthesis of dozens or even more languages can be realized in batches. strategy, and the end-to-end strategy has strong and rich pronunciation style and prosodic expressiveness.

在一种可能的实施例中，电子设备也可以先根据文本韵律信息，生成第一语音，然后根据符号韵律信息对第一语音进行调整，生成目标语音。电子设备预先设置符号韵律信息对应的调整方式，例如：若待处理文本中包含感叹号，那么电子设备可以表示为将第一语音的振幅缩小10％等，从而得到目标语音。采用上述方式，电子设备可以直接利用现有的语音合成策略，通过文本韵律信息生成第一语音，只需要在现有技术中的流程逻辑中添加少量的调整处理模块，就可以生成目标语音，减少开发人员的工作量等。In a possible embodiment, the electronic device may first generate the first voice according to the text prosody information, and then adjust the first voice according to the symbol prosody information to generate the target voice. The electronic device presets an adjustment method corresponding to the symbol prosody information. For example, if the text to be processed contains an exclamation mark, the electronic device can indicate that the amplitude of the first voice is reduced by 10%, etc., so as to obtain the target voice. In the above manner, the electronic device can directly use the existing speech synthesis strategy to generate the first speech through the text prosody information, and only need to add a small number of adjustment processing modules to the process logic in the prior art to generate the target speech, reducing the Developer workload, etc.

在一种可能的实施例中，本申请中的语音生成方法可以应用在语音合成系统中，如图2所示，图2可以表示一种语音合成系统的示意图。如图2所示，语音合成系统可以包括：前端处理模块201和后端处理模块，其中，后端处理模块可以包括：第一后端模块和/或第二后端模块，第一后端模块可以包括声学模型202和声码器203，第二后端模块可以包括预设语音库204等。在处理的过程中，电子设备可以确定待处理文本，这里，待处理文本包括文本内容以及标点符号。电子设备可以通过爬虫处理，在互联网上获取文章等形式的待处理文本，也可以接收用户通过鼠标或键盘等组件上传的待处理文本。然后电子设备可以将待处理文本输入前端处理模块201进行处理，得到目标韵律信息。In a possible embodiment, the speech generation method in the present application may be applied in a speech synthesis system, as shown in FIG. 2 , which may represent a schematic diagram of a speech synthesis system. As shown in FIG. 2, the speech synthesis system may include: a front-end processing module 201 and a back-end processing module, wherein the back-end processing module may include: a first back-end module and/or a second back-end module, the first back-end module An acoustic model 202 and a vocoder 203 may be included, and the second back-end module may include a preset speech library 204 and the like. During the processing, the electronic device may determine the text to be processed, where the text to be processed includes text content and punctuation marks. Electronic devices can be processed by crawlers to obtain pending texts in the form of articles on the Internet, and can also receive pending texts uploaded by users through components such as mouse or keyboard. Then the electronic device can input the text to be processed into the front-end processing module 201 for processing to obtain target prosody information.

这里，语音合成系统主要可以分为两个模块：前端处理模块201和后端处理模块。前端处理模块201主要可以进行：文本正则化、分词、词性标注、多音字消歧、以及韵律预测等环节(子模块)的处理，其中在韵律预测环节，本实施例中加入了标点检测子模块，用户根据待处理文本中的标点符号的符号信息，确定符号韵律信息等。Here, the speech synthesis system can be mainly divided into two modules: a front-end processing module 201 and a back-end processing module. The front-end processing module 201 can mainly perform the processing of links (sub-modules) such as text regularization, word segmentation, part-of-speech tagging, polyphonic word disambiguation, and prosody prediction, wherein in the prosody prediction link, a punctuation detection sub-module is added in this embodiment. , the user determines the symbol prosody information and the like according to the symbol information of the punctuation symbols in the text to be processed.

前端处理模块201可以对输入到语音合成系统中的待处理文本进行解析。在解析过程中前端处理首先会对文本进行正则化，将一些特殊字符(比如阿拉伯数字的解释可以是数词也可以是电话号码)等问题按一定规则标记好。在这个处理过程中，会完成分词处理。分词是将待处理文本按自然语言的规则和词组进行切割，分词时也会依据前面待处理文本正则化时处理的一些特殊字符以及标点符号的信息。在分词结束后，进入词性标注环节，词性标注的作用是对分词结果中的每个单词标注一个正确的词性类型(名词、动词、形容词或者其他词性)。The front-end processing module 201 can parse the text to be processed input into the speech synthesis system. In the parsing process, the front-end processing will first normalize the text, and mark some special characters (for example, the interpretation of Arabic numerals can be numerals or phone numbers) according to certain rules. During this process, word segmentation processing is completed. Word segmentation is to cut the text to be processed according to the rules and phrases of the natural language, and the word segmentation will also be based on some special characters and punctuation information that were processed during the regularization of the text to be processed. After the word segmentation, enter the part-of-speech tagging link. The role of the part-of-speech tagging is to tag each word in the word segmentation result with a correct part-of-speech type (noun, verb, adjective or other parts of speech).

然后进行多音字消歧环节，这个环节会以分词为处理单位(也即文本单元)，对于各个分词可以按最长匹配原则与读音库中的词表进行对比，确定各个分词的音素和音调等信息。在匹配完成之后，电子设备也可以记录各个分词所对应的音素级别(上一个音素、下一个音素)、音节级别(单词的第几个音节)、单词级别(词性/在句子中的位置)等信息，可以用于后续的音律预测。若电子设备在读音库中找不到“词”级别的匹配结果，那么电子设备能够以小于“词”级别的“字”级别为处理单位，来进行匹配，这种情况有可能会存在多音字的问题，可以简单选择读音库中统计的概率最大的读音，得到文本信息。然后电子设备可以进行韵律预测，在韵律预测的过程中除了需要对各个分词进行韵律预测，得到文本韵律信息外，可以按“句”为单位对每句话的标点符号进行处理，得到符号韵律信息。例如：当解析过程中遇到了标点符号，前端处理模块201可以对标点种类和范围进行识别和记录。然后电子设备可以根据文本韵律信息和符号韵律信息，确定目标韵律信息。Then carry out the polyphonic word disambiguation link. This link will take the word segmentation as the processing unit (that is, the text unit). For each word segmentation, it can be compared with the vocabulary in the pronunciation library according to the longest matching principle to determine the phoneme and tone of each word segmentation, etc. information. After the matching is completed, the electronic device can also record the phoneme level (previous phoneme, next phoneme), syllable level (the syllable number of the word), word level (part of speech/position in the sentence), etc. corresponding to each participle. information, which can be used for subsequent temperament prediction. If the electronic device cannot find a matching result at the "word" level in the pronunciation library, the electronic device can use the "word" level that is smaller than the "word" level as a processing unit to perform matching. In this case, there may be polyphonic words. For the problem, you can simply select the pronunciation with the highest probability in the pronunciation database to get the text information. Then the electronic device can perform prosody prediction. In the process of prosody prediction, in addition to the prosody prediction of each word segment to obtain the text prosody information, the punctuation marks of each sentence can be processed in units of "sentence" to obtain the symbol prosody information. . For example, when a punctuation mark is encountered in the parsing process, the front-end processing module 201 can identify and record the type and range of the punctuation mark. The electronic device can then determine the target prosody information according to the text prosody information and the symbol prosody information.

在一种可能的实施例中，以中文为例，本语音合成系统支持的都是国标中规定的标点符号。例如：停顿类的标点符号可以包括，逗号(“，”)：逗号一般是用于句中停顿的，一般没有特别的语气而使用默认语气级，因此处理逻辑比较简单，只需要在断句处前1个分词的属性中记录一个“逗号停顿时间”即可。顿号(“、”)：顿号与逗号类似，一般也是用于句中词语与词语间停顿，一般没有特别的语气而使用默认语气级，因此处理逻辑比较简单，只需要在断句处前1个分词的属性中记录一个“顿号停顿时间”即可，其中，一般顿号停顿时间可以稍小于逗号停顿时间。分号(“；”)：分号与逗号类似，一般也是用于句间停顿，一般没有特别的语气而使用默认语气级，因此处理逻辑比较简单，只需要在断句处前1个分词的属性中记录一个“分号停顿时间”即可，其中，一般分号停顿时间可以稍大于逗号停顿时间。句号(“。”)：句号与逗号类似，一般是用于句尾停顿，一般没有特别的语气而使用默认语气级，因此处理逻辑比较简单，只需要在断句处前1个分词的属性中记录一个“句号停顿时间”即可，其中，一般句号停顿时间比较大，可以大过其他标点的停顿时间。连接号(“-”)：连接号与顿号类似，一般也是用于连接两个词，一般没有特别的语气而使用默认语气级，因此处理逻辑比较简单，只需要在断句处前1个分词的属性中记录一个“连接号停顿时间”即可，其中，一般连接号停顿时间可以和顿号停顿时间一致。省略号(“……”)：省略号一般用于省略场景，一般没有特别的语气而使用默认语气级，只需要在断句处前1个分词的属性中记录一个“省略号停顿时间”即可，其中，一般省略号停顿时间可以比逗号停顿时间稍长。破折号(“—”)：破折号一般用于解释说明或话题转变场景，一般没有特别的语气而使用默认语气级，只需要在断句处前1个分词的属性中记录一个“破折号停顿时间”即可。有一个破折号的特例，即有时表示声音延长。这种情况因为要对文本内容进行分析和处理，目前暂不支持，其中，一般破折号停顿时间可以比逗号停顿时间稍长。In a possible embodiment, taking Chinese as an example, the speech synthesis system supports all the punctuation symbols specified in the national standard. For example, the punctuation marks for pauses can include, commas (","): commas are generally used for pauses in sentences. Generally, there is no special tone and the default tone level is used. Therefore, the processing logic is relatively simple. It is enough to record a "comma pause time" in the attribute of a participle. The comma (“,”): similar to the comma, it is generally used to pause between words and words in a sentence. Generally, there is no special tone and the default tone level is used, so the processing logic is relatively simple, and only the first 1 of the sentence is required. It is enough to record a "pause time of comma" in the attribute of each word segment, where the pause time of comma can be slightly less than the pause time of comma. Semicolon (";"): Semicolon is similar to comma, and is generally used for pauses between sentences. Generally, there is no special tone and the default tone level is used, so the processing logic is relatively simple, and only the attributes of the first participle at the sentence break are required. It is enough to record a "semicolon pause time", in which the general semicolon pause time can be slightly longer than the comma pause time. Period (“.”): The period is similar to the comma. It is generally used to pause at the end of the sentence. Generally, there is no special tone and the default tone level is used. Therefore, the processing logic is relatively simple, and it only needs to be recorded in the attribute of the first participle of the sentence. One "period pause time" is sufficient, among which, the pause time of the period is generally relatively large, which can be longer than the pause time of other punctuation marks. The hyphen (“-”): The hyphen is similar to the comma, and is generally used to connect two words. Generally, there is no special tone and the default tone level is used, so the processing logic is relatively simple, and only the first participle is required at the sentence break. It is enough to record a "connection number pause time" in the properties of , in which the general connection number pause time can be consistent with the pause time of the pause number. Ellipsis ("..."): The ellipsis is generally used to omit the scene. Generally, there is no special tone and the default tone level is used. It is only necessary to record an "ellipsis pause time" in the attribute of the first participle of the sentence. Among them, Generally, the ellipsis pause time can be slightly longer than the comma pause time. Dash (“—”): Dash is generally used for explanations or topic change scenarios. Generally, there is no special tone and the default tone level is used. It is only necessary to record a “dash pause time” in the attribute of the first participle of the sentence. . There is a special case of the dash, which is sometimes used to indicate a prolonged sound. This situation is currently not supported because the text content needs to be analyzed and processed. Generally, the dash pause time can be slightly longer than the comma pause time.

免读类的标点符号可以包括，括号(“()”)：括号直接忽略其中的文字不处理。书名号(“《》”，“<>”)：书名号中内容也不进行特殊语气处理，使用默认语气级，无停顿时间。Read-free punctuation can include parentheses ("()"): The text in the parentheses is ignored and not processed. Book title number (""", "<>"): The content in the book title number is also not subject to special tone treatment, the default tone level is used, and there is no pause time.

语气类的标点符号可以包括，问号(“？”)：问号用于疑问句，因此问号前的分词会标记“普通疑问语气级”。如果有2个及以上问号(“？？？”)，那么问号前的分词会标记“强烈疑问语气级”。叹号(“！”)：叹号用于感叹句，因此叹号前的分词会标记“普通感叹语气级”。如果有2个及以上问号(“！！！”)，那么问号前的分词会标记“强烈感叹语气级”。问号与叹号组合(“？！”或“？？？！！！”)：这种组合是比较特殊的一种用法，一般用于强烈的疑问同时惊讶，因此在问号前的分词会标记疑问惊讶语气级或强烈疑问惊讶语气级(如果有2个及以上)。The punctuation marks of the mood class can include, question mark ("?"): question marks are used in interrogative sentences, so the participle before the question mark will mark the "general interrogative mood level". If there are 2 or more question marks ("???"), the participle before the question mark will be marked as "strong question mark". Exclamation mark ("!"): The exclamation mark is used in exclamatory sentences, so the participle before the exclamation mark marks the "normal exclamation level". If there are 2 or more question marks ("!!!"), the participle before the question mark will be marked with "strong exclamation level". Question mark and exclamation mark combination ("?!" or "???!!!"): This combination is a special kind of usage, generally used for strong doubts and surprises, so the participle before the question mark will mark doubtful surprises Tone level or strong interrogative surprise tone level (if there are 2 or more).

符号韵律信息中的语气程度可以分为不同的级别，例如：无语气、普通疑问语气、强烈疑问语气、普通感叹语气、强烈感叹语气等(可以扩展更多更细腻语气种类)。通过检测待处理文本最后的标点符号，来决定使用什么样的语气去播报这个待处理文本，语气会作用在分词之后的最后一个文本单元上，比如待处理文本是：在上课？可以使用疑问语气，分词后的结果是：在/上课/？，对上课这个词会用疑问的语气读出来；而在上课！这个文本，可以用感叹语气，对上课使用感叹语气读出来等。The tone degree in the symbolic prosodic information can be divided into different levels, such as: no tone, ordinary interrogative tone, strong interrogative tone, ordinary exclamatory tone, strong exclamatory tone, etc. (more and more delicate tone types can be expanded). By detecting the last punctuation mark of the text to be processed, it is determined what kind of tone is used to broadcast the text to be processed. The tone will act on the last text unit after the word segmentation. For example, the text to be processed is: In class? The interrogative mood can be used, and the result after participle is: in /class/? , the word class will be read in a questioning tone; while in class! This text can be read out in an exclamatory tone, or in an exclamatory tone in class.

电子设备经过前端处理模块201的处理，得到目标韵律信息之后，可以将目标韵律信息输入后端处理模块进行后端处理，其中。后端处理模块的作用是根据目标韵律信息生成目标语音，后端处理模块可以包括：第一后端模块和/或第二后端模块，第一后端模块可以包括声学模型202和声码器203，第二后端模块可以包括预设语音库204等。后端处理可以包括基于统计参数建模的语音合成(或称参数合成或参数法等)，以及基于单元挑选和波形拼接的语音合成(或称拼接合成或拼接法等)。对于参数法，电子设备可以先通过声学模型202生成声学特征。若电子设备采用参数法来生成目标语音，那么电子设备可以通过声码器203对声学特征进行处理，得到目标语音。若电子设备采用拼接法来生成目标语音，那么电子设备可以从预设语音库204中，获取各个文本单元对应的音频片段，然后对音频片段进行拼接处理，生成目标语音。After the electronic device obtains the target prosody information after processing by the front-end processing module 201, the target prosody information can be input into the back-end processing module for back-end processing, wherein. The role of the back-end processing module is to generate target speech according to the target prosody information. The back-end processing module may include: a first back-end module and/or a second back-end module, and the first back-end module may include an acoustic model 202 and a vocoder 203. The second back-end module may include a preset voice library 204 and the like. The back-end processing can include speech synthesis based on statistical parameter modeling (or called parametric synthesis or parametric method, etc.), and speech synthesis based on unit selection and waveform splicing (or called splicing synthesis or splicing method, etc.). For the parametric method, the electronic device may first generate acoustic features through the acoustic model 202 . If the electronic device adopts the parametric method to generate the target voice, the electronic device can process the acoustic features through the vocoder 203 to obtain the target voice. If the electronic device uses the splicing method to generate the target speech, the electronic device can obtain the audio segments corresponding to each text unit from the preset speech library 204, and then perform splicing processing on the audio segments to generate the target speech.

图3是根据一示例性实施例示出的一种语音生成装置框图。如图3所示，该语音生成装置300主要包括：Fig. 3 is a block diagram of a voice generating apparatus according to an exemplary embodiment. As shown in FIG. 3, the voice generation device 300 mainly includes:

分析模块301，配置为对待处理文本进行文本分析处理，得到所述待处理文本中各个文本单元的文本信息；The analysis module 301 is configured to perform text analysis processing on the text to be processed, and obtain text information of each text unit in the text to be processed;

确定模块302，配置为根据所述文本信息和所述待处理文本所包含的标点符号的符号信息，确定各个所述文本单元的目标韵律信息；The determining module 302 is configured to determine the target prosody information of each of the text units according to the text information and the symbol information of the punctuation marks contained in the text to be processed;

生成模块303，配置为根据所述文本信息和所述目标韵律信息，生成目标语音。The generating module 303 is configured to generate a target speech according to the text information and the target prosody information.

在一些实施例中，所述文本信息包括：文本音素和文本音调；所述确定模块302，配置为：In some embodiments, the text information includes: text phonemes and text tones; the determining module 302 is configured to:

在一些实施例中，所述符号信息包括：符号位置；所述确定模块302，配置为：In some embodiments, the symbol information includes: symbol position; the determining module 302 is configured to:

在一些实施例中，所述符号信息包括：符号类型和符号数量；所述确定模块302，配置为：In some embodiments, the symbol information includes: symbol type and symbol quantity; the determining module 302 is configured to:

在一些实施例中，所述生成模块303，配置为：In some embodiments, the generating module 303 is configured to:

关于上述实施例中的装置，其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

图4是根据一示例性实施例示出的一种语音生成装置的硬件结构框图。例如，装置400可以是移动电话，计算机，数字广播终端，消息收发设备，游戏控制台，平板设备，医疗设备，健身设备，个人数字助理等。Fig. 4 is a block diagram of a hardware structure of a voice generating apparatus according to an exemplary embodiment. For example, apparatus 400 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, and the like.

参照图4，装置400可以包括以下一个或多个组件：处理组件402，存储器404，电源组件406，多媒体组件408，音频组件410，输入/输出(I/O)的接口412，传感器组件414，以及通信组件416。4, the apparatus 400 may include one or more of the following components: a processing component 402, a memory 404, a power supply component 406, a multimedia component 408, an audio component 410, an input/output (I/O) interface 412, a sensor component 414, and communication component 416 .

处理组件402通常控制装置400的整体操作，诸如与显示，电话呼叫，数据通信，相机操作和记录操作相关联的操作。处理组件402可以包括一个或多个处理器420来执行指令，以完成上述的方法的全部或部分步骤。此外，处理组件402可以包括一个或多个模块，便于处理组件402和其他组件之间的交互。例如，处理组件402可以包括多媒体模块，以方便多媒体组件408和处理组件402之间的交互。The processing component 402 generally controls the overall operation of the device 400, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or some of the steps of the methods described above. Additionally, processing component 402 may include one or more modules that facilitate interaction between processing component 402 and other components. For example, processing component 402 may include a multimedia module to facilitate interaction between multimedia component 408 and processing component 402.

存储器404被配置为存储各种类型的数据以支持在装置400的操作。这些数据的示例包括用于在装置400上操作的任何应用程序或方法的指令，联系人数据，电话簿数据，消息，图片，视频等。存储器404可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。Memory 404 is configured to store various types of data to support operations at device 400 . Examples of such data include instructions for any application or method operating on device 400, contact data, phonebook data, messages, pictures, videos, and the like. Memory 404 may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

电源组件406为装置400的各种组件提供电力。电源组件406可以包括电源管理系统，一个或多个电源，及其他与为装置400生成、管理和分配电力相关联的组件。Power supply assembly 406 provides power to various components of device 400 . Power supply components 406 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 400 .

多媒体组件408包括在所述装置400和用户之间的提供一个输出接口的屏幕。在一些实施例中，屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板，屏幕可以被实现为触摸屏，以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界，而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中，多媒体组件408包括一个前置摄像头和/或后置摄像头。当装置400处于操作模式，如拍摄模式或视频模式时，前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。Multimedia component 408 includes screens that provide an output interface between the device 400 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action. In some embodiments, multimedia component 408 includes a front-facing camera and/or a rear-facing camera. When the apparatus 400 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.

音频组件410被配置为输出和/或输入音频信号。例如，音频组件410包括一个麦克风(MIC)，当装置400处于操作模式，如呼叫模式、记录模式和语音识别模式时，麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器404或经由通信组件416发送。在一些实施例中，音频组件410还包括一个扬声器，用于输出音频信号。Audio component 410 is configured to output and/or input audio signals. For example, audio component 410 includes a microphone (MIC) that is configured to receive external audio signals when device 400 is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signal may be further stored in memory 404 or transmitted via communication component 416 . In some embodiments, audio component 410 also includes a speaker for outputting audio signals.

I/O接口412为处理组件402和外围接口模块之间提供接口，上述外围接口模块可以是键盘，点击轮，按钮等。这些按钮可包括但不限于：主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 412 provides an interface between the processing component 402 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.

传感器组件414包括一个或多个传感器，用于为装置400提供各个方面的状态评估。例如，传感器组件414可以检测到装置400的打开/关闭状态，组件的相对定位，例如所述组件为装置400的显示器和小键盘，传感器组件414还可以检测装置400或装置400一个组件的位置改变，用户与装置400接触的存在或不存在，装置400方位或加速/减速和装置400的温度变化。传感器组件414可以包括接近传感器，被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件414还可以包括光传感器，如CMOS或CCD图像传感器，用于在成像应用中使用。在一些实施例中，该传感器组件414还可以包括加速度传感器，陀螺仪传感器，磁传感器，压力传感器或温度传感器。Sensor assembly 414 includes one or more sensors for providing status assessment of various aspects of device 400 . For example, the sensor assembly 414 can detect the open/closed state of the device 400, the relative positioning of components, such as the display and keypad of the device 400, and the sensor assembly 414 can also detect a change in the position of the device 400 or a component of the device 400 , the presence or absence of user contact with the device 400 , the orientation or acceleration/deceleration of the device 400 and the temperature change of the device 400 . Sensor assembly 414 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

通信组件416被配置为便于装置400和其他设备之间有线或无线方式的通信。装置400可以接入基于通信标准的无线网络，如WI-FI，4G或5G，或它们的组合。在一个示例性实施例中，通信组件416经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，所述通信组件416还包括近场通信(NFC)模块，以促进短程通信。例如，在NFC模块可基于射频识别(RFID)技术，红外数据协会(IrDA)技术，超宽带(UWB)技术，蓝牙(BT)技术和其他技术来实现。Communication component 416 is configured to facilitate wired or wireless communication between apparatus 400 and other devices. The device 400 may access a wireless network based on a communication standard, such as WI-FI, 4G or 5G, or a combination thereof. In one exemplary embodiment, the communication component 416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

在示例性实施例中，装置400可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现，用于执行上述方法。In an exemplary embodiment, apparatus 400 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation is used to perform the above method.

在示例性实施例中，还提供了一种包括指令的非临时性计算机可读存储介质，例如包括指令的存储器404，上述指令可由装置400的处理器420执行以完成上述方法。例如，所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium including instructions, such as a memory 404 including instructions, executable by the processor 420 of the apparatus 400 to perform the method described above. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

一种非临时性计算机可读存储介质，当所述存储介质中的指令由语音生成装置的处理器执行时，使得语音生成装置能够执行一种语音生成方法，包括：A non-transitory computer-readable storage medium, when instructions in the storage medium are executed by a processor of a speech generation device, enabling the speech generation device to execute a speech generation method, comprising:

图5是根据一示例性实施例示出的一种用于语音生成的装置500的硬件结构框图。例如，装置500可以被提供为一服务器。参照图5，装置500包括处理组件522，其进一步包括一个或多个处理器，以及由存储器532所代表的存储器资源，用于存储可由处理组件522的执行的指令，例如应用程序。存储器532中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外，处理组件522被配置为执行指令，以执行一种语音生成方法，包括：FIG. 5 is a block diagram of the hardware structure of an apparatus 500 for speech generation according to an exemplary embodiment. For example, the apparatus 500 may be provided as a server. 5, apparatus 500 includes processing component 522, which further includes one or more processors, and a memory resource, represented by memory 532, for storing instructions executable by processing component 522, such as an application program. An application program stored in memory 532 may include one or more modules, each corresponding to a set of instructions. Additionally, the processing component 522 is configured to execute instructions to perform a method of speech generation, including:

装置500还可以包括一个电源组件526被配置为执行装置500的电源管理，一个有线或无线网络接口550被配置为将装置500连接到网络，和一个输入/输出(I/O)接口558。装置500可以操作基于存储在存储器532的操作系统，例如Windows ServerTM，Mac OS XTM，UnixTM，LinuxTM，FreeBSDTM或类似。Device 500 may also include a power supply assembly 526 configured to perform power management of device 500 , a wired or wireless network interface 550 configured to connect device 500 to a network, and an input/output (I/O) interface 558 . Device 500 may operate based on an operating system stored in memory 532, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™ or the like.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本申请的真正范围和精神由下面的权利要求指出。Other embodiments of the present application will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses or adaptations of this application that follow the general principles of this application and include common knowledge or conventional techniques in the technical field not disclosed in this application . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the application being indicated by the following claims.

应当理解的是，本申请并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。It is to be understood that the present application is not limited to the precise structures described above and illustrated in the accompanying drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. a speech generation method, it is characterised in that the method comprises:

Performing text analysis processing on the text to be processed to obtain text information of each text unit in the text to be processed;

Determine the target prosody information of each of the text units according to the text information and the symbol information of the punctuation marks contained in the text to be processed;

A target speech is generated according to the text information and the target prosody information.

2 . The method according to claim 1 , wherein the text information comprises: text phonemes and text tones; the determining according to the text information and the symbol information of punctuation marks contained in the text to be processed The target prosody information of each of the text units, including:

According to the text phoneme and the text pitch of each of the text units, determine the text prosody information of each of the text units;

According to the symbol information, determine the symbol prosody information of each of the text units;

The target prosody information is determined according to the text prosody information and the symbol prosody information.

3 . The method according to claim 2 , wherein the symbol information comprises: symbol position; the determining, according to the symbol information, the symbol prosody information of each of the text units, comprising: 3 .

According to the symbol positions of the punctuation marks, determine the text units that each punctuation mark acts on; wherein, one punctuation mark acts on at least one text unit;

Symbolic prosody information of the text unit to which the punctuation marks act is determined.

4. The method according to claim 2, wherein the symbol information comprises: symbol type and number of symbols; and determining the symbol prosody information of each of the text units according to the symbol information, comprising:

Determine the tone type according to the symbol type of the punctuation mark;

Determine the tone degree according to the number of the punctuation marks;

The symbol prosody information is determined according to the tone type and/or the tone degree.

5. The method according to claim 1, wherein the generating a target speech according to the text information and the target prosody information comprises:

Matching the text information and the target prosody information with preset text information and preset prosody information in a preset voice library; wherein the preset voice library is used to store audio clips, and the preset text information and the mapping relationship between the preset prosody information and the audio segment;

Using the speech generation strategy corresponding to the matching result, the target speech is generated according to the text information and the target prosody information; wherein, different matching results correspond to different speech generation strategies.

6. The method according to claim 5, wherein, generating the target voice according to the text information and the target prosody information by using a voice generation strategy corresponding to the matching result, comprising:

In the case that the matching result indicates that the text information and the target prosody information exist in the preset voice library, determine the audio corresponding to the text information and the target prosody information from the preset voice library fragment;

Perform splicing processing on each of the audio segments to generate the target speech.

7. The method according to claim 5, wherein, generating the target voice according to the text information and the target prosody information by using a voice generation strategy corresponding to the matching result, comprising:

In the case that the matching result indicates that the text information and the target prosody information do not exist in the preset speech library, feature conversion processing is performed on the text information and the target prosody information of each of the text units , get the acoustic features;

Decoding each of the acoustic features is performed to generate the target speech.

8. A voice generation device, comprising:

an analysis module, configured to perform text analysis processing on the text to be processed, and obtain text information of each text unit in the text to be processed;

a determining module, configured to determine the target prosody information of each of the text units according to the text information and the symbol information of the punctuation marks contained in the text to be processed;

The generating module is configured to generate the target speech according to the text information and the target prosody information.

9. A voice generation device, comprising:

processor;

memory configured to store processor executable instructions;

Wherein, the processor is configured to: implement the steps in any one of the above-mentioned speech generation methods in claims 1 to 7 when executed.

10. A non-transitory computer-readable storage medium that, when the instructions in the storage medium are executed by a processor of a speech generating apparatus, enables the apparatus to perform any one of the above-mentioned speech generating methods of claims 1 to 7 steps in .