CN116778900A

CN116778900A - Speech synthesis method, electronic device and storage medium

Info

Publication number: CN116778900A
Application number: CN202310831506.6A
Authority: CN
Inventors: 岳振; 马金燚; 佘志强
Original assignee: China Mobile Communications Group Co Ltd; MIGU Digital Media Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Digital Media Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-09-19

Abstract

The application provides a voice synthesis method, electronic equipment and a storage medium, and relates to the technical field of voice, wherein the method comprises the following steps: acquiring a first text, wherein the first text comprises N dialogue texts, each dialogue text corresponds to dialogue content of a role, and N is an integer greater than 1; determining azimuth information of each dialogue text according to target information of each dialogue text in the first text, wherein the target information comprises position information of a role corresponding to each dialogue text in an associated movie fragment of the dialogue text or position keywords associated with each dialogue text in the first text; and determining the sound source position of the audio to be synthesized of each dialogue text according to the azimuth information of each dialogue text, and generating the synthesized voice of each dialogue text according to the sound source position of the audio to be synthesized of each dialogue text. The embodiment of the application can convert the dialogue text into the dialogue voice with different sound source positions, thereby effectively improving the voice playing effect of the dialogue text.

Description

Speech synthesis method, electronic device and storage medium

技术领域Technical field

本申请涉及语音技术领域，尤其涉及一种语音合成方法、电子设备及存储介质。The present application relates to the field of speech technology, and in particular, to a speech synthesis method, electronic device and storage medium.

背景技术Background technique

“听电子书”是被广泛应用的一种技术。目前，通常可通过语音合成(Text toSpeech，TTS)技术，将文本实时合成为语音文件并播放，从而让用户“听”到电子书的内容。现有的电子书朗读通常采用单一发音人的音库，仅将电子书文本转化为单声道音频文件播放。"Listening to e-books" is a widely used technology. At present, text to speech (Text to Speech, TTS) technology can usually be used to synthesize text into a speech file in real time and play it back, allowing users to "hear" the content of e-books. Existing e-book readings usually use the sound library of a single speaker, and only convert the e-book text into a mono audio file for playback.

很多电子书中通常会充满大量的人物对话，现有的电子书朗读仅将电子书文本转化为单声道音频文件播放，这导致音频文件较为单调，音频播放效果较差。Many e-books are usually filled with a lot of character dialogue. The existing e-book reading only converts the e-book text into a mono audio file for playback, which results in a monotonous audio file and poor audio playback effect.

发明内容Contents of the invention

本申请实施例提供了一种语音合成方法、电子设备及存储介质，以解决现有的电子书朗读仅将电子书文本转化为单声道音频文件播放，而导致音频文件较为单调，音频播放效果较差的问题。Embodiments of the present application provide a speech synthesis method, electronic equipment, and storage media to solve the problem that existing e-book reading only converts e-book text into monophonic audio files for playback, resulting in relatively monotonous audio files and poor audio playback effects. Poor question.

为了解决上述技术问题，本申请是这样实现的；In order to solve the above technical problems, this application is implemented as follows;

第一方面，本申请实施例提供了一种语音合成方法，所述方法包括：In a first aspect, embodiments of the present application provide a speech synthesis method, which method includes:

获取第一文本，其中，所述第一文本中包括N个对话文本，每个对话文本对应一个角色的对话内容，N为大于1的整数；Obtain the first text, wherein the first text includes N dialogue texts, each dialogue text corresponds to the dialogue content of a character, and N is an integer greater than 1;

根据所述第一文本中各对话文本的目标信息，确定各对话文本的方位信息，其中，所述目标信息包括各对话文本对应的角色在该对话文本的关联影视片段中的位置信息，或所述第一文本中与各对话文本关联的位置关键词；Determine the orientation information of each dialogue text according to the target information of each dialogue text in the first text, wherein the target information includes the position information of the character corresponding to each dialogue text in the associated film and television clip of the dialogue text, or the location information of each dialogue text. Describe the positional keywords associated with each dialogue text in the first text;

根据所述各对话文本的方位信息，确定所述各对话文本的待合成音频的声源位置，并根据所述各对话文本的待合成音频的声源位置，生成各对话文本的合成语音。According to the orientation information of each dialogue text, the sound source position of the audio to be synthesized of each dialogue text is determined, and based on the sound source position of the audio to be synthesized of each dialogue text, a synthesized speech of each dialogue text is generated.

可选地，所述目标信息包括各对话文本对应的角色在该对话文本的关联影视片段中的位置信息；Optionally, the target information includes position information of the character corresponding to each dialogue text in the film and television clip associated with the dialogue text;

所述根据所述第一文本中各对话文本的目标信息，确定各对话文本的方位信息，包括：Determining the orientation information of each dialogue text based on the target information of each dialogue text in the first text includes:

在包含所述第一文本的电子书存在关联影视资源的情况下，根据所述各对话文本的内容和所述关联影视资源中各视频帧的影视内容，确定所述关联影视资源中分别与各对话文本匹配的关联影视片段；In the case where the e-book containing the first text has associated film and television resources, it is determined based on the content of each dialogue text and the film and television content of each video frame in the associated film and television resources, that each of the associated film and television resources is determined to be associated with each other. Related film and television clips matched with dialogue text;

确定所述各对话文本的关联影视片段中与对应对话文本的角色匹配的目标影视角色；Determine the target film and television character in the film and television clip associated with each dialogue text that matches the role of the corresponding dialogue text;

根据所述各对话文本的关联影视片段中的目标影视角色在该关联影视片段的影视画面中的位置信息，确定各对话文本的方位信息。The orientation information of each dialogue text is determined based on the position information of the target film and television character in the associated film and television segment of each dialogue text in the film and television frame of the associated film and television segment.

可选地，所述确定所述各对话文本的关联影视片段中与对应对话文本的角色匹配的目标影视角色，包括：Optionally, the determining the target film and television character in the associated film and television clip of each dialogue text that matches the role of the corresponding dialogue text includes:

对第一关联影视片段进行目标检测，确定所述第一关联影视片段中发出与第一对话文本匹配的对话语音的目标图像，其中，所述目标图像所指示的影视角色为与所述第一对话文本的角色匹配的目标影视角色，所述第一关联影视片段为所述第一对话文本的关联影视片段，所述第一对话文本为所述第一文本中的任一对话文本。Perform target detection on the first associated film and television segment, and determine the target image in the first associated film and television segment that emits a dialogue voice that matches the first dialogue text, wherein the film and television character indicated by the target image is the same as the first dialogue text. The role of the dialogue text matches the target film and television character, the first associated film and television segment is an associated film and television segment of the first dialogue text, and the first dialogue text is any dialogue text in the first text.

可选地，所述根据所述各对话文本的内容和所述关联影视资源中各视频帧的影视内容，确定所述关联影视资源中分别与各对话文本匹配的关联影视片段，包括：Optionally, determining the associated film and television segments in the associated film and television resources that respectively match each dialogue text based on the content of each dialogue text and the film and television content of each video frame in the associated film and television resources includes:

根据第一对话文本的内容和所述关联影视资源中各视频帧的字幕内容或音频内容，确定所述关联影视资源中与所述第一对话文本匹配的第一关联影视片段，所述第一对话文本为所述第一文本中的任一对话文本；According to the content of the first dialogue text and the subtitle content or audio content of each video frame in the associated film and television resources, the first associated film and television segment matching the first dialogue text in the associated film and television resources is determined, and the first The dialogue text is any dialogue text in the first text;

对所述第一关联影视片段进行场景识别，确定所述第一关联影视片段中的目标场景元素；Perform scene recognition on the first associated film and television clip to determine the target scene element in the first associated film and television clip;

确定所述第一文本中所述第一对话文本的上下文文本中的场景关键词；Determine scene keywords in the context text of the first dialogue text in the first text;

在所述目标场景元素与所述场景关键词匹配的情况下，对所述第一关联影视片段进行目标检测。When the target scene element matches the scene keyword, target detection is performed on the first associated film and television clip.

可选地，所述对所述第一关联影视片段进行场景识别，确定所述第一关联影视片段中的目标场景元素，包括：Optionally, the step of performing scene recognition on the first associated film and television clip and determining the target scene element in the first associated film and television clip includes:

对所述第一关联影视片段进行场景识别，确定所述第一关联影视片段中的各个场景元素以及各个场景元素的可信度；Perform scene recognition on the first associated film and television clip, and determine each scene element and the credibility of each scene element in the first associated film and television clip;

将所述第一关联影视片段中可信度超过预设阈值的场景元素确定为所述目标场景元素。Scene elements whose credibility exceeds a preset threshold in the first associated video clip are determined as the target scene elements.

可选地，所述目标信息包括所述第一文本中与各对话文本关联的位置关键词；Optionally, the target information includes position keywords associated with each dialogue text in the first text;

遍历所述第一文本中的角色名称；Traverse the character names in the first text;

确定所述第一文本中出现各角色名称的位置的上下文文本中的位置关键词；Determine the position keywords in the contextual text of the position where each character name appears in the first text;

根据所述第一文本中各角色名称对应的对话文本，确定所述各对话文本关联的位置关键词。According to the dialogue text corresponding to each character name in the first text, the position keyword associated with each dialogue text is determined.

可选地，所述确定所述第一文本中出现各角色名称的位置的上下文文本中的位置关键词，包括：Optionally, determining the location keywords in the context text where each character name appears in the first text includes:

在所述第一文本中对第一角色名称的限定描述满足预设判别模式的情况下，根据所述限定描述，确定所述第一角色名称对应的第一处所词，其中，所述限定描述至少包括处所词，所述第一角色名称为所述第一文本中的任一角色名称；When the qualified description of the first character name in the first text satisfies the preset discrimination mode, the first local word corresponding to the first character name is determined according to the qualified description, wherein the qualified description At least including a local word, the first character name is any character name in the first text;

根据所述第一角色名称对应的第一处所词，确定所述第一角色名称对应的第一方位；Determine the first location corresponding to the first character name according to the first location word corresponding to the first character name;

将所述第一角色名称、所述第一处所词和所述第一方位对应记录至方位表，其中，在第二角色名称对应的第二处所词与所述第一处所词相同的情况下，所述方位表中记录的所述第二角色名称对应的第二方位与所述第一方位相同，所述第二角色名称为所述第一文本中除所述第一角色名称外的任一角色名称；The first character name, the first location word and the first location are correspondingly recorded in a location table, where, when the second location word corresponding to the second character name is the same as the first location word , the second orientation corresponding to the second character name recorded in the orientation table is the same as the first orientation, and the second character name is any character in the first text except the first character name. a role name;

根据所述方位表中记录的各角色名称对应的处所词和方位，确定各角色名称对应的对话文本的方位。According to the location words and directions corresponding to each character name recorded in the location table, the location of the dialogue text corresponding to each character name is determined.

可选地，在所述根据所述各对话文本的待合成音频的声源位置，生成各对话文本的合成语音之前，所述方法还包括：Optionally, before generating the synthesized speech of each dialogue text according to the sound source position of the audio to be synthesized of each dialogue text, the method further includes:

确定所述第一文本中各对话文本的情绪属性；Determine the emotional attributes of each dialogue text in the first text;

根据所述各对话文本的情绪属性，确定所述各对话文本的待合成音频的音频参数；Determine the audio parameters of the audio to be synthesized for each dialogue text according to the emotional attributes of each dialogue text;

所述根据所述各对话文本的待合成音频的声源位置，生成各对话文本的合成语音，包括：Generating the synthesized speech of each dialogue text according to the sound source position of the audio to be synthesized of each dialogue text includes:

根据所述各对话文本的待合成音频的声源位置和音频参数，生成各对话文本的合成语音。According to the sound source position and audio parameters of the audio to be synthesized of each dialogue text, the synthesized speech of each dialogue text is generated.

可选地，所述确定所述第一文本中各对话文本的情绪属性，包括：Optionally, determining the emotional attributes of each dialogue text in the first text includes:

对第一对话文本和所述第一对话文本的上下文进行分词处理，得到多个情绪词组，其中，所述第一对话文本为所述第一文本中的任一对话文本；Perform word segmentation processing on the first dialogue text and the context of the first dialogue text to obtain multiple emotional phrases, wherein the first dialogue text is any dialogue text in the first text;

通过预先构建的情绪词典，查询所述多个词组中每个情绪词组对应的情绪属性；Query the emotional attributes corresponding to each emotional phrase in the multiple phrases through the pre-constructed emotional dictionary;

根据所述每个情绪词组对应的情绪属性，确定所述第一对话文本的情绪属性。According to the emotional attribute corresponding to each emotional phrase, the emotional attribute of the first dialogue text is determined.

可选地，所述根据所述每个情绪词组对应的情绪属性，确定所述第一对话文本的情绪属性，包括：Optionally, determining the emotional attribute of the first dialogue text based on the emotional attribute corresponding to each emotional phrase includes:

在所述第一对话文本和所述第一对话文本的上下文中至少两个情绪词组对应的情绪属性不同的情况下，确定所述每个情绪词组对应的情绪属性的权重；When the emotional attributes corresponding to at least two emotional phrases in the first dialogue text and the context of the first dialogue text are different, determine the weight of the emotional attribute corresponding to each emotional phrase;

将权重最高的情绪属性确定为所述第一对话文本的情绪属性。The emotional attribute with the highest weight is determined as the emotional attribute of the first dialogue text.

可选地，所述根据所述各对话文本的待合成音频的声源位置，生成各对话文本的合成语音，包括：Optionally, generating the synthesized speech of each dialogue text according to the sound source position of the audio to be synthesized of each dialogue text includes:

根据所述各对话文本的待合成音频的声源位置，确定所述各对话文本的待合成音频的第一声道与第二声道的时延或者音量差，生成所述各对话文本的具有所述时延或者所述音量差的双声道合成语音。According to the sound source position of the audio to be synthesized in each dialogue text, the time delay or volume difference between the first channel and the second channel of the audio to be synthesized in each dialogue text is determined, and a sound source of each dialogue text is generated. The time delay or the two-channel synthesized speech with the volume difference.

可选地，在所述根据所述各对话文本的待合成音频的声源位置，生成所述各对话文本的合成语音之前，所述方法还包括：Optionally, before generating the synthesized speech of each dialogue text according to the sound source position of the audio to be synthesized of each dialogue text, the method further includes:

根据预设的人物字典中标注的所述第一文本中各角色的名称信息，确定所述各对话文本的角色属性；Determine the character attributes of each dialogue text according to the name information of each character in the first text marked in the preset character dictionary;

所述根据所述各对话文本的待合成音频的声源位置，生成所述各对话文本的合成语音，包括：Generating the synthesized speech of each dialogue text according to the sound source position of the audio to be synthesized of each dialogue text includes:

根据所述各对话文本的角色属性，确定所述各对话文本对应的发音人；Determine the speaker corresponding to each dialogue text according to the role attribute of each dialogue text;

根据所述各对话文本的待合成音频的声源位置，以所述各对话文本对应的发音人的音色生成各对话文本的合成语音。According to the sound source position of the audio to be synthesized in each dialogue text, the synthesized speech of each dialogue text is generated with the voice color of the speaker corresponding to each dialogue text.

第二方面，本申请实施例还提供了一种语音合成装置，包括：In a second aspect, embodiments of the present application also provide a speech synthesis device, including:

第一获取模块，用于获取第一文本，其中，所述第一文本中包括N个对话文本，每个对话文本对应一个角色的对话内容，N为大于1的整数；The first acquisition module is used to acquire the first text, wherein the first text includes N dialogue texts, each dialogue text corresponds to the dialogue content of a character, and N is an integer greater than 1;

第一确定模块，用于根据所述第一文本中各对话文本的目标信息，确定各对话文本的方位信息，其中，所述目标信息包括各对话文本对应的角色在该对话文本的关联影视片段中的位置信息，或所述第一文本中与各对话文本关联的位置关键词；The first determination module is used to determine the orientation information of each dialogue text according to the target information of each dialogue text in the first text, wherein the target information includes the associated film and television clips of the characters corresponding to each dialogue text in the dialogue text. The location information in the first text, or the location keywords associated with each dialogue text in the first text;

第二确定模块，用于根据所述各对话文本的方位信息，确定所述各对话文本的待合成音频的声源位置，并根据所述各对话文本的待合成音频的声源位置，生成各对话文本的合成语音。The second determination module is configured to determine the sound source position of the audio to be synthesized for each dialogue text based on the orientation information of each dialogue text, and generate each sound source position of the audio to be synthesized for each dialogue text based on the sound source location of the audio to be synthesized. Synthetic speech for conversational text.

所述第一确定模块，包括：The first determination module includes:

第一确定单元，用于在包含所述第一文本的电子书存在关联影视资源的情况下，根据所述各对话文本的内容和所述关联影视资源中各视频帧的影视内容，确定所述关联影视资源中分别与各对话文本匹配的关联影视片段；A first determining unit configured to determine, when the e-book containing the first text has associated film and television resources, based on the content of each dialogue text and the film and television content of each video frame in the associated film and television resources, Related film and television clips in related film and television resources that match each dialogue text respectively;

第二确定单元，用于确定所述各对话文本的关联影视片段中与对应对话文本的角色匹配的目标影视角色；a second determination unit configured to determine the target film and television character in the film and television clip associated with each dialogue text that matches the role of the corresponding dialogue text;

第三确定单元，用于根据所述各对话文本的关联影视片段中的目标影视角色在该关联影视片段的影视画面中的位置信息，确定各对话文本的方位信息。The third determination unit is configured to determine the orientation information of each dialogue text based on the position information of the target film and television character in the associated film and television segment of each dialogue text in the film and television frame of the associated film and television segment.

可选地，所述第二确定单元，具体用于：Optionally, the second determining unit is specifically used to:

可选地，所述第一确定单元，包括：Optionally, the first determining unit includes:

第一确定子单元，用于根据第一对话文本的内容和所述关联影视资源中各视频帧的字幕内容或音频内容，确定所述关联影视资源中与所述第一对话文本匹配的第一关联影视片段，所述第一对话文本为所述第一文本中的任一对话文本；The first determining subunit is configured to determine the first dialogue text in the associated film and television resource that matches the first dialogue text based on the content of the first dialogue text and the subtitle content or audio content of each video frame in the associated film and television resource. Associated with film and television clips, the first dialogue text is any dialogue text in the first text;

第二确定子单元，用于对所述第一关联影视片段进行场景识别，确定所述第一关联影视片段中的目标场景元素；The second determination subunit is used to perform scene recognition on the first associated film and television clip and determine the target scene element in the first associated film and television clip;

第三确定子单元，用于确定所述第一文本中所述第一对话文本的上下文文本中的场景关键词；A third determination subunit, configured to determine scene keywords in the context text of the first dialogue text in the first text;

第一检测子单元，用于在所述目标场景元素与所述场景关键词匹配的情况下，对所述第一关联影视片段进行目标检测。The first detection subunit is configured to perform target detection on the first associated film and television clip when the target scene element matches the scene keyword.

可选地，所述第二确定子单元，具体用于：Optionally, the second determination subunit is specifically used for:

所述第一确定模块，还包括：The first determination module also includes:

第一遍历单元，用于遍历所述第一文本中的角色名称；The first traversal unit is used to traverse the character names in the first text;

第四确定单元，用于确定所述第一文本中出现各角色名称的位置的上下文文本中的位置关键词；The fourth determination unit is used to determine the position keywords in the contextual text of the position where each character name appears in the first text;

第五确定单元，用于根据所述第一文本中各角色名称对应的对话文本，确定所述各对话文本关联的位置关键词。The fifth determination unit is used to determine the position keyword associated with each dialogue text according to the dialogue text corresponding to each character name in the first text.

可选地，所述第四确定单元，具体用于：Optionally, the fourth determining unit is specifically used to:

第六确定单元，用于根据所述方位表中记录的各角色名称对应的处所词和方位，确定各角色名称对应的对话文本的方位。The sixth determination unit is used to determine the location of the dialogue text corresponding to each character name based on the location word and location corresponding to each character name recorded in the location table.

可选地，所述语音合成装置还包括：Optionally, the speech synthesis device further includes:

第三确定模块，用于确定所述第一文本中各对话文本的情绪属性；The third determination module is used to determine the emotional attributes of each dialogue text in the first text;

第四确定模块，用于根据所述各对话文本的情绪属性，确定所述各对话文本的待合成音频的音频参数；The fourth determination module is used to determine the audio parameters of the audio to be synthesized of each dialogue text according to the emotional attributes of each dialogue text;

所述第二确定模块，包括：The second determination module includes:

第一生成单元，用于根据所述各对话文本的待合成音频的声源位置和音频参数，生成各对话文本的合成语音。The first generation unit is configured to generate the synthesized speech of each dialogue text according to the sound source position and audio parameters of the audio to be synthesized of each dialogue text.

可选地，所述第三确定模块，包括：Optionally, the third determination module includes:

第一处理模块，用于对第一对话文本和所述第一对话文本的上下文进行分词处理，得到多个情绪词组，其中，所述第一对话文本为所述第一文本中的任一对话文本；The first processing module is used to perform word segmentation processing on the first dialogue text and the context of the first dialogue text to obtain multiple emotional phrases, wherein the first dialogue text is any dialogue in the first text. text;

第一查询单元，通过预先构建的情绪词典，查询所述多个情绪词组中每个词组对应的情绪属性；The first query unit queries the emotional attributes corresponding to each of the plurality of emotional phrases through a pre-constructed emotion dictionary;

第七确定单元，用于根据所述每个情绪词组对应的情绪属性，确定所述第一对话文本的情绪属性。A seventh determination unit is configured to determine the emotional attribute of the first dialogue text according to the emotional attribute corresponding to each emotional phrase.

可选地，所述第七确定单元，具体用于：Optionally, the seventh determining unit is specifically used to:

在所述第一对话文本中至少两个情绪词组对应的情绪属性不同的情况下，确定所述每个情绪词组对应的情绪属性的权重；When the emotional attributes corresponding to at least two emotional phrases in the first dialogue text are different, determine the weight of the emotional attribute corresponding to each emotional phrase;

可选地，所述第一生成单元，具体用于：Optionally, the first generation unit is specifically used for:

第五确定模块，用于根据预设的人物字典中标注的所述第一文本中各角色的名称信息，确定所述各对话文本的角色属性；The fifth determination module is used to determine the character attributes of each dialogue text according to the name information of each character in the first text marked in the preset character dictionary;

所述第一生成单元，具体还用于：The first generation unit is specifically also used for:

第三方面，本申请实施例还提供了一种电子设备，包括：收发机、存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序；In a third aspect, embodiments of the present application also provide an electronic device, including: a transceiver, a memory, a processor, and a computer program stored on the memory and executable on the processor;

所述处理器，用于读取存储器中的程序实现如第一方面所述的语音合成方法的步骤。The processor is configured to read the program in the memory to implement the steps of the speech synthesis method described in the first aspect.

第四方面，本申请实施例还提供了一种计算机可读存储介质，用于存储计算机程序，所述计算机程序被处理器执行时实现如第一方面所述的语音合成方法的步骤。In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium for storing a computer program that implements the steps of the speech synthesis method as described in the first aspect when executed by a processor.

本申请实施例中，通过确定文本中的各角色对话文本的方位信息，并基于方位信息确定各对话文本的待合成音频的声源位置，进而按照各对话文本的待合成音频的声源位置，生成各对话文本的合成语音，能够使合成的语音具有立体声播放效果，可见，本申请实施例合成的语音文件内容丰富，能够有效提高对话文本的语音播放效果。In the embodiment of the present application, by determining the orientation information of each character dialogue text in the text, and determining the sound source position of the audio to be synthesized for each dialogue text based on the orientation information, and then based on the sound source location of the audio to be synthesized for each dialogue text, Generating the synthesized speech of each dialogue text can make the synthesized speech have a stereo playback effect. It can be seen that the synthesized speech file according to the embodiment of the present application is rich in content and can effectively improve the speech playback effect of the dialogue text.

附图说明Description of drawings

为了更清楚地说明本申请实施例的技术方案，下面将对本申请实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting any creative effort.

图1为本申请实施例提供的一种语音合成方法的流程图之一；Figure 1 is one of the flow charts of a speech synthesis method provided by an embodiment of the present application;

图2为本申请实施例提供的一种声源位置示意图之一；Figure 2 is a schematic diagram of a sound source position provided by an embodiment of the present application;

图3为本申请实施例提供的一种声源位置示意图之二；Figure 3 is a second schematic diagram of the position of a sound source provided by an embodiment of the present application;

图4为本申请实施例提供的一种确定对话文本方位信息的流程图；Figure 4 is a flow chart for determining the location information of dialogue text provided by an embodiment of the present application;

图5为本申请实施例提供的一种确定情绪属性的权值的流程图；Figure 5 is a flow chart for determining the weight of emotional attributes provided by an embodiment of the present application;

图6为本申请实施例提供的一种语音合成方法的流程图之二；Figure 6 is a second flow chart of a speech synthesis method provided by an embodiment of the present application;

图7为本申请实施例提供的一种语音合成装置的结构图；Figure 7 is a structural diagram of a speech synthesis device provided by an embodiment of the present application;

图8为本申请实施例提供的一种电子设备的结构图。Figure 8 is a structural diagram of an electronic device provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

参见图1，图1是本申请实施例提供的一种语音合成方法的流程图，如图1所示，该方法包括以下步骤：Referring to Figure 1, Figure 1 is a flow chart of a speech synthesis method provided by an embodiment of the present application. As shown in Figure 1, the method includes the following steps:

步骤101、获取第一文本，其中，第一文本中包括N个对话文本，每个对话文本对应一个角色的对话内容，N为大于1的整数；Step 101: Obtain the first text, where the first text includes N dialogue texts, each dialogue text corresponds to the dialogue content of a character, and N is an integer greater than 1;

步骤102、根据第一文本中各对话文本的目标信息，确定各对话文本的方位信息，其中，目标信息包括各对话文本对应的角色在该对话文本的关联影视片段中的位置信息，或第一文本中与各对话文本关联的位置关键词；Step 102: Determine the orientation information of each dialogue text according to the target information of each dialogue text in the first text, where the target information includes the position information of the character corresponding to each dialogue text in the associated film and television clip of the dialogue text, or the first Positional keywords in the text associated with each dialogue text;

步骤103、根据各对话文本的方位信息，确定各对话文本的待合成音频的声源位置，并根据各对话文本的待合成音频的声源位置，生成各对话文本的合成语音。Step 103: Determine the sound source location of the audio to be synthesized for each dialogue text based on the orientation information of each dialogue text, and generate a synthesized speech for each dialogue text based on the sound source location of the audio to be synthesized for each dialogue text.

在步骤101中，通信设备获取第一文本，第一文本用于合成语音，第一文本为电子书中的文本，电子书的类型包括小说等。在获取第一文本之后，对第一文本进行识别，以识别出第一文本包括的N个对话文本，其中，N为大于1的整数，且N个对话文本中每个对话文本对应一个角色的对话内容。需要说明的是，在N个对话文本中，可能每个对话文本对应的角色均不相同，也可能存在两个或两个以上的对话文本对应的角色为同一角色。In step 101, the communication device obtains a first text. The first text is used to synthesize speech. The first text is a text in an e-book. The type of e-book includes novels, etc. After obtaining the first text, the first text is identified to identify N dialogue texts included in the first text, where N is an integer greater than 1, and each of the N dialogue texts corresponds to a character. Conversation content. It should be noted that among the N dialogue texts, each dialogue text may correspond to a different role, or there may be two or more dialogue texts corresponding to the same role.

在步骤102中，方位信息例如“左边、右边以及中间”等。各对话文本的方位信息可反映各对话文本对应的角色所处的方位。第一文本中，在各角色所处的方位不同的情况下，各角色对应的对话文本的方位信息也不同。In step 102, the orientation information is such as "left, right and center". The orientation information of each dialogue text can reflect the orientation of the character corresponding to each dialogue text. In the first text, when the orientation of each character is different, the orientation information of the dialogue text corresponding to each character is also different.

本申请实施例中，可通过两种方案确定对话文本的方位信息。该两种方案分别为：根据各对话文本对应的角色在该对话文本的关联影视片段中的位置信息，来确定第一文本中各对话文本的方位信息；或者，根据第一文本中各对话文本关联的位置关键词，来确定第一文本中的各对话文本的方位信息。In the embodiment of the present application, the orientation information of the dialogue text can be determined through two solutions. The two solutions are: determining the position information of each dialogue text in the first text based on the position information of the character corresponding to each dialogue text in the film and television clip associated with the dialogue text; or, determining the position information of each dialogue text in the first text based on The associated position keywords are used to determine the location information of each dialogue text in the first text.

其中，对话文本的关联影视片段是指关联影视片段中存在与该对话文本匹配的字幕或者音频等。在第一文本中各对话文本不存在关联影视片段的情况下，可直接根据第一文本中各对话文本关联的位置关键词，来确定各对话文本的方位信息。在各对话文本存在关联影视片段的情况下，可根据需求择一选择以何种方式来确定各对话文本的方位信息。例如：可预先对上述两种确定对话文本的方位信息的方案进行测试，以确定上述两种方案的方位信息确定准确率，再根据准确率优先的规则择一选择对应的方案。Among them, the associated film and television clips of the dialogue text refer to the existence of subtitles or audio that match the dialogue text in the associated film and television clips. When there are no associated film and television clips for each dialogue text in the first text, the location information of each dialogue text can be determined directly based on the location keywords associated with each dialogue text in the first text. When each dialogue text has associated film and television clips, you can choose a method to determine the location information of each dialogue text according to your needs. For example, the above two schemes for determining the orientation information of the dialogue text can be tested in advance to determine the accuracy of the above two schemes for determining the orientation information, and then one of the corresponding schemes can be selected according to the accuracy priority rule.

在步骤103中，各对话文本的待合成音频的声源位置与各对话文本的方位信息关联。具体的，若两个对话文本的方位信息不同，则该两个对话文本的声源位置也不同。根据各对话文本的待合成音频的声源位置，生成各对话文本的合成语音，从而各对话文本的合成语音能够反馈出声源位置，以达到立体播放效果。In step 103, the sound source position of the audio to be synthesized in each dialogue text is associated with the orientation information of each dialogue text. Specifically, if the orientation information of the two dialogue texts is different, the sound source positions of the two dialogue texts are also different. According to the sound source position of the audio to be synthesized in each dialogue text, the synthesized voice of each dialogue text is generated, so that the synthesized voice of each dialogue text can feed back the sound source location to achieve a stereoscopic playback effect.

需要说明的是，本申请中可预先设置播放合成语音的声道数量，根据播放合成语音的声道数量所能够反馈的声源位置数量，来预先确定所有可能的声源位置。例如，若合成语音为双声道语音，则合成语音能够反馈三种声源位置，双声道即包含左边声道和右边声道。双声道语音可包含的三种声源位置分别为“左边、右边以及中间”。如图2所示，声源位置为中间，则代表双声道中的左边声道和右边声道音频同步且无音量差，从而听者会认为声源在正前方；如图3所示，声源位置为左边，则代表右声道的音频相较于左声道存在延迟，或者右声道的音频的音量更低，从而听者感知到的声源位置就会偏左侧。基于上述，若合成语音为双声道语音，则合成语音可包含的三种声源位置分别为“左边、右边以及中间”，从而在确定文本中的各对话文本的方位信息时，可确定各对话文本的方位信息为“左边”、“右边”或者“中间”，进一步地，根据方位信息为“左边”、“右边”或者“中间”确定对应的声源位置。It should be noted that in this application, the number of vocal channels for playing synthetic speech can be preset, and all possible sound source positions can be predetermined based on the number of sound source positions that can be fed back by the number of vocal channels for playing synthetic speech. For example, if the synthesized speech is a binaural speech, the synthesized speech can feedback three sound source positions, and the binaural includes the left channel and the right channel. The three sound source positions that binaural speech can include are "left, right and center". As shown in Figure 2, if the sound source is in the middle, it means that the left and right channels in the two-channel audio are synchronized and there is no volume difference, so the listener will think that the sound source is directly in front; as shown in Figure 3, If the sound source position is on the left, it means that the audio in the right channel is delayed compared to the left channel, or the volume of the audio in the right channel is lower, so the position of the sound source perceived by the listener will be to the left. Based on the above, if the synthesized speech is a binaural speech, the three sound source positions that the synthesized speech can include are "left, right and middle". Therefore, when determining the orientation information of each dialogue text in the text, each The orientation information of the dialogue text is "left", "right" or "middle". Further, the corresponding sound source position is determined based on the orientation information being "left", "right" or "middle".

其中，设置合成语音为双声道语音只是示例性说明，还可设置合成语音为三声道语音或者四声道语音等。Among them, setting the synthesized voice to be a two-channel voice is only an exemplary description, and the synthesized voice can also be set to be a three-channel voice or a four-channel voice, etc.

本申请实施例中，还可确定文本中的除各对话文本以外的旁白的方位信息，旁白的方位信息可均为同一的方位信息，例如旁白的方位信息可均为上述三种方位信息的“中间”，根据旁白的方位信息，可确定旁白的待合成音频的声源位置，根据旁白的待合成音频的声源位置可生成旁白的合成语音。In the embodiment of the present application, the position information of the narration in the text other than each dialogue text can also be determined. The position information of the narration can all be the same position information. For example, the position information of the narration can be "all of the above three kinds of position information". "Middle", according to the orientation information of the narration, the sound source position of the narration's audio to be synthesized can be determined, and the synthesized speech of the narration can be generated based on the sound source position of the narration's audio to be synthesized.

可选地，目标信息包括各对话文本对应的角色在该对话文本的关联影视片段中的位置信息；Optionally, the target information includes position information of the character corresponding to each dialogue text in the film and television clip associated with the dialogue text;

根据第一文本中各对话文本的目标信息，确定各对话文本的方位信息，包括：According to the target information of each dialogue text in the first text, the orientation information of each dialogue text is determined, including:

在包含第一文本的电子书存在关联影视资源的情况下，根据各对话文本的内容和关联影视资源中各视频帧的影视内容，确定关联影视资源中分别与各对话文本匹配的关联影视片段；When the e-book containing the first text has associated film and television resources, determine the associated film and television clips in the associated film and television resources that match each dialogue text respectively based on the content of each dialogue text and the film and television content of each video frame in the associated film and television resources;

确定各对话文本的关联影视片段中与对应对话文本的角色匹配的目标影视角色；Determine the target film and television character in the film and television clip associated with each dialogue text that matches the role of the corresponding dialogue text;

根据各对话文本的关联影视片段中的目标影视角色在该关联影视片段的影视画面中的位置信息，确定各对话文本的方位信息。The orientation information of each dialogue text is determined based on the position information of the target film and television character in the associated film and television segment of each dialogue text in the film and television frame of the associated film and television segment.

该实施方式中，包含第一文本的电子书存在关联影视资源，关联影视资源即该电子书的影视作品，关联影视资源包括电子书关联的电视剧资源、动漫资源以及电影资源等。In this embodiment, the e-book containing the first text has associated film and television resources. The associated film and television resources are the film and television works of the e-book. The associated film and television resources include TV drama resources, animation resources, and movie resources associated with the e-book.

在包含第一文本的电子书存在关联影视资源的情况下，根据各对话文本的内容和关联影视资源中各视频帧的影视内容，确定关联影视资源中分别与各对话文本匹配的关联影视片段。关联影视资源中各视频帧的影视内容包括但不限于各视频帧的语音信息、背景信息、人物信息以及字幕信息等。在多个相邻视频帧的影视内容与第一文本中的任意对话文本匹配的情况下，例如在多个相邻视频帧中的字幕信息与任意对话文本匹配的情况下，确定该多个相邻视频帧形成的影视片段即为与该对话文本匹配的关联影视片段。When the e-book containing the first text has associated film and television resources, the associated film and television segments in the associated film and television resources that match each dialogue text are determined based on the content of each dialogue text and the film and television content of each video frame in the associated film and television resources. The film and television content of each video frame in the associated film and television resources includes but is not limited to the voice information, background information, character information, subtitle information, etc. of each video frame. When the video content of multiple adjacent video frames matches any dialogue text in the first text, for example, when the subtitle information in multiple adjacent video frames matches any dialogue text, it is determined that the multiple adjacent video frames match any dialogue text. The video clip formed by adjacent video frames is the associated video clip matching the dialogue text.

在确定各对话文本关联的影视片段之后，确定各对话文本的关联影视片段中与对应对话文本的角色匹配的目标影视角色。每个对话文本的关联影视片段中存在至少一个影视角色，本申请实施例在至少一个影视角色中确定目标影视角色，关联影视片段中的目标影视角色与关联影视片段所对应的对话文本的角色匹配，例如关联影视片段A对应的对话文本为对话文本A，对话文本A的角色为B，则所确定的关联影视片段A中的目标影视角色也为B。After determining the film and television clips associated with each dialogue text, determine the target film and television character in the film and television clips associated with each dialogue text that matches the role of the corresponding dialogue text. There is at least one film and television character in the associated film and television segment of each dialogue text. The embodiment of the present application determines the target film and television character in the at least one film and television character, and the target film and television character in the associated film and television segment matches the role of the dialogue text corresponding to the associated film and television segment. , for example, the dialogue text corresponding to the associated film and television segment A is dialogue text A, and the character of dialogue text A is B, then the determined target film and television character in the associated film and television segment A is also B.

在确定各关联影视片段中的目标影视角色之后，确定目标影视角色在对应关联视频片段的影视画面中的位置信息。每个关联影视片段中的目标影视角色均与对应的对话文本的角色匹配，从而目标影视角色在关联影视片段的影视画面中的位置信息能够反应出该关联影视片段所对应的对话文本的角色的位置信息，故可根据各对话文本的关联影视片段中的目标影视角色在该关联影视片段的影视画面中的位置信息，确定各对话文本的方位信息。After determining the target film and television character in each associated film and television segment, the position information of the target film and television character in the film and television frame corresponding to the associated video segment is determined. The target film and television character in each associated film and television clip matches the role of the corresponding dialogue text, so that the position information of the target film and television character in the film and television screen of the associated film and television clip can reflect the character of the dialogue text corresponding to the associated film and television clip. Position information, so the orientation information of each dialogue text can be determined based on the position information of the target film and television character in the associated film and television clip of each dialogue text in the film and television frame of the associated film and television clip.

作为示例，目标影视角色在影视画面中的位置信息，可为目标影视角色在影视画面中的坐标信息。根据目标影视角色在影视画面中的坐标信息来确定各对话文本的方位信息，例如可预先确定多个坐标范围，每个坐标范围关联一方位信息，若设置合成语音为双声道语音，方位信息例如“左边、右边以及中间”。当目标影视角色在影视画面中的坐标在左边方位信息所关联的坐标范围内，则确定该目标影视角色的方位信息为左边。As an example, the position information of the target film and television character in the film and television screen can be the coordinate information of the target film and television character in the film and television screen. The orientation information of each dialogue text is determined based on the coordinate information of the target film and television character in the film and television screen. For example, multiple coordinate ranges can be predetermined, and each coordinate range is associated with a piece of orientation information. If the synthesized voice is set to a two-channel voice, the orientation information For example, "left, right, and center." When the coordinates of the target film and television character in the film and television screen are within the coordinate range associated with the left orientation information, the orientation information of the target film and television character is determined to be the left.

该实施方式中，各对话文本的关联影视片段中的目标影视角色在该关联影视片段的影视画面中的位置信息，能够反映出各对话文本的角色的位置信息，通过确定各对话文本的关联影视片段中的目标影视角色在该关联影视片段的影视画面中的位置信息，来确定各对话文本的方位信息，能够提高所确定的各对话文本的方位信息的准确度。In this embodiment, the position information of the target film and television character in the film and television segment associated with each dialogue text in the film and television screen of the associated film and television segment can reflect the location information of the character in each dialogue text. By determining the associated film and television character of each dialogue text The position information of the target film and television character in the clip in the film and television frame of the associated film and television clip is used to determine the position information of each dialogue text, which can improve the accuracy of the determined position information of each dialogue text.

可选地，确定各对话文本的关联影视片段中与对应对话文本的角色匹配的目标影视角色，包括：Optionally, determine the target film and television character in the film and television clip associated with each dialogue text that matches the role of the corresponding dialogue text, including:

对第一关联影视片段进行目标检测，确定第一关联影视片段中发出与第一对话文本匹配的对话语音的目标图像，其中，目标图像所指示的影视角色为与第一对话文本的角色匹配的目标影视角色，第一关联影视片段为第一对话文本的关联影视片段，第一对话文本为第一文本中的任一对话文本。Perform target detection on the first associated film and television clip, and determine the target image in the first associated film and television clip that emits a dialogue voice that matches the first dialogue text, wherein the film and television character indicated by the target image is a character that matches the first dialogue text. For the target film and television character, the first related film and television clip is a related film and television clip of the first dialogue text, and the first dialogue text is any dialogue text in the first text.

该实施方式中，对第一关联影视片段进行目标检测，确定第一关联影视片段中发出与第一对话文本匹配的对话语音的目标图像，从而该目标图像指示的影视角色与第一对话文本的角色匹配，即该目标图像指示的影视角色为与第一对话文本的角色匹配的目标影视角色，其中，目标图像例如人脸图像或者动漫角色图像等。In this embodiment, target detection is performed on the first associated film and television segment, and the target image in the first associated film and television segment that emits a dialogue voice that matches the first dialogue text is determined, so that the film and television character indicated by the target image matches the first dialogue text. Role matching means that the film and television character indicated by the target image is a target film and television character that matches the role of the first dialogue text, where the target image is, for example, a human face image or an animation character image.

确定发出与第一对话文本匹配的对话语音的目标图像，需对第一关联影视片段进行目标检测，若检测到多个人物图像，则将多个人物图像中图像信息指示嘴型处于对话状态，且嘴型处于对话状态时音频也为上述对话语音的确定为目标图像。To determine the target image that emits a dialogue voice that matches the first dialogue text, target detection needs to be performed on the first associated film and television clip. If multiple character images are detected, the image information in the multiple character images indicates that the mouth shape is in a dialogue state. And when the mouth shape is in a conversational state, the audio is also determined to be the above-mentioned conversational voice as the target image.

需要说明的是，本申请实施例中第一对话文本与对话语音匹配，并不要求第一对话文本与对话语音完全匹配，可设置第一对话文本与对话语音的相似值超出某一阈值则确定第一对话文本与对话语音匹配。It should be noted that in the embodiment of the present application, the matching of the first dialogue text and the dialogue voice does not require that the first dialogue text and the dialogue voice completely match. It can be set that the similarity value between the first dialogue text and the dialogue voice exceeds a certain threshold and is determined. The first dialogue text matches the dialogue speech.

本申请实施例中确定第一关联影视片段中发出与第一对话文本匹配的对话语音的目标图像，来确定与第一对话文本的角色匹配的目标影视角色，能够使所确定的目标影视角色较为准确，进而能够使根据目标影视角色的位置信息所确定的对话文本的方位信息较为准确。In the embodiment of the present application, the target image in the first associated film and television clip that emits the dialogue voice that matches the first dialogue text is determined to determine the target film and television character that matches the role of the first dialogue text, so that the determined target film and television character can be compared Accurate, thereby making the position information of the dialogue text determined based on the position information of the target film and television character more accurate.

可选地，根据各对话文本的内容和关联影视资源中各视频帧的影视内容，确定关联影视资源中分别与各对话文本匹配的关联影视片段，包括：Optionally, based on the content of each dialogue text and the film and television content of each video frame in the associated film and television resources, determine the associated film and television segments in the associated film and television resources that match each dialogue text, including:

根据第一对话文本的内容和关联影视资源中各视频帧的字幕内容或音频内容，确定关联影视资源中与第一对话文本匹配的第一关联影视片段，第一对话文本为第一文本中的任一对话文本；According to the content of the first dialogue text and the subtitle content or audio content of each video frame in the associated film and television resources, the first associated film and television segment in the associated film and television resources that matches the first dialogue text is determined, and the first dialogue text is the Any dialogue text;

对第一关联影视片段进行场景识别，确定第一关联影视片段中的目标场景元素；Perform scene recognition on the first associated film and television clip to determine the target scene element in the first associated film and television clip;

确定第一文本中第一对话文本的上下文文本中的场景关键词；Determining scene keywords in the context text of the first dialogue text in the first text;

在目标场景元素与场景关键词匹配的情况下，对第一关联影视片段进行目标检测。When the target scene element matches the scene keyword, target detection is performed on the first associated film and television clip.

该实施方式中，若关联影视资源中各视频帧存在字幕内容，则可直接根据字幕内容来确定关联影视资源中与第一对话文本匹配的第一关联影视片段；若关联影视资源中各视频帧仅存在音频内容，则可将音频内容转换为文本内容之后，再根据文本内容确定关联影视资源中与第一对话文本匹配的第一关联影视片段。In this implementation, if subtitle content exists in each video frame in the associated film and television resource, the first associated film and television segment in the associated film and television resource that matches the first dialogue text can be determined directly based on the subtitle content; if each video frame in the associated film and television resource If only audio content exists, the audio content can be converted into text content, and then the first associated film and television clip in the associated film and television resource that matches the first dialogue text can be determined based on the text content.

电子书关联的影视资源通常会进行一定改编，以根据字幕内容来确定关联影视资源中与第一对话文本匹配的第一关联影视片段为例，多个相邻视频帧的字幕内容与第一对话文本的文本相似度超过一定阈值(例如70％)，则可确定该多个相邻视频帧形成的视频片段为与第一对话文本匹配的第一关联影视片段。Film and television resources associated with e-books are usually subject to certain adaptations. For example, the first associated film and television clip in the associated film and television resource that matches the first dialogue text is determined based on the subtitle content. The subtitle content of multiple adjacent video frames is consistent with the first dialogue text. If the text similarity of the text exceeds a certain threshold (for example, 70%), it can be determined that the video segment formed by the plurality of adjacent video frames is the first associated film and television segment that matches the first dialogue text.

其中，上述文本相似度可根据杰卡德相似度来确定。具体的，先对多个视频帧的字幕和第一对话文本进行分词，假设对第一对话文本分词后形成词语集合A，对多个视频帧的字幕分词后形成词语集合B：Among them, the above text similarity can be determined based on Jaccard similarity. Specifically, first segment the subtitles of multiple video frames and the first dialogue text. It is assumed that the first dialogue text is segmented to form a word set A, and the subtitles of multiple video frames are segmented to form a word set B:

词语集合A＝{请/问/卫生间/在/哪里/？}Word set A={please/ask/the bathroom/is/where/? }

词语集合B＝{请/问/厕所/在/哪里/？}Word set B={please/ask/the toilet/is/where/? }

杰卡德相似度＝词语集合A与词语集合B相同分词数÷词语集合A与词语集合B所有分词数，从而词语集合A与词语集合B的杰卡德相似度为：Jaccard similarity = the number of the same word segments between word set A and word set B ÷ the number of all word segments between word set A and word set B, so the Jaccard similarity between word set A and word set B is:

长度({请/问/在/哪里/？})÷长度({请/问/卫生间/在/哪里/？/厕所}＝5/7＝71.43％Length ({please/ask/at/where/?})÷length ({please/ask/toilet/at/where/?/toilet}=5/7=71.43%

可见，上述词语集合A与词语集合B的文本相似度为71.43％。It can be seen that the text similarity between the above word set A and word set B is 71.43%.

在确定第一对话文本的关联影视片段之后，进一步地，对第一关联影视片段进行场景识别，确定关联影视片段中的场景元素，场景元素包括人物元素、物品元素以及风景元素等。作为示例，场景元素例如：树木、街道以及房屋等。After determining the associated film and television clips of the first dialogue text, further, perform scene recognition on the first associated film and television clips, and determine the scene elements in the associated film and television clips. The scene elements include character elements, object elements, scenery elements, etc. As examples, scene elements include trees, streets, and houses.

其中，人物元素可包括具体的角色元素。可基于电子书的关联影视资源构建人物库，人物库中包括主要角色以及主要角色对应的角色图像，若关联影视资源为电视剧资源或者电影资源，则人物库中可包括主要角色以及主要角色对应的人脸图像。在构建人物库之后，对关联影视片段进行场景识别，若关联影视片段的影视画面中存在人物库中角色图像，则可基于人物库确定该角色图像对应的角色元素。Among them, character elements may include specific role elements. A character library can be constructed based on the associated film and television resources of the e-book. The character library includes the main characters and the character images corresponding to the main characters. If the related film and television resources are TV drama resources or movie resources, the character library can include the main characters and the corresponding images of the main characters. Face images. After the character library is constructed, scene recognition is performed on the associated film and television clips. If there is a character image in the character library in the film and television screen of the related film and television clip, the character element corresponding to the character image can be determined based on the character library.

在确定第一关联影视片段的目标场景元素之后，再确定第一对话文本的上下文文本中的场景关键词，并在目标场景元素与场景关键词匹配的情况下，才对第一关联影视片段进行目标检测。其中，目标场景元素与场景关键词匹配可理解为目标场景元素命中率，即场景关键词中是否包含目标场景元素。After determining the target scene element of the first associated film and television clip, the scene keywords in the context text of the first dialogue text are determined, and only when the target scene element matches the scene keyword, the first associated film and television clip is processed Target Detection. Among them, the matching between the target scene element and the scene keyword can be understood as the hit rate of the target scene element, that is, whether the scene keyword contains the target scene element.

本申请实施例在确定第一对话文本的第一关联影视片段之后，再将第一关联影视片段的目标场景元素与第一对话文本的上下文场景关键词进行匹配，若目标场景元素与场景关键词匹配，则能够更进一步确定第一关联影视片段的内容中包含与第一对话文本匹配的内容，也即能够更进一步确定第一关联影视片段中包含发出与第一对话文本匹配的对话语音的目标图像。可见，本申请在目标场景元素与场景关键词匹配的情况下，才对第一关联影视片段进行目标检测，有利于提高目标检测的可靠性。In this embodiment of the present application, after determining the first associated film and television segment of the first dialogue text, the target scene element of the first associated film and television segment is matched with the contextual scene keyword of the first dialogue text. If the target scene element matches the scene keyword matching, it can be further determined that the content of the first associated film and television segment contains content that matches the first dialogue text, that is, it can be further determined that the first associated film and television segment contains a target that emits dialogue voice that matches the first dialogue text. image. It can be seen that this application only performs target detection on the first associated film and television clip when the target scene element matches the scene keyword, which is beneficial to improving the reliability of target detection.

可选地，对第一关联影视片段进行场景识别，确定第一关联影视片段中的目标场景元素，包括：Optionally, perform scene recognition on the first associated film and television clip to determine the target scene element in the first associated film and television clip, including:

对第一关联影视片段进行场景识别，确定第一关联影视片段中的各个场景元素以及各个场景元素的可信度；Perform scene recognition on the first associated film and television clip, and determine each scene element and the credibility of each scene element in the first associated film and television clip;

将第一关联影视片段中可信度超过预设阈值的场景元素确定为目标场景元素。Scene elements whose credibility exceeds a preset threshold in the first associated video clip are determined as target scene elements.

每个场景元素的可信度用于反映第一关联影视片段中存在该场景元素的可信度。将第一关联影视片段中可信度超过预设阈值的场景元素确定为目标场景元素，例如，若第一关联影视片段中，场景元素树木的可信度为80％、场景元素天空的可信度为60％，场景元素月亮的可信度为30％，预设阈值为50％，则可确定场景元素树木以及场景元素天空为目标场景元素。The credibility of each scene element is used to reflect the credibility of the existence of the scene element in the first associated film and television clip. Determine the scene element whose credibility exceeds the preset threshold in the first associated film and television clip as the target scene element. For example, if in the first associated film and television clip, the credibility of the scene element tree is 80%, and the credibility of the scene element sky is 80%. If the degree is 60%, the credibility of the scene element moon is 30%, and the preset threshold is 50%, then the scene element trees and scene element sky can be determined as the target scene elements.

本申请实施例通过将第一关联影视片段中可信度超过预设阈值的场景元素确定为目标场景元素，从而能够提高目标场景元素的可靠性。Embodiments of the present application determine the scene elements whose credibility exceeds a preset threshold in the first associated video clip as target scene elements, thereby improving the reliability of the target scene elements.

可选地，目标信息包括第一文本中与各对话文本关联的位置关键词；Optionally, the target information includes position keywords associated with each dialogue text in the first text;

遍历第一文本中的角色名称；Traverse the character names in the first text;

确定第一文本中出现各角色名称的位置的上下文文本中的位置关键词；Determining positional keywords in the contextual text where each character name appears in the first text;

根据第一文本中各角色名称对应的对话文本，确定各对话文本关联的位置关键词。According to the dialogue text corresponding to each character name in the first text, the position keyword associated with each dialogue text is determined.

包含第一文本的电子书中同一角色可能包括角色名称以及角色别称，基于此，可预先构建人物字典，人物词典中一角色的角色名称与该角色的别称相关联，人物字典可如下表所示：The same character in the e-book containing the first text may include a character name and a character nickname. Based on this, a character dictionary can be constructed in advance. The character name of a character in the character dictionary is associated with the character's nickname. The character dictionary can be as shown in the following table :

表1人物词典Table 1 Character Dictionary

角色名称Role Name 别称nickname 爱玲Ai Ling 张爱玲Zhang Ailing 兰成Lan Cheng 胡兰成Hu Lancheng 鲁迅Lu Xun 周树人、周先生、豫才、我Zhou Shuren, Mr. Zhou, Yucai, me 永吉Yongji 关永吉Guan Yongji 蕊生Ruisheng 胡兰成Hu Lancheng 小周Xiao Zhou 周训德Zhou Xunde 爱珍Aizhen 佘爱珍She Aizhen 秀美beautiful 范秀美Fan Xiumei 慧文Huiwen 全慧文Quan Huiwen 启武Qiwu 沈启武Shen Qiwu

在检测到第一文本中各角色名称对应的角色别称时，将该角色别称标注为对应的角色名称，需要说明的是，在电子书为以第一人称的角度撰写的情况下，角色别称还还包括“我”。例如，如表1所示，在以鲁迅为第一人称的电子书中，鲁迅对应的别称包括周树人、周先生、豫才和我，当识别到文本中包含周树人、周先生、豫才和我时，将周树人、周先生、豫才和我均标注为鲁迅。When the character nickname corresponding to each character name in the first text is detected, the character nickname is marked as the corresponding character name. It should be noted that when the e-book is written from the first-person perspective, the character nickname is still including me". For example, as shown in Table 1, in the e-book with Lu Xun as the first person, Lu Xun’s corresponding aliases include Zhou Shuren, Mr. Zhou, Yucai and me. When it is recognized that the text contains Zhou Shuren, Mr. Zhou, Yucai and me , labeling Zhou Shuren, Mr. Zhou, Yucai and me as Lu Xun.

在遍历第一文本的角色名称前，可对第一文本进行分词并标注各个词组的词性，从而根据各个词组的词性确定出第一文本中的各角色名称、角色别称以及位置关键词等相关信息，其中各个词组的词性包括但不限于下表所示：Before traversing the character names of the first text, the first text can be segmented and the part-of-speech of each phrase can be marked, so that relevant information such as each character name, character alias, and position keywords in the first text can be determined based on the part-of-speech of each phrase. , the parts of speech of each phrase include but are not limited to the following table:

表2词性表格Table 2 Parts of speech table

nn 普通名词common noun mm 数量词Quantifier ff 方位名词location noun qq 量词quantifier ss 所处名词location noun rr 代词pronoun nznz 其他专名Other proper names pp 介词preposition nwnw 作品名Title cc 连词conjunction vv 普通动词common verb uu 助词particle vdvd 动副词verb adverb xcxc 其他虚词Other function words vnvn 名动词noun verb ww 标点符号Punctuation aa 形容词adjective PERPER 人名name adad 副形词adverb LOCLOC 地名Place name anan 名形词noun ORGORG 机构名Organization name dd 副词adverb TIMETIME 时间time

在识别出第一文本包括角色别称的情况下，基于人物词典将角色别称标注为该角色别称对应的角色名称。在将角色别称标注为该角色别称对应的角色名称之后，遍历第一文本的角色名称，并确定第一文本中出现各角色名称的位置的上下文文本中的位置关键词，根据出现各角色名称的位置的上下文文本中的位置关键词，确定各角色名称对应的角色是否靠近某个角色、某个物品或者某个处所，从而确定各角色名称所关联的位置信息。在确定各角色名称所关联的位置关键词之后，根据各角色名称所关联的位置关键词以及各角色名称对应的对话文本，确定各对话文本关联的位置信息，例如角色名称张三关联的位置关键词包含“沙发上”，则角色名称张三对应的对话文本关联的位置关键词也包含“沙发上”。When it is recognized that the first text includes a character alias, the character alias is marked with a character name corresponding to the character alias based on the character dictionary. After marking the role alias as the role name corresponding to the role alias, traverse the role names of the first text, and determine the position keywords in the context text where each role name appears in the first text, and determine the location keywords in the context text where each role name appears according to the location where each role name appears. The location keyword in the contextual text of the location determines whether the character corresponding to each character name is close to a certain character, an object, or a certain place, thereby determining the location information associated with each character name. After determining the location keywords associated with each role name, determine the location information associated with each dialogue text based on the location keywords associated with each role name and the dialogue text corresponding to each role name, such as the location key associated with the role name Zhang San. If the word contains "on the sofa", then the position keyword associated with the dialogue text corresponding to the character name Zhang San also contains "on the sofa".

其中，第一文本中各个对话文本对应于一个角色的对话内容，可根据各个对话文本的上下文确定各个对话文本对应的角色名称，例如：第一文本包含：丽香静静地立在沙发旁，沉默良久，一字一顿地说道：“我们走吧，就现在。“可见，根据对话文本“我们走吧，就现在。“的上下文中的各个词组的词性，可识别出其对应的角色名称为丽香。Among them, each dialogue text in the first text corresponds to the dialogue content of a character, and the character name corresponding to each dialogue text can be determined according to the context of each dialogue text. For example: the first text contains: Lixiang stands quietly next to the sofa, After a long silence, he said word by word: "Let's go, now." It can be seen that according to the part of speech of each phrase in the context of the dialogue text "Let's go, now.", its corresponding character name can be identified For Lixiang.

本申请实施例中通过确定第一文本中出现各角色名称的位置的上下文文中的位置关键词，来确定各角色名称关联的位置关键词，从而根据各角色名称关联的位置关键词以及各角色名称对应的对话文本，能够确定各对话文本关联的位置关键词。可见，本申请实施例所确定的各对话文本关联的位置关键词能够较为准确反映出各对话文本对应的角色的位置，基于此，能够使各对话文本关联的位置关键词所确定的各对话文本的待合成音频的声源位置更为准确，从而有利于提高合成语言的播放效果。In the embodiment of the present application, the position keywords associated with each character name are determined by determining the position keywords in the context of the position where each character name appears in the first text, so that according to the position keywords associated with each character name and each character name The corresponding dialogue text can determine the position keywords associated with each dialogue text. It can be seen that the location keywords associated with each dialogue text determined in the embodiment of the present application can more accurately reflect the location of the character corresponding to each dialogue text. Based on this, each dialogue text determined by the location keywords associated with each dialogue text can be The sound source position of the audio to be synthesized is more accurate, which is beneficial to improving the playback effect of the synthesized language.

可选地，确定第一文本中出现各角色名称的位置的上下文文本中的位置关键词，包括：Optionally, determine the location keywords in the contextual text where each character name appears in the first text, including:

在第一文本中对第一角色名称的限定描述满足预设判别模式的情况下，根据限定描述，确定第一角色名称对应的第一处所词，其中，限定描述至少包括处所词，第一角色名称为第一文本中的任一角色名称；When the qualified description of the first character name in the first text satisfies the preset discrimination mode, determine the first local word corresponding to the first character name according to the limited description, wherein the qualified description at least includes the local word, and the first role The name is the name of any character in the first text;

根据第一角色名称对应的第一处所词，确定第一角色名称对应的第一方位；Determine the first location corresponding to the first character name according to the first location word corresponding to the first character name;

将第一角色名称、第一处所词和第一方位对应记录至方位表，其中，在第二角色名称对应的第二处所词与第一处所词相同的情况下，方位表中记录的第二角色名称对应的第二方位与第一方位相同，第二角色名称为第一文本中除第一角色名称外的任一角色名称；The first character name, the first location word and the first location are correspondingly recorded in the location table, where, when the second location word corresponding to the second character name is the same as the first location word, the second location word recorded in the location table The second position corresponding to the character name is the same as the first position, and the second character name is any character name in the first text except the first character name;

根据方位表中记录的各角色名称对应的处所词和方位，确定各角色名称对应的对话文本的方位。According to the location words and directions corresponding to each character name recorded in the location table, the location of the dialogue text corresponding to each character name is determined.

该实施方式中，处所词用于反映空间位置，处所词例如里屋、门口以及屋顶等，可预先构建处所词表，限定表述中除了包括处所词以外还可以包括方位词，方位词表示空间的相对位置，如：沙发上、大树下、上、下、左、右、旁、前、后以及边等，可预先构建方位词列表。In this embodiment, place words are used to reflect spatial positions. Place words are such as back rooms, doorways, roofs, etc. A place word list can be constructed in advance. In addition to place words, the limited expression can also include location words. The location words represent relative positions in space. Positions, such as: on the sofa, under a big tree, up, down, left, right, next to, front, back, side, etc., a list of location words can be built in advance.

如图4所示，在第一文本中对第一角色名称的限定描述满足预设判别模式的情况下，根据限定描述中的处所词，确定第一角色名称对应的第一处所词，其中，上述限定描述至少包括处所词。As shown in Figure 4, when the qualified description of the first character name in the first text satisfies the preset discrimination mode, the first local word corresponding to the first character name is determined according to the local word in the qualified description, where, The above limited description includes at least the word of location.

为便于理解，以下结合具体示例对预设判别模式进行说明。For ease of understanding, the preset discrimination mode is described below with specific examples.

判别模式可如下(中括号内为可选内容)：The discrimination mode can be as follows (optional content in square brackets):

角色名称+[介词]+处所词+[方位词]Character name + [preposition] + place word + [location word]

若第一文本为：“楚明瘫倒在沙发上，面无表情，呆呆地望着窗外。丽香静静地立在沙发旁，沉默良久，一字一顿地说道：‘我们走吧，就现在。’昌文斜倚在门口，不耐烦地催促：‘快点决定吧，别总是磨磨蹭蹭的！’”。该段文字中“楚明”、“丽香”、“昌文”为角色名称，“沙发”、“门口”为处所名，满足上面预设判别模式：楚明+在+沙发+上；丽香+在+沙发+旁，因此，两人都与“沙发”相关联。昌文+在+门口，符合判别模式，昌文与“门”关联。If the first text is: "Chu Ming collapsed on the sofa, expressionless, staring out the window blankly. Lixiang stood quietly next to the sofa, silent for a long time, and said word by word: 'Let's go. , right now.' Changwen leaned at the door and urged impatiently: 'Make a decision quickly, don't keep dilly-dallying!'" In this paragraph of text, "Chu Ming", "Li Xiang" and "Chang Wen" are the character names, and "sofa" and "doorway" are the place names, which meet the above default judgment mode: Chu Ming + on + sofa +; Li Xiang +Beside the +sofa, therefore, both people are associated with "sofa". Changwen + is at the door, which conforms to the judgment model. Changwen is related to "door".

在确定第一角色名称对应的第一处所词之后，确定第一角色名称对应的第一方位。其中，在第二角色名称对应的第二处所词与第一处所词相同的情况下，方位表中记录的第二角色名称对应的第二方位与第一方位相同，第二角色名称为第一文本中除第一角色名称外的任一角色名称，可见，方位表中，与同一处所词关联的角色名称对应的方位相同。以下提供一方位表作为示例：After determining the first location word corresponding to the first character name, the first location corresponding to the first character name is determined. Wherein, when the second location word corresponding to the second character name is the same as the first location word, the second location corresponding to the second role name recorded in the location table is the same as the first location, and the second role name is the first location word. For any character name in the text except the first character name, it can be seen that in the position table, the position corresponding to the character name associated with the same local word is the same. An orientation table is provided below as an example:

表3方位表Table 3 Orientation table

角色名称Role Name 处所词word of place 方位position 楚明Chu Ming 沙发sofa 左Left 丽香Lixiang 沙发sofa 左Left 昌文Changwen 门口doorway 右right

如表3所示，楚明和丽香关联的处所词均为沙发，故楚明与丽香的方位相同，若合成语音为双声道语音，方位可包括“左边、右边和中间”，楚明与丽香的方位相同，故楚明与丽香的方位可为“左边、右边和中间”中的其中一种方位，昌文与楚明和丽香关联的处所词不同，故昌文的方位与楚明和丽香的方位不同，若楚明与丽香的方位为“左”，则可确定昌文的方位为与“左”不同的方位，例如昌文的方位可为“右”。基于此，可确定楚明与丽香各自对应的对话文本的方位信息相同，故楚明与丽香各自对应的对话文本的待合成音频的声源位置也相同，声源位置可均为左声源，昌文对应的对话文本的待合成音频的声源位置可为右声源。As shown in Table 3, the location words related to Chu Ming and Li Xiang are both couches, so the directions of Chu Ming and Li Xiang are the same. If the synthesized speech is binaural speech, the directions can include "left, right and middle". Chu Ming The orientation of Chuming and Lixiang is the same, so the orientation of Chuming and Lixiang can be one of the orientations of "left, right and middle". Changwen has different place words related to Chuming and Lixiang, so the orientation of Changwen is the same as that of Chuming and Lixiang. The orientation of Lixiang is different. If the orientations of Chuming and Lixiang are "left", then the orientation of Changwen can be determined to be different from "left". For example, the orientation of Changwen can be "right". Based on this, it can be determined that the orientation information of the dialogue texts corresponding to Chu Ming and Li Xiang is the same. Therefore, the sound source positions of the audio to be synthesized for the dialogue texts corresponding to Chu Ming and Li Xiang are also the same. The sound source positions can both be left. source, the sound source position of the audio to be synthesized corresponding to the dialogue text of Changwen can be the right sound source.

本申请实施例中通过设置方位表，方位表中关联同一处所词的角色名称对应的方位相同，关联不同处所词的角色名称对应的方位不相同，从而关联同一处所词的角色名称对应的对话文本的待合成音频的声源位置相同，关联不同处所词的角色名称对应的对话文本的待合成音频的声源位置不同，基于此，在角色名称关联的处所词不同的情况下，根据声源位置所确定的合成语音能够反馈出角色处于不同方位，从而能够提高合成语音的播放效果。In the embodiment of this application, a position table is set up. In the position table, the character names associated with the same place word have the same position, and the character names associated with different place words have different positions, so that the dialogue text corresponding to the character name with the same place word is associated. The sound source positions of the audio to be synthesized are the same, and the sound source positions of the dialogue texts corresponding to the character names associated with different local words are different. Based on this, when the local words associated with the character names are different, according to the sound source position The determined synthesized voice can feedback that the character is in different directions, thereby improving the playback effect of the synthesized voice.

可选地，在根据各对话文本的待合成音频的声源位置，生成各对话文本的合成语音之前，方法还包括：Optionally, before generating the synthesized speech of each dialogue text according to the sound source position of the audio to be synthesized of each dialogue text, the method further includes:

确定第一文本中各对话文本的情绪属性；Determine the emotional attributes of each dialogue text in the first text;

根据各对话文本的情绪属性，确定各对话文本的待合成音频的音频参数；According to the emotional attributes of each dialogue text, determine the audio parameters of the audio to be synthesized for each dialogue text;

根据各对话文本的待合成音频的声源位置，生成各对话文本的合成语音，包括：According to the sound source position of the audio to be synthesized in each dialogue text, the synthesized speech of each dialogue text is generated, including:

根据各对话文本的待合成音频的声源位置和音频参数，生成各对话文本的合成语音。According to the sound source position and audio parameters of the audio to be synthesized for each dialogue text, the synthesized speech of each dialogue text is generated.

该实施方式中，情绪属性为预先设定的情绪分类。作为示例，可将情绪属性分为五种属性，分别为兴奋、平静、激动、悲哀和紧张。其中，每种情绪属性与对应的音频参数关联，从而通过确定各对话文本的情绪属性，即可确定各对话文本的待合成音频的音频参数，音频参数包括音频响度、音频语速和音频音调。In this implementation, the emotion attributes are preset emotion categories. As an example, emotional attributes can be divided into five attributes, namely excitement, calmness, agitation, sadness, and tension. Each emotional attribute is associated with a corresponding audio parameter, so by determining the emotional attribute of each dialogue text, the audio parameters of the audio to be synthesized for each dialogue text can be determined. The audio parameters include audio loudness, audio speaking speed, and audio pitch.

每种情绪属性与对应的音频参数关联，具体的，每种情绪属性关联的音频参数可如下表所示：Each emotional attribute is associated with a corresponding audio parameter. Specifically, the audio parameters associated with each emotional attribute are as shown in the following table:

表4音频参数表Table 4 Audio parameter table

其中，第一文本中，除各对话文本以外的旁白文本的情绪属性可均为“平静”，根据旁白的待合成音频的声源位置和“平静”对应的音频参数，从而可生成旁白的合成语音。Among them, in the first text, the emotional attribute of the narration text except each dialogue text can all be "calm". According to the sound source position of the narration audio to be synthesized and the audio parameters corresponding to "calm", a synthesis of the narration can be generated. voice.

本申请实施例根据各对话文本的待合成音频的声源位置和音频参数，生成各对话文本的合成语音，从而能够使合成语音既能够反馈出对话文本的情绪，也能够反馈出对话文本的方位，可见，本申请实施例能够使合成语音的内容较为丰富，合成语音的播放效果较佳。Embodiments of the present application generate synthetic speech for each dialogue text based on the sound source position and audio parameters of the audio to be synthesized for each dialogue text, so that the synthesized speech can feed back not only the emotion of the dialogue text, but also the orientation of the dialogue text. , it can be seen that the embodiments of the present application can make the content of the synthesized speech richer and the playback effect of the synthesized speech better.

可选地，确定第一文本中各对话文本的情绪属性，包括：Optionally, determining the emotional attributes of each dialogue text in the first text includes:

对第一对话文本和第一对话文本的上下文进行分词处理，得到多个情绪词组，其中，第一对话文本为第一文本中的任一对话文本；Perform word segmentation processing on the first dialogue text and the context of the first dialogue text to obtain multiple emotional phrases, wherein the first dialogue text is any dialogue text in the first text;

通过预先构建的情绪词典，查询多个情绪词组中每个词组对应的情绪属性；Query the emotional attributes corresponding to each of multiple emotional phrases through the pre-built emotional dictionary;

根据每个情绪词组对应的情绪属性，确定第一对话文本的情绪属性。According to the emotional attribute corresponding to each emotional phrase, the emotional attribute of the first dialogue text is determined.

该实施方式中，可预先构建情绪词典，情绪词典中包含多个情绪属性，每个情绪属性下存在多个词组。情绪词典并非专门为一本电子书的词组所构建的情绪词典，情绪词典可为所有电子书所通用的情绪词典，作为示例，情绪词典中可包含五种情绪属性，该五种情绪属性分别为兴奋、平静、激动、悲哀和紧张，情绪词典的部分内容可如下表所示：In this implementation, an emotion dictionary can be constructed in advance. The emotion dictionary contains multiple emotion attributes, and there are multiple phrases under each emotion attribute. The emotion dictionary is not an emotion dictionary specifically built for the phrases of an e-book. The emotion dictionary can be an emotion dictionary common to all e-books. As an example, the emotion dictionary can contain five emotion attributes. The five emotion attributes are respectively Excited, calm, excited, sad and nervous, part of the emotional dictionary can be shown in the following table:

表4情绪词典Table 4 Emotion Dictionary

情绪词组emotional phrases 词性part of speech 情绪属性emotional attributes 强度strength 不知所措at a loss 形容词adjective 紧张nervous 77 语无伦次incoherent 形容词adjective 紧张nervous 55 痛哭cry bitterly 动词verb 悲哀sorrow 66 笑声laughter 名词noun 兴奋excited 33

通过对第一对话文本和第一对话文本的上下文进行分词处理，从而能够得到多个词组，该多个词组中包含情绪词典中记载的情绪词组以及其他词组，基于情绪词典确定多个词组中的多个情绪词组，并确定每个情绪词组对应的情绪属性，从而根据每个情绪词组对应的情绪属性可确定第一对话文本的情绪属性。By performing word segmentation processing on the first dialogue text and the context of the first dialogue text, multiple phrases can be obtained. The multiple phrases include the emotion phrases recorded in the emotion dictionary and other phrases. Based on the emotion dictionary, the words in the multiple phrases are determined. Multiple emotional phrases, and the emotional attributes corresponding to each emotional phrase are determined, so that the emotional attributes of the first dialogue text can be determined based on the emotional attributes corresponding to each emotional phrase.

本申请实施例通过构建情绪词典，从而能够基于情绪词典查询对话文本和对话文本上下文中的情绪词组，通过确定情绪词组对应的情绪属性，从而能够快速确定对话文本的情绪属性。By constructing an emotion dictionary, the embodiments of the present application can query the dialogue text and the emotion phrases in the context of the dialogue text based on the emotion dictionary, and by determining the emotion attributes corresponding to the emotion phrases, the emotion attributes of the dialogue text can be quickly determined.

可选地，根据每个词组对应的情绪属性，确定第一对话文本的情绪属性，包括：Optionally, determine the emotional attributes of the first dialogue text based on the emotional attributes corresponding to each phrase, including:

在第一对话文本和第一对话文本的上下文中至少两个情绪词组对应的情绪属性不同的情况下，确定每个情绪词组对应的情绪属性的权重；When the emotional attributes corresponding to at least two emotional phrases in the first dialogue text and the context of the first dialogue text are different, determine the weight of the emotional attribute corresponding to each emotional phrase;

将权重最高的情绪属性确定为第一对话文本的情绪属性。The emotional attribute with the highest weight is determined as the emotional attribute of the first dialogue text.

一种实施方式中，如图5所示，每个情绪词组对应的情绪属性的权重，可以每个情绪词组对应的强度值为基础。如表4所示，情绪词典中的情绪词组还包括强度值。需要说明的是，属于相同情绪属性的各个情绪词组可设置为强度值不相同，例如表4中属于相同情绪属性的“不知所措”和“语无伦次”的强度值不同，因“不知所措”较于“语无伦次”，其紧张程度更高，故其强度值更大。In one implementation, as shown in Figure 5, the weight of the emotional attribute corresponding to each emotional phrase can be based on the intensity value corresponding to each emotional phrase. As shown in Table 4, the emotion phrases in the emotion dictionary also include intensity values. It should be noted that each emotional phrase belonging to the same emotional attribute can be set to have different intensity values. For example, in Table 4, the intensity values of "overwhelmed" and "incoherent" belonging to the same emotional attribute are different. Compared with "incoherent", it has a higher degree of tension, so its intensity value is greater.

本申请实施例在第一对话文本和第一对话文本的上下文中至少两个情绪词组对应的情绪属性不同的情况下，通过将权重最高的情绪属性确定为第一对话文本的情绪属性，能够提高第一对话文本的情绪属性的可靠性。In the embodiment of the present application, when the emotional attributes corresponding to at least two emotional phrases in the first dialogue text and the context of the first dialogue text are different, by determining the emotional attribute with the highest weight as the emotional attribute of the first dialogue text, it is possible to improve Reliability of emotional properties of first conversational texts.

另一种实施方式中，在第一对话文本和第一对话文本的上下文中至少两个情绪词组对应的情绪属性不同的情况下，可以确定每个情绪词组对应的情绪属性，并根据每个情绪词组对应的情绪属性确定第一对话文本的多重情绪属性，根据第一对话文本的多重情绪属性调整第一对话文本中对应文本的待合成音频的音频参数，对第一对话文本生成包含多种情绪的合成语音，能够丰富第一对话文本的合成语音的情感，有效提高第一对话文本的合成语音播放效果。In another implementation, when the emotional attributes corresponding to at least two emotional phrases in the first dialogue text and the context of the first dialogue text are different, the emotional attributes corresponding to each emotional phrase can be determined, and according to each emotion The emotional attribute corresponding to the phrase determines the multiple emotional attributes of the first dialogue text, adjusts the audio parameters of the audio to be synthesized corresponding to the text in the first dialogue text according to the multiple emotional attributes of the first dialogue text, and generates a plurality of emotions for the first dialogue text The synthetic voice can enrich the emotion of the synthetic voice of the first dialogue text and effectively improve the playback effect of the synthetic voice of the first dialogue text.

可选地，根据各对话文本的待合成音频的声源位置，生成各对话文本的合成语音，包括：Optionally, generate synthetic speech for each dialogue text based on the sound source position of the audio to be synthesized for each dialogue text, including:

根据各对话文本的待合成音频的声源位置，确定各对话文本的待合成音频的第一声道与第二声道的时延或者音量差，生成各对话文本的具有时延或者音量差的双声道合成语音。According to the sound source position of the audio to be synthesized for each dialogue text, the time delay or volume difference between the first channel and the second channel of the audio to be synthesized for each dialogue text is determined, and a time delay or volume difference for each dialogue text is generated. Binaural synthesized speech.

该实施方式中，设置最终合成的语音为双声道语音，各对话文本的待合成音频的声源位置可能包括“左边、右边以及中间”。In this embodiment, the final synthesized speech is set to be a binaural speech, and the sound source positions of the audio to be synthesized in each dialogue text may include "left, right and middle".

本申请实施例基于哈斯效应生成各对话文本的具有时延的双声道合成语音。哈斯效应为：同一声音从两个不同声源发出，如果这两个声波到达听音者的时差在5到35毫秒之间，人耳无法辨别出这两个声音的方位。人先听到的那个声音，就会认为是全部声音来自那个方位。利用哈斯效应，将左右两声道的声音不同步播放，人为加入一定延迟：M毫秒，通常M处于5到35之间。The embodiment of the present application generates binaural synthetic speech with time delay for each dialogue text based on the Haas effect. The Haas effect is: the same sound is emitted from two different sound sources. If the time difference between the two sound waves reaching the listener is between 5 and 35 milliseconds, the human ear cannot distinguish the location of the two sounds. The sound that people hear first will be considered as the direction from which all sounds come. Using the Haas effect, the left and right channels of sound are played asynchronously, and a certain delay is added artificially: M milliseconds, usually M is between 5 and 35.

本申请实施例中的待合成音频的第一声道与第二声道分别对应于左声道和右声道。在第一声道与第二声道无音量差时，如果声源位置为“右边”，则第一声道(左声道)比第二声道(右声道)延迟M毫秒；如果声源方向为“左边”，则第二声道比第一声道延迟M毫秒；如果声源方向为“中间”，则左右两声道同步，无延迟。In the embodiment of the present application, the first channel and the second channel of the audio to be synthesized correspond to the left channel and the right channel respectively. When there is no volume difference between the first channel and the second channel, if the sound source position is "right", the first channel (left channel) is delayed by M milliseconds compared to the second channel (right channel); if If the source direction is "left", the second channel is delayed by M milliseconds from the first channel; if the sound source direction is "middle", the left and right channels are synchronized without delay.

在第一声道与第二声道无时延时，根据各对话文本的待合成音频的第一声道与第二声道的音量差，生成各对话文本的具有音量差的双声道合成语音。第一声道和第二声道存在音量差时，人耳感知到的声源位置偏向于音量高的一侧。在第一声道与第二声道无时延时，如果声源位置为“右边”，则第二声道(右声道)的音量高于第一声道(左声道)；如果声源方向为“左边”，则第一声道的音量高于比第二声道；如果声源方向为“中间”，则左右两声道无音量差。There is no time delay between the first channel and the second channel. Based on the volume difference between the first channel and the second channel of the audio to be synthesized for each dialogue text, a two-channel synthesis with a volume difference for each dialogue text is generated. voice. When there is a volume difference between the first channel and the second channel, the position of the sound source perceived by the human ear is biased towards the side with higher volume. There is no time delay between the first channel and the second channel. If the sound source position is "right", the volume of the second channel (right channel) is higher than the first channel (left channel); if If the source direction is "left", the volume of the first channel is higher than that of the second channel; if the sound source direction is "middle", there is no volume difference between the left and right channels.

本申请实施中，通过生成各对话文本的具有时延或者音量差的双声道合成语音，从而用户配合耳机或两通道立体声音箱收听合成语音时，能够感受到合成语音中声源位置，并且，现有耳机以及其他收听设备通常为双声道播放，将会话文本合成为双声道合成语音，能够有效提高用户基于收听设备能够感受到合成语音中声源位置的概率，从而提高语音播放效果。In the implementation of this application, binaural synthetic speech with time delay or volume difference is generated for each dialogue text, so that when the user listens to the synthesized speech with headphones or a two-channel stereo speaker, the user can feel the position of the sound source in the synthesized speech, and, Existing headphones and other listening devices usually play back in two channels. Synthesizing conversational text into two-channel synthetic speech can effectively improve the probability that users can feel the location of the sound source in the synthesized speech based on the listening device, thereby improving the speech playback effect.

根据预设的人物字典中标注的第一文本中各角色的名称信息，确定各对话文本的角色属性；Determine the character attributes of each dialogue text according to the name information of each character in the first text marked in the preset character dictionary;

根据各对话文本的角色属性，确定各对话文本对应的发音人；According to the role attributes of each dialogue text, determine the speaker corresponding to each dialogue text;

根据各对话文本的待合成音频的声源位置，以各对话文本对应的发音人的音色生成各对话文本的合成语音。According to the sound source position of the audio to be synthesized in each dialogue text, the synthesized speech of each dialogue text is generated with the voice of the speaker corresponding to each dialogue text.

该实施方式中根据预设的人物字典中标注的第一文本中各角色的名称信息，确定各对话文本的角色属性，各对话文本的角色属性包括角色的性别以及年龄等。根据各对话文本的角色属性，可从发音人音库中确定各对话文本对应的发音人。发音人音库中，每个发音人与对应的音色特征关联，音色特征例如年轻男性、年轻女性，老年、孩童等。对话文本的角色属性与发音人的音色特征匹配的情况下，则确定该发音人为该对话文本对应的发音人。In this embodiment, the character attributes of each dialogue text are determined based on the name information of each character in the first text marked in the preset character dictionary. The character attributes of each dialogue text include the gender and age of the character. According to the role attributes of each dialogue text, the speaker corresponding to each dialogue text can be determined from the speaker's voice library. In the speaker's voice library, each speaker is associated with the corresponding timbre characteristics, such as young men, young women, old people, children, etc. If the role attribute of the dialogue text matches the timbre characteristics of the speaker, the speaker is determined to be the speaker corresponding to the dialogue text.

本申请实施例通过以各对话文本对应的发音人的音色生成各对话文本的合成语音，从而能够进一步丰富合成语音的内容，以进一步提高合成语音的播放效果。Embodiments of the present application generate synthetic speech for each dialogue text based on the voice of the speaker corresponding to each dialogue text, thereby further enriching the content of the synthetic speech and further improving the playback effect of the synthetic speech.

为了更清楚理解本申请实施例的技术方案，以下结合图6对生成各对话文本的合成语音的流程进行说明。In order to have a clearer understanding of the technical solutions of the embodiments of the present application, the process of generating synthetic speech for each dialogue text will be described below with reference to FIG. 6 .

在生成各对话文本的合成语音之前，确定各对话文本的方位信息、情绪属性以及发音人，作为示例，以《罗密欧与朱丽叶》中的一段文本为例，各对话文本的方位信息、情绪属性以及发音人可如下：Before generating the synthesized speech of each dialogue text, the orientation information, emotional attributes and speaker of each dialogue text are determined. As an example, taking a text in "Romeo and Juliet", the orientation information, emotional attributes and Pronounced by:

罗密欧：[中，旁白声，平静]Romeo: [medium, voiceover, calm]

相信我，爱人，在我的眼中你也是这样；忧伤吸干了我们的血液。再会！再会！[左，磁性男声，悲哀]Believe me, love, you are the same in my eyes; sorrow drains our blood. Goodbye! Goodbye! [Left, magnetic male voice, sad]

朱丽叶：[中，旁白声，平静]Juliet: [center, voiceover, calm]

命运啊命运！谁都说你反复无常；要是你真的反复无常，那么你怎样对待一个忠贞不贰的人呢？愿你不要改变你的轻浮的天性，因为这样也许你会早早打发他回来。[右，温柔女声，激动]Destiny, destiny! Everyone says you are capricious; if you are really capricious, how do you treat a loyal person? May you not change your frivolous nature, for then perhaps you will send him back sooner. [Right, gentle female voice, excited]

凯普莱特夫人：[中，旁白声，平静]Lady Capulet: [medium, voiceover, calm]

喂，女儿！你起来了吗？[中，知性小妍，紧张]。Hello, daughter! Are you up? [In the middle, intellectual Xiaoyan, nervous].

其中，本申请实施例中除确定各对话文本的信息之外，还可确定除对外文本以外的旁白的信息，以使听音者可区别出对话语音和旁白语音，上述[中，旁白声，平静]即为除对外文本以外的旁白的属性。Among them, in the embodiment of the present application, in addition to determining the information of each dialogue text, the information of the narration other than the external text can also be determined, so that the listener can distinguish the dialogue voice and the narration voice. In the above [in, narration voice, Calm] is the attribute of the narration other than the external text.

在确定各文本的方位信息、情绪属性以及发音人之后，生成合成语言。如图6所示：以对话文本的发音人的音色通过语音合成(Text to Speech，TTS)技术生成待合成音频文件A，将待合成音频文件A转换为双声道音频文件A1。根据对话文本的方位信息确定双声道音频文件A1的声源位置，根据双声道音频文件A1的声源位置生成音频文件A2，根据对话文本的情绪属性以及情绪属性对应的音频参数对音频文件A2的音调、响度和语速进行处理以生成最终的合成语音音频文件A3。After determining the orientation information, emotional attributes and speaker of each text, a synthetic language is generated. As shown in Figure 6: the audio file A to be synthesized is generated using the voice of the speaker of the dialogue text through speech synthesis (Text to Speech, TTS) technology, and the audio file A to be synthesized is converted into a two-channel audio file A1. Determine the sound source position of the two-channel audio file A1 based on the orientation information of the dialogue text, generate the audio file A2 based on the sound source location of the two-channel audio file A1, and update the audio file based on the emotional attributes of the dialogue text and the audio parameters corresponding to the emotional attributes. The pitch, loudness and speaking speed of A2 are processed to generate the final synthesized speech audio file A3.

参见图7，图7是本申请实施例提供的一种语音合成装置的结构图，如图7所示，语音合成装置200，包括：Referring to Figure 7, Figure 7 is a structural diagram of a speech synthesis device provided by an embodiment of the present application. As shown in Figure 7, a speech synthesis device 200 includes:

第一获取模块201，用于获取第一文本，其中，第一文本中包括N个对话文本，每个对话文本对应一个角色的对话内容，N为大于1的整数；The first acquisition module 201 is used to acquire the first text, where the first text includes N dialogue texts, each dialogue text corresponds to the dialogue content of a character, and N is an integer greater than 1;

第一确定模块202，用于根据第一文本中各对话文本的目标信息，确定各对话文本的方位信息，其中，目标信息包括各对话文本对应的角色在该对话文本的关联影视片段中的位置信息，或第一文本中与各对话文本关联的位置关键词；The first determination module 202 is used to determine the orientation information of each dialogue text according to the target information of each dialogue text in the first text, where the target information includes the position of the character corresponding to each dialogue text in the associated film and television clip of the dialogue text. Information, or positional keywords in the first text associated with each dialogue text;

第二确定模块203，用于根据各对话文本的方位信息，确定各对话文本的待合成音频的声源位置，并根据各对话文本的待合成音频的声源位置，生成各对话文本的合成语音。The second determination module 203 is configured to determine the sound source location of the audio to be synthesized for each dialogue text based on the orientation information of each dialogue text, and generate a synthesized speech for each dialogue text based on the sound source location of the audio to be synthesized for each dialogue text. .

第一确定模块202，包括：The first determination module 202 includes:

第一确定单元，用于在包含第一文本的电子书存在关联影视资源的情况下，根据各对话文本的内容和关联影视资源中各视频帧的影视内容，确定关联影视资源中分别与各对话文本匹配的关联影视片段；The first determination unit is configured to determine, when the e-book containing the first text has associated film and television resources, based on the content of each dialogue text and the film and television content of each video frame in the associated film and television resources, that the associated film and television resources are associated with each dialogue. Text matching related film and television clips;

第二确定单元，用于确定各对话文本的关联影视片段中与对应对话文本的角色匹配的目标影视角色；The second determination unit is used to determine the target film and television character in the film and television clip associated with each dialogue text that matches the role of the corresponding dialogue text;

第三确定单元，用于根据各对话文本的关联影视片段中的目标影视角色在该关联影视片段的影视画面中的位置信息，确定各对话文本的方位信息。The third determination unit is used to determine the orientation information of each dialogue text based on the position information of the target film and television character in the associated film and television clip of each dialogue text in the film and television frame of the associated film and television clip.

可选地，第二确定单元，具体用于：Optionally, the second determination unit is specifically used for:

可选地，第一确定单元，包括：Optionally, the first determining unit includes:

第一确定子单元，用于根据第一对话文本的内容和关联影视资源中各视频帧的字幕内容或音频内容，确定关联影视资源中与第一对话文本匹配的第一关联影视片段，第一对话文本为第一文本中的任一对话文本；The first determination subunit is used to determine the first associated film and television segment in the associated film and television resource that matches the first dialogue text based on the content of the first dialogue text and the subtitle content or audio content of each video frame in the associated film and television resource, the first The dialogue text is any dialogue text in the first text;

第二确定子单元，用于对第一关联影视片段进行场景识别，确定第一关联影视片段中的目标场景元素；The second determination subunit is used to perform scene recognition on the first associated film and television clip and determine the target scene element in the first associated film and television clip;

第三确定子单元，用于确定第一文本中第一对话文本的上下文文本中的场景关键词；The third determination subunit is used to determine the scene keywords in the context text of the first dialogue text in the first text;

第一检测子单元，用于在目标场景元素与场景关键词匹配的情况下，对第一关联影视片段进行目标检测。The first detection subunit is used to perform target detection on the first associated film and television clip when the target scene element matches the scene keyword.

可选地，第二确定子单元，具体用于：Optionally, the second determination subunit is specifically used for:

第一确定模块202，还包括：The first determination module 202 also includes:

第一遍历单元，用于遍历第一文本中的角色名称；The first traversal unit is used to traverse the character names in the first text;

第四确定单元，用于确定第一文本中出现各角色名称的位置的上下文文本中的位置关键词；The fourth determination unit is used to determine the position keywords in the contextual text of the position where each character name appears in the first text;

第五确定单元，用于根据第一文本中各角色名称对应的对话文本，确定各对话文本关联的位置关键词。The fifth determination unit is used to determine the position keywords associated with each dialogue text based on the dialogue text corresponding to each character name in the first text.

可选地，第四确定单元，具体用于：Optionally, the fourth determination unit is specifically used for:

第六确定单元，用于根据方位表中记录的各角色名称对应的处所词和方位，确定各角色名称对应的对话文本的方位。The sixth determination unit is used to determine the location of the dialogue text corresponding to each character name based on the local word and location corresponding to each character name recorded in the location table.

可选地，语音合成装置200还包括：Optionally, the speech synthesis device 200 also includes:

第三确定模块，用于确定第一文本中各对话文本的情绪属性；The third determination module is used to determine the emotional attributes of each dialogue text in the first text;

第四确定模块，用于根据各对话文本的情绪属性，确定各对话文本的待合成音频的音频参数；The fourth determination module is used to determine the audio parameters of the audio to be synthesized for each dialogue text based on the emotional attributes of each dialogue text;

第二确定模块203，包括：The second determination module 203 includes:

第一生成单元，用于根据各对话文本的待合成音频的声源位置和音频参数，生成各对话文本的合成语音。The first generation unit is used to generate the synthesized speech of each dialogue text according to the sound source position and audio parameters of the audio to be synthesized of each dialogue text.

可选地，第三确定模块，包括：Optionally, the third determination module includes:

第一处理模块，用于对第一对话文本和第一对话文本的上下文进行分词处理，得到多个情绪词组，其中，第一对话文本为第一文本中的任一对话文本；The first processing module is used to perform word segmentation processing on the first dialogue text and the context of the first dialogue text to obtain multiple emotional phrases, wherein the first dialogue text is any dialogue text in the first text;

第一查询单元，通过预先构建的情绪词典，查询多个情绪词组中每个词组对应的情绪属性；The first query unit queries the emotional attributes corresponding to each of the multiple emotional phrases through a pre-built emotional dictionary;

第七确定单元，用于根据每个情绪词组对应的情绪属性，确定第一对话文本的情绪属性。The seventh determination unit is used to determine the emotional attributes of the first dialogue text based on the emotional attributes corresponding to each emotional phrase.

可选地，第七确定单元，具体用于：Optionally, the seventh determination unit is specifically used for:

在第一对话文本中至少两个情绪词组对应的情绪属性不同的情况下，确定每个情绪词组对应的情绪属性的权重；When the emotional attributes corresponding to at least two emotional phrases in the first dialogue text are different, determine the weight of the emotional attribute corresponding to each emotional phrase;

可选地，第一生成单元，具体用于：Optionally, the first generation unit is specifically used for:

可选地，语音合成装置还包括：Optionally, the speech synthesis device also includes:

第五确定模块，用于根据预设的人物字典中标注的第一文本中各角色的名称信息，确定各对话文本的角色属性；The fifth determination module is used to determine the character attributes of each dialogue text based on the name information of each character in the first text marked in the preset character dictionary;

第一生成单元，具体还用于：The first generation unit is also specifically used for:

需要说明的是本申请实施例提供的语音合成装置，可以执行上述方法实施例，其实现原理和技术效果类似，本实施例此处不再赘述。It should be noted that the speech synthesis device provided by the embodiments of the present application can execute the above method embodiments, and its implementation principles and technical effects are similar, and will not be described again in this embodiment.

本申请实施例还提供了一种电子设备。由于电子设备解决问题的原理与本申请实施例中语音合成方法相似，因此该电子设备的实施可以参见方法的实施，重复之处不再赘述。如图8所示，本申请实施例的电子设备，包括：处理器300，用于读取存储器320中的程序，执行下列过程：An embodiment of the present application also provides an electronic device. Since the problem-solving principle of the electronic device is similar to the speech synthesis method in the embodiment of the present application, the implementation of the electronic device can be referred to the implementation of the method, and repeated details will not be repeated. As shown in Figure 8, the electronic device according to the embodiment of the present application includes: a processor 300, configured to read the program in the memory 320 and perform the following processes:

通过收发机310获取第一文本，其中，第一文本中包括N个对话文本，每个对话文本对应一个角色的对话内容，N为大于1的整数；The first text is obtained through the transceiver 310, where the first text includes N dialogue texts, each dialogue text corresponds to the dialogue content of a character, and N is an integer greater than 1;

根据第一文本中各对话文本的目标信息，确定各对话文本的方位信息，其中，目标信息包括各对话文本对应的角色在该对话文本的关联影视片段中的位置信息，或第一文本中与各对话文本关联的位置关键词；Determine the orientation information of each dialogue text according to the target information of each dialogue text in the first text, where the target information includes the position information of the character corresponding to each dialogue text in the film and television clip associated with the dialogue text, or the location information in the first text corresponding to the dialogue text. Positional keywords associated with each dialogue text;

根据各对话文本的方位信息，确定各对话文本的待合成音频的声源位置，并根据各对话文本的待合成音频的声源位置，生成各对话文本的合成语音。Based on the orientation information of each dialogue text, the sound source location of the audio to be synthesized for each dialogue text is determined, and based on the sound source location of the audio to be synthesized for each dialogue text, a synthesized speech for each dialogue text is generated.

收发机310，用于在处理器300的控制下接收和发送数据。Transceiver 310 for receiving and transmitting data under the control of processor 300.

可选地，目标信息包括各对话文本对应的角色在该对话文本的关联影视片段中的位置信息，处理器300还用于读取存储器320中的程序，执行如下步骤：Optionally, the target information includes the position information of the character corresponding to each dialogue text in the associated film and television clips of the dialogue text. The processor 300 is also configured to read the program in the memory 320 and perform the following steps:

可选地，处理器300还用于读取存储器320中的程序，执行如下步骤：Optionally, the processor 300 is also used to read the program in the memory 320 and perform the following steps:

处理器300还用于读取存储器320中的程序，执行如下步骤：The processor 300 is also used to read the program in the memory 320 and perform the following steps:

通过预先构建的情绪词典，查询多个词组中每个情绪词组对应的情绪属性；Query the emotional attributes corresponding to each emotional phrase in multiple phrases through the pre-built emotion dictionary;

在本申请所提供的几个实施例中，应该理解到，所揭露方法和装置，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed methods and devices can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理包括，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in various embodiments of the present application may be integrated into one processing unit, each unit may be physically included separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.

上述以软件功能单元的形式实现的集成的单元，可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例收发方法的部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，ROM)、随机存取存储器(Random Access Memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The above-mentioned integrated unit implemented in the form of a software functional unit can be stored in a computer-readable storage medium. The above-mentioned software functional unit is stored in a storage medium and includes a number of instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute some steps of the transceiver method in various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code. .

以上是本申请的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本申请原理的前提下，还可以作出若干改进和润饰，这些改进和润饰也应视为本申请的保护范围。The above are the preferred embodiments of the present application. It should be noted that for those of ordinary skill in the art, several improvements and modifications can be made without departing from the principles of the present application, and these improvements and modifications should also be regarded as the present invention. The scope of protection applied for.

Claims

1. A speech synthesis method, characterized in that the method includes:

Obtain the first text, wherein the first text includes N dialogue texts, each dialogue text corresponding to the dialogue content of a character;

Determine the orientation information of each dialogue text according to the target information of each dialogue text in the first text, wherein the target information includes the position information of the character corresponding to each dialogue text in the associated film and television clip of the dialogue text, or the location information of each dialogue text. Describe the positional keywords associated with each dialogue text in the first text;

According to the orientation information of each dialogue text, the sound source position of the audio to be synthesized of each dialogue text is determined, and based on the sound source position of the audio to be synthesized of each dialogue text, a synthesized speech of each dialogue text is generated.

2. The method according to claim 1, wherein the target information includes position information of the character corresponding to each dialogue text in the film and television clip associated with the dialogue text;

Determining the orientation information of each dialogue text based on the target information of each dialogue text in the first text includes:

In the case where the e-book containing the first text has associated film and television resources, based on the content of each dialogue text and the film and television content of each video frame in the associated film and television resources, it is determined that the associated film and television resources are associated with each dialogue text. Matching related film and television clips;

Determine the target film and television character in the film and television clip associated with each dialogue text that matches the role of the corresponding dialogue text;

The orientation information of each dialogue text is determined based on the position information of the target film and television character in the associated film and television segment of each dialogue text in the film and television frame of the associated film and television segment.

3. The method according to claim 2, wherein determining the target film and television character matching the character of the corresponding dialogue text in the associated film and television segments of each dialogue text includes:

Perform target detection on the first associated film and television segment, and determine the target image in the first associated film and television segment that emits a dialogue voice that matches the first dialogue text, wherein the film and television character indicated by the target image is the same as the first dialogue text. The role of the dialogue text matches the target film and television character, the first associated film and television segment is an associated film and television segment of the first dialogue text, and the first dialogue text is any dialogue text in the first text.

4. The method according to claim 2 or 3, characterized in that, according to the content of each dialogue text and the film and television content of each video frame in the associated film and television resources, it is determined that each dialogue in the associated film and television resources is Related film and television clips for text matching, including:

According to the content of the first dialogue text and the subtitle content or audio content of each video frame in the associated film and television resources, the first associated film and television segment matching the first dialogue text in the associated film and television resources is determined, and the first The dialogue text is any dialogue text in the first text;

Perform scene recognition on the first associated film and television clip to determine the target scene element in the first associated film and television clip;

Determine scene keywords in the context text of the first dialogue text in the first text;

When the target scene element matches the scene keyword, target detection is performed on the first associated film and television clip.

5. The method according to claim 4, wherein the step of performing scene recognition on the first associated film and television clip and determining the target scene element in the first associated film and television clip includes:

Perform scene recognition on the first associated film and television clip, and determine each scene element and the credibility of each scene element in the first associated film and television clip;

Scene elements whose credibility exceeds a preset threshold in the first associated video clip are determined as the target scene elements.

6. The method according to claim 1, characterized in that the target information includes position keywords associated with each dialogue text in the first text;

Traverse the character names in the first text;

Determine the position keywords in the contextual text of the position where each character name appears in the first text;

According to the dialogue text corresponding to each character name in the first text, the position keyword associated with each dialogue text is determined.

7. The method according to claim 6, characterized in that determining the position keywords in the contextual text of the position where each character name appears in the first text includes:

When the qualified description of the first character name in the first text satisfies the preset discrimination mode, the first local word corresponding to the first character name is determined according to the qualified description, wherein the qualified description At least including a local word, the first character name is any character name in the first text;

Determine the first location corresponding to the first character name according to the first location word corresponding to the first character name;

The first character name, the first location word and the first location are correspondingly recorded in a location table, where, when the second location word corresponding to the second character name is the same as the first location word , the second orientation corresponding to the second character name recorded in the orientation table is the same as the first orientation, and the second character name is any character in the first text except the first character name. a role name;

According to the location words and directions corresponding to each character name recorded in the location table, the location of the dialogue text corresponding to each character name is determined.

8. The method according to claim 1, characterized in that, before generating the synthesized speech of each dialogue text according to the sound source position of the audio to be synthesized of each dialogue text, the method further includes:

Determine the emotional attributes of each dialogue text in the first text;

Determine the audio parameters of the audio to be synthesized for each dialogue text according to the emotional attributes of each dialogue text;

Generating the synthesized speech of each dialogue text according to the sound source position of the audio to be synthesized of each dialogue text includes:

According to the sound source position and audio parameters of the audio to be synthesized of each dialogue text, the synthesized speech of each dialogue text is generated.

9. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor; characterized in that the processor is used to read the program in the memory The steps of implementing the speech synthesis method according to any one of claims 1 to 8.

10. A computer-readable storage medium for storing a computer program, characterized in that, when the computer program is executed by a processor, the steps of the speech synthesis method according to any one of claims 1 to 8 are implemented.