CN117672211A

CN117672211A - Dialogue processing method and device, electronic equipment and computer storage medium

Info

Publication number: CN117672211A
Application number: CN202311369891.3A
Authority: CN
Inventors: 刘亚龙; 王佳
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2023-10-20
Filing date: 2023-10-20
Publication date: 2024-03-08
Also published as: WO2025082042A1; WO2025082042A9

Abstract

The disclosure provides a dialogue processing method, a dialogue processing device, electronic equipment and a computer readable storage medium, and relates to the technical field of voice processing. The method comprises the following steps: when the machine voice is played, user voice input of a user is stopped, the machine voice is played, the user voice and the machine voice are input into a recovery model, when the intention of the user for actually interrupting the machine voice is determined through the recovery model, timestamp information corresponding to the machine voice when the user voice is input is obtained, a recovery play node corresponding to the machine voice is determined according to the timestamp information, for example, the recovery play node is arranged at a preset pause position corresponding to the machine voice when the user voice is input, and the machine voice is played from the recovery play node; the embodiment of the disclosure can confirm the intention of the user, correct the situation that the user interrupts the voice by mistake, ensure the smoothness and continuity of voice recovery and improve the voice interaction effect.

Description

Conversation processing method, device, electronic equipment and computer storage medium

技术领域Technical field

本公开涉及语音处理技术领域，尤其涉及一种对话处理方法、装置、电子设备及计算机可读存储介质。The present disclosure relates to the field of speech processing technology, and in particular, to a dialogue processing method, device, electronic device and computer-readable storage medium.

背景技术Background technique

随着智能语音对话系统应用越来越广泛，智能语音对话的智能化体验越来越成为系统性能的重要指标，能否有效地支持用户打断机器是最有效提升体验的方法之一。As the application of intelligent voice dialogue systems becomes more and more widespread, the intelligent experience of intelligent voice dialogue has increasingly become an important indicator of system performance. Whether it can effectively support users to interrupt the machine is one of the most effective ways to improve the experience.

相关技术中，智能语音对话系统识别到用户的输入后，不能检测用户是否有真实打断语音的意图，用户可能只是随口附和，或有背景人在说话，系统也会停止播报并会有明显的突兀感，系统也存在明显的卡顿现象；且系统恢复播报的时机不确定，有可能从一个单词的一半的时候开始恢复，比如“我们”，系统从“们”字开始播报，更糟糕的情况，有可能从“们”的后半个音节开始播报，导致语音交互效果差，用户体验不佳。In related technologies, after the intelligent voice dialogue system recognizes the user's input, it cannot detect whether the user actually intends to interrupt the speech. The user may just agree casually, or there may be people in the background talking, and the system will also stop broadcasting and there will be obvious interruptions. There is a sense of abruptness, and the system also has obvious lags; and the timing of the system to resume broadcasting is uncertain. It may start to resume from halfway through a word, such as "we", the system starts broadcasting from the word "we", and even worse In this case, it is possible to start broadcasting from the second half of the syllable of "人", resulting in poor voice interaction effect and poor user experience.

需要说明的是，在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解，因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。It should be noted that the information disclosed in the above background section is only used to enhance understanding of the background of the present disclosure, and therefore may include information that does not constitute prior art known to those of ordinary skill in the art.

发明内容Contents of the invention

本公开提供一种对话处理方法、装置、电子设备及计算机可读存储介质，至少在一定程度上克服相关技术中语音交互效果差的问题。The present disclosure provides a dialogue processing method, device, electronic device and computer-readable storage medium, which at least to a certain extent overcomes the problem of poor voice interaction effect in related technologies.

本公开的其他特性和优点将通过下面的详细描述变得显然，或部分地通过本公开的实践而习得。Additional features and advantages of the disclosure will be apparent from the following detailed description, or, in part, may be learned by practice of the disclosure.

根据本公开的一个方面，提供一种对话处理方法，包括：当机器语音播放时有用户的用户语音输入，且所述用户无真实打断所述机器语音的意图时，停止播放所述机器语音，并获取所述机器语音在所述用户语音输入时对应的时间戳信息；根据所述时间戳信息确定所述机器语音对应的恢复播放节点；从所述恢复播放节点播放所述机器语音。According to one aspect of the present disclosure, a dialogue processing method is provided, including: when there is a user voice input from a user while the machine voice is playing, and the user has no real intention to interrupt the machine voice, stopping the playback of the machine voice , and obtain the timestamp information corresponding to the machine voice when the user's voice is input; determine the recovery playback node corresponding to the machine voice according to the timestamp information; and play the machine voice from the recovery playback node.

在本公开的一个实施例中，所述当机器语音播放时有用户的用户语音输入，且所述用户无真实打断所述机器语音的意图时，停止播放所述机器语音，并获取所述机器语音在所述用户语音输入时对应的时间戳信息包括：将所述用户语音及所述机器语音输入至恢复模型，以便通过所述恢复模型确定所述用户是否有真实打断所述机器语音的意图。In one embodiment of the present disclosure, when there is a user voice input from the user while the machine voice is playing, and the user has no real intention to interrupt the machine voice, the machine voice is stopped playing, and the machine voice is obtained. The timestamp information corresponding to the machine voice when the user's voice is input includes: inputting the user's voice and the machine voice to the recovery model, so as to determine whether the user actually interrupted the machine voice through the recovery model. intention of.

在本公开的一个实施例中，所述恢复模型的训练数据包括：对话音频特征数据、对话上下文数据、对话轮次数据、和/或对话语义数据。In one embodiment of the present disclosure, the training data of the recovery model includes: dialogue audio feature data, dialogue context data, dialogue turn data, and/or dialogue semantic data.

在本公开的一个实施例中，所述将所述用户语音及所述机器语音输入至恢复模型，以便通过所述恢复模型确定所述用户是否有真实打断所述机器语音的意图包括：所述恢复模型根据语音数据、所述用户语音及所述机器语音确定所述用户是否有真实打断所述机器语音的意图；其中，所述语音数据包括以下至少之一：引导话术内容数据、所述机器语音的播放进度数据、语音置信度数据。In one embodiment of the present disclosure, inputting the user voice and the machine voice to a recovery model, so as to determine whether the user has a real intention to interrupt the machine voice through the recovery model includes: The recovery model determines whether the user has a real intention to interrupt the machine voice based on the voice data, the user voice and the machine voice; wherein the voice data includes at least one of the following: guided speech content data, The playback progress data and voice confidence data of the machine voice.

在本公开的一个实施例中，所述恢复播放节点包括以下至少之一：In one embodiment of the present disclosure, the resuming playback node includes at least one of the following:

所述机器语音的起始位置；The starting position of the machine voice;

所述机器语音在所述用户语音输入时的位置；The position of the machine voice at the time of the user's voice input;

所述机器语音在所述用户语音输入时对应的断句位置；The corresponding sentence segment position of the machine voice during the user's voice input;

所述机器语音在所述用户语音输入时对应的预设停顿位置。The preset pause position corresponding to the machine voice when the user's voice is input.

在本公开的一个实施例中，所述机器语音包括：语音音频、所述语音音频对应的文字、和/或所述语音音频对应的时间戳信息，其中，所述时间戳信息包括：标点符号、断句位置、和/或预设停顿位置。In one embodiment of the present disclosure, the machine voice includes: voice audio, text corresponding to the voice audio, and/or timestamp information corresponding to the voice audio, wherein the timestamp information includes: punctuation marks , sentence break position, and/or preset pause position.

在本公开的一个实施例中，所述当机器语音播放时有用户的用户语音输入，且所述用户无真实打断所述机器语音的意图时，停止播放所述机器语音，并获取所述机器语音在所述用户语音输入时对应的时间戳信息包括：当机器语音播放时有用户的用户语音输入，且所述用户无真实打断所述机器语音的意图时，立即停止播放所述机器语音或缓慢降低所述机器语音的音量。In one embodiment of the present disclosure, when there is a user voice input from the user while the machine voice is playing, and the user has no real intention to interrupt the machine voice, the machine voice is stopped playing, and the machine voice is obtained. The timestamp information corresponding to the machine voice when the user's voice is input includes: when there is a user's voice input while the machine voice is playing, and the user has no real intention to interrupt the machine voice, immediately stop playing the machine voice Voice or slowly lower the volume of the machine voice.

在本公开的一个实施例中，还包括：当所述用户语音的停顿超过第一时间阈值时，表征用户语音输入完成。In one embodiment of the present disclosure, the method further includes: when the pause of the user's voice exceeds a first time threshold, it indicates that the user's voice input is completed.

在本公开的一个实施例中，还包括：当所述用户有真实打断所述机器语音的意图时，生成与所述用户语音对应的回复语音。In one embodiment of the present disclosure, the method further includes: when the user has a real intention to interrupt the machine voice, generating a reply voice corresponding to the user's voice.

在本公开的一个实施例中，还包括：过滤所述用户语音中的静音数据及噪声数据。In one embodiment of the present disclosure, the method further includes: filtering silence data and noise data in the user's voice.

根据本公开的另一个方面，还提供了一种对话处理装置，包括：According to another aspect of the present disclosure, a dialogue processing device is also provided, including:

获取模块，当机器语音播放时有用户的用户语音输入，且所述用户无真实打断所述机器语音的意图时，停止播放所述机器语音，并获取所述机器语音在所述用户语音输入时对应的时间戳信息；The acquisition module, when there is user voice input from the user when the machine voice is playing, and the user has no real intention to interrupt the machine voice, stops playing the machine voice, and obtains the user voice input of the machine voice. The corresponding timestamp information;

确定模块，根据所述时间戳信息确定所述机器语音对应的恢复播放节点；A determination module that determines the resumption playback node corresponding to the machine voice according to the timestamp information;

恢复模块，从所述恢复播放节点播放所述机器语音。A recovery module that plays the machine voice from the recovery play node.

根据本公开的另一个方面，还提供了一种电子设备，包括：处理器；以及存储器，用于存储所述处理器的可执行指令；其中，所述处理器配置为经由执行所述可执行指令来执行上述任意一项所述对话处理方法。According to another aspect of the present disclosure, an electronic device is also provided, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the executable instructions via Instructions to execute any of the above conversation processing methods.

根据本公开的另一个方面，还提供了一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现上述任意一项所述的对话处理方法。According to another aspect of the present disclosure, a computer-readable storage medium is also provided, on which a computer program is stored. When the computer program is executed by a processor, it implements any one of the above conversation processing methods.

本公开的实施例所提供的对话处理方法、装置、电子设备及计算机可读存储介质，当机器语音播放时有用户的用户语音输入，停止播放机器语音，将用户语音及机器语音输入至恢复模型，且通过恢复模型确定用户无真实打断机器语音的意图时，获取机器语音在用户语音输入时对应的时间戳信息，根据时间戳信息确定机器语音对应的恢复播放节点，例如将恢复播放节点设置在机器语音在用户语音输入时对应的预设停顿位置，从恢复播放节点播放机器语音；能够对用户意图进行确认，纠正用户误打断语音的情况，并保证语音恢复的流畅和连贯，提高语音交互效果。The dialogue processing method, device, electronic device and computer-readable storage medium provided by the embodiments of the present disclosure, when the machine voice is played and there is a user voice input from the user, the machine voice is stopped, and the user voice and the machine voice are input to the recovery model , and when it is determined through the recovery model that the user has no real intention to interrupt the machine voice, the timestamp information corresponding to the machine voice when the user's voice is input is obtained, and the recovery playback node corresponding to the machine voice is determined based on the timestamp information, for example, the recovery playback node is set Play the machine voice from the resume playback node at the preset pause position corresponding to the user's voice input; it can confirm the user's intention, correct the situation where the user accidentally interrupts the voice, and ensure the smoothness and coherence of voice recovery, improving voice quality. interactive effects.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本公开。It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and do not limit the present disclosure.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本公开的实施例，并与说明书一起用于解释本公开的原理。显而易见地，下面描述中的附图仅仅是本公开的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. Obviously, the drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

图1示出本公开实施例中一种对话处理方法流程图；Figure 1 shows a flow chart of a dialogue processing method in an embodiment of the present disclosure;

图2示出本公开实施例中一种用户意图判断方法流程图；Figure 2 shows a flow chart of a method for determining user intention in an embodiment of the present disclosure;

图3示出本公开实施例中又一种对话处理方法流程图；Figure 3 shows a flow chart of yet another dialogue processing method in an embodiment of the present disclosure;

图4示出本公开实施例中一种对话恢复示意图；Figure 4 shows a schematic diagram of conversation recovery in an embodiment of the present disclosure;

图5示出本公开实施例中一种对话处理装置示意图；Figure 5 shows a schematic diagram of a dialogue processing device in an embodiment of the present disclosure;

图6示出本公开实施例中一种对话处理系统示意图；Figure 6 shows a schematic diagram of a dialogue processing system in an embodiment of the present disclosure;

图7示出了可以应用于本公开实施例的对话处理方法或对话处理装置的示例性系统架构的示意图；和7 shows a schematic diagram of an exemplary system architecture that can be applied to a dialogue processing method or dialogue processing device according to embodiments of the present disclosure; and

图8示出本公开实施例中一种电子设备的结构框图。Figure 8 shows a structural block diagram of an electronic device in an embodiment of the present disclosure.

具体实施方式Detailed ways

现在将参考附图更全面地描述示例实施方式。然而，示例实施方式能够以多种形式实施，且不应被理解为限于在此阐述的范例；相反，提供这些实施方式使得本公开将更加全面和完整，并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concepts of the example embodiments. To those skilled in the art. The described features, structures or characteristics may be combined in any suitable manner in one or more embodiments.

此外，附图仅为本公开的示意性图解，并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分，因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体，不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings represent the same or similar parts, and thus their repeated description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software form, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.

下面结合附图及实施例对本示例实施方式进行详细说明。This exemplary implementation will be described in detail below with reference to the drawings and examples.

首先，本公开实施例中提供了一种对话处理方法，该方法可以由任意具备计算处理能力的电子设备执行。First, embodiments of the present disclosure provide a dialogue processing method, which can be executed by any electronic device with computing processing capabilities.

图1示出本公开实施例中一种对话处理方法流程图，如图1所示，本公开实施例中提供的对话处理方法包括如下步骤：Figure 1 shows a flow chart of a dialogue processing method in an embodiment of the present disclosure. As shown in Figure 1, the dialogue processing method provided in an embodiment of the present disclosure includes the following steps:

S102，当机器语音播放时有用户的用户语音输入，且用户无真实打断机器语音的意图时，停止播放机器语音并获取机器语音在用户语音输入时对应的时间戳信息。S102. When there is user voice input by the user when the machine voice is playing, and the user has no real intention to interrupt the machine voice, stop playing the machine voice and obtain the timestamp information corresponding to the machine voice when the user voice is input.

在一个实施例中，机器语音包括但不限于：语音音频、语音音频对应的文字、和/或语音音频对应的时间戳信息等，在合成语音音频时同步生成对应的文字、时间戳信息等；其中，时间戳信息包括但不限于：打断播放语音的位置、标点符号、断句位置、和/或预设停顿位置，断句位置可为根据语义等确定需要断句的位置，预设停顿位置为人工或自动设置的需要停顿的位置。In one embodiment, the machine voice includes but is not limited to: voice audio, text corresponding to the voice audio, and/or timestamp information corresponding to the voice audio, etc., and the corresponding text, timestamp information, etc. are generated synchronously when synthesizing the voice audio; Among them, the timestamp information includes but is not limited to: the position of interrupting the played voice, punctuation mark, sentence segmentation position, and/or preset pause position. The sentence segmentation position can be the position where the sentence needs to be segmented based on semantics, etc., and the preset pause position is artificial Or automatically set the required pause position.

在一个实施例中，过滤用户语音中的静音数据及噪声数据等数据。In one embodiment, data such as silence data and noise data in the user's voice are filtered.

在一个实施例中，将用户语音及机器语音输入至恢复模型，以便通过恢复模型确定用户是否有真实打断机器语音的意图；无真实打断机器语音的意图包括但不限于：用户误打断、用户不改变对话内容等。In one embodiment, the user's voice and the machine's voice are input to the recovery model, so that the recovery model determines whether the user has a real intention to interrupt the machine's voice; the lack of real intention to interrupt the machine's voice includes but is not limited to: the user accidentally interrupts , the user does not change the conversation content, etc.

在一个实施例中，将过滤静音数据及噪声数据等数据后的用户语音、及机器语音输入至恢复模型，以便通过恢复模型确定用户是否有真实打断机器语音的意图。In one embodiment, the user's voice and machine voice after filtering data such as silence data and noise data are input to the recovery model, so that the recovery model can determine whether the user has a real intention to interrupt the machine voice.

在一个实施例中，当机器语音播放时有用户的用户语音输入，且用户无真实打断机器语音的意图时，立即停止播放机器语音，或缓慢降低机器语音的音量，直到用户语音的停顿超过第一时间阈值表征用户语音输入完成。In one embodiment, when there is user voice input from the user while the machine voice is playing, and the user has no real intention to interrupt the machine voice, the machine voice is immediately stopped playing, or the volume of the machine voice is slowly reduced until the pause in the user's voice exceeds The first time threshold represents the completion of the user's voice input.

在一个实施例中，根据使用场景进行选择和配置立即停止机器语音播报行为，或降低机器语音播报音量。In one embodiment, select and configure according to the usage scenario to immediately stop the machine voice broadcasting behavior or reduce the machine voice broadcasting volume.

在一个实施例中，当用户语音的停顿超过第一时间阈值时，表征用户语音输入完成，第一时间阈值可根据应用场景等自动或手动进行设置，例如第一时间阈值可设置为600ms-800ms。In one embodiment, when the pause of the user's voice exceeds the first time threshold, it indicates that the user's voice input is completed. The first time threshold can be set automatically or manually according to the application scenario. For example, the first time threshold can be set to 600ms-800ms. .

在一个实施例中，恢复模型根据语音数据、用户语音及机器语音等确定用户是否有真实打断机器语音的意图；In one embodiment, the recovery model determines whether the user has a real intention to interrupt the machine voice based on voice data, user voice, machine voice, etc.;

其中，语音数据包括但不限于以下至少之一：引导话术内容数据、机器语音的播放进度数据、语音置信度数据；引导话术内容数据为机器语音中包括的引导用户交互的数据；语音置信度数据为用户语音及机器语音对应的识别置信度。Among them, the voice data includes but is not limited to at least one of the following: guidance speech content data, machine voice playback progress data, and voice confidence data; guidance speech content data is data included in machine speech to guide user interaction; voice confidence The degree data is the recognition confidence corresponding to the user's voice and the machine's voice.

S104，根据时间戳信息确定机器语音对应的恢复播放节点。S104: Determine the resumption playback node corresponding to the machine voice according to the timestamp information.

在一个实施例中，恢复播放节点包括但不限于以下至少之一：In one embodiment, the recovery playback node includes but is not limited to at least one of the following:

机器语音的起始位置；The starting position of the machine voice;

机器语音在用户语音输入时的位置；The position of the machine voice during user voice input;

机器语音在用户语音输入时对应的断句位置；The sentence segment position corresponding to the machine voice during user voice input;

机器语音在用户语音输入时对应的预设停顿位置。The preset pause position corresponding to the machine voice during user voice input.

在一个实施例中，可根据场景等自动或手动设置恢复播放节点，例如，可将恢复播放节点设置为机器语音在用户语音输入时对应的断句位置，则建立时间戳与恢复播放节点的映射表，根据时间戳信息确定对应的恢复播放节点为断句位置；可将恢复播放节点设置为机器语音在用户语音输入时对应的预设停顿位置，则建立时间戳与恢复播放节点的映射表，根据时间戳信息A确定对应的恢复播放节点为预设停顿位置B。In one embodiment, the resume play node can be set automatically or manually according to the scene. For example, the resume play node can be set to the sentence segment position corresponding to the machine voice when the user's voice input, and then a mapping table between the timestamp and the resume play node is established. , determine the corresponding resume playback node as the sentence segment position based on the timestamp information; the resume playback node can be set to the preset pause position corresponding to the machine voice during the user's voice input, and then a mapping table between the timestamp and the resume playback node is established. According to the time The stamp information A determines that the corresponding resume play node is the preset pause position B.

S106，从恢复播放节点播放机器语音。S106, play the machine voice from the recovery playback node.

在一个实施例中，在第一轮次用户输入用户语音结束，还可进一步检测收集下一轮次的用户语音，对第一轮次用户输入的用户语音做进一步验证。In one embodiment, after the first round of user voice input ends, the user voice of the next round can be further detected and collected to further verify the user voice input of the first round.

在一个实施例中，在恢复播放机器语音时，可直接恢复机器语音的音量，也可采用淡入淡出的方式进行平滑恢复，保证语音恢复的流畅和连贯，恢复音量的方式及平滑过渡时间可根据需要和场景进行自动或手动配置。In one embodiment, when resuming the playback of the machine voice, the volume of the machine voice can be restored directly, or the fade-in and fade-out method can be used for smooth recovery to ensure smooth and coherent voice recovery. The method of restoring the volume and the smooth transition time can be based on Requires automatic or manual configuration with the scenario.

在一个实施例中，当用户有真实打断机器语音的意图时，生成与用户语音对应的回复语音，回复语音包括但不限于：“嗯”，“收到”等，再根据用户语音播报对应真实意图的系统话术。In one embodiment, when the user truly intends to interrupt the machine voice, a reply voice corresponding to the user's voice is generated. The reply voice includes but is not limited to: "um", "received", etc., and then the corresponding voice is broadcast based on the user's voice. Systematic rhetoric with true intentions.

上述实施例中，基于恢复模型，对用户意图进行确认，纠正在智能语音对话中，误打断比较严重的问题，将时间戳作为语音对话的上下文维护，在处理用户打断时，将机器语音播放时有用户的用户语音输入对应的时间戳记录下来，通过时间戳可以进行多种恢复策略的选择，能确定合适的恢复播放节点，保证语音恢复的流畅和连贯，既能灵活打断，又能及时检测误打断，并能友好恢复机器语音，提高智能语音对话的拟人化体验。In the above embodiment, based on the recovery model, the user's intention is confirmed, the serious problem of accidental interruptions is corrected in the intelligent voice dialogue, the timestamp is maintained as the context of the voice dialogue, and the machine voice is used when handling user interruptions. During playback, the timestamp corresponding to the user's voice input is recorded. Multiple recovery strategies can be selected through the timestamp, and the appropriate recovery playback node can be determined to ensure smooth and coherent voice recovery, which can not only flexibly interrupt, but also It can detect false interruptions in time and restore machine voice in a friendly manner, improving the anthropomorphic experience of intelligent voice dialogue.

图2示出本公开实施例中一种用户意图判断方法流程图，如图2所示，本公开实施例中提供的用户意图判断方法包括如下步骤：Figure 2 shows a flow chart of a user intention determination method in an embodiment of the present disclosure. As shown in Figure 2, the user intention determination method provided in an embodiment of the present disclosure includes the following steps:

S202，对恢复模型进行训练。S202, train the recovery model.

在一个实施例中，恢复模型的训练数据包括但不限于：对话音频特征数据、对话上下文数据、对话轮次数据、语音数据、和/或对话语义数据等；对话音频特征包括但不限于短时过零率、短时能量、短时自相关函数、短时平均幅度、频谱差分幅度、频谱质心、频谱宽度、梅尔频率倒谱系数等；对话上下文为机器语音与用户语音交互的上下文信息等；对话轮次数据包括但不限于：发出信号、确认结束、正式告别等轮次数据；对话语义数据为数据所对应的现实世界中的事物所代表的概念的含义等。In one embodiment, the training data of the recovery model includes but is not limited to: dialogue audio feature data, dialogue context data, dialogue turn data, speech data, and/or dialogue semantic data, etc.; the dialogue audio features include but is not limited to short-term Zero-crossing rate, short-term energy, short-term autocorrelation function, short-term average amplitude, spectrum difference amplitude, spectrum centroid, spectrum width, Mel frequency cepstrum coefficient, etc.; the dialogue context is the context information of the interaction between machine voice and user voice, etc. ; Dialogue turn data includes but is not limited to: turn data such as signaling, confirming the end, and formal farewell; dialogue semantic data refers to the meaning of concepts represented by things in the real world corresponding to the data, etc.

S204，将用户语音及机器语音输入至恢复模型。S204: Input user voice and machine voice to the recovery model.

在一个实施例中，恢复模型根据语音数据、用户语音及机器语音等确定用户是否有真实打断机器语音的意图。In one embodiment, the recovery model determines whether the user has a real intention to interrupt the machine voice based on voice data, user voice, machine voice, etc.

S206，根据恢复模型判断结果，确定是否恢复机器语音的播放。S206: Determine whether to resume the playback of the machine voice according to the recovery model judgment result.

恢复模型判断结果一种是确认了用户的打断意图，此时本轮对话结束，对用户语音进行完整的语义分析和对话处理，产生新的话术做进一步的交互。One of the recovery model judgment results is to confirm the user's interruption intention. At this time, the current round of dialogue ends, and the user's voice is completely semantically analyzed and dialogue processed to generate new words for further interaction.

恢复模型判断结果另一种是否认了用户的打断意图，此时需要恢复本轮对话，将机器语音向前滚动到合适的恢复播放节点，在恢复播放节点处开始恢复机器语音的播放。Another judgment result of the recovery model is that the user's intention to interrupt is denied. At this time, the current round of dialogue needs to be resumed, the machine voice is scrolled forward to the appropriate resume playback node, and the playback of the machine voice is resumed at the resume playback node.

上述实施例中，基于恢复模型，对用户意图进行确认，根据确认结果，确定是进一步做意图识别还是进行对话恢复，纠正在智能语音对话中，误打断比较严重的问题，在进行对话恢复时，通过增加时间戳的方法，从打断位置向前滚动到合适的恢复播放节点，这种策略声音比较流畅和连贯，表达比较清晰，更接近真实的人人对话体验。In the above embodiment, based on the recovery model, the user's intention is confirmed, and based on the confirmation result, it is determined whether to further perform intention recognition or restore the conversation, correcting the serious problem of mistaken interruption in the intelligent voice conversation, when resuming the conversation. , by adding timestamps and scrolling forward from the interruption position to the appropriate resume playback node, this strategy has smoother and more coherent sounds, clearer expressions, and is closer to the real everyone-to-people dialogue experience.

图3示出本公开实施例中又一种对话处理方法流程图，如图3所示，本公开实施例中提供的对话处理方法包括如下步骤：Figure 3 shows a flow chart of yet another dialogue processing method in an embodiment of the present disclosure. As shown in Figure 3, the dialogue processing method provided in an embodiment of the present disclosure includes the following steps:

S302，合成及播放机器语音。S302, synthesize and play machine voice.

在一个实施例中，机器语音由语音合成引擎合成，也可以是提前制作的录音；机器语音音频生成时同步生成了对应文字的时间戳，由中央控制模块作为上下文维护。In one embodiment, the machine voice is synthesized by a speech synthesis engine, or may be a recording made in advance; when the machine voice audio is generated, a timestamp of the corresponding text is generated simultaneously, and is maintained by the central control module as a context.

S304，当检测到机器语音播放时有用户的用户语音输入时，获取机器语音在用户语音输入时对应的时间戳信息。S304: When it is detected that there is user voice input by the user when the machine voice is played, obtain the timestamp information corresponding to the machine voice during the user voice input.

打断功能可以结合中央控制模块中的语音端点检测模块和/或降噪模块检测启动，在检测到用户的用户语音输入后，中央控制模块可立即实施打断或降低机器语音的音量，并将检测到用户的用户语音输入对应的打断时间戳记录在上下文中，通过时间戳记录打断的位置，用于在恢复时还原打断位置。The interrupt function can be activated in combination with the voice endpoint detection module and/or noise reduction module in the central control module. After detecting the user's voice input, the central control module can immediately interrupt or reduce the volume of the machine voice, and The interruption timestamp corresponding to the user's voice input is detected in the context, and the interruption position is recorded through the timestamp, which is used to restore the interruption position during recovery.

在一个实施例中，实时处理用户语音数据，分别调用语音端点检测模块和降噪模块过滤用户语音数据，如果语音端点检测模块检测到用户语音数据中包含声音，再由降噪模块进一步检测，如果声音中包含的是噪音则过滤掉，如果包含的不是噪音，则确定为是用户输入的，提高用户语音的准确性。In one embodiment, the user voice data is processed in real time, and the voice endpoint detection module and the noise reduction module are respectively called to filter the user voice data. If the voice endpoint detection module detects that the user voice data contains sound, the noise reduction module further detects it. If If the sound contains noise, it will be filtered out. If it does not contain noise, it will be determined to be input by the user, thereby improving the accuracy of the user's voice.

语音端点检测模块和降噪模块是本地集成的轻量模块，响应速度相对较快，使用语音端点检测模块和降噪模块处理打断可以更快的响应用户，但它们的准确性容易受到背景噪音的影响，因此在用户输入用户语音结束，还需进一步收集到足够的信息，对用户输入的用户语音做进一步验证。The voice endpoint detection module and noise reduction module are locally integrated lightweight modules with relatively fast response times. Using the voice endpoint detection module and noise reduction module to handle interruptions can respond to users faster, but their accuracy is susceptible to background noise. Therefore, after the user inputs the user voice, it is necessary to further collect enough information to further verify the user input of the user voice.

S306，通过恢复模型检测用户是否有真实打断机器语音的意图。S306: Use the recovery model to detect whether the user has a real intention to interrupt the machine voice.

在一个实施例中，恢复模型此时结合上下文信息，综合语音和语义信息，检测用户是否有真实打断机器语音意图。In one embodiment, the recovery model combines contextual information, integrates speech and semantic information, and detects whether the user has a real intention to interrupt the machine's speech.

S308，若是，则本轮对话结束，中央控制模块调用语义识别引擎和对话管理服务进行完整的语义分析和对话处理，产生新的话术做进一步的交互。S308, if so, this round of dialogue ends, and the central control module calls the semantic recognition engine and dialogue management service to perform complete semantic analysis and dialogue processing, and generate new words for further interaction.

S310，否则，恢复本轮对话，中央控制模块综合语音合成引擎的时间戳，将语音向前滚动到合适的停顿位置，在停顿处开始恢复机器语音的播放。S310, otherwise, resume the current round of dialogue. The central control module integrates the timestamp of the speech synthesis engine, rolls the speech forward to a suitable pause position, and starts to resume the playback of the machine voice at the pause.

上述实施例中，提供了一种快速打断，意图确认，错误恢复的方案，使用快速的语音断点检测及时打断机器语音，在用户发言结束后，使用快速的恢复模型进行确认，根据确认结果，确定是进一步做意图识别还是进行对话恢复，在进行对话恢复时，通过增加时间戳的方法，从打断位置向前滚动到合适的恢复播放节点，这种策略声音比较流畅和连贯，表达比较清晰，更接近真实的人人对话体验。In the above embodiment, a solution for rapid interruption, intent confirmation, and error recovery is provided. Fast voice breakpoint detection is used to interrupt the machine voice in time. After the user finishes speaking, a fast recovery model is used for confirmation. According to the confirmation As a result, it is determined whether to further perform intent recognition or resume dialogue. When resuming dialogue, scroll forward from the interruption position to the appropriate resume playback node by adding a timestamp. This strategy makes the sound smoother, more coherent, and more expressive. It is clearer and closer to the real conversation experience between people.

图4示出本公开实施例中一种对话恢复示意图，坐标轴代表机器语音实时生成的方向，每个方格代表一个时间片，机器语音询问用户姓名时，用户听到自己的姓名时可能会立即作出确认，此时机器语音会被立即打断或缓慢降低机器语音的音量；例如，机器语音可以为：“可以，您是王先生吗”，用户在播放至“王”时进行答复，即对应用户语音输入位置。Figure 4 shows a schematic diagram of dialogue recovery in an embodiment of the present disclosure. The coordinate axis represents the direction of real-time generation of machine voice. Each square represents a time slice. When the machine voice asks for the user's name, the user may hear his or her own name. Confirm immediately. At this time, the machine voice will be interrupted immediately or the volume of the machine voice will be slowly reduced; for example, the machine voice can be: "Okay, are you Mr. Wang", and the user will reply when "Wang" is played, that is Corresponds to the user's voice input position.

如果用户答复内容为“是的”，恢复模型结合引导话术内容数据，播放进度数据，语音置信度等参数，将判断为用户有真实打断机器语音的意图。If the user's reply is "yes", the recovery model will determine that the user has a real intention to interrupt the machine's voice based on the guidance content data, playback progress data, voice confidence and other parameters.

如果用户答复内容为“啊～”，恢复模型将判断为用户无真实打断机器语音的意图，此时中央控制模块根据时间戳信息回滚语音到恢复播放节点位置“您”，并从此处开始重复引导。If the user's reply is "Ah~", the recovery model will judge that the user has no real intention to interrupt the machine's voice. At this time, the central control module rolls back the voice to the resume playback node position "you" based on the timestamp information, and starts from here. Repeat boot.

上述实施例中，从打断位置向前滚动到合适的恢复播放节点，这种策略声音比较流畅和连贯，表达比较清晰，更接近真实的人人对话体验，解决在智能语音对话系统中出现的误打断恢复问题。In the above embodiment, scroll forward from the interruption position to the appropriate resume playback node. This strategy makes the sound smoother and more coherent, the expression is clearer, and is closer to the real everyone dialogue experience, solving the problems that occur in the intelligent voice dialogue system. Accidental interrupt recovery problem.

基于同一发明构思，本公开实施例中还提供了一种对话处理装置，如下面的实施例。由于该装置实施例解决问题的原理与上述方法实施例相似，因此该装置实施例的实施可以参见上述方法实施例的实施，重复之处不再赘述。Based on the same inventive concept, embodiments of the present disclosure also provide a dialogue processing device, such as the following embodiments. Since the problem-solving principle of this device embodiment is similar to that of the above-mentioned method embodiment, the implementation of this device embodiment can refer to the implementation of the above-mentioned method embodiment, and repeated details will not be repeated.

图5示出本公开实施例中一种对话处理装置示意图，如图5所示，该对话处理装置5包括：获取模块501、确定模块502及恢复模块503；Figure 5 shows a schematic diagram of a dialogue processing device in an embodiment of the present disclosure. As shown in Figure 5, the dialogue processing device 5 includes: an acquisition module 501, a determination module 502 and a recovery module 503;

获取模块501，当机器语音播放时有用户的用户语音输入，且用户无真实打断机器语音的意图时，停止播放机器语音，并获取机器语音在用户语音输入时对应的时间戳信息；Acquisition module 501, when there is user voice input by the user when the machine voice is playing, and the user has no real intention to interrupt the machine voice, stops playing the machine voice, and obtains the timestamp information corresponding to the machine voice when the user voice is input;

确定模块502，根据时间戳信息确定机器语音对应的恢复播放节点；The determination module 502 determines the resumption playback node corresponding to the machine voice according to the timestamp information;

恢复模块503，从恢复播放节点播放机器语音。The recovery module 503 plays the machine voice from the recovery play node.

基于同一发明构思，本公开实施例中还提供了一种对话处理系统，如下面的实施例。由于该系统实施例解决问题的原理与上述方法实施例相似，因此该系统实施例的实施可以参见上述方法实施例的实施，重复之处不再赘述。Based on the same inventive concept, embodiments of the present disclosure also provide a dialogue processing system, such as the following embodiments. Since the problem-solving principle of this system embodiment is similar to that of the above-mentioned method embodiment, the implementation of this system embodiment can be referred to the implementation of the above-mentioned method embodiment, and repeated details will not be repeated.

图6示出本公开实施例中一种对话处理系统示意图，如图6所示，该系统包括：中央控制模块601、语音合成引擎602；语音识别引擎603；语义识别引擎604；对话管理服务605；中央控制模块601集成各项技术，控制对话逻辑；Figure 6 shows a schematic diagram of a dialogue processing system in an embodiment of the present disclosure. As shown in Figure 6, the system includes: a central control module 601, a speech synthesis engine 602; a speech recognition engine 603; a semantic recognition engine 604; and a dialogue management service 605 ;The central control module 601 integrates various technologies and controls the dialogue logic;

语音合成引擎602合成机器语音，机器语音音频生成时同步生成了对应文字的时间戳；通过语音识别引擎603对用户语音及机器语音进行识别；通过语义识别引擎604和对话管理服务605进行完整的语义分析和对话处理，产生新的话术做进一步的交互。The speech synthesis engine 602 synthesizes machine speech. When the machine speech audio is generated, the timestamp of the corresponding text is synchronously generated; the user speech and the machine speech are recognized through the speech recognition engine 603; the complete semantics is performed through the semantic recognition engine 604 and the dialogue management service 605. Analysis and dialogue processing to generate new words for further interaction.

中央控制模块601包括语音端点检测模块6011、降噪模块6012、撤销恢复模块6013；The central control module 601 includes a voice endpoint detection module 6011, a noise reduction module 6012, and an undo recovery module 6013;

语音端点检测模块6011，用于从声音信号里识别出静音期和语音期的开始和结束，应用可以只处理有效的语音，过滤掉多余的静音，节省传输带宽或运算量。The voice endpoint detection module 6011 is used to identify the beginning and end of the silence period and the speech period from the sound signal. The application can only process valid speech, filter out excess silence, and save transmission bandwidth or computing power.

降噪模块6012，用于从声音信号里识别出噪音和语音的开始和结束。Noise reduction module 6012 is used to identify the beginning and end of noise and speech from the sound signal.

撤销恢复模块6013，机器语音被打断后，继续采集数据，根据更丰富的数据进一步检查打断是否正确或合理，不正确的打断进行恢复。The undo recovery module 6013 continues to collect data after the machine voice is interrupted, further checks whether the interruption is correct or reasonable based on richer data, and restores incorrect interruptions.

通过中央控制模块601中的语音合成引擎602合成机器语音，通过中央控制模块601中的语音端点检测模块6011或降噪模块6012检测到有人声输入的情况后，系统可立即停止系统的播报行为，也可降低系统的播报音量，产生淡入淡出的效果，给用户一个反馈说明系统已经知晓他正在说话，同时，当用户继续说话超过第二时间阈值，说明用户在表达一些完整的意图，系统会停止播报等待用户表达完成，当用户停顿超过第一时间阈值，系统会认为用户表达完成，并将检测到用户的用户语音输入对应的打断时间戳记录在上下文中，通过时间戳记录打断的位置，用于在恢复时还原打断位置。Machine speech is synthesized by the speech synthesis engine 602 in the central control module 601. After detecting voice input through the voice endpoint detection module 6011 or noise reduction module 6012 in the central control module 601, the system can immediately stop the system's broadcasting behavior. You can also reduce the system's broadcast volume to produce a fade-in and fade-out effect, giving the user a feedback indicating that the system has learned that he is speaking. At the same time, when the user continues to speak beyond the second time threshold, it means that the user is expressing some complete intentions and the system will stop. The broadcast waits for the user's expression to be completed. When the user's pause exceeds the first time threshold, the system will consider the user's expression to be completed, and record the interruption timestamp corresponding to the user's voice input in the context, and record the location of the interruption through the timestamp. , used to restore the interrupted position during recovery.

在一个实施例中，可根据场景及用户习惯等自动或手动设置第一时间阈值、第二时间阈值，第二时间阈值一般为1s左右，第一时间阈值一般600ms-800ms。In one embodiment, the first time threshold and the second time threshold can be set automatically or manually according to the scene and user habits. The second time threshold is generally about 1 second, and the first time threshold is generally 600ms-800ms.

在一个实施例中，根据使用场景进行选择和配置立即停止系统的机器语音播报行为，或降低系统的机器语音播报音量。In one embodiment, select and configure according to the usage scenario to immediately stop the system's machine voice broadcasting behavior, or reduce the system's machine voice broadcasting volume.

通过对话音频特征数据、对话上下文数据、对话轮次数据等对恢复模型进行训练，根据恢复模型判断用户是否有真实的打断意图，如果用户有真实打断机器语音的意图时，系统会增加一个语气助词，如“嗯”，“收到”，再根据用户语音播报对应真实意图的系统话术；如果用户无真实打断意图，系统会走入自然恢复播报的流程。The recovery model is trained through dialogue audio feature data, dialogue context data, dialogue turn data, etc., and the recovery model is used to determine whether the user has a real intention to interrupt. If the user has a real intention to interrupt the machine voice, the system will add a The modal particles, such as "um" and "received", are used to broadcast the system's words corresponding to the true intention based on the user's voice; if the user has no real intention to interrupt, the system will naturally resume the broadcasting process.

通过语音合成引擎602的时间戳信息，系统可以计算出自上次打断后，系统播报的一个恢复播放节点，并从恢复播放节点开始恢复播报，使系统更加拟人化。Through the timestamp information of the speech synthesis engine 602, the system can calculate a resume playback node reported by the system since the last interruption, and resume broadcasting from the resume playback node, making the system more anthropomorphic.

在一个实施例中，在播报对应真实意图的系统话术或自然恢复播报的流程中时，可恢复至播放音量，恢复过程可采用直接恢复，也可采用淡入淡出的方式进行平滑恢复，平滑过渡时间根据需要和场景进行自动或手动配置。In one embodiment, when broadcasting the system speech corresponding to the true intention or naturally resuming the broadcasting process, the playback volume can be restored. The restoration process can be a direct restoration or a smooth restoration and smooth transition using a fade-in and fade-out method. The time is automatically or manually configured according to needs and scenarios.

上述实施例中，应用在支持打断的智能语音对话场景中，对话的一方是真人，另一方是语音机器人，真人和机器交替发言，机器发言时，允许真人打断机器，获得话语权，此时机器等待真人发言结束，并通过恢复模型判断先前的打断意图是否正确或者真人的发言是否改变对话内容，如果判断结果为误打断或不改变对话内容，则进行对话恢复，既能灵活打断，又能及时检测误打断，并能友好恢复机器语音，保证语音恢复的流畅和连贯，改进语音交互中打断场景的用户体验，提高智能语音对话的拟人化体验，保证该系统能及时响应又能正确理解用户意图，智能化体验比较好。In the above embodiment, it is used in an intelligent voice dialogue scenario that supports interruption. One party to the conversation is a real person and the other is a voice robot. The real person and the machine speak alternately. When the machine speaks, the real person is allowed to interrupt the machine and obtain the right to speak. The time machine waits for the end of the real person's speech, and uses the recovery model to judge whether the previous interruption intention is correct or whether the real person's speech changes the conversation content. If the judgment result is that the interruption is mistaken or the conversation content does not change, the conversation will be restored, which can flexibly open the conversation. It can detect false interruptions in time, and can restore the machine voice in a friendly manner, ensuring smooth and coherent voice recovery, improving the user experience of interruption scenes in voice interaction, improving the anthropomorphic experience of intelligent voice dialogue, and ensuring that the system can be timely The response can correctly understand the user's intention, and the intelligent experience is better.

图7示出了可以应用于本公开实施例的对话处理方法或对话处理装置的示例性系统架构的示意图。FIG. 7 shows a schematic diagram of an exemplary system architecture that can be applied to a dialogue processing method or dialogue processing apparatus according to embodiments of the present disclosure.

如图7所示，系统架构700可以包括终端设备701、702、703，网络704和服务器705。As shown in Figure 7, the system architecture 700 may include terminal devices 701, 702, 703, a network 704 and a server 705.

网络704用以在终端设备701、702、703和服务器705之间提供通信链路的介质，可以是有线网络，也可以是无线网络。The network 704 is a medium used to provide communication links between the terminal devices 701, 702, 703 and the server 705, and may be a wired network or a wireless network.

可选地，上述的无线网络或有线网络使用标准通信技术和/或协议。网络通常为因特网、但也可以是任何网络，包括但不限于局域网(Local Area Network，LAN)、城域网(Metropolitan Area Network，MAN)、广域网(Wide Area Network，WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合)。在一些实施例中，使用包括超文本标记语言(Hyper Text Mark-up Language，HTML)、可扩展标记语言(ExtensibleMarkupLanguage，XML)等的技术和/或格式来代表通过网络交换的数据。此外还可以使用诸如安全套接字层(Secure Socket Layer，SSL)、传输层安全(Transport Layer Security，TLS)、虚拟专用网络(Virtual Private Network，VPN)、网际协议安全(InternetProtocolSecurity，IPsec)等常规加密技术来加密所有或者一些链路。在另一些实施例中，还可以使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。Optionally, the above-mentioned wireless network or wired network uses standard communication technologies and/or protocols. The network is usually the Internet, but can also be any network, including but not limited to Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, wired or wireless network, private network, or virtual private network). In some embodiments, technologies and/or formats including Hyper Text Mark-up Language (HTML), Extensible Markup Language (XML), etc. are used to represent data exchanged through the network. In addition, you can also use conventional methods such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (Virtual Private Network, VPN), Internet Protocol Security (Internet Protocol Security, IPsec), etc. Encryption technology to encrypt all or some links. In other embodiments, customized and/or dedicated data communication technologies may also be used in place of or in addition to the above-described data communication technologies.

终端设备701、702、703可以是各种电子设备，包括但不限于智能手机、平板电脑、膝上型便携计算机、台式计算机等，可用于显示机器语音的播放进度、恢复模型的判断结果等。Terminal devices 701, 702, and 703 can be various electronic devices, including but not limited to smartphones, tablet computers, laptop computers, desktop computers, etc., and can be used to display the playback progress of the machine voice, the judgment results of the recovery model, etc.

可选地，不同的终端设备701、702、703中安装的应用程序的客户端是相同的，或基于不同操作系统的同一类型应用程序的客户端。基于终端平台的不同，该应用程序的客户端的具体形态也可以不同，比如，该应用程序客户端可以是手机客户端、PC客户端等。Optionally, the clients of the application programs installed in different terminal devices 701, 702, and 703 are the same, or the clients of the same type of application program based on different operating systems. Based on different terminal platforms, the specific form of the application client can also be different. For example, the application client can be a mobile phone client, a PC client, etc.

服务器705可以是提供各种服务的服务器，例如对用户利用终端设备701、702、703所进行操作的装置提供支持的后台管理服务器。后台管理服务器可以对接收到的请求等数据进行分析等处理，并将处理结果反馈给终端设备；例如，可获取机器语音在用户语音输入时对应的时间戳信息；根据时间戳信息确定机器语音对应的恢复播放节点；从恢复播放节点播放机器语音；对恢复模型进行训练等。The server 705 may be a server that provides various services, such as a background management server that provides support for devices operated by users using the terminal devices 701, 702, and 703. The background management server can analyze and process the received request and other data, and feed the processing results back to the terminal device; for example, it can obtain the timestamp information corresponding to the machine voice when the user inputs the voice; determine the machine voice correspondence based on the timestamp information resume playback node; play machine voice from the resume playback node; train the recovery model, etc.

可选地，服务器可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network，内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端可以是智能手机、平板电脑、笔记本电脑、台式计算机等，但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接，本申请在此不做限制。Optionally, the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms. The terminal can be a smartphone, a tablet, a laptop, a desktop computer, etc., but is not limited thereto. The terminal and the server can be connected directly or indirectly through wired or wireless communication methods, which is not limited in this application.

本领域技术人员可以知晓，图7中的终端设备、网络和服务器的数量仅仅是示意性的，根据实际需要，可以具有任意数目的终端设备、网络和服务器。本公开实施例对此不作限定。Those skilled in the art will know that the number of terminal devices, networks and servers in Figure 7 is only illustrative, and there can be any number of terminal devices, networks and servers according to actual needs. The embodiments of the present disclosure do not limit this.

所属技术领域的技术人员能够理解，本公开的各个方面可以实现为系统、方法或程序产品。因此，本公开的各个方面可以具体实现为以下形式，即：完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等)，或硬件和软件方面结合的实施方式，这里可以统称为“电路”、“模块”或“系统”。Those skilled in the art will understand that various aspects of the present disclosure may be implemented as systems, methods, or program products. Therefore, various aspects of the present disclosure may be embodied in the following forms, namely: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or an implementation combining hardware and software aspects, which may be collectively referred to herein as "Circuits", "modules" or "systems".

下面参照图8来描述根据本公开的这种实施方式的电子设备800。图8显示的电子设备800仅仅是一个示例，不应对本公开实施例的功能和使用范围带来任何限制。An electronic device 800 according to this embodiment of the present disclosure is described below with reference to FIG. 8 . The electronic device 800 shown in FIG. 8 is only an example and should not bring any limitations to the functions and usage scope of the embodiments of the present disclosure.

如图8所示，电子设备800以通用计算设备的形式表现。电子设备800的组件可以包括但不限于：上述至少一个处理单元810、上述至少一个存储单元820、连接不同系统组件(包括存储单元820和处理单元810)的总线830。As shown in Figure 8, electronic device 800 is embodied in the form of a general computing device. The components of the electronic device 800 may include, but are not limited to: the above-mentioned at least one processing unit 810, the above-mentioned at least one storage unit 820, and a bus 830 connecting different system components (including the storage unit 820 and the processing unit 810).

其中，所述存储单元存储有程序代码，所述程序代码可以被所述处理单元810执行，使得所述处理单元810执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。Wherein, the storage unit stores program code, and the program code can be executed by the processing unit 810, so that the processing unit 810 performs various exemplary methods according to the present disclosure described in the "Example Method" section of this specification. Implementation steps.

例如，所述处理单元810可以执行上述方法实施例的如下步骤：当机器语音播放时有用户的用户语音输入，停止播放机器语音，将用户语音及机器语音输入至恢复模型，且通过恢复模型确定用户无真实打断机器语音的意图时，获取机器语音在用户语音输入时对应的时间戳信息，根据时间戳信息确定机器语音对应的恢复播放节点，例如将恢复播放节点设置在机器语音在用户语音输入时对应的预设停顿位置，从恢复播放节点播放机器语音。For example, the processing unit 810 can perform the following steps of the above method embodiment: when the machine voice is played and there is a user voice input of the user, stop playing the machine voice, input the user voice and the machine voice to the recovery model, and determine through the recovery model When the user has no real intention to interrupt the machine voice, obtain the timestamp information corresponding to the machine voice when the user's voice is input, and determine the resume playback node corresponding to the machine voice based on the timestamp information. For example, set the resume playback node when the machine voice is in the user's voice input. Play the machine voice from the resume playback node at the corresponding preset pause position during input.

存储单元820可以包括易失性存储单元形式的可读介质，例如随机存取存储单元(RAM)8201和/或高速缓存存储单元8202，还可以进一步包括只读存储单元(ROM)8203。The storage unit 820 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 8201 and/or a cache storage unit 8202, and may further include a read-only storage unit (ROM) 8203.

存储单元820还可以包括具有一组(至少一个)程序模块8205的程序/实用工具8204，这样的程序模块8205包括但不限于：操作系统、一个或者多个应用程序、其它程序模块以及程序数据，这些示例中的每一个或某种组合中可能包括网络环境的实现。Storage unit 820 may also include a program/utility 8204 having a set of (at least one) program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples, or some combination, may include the implementation of a network environment.

总线830可以为表示几类总线结构中的一种或多种，包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。Bus 830 may be a local area representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or using any of a variety of bus structures. bus.

电子设备800也可以与一个或多个外部设备840(例如键盘、指向设备、蓝牙设备等)通信，还可与一个或者多个使得用户能与该电子设备800交互的设备通信，和/或与使得该电子设备800能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口850进行。并且，电子设备800还可以通过网络适配器860与一个或者多个网络(例如局域网(LAN)，广域网(WAN)和/或公共网络，例如因特网)通信。如图所示，网络适配器860通过总线830与电子设备800的其它模块通信。应当明白，尽管图中未示出，可以结合电子设备800使用其它硬件和/或软件模块，包括但不限于：微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。Electronic device 800 may also communicate with one or more external devices 840 (e.g., keyboard, pointing device, Bluetooth device, etc.), may also communicate with one or more devices that enable a user to interact with electronic device 800, and/or with Any device that enables the electronic device 800 to communicate with one or more other computing devices (eg, router, modem, etc.). This communication may occur through input/output (I/O) interface 850. Furthermore, the electronic device 800 may also communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through a network adapter 860. As shown, network adapter 860 communicates with other modules of electronic device 800 via bus 830. It should be understood that, although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.

通过以上的实施方式的描述，本领域的技术人员易于理解，这里描述的示例实施方式可以通过软件实现，也可以通过软件结合必要的硬件的方式来实现。因此，根据本公开实施方式的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中或网络上，包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开实施方式的方法。Through the above description of the embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by software combined with necessary hardware. Therefore, the technical solution according to the embodiment of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which may be a personal computer, a server, a terminal device, a network device, etc.) to execute a method according to an embodiment of the present disclosure.

在本公开的示例性实施例中，还提供了一种计算机可读存储介质，该计算机可读存储介质可以是可读信号介质或者可读存储介质。其上存储有能够实现本公开上述方法的程序产品。在一些可能的实施方式中，本公开的各个方面还可以实现为一种程序产品的形式，其包括程序代码，当所述程序产品在终端设备上运行时，所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。In an exemplary embodiment of the present disclosure, a computer-readable storage medium is also provided, and the computer-readable storage medium may be a readable signal medium or a readable storage medium. Program products capable of implementing the above methods of the present disclosure are stored thereon. In some possible implementations, various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code. When the program product is run on a terminal device, the program code is used to cause the The terminal device performs the steps according to various exemplary embodiments of the present disclosure described in the above "Example Method" section of this specification.

例如，本公开实施例中的程序产品被处理器执行时实现如下步骤的方法：当机器语音播放时有用户的用户语音输入，停止播放机器语音，将用户语音及机器语音输入至恢复模型，且通过恢复模型确定用户无真实打断机器语音的意图时，获取机器语音在用户语音输入时对应的时间戳信息，根据时间戳信息确定机器语音对应的恢复播放节点，例如将恢复播放节点设置在机器语音在用户语音输入时对应的预设停顿位置，从恢复播放节点播放机器语音。For example, when the program product in the embodiment of the present disclosure is executed by a processor, the method implements the following steps: when the machine voice is played, there is a user voice input from the user, the machine voice is stopped, the user voice and the machine voice are input to the recovery model, and When it is determined through the recovery model that the user has no real intention to interrupt the machine voice, the timestamp information corresponding to the machine voice when the user inputs the voice is obtained, and the recovery playback node corresponding to the machine voice is determined based on the timestamp information. For example, the recovery playback node is set on the machine. The voice plays the machine voice from the resume play node at the preset pause position corresponding to the user's voice input.

本公开中的计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。More specific examples of computer-readable storage media in this disclosure may include, but are not limited to: electrical connections having one or more wires, portable computer disks, hard drives, random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

在本公开中，计算机可读存储介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了可读程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质，该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。In this disclosure, a computer-readable storage medium may include a data signal propagated in baseband or as part of a carrier wave carrying readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A readable signal medium may also be any readable medium other than a readable storage medium that can send, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or device.

可选地，计算机可读存储介质上包含的程序代码可以用任何适当的介质传输，包括但不限于无线、有线、光缆、RF等等，或者上述的任意合适的组合。Alternatively, program code embodied on a computer-readable storage medium may be transmitted using any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.

在具体实施时，可以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作的程序代码，所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中，远程计算设备可以通过任意种类的网络，包括局域网(LAN)或广域网(WAN)，连接到用户计算设备，或者，可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。When implemented, program code for performing operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., and Includes conventional procedural programming languages—such as "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on. In situations involving remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device, such as provided by an Internet service. (business comes via Internet connection).

应当注意，尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元，但是这种划分并非强制性的。实际上，根据本公开的实施方式，上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之，上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that although several modules or units of equipment for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above may be further divided into being embodied by multiple modules or units.

此外，尽管在附图中以特定顺序描述了本公开中方法的各个步骤，但是，这并非要求或者暗示必须按照该特定顺序来执行这些步骤，或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的，可以省略某些步骤，将多个步骤合并为一个步骤执行，以及/或者将一个步骤分解为多个步骤执行等。Furthermore, although various steps of the methods of the present disclosure are depicted in the drawings in a specific order, this does not require or imply that the steps must be performed in that specific order, or that all of the illustrated steps must be performed to achieve the desired results. result. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, etc.

通过以上实施方式的描述，本领域的技术人员易于理解，这里描述的示例实施方式可以通过软件实现，也可以通过软件结合必要的硬件的方式来实现。因此，根据本公开实施方式的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中或网络上，包括若干指令以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者网络设备等)执行根据本公开实施方式的方法。Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by software combined with necessary hardware. Therefore, the technical solution according to the embodiment of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which may be a personal computer, a server, a mobile terminal, a network device, etc.) to execute a method according to an embodiment of the present disclosure.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本公开的真正范围和精神由所附的权利要求指出。Other embodiments of the disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. The present disclosure is intended to cover any variations, uses, or adaptations of the disclosure that follow the general principles of the disclosure and include common common sense or customary technical means in the technical field that are not disclosed in the disclosure. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A dialogue processing method, characterized by including:

When there is user voice input from the user when the machine voice is playing, and the user has no real intention to interrupt the machine voice, stop playing the machine voice, and obtain the corresponding value of the machine voice when the user voice is input. timestamp information;

Determine the resumption playback node corresponding to the machine voice according to the timestamp information;

Play the machine voice from the resume play node.

2. The dialogue processing method according to claim 1, characterized in that when there is user voice input from the user when the machine voice is played, and the user has no real intention to interrupt the machine voice, the playback of the machine voice is stopped. Describing the machine voice and obtaining the timestamp information corresponding to the machine voice when the user's voice is input includes:

The user's voice and the machine voice are input to the recovery model, so that the recovery model determines whether the user has a real intention to interrupt the machine voice.

3. The dialogue processing method according to claim 2, characterized in that the training data of the recovery model includes: dialogue audio feature data, dialogue context data, dialogue turn data, and/or dialogue semantic data.

4. The dialogue processing method according to claim 2, characterized in that the user's voice and the machine voice are input to a recovery model, so as to determine whether the user has truly interrupted the process through the recovery model. The intentions of the machine voice include:

The recovery model determines whether the user has a real intention to interrupt the machine voice based on the voice data, the user voice and the machine voice;

Wherein, the voice data includes at least one of the following: guided speech content data, playback progress data of the machine voice, and voice confidence data.

5. The dialogue processing method according to claim 1, characterized in that the resume playback node includes at least one of the following:

The starting position of the machine voice;

The position of the machine voice at the time of the user's voice input;

The corresponding sentence segment position of the machine voice during the user's voice input;

The preset pause position corresponding to the machine voice when the user's voice is input.

6. The dialogue processing method according to claim 1, characterized in that the machine voice includes: voice audio, text corresponding to the voice audio, and/or timestamp information corresponding to the voice audio, wherein the The timestamp information includes: punctuation marks, sentence segmentation positions, and/or preset pause positions.

7. The dialogue processing method according to claim 2, characterized in that when there is user voice input from the user when the machine voice is played, and the user has no real intention to interrupt the machine voice, the playback of the machine voice is stopped. Describing the machine voice and obtaining the timestamp information corresponding to the machine voice when the user's voice is input includes:

When there is user voice input from the user while the machine voice is playing, and the user has no real intention to interrupt the machine voice, the machine voice is immediately stopped playing or the volume of the machine voice is slowly reduced.

8. The dialogue processing method according to claim 1, further comprising:

When the pause of the user's voice exceeds the first time threshold, it indicates that the user's voice input is completed.

9. The dialogue processing method according to claim 1, further comprising:

When the user has a real intention to interrupt the machine voice, a reply voice corresponding to the user's voice is generated.

10. The dialogue processing method according to claim 7 or 8, further comprising: filtering silence data and noise data in the user's voice.

11. A dialogue processing device, characterized in that it includes:

The acquisition module, when there is user voice input from the user when the machine voice is playing, and the user has no real intention to interrupt the machine voice, stops playing the machine voice, and obtains the user voice input of the machine voice. The corresponding timestamp information;

A determination module that determines the resumption playback node corresponding to the machine voice according to the timestamp information;

A recovery module that plays the machine voice from the recovery play node.

12. An electronic device, characterized in that it includes:

processor; and

memory for storing executable instructions for the processor;

Wherein, the processor is configured to execute the dialog processing method according to any one of claims 1 to 10 by executing the executable instructions.

13. A computer-readable storage medium on which a computer program is stored, characterized in that when the computer program is executed by a processor, the dialogue processing method according to any one of claims 1 to 10 is implemented.