WO2024099046A1

WO2024099046A1 - Voice interaction method, server and computer-readable storage medium

Info

Publication number: WO2024099046A1
Application number: PCT/CN2023/125464
Authority: WO
Inventors: 樊骏锋; 宁洪珂; 丁鹏傑; 郭梦雪; 赵群
Original assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Current assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date: 2022-11-08
Filing date: 2023-10-19
Publication date: 2024-05-16
Anticipated expiration: 2025-05-08
Also published as: CN115457959A; CN115457959B

Abstract

Disclosed in the present application is a voice interaction method, comprising: receiving a voice request forwarded by a vehicle; processing the voice request to extract intention information and slot information of the voice request, so as to confirm that a target position and/or a target operation object cannot be directly acquired according to semantics; according to the intention information and the slot information, determining the target position and the target operation object of the voice request; according to the target position and the target operation object, generating a vehicle control instruction corresponding to the voice request; and forwarding the vehicle control instruction to the vehicle, so as to complete voice interaction.

Description

Voice interaction method, server and computer readable storage medium

本申请要求于2022年11月8日申请的、申请号为202211389565.4的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims priority to Chinese patent application No. 202211389565.4 filed on November 8, 2022, the entire contents of which are incorporated by reference into this application.

Technical Field

本申请涉及车载语音技术领域，特别涉及一种语音交互方法、服务器及计算机可读存储介质。The present application relates to the field of vehicle-mounted voice technology, and in particular to a voice interaction method, a server, and a computer-readable storage medium.

Background technique

目前，车载语音技术可以支持用户通过语音在车辆座舱内进行交互，例如控制车辆零部件或与车载系统用户界面中的组件进行交互。例如，用户通过语音控制车载系统的用户界面中的音乐播放器控件打开等。在实际交互场景中，用户通常需要严格按照规定的句式输入语音请求，才能够正常进行语音交互，而用户利用相对自由或接近日常的表述，可能导致语音助手无法识别语音请求，进而导致语音交互不能顺利进行，影响语音交互的流畅性和便捷性。At present, in-vehicle voice technology can support users to interact in the vehicle cabin through voice, such as controlling vehicle parts or interacting with components in the in-vehicle system user interface. For example, users control the opening of the music player control in the in-vehicle system user interface through voice. In actual interaction scenarios, users usually need to input voice requests strictly according to the prescribed sentence structure in order to perform voice interaction normally. If users use relatively free or daily expressions, the voice assistant may not be able to recognize the voice request, which in turn causes the voice interaction to not proceed smoothly, affecting the fluency and convenience of voice interaction.

technical problem

本申请提供了一种语音交互方法、服务器及计算机可读存储介质。The present application provides a voice interaction method, a server and a computer-readable storage medium.

Technical Solutions

本申请的语音交互方法，包括：The voice interaction method of the present application includes:

接收车辆转发的语音请求；Receive voice requests forwarded by the vehicle;

处理所述语音请求，提取所述语音请求的意图信息和槽位信息，确认根据语义无法直接获取目标位置和/或目标操作对象，其中，所述意图信息包括动作类型，所述槽位信息包括参考点、相对位置信息和/或操作对象；Processing the voice request, extracting intent information and slot information of the voice request, and confirming that a target position and/or a target operation object cannot be directly obtained according to semantics, wherein the intent information includes an action type, and the slot information includes a reference point, relative position information, and/or an operation object;

根据所述意图信息和所述槽位信息确定所述语音请求的目标位置和目标操作对象；Determine a target location and a target operation object of the voice request according to the intention information and the slot information;

根据所述目标位置和所述目标操作对象生成与所述语音请求对应的车辆控制指令；generating a vehicle control instruction corresponding to the voice request according to the target position and the target operation object;

将所述车辆控制指令转发至所述车辆以完成所述语音交互。The vehicle control command is forwarded to the vehicle to complete the voice interaction.

如此，本申请中，在用户通过语音与车载系统用户界面进行交互的过程中，服务器在提取语音请求的意图信息和槽位信息后，无法根据语义直接获取目标位置和目标操作对象时，仍可通过一系列方法确定语音请求的目标位置和目标操作对象，最终生成车辆控制指令。本申请的语音交互方法可识别用户口语化语音请求，完成对目标位置和目标操作对象的定位，而不需要用户进行多轮澄清，提高语音交互的流畅性和便捷性。Thus, in this application, when the user interacts with the vehicle system user interface through voice, after extracting the intention information and slot information of the voice request, the server cannot directly obtain the target position and target operation object based on semantics, but can still determine the target position and target operation object of the voice request through a series of methods, and finally generate a vehicle control instruction. The voice interaction method of this application can recognize the user's spoken voice request, complete the positioning of the target position and target operation object, without the user having to perform multiple rounds of clarification, and improve the fluency and convenience of voice interaction.

所述根据所述意图信息和所述槽位信息确定所述语音请求的目标位置和目标操作对象，包括：The determining a target position and a target operation object of the voice request according to the intention information and the slot information includes:

对所述槽位信息中的参考点进行归一化处理，以将所述参考点对应至车辆座舱内的绝对位置。The reference points in the slot information are normalized to correspond the reference points to absolute positions in the vehicle cabin.

如此，可将提取到的语音请求中的参考点槽位信息进行归一化处理，使参考点与车辆座舱内相应的绝对位置对应起来，以便后续结合相对位置信息确定目标操作对象的位置范围。In this way, the reference point slot information in the extracted voice request can be normalized so that the reference point corresponds to the corresponding absolute position in the vehicle cabin, so as to subsequently determine the position range of the target operation object in combination with the relative position information.

根据所述绝对位置和所述相对位置信息确定所述目标位置。The target position is determined according to the absolute position and the relative position information.

如此，可将参考点对应的车辆座舱内的绝对位置与相对位置信息结合，进行目标位置范围的确定。使后续查找目标操作对象的范围限定在目标位置内，过程更为准确和高效。In this way, the absolute position in the vehicle cabin corresponding to the reference point can be combined with the relative position information to determine the target position range, so that the scope of subsequent search for the target operation object is limited to the target position, making the process more accurate and efficient.

所述方法还包括：The method further comprises:

在所述槽位信息中缺失所述参考点的情况下，根据所述语音请求的历史对话信息确认所述参考点。In the case where the reference point is missing in the slot information, the reference point is confirmed according to the historical dialogue information of the voice request.

如此，当参考点信息模糊时，服务器将搜索历史对话内容，将上条语音请求中的参考点确认为本条语音请求的参考点，使语音交互过程更具连贯性。In this way, when the reference point information is ambiguous, the server will search the historical conversation content and confirm the reference point in the previous voice request as the reference point of this voice request, making the voice interaction process more coherent.

所述方法还包括：The method further comprises:

在所述槽位信息中缺失所述参考点的情况下，根据所述语音请求的音区信息确认所述参考点。In the case where the reference point is missing in the slot information, the reference point is confirmed according to the voice zone information of the voice request.

如此，当参考点信息缺失时，服务器将判断语音请求的音区信息，将用户所在音区作为参考点，使语音交互过程更具连贯性。In this way, when the reference point information is missing, the server will determine the voice zone information of the voice request and use the user's voice zone as a reference point to make the voice interaction process more coherent.

根据所述相对位置信息确定候选操作对象。A candidate operation object is determined according to the relative position information.

如此，服务器可在根据相对位置信息确定的目标位置内，将所有对象确定为候选操作对象。将后续在候选操作对象范围内筛选得到目标操作对象的过程缩小至目标位置范围内进行，提高筛选步骤的高效性。In this way, the server can determine all objects as candidate operation objects within the target position determined according to the relative position information, and narrow the subsequent process of screening the target operation object within the candidate operation object range to the target position range, thereby improving the efficiency of the screening step.

根据所述槽位信息中的操作对象对所述候选操作对象进行第一筛选处理；Performing a first screening process on the candidate operation objects according to the operation objects in the slot information;

根据所述意图信息中操作类型对经过第一筛选处理的所述候选操作对象进行第二筛选处理以得到所述目标操作对象。The candidate operation objects that have undergone the first screening process are subjected to a second screening process according to the operation type in the intention information to obtain the target operation object.

如此，可根据用户语音请求中的意图信息，首先在目标区域内筛选出候选操作对象，再在筛选出的候选操作对象中进行第二次筛选，筛选出其中可操作对象作为目标操作对象，以便融合生成车载系统能够识别并执行的指令。In this way, based on the intention information in the user's voice request, candidate operation objects can be first screened out in the target area, and then a second screening can be performed among the screened candidate operation objects to select the operable objects as the target operation objects, so as to merge and generate instructions that the in-vehicle system can recognize and execute.

所述方法还包括：The method further comprises:

在所述槽位信息中缺失所述操作对象的情况下，根据所述语音请求的音区信息确定所述操作对象。In the case where the operation object is missing in the slot information, the operation object is determined according to the voice zone information of the voice request.

如此，当操作对象信息缺失时，服务器将进行模糊匹配，判断语音请求的音区信息，并将用户所在音区范围确认为操作对象的位置范围，据此确定操作对象信息，使语音交互过程更具连贯性。In this way, when the operation object information is missing, the server will perform fuzzy matching, determine the voice zone information of the voice request, and confirm the user's voice zone range as the location range of the operation object, and determine the operation object information based on this, making the voice interaction process more coherent.

所述根据所述目标位置和所述目标操作对象生成与所述语音请求对应的车辆控制指令，包括：The generating a vehicle control instruction corresponding to the voice request according to the target position and the target operation object includes:

所述车辆的状态信息、所述目标位置和所述目标操作对象，确定对所述目标操作对象的操作权限；The state information of the vehicle, the target position and the target operation object, determine the operation authority for the target operation object;

根据所述操作权限生成所述车辆控制指令。The vehicle control instruction is generated according to the operation authority.

如此，可根据车辆的状态信息、目标位置及目标操作对象，确定目标对象的操作权限，并根据权限识别结果生成所述车辆控制指令。使语音交互过程及结果更适应车辆的行驶状态，保障驾驶安全。In this way, the operation authority of the target object can be determined according to the vehicle status information, target location and target operation object, and the vehicle control instruction can be generated according to the authority identification result, so that the voice interaction process and results are more adapted to the driving status of the vehicle and ensure driving safety.

所述方法还包括：The method further comprises:

对所述语音请求的意图信息和槽位信息进行存储。The intent information and slot information of the voice request are stored.

如此，可存储语音请求中的意图信息和槽位信息，以便车辆在下轮执行任务过程中，获取历史轮次中执行的信息，得到更具可靠性的语音交互结果。In this way, the intent information and slot information in the voice request can be stored so that the vehicle can obtain the information executed in the historical rounds during the next round of task execution and obtain more reliable voice interaction results.

本申请的服务器，包括处理器和存储器，所述存储器中存储有计算机程序，所述计算机程序被所述处理器执行时，实现上述的方法。The server of the present application includes a processor and a memory, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the above method is implemented.

本申请的计算机可读存储介质，存储有计算机程序，当所述计算机程序被一个或多个处理器执行时，实现上述的方法。The computer-readable storage medium of the present application stores a computer program, and when the computer program is executed by one or more processors, the above method is implemented.

本申请的实施方式的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本申请的实施方式的实践了解到。Additional aspects and advantages of the embodiments of the present application will be given in part in the description below, and in part will become apparent from the description below, or will be learned through the practice of the embodiments of the present application.

BRIEF DESCRIPTION OF THE DRAWINGS

本申请的上述和/或附加的方面和优点从结合下面附图对实施方式的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present application will become apparent and easily understood from the description of the embodiments in conjunction with the following drawings, in which:

图1是本申请语音交互方法的流程示意图之一；FIG1 is a flow chart of the voice interaction method of the present application;

图2是本申请语音交互方法的流程示意图之二；FIG2 is a second flow chart of the voice interaction method of the present application;

图3是本申请语音交互方法的流程示意图之三；FIG3 is a third flow chart of the voice interaction method of the present application;

图4是本申请语音交互方法的流程示意图之四；FIG4 is a fourth flow chart of the voice interaction method of the present application;

图5是本申请语音交互方法的流程示意图之五。FIG. 5 is a fifth flowchart of the voice interaction method of the present application.

Embodiments of the present invention

下面详细描述本申请的实施方式，实施方式的示例在附图中示出，其中，相同或类似的标号自始至终表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的，仅用于解释本申请的实施方式，而不能理解为对本申请的实施方式的限制。The embodiments of the present application are described in detail below, and examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals represent the same or similar elements or elements having the same or similar functions from beginning to end. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the embodiments of the present application, and cannot be understood as limiting the embodiments of the present application.

请参阅图1、图2及图3，本申请提供一种语音交互方法，包括：Please refer to FIG. 1 , FIG. 2 and FIG. 3 , the present application provides a voice interaction method, including:

01：接收车辆转发的语音请求；01: Receive the voice request forwarded by the vehicle;

02：处理语音请求，提取语音请求的意图信息和槽位信息，确认根据语义无法直接获取目标位置和/或目标操作对象；02: Process the voice request, extract the intent information and slot information of the voice request, and confirm that the target location and/or target operation object cannot be directly obtained based on the semantics;

03：根据意图信息和槽位信息确定语音请求的目标位置和目标操作对象；03: Determine the target location and target operation object of the voice request based on the intent information and slot information;

04：根据目标位置和目标操作对象生成与语音请求对应的车辆控制指令；04: Generate vehicle control instructions corresponding to the voice request based on the target position and target operation object;

05：将车辆控制指令转发至车辆以完成语音交互。05: Forward vehicle control commands to the vehicle to complete voice interaction.

本申请还提供了一种服务器，服务器包括存储器和处理器。本申请的语音交互方法可以由本申请的服务器实现。具体地，存储器中存储有计算机程序，处理器用于接收车辆转发的语音请求，处理语音请求，提取语音请求的意图信息和槽位信息，并确认根据语义无法直接获取目标位置和/或目标操作对象，以及根据意图信息和槽位信息确定语音请求的目标位置和目标操作对象，根据目标位置和目标操作对象生成与语音请求对应的车辆控制指令，最后将车辆控制指令转发至车辆以完成语音交互。The present application also provides a server, which includes a memory and a processor. The voice interaction method of the present application can be implemented by the server of the present application. Specifically, a computer program is stored in the memory, and the processor is used to receive a voice request forwarded by a vehicle, process the voice request, extract the intent information and slot information of the voice request, and confirm that the target position and/or target operation object cannot be directly obtained based on the semantics, and determine the target position and target operation object of the voice request based on the intent information and slot information, generate a vehicle control instruction corresponding to the voice request based on the target position and the target operation object, and finally forward the vehicle control instruction to the vehicle to complete the voice interaction.

车载系统语音交互功能，可实现用户对车辆的控制。目前，车载系统语音交互功能支持用户通过语音在车辆座舱内进行交互。相关技术中，车载系统语音交互功能能够识别的语音请求通常需要用户严格按照规定的句式进行输入。如图2所示，在语音控制车窗开闭的场景中，若用户输入语音请求的句式符合表达规范，如“打开主驾车窗”，则该语音请求可以被语音助手准确识别。通过自然语言处理，利用意图分类模型和槽位提取模型，最终生成控制对象明确的车辆控制指令。然而，当用户发出相对自由或更接近日常表述的语音请求，如用户发出类似“打开我左边的车窗”的语音请求时，则无法直接识别该语音请求进而生成相应的控制指令，通常需要用户进行多轮澄清后，才能够确认最终的目标，从而生成相应的控制指令，或对用户发出“听不懂”等类似的反馈。The voice interaction function of the vehicle system can enable users to control the vehicle. At present, the voice interaction function of the vehicle system supports users to interact in the vehicle cabin through voice. In the related art, the voice request that can be recognized by the voice interaction function of the vehicle system usually requires the user to input it strictly according to the prescribed sentence pattern. As shown in Figure 2, in the scenario of voice control of the opening and closing of the car window, if the sentence pattern of the user inputting the voice request meets the expression specification, such as "open the main driving window", the voice request can be accurately recognized by the voice assistant. Through natural language processing, the intent classification model and the slot extraction model are used to finally generate a vehicle control instruction with a clear control object. However, when the user makes a relatively free or closer to daily expression voice request, such as when the user makes a voice request similar to "open my left window", the voice request cannot be directly recognized and the corresponding control instruction is generated. Usually, the user needs to make multiple rounds of clarification before the final goal can be confirmed, thereby generating the corresponding control instruction, or giving the user "I don't understand" and other similar feedback.

如图3所示，本申请中，针对上述场景，对于用户发出的语音请求，例如上例中的“打开我左边车窗”，服务器在接收到车辆转发的该类语音请求后，提取语音请求中的意图信息和槽位信息。其中，意图分类模型对语音请求的内容进行分类预测，得到意图信息为“打开”。此处的意图信息区别于传统自然语言理解模型中的意图信息，分类更少，主要针对用户的动作而不涉及动作实施的对象，如“打开”、“关闭”、“点击”、“切换”等动作类别。As shown in Figure 3, in this application, for the above scenario, for the voice request issued by the user, such as "open my left window" in the above example, after receiving such voice request forwarded by the vehicle, the server extracts the intent information and slot information in the voice request. Among them, the intent classification model classifies and predicts the content of the voice request, and obtains the intent information as "open". The intent information here is different from the intent information in the traditional natural language understanding model, with fewer classifications, and is mainly aimed at the user's actions rather than the objects of the action implementation, such as "open", "close", "click", "switch" and other action categories.

槽位提取模型可针对上述实际语音请求“打开我左边车窗”中的位置定位信息进行提取，包括参考点槽位、相对位置信息槽位和/或操作对象的槽位。其中，参考点可作为确定相对位置信息的参考位置，可包括“主驾”“后排”或“屏幕”等，实际场景中用户可能通过更生活化语言表达，则需要按预定规则进行自然语言处理得到参考点对应的车辆座舱内绝对位置。The slot extraction model can extract the location information in the above-mentioned actual voice request "Open my left window", including the reference point slot, the relative position information slot and/or the operation object slot. Among them, the reference point can be used as a reference position to determine the relative position information, which may include "driver", "back row" or "screen", etc. In actual scenarios, users may express themselves in more lifelike language, so it is necessary to perform natural language processing according to predetermined rules to obtain the absolute position in the vehicle cabin corresponding to the reference point.

相对位置信息是指语音请求中描述相对于参考点位置的区域位置信息，可包括“左边”、“右侧”、“上边”等。Relative position information refers to the regional position information described in the voice request relative to the reference point, which may include "left", "right", "top", etc.

操作对象是指语音请求中一些描述车内零部件以及用户界面部件或区域的自然语言信息，具有执行意图信息描述的相关动作的能力，例如“车窗”、“音量设置按钮”等，且现有自然语言理解模型无法利用位置信息对各操作对象进行区分。对于语音请求“打开我左边的车窗”而言，槽位提取模型可提取到槽位信息包括：参考点槽位“我”，相对位置信息槽位“左边”，操作对象槽位“车窗”。此外，对车载系统用户界面中各语音交互元素的控制，如“点击大屏中间那个按钮”和“把导航设置下面的功能打开”等。The operation object refers to some natural language information in the voice request that describes the components in the car and the user interface components or areas. It has the ability to perform related actions described by the intention information, such as "car window", "volume setting button", etc., and the existing natural language understanding model cannot use location information to distinguish between the various operation objects. For the voice request "open my left window", the slot extraction model can extract slot information including: reference point slot "me", relative position information slot "left", and operation object slot "car window". In addition, the control of various voice interaction elements in the user interface of the vehicle system, such as "click the button in the middle of the large screen" and "turn on the function below the navigation settings".

可以理解地，在实际的语音交互场景中，用户可能无法完整实现意图信息、参考点信息、相对位置信息和操作对象信息这四个关键信息的准确输入。例如，对于语音请求“打开我左边的车窗”而言，由于口语习惯，主驾用户可能实际输入的语音请求为“打开左边车窗”，“打开车窗”，语音请求中参考点或相对位置不明确。或在实现“打开主驾车窗”动作后，主驾用户接着输入语音请求“再把后边的也关了”，此语音请求中，参考点槽位、相对位置槽位及操作对象槽位的信息均无法直接通过语义获取。Understandably, in actual voice interaction scenarios, users may not be able to fully and accurately input the four key information: intent information, reference point information, relative position information, and operation object information. For example, for the voice request "open my left window", due to spoken language habits, the main driver user may actually input the voice request "open the left window" or "open the window", and the reference point or relative position in the voice request is unclear. Or after implementing the action of "opening the main driver's window", the main driver user then inputs the voice request "close the back one too". In this voice request, the information of the reference point slot, relative position slot, and operation object slot cannot be directly obtained through semantics.

在上述场景中，车载系统服务器可通过模糊匹配、权限识别、信息继承等方法，最终明确语音请求的关键信息，即得到确定目标位置和目标操作对象。最后，服务器将得到的目标位置和目标操作对象，结合语音请求的意图信息，生成可被车辆识别的控制指令，包括对车内零部件及用户界面部件或区域的控制指令。最后将控制指令下发至车辆，并由车辆执行指令动作。In the above scenario, the vehicle system server can use fuzzy matching, permission identification, information inheritance and other methods to finally identify the key information of the voice request, that is, to determine the target location and target operation object. Finally, the server will combine the obtained target location and target operation object with the intention information of the voice request to generate control instructions that can be recognized by the vehicle, including control instructions for in-vehicle components and user interface components or areas. Finally, the control instructions are sent to the vehicle, and the vehicle executes the instruction action.

本申请的语音交互方法，在确认用户语音请求无法直接根据语义判断目标位置及目标操作对象时，仍可得到目标位置及目标操作对象，进而生成可被车辆识别的控制指令并下发至车辆，使车辆顺利完成语音请求的执行。语音助手可兼容语音请求中更贴近生活的口语化表述方式，使车载语音交互具有更流畅的体验感。The voice interaction method of the present application can still obtain the target position and target operation object when it is confirmed that the user's voice request cannot directly determine the target position and target operation object based on semantics, and then generate a control instruction that can be recognized by the vehicle and send it to the vehicle, so that the vehicle can successfully complete the execution of the voice request. The voice assistant can be compatible with the more life-like spoken expression in the voice request, making the in-vehicle voice interaction have a smoother experience.

综上，本申请中，在用户通过语音与车载系统用户界面进行交互的过程中，服务器在提取语音请求的意图信息和槽位信息后，无法根据语义直接获取目标位置和目标操作对象时，仍可通过一系列方法确定语音请求的目标位置和目标操作对象，最终生成车辆控制指令。本申请的语音交互方法可识别用户口语化语音请求，完成对目标位置和目标操作对象的定位，而不需要用户进行多轮澄清，提高语音交互的流畅性和便捷性。In summary, in this application, when the user interacts with the vehicle system user interface through voice, after the server extracts the intention information and slot information of the voice request, if it cannot directly obtain the target position and target operation object based on semantics, it can still determine the target position and target operation object of the voice request through a series of methods, and finally generate a vehicle control instruction. The voice interaction method of this application can recognize the user's colloquial voice request, complete the positioning of the target position and target operation object, without the user having to perform multiple rounds of clarification, and improve the fluency and convenience of voice interaction.

请参阅图4，步骤03包括：Please refer to Figure 4, step 03 includes:

031：对槽位信息中的参考点进行归一化处理，以将参考点对应至车辆座舱内的绝对位置。031: Normalize the reference points in the slot information to correspond to the absolute positions in the vehicle cabin.

处理器用于对槽位信息中的参考点进行归一化处理，以将参考点对应至车辆座舱内的绝对位置。The processor is used to normalize the reference points in the slot information to correspond the reference points to absolute positions in the vehicle cabin.

具体地，在根据语义无法直接获取目标位置和目标操作对象的情况下，服务器可对提取到的槽位信息中的参考点进行归一化处理，即将用户输入的语音请求的槽位信息中参考点与车辆座舱内的绝对位置预定语义规则进行实体归一化。预定语义规则在此不作限定。Specifically, when the target position and the target operation object cannot be directly obtained according to the semantics, the server can normalize the reference point in the extracted slot information, that is, the reference point in the slot information of the voice request input by the user and the absolute position predetermined semantic rule in the vehicle cabin are entity normalized. The predetermined semantic rule is not limited here.

在一个示例中，用户发出的语音请求为“打开我左边的车窗”时，需要进行归一化过程的包括槽位信息中的参考点“我”。“我”作为参考点，通过识别声音来源信息，定位输入语音请求的用户“我”在车辆座舱中所处的位置。例如，主驾内用户输入语音请求中，参考点的槽位信息为“我”，则将“我”这一槽位信息归一化至“主驾”这一车内绝对位置。In one example, when the voice request sent by the user is "Open my left window", the reference point "I" in the slot information needs to be normalized. "I" is used as the reference point, and the position of the user "I" who inputs the voice request in the vehicle cabin is located by identifying the sound source information. For example, in the voice request input by the user in the main driver's seat, the slot information of the reference point is "I", then the slot information of "I" is normalized to the absolute position in the vehicle of "main driver's seat".

步骤03包括：Step 03 includes:

032：根据绝对位置和相对位置信息确定目标位置。032: Determine the target position based on the absolute position and relative position information.

处理器用于根据绝对位置和相对位置信息确定目标位置。The processor is used to determine the target position according to the absolute position and relative position information.

请参阅图4，具体地，可以结合归一化得到的参考点绝对位置，并根据相对位置信息，获取目标操作对象对应的位置范围，即目标位置。其中，相对位置信息默认以三维空间的位置表述。当用户的语音请求面向车载系统的用户界面时，不支持三维位置信息的表述，则自动降为二维位置信息的表述。Please refer to FIG. 4. Specifically, the position range corresponding to the target operation object, i.e., the target position, can be obtained by combining the normalized absolute position of the reference point and the relative position information. The relative position information is expressed as a three-dimensional position by default. When the user's voice request is directed to the user interface of the vehicle system, the expression of three-dimensional position information is not supported, and it is automatically reduced to the expression of two-dimensional position information.

在一个示例中，主驾的用户发出的语音请求为“打开我左边的车窗”时，归一化得到参考点“我”的车内绝对位置为“主驾”。提取语音请求中相对位置信息“左边”，由于该语音请求并非面向车载系统的用户界面，则目标位置范围可确定为，参考点“主驾”的“左侧”包含的三维空间。In one example, when the voice request from the driver is "Open my left window", the normalized absolute position of the reference point "I" in the car is "driver". The relative position information "left" in the voice request is extracted. Since the voice request is not for the user interface of the vehicle system, the target position range can be determined as the three-dimensional space contained by the "left side" of the reference point "driver".

请参阅图4及图5，方法还包括：Please refer to FIG. 4 and FIG. 5 , the method further includes:

07：在槽位信息中缺失参考点的情况下，根据语音请求的历史对话信息确认参考点。07: In the case where the reference point is missing in the slot information, the reference point is confirmed based on the historical dialogue information of the voice request.

处理器用于在槽位信息中缺失参考点的情况下，根据语音请求的历史对话信息确认参考点。The processor is used to confirm the reference point based on the historical dialogue information of the voice request when the reference point is missing in the slot information.

具体地，用户输入语音请求时，因随机性可能造成参考点的缺失。例如，在多轮语音请求场景下，可继承上一条的语义。信息继承的方法可用于，在语音请求中提取到的参考点槽位信息模糊，存在“它”、“这个”等指代词的情况，代表前一轮的语音请求中已经出现过的参考点。此时，服务器应搜索历史对话内容，根据其中的对话信息确认模糊指代词所对应参考点。Specifically, when a user inputs a voice request, the reference point may be missing due to randomness. For example, in a multi-round voice request scenario, the semantics of the previous one can be inherited. The information inheritance method can be used when the reference point slot information extracted from the voice request is vague, and there are pronouns such as "it" and "this", which represent reference points that have appeared in the previous round of voice requests. At this time, the server should search the historical conversation content and confirm the reference point corresponding to the vague pronoun based on the conversation information.

在一个示例中，中控显示屏处于购物列表的场景下，用户第一轮输入语音请求“帮我点个A商品”，第二轮输入语音请求为“它左边那个我也要”。服务器在第二轮输入的语音请求中，提取到的槽位信息包括“它”和“那个”两个指代词。根据历史对话内容的搜索结果，上轮语音请求中已经出现参考点“A商品”，则可以确认第二轮语音请求中的“它”指代上轮语音请求中的“A商品”。相类似地，针对历史对话内容，可确认语音请求“它左边那个我也要”的目的也是购买商品，则“那个”指代的是购物列表中位于“A商品”“左边”相应的商品。In one example, the central control display screen is in a shopping list scenario. The user inputs a voice request in the first round, "Please help me order product A", and the second round of voice request is "I also want the one on its left". In the second round of voice request input, the slot information extracted by the server includes two pronouns, "it" and "that". According to the search results of the historical conversation content, the reference point "product A" has appeared in the previous round of voice request, so it can be confirmed that "it" in the second round of voice request refers to "product A" in the previous round of voice request. Similarly, for the historical conversation content, it can be confirmed that the purpose of the voice request "I also want the one on its left" is to purchase goods, so "that" refers to the corresponding product on the "left" of "product A" in the shopping list.

请参阅图5，方法还包括：Referring to FIG. 5 , the method further includes:

08：在槽位信息中缺失参考点的情况下，根据语音请求的音区信息确认参考点。08: In the case where the reference point is missing in the slot information, the reference point is confirmed according to the zone information of the voice request.

处理器用于在槽位信息中缺失参考点的情况下，根据语音请求的音区信息确认参考点。The processor is used to confirm the reference point according to the voice zone information of the voice request when the reference point is missing in the slot information.

具体地，用户输入语音请求时，因随机性可能造成参考点的缺失。此时，服务器将根据语音请求的音区信息确认参考点。Specifically, when the user inputs a voice request, the reference point may be missing due to randomness. At this time, the server will confirm the reference point according to the voice zone information of the voice request.

在实际场景中，用户输入语音请求“打开左边的车窗”，语音请求中没有关于相对位置信息槽位“左边”相对应的参考点槽位信息。此时根据输入语音请求的音区信息，判断输入语音请求的用户所在的座位作为参考点。例如，当输入语音请求的是主驾用户时，参考点确定为主驾，则语音请求可理解为“打开主驾左边的车窗”。In an actual scenario, the user inputs a voice request "open the left window", and there is no reference point slot information corresponding to the relative position information slot "left" in the voice request. At this time, based on the sound zone information of the input voice request, the seat of the user who inputs the voice request is determined as the reference point. For example, when the user who inputs the voice request is the main driver, the reference point is determined to be the main driver, and the voice request can be understood as "open the left window of the main driver".

请参阅图4，步骤03包括：Please refer to Figure 4, step 03 includes:

033：根据相对位置信息确定候选操作对象。033: Determine the candidate operation object according to the relative position information.

处理器用于根据相对位置信息确定候选操作对象。The processor is used to determine the candidate operation object according to the relative position information.

具体地，在相对位置信息默认指示的三维区域范围内，搜索所有具有执行语音请求动作意图的可操作对象，作为候选对象。其中，相对位置信息默认以三维空间的位置表述。当用户的语音请求面向车载系统的用户界面时，不支持三维位置信息的表述，则自动降为二维位置信息的表述。Specifically, within the three-dimensional area indicated by the relative position information by default, all operable objects with the intention of executing the voice request action are searched as candidate objects. The relative position information is expressed as a three-dimensional position by default. When the user's voice request is directed to the user interface of the vehicle system, the expression of three-dimensional position information is not supported, and it is automatically reduced to the expression of two-dimensional position information.

在一个示例中，语音请求槽位信息中相对位置信息为“左手边”，则选择参考点左侧范围内的可操作对象，作为候选操作对象。如果参考点为“主驾”，不是位于用户界面上的按键，则确定参考点“主驾”左侧三维空间为目标位置，并选择目标位置范围内所有可操作对象作为候选操作对象；如果参考点为用户界面中某按键，则确定该按键左侧平面范围为目标位置，并选择其中所有可操作对象为候选操作对象。In one example, if the relative position information in the voice request slot information is "left hand side", the operable objects within the range to the left of the reference point are selected as candidate operation objects. If the reference point is "main driving" and is not a button on the user interface, the three-dimensional space to the left of the reference point "main driving" is determined as the target position, and all operable objects within the target position are selected as candidate operation objects; if the reference point is a button in the user interface, the plane range to the left of the button is determined as the target position, and all operable objects therein are selected as candidate operation objects.

请参阅图4，步骤03还包括：Please refer to Figure 4, step 03 also includes:

034：根据槽位信息中的操作对象对候选操作对象进行第一筛选处理；034: Perform a first screening process on the candidate operation objects according to the operation objects in the slot information;

035：根据意图信息中操作类型对经过第一筛选处理的候选操作对象进行第二筛选处理以得到目标操作对象。035: Perform a second screening process on the candidate operation objects that have undergone the first screening process according to the operation type in the intention information to obtain a target operation object.

处理器用于根据槽位信息中的操作对象对候选操作对象进行第一筛选处理，以及根据意图信息中操作类型对经过第一筛选处理的候选操作对象进行第二筛选处理以得到目标操作对象。The processor is used to perform a first screening process on candidate operation objects according to the operation objects in the slot information, and to perform a second screening process on the candidate operation objects that have undergone the first screening process according to the operation type in the intention information to obtain a target operation object.

具体地，服务器获取目标位置内选出的所有候选操作对象后，可根据语音请求槽位信息中的操作对象信息，候选操作对象进行第一筛选处理。第一筛选处理即利用语义相似度模型，获取相似度较高的数个候选操作对象，例如，可筛选出语义相似度前十位的候选操作对象。第一筛选处理得到相似度较高的候选操作对象的数量，可为所有候选操作对象数量之内的任何数，在此不作限定。Specifically, after the server obtains all the candidate operation objects selected in the target position, the candidate operation objects can be subjected to a first screening process according to the operation object information in the voice request slot information. The first screening process uses a semantic similarity model to obtain several candidate operation objects with high similarity. For example, the top ten candidate operation objects with semantic similarity can be screened out. The number of candidate operation objects with high similarity obtained by the first screening process can be any number within the number of all candidate operation objects, and is not limited here.

进一步地，可根据语音请求的意图信息，在上述步骤中经第一筛选处理得到的相似度较高的候选操作对象范围内，进行第二筛选处理，最终获取目标操作对象。第二筛选处理可根据语音请求的意图信息，选择具有执行语音请求意图能力的操作对象，作为最终的目标操作对象。Furthermore, according to the intention information of the voice request, a second screening process can be performed within the range of candidate operation objects with a high similarity obtained by the first screening process in the above step, and finally the target operation object can be obtained. The second screening process can select an operation object capable of executing the voice request intention as the final target operation object according to the intention information of the voice request.

在一个示例中，例如，“打开”的意图可以用在“车窗”等操作对象上，但“切换”的意图则无法用在“车窗”上，即最终确定的目标操作对象是具有执行语音请求意图能力的操作对象。In one example, for example, the intent of "open" can be used on operation objects such as "car windows", but the intent of "switch" cannot be used on "car windows", that is, the target operation object finally determined is the operation object that has the ability to execute the voice request intention.

09：在槽位信息中缺失操作对象的情况下，根据语音请求的音区信息确定操作对象。09: When the operation object is missing in the slot information, the operation object is determined according to the voice zone information of the voice request.

处理器用于在槽位信息中缺失操作对象的情况下，根据语音请求的音区信息确定操作对象。The processor is used to determine the operation object according to the voice zone information of the voice request when the operation object is missing in the slot information.

具体地，用户输入语音请求时，因随机性可能造成操作对象信息的缺失。此时，服务器将采用模糊匹配的方法，根据语音请求音区来源定位，识别用户所在位置，并确定为目标位置。Specifically, when the user inputs a voice request, the randomness may cause the loss of the operation object information. At this time, the server will use the fuzzy matching method to locate the source of the voice request sound area, identify the user's location, and determine it as the target location.

在一个示例中，用户输入语音请求“播放一个电影”，则该语音请求中槽位信息不包含操作对象。因为车辆中存在前排中控显示屏和后排中控显示屏，服务器可通过判断语音请求发出的音区，得到操作对象的位置范围。例如该语音请求为主驾用户发出，则确定操作对象为前排中控显示屏。In one example, if a user inputs a voice request "play a movie", the slot information in the voice request does not contain the operation object. Because there are front and rear central control displays in the vehicle, the server can determine the location range of the operation object by determining the sound zone where the voice request is issued. For example, if the voice request is issued by the driver, the operation object is determined to be the front central control display.

步骤04包括：Step 04 includes:

041：车辆的状态信息、目标位置和目标操作对象，确定对目标操作对象的操作权限；041: Vehicle status information, target location and target operation object, determine the operation authority for the target operation object;

042：根据操作权限生成车辆控制指令。042: Generate vehicle control instructions based on operation permissions.

处理器用于车辆的状态信息、目标位置和目标操作对象，确定对目标操作对象的操作权限，以及根据操作权限生成车辆控制指令。The processor is used for the vehicle's status information, target position and target operation object, determines the operation authority for the target operation object, and generates vehicle control instructions according to the operation authority.

车辆的状态信息描述车辆所处的状态，包括车辆所处的档位。如部分自动档的车辆具有停车档。The vehicle status information describes the status of the vehicle, including the gear position the vehicle is in. For example, some automatic transmission vehicles have a parking gear.

目标操作对象的操作权限是指部分车辆部件的功能可能受到车辆所处状态的限制。例如，为了保证车辆驾驶过程中的安全，当车辆处在行驶状态时，主驾相关的部分娱乐性质的交互功能将受到限制。The operation rights of the target operation object refer to the fact that the functions of some vehicle components may be restricted by the state of the vehicle. For example, in order to ensure the safety of the vehicle during driving, when the vehicle is in driving state, some entertainment-related interactive functions related to the main driver will be restricted.

具体地，在一个实例中，若用户发出的语音请求为“播放一个电影”，该语音请求的意图信息为“播放电影”，则判断目标位置为车载系统的前排或后排的用户界面，目标操作对象为用户界面中控制播放视频功能的控件。进一步地，由于车载系统中播放视频的功能可能对正在行驶的车辆造成安全隐患，则在确定控制播放视频功能的控件为目标操作对象时，使用权限识别方法，触发预设权限限制。Specifically, in one example, if the voice request sent by the user is "play a movie", and the intention information of the voice request is "play movie", then the target position is determined to be the user interface of the front row or the rear row of the vehicle system, and the target operation object is the control in the user interface that controls the video playback function. Furthermore, since the video playback function in the vehicle system may pose a safety hazard to the vehicle being driven, when the control that controls the video playback function is determined to be the target operation object, the permission identification method is used to trigger the preset permission restriction.

在某些示例中，权限限制可以是，当车辆在行驶状态，即车辆未在停车档状态时，前排发出“播放电影”的语音请求，则判断需要开启并播放视频的目标操作对象为前排具有控制播放视频功能的中控显示屏。此时，安全驾驶限制开启，可在生成车辆控制指令前，设置语音消息或用户界面文字弹窗，提醒用户安全驾驶。当车辆处于停车档状态下，则不弹出任何安全警示，生成车辆控制指令。如果通过音区信息，判断发出“播放电影”语音请求的用户位于车辆后排，则无需进行车辆状态的判断，控制后排中控显示屏直接“播放电影”。In some examples, the permission restriction can be that when the vehicle is in driving state, that is, the vehicle is not in the parking gear state, and a voice request of "play movie" is issued from the front row, then the target operation object that needs to be turned on and the video is judged to be the central control display screen in the front row with the function of controlling the video playback. At this time, the safe driving restriction is turned on, and a voice message or user interface text pop-up window can be set before generating the vehicle control command to remind the user to drive safely. When the vehicle is in the parking gear state, no safety warning will pop up and the vehicle control command will not be generated. If it is judged through the sound zone information that the user who issued the "play movie" voice request is in the back row of the vehicle, there is no need to judge the vehicle status, and the rear central control display screen is controlled to directly "play the movie".

请参阅图4，方法还包括：Referring to FIG. 4 , the method further includes:

对语音请求的意图信息和槽位信息进行存储。The intent information and slot information of the voice request are stored.

处理器用于对语音请求的意图信息和槽位信息进行存储。The processor is used to store the intent information and slot information of the voice request.

具体地，从用户输入语音请求起，经过一系列语音处理过程，到车辆接收到可识别的控制指令，并完成执行动作为止，称为一轮对话。服务器可在一轮对话结束时，存储本轮语音请求的历史轮次中语音请求的意图信息和槽位信息，为下轮语音交互过程提供历史结果依据。Specifically, from the time the user inputs a voice request, through a series of voice processing processes, to the time the vehicle receives a recognizable control instruction and completes the execution action, this is called a round of dialogue. At the end of a round of dialogue, the server can store the intent information and slot information of the voice request in the historical rounds of the current voice request to provide a historical result basis for the next round of voice interaction process.

本申请的计算机可读存储介质，存储有计算机程序，当计算机程序被一个或多个处理器执行时，实现上述的方法。The computer-readable storage medium of the present application stores a computer program, and when the computer program is executed by one or more processors, the above method is implemented.

在本说明书的描述中，参考术语“上述”、“具体地”、“进一步地”、“可以理解地”等的描述意指结合实施方式或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施方式或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施方式或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施方式或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, the descriptions with reference to the terms "above", "specifically", "further", "understandably", etc., mean that the specific features, structures, materials or characteristics described in conjunction with the implementation or examples are included in at least one implementation or example of the present application. In this specification, the schematic representations of the above terms do not necessarily refer to the same implementation or example. Moreover, the specific features, structures, materials or characteristics described may be combined in any one or more implementations or examples in a suitable manner. In addition, those skilled in the art may combine and combine the different embodiments or examples described in this specification and the features of the different embodiments or examples, unless they are contradictory.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行请求的代码的模块、片段或部分，并且本申请的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本申请的实施例所属技术领域的技术人员所理解。Any process or method description in a flowchart or otherwise described herein may be understood to represent a module, fragment or portion of executable requested code including one or more steps for implementing a specific logical function or process, and the scope of the preferred embodiments of the present application includes alternative implementations in which functions may not be performed in the order shown or discussed, including performing functions in a substantially simultaneous manner or in the reverse order depending on the functions involved, which should be understood by technicians in the technical field to which the embodiments of the present application belong.

尽管上面已经示出和描述了本申请的实施方式，可以理解的是，上述实施方式是示例性的，不能理解为对本申请的限制，本领域的普通技术人员在本申请的范围内可以对上述实施方式进行变化、修改、替换和变型。Although the embodiments of the present application have been shown and described above, it can be understood that the above embodiments are exemplary and cannot be understood as limitations to the present application. Ordinary technicians in this field can change, modify, replace and modify the above embodiments within the scope of the present application.

Claims

A voice interaction method, wherein the method comprises:

Receive voice requests forwarded by the vehicle;

Processing the voice request, extracting intent information and slot information of the voice request, and confirming that a target position and/or a target operation object cannot be directly obtained according to semantics, wherein the intent information includes an action type, and the slot information includes a reference point, relative position information, and/or an operation object;

Determine a target location and a target operation object of the voice request according to the intention information and the slot information;

generating a vehicle control instruction corresponding to the voice request according to the target position and the target operation object;

The vehicle control command is forwarded to the vehicle to complete the voice interaction.

The voice interaction method according to claim 1, wherein determining a target position and a target operation object of the voice request according to the intent information and the slot information comprises:

The reference points in the slot information are normalized to correspond the reference points to absolute positions in the vehicle cabin.

The voice interaction method according to claim 2, wherein determining the target position and the target operation object of the voice request according to the intent information and the slot information comprises:

The target position is determined according to the absolute position and the relative position information.

The voice interaction method according to claim 2, wherein the method further comprises:

In the case where the reference point is missing in the slot information, the reference point is confirmed according to the historical dialogue information of the voice request.

In the case where the reference point is missing in the slot information, the reference point is confirmed according to the voice zone information of the voice request.

A candidate operation object is determined according to the relative position information.

According to the voice interaction method of claim 6, determining the target position and target operation object of the voice request according to the intention information and the slot information includes:

Performing a first screening process on the candidate operation objects according to the operation objects in the slot information;

The candidate operation objects that have undergone the first screening process are subjected to a second screening process according to the operation type in the intention information to obtain the target operation object.

The voice interaction method according to claim 7, wherein the method further comprises:

In the case where the operation object is missing in the slot information, the operation object is determined according to the voice zone information of the voice request.

The voice interaction method according to claim 1, wherein the step of generating a vehicle control instruction corresponding to the voice request according to the target position and the target operation object comprises:

The state information of the vehicle, the target position and the target operation object, determine the operation authority for the target operation object;

The vehicle control instruction is generated according to the operation authority.

The voice interaction method according to claim 1, wherein the method further comprises:

The intent information and slot information of the voice request are stored.

A server, wherein the server comprises a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the method described in any one of claims 1 to 10 is implemented.

A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by one or more processors, the method according to any one of claims 1 to 10 is implemented.