[go: up one dir, main page]

CN117894308A - Voice interaction method and electronic equipment - Google Patents

Voice interaction method and electronic equipment Download PDF

Info

Publication number
CN117894308A
CN117894308A CN202311861406.4A CN202311861406A CN117894308A CN 117894308 A CN117894308 A CN 117894308A CN 202311861406 A CN202311861406 A CN 202311861406A CN 117894308 A CN117894308 A CN 117894308A
Authority
CN
China
Prior art keywords
text
voice
event
control
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311861406.4A
Other languages
Chinese (zh)
Inventor
韦力诚
赵敬霄
张宁
杨竟成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unicom Smart Connection Technology Ltd
Original Assignee
China Unicom Smart Connection Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unicom Smart Connection Technology Ltd filed Critical China Unicom Smart Connection Technology Ltd
Priority to CN202311861406.4A priority Critical patent/CN117894308A/en
Publication of CN117894308A publication Critical patent/CN117894308A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a voice interaction method and electronic equipment. The method is applied to the electronic equipment and comprises the following steps: labeling text labels corresponding to the event elements according to the binding relation between the event elements and the text information on the control interface; acquiring control voice; analyzing control voice to obtain voice text; determining event elements corresponding to the voice text based on the text labels; and according to the voice text, performing control operation on the event element corresponding to the voice text. According to the method, the elements in the control interface are marked before voice control, so that the accuracy of identifying the page elements during voice control is improved, and the matching efficiency of multiple repeated elements of the page during voice control is improved.

Description

Voice interaction method and electronic equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a voice interaction method and an electronic device.
Background
With the development of the intelligent cabin, the central control screen of the vehicle-mounted information entertainment system (In-Vehicle Infotainment, IVI) of the intelligent cabin is more and more popular and larger In size, meanwhile, the information entertainment application of the intelligent cabin is continuously rich, various functions In the vehicle can be controlled by touching the central control screen, and more intelligent and convenient operation experience is provided for users.
The voice control frees both hands for driving safety, and thus the human-machine interface (Human Machine Interface, HMI) of the IVI provides voice interaction functionality. The visual and speaking voice interaction becomes one of the core functions of the HMI, and a user can control the visual elements of the cabin page only through voice instructions, so that more intelligent, convenient and safe driving experience is provided for the user.
The recognition of the visible element usually monitors the screen in real time by means of the barrier-free service of the system, analyzes the text information of the current page element, matches the received voice command with the page element, triggers the simulated click if successful, and continues other voice flows if failed. Not only is the text information of the visible element accurately identified, but also a triggerable event needs to be correctly acquired.
However, due to the abundance of HMI's content, many functions and applications are provided by third parties, and more operational services and personalized ecological applications enable fast boarding of functions through applet or hypertext markup language (Hyper Text Markup Language, HTML) web pages. The voice control service by means of IVI has limitations, page elements of an HTML webpage and an applet cannot be completely restored, meanwhile, element identification and control cannot be performed on the most important player in music and video webpages, and in addition, the situation that text information is not matched with clickable attributes exists on combined elements, so that voice control cannot be performed in a simulation mode, and the application range and the control efficiency of 'visible and can be said' are difficult to meet the requirements.
Disclosure of Invention
Aiming at the problem of how to improve the action range and the control efficiency of voice control, the application provides a voice interaction method and electronic equipment, and also provides a computer readable storage medium.
The embodiment of the application adopts the following technical scheme:
in a first aspect, the present application provides a voice interaction method, where the method is applied to an electronic device, the method includes:
labeling text labels corresponding to the event elements according to the binding relation between the event elements and the text information on the control interface;
acquiring control voice;
analyzing control voice to obtain voice text;
determining event elements corresponding to the voice text based on the text labels;
and according to the voice text, performing control operation on the event element corresponding to the voice text.
According to the method, the elements in the control interface are marked before voice control, so that the accuracy of identifying the page elements during voice control is improved, and the matching efficiency of multiple repeated elements of the page during voice control is improved.
In one implementation manner of the first aspect, the method further includes:
generating a first interface according to a control interface, and displaying the first interface, wherein the display content of the first interface is the same as that of the control interface, and the elements in the first interface do not trigger the execution of events corresponding to the control interface;
And recognizing clicking operation of a user on the element on the first interface, and acquiring the binding relation according to the element clicked by the user.
In one implementation manner of the first aspect, the method further includes:
and adding an auxiliary label to the text label.
In one implementation manner of the first aspect, the auxiliary tag includes: the generalized word segmentation of the text label and/or the auxiliary pinyin of the text label.
In an implementation manner of the first aspect, the determining an event element corresponding to the voice text includes:
determining a first target element according to the voice text match;
determining a downward sub-element of the first target element under the condition that the first target element cannot process an event corresponding to the voice text;
judging whether the sub-element can process the event corresponding to the voice text and the sub-element does not contain text information;
and determining the sub-element as an event element corresponding to the voice text under the condition that the sub-element can process the event corresponding to the voice text and the sub-element does not contain text information.
In an implementation manner of the first aspect, the determining an event element corresponding to the voice text further includes:
Determining a parent container upwards of the first target element under the condition that the child element cannot process an event corresponding to the voice text and/or the child element contains text information;
judging whether the father container can process the event corresponding to the voice text and the father container does not contain text information;
and determining the parent container as an event element corresponding to the voice text under the condition that the parent container can process the event corresponding to the voice text and the parent container does not contain text information.
In an implementation manner of the first aspect, the determining an event element corresponding to the voice text further includes:
determining a second target element which does not contain text information and is downward by the parent container under the condition that the parent container can not process the event corresponding to the voice text and/or the parent container contains text information;
judging whether the second target element can process the event corresponding to the voice text or not;
and under the condition that the second target element can process the event corresponding to the voice text, determining that the second target element is the event element corresponding to the voice text.
In one implementation manner of the first aspect, the method further includes:
starting a player voice control module under the condition that the control object corresponding to the voice text is a player;
according to the voice text, performing control operation on the player voice control module, and controlling the player through the player voice control module, wherein:
aiming at the application of the original player, the player voice control module realizes play control by simulating a media session control event and preempting an audio focus;
and/or the number of the groups of groups,
aiming at the hypertext markup language webpage player, the player voice control module obtains the webpage audio and video elements by executing the command script, and adapts to play and pause of the webpage audio and video elements.
In a second aspect, the present application provides an electronic device comprising a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the electronic device to perform the method steps as described in the first aspect.
In a third aspect, the present application provides a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the method according to the first aspect.
Drawings
FIG. 1 is a schematic diagram of a control interface according to an embodiment of the present application;
FIG. 2 is a flow chart of a voice interaction method according to an embodiment of the present application;
FIG. 3 is a partial process flow diagram according to an embodiment of the present application;
FIG. 4 is a partial process flow diagram according to an embodiment of the present application;
FIG. 5 is a flow chart of a method of voice interaction according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a voice interaction device according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The terminology used in the description section of the present application is for the purpose of describing particular embodiments of the present application only and is not intended to be limiting of the present application.
Due to the continuous getting-on of different functional applications such as third party applications, applets and HTML web pages, there are more cases where the combined elements in the display page (or called control panel) have text information and event elements that are not matched.
In embodiments of the present description, elements in a page may be divided into event elements and non-event elements, depending on whether the elements may trigger event execution. Event elements may also be referred to as clickable elements, which refer to elements on a control interface that a user can manipulate and trigger a corresponding event. Such as clickable text, buttons, switches, arrows, etc. Non-event elements refer to elements on the control interface that cannot be manipulated by a user and that cannot trigger execution of an event. Such as text, pictures, tables, etc. for content presentation only.
In the embodiment of the present specification, the information in the page may be divided into text information and non-text information according to the contents of the presentation. Text information refers to text (e.g., words) that may be recognized. Non-text information elements refer to information other than text information (e.g., lines, pictures).
The event element may contain text information. For example, a button A which can trigger the execution of an event after clicking is contained in a certain control panel, and text information (text: "determining") is contained on the button A. Button a is an event element that contains the text information "ok".
For the case that the event element contains text information, at the time of voice control, text information indicating a control object is acquired by analyzing the voice content, and further, the event element as the control object is determined by matching the text information indicating the control object to the text information contained in the event element.
For example, by parsing the voice content "click determination button", it is possible to match text information (text: "determination") contained on the button a, so that it is possible to determine that the control object of the current voice operation is the button a.
However, in some control panels, the event elements do not include text information, and therefore, in the case of voice control, the event elements to be controlled cannot be located directly by parsing text information corresponding to voice content.
For example, FIG. 1 is a schematic diagram of a control interface according to an embodiment of the present application.
As shown in fig. 1, the control interface contains elements: arrow 101, switch 102, switch 103. The arrows 101, 102 and 103 can trigger the execution of the corresponding event after being clicked, and the arrows 101, 102 and 103 are event elements.
The arrows 101, 102 and 103 do not contain text information, and when voice control is performed, the text information cannot be simply and directly matched with the arrows 101, 102 and 103, and thus the arrows 101, 102 and 103 cannot be operated by voice instructions.
In an embodiment, an event element that does not contain a text information element may be voice controlled by describing its shape (e.g., arrow, switch) or performing an action (e.g., enter, turn on, turn off). However, when the control interface includes a plurality of event elements having the same appearance, one event element cannot be located by describing the shape or performing the action.
For example, for the control interface shown in fig. 1, control for arrow 101 may be performed by voice command "click arrow". However, since the shapes of the switch 102 and the switch 103 are completely identical, when the control is performed by "clicking" the switch by a voice command, it is impossible to distinguish which of the switch 102 and the switch 103 is the control target.
Further, for an event element containing text information, voice control can be performed by the text information contained therein, but in some control panels, there is a case where a plurality of event elements contain the same text information.
For example, as shown in fig. 1, the control panel further includes buttons 104, 105, 106, 107, 108, 109. Button 104, button 105, button 106, button 107, button 108, and button 109 each contain the text information "low" 114, "medium" 115, "high" 116, "low" 117, "medium" 118, and "high" 119. The text information contained in the buttons 104, 105, and 106 is the same as the text information contained in the buttons 107, 108, and 109, respectively. When the control is performed by clicking the "high" button by a voice command, it is impossible to distinguish which of the buttons 106 and 109 is the control object.
In order to accurately and rapidly locate an event element as a control object in a voice control process, an embodiment of the application provides a voice interaction method, in which accuracy of page element identification in voice control is improved through test calibration, and matching efficiency of multiple repeated elements of a page in voice control is improved.
Fig. 2 is a flowchart of a voice interaction method according to an embodiment of the present application.
The electronic device performs the method flow shown in fig. 2 to achieve test calibration.
And S200, labeling elements in the control interface, including labeling text labels corresponding to the event elements according to the binding relation between the event elements and the text information.
The binding relation between the event element and the text information comprises a binding relation between the event element and the text information contained in the event element and a binding relation between the event element and the text information outside the event element.
The text labels corresponding to the event elements can be all texts of the text information bound by the event elements; or part of text of the text information bound by the event element; or text generated from text of the text information bound to the event element.
For example, as shown in FIG. 1, the control interface also contains text information: "text 1"111, "text 2"112, "text 3"113 text information: "text 1"111, "text 2"112, and "text 3"113 are text information other than the text information included in the event element arrow 101, the switch 102, the switch 103, the button 104, the button 105, the button 106, the button 107, the button 108, and the button 109.
Labeling for event elements of the control interface shown in fig. 1:
binding the text information of the text 1 111 to the arrow 101, and labeling the text corresponding to the arrow 101 as the text 1;
binding the text information of the text 2 112 to the switch 102, and labeling the text corresponding to the switch 102 as the text 2;
binding the text information of the text 3 113 to the switch 103, and labeling the text corresponding to the switch 103 as the text 3;
binding the text information of the event element buttons 104, 105 and 106 with the text 2 112, and marking the text labels corresponding to the event element buttons 104, 105 and 106 as text 2 low, text 2 medium and text 2 high;
binding the text information of the event element buttons 107, 108, 109 with the "text 3"113, and labeling the text labels corresponding to the event element buttons 107, 108, 109 are "text 3 low", "text 3 medium", "text 3 high", respectively.
S210, acquiring control voice.
S220, analyzing the control voice to obtain a voice text.
S230, determining the event element corresponding to the voice text based on the text label corresponding to the event element.
S240, according to the voice text, performing control operation on the event element corresponding to the voice text.
For example, the voice text of the control voice input by the user is "click to enter next page of text 1". In S230, the text corresponding to the arrow 101 is labeled "text 1", and the control object corresponding to the "click into next page of the text 1" of the voice text is determined to be the arrow 101 according to the fact that the "text 1" in the voice text is matched with the arrow 101. In S240, according to "click to next page" in the voice text, a control operation is performed on the arrow 101, the control operation being clicking on the arrow 101 to enter the next page.
According to the method, the elements in the control interface are marked before voice control, so that the accuracy of identifying the page elements during voice control is improved, and the matching efficiency of multiple repeated elements of the page during voice control is improved.
Further, in the practical application scenario, when the control voice is parsed, the pronunciation characteristics (for example, dialect and accent) of the user and the word segmentation mode of the user speaking affect the parsing result. In addition, because of the complex and changeable driving scenes, the actual network and flow conditions of the vehicle and the machine have various uncertainties, and accurate interactive control of the 'visible' function is required to be realized under the weak network or offline voice.
In order to accurately and rapidly locate the event element as the control object in the voice control process, in one implementation manner of S200, the method further includes adding an auxiliary label to the text label corresponding to the event element, so that the text label combined with the auxiliary label is more in line with the actual language habit of the user. For example, the auxiliary label comprises a special pronunciation mode of the dialect used by the user, and the text label is more in line with the pronunciation habit of the dialect used by the user after being combined with the auxiliary label. For another example, the auxiliary label comprises word segmentation habit of the user, and after the text label is combined with the auxiliary label, the auxiliary label is more in line with actual language habit of the user.
Specifically, in one implementation of S230, the auxiliary label includes generalized word segmentation and/or auxiliary pinyin of the text label, so that the text label combined with the auxiliary label better conforms to the actual language habit of the user. Further, in S230, a bi-directional matching method is adopted to match the voice text and the text label corresponding to the event element in combination with the text label corresponding to the event element, the generalized word segmentation of the text label and the auxiliary pinyin, so as to determine the event element targeted by voice control, thereby improving the matching accuracy and avoiding the matching error caused by inaccurate accents, sentence breaks and the like of the user.
Specifically, in one embodiment, the auxiliary pinyin is stored in a local hotword stock, and the problems of ambiguous words and multi-tone words are mainly processed in the process of hotword stock design, so that recognition deviation caused by different accents and sentence breaking of users is avoided.
Further, in an actual application scenario, different event elements may have different triggering manners. For example: the triggering mode of the button and the switch is clicking; the triggering mode of the sliding block is sliding. In order to accurately control the event element as the control object in the voice control process, in one implementation manner of S200, the elements in the control interface are labeled, and further includes a triggering manner corresponding to the labeled event element. Further, in S240, according to the operation content, in combination with the triggering manner corresponding to the previously marked event element, a control operation is performed on the event element corresponding to the control object.
Further, in the practical application scenario, some elements on the control interface have triggerability, for example, an event element (button) is clicked to trigger the execution of an event. However, some elements on the control interface are not triggerable, e.g., clicking on text, picture elements that are independent of the event elements does not trigger event execution.
In speech control, a control object is directed to only elements (event elements) having triggerability. Therefore, in order to distinguish between controllable event elements and uncontrollable other elements in the voice control process, so as to correctly control the event elements as control objects, in one implementation of S200, the elements in the control interface are labeled, and further includes labeling triggerable attributes of the event elements. Further, in SS230, it is determined whether the element is an event element based on whether the element is annotated with a triggerable attribute.
Further, those skilled in the art may design the implementation manner of S200 according to practical situations, which is not specifically limited in this application.
For example, in one implementation, the electronic device identifies text information and event elements in the control interface, obtains a binding relationship between the event elements and the text information according to the position distribution of the text information and the event elements in the control interface, and performs element labeling according to the binding relationship between the event elements and the text information.
For another example, in another implementation, the electronic device identifies text information and event elements in the control interface, presents to the user, and inputs a binding relationship between the text information and the event elements by the user. The electronic device performs element labeling according to user input.
Specifically, fig. 3 is a partial flow chart of a method according to an embodiment of the present application.
The electronic device performs the following procedure shown in fig. 3 to annotate the event element.
S300, starting a labeling mode.
Those skilled in the art may design the triggering manner of S300 according to practical situations, which is not specifically limited in this application.
For example, in one embodiment, the user clicks the test calibration button, triggering the electronic device to execute S300.
S310, generating a first interface according to the control interface, and displaying the first interface.
The display content of the first interface is the same as that of the control interface, and in the first interface, the elements corresponding to the event elements in the control interface are not associated with the corresponding events, that is, the elements in the first interface do not trigger the execution of the events corresponding to the control interface.
For example, in one embodiment, in S310, the interface content shown in fig. 1 is displayed. However, based on the interface content presented in S310, the user clicks the switch 202 and does not trigger execution of the corresponding on or off operation.
S320, recognizing clicking operation of a user on the first interface, and acquiring a binding relation between the event element and the text information according to the element clicked by the user.
S321, labeling text labels corresponding to the event elements according to binding relations between the event elements and the text information.
For example, for the content shown in FIG. 2, the user continuously clicks on "text 2"212 and switch 202. The electronic device binds "text 2"212 to the switch 202 labeling the switch 202 with the text label "text 2".
In one implementation of S320, a single text or a combination of multiple texts may be selected when the user selects text information. That is, a single text message may be bound to an event element, or a combination of a plurality of texts may be bound to an event element.
In one implementation of S320, when the user selects an element, the electronic device identifies the element selected by the user and displays the labeled information of the element, so that the user can determine whether to need to be labeled again according to the displayed labeled information. For example, for an event element, a text label marked by the event element, a triggerable attribute of the event element, a generalized word segmentation of the text label marked by the event element, and an auxiliary pinyin of the text label marked by the event element are presented.
S330, exiting the labeling mode.
Furthermore, in the actual application scenario, the text information and the event element on the control interface are not invariable.
In one embodiment, elements on the control interface are re-annotated when text information and/or event elements on the control interface change.
For example, when the text content of the text information on the control interface changes, the generalized word segmentation of the text label and the auxiliary pinyin are processed according to the change of the text content.
For another example, when an event corresponding to the event element on the control interface changes, the text information is rebinding according to the event change, and the event element is remarked according to the newly-bound text information.
Further, in the actual application scenario, with respect to dynamic pages such as an applet and an HTML web page (for example, a fifth generation HTML web page, abbreviated as H5 web page), elements in the page are dynamically changed, so that the elements in the page cannot be labeled in advance. Therefore, in an embodiment of the present application, for dynamic pages such as applets and H5 web pages, the objective of dynamic calibration is achieved by dynamically combining element attributes in a parent container using a responsibility chain mechanism.
Specifically, fig. 4 is a partial flow chart of a method according to an embodiment of the present application.
In one embodiment, after detecting that the control interface is a dynamic page (e.g., an H5 web page), the electronic device executes the following procedure shown in fig. 4 to implement voice control.
S400, acquiring control voice.
S410, analyzing the control voice to obtain a voice text.
S420, determining a target element corresponding to the voice text.
S430, judging whether the target element can process the event corresponding to the voice text.
In one embodiment, the target element may not process an event corresponding to the phonetic text, including:
the target element cannot trigger event execution (the target element is a non-event element);
or the triggering mode of the target element is not matched with the voice text. For example, the target element is a slider, the trigger mode is sliding, and the trigger mode included in the voice text is clicking.
If the target element can process the event corresponding to the voice text, S431 is performed.
S431, distributing the event corresponding to the voice text to the target element for processing, namely controlling the target element to trigger the event to execute according to the voice text.
If the target element cannot process the event corresponding to the voice text, S440 is performed.
S440, determining the downward sub-elements of the target element.
S450, judging whether the sub-element can process the event corresponding to the voice text and the sub-element does not contain text information.
If the determination result is yes, that is, the sub-element may process an event corresponding to the voice text and the sub-element does not contain text information, S451 is performed.
S451, distributing the event corresponding to the voice text to the sub-element processing, namely controlling the sub-element to trigger the event to execute according to the voice text.
If the determination result is no, that is, the sub-element cannot process the event corresponding to the voice text and/or the sub-element contains text information, S460 is executed.
S460, determining the parent container of the target element upwards.
S470, judging whether the parent container can process the event corresponding to the voice text and the parent container does not contain text information.
If the determination result is yes, that is, the parent container may process the event corresponding to the voice text and the parent container does not contain text information, S471 is performed.
And S471, distributing the event corresponding to the voice text to the parent container for processing, namely controlling the parent container to trigger the event to execute according to the voice text.
If the determination result is no, that is, the parent container may not process the event corresponding to the voice text and/or the parent container contains text information, S480 is performed.
S480, determining an element which is downward and does not contain text information except the target element of the parent container, taking the element as the target element, and returning to S430.
Further, in an embodiment, after the parent container trigger event is controlled to be executed successfully according to the voice text, a responsibility chain from the original target element to the element executed by the current trigger event is recorded, and the responsibility chain is taken as a success template. The elements that trigger the execution of the event are determined based on the success template preferentially when the voice control is performed later.
Further, in an embodiment, when it is determined that the element cannot process the event corresponding to the voice text and/or the sub-element contains text information (S450, S470), the number of failures is recorded, and when the number of failures exceeds a preset value, continuing to try a new element is stopped.
Further, in the embodiment shown in fig. 4, when the target element cannot process the event corresponding to the voice text, the sub-element of the target element downward is first tried. In another embodiment, when the target element cannot handle the event corresponding to the phonetic text, the parent container is first tried with the target element up.
Further, in the embodiment shown in fig. 4, when the target element cannot process the event corresponding to the phonetic text, the number of layers of attempts for the child element down and parent container up is 1. In another embodiment, other numbers of attempted layers may be set. For example, setting the trial layer number to be 2, and when the target element cannot process an event corresponding to the voice text, trying the target element to be a sub-element of the layer 1 downwards; when the child element of the target element of the downward 1 layer cannot handle the event corresponding to the phonetic text, the child element of the target element of the downward 2 layer is tried (similar when the parent container of the upward is tried).
Further, in another embodiment, the level ranges (e.g., layers 1-3) of the upward and/or downward attempts are automatically adjusted for different addresses based on the aggregated successful link level data.
Further, in one embodiment, in order to properly voice control a particular application (e.g., player), a voice control module corresponding to the voice control is configured for the particular application. When the voice control is performed for the specific application, the voice control module of the specific application is controlled according to the voice text instead of the voice control based on the original control interface of the specific application, and the voice control module controls the specific application. Thus, control failure caused by element identification or matching errors on the original control interface can be effectively avoided.
Specifically, in one embodiment, a player voice control module corresponding to voice control is configured for a player, considering that the player is an application with a high frequency of use on IVI.
Fig. 5 is a flowchart illustrating a voice interaction method according to an embodiment of the present application.
The electronic device performs the method flow shown in fig. 5 to implement voice control for the player.
S500, acquiring control voice.
S510, analyzing the control voice to obtain a voice text.
S520, when the control object corresponding to the voice text is a player, starting a player voice control module.
S530, according to the voice text, performing control operation on the player voice control module, and controlling the player through the player voice control module.
Specifically, in one embodiment, for native player applications, the player voice control module implements playback control by simulating a media Session control (Session) event and preempting the audio focus. In an embodiment of the present application, a native player application refers to a player application installed on the IVI that can run independently in the system framework.
In one embodiment, for an HTML web player (e.g., H5 web player), the player voice control module obtains the web audio and video elements by executing the command script, adapting the playing and pausing of the web audio and video elements.
According to the method, the scheme of testing and calibrating is adopted to calibrate data of all application pages of the vehicle machine including third party applications without code invasion, meanwhile, compatible adaptation of control of primary and HTML webpage main stream players is achieved by adopting a mode of focus and command script, functional adaptation and customization work of each application page are reduced, and the action range and accuracy of visible elements are improved.
According to the method, the auxiliary label is automatically added while the data are calibrated, the content comprises analysis and generalization of word segmentation, auxiliary pinyin and bidirectional matching rules, and the fuzzy matching rate of visible elements in a weak network or offline environment can be improved through the auxiliary label.
According to the method, for dynamic contents such as the applet and the HTML webpage, the text information and the event can be adaptively adjusted by adopting a dynamic chain and event distribution mechanism, so that the matching rate of the webpage is reduced, and the matching efficiency of a plurality of repeated elements of the webpage is improved.
The method provided by the embodiment of the application is established on a non-invasive decoupling design, and the visible voice interaction range and the execution efficiency of each scene on the vehicle are improved.
Furthermore, according to the voice interaction method of the embodiment of the application, an embodiment of the application further provides a voice interaction device.
Fig. 6 is a schematic structural diagram of a voice interaction device according to an embodiment of the present application.
As shown in fig. 6, the voice interaction apparatus includes:
an annotating module 610 for annotating elements in the control interface. The test calibration module 61 may refer to the implementation of S200.
The voice parsing module 620 is configured to obtain a control voice, parse the control voice, and obtain a voice text.
A control module 630 for: determining an event element corresponding to the voice text based on the text label corresponding to the event element; and according to the voice text, performing control operation on the event element corresponding to the voice text. The control module 630 may refer to the implementation of S230 and S240.
Optionally, in an embodiment, the labeling module 610 includes a hotword engine module 611. The hotword engine module 611 is configured to add an auxiliary tag to the text label corresponding to the event element. Specifically, in an embodiment, the hotword engine module 611 is configured to perform word segmentation and generalization on the text labels corresponding to the event elements, so as to obtain generalized word segmentation of the text labels corresponding to the event elements, and generate auxiliary pinyin for the text labels corresponding to the event elements.
Optionally, in an embodiment, the control module 630 includes a dynamic chain module 631. The dynamic link module 631 is used for dynamically combining the element attributes in the parent container by adopting a responsibility link mechanism aiming at dynamic pages such as applets and H5 webpages so as to achieve the purpose of dynamic calibration. The kinematic chain module 631 may refer to the embodiment illustrated in FIG. 4.
Optionally, in an embodiment, the control module 630 comprises a player voice control module 632. The player voice control module 632 is used to control the player. The player voice control module 632 may refer to the embodiment shown in fig. 5.
In the description of the embodiments of the present application, for convenience of description, the apparatus is described as being divided into various modules by functions, where the division of each module is merely a division of a logic function, and the functions of each module may be implemented in one or more pieces of software and/or hardware when the embodiments of the present application are implemented.
In particular, the apparatus according to the embodiments of the present application may be fully or partially integrated into one physical entity or may be physically separated when actually implemented. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; it is also possible that part of the modules are implemented in the form of software called by the processing element and part of the modules are implemented in the form of hardware. For example, the detection module may be a separately established processing element or may be implemented integrated in a certain chip of the electronic device. The implementation of the other modules is similar. In addition, all or part of the modules can be integrated together or can be independently implemented. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.
For example, the modules above may be one or more integrated circuits configured to implement the methods above, such as: one or more specific integrated circuits (Application Specific Integrated Circuit, ASIC), or one or more digital signal processors (Digital Singnal Processor, DSP), or one or more field programmable gate arrays (Field Programmable Gate Array, FPGA), etc. For another example, the modules may be integrated together and implemented in the form of a System-On-a-Chip (SOC).
An embodiment of the application also provides electronic equipment.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
As shown in fig. 7, the electronic device 700 comprises a memory 702 for storing computer program instructions and a processor 701 for executing the program instructions, wherein the computer program instructions, when executed by the processor 701, trigger the electronic device 700 to perform the method steps as described in the embodiments of the present application.
Specifically, in an embodiment of the present application, the one or more computer programs are stored in the memory 702, where the one or more computer programs include instructions, which when executed by the electronic device 700, cause the electronic device 700 to perform the method steps described in the embodiments of the present application.
It is to be understood that the structural description of the electronic device 700 in the embodiments of the present application does not constitute a specific limitation on the electronic device 700. In other embodiments of the present application, electronic device 700 may include other components besides processor 701 and memory 702.
The processor 701 may be a device on chip SOC, and the processor 701 may include a central processing unit (Central Processing Unit, CPU) and may further include other types of processors.
The processor 701 may include, for example, a CPU, DSP, microcontroller, or digital signal processor, and may further include a GPU, an embedded Neural network processor (Neural-network Process Units, NPU), and an image signal processor (Image Signal Processing, ISP), and may further include a necessary hardware accelerator or logic processing hardware circuit, such as an ASIC, or one or more integrated circuits for controlling program execution of the present application, and the like. Further, the processor may have a function of operating one or more software programs, which may be stored in a storage medium.
The processor 701 may include one or more processing units. For example: the processors may include application processors (application processor, AP), modem processors, graphics processors (graphics processing unit, GPU), image signal processors (image signal processor, ISP), controllers, video codecs, digital signal processors (digital signal processor, DSP), baseband processors, and/or neural network processors (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate components or may be integrated in one or more processors. In some embodiments, the electronic device 700 may also include one or more processors 701. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.
In some embodiments, the processor 701 may include one or more interfaces. The interfaces may include inter-integrated circuit (inter-integrated circuit, I2C) interfaces, inter-integrated circuit audio (integrated circuit sound, I2S) interfaces, pulse code modulation (pulse code modulation, PCM) interfaces, universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interfaces, mobile industry processor interfaces (mobile industry processor interface, MIPI), general-purpose input/output (GPIO) interfaces, SIM card interfaces, and/or USB interfaces, among others. The USB interface is an interface conforming to the USB standard specification, and specifically may be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface can be used for connecting a charger to charge the electronic equipment and can also be used for transmitting data between the electronic equipment and the peripheral equipment.
The electronic device 700 may also include an external memory interface for interfacing with an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device. The external memory card communicates with the processor 701 through an external memory interface to implement a data storage function. For example, files such as music, video, etc. are stored in an external memory card.
Memory 702 may include a code storage area and a data storage area. Wherein the code storage area may store an operating system. The data store may store data and the like created during use of the electronic device 700. In addition, memory 702 may include high-speed random access memory, and may also include non-volatile memory, such as one or more disk storage units, flash memory units, universal flash memory (universal flash storage, UFS), and the like.
The memory 702 may be a read-only memory (ROM), other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc read-only memory (compact disc read-only memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media, or other magnetic storage devices, or any computer readable medium that can be utilized to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The processor 701 and the memory 702 may be combined into one processing device, more commonly as separate components.
Further, the devices, apparatuses, modules illustrated in the embodiments of the present application may be implemented by a computer chip or entity, or by a product having a certain function.
It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein.
In several embodiments provided herein, any of the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
Specifically, in an embodiment of the present application, there is further provided a computer readable storage medium, where a computer program is stored, when the computer program is executed on a computer, to cause the computer to perform the method provided in the embodiment of the present application.
An embodiment of the present application also provides a computer program product comprising a computer program which, when run on a computer, causes the computer to perform the method provided by the embodiments of the present application.
The description of embodiments herein is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments herein. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In the embodiments of the present application, the term "at least one" refers to one or more, and the term "a plurality" refers to two or more. "and/or", describes an association relation of association objects, and indicates that there may be three kinds of relations, for example, a and/or B, and may indicate that a alone exists, a and B together, and B alone exists. Wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" and the like means any combination of these items, including any combination of single or plural items. For example, at least one of a, b and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.
In the present embodiments, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
All embodiments in the application are described in a progressive manner, and identical and similar parts of all embodiments are mutually referred, so that each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.
Those of ordinary skill in the art will appreciate that the various elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as a combination of electronic hardware, computer software, and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, the apparatus and the units described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The foregoing is merely a specific embodiment of the present application, and any person skilled in the art may easily think of changes or substitutions within the technical scope of the present application, and should be covered in the scope of the present application. The protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of voice interaction, the method being applied to an electronic device, the method comprising:
labeling text labels corresponding to the event elements according to the binding relation between the event elements and the text information on the control interface;
acquiring control voice;
analyzing control voice to obtain voice text;
determining event elements corresponding to the voice text based on the text labels;
and according to the voice text, performing control operation on the event element corresponding to the voice text.
2. The method according to claim 1, wherein the method further comprises:
generating a first interface according to a control interface, and displaying the first interface, wherein the display content of the first interface is the same as that of the control interface, and the elements in the first interface do not trigger the execution of events corresponding to the control interface;
And recognizing clicking operation of a user on the element on the first interface, and acquiring the binding relation according to the element clicked by the user.
3. The method according to claim 1 or 2, characterized in that the method further comprises:
and adding an auxiliary label to the text label.
4. A method according to claim 3, wherein the auxiliary tag comprises: the generalized word segmentation of the text label and/or the auxiliary pinyin of the text label.
5. The method of any of claims 1-4, wherein the determining the event element to which the phonetic text corresponds comprises:
determining a first target element according to the voice text match;
determining a downward sub-element of the first target element under the condition that the first target element cannot process an event corresponding to the voice text;
judging whether the sub-element can process the event corresponding to the voice text and the sub-element does not contain text information;
and determining the sub-element as an event element corresponding to the voice text under the condition that the sub-element can process the event corresponding to the voice text and the sub-element does not contain text information.
6. The method of claim 5, wherein the determining the event element to which the phonetic text corresponds further comprises:
determining a parent container upwards of the first target element under the condition that the child element cannot process an event corresponding to the voice text and/or the child element contains text information;
judging whether the father container can process the event corresponding to the voice text and the father container does not contain text information;
and determining the parent container as an event element corresponding to the voice text under the condition that the parent container can process the event corresponding to the voice text and the parent container does not contain text information.
7. The method of claim 6, wherein the determining the event element to which the phonetic text corresponds further comprises:
determining a second target element which does not contain text information and is downward by the parent container under the condition that the parent container can not process the event corresponding to the voice text and/or the parent container contains text information;
judging whether the second target element can process the event corresponding to the voice text or not;
and under the condition that the second target element can process the event corresponding to the voice text, determining that the second target element is the event element corresponding to the voice text.
8. The method according to any one of claims 1-7, further comprising:
starting a player voice control module under the condition that the control object corresponding to the voice text is a player;
according to the voice text, performing control operation on the player voice control module, and controlling the player through the player voice control module, wherein:
aiming at the application of the original player, the player voice control module realizes play control by simulating a media session control event and preempting an audio focus;
and/or the number of the groups of groups,
aiming at the hypertext markup language webpage player, the player voice control module obtains the webpage audio and video elements by executing the command script, and adapts to play and pause of the webpage audio and video elements.
9. An electronic device comprising a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the electronic device to perform the method steps of any of claims 1-8.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when run on a computer, causes the computer to perform the method according to any of claims 1-8.
CN202311861406.4A 2023-12-29 2023-12-29 Voice interaction method and electronic equipment Pending CN117894308A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311861406.4A CN117894308A (en) 2023-12-29 2023-12-29 Voice interaction method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311861406.4A CN117894308A (en) 2023-12-29 2023-12-29 Voice interaction method and electronic equipment

Publications (1)

Publication Number Publication Date
CN117894308A true CN117894308A (en) 2024-04-16

Family

ID=90646878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311861406.4A Pending CN117894308A (en) 2023-12-29 2023-12-29 Voice interaction method and electronic equipment

Country Status (1)

Country Link
CN (1) CN117894308A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120580994A (en) * 2025-08-06 2025-09-02 深圳市友杰智新科技有限公司 Off-line speech recognition entry expansion method, system, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120580994A (en) * 2025-08-06 2025-09-02 深圳市友杰智新科技有限公司 Off-line speech recognition entry expansion method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
US11194448B2 (en) Apparatus for vision and language-assisted smartphone task automation and method thereof
TWI684881B (en) Method, system and non-transitory machine-readable medium for generating a conversational agentby automatic paraphrase generation based on machine translation
US11302337B2 (en) Voiceprint recognition method and apparatus
US9263037B2 (en) Interactive manual, system and method for vehicles and other complex equipment
TWI510965B (en) Input method editor integration
US10460731B2 (en) Apparatus, method, and non-transitory computer readable storage medium thereof for generating control instructions based on text
TWI519968B (en) Input method editor user profiles
EP2610724A1 (en) A system and method for online user assistance
TW200900967A (en) Multi-mode input method editor
JP2002544584A (en) System and method for dynamic assistance in a software application using behavioral and host application models
CN112286485B (en) Method and device for controlling application through voice, electronic equipment and storage medium
CN117894308A (en) Voice interaction method and electronic equipment
CN116700662A (en) Voice control method and device, storage medium and vehicle
CN115858601A (en) Conducting collaborative search sessions through automated assistant
KR20200034660A (en) Facilitated user interaction
CN112380871A (en) Semantic recognition method, apparatus, and medium
CN117520490A (en) Man-machine conversation method, system and related device
JP2002268667A (en) Presentation system and control method thereof
Gruen et al. NuiVend-Next Generation Vending Machine
Griol et al. A framework to develop adaptive multimodal dialog systems for android-based mobile devices
CN113641408A (en) Method and device for generating shortcut entrance
CN112017487A (en) Flat Flash learning system based on artificial intelligence
CN113722467B (en) Processing method, system, device and storage medium for user search intention
CN119336217A (en) Task execution method, device, electronic device, medium and program product
CN119864028A (en) Method, device, equipment, vehicle and storage medium for vehicle voice interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination