Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the element(s) defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other like elements in different embodiments of the application having the same meaning as may be defined by the same meaning as they are explained in this particular embodiment or by further reference to the context of this particular embodiment.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope herein. The term "if" as used herein may be interpreted as "at..once" or "when..once" or "in response to a determination", depending on the context. Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, steps, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, steps, operations, elements, components, items, categories, and/or groups. The terms "or", "and/or", "including at least one of", and the like, as used herein, may be construed as inclusive, or mean any one or any combination. For example, "including at least one of" A, B, C "means" any of A, B, C, A and B, A and C, B and C, A and B and C ", and as yet another example," A, B or C "or" A, B and/or C "means" any of A, B, C, A and B, A and C, B and C, A and B and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are in some way inherently mutually exclusive.
It should be understood that, although the steps in the flowcharts in the embodiments of the present application are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the figures may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily occurring in sequence, but may be performed alternately or alternately with other steps or at least a portion of the other steps or stages.
The words "if", as used herein, may be interpreted as "at" or "when" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.
It should be noted that, in this document, step numbers such as S22 and S23 are adopted, and the purpose of the present application is to more clearly and briefly describe the corresponding content, and not to constitute a substantial limitation on the sequence, and those skilled in the art may execute S23 first and then execute S22 when implementing the present application, which is within the scope of protection of the present application.
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In the following description, suffixes such as "module", "part" or "unit" for representing elements are used only for facilitating the description of the present application, and have no specific meaning per se. Thus, "module," "component," or "unit" may be used in combination.
The intelligent terminal may be implemented in various forms. For example, the smart terminals described in the present application may include smart terminals such as mobile phones, tablet computers, notebook computers, palm computers, personal digital assistants (Personal DIGITAL ASSISTANT, PDA), portable media players (Portable MEDIA PLAYER, PMP), navigation devices, wearable devices, smart bracelets, pedometers, and fixed terminals such as digital TVs, desktop computers, and the like.
The following description will be given taking a mobile terminal as an example, and those skilled in the art will understand that the configuration according to the embodiment of the present application can be applied to a fixed type terminal in addition to elements particularly used for a moving purpose.
Referring to fig. 1, which is a schematic diagram of a hardware structure of a mobile terminal for implementing various embodiments of the present application, the mobile terminal 100 may include an RF (Radio Frequency) unit 101, a WiFi module 102, an audio output unit 103, an a/V (audio/video) input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, a processor 110, and a power supply 111. Those skilled in the art will appreciate that the mobile terminal structure shown in fig. 1 is not limiting of the mobile terminal and that the mobile terminal may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
The following describes the components of the mobile terminal in detail with reference to fig. 1:
The radio frequency unit 101 may be used for receiving and transmitting signals during the process of receiving and transmitting information or communication, specifically, receiving downlink information of a base station, processing the downlink information by the processor 110, and transmitting uplink data to the base station. Typically, the radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 101 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol including, but not limited to, GSM (Global System of Mobile communication, global System for Mobile communications), GPRS (GENERAL PACKET Radio Service), CDMA2000 (Code Division Multiple Access, code Division multiple Access 2000), WCDMA (Wideband Code Division Multiple Access ), TD-SCDMA (Time Division-Synchronous Code Division Multiple Access, time Division synchronous code Division multiple Access), FDD-LTE (Frequency Division Duplexing-Long Term Evolution, frequency Division Duplex Long term evolution), TDD-LTE (Time Division Duplexing-Long Term Evolution, time Division Duplex Long term evolution), 5G, 6G, and the like.
WiFi belongs to a short-distance wireless transmission technology, and a mobile terminal can help a user to send and receive e-mails, browse web pages, access streaming media and the like through the WiFi module 102, so that wireless broadband Internet access is provided for the user. Although fig. 1 shows a WiFi module 102, it is understood that it does not belong to the necessary constitution of a mobile terminal, and can be omitted entirely as required within a range that does not change the essence of the invention.
The audio output unit 103 may convert audio data received by the radio frequency unit 101 or the WiFi module 102 or stored in the memory 109 into an audio signal and output as sound when the mobile terminal 100 is in a call signal reception mode, a talk mode, a recording mode, a voice recognition mode, a broadcast reception mode, or the like. Also, the audio output unit 103 may also provide audio output (e.g., a call signal reception sound, a message reception sound, etc.) related to a specific function performed by the mobile terminal 100. The audio output unit 103 may include a speaker, a buzzer, and the like.
The a/V input unit 104 is used to receive an audio or video signal. The a/V input unit 104 may include a graphics processor (Graphics Processing Unit, GPU) 1041 and a microphone 1042, the graphics processor 1041 processing image data of still pictures or video obtained by an image capturing device (e.g. a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 106. The image frames processed by the graphics processor 1041 may be stored in the memory 109 (or other storage medium) or transmitted via the radio frequency unit 101 or the WiFi module 102. The microphone 1042 can receive sound (audio data) via the microphone 1042 in a phone call mode, a recording mode, a voice recognition mode, and the like, and can process such sound into audio data. The processed audio (voice) data may be converted into a format output that can be transmitted to the mobile communication base station via the radio frequency unit 101 in the case of a telephone call mode. The microphone 1042 may implement various types of noise cancellation (or suppression) algorithms to cancel (or suppress) noise or interference generated in the course of receiving and transmitting the audio signal.
The mobile terminal 100 also includes at least one sensor 105, such as a light sensor, a motion sensor, and other sensors. Optionally, the light sensor includes an ambient light sensor and a proximity sensor, optionally, the ambient light sensor may adjust the brightness of the display panel 1061 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1061 and/or the backlight when the mobile terminal 100 moves to the ear. The accelerometer sensor can detect the acceleration in all directions (generally three axes), can detect the gravity and the direction when the accelerometer sensor is static, can be used for identifying the gesture of a mobile phone (such as transverse and vertical screen switching, related games, magnetometer gesture calibration), vibration identification related functions (such as pedometer and knocking), and the like, and can be configured as other sensors such as fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors and the like, which are not repeated herein.
The display unit 106 is used to display information input by a user or information provided to the user. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), an Organic Light-Emitting Diode (OLED), or the like.
The user input unit 107 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the mobile terminal. Alternatively, the user input unit 107 may include a touch panel 1071 and other input devices 1072. The touch panel 1071, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 1071 or thereabout by using any suitable object or accessory such as a finger, a stylus, etc.) and drive the corresponding connection device according to a predetermined program. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Optionally, the touch detection device detects the touch orientation of the user, detects a signal caused by the touch operation, and transmits the signal to the touch controller, and the touch controller receives touch information from the touch detection device, converts the touch information into touch point coordinates, and then transmits the touch point coordinates to the processor 110, and can receive and execute a command sent by the processor 110. Further, the touch panel 1071 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The user input unit 107 may include other input devices 1072 in addition to the touch panel 1071. Alternatively, other input devices 1072 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc., as specifically not limited herein.
Alternatively, the touch panel 1071 may overlay the display panel 1061, and when the touch panel 1071 detects a touch operation thereon or thereabout, the touch panel 1071 is transferred to the processor 110 to determine the type of touch event, and the processor 110 then provides a corresponding visual output on the display panel 1061 according to the type of touch event. Although in fig. 1, the touch panel 1071 and the display panel 1061 are two independent components for implementing the input and output functions of the mobile terminal, in some embodiments, the touch panel 1071 may be integrated with the display panel 1061 to implement the input and output functions of the mobile terminal, which is not limited herein.
The interface unit 108 serves as an interface through which at least one external device can be connected with the mobile terminal 100. For example, the external devices may include a wired or wireless headset port, an external power (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 108 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the mobile terminal 100 or may be used to transmit data between the mobile terminal 100 and an external device.
Memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a storage program area that may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area that may store data created according to the use of the cellular phone (such as audio data, a phonebook, etc.), etc. In addition, memory 109 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
The processor 110 is a control center of the mobile terminal, connects various parts of the entire mobile terminal using various interfaces and lines, and performs various functions of the mobile terminal and processes data by running or executing software programs and/or modules stored in the memory 109 and calling data stored in the memory 109, thereby performing overall monitoring of the mobile terminal. The processor 110 may include one or more processing units, and preferably the processor 110 may integrate an application processor and a modem processor, the application processor optionally processing primarily an operating system, user interface and application programs, etc., the modem processor processing primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 110.
The mobile terminal 100 may further include a power source 111 (e.g., a battery) for supplying power to the respective components, and preferably, the power source 111 may be logically connected to the processor 110 through a power management system, so as to perform functions of managing charging, discharging, and power consumption management through the power management system.
Although not shown in fig. 1, the mobile terminal 100 may further include a bluetooth module or the like, which is not described herein.
In order to facilitate understanding of the embodiments of the present application, a communication network system on which the mobile terminal of the present application is based will be described below.
Referring to fig. 2, fig. 2 is a schematic diagram of a communication network system according to an embodiment of the present application, where the communication network system is an LTE system of a general mobile communication technology, and the LTE system includes a UE (User Equipment) 201, an e-UTRAN (Evolved UMTS Terrestrial Radio Access Network ) 202, an epc (Evolved Packet Core, evolved packet core) 203, and an IP service 204 of an operator that are sequentially connected in communication.
Alternatively, the UE201 may be the terminal 100 described above, which is not described here again.
The E-UTRAN202 includes eNodeB2021 and other eNodeB2022, etc. Alternatively, the eNodeB2021 may connect with other enodebs 2022 over a backhaul (e.g., X2 interface), the eNodeB2021 is connected to the EPC203, and the eNodeB2021 may provide access for the UE201 to the EPC 203.
EPC203 may include MME (Mobility MANAGEMENT ENTITY ) 2031, hss (Home Subscriber Server, home subscriber server) 2032, other MMEs 2033, SGW (SERVING GATE WAY ) 2034, pgw (PDN GATE WAY, packet data network gateway) 2035, PCRF (Policy AND CHARGING Rules Function) 2036, and so on. Optionally, MME2031 is a control node that handles signaling between UE201 and EPC203, providing bearer and connection management. HSS2032 is used to provide registers to manage functions such as home location registers (not shown) and to hold user specific information about service characteristics, data rates, etc. All user data may be sent through SGW2034 and PGW2035 may provide IP address allocation and other functions for UE201, PCRF2036 is a policy and charging control policy decision point for traffic data flows and IP bearer resources, which selects and provides available policy and charging control decisions for a policy and charging enforcement function (not shown).
IP services 204 may include the internet, intranets, IMS (IP Multimedia Subsystem ), or other IP services, etc.
Although the LTE system is described above as an example, it should be understood by those skilled in the art that the present application is not limited to LTE systems, but may be applied to other wireless communication systems, such as GSM, CDMA2000, WCDMA, TD-SCDMA, 5G, and future new network systems (e.g., 6G), etc.
Based on the above-mentioned mobile terminal hardware structure and communication network system, various embodiments of the present application are presented.
First embodiment
Fig. 3 is a schematic diagram of an interaction method according to a first embodiment of the present application. The execution main body of the interaction method can be at least one of an intelligent terminal such as a mobile phone, a wearable device, a voice assistant with a voice interaction function, a processor, a storage medium and the like. The voice assistant comprises, but is not limited to, voice interaction programs of various intelligent terminals based on an operating system, such as Siri of an apple type intelligent terminal, love voice of an android type intelligent terminal, love voice of a hong Mongolian type intelligent terminal, zebra voice of domestic various car machines and the like, and voice interaction programs of various APP installed on the intelligent terminal.
Optionally, the voice assistant of each APP is only suitable for the interaction between the user and the APP, or performs the linkage of voice interaction with the voice assistants of other APPs according to the user setting, for example, the user sets that the voice assistants of the first APP and the second APP have a linkage relationship, so that the user can directly issue a voice instruction for controlling the second APP to the voice assistant of the second APP (called as "second voice assistant") through the voice assistant of the first APP (called as "first voice assistant"), as shown in fig. 4 (a) and fig. 4 (b), the first APP can directly display an icon of the "second voice assistant", the user can receive the voice instruction issued by the first APP through the "second voice assistant", and the icon of the "first voice assistant" can be displayed simultaneously when the icon of the "second voice assistant" is displayed, or not displayed, and/or the "first voice assistant" continues to monitor and recognize the voice instruction, or does not monitor and recognize. For example, when the first APP and the second APP are respectively installed on two intelligent terminals, a user may only issue a voice command through one of the intelligent terminals, as shown in fig. 4 (b), icons of the "first voice assistant" and the "second voice assistant" may be simultaneously displayed on the operation interface of the first APP, where the voice command includes contents that need to be executed by the first APP and also includes contents that need to be executed by the second APP, and then the first APP of one of the intelligent terminals performs a corresponding operation, and the second APP of the other intelligent terminal performs a corresponding operation.
The user can set a linkage relation between the two intelligent terminals by setting a preset relation between the two intelligent terminals, wherein the preset relation comprises at least one of a connection relation established based on near field communication technologies such as Bluetooth, a connection relation established based on far field communication technologies such as a cellular network, a connection relation established based on authentication, a connection relation which is positioned in the same local area network, a connection group of the same Internet of things, master-slave control or information receiving-transmitting relation which is formed according to a user-defined rule.
The voice command may be embodied as a sound uttered by a user speaking, or as a sound in the form of an analog signal uttered by a tool such as an electronic device, for example, a synthetic sound.
For ease of illustration and understanding, the present application is described below in terms of a scenario in which a user interacts with an intelligent terminal based on a voice assistant onboard the operating system, unless specifically indicated otherwise.
As shown in fig. 3, the interaction method includes step S11.
And S11, responding to a preset event, and determining a target role of the voice assistant and/or answering voice.
In an example, the preset event may be represented as one or more events, may be a dedicated event for implementing the present application, and is different from an existing event for triggering the intelligent terminal, so that at least any event may be prevented from colliding with an existing event for performing other functions (including a personalized event set by a user). For example, step S11 may be performed only when the preset mode is entered and/or the preset function option is turned on, so as to avoid a conflict with the existing event of performing other functions. For example, only when the preset mode is started by the setting interface, the preset event is considered to be valid, and the corresponding step is triggered to be executed. In this case, the user may execute different functions by implementing the same event more conveniently and rapidly, that is, multiplexing the event for executing other functions, for example, when entering the preset mode, the user may trigger to execute step S11 when the user calls (i.e., the preset event), and when not entering the preset mode, the user may execute the operation only in the conventional manner, for example, the user may hang up automatically after having answered for a long time, and the function of the preset event may not be implemented.
For example, for a scenario suitable for an intelligent terminal, this example is equivalent to adding a message quick look function to the intelligent terminal. The implementation mode of the function includes, but is not limited to, taking a smart phone as an example, pre-programming a script or an application program and installing the script or the application program in an operating system of the smart phone, so that a message quick view option is added in a setting interface of the operating system of the smart phone and/or a setting interface of a voice assistant, and then the function can be correspondingly started or stopped by starting or stopping the option through a sliding slider. Or the intelligent terminal can independently set an APP on the main interface, and the APP can execute the opening and closing of the corresponding control function and the detailed setting of any one or more events including the preset event. After the function option is started, the intelligent terminal can allow the user to set the event on the display interface, so that whether the detected event is consistent with the set event in the subsequent process or not is judged. If the two types of the data are consistent, the execution step S11 is triggered, and if the two types of the data are inconsistent, the execution step S11 is not triggered.
In another example, the preset event is not an existing event that performs other functions, but is merely a dedicated event of the present application. Optionally, the present application combines a preset event with the current interface to determine whether to trigger execution of step S11. For example, if the current interface is an already opened interface (simply referred to as "operation interface") of the voice assistant, the execution step S11 is triggered once the preset event is detected as an existing event for executing other functions, if the current interface is not an operation interface of the voice assistant, the execution step S11 is not triggered even if the preset event is detected as an existing event for executing other functions, but other functions corresponding to the existing event are executed, and if the preset event is detected as a dedicated event of the present application, the execution step S11 is directly triggered regardless of whether the current interface is an operation interface of the voice assistant.
The specific form of any event, including preset events, is not limiting. The following describes, as an example, that the preset event includes at least one of the events 1 to 4.
Event 1-the text content of the user's voice is quiesced.
The stop point can be understood as a sentence breaking identification of a sentence, and the application can identify the stop point through natural language processing (Natural Language Processing, NLP) technology. In an actual scenario, each time a stop is detected, it may indicate that it is recognized that the user has spoken a sentence, it may trigger the determination of the target role of the voice assistant and/or the response of the voice.
And 2, switching from the first user to the second user to send out user voice.
The application can judge whether the speaking user changes by identifying the voiceprint characteristic information of the current voice, and if the voice of the user changes, the target role of the voice assistant and/or the response voice are triggered and determined.
Event 3 voice call event.
The voice call event includes, but is not limited to, at least one of an incoming call event, a called event of a talk class or chat class APP. For example, as shown in fig. 5a and 5b, when a mobile phone calls, if the user does not answer the call within a preset time period, the intelligent terminal automatically activates and operates the voice assistant, and triggers to determine the target role of the voice assistant and/or answer the voice, optionally, after answering, the user voice (i.e. the speaking of the calling party) and the text content of the answer voice are displayed on the interface (i.e. the call interface) of the intelligent terminal shown in fig. 5a and 5b according to the conversation time sequence, if the user answers the call actively within the preset time period, the user can call in a traditional manner, optionally, during the call, the voice assistant can also operate and monitor the passing process in real time, and intervene according to the user indication, for example, the voice assistant inquires other information such as schedule and broadcasts.
The method can operate in an offline mode, can realize functions of answering instead of answering, automatic replying and the like through the voice assistant based on the offline large model, and remarkably improves the conversation efficiency, meanwhile, the offline processing can ensure the privacy safety of users, avoid privacy problems possibly related in real-time conversation, provide better practice for the application of the voice assistant of the offline large model on the end side, and ensure the stability and the expandability of a terminal system.
And 4, target scenes and/or target objects with preset association relation with the user voice.
The interaction of the voice assistant can be associated with different scenes through the target scene, and the interaction of the voice assistant can be associated with different users through the target object, so that the accuracy of the voice interaction is improved.
For example, the text content of the user voice is "xxx buying house now ready for decoration" a few days before, although two words of "decoration" appear, the voice assistant has difficulty in determining whether the user is about to ask for a decoration class problem, at this time, a target scene is introduced through the event 4, for example, the intelligent terminal can automatically start the 3D radar scanning function of the rear camera to scan for the current scene, if the current position in the house is detected, which indicates that the user is about to ask for a decoration class problem, the voice assistant is triggered to determine the corresponding target role, namely, a decoration advisor.
For another example, when a segment of sound (the segment of sound is "user voice") is detected, voiceprint feature information of the segment of sound may be detected to determine whether the segment of sound is emitted by a target user (i.e., "target object"), if so, a voice assistant is triggered to determine a target character and/or answer voice, and if not, the segment of sound is ignored, for example, if it is identified that the sound is emitted by a television or by a user a instead of the target user B. In this way, the accuracy and privacy security of voice interaction can be improved.
Optionally, the preset association relationship includes at least one of the following:
Relationship 1-has a textual logical relationship with the textual content of the user's voice.
For example, the text content of the user's voice is "how planning is needed to decorate a house," decoration "two words have a text logical relationship with the house scene, the intelligent terminal can automatically start the 3D radar scanning function of the rear camera to scan to obtain a current scene, and if the current position in a room is detected, the voice assistant determines a corresponding target role, namely a decoration consultant, if the current position is detected, and the voice assistant indicates that the user wants to inquire about a decoration class.
Unlike the relationship 1, which is automatically determined based on the natural speech processing technique, the following relationships 2 to 5 may be manually preset relationships.
And 2, a preset mapping relation between the text content of the user voice and the target scene.
Relation 3, preset mapping relation between text content of user voice and target object.
For example, the text content of the user voice is "tuesday goes home to eat" and the voice assistant detects that the user voice is "father" and the father has a preset mapping relationship with the furniture, and the voice assistant determines the target role and/or the response voice.
And 4, a preset mapping relation between the user and the scene and/or between the user and the object.
And 5, a preset mapping relation between the user and the target object.
For example, when a user voice is detected, the user who sends the user voice is a target object (i.e. accords with a preset mapping relation between the user and the target object), and/or if the current scene is in a house where the user who sends the user voice lives (i.e. accords with a preset mapping relation between the user and the scene) through the 3D radar scanning of the rear camera, a voice assistant is automatically triggered and operated, and a target role and/or a response voice are determined.
It should be noted that, for a certain technical feature, the embodiments of the present application include multiple cases and multiple possible implementations, where no special description exists, all means that the corresponding technical feature may be implemented by a combination of any part of the modes. For example, the above relation 1 may be combined with any one of the relations 2 to 5. Through the combination scheme, the corresponding technical characteristics can be implemented more accurately and/or intelligently, and further the accuracy and the user experience of technical characteristic implementation are improved.
It should be appreciated that the preset event need not necessarily contain user speech, i.e., the target character of the voice assistant and/or the responsive speech, and need not necessarily be determined based on the detected user speech. For example, in the case that the preset event is the called event, once the called event is detected, the voice assistant can determine the target role and/or answer the voice according to the calling party of the called event, taking the incoming call of the takeaway as an example, the voice assistant can automatically answer the incoming call, determine the target role as the intelligent terminal user according to the historical interaction information (namely, the previously recognized call sound of the intelligent terminal user, the identification of the incoming call number and the like), and certainly can also be a common intelligent robot, note that the takeaway does not need to send any user voice indicating identity at this time, the voice assistant can automatically answer "your well, please put to the first floor express cabinet", and take the father incoming call as an example, when the preset duration is exceeded and the incoming call is not yet answered, the voice assistant can automatically answer the incoming call, determine the target role as the intelligent terminal user according to the historical interaction information (namely, the previously recognized call sound of the intelligent terminal user), and carry out individuation according to the call habit, the tone and the communication language habit of the calling party (namely, father). Such personalized replies may refer to more sophisticated dialog techniques in the field of voice interaction, and will not be described in detail here.
Each character of the voice assistant has expertise and knowledge corresponding to the character, and can provide operations and conversations matching the character. The voice assistant may be an AI (ARTIFICIAL INTELLIGENCE ) assistant, and the roles it contains may contain existing professional roles. Such as traditional decoration consultants, decoration designers, intellectual property technicians, etc., and may also include custom-added characters, such as the present intelligent terminal user synthesized by a voice assistant imitating learning, loving specialists, fairy tales, etc. In step S11, the manner of determining the target role of the voice assistant may include at least one of:
mode 1, determining the character with highest matching degree as a target character according to the text content of the user voice.
The user voice is analyzed through natural language processing technology to obtain text content, the text content is mainly used for determining the intention of the user, and therefore keywords in the text content are extracted to determine corresponding roles according to the keywords and serve as candidate roles.
When the determined candidate character is only one, the candidate character is directly taken as a target character.
When the determined candidate characters are multiple, for example, the text content is that a house needing decoration is a three-room hall, and the user likes to watch movies and play games at home at ordinary times, and according to the keywords of decoration, movie watching and play games, the candidate characters are determined to be three, namely, decoration consultants, movie universities and electronic contestants, respectively, then the highest matching degree of the candidate characters can be determined according to preset rules. The preset rule may be the association degree of the context of the user voice, for example, if the previous sentence of the user and the voice assistant contains "decoration", the "watching movie" and "playing game" are determined only to facilitate the leisure and entertainment requirement that needs to be biased during decoration, and the matching degree of the decoration consultant is the highest at this time, namely, the decoration consultant is the target role.
In other examples, when the determined candidate characters are multiple, the candidate characters can be sequentially used as target characters of the voice assistant in the order from high to low in matching degree, alternatively, the target characters can sequentially play response voices in the order from high to low in matching degree, still taking the foregoing three characters as an example, after finishing consultants speak, a movie master can feed back "is the xxx movie newly mapped recently, and need to watch.
And 2, determining the target role of the voice assistant according to the user instruction.
When the determined candidate characters are plural, the user selects at least one as a target character.
And 3, determining the target role of the voice assistant according to the current scene information and/or the historical interaction information.
For example, when the determined candidate characters are a decoration consultant and a decoration designer, the intelligent terminal can automatically start a 3D radar scanning function of the rear camera, select 'decoration designer' as a target character of a voice assistant if the current indoor layout information can be scanned, and select 'decoration consultant' as a target character of the voice assistant if the current indoor layout information is not scanned.
For another example, when a taker incoming call is detected, the target role of the voice assistant may be determined based on historical interaction information (e.g., the mood of the last call with the taker, etc.). For another example, the historical interaction information is that the current target role is the intelligent robot when the intelligent robot is communicated with the takeout person last time, and the current target role is still the intelligent terminal user if the intelligent terminal user is the current target role last time.
Based on the above, the voice assistant can automatically switch roles in the interaction process by triggering the preset event, so that personalized interaction of voice interaction is improved, and user experience is improved.
In one scene, the intelligent terminal starts a default assistant, i'm is your assistant, what needs help, the user voice is "want to decorate a house, what needs to be planned", the target role is determined to be a decoration assistant at this time, the response voice is "i are decoration consultants, how big the house is requested, how big you can be described, the user voice is" a little, the noon, i'm is first helped to send braised chicken to a company ", the target role is determined to be a merry assistant at this time, the merry assistant performs point take-out and feeds back to the user as" take-out point good ", then the user voice is" three-room hall, like watching movies at home at ordinary times, play games ", the target role is determined to be a decoration consultant at this time, and the response voice is" xxxx ".
In another scenario, the intelligent terminal turns on a default assistant, "I are your assistant, what is needed to help? at this time, two voice assistants can be determined, one of the determined target roles is a lover, the response voice is" suggest gentle mood, then in searching for a solution woolen ", the other determined target role is a story generation assistant, the response voice is" small story woolen which is not to give you a happy feeling.
The voice assistant can automatically switch roles and execute corresponding operations and conversations without manually selecting the roles or providing detailed switching instructions, thereby providing more personalized and efficient experience for the user and providing more proper services. In addition, the user can conveniently chat with multiple target roles in a dialogue mode, individuation and efficiency of interaction are further improved, and user satisfaction is higher.
In addition, unlike the conventional method, only the text content needs to be determined, and then the text content is generated into a section of response voice with fixed voice characteristic information (namely, speech speed, intonation, emotion, etc. which are commonly known), and in step S11, the characteristic information of the response voice is adaptively changed.
In an example, as shown in fig. 6, the method for determining the answer speech includes:
s111, determining feature information of response voice according to at least one of a working mode of the intelligent terminal, a user role, historical interaction information and preset features of user voice;
s112, determining the text content of the answer, and
And S113, generating response voice according to the characteristic information and text content of the response voice.
In step S111, the preset feature information may include at least one factor of text content, voiceprint feature information, speech speed, intonation, emotion feature information, and the like of the user' S voice. Analyzing the user voice through a natural language processing technology, extracting the preset characteristic information, and determining the characteristic information of the response voice.
The operation mode of the intelligent terminal includes, but is not limited to, at least one of a conference mode, a do-not-disturb mode, an outdoor mode, and the like. Taking a conference mode as an example, when an incoming call is detected, the intelligent terminal can automatically answer, and the voice assistant determines the target role of the intelligent robot with the preset gender, so that voice characteristic information corresponding to the intelligent robot is used as characteristic information of response voice. The predetermined gender may be identified from historical interaction information (i.e., a pre-learned gender of the present intelligent terminal user).
The user role may be the role of the user who utters the user's voice, for example, the role of the user of the present intelligent terminal. For example, when an incoming call of "father" is detected, the user character is determined to be "son", thereby determining that the characteristic information of the answer voice is biased toward respect, naughty, and the like.
The preset features of the user's voice include, but are not limited to, at least one factor of text content, voiceprint feature information, speech speed, intonation, emotion feature information, etc. of the user's voice.
In an example, step S111 may obtain an industry spoken help word of the target character through a preset AI model, and generate text content of the response according to the industry spoken help word.
Taking a preset feature as an example of the speech speed, the intelligent terminal can detect the current speech speed of the user speech, analyze the difference value of the current speech speed and the speech speed of the AI speech assistant (namely the target role) in real time, compare the difference value with a corresponding preset threshold value, increase the speech speed of the AI speech assistant if the difference value exceeds the first preset threshold value, which indicates that the speech speed of the AI speech assistant is too slow, and increase the speech speed of the AI speech assistant if the difference value is smaller than a second preset threshold value, which is smaller than the first preset threshold value, which indicates that the speech speed of the AI speech assistant is too fast, so that the speech speed of the AI speech assistant can be reduced, or the AI speech assistant inserts common spoken language aid words in the dialogue to reduce the speech speed of the AI speech assistant, thereby forming the text content of the response determined in step S112. The spoken language assisting words in the industry include, but are not limited to, common spoken language assisting words in the industry where the target role is located, can be learned in daily communication between a plurality of industry personnel and consumer groups through an AI model, and can be, for example, "boss", "sister" and "excellent taste.
The manner in which the text content of the response is generated based on the industry spoken language facilitation may include at least one of:
Mode 1, inserting industry spoken language assisting words at pause points of the text content of the response.
For example, the AI voice assistant may first identify a stop point for the text content of the response and insert the industry spoken word into the corresponding stop point. The AI speech assistant may determine the stop point by recognizing punctuation marks of the text content, for example, inserting an industry spoken help word after a period or exclamation mark when the period or exclamation mark is recognized, or by learning the stop or sentence spacing of the response when the text content is converted to speech by a preset AI model.
And 2, learning sentence intervals when the text content of the response is converted into voice through a preset AI model, and inserting the industrial spoken language auxiliary words into the sentence intervals exceeding a preset threshold value.
The characteristic information of the response voice can be adaptively changed, for example, along with the different working modes, user roles, historical interaction information and/or user voices of the intelligent terminal, the voice assistant can respond with the adaptive intonation and emotion, so that the personalized interaction of voice interaction is improved, and the user experience is improved.
Second embodiment
Referring to fig. 7, the interaction method of the second embodiment includes the following steps:
S21, responding to a preset event, and determining a target role of the voice assistant and/or response voice;
s22, backing up and storing the output result of the target role and/or the response voice and the voice of the user;
S23, labeling the user according to text content of the voice of the user and/or historical interaction information of the user and the intelligent terminal.
The step S21 may correspond to the step S11 of the foregoing embodiment, so the same features of this embodiment as those of the foregoing embodiment may be referred to the foregoing, and will not be repeated here. In addition, steps S22 and S23 may be performed by only one item, and the order may be interchanged when both items are performed.
In step S23, as shown in fig. 5a and 5b, a corresponding AI mark may be set for the output result of the voice assistant, for example, after the user runs out of the meeting, the user may check the missed call record stored in the backup through the AI mark, and the user may click to read the backup to play the call record of the corresponding person.
In step S23, the unknown number is automatically marked as suspected fraud according to the call content of the caller (i.e. the text content of the user voice) 'you have in 500w hope of getting the attention of xxx', the call content of the caller (i.e. the text content of the user voice) 'we have a child insurance product, you need you have, i introduce)' the historical interaction information of the user and the intelligent terminal is 'through learning the user habit of the user, the fixed number and the current voice interaction event are marked as harassment', and if the historical interaction information of the user and the intelligent terminal is 'through learning the user habit of the user, the relevant insurance product of the recent research is found', the fixed number and the current voice interaction event are not marked, or are marked as normal.
By identifying nuisance calls and the latest fraud scene, the application can construct a zero-nuisance and fraud-preventing call environment, which is beneficial to improving the terminal use privacy of users.
After step S22, as shown in fig. 8, it may further include S24 of creating a schedule. Namely, the intelligent terminal recognizes text information capable of creating the schedule from the backup stored result through a natural language processing technology, and the schedule can be automatically extracted and created. For example, "xx" in the backup store can be extracted to let you get home from 12 points to eat noodles "as key information, and" xx "can be the relevant party of the calendar.
Through backup storage, the application can provide intelligent experience of full scenes for users, including risk identification and abstract generation, and further derived personalized services.
In other examples, the calendar may be created directly without backup storage, but in response to recognizing calendar information from the output result of the target character and/or the answer voice, and the user voice. For example, when the working mode of the intelligent terminal is the conference mode, the incoming call of the father is received, the voice assistant automatically answers and replies to the father, the father is in a meeting, the father forgets to get the dinner on the day of the week at night, the voice assistant automatically replies to the sound, optionally, after hanging up the phone, the voice assistant can display the schedule of the day dinner is created on a preset interface (such as the current conversation interface) of the intelligent terminal, or play the response voice that the schedule of the day dinner is created.
Third embodiment
On the basis of the description of any one of the foregoing embodiments, a difference is that the preset event of the third embodiment includes receiving a control instruction of the associated device and/or a user voice. For convenience of description, the associated device is referred to as a second intelligent terminal, and the present intelligent terminal is referred to as a first intelligent terminal.
The user can set a preset relation between the two intelligent terminals to set a linkage relation between the two intelligent terminals, wherein the preset relation comprises at least one of a connection relation established based on near field communication technologies such as Bluetooth, a connection relation established based on far field communication technologies such as a cellular network, a connection relation established based on authentication, a connection relation located in the same local area network, a connection group of the same Internet of things, master-slave control or information receiving-transmitting relation formed according to a user-defined rule.
In an example, after the linkage is set, the user may issue a user voice through a voice assistant (referred to as a "first voice assistant") of the first intelligent terminal, where the user voice is also received by a voice assistant (referred to as a "second voice assistant") of the second intelligent terminal, as shown in fig. 9 (a) and 9 (b), the first intelligent terminal may directly display an icon of the "second voice assistant", and the user may receive a voice instruction issued by the first intelligent terminal by the "second voice assistant", while the icon of the "first voice assistant" may be displayed while the icon of the "second voice assistant" is displayed, or may not be displayed, and/or the "first voice assistant" may continue to monitor and recognize the voice instruction, or may not monitor and recognize the voice instruction. For example, a user may only issue a user voice through the first intelligent terminal, as shown in fig. 9 (b), icons of the "first voice assistant" and the "second voice assistant" may be simultaneously displayed on the operation interface of the first intelligent terminal, where the user voice includes content that needs to be executed by the first intelligent terminal and also includes content that needs to be executed by the second intelligent terminal, and then the first intelligent terminal and the second intelligent terminal execute corresponding operations respectively. Here, the finally determined second voice assistant has the same target role as the first voice assistant, and the text content of the response voice is different, but the feature information of the response voice may be the same.
The embodiment of the application also provides an intelligent terminal which comprises a memory and a processor, wherein the memory stores an interaction program, and the interaction program is executed by the processor to realize the steps of the interaction method in any embodiment.
The embodiment of the application also provides a storage medium, and the storage medium stores an interaction program which realizes the steps of the interaction method in any embodiment when being executed by a processor.
The embodiments of the intelligent terminal and the storage medium provided by the application can contain all technical characteristics of any of the embodiments of the interaction method, so that the method has corresponding beneficial effects, and the expansion and explanation contents of the description are basically the same as those of each embodiment of the method, and are not repeated herein.
Embodiments of the present application also provide a computer program product comprising computer program code which, when run on a computer, causes the computer to perform the method as in the various possible embodiments described above.
The embodiment of the application also provides a chip, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for calling and running the computer program from the memory, so that the device provided with the chip executes the method in the various possible implementation manners.
It can be understood that the above scenario is merely an example, and does not constitute a limitation on the application scenario of the technical solution provided by the embodiment of the present application, and the technical solution of the present application may also be applied to other scenarios. For example, as one of ordinary skill in the art can know, with the evolution of the system architecture and the appearance of new service scenarios, the technical solution provided by the embodiment of the present application is also applicable to similar technical problems.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.
The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs.
In the present application, the same or similar term concept, technical solution and/or application scenario description will be generally described in detail only when first appearing and then repeatedly appearing, and for brevity, the description will not be repeated generally, and in understanding the present application technical solution and the like, reference may be made to the previous related detailed description thereof for the same or similar term concept, technical solution and/or application scenario description and the like which are not described in detail later.
In the present application, the descriptions of the embodiments are emphasized, and the details or descriptions of the other embodiments may be referred to.
The technical features of the technical scheme of the application can be arbitrarily combined, and all possible combinations of the technical features in the above embodiment are not described for the sake of brevity, however, as long as there is no contradiction between the combinations of the technical features, the application shall be considered as the scope of the description of the application.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, a controlled terminal, or a network device, etc.) to perform the method of each embodiment of the present application.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer instructions may be stored in a storage medium or transmitted from one storage medium to another storage medium, for example, from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.) means. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. Usable media may be magnetic media (e.g., floppy disks, storage disks, magnetic tape), optical media (e.g., DVD), or semiconductor media (e.g., solid state storage disk Solid STATE DISK (SSD)), etc.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.