WO2018117608A1

WO2018117608A1 - Electronic device, method for determining utterance intention of user thereof, and non-transitory computer-readable recording medium

Info

Publication number: WO2018117608A1
Application number: PCT/KR2017/015041
Authority: WO
Inventors: 이동현; 양해훈; 양희정; 이정섭; 전희식; 최형탁
Original assignee: 삼성전자 주식회사
Priority date: 2016-12-20
Filing date: 2017-12-19
Publication date: 2018-06-28

Abstract

An electronic device, a method for determining an utterance intention of a user thereof, and a non-transitory computer-readable recording medium are provided. An electronic device according to an embodiment of the present disclosure may comprise: a microphone for receiving a user voice uttered by a user; and a processor for determining an utterance intention of a user on the basis of at least one word included in a user voice while the user voice is being input, providing response information corresponding to the determined utterance intention, and updating the response information while providing the response information, on the basis of an additional word uttered after the at least one word is input.

Description

Electronic device, method for determining user utterance intention and non-transitory computer readable recording medium

The present disclosure relates to an electronic device, a method of determining a user's utterance intention of a user, and a non-transitory computer readable recording medium. More particularly, an electronic device capable of providing response information even before a user completes a utterance, a user's utterance A method of determining intent and a non-transitory computer readable recording medium.

In addition, the present disclosure relates to an artificial intelligence (AI) system that simulates functions such as cognition and judgment of the human brain by using a machine learning algorithm such as deep learning, and an application thereof.

Recently, as functions of mobile devices, voice recognition devices, home network hub devices, and the like have been improved, the number of users using these devices has increased. In particular, such an electronic device provides a function of a virtual personal assistant (VPA) that recognizes a user's voice and provides corresponding information or performs an operation.

The existing virtual personal assistant starts the speech recognition of the user after the user speaks. In addition, a plurality of voice inputs are required to execute an operation corresponding to a user's intention. Accordingly, the existing virtual personal assistant has been ignored by users because of its slow response time. This is because it is much more convenient for the user to execute the action in a way other than voice.

Meanwhile, such a virtual personal assistant may be implemented as an artificial intelligence system. An artificial intelligence system is a computer system that implements human-level intelligence. Unlike conventional rule-based smart systems, an artificial intelligence system is a machine that learns and judges itself and becomes smart. As the artificial intelligence system is used, the recognition rate is improved and the user's taste can be understood more accurately, and the existing rule-based smart system is gradually replaced by the deep learning-based artificial intelligence system.

Artificial intelligence technology consists of elementary technologies that utilize machine learning (eg, deep learning) and machine learning.

Machine learning is an algorithm technology that classifies / learns characteristics of input data by itself, and element technology is a technology that simulates the functions of human brain cognition and judgment by using machine learning algorithms such as deep learning. It consists of technical areas such as understanding, reasoning / prediction, knowledge representation, and motion control.

The various fields in which artificial intelligence technology is applied are as follows. Linguistic understanding is a technology for recognizing and applying / processing human language / characters and includes natural language processing, machine translation, dialogue system, question and answer, speech recognition / synthesis, and the like. Visual understanding is a technology that recognizes and processes objects as human vision, and includes object recognition, object tracking, image retrieval, person recognition, scene understanding, spatial understanding, and image enhancement. Inference Prediction is a technique for judging, logically inferring, and predicting information. It includes knowledge / probability-based inference, optimization prediction, preference-based planning, and recommendation. Knowledge expression is a technology that automatically processes human experience information into knowledge data, and includes knowledge construction (data generation / classification) and knowledge management (data utilization). Motion control is a technology for controlling autonomous driving of a vehicle and movement of a robot, and includes motion control (navigation, collision, driving), operation control (action control), and the like.

The present disclosure is to solve the above-described problems, and to provide an electronic device capable of providing a virtual personal assistant function corresponding to the user's speech in real time, a method of determining the user's speech intent, and a non-transitory computer-readable recording medium. The purpose.

An electronic device according to an embodiment of the present disclosure to achieve the above object is based on a microphone for receiving a user voice spoken by a user and at least one word included in the user voice while the user voice is input. Determine a speech intent of the speech signal, provide response information corresponding to the determined speech intent, and update the response information based on the additional words spoken after the at least one word is input while the response information is provided. It may include a processor.

The processor determines the reliability of a plurality of speech intents based on the input at least one word, and if the speech intent having a reliability equal to or greater than a predetermined value is detected among the plurality of speech intents, the detected speech intent May be determined as the user's intention to speak.

The processor may initiate an execution preparation operation of an application for performing an operation corresponding to a speech intent having the highest reliability among the plurality of speech intents.

The display device may further include a display, wherein the processor controls the display to display an execution screen of an application for performing an operation corresponding to the detected speech intent when the speech intent having the reliability equal to or greater than the predetermined value is detected. can do.

In addition, the processor may control the display to display a UI for inducing the user to speak additional information necessary to perform an operation corresponding to the detected speech intent.

The display device may further include a display. When the reliability of the determined plurality of speech intentions is less than a preset value, the processor controls the display to display a list UI including the determined plurality of speech intents. When a user input for selecting one of the displayed utterance intentions is received, response information corresponding to the selected utterance intention may be provided.

The processor may be further configured to provide response information corresponding to the newly determined speech intention when the speech intent newly determined based on the additional word is different from the speech intent determined based on the at least one word. The response information can be updated.

The display device may further include a display, and the processor may control the display to display the provided response information. The response information may include the entity name and intention.

Meanwhile, a method of determining a user utterance intention of an electronic device according to an embodiment of the present disclosure for achieving the above object comprises: receiving a user voice spoken by a user, included in the user voice while the user voice is input; Determining a speech intent of the user based on the at least one word, providing response information corresponding to the determined speech intent of the user, and after the at least one word is input while the response information is provided. The response information may be updated based on the spoken additional word.

The determining may include determining reliability of a plurality of speech intents based on the input at least one word, detecting speech intents having a reliability greater than or equal to a predetermined value among the plurality of speech intents; The method may include determining the detected speech intent as the speech intent of the user.

The method may further include initiating an execution preparation operation of an application for performing an operation corresponding to a speech intent having the highest reliability among the plurality of speech intents.

The method may further include displaying an execution screen of an application for performing an operation corresponding to the detected speech intent when the speech intent having the reliability equal to or greater than the preset value is detected.

The method may further include displaying a UI for inducing the user to utter additional information necessary to perform an operation corresponding to the detected speech intent.

The determining may further include displaying a list UI including the determined plurality of speech intents when the reliability of the determined plurality of speech intents is less than a preset value. When a user input for selecting one of the displayed utterance intentions is received, response information corresponding to the selected utterance intention may be provided.

The updating may include providing response information corresponding to the newly determined utterance intention when the utterance intention newly determined based on the additional word is different from the utterance intention determined based on the at least one word. The provided response information can be updated.

Meanwhile, a non-transitory computer readable recording medium including a program for executing a method of determining a user utterance intention of an electronic device according to an embodiment of the present disclosure for receiving the above object receives an input of a user voice spoken by a user. Determining a user's speech intent based on at least one word included in the user's voice while the user's voice is input, providing response information corresponding to the determined user's speech intent, and And updating the response information based on the additional words spoken after the at least one word is input while the response information is provided.

Providing the response information may include displaying the response information. The response information may include the entity name and intention.

According to various embodiments of the present disclosure as described above, the response speed of the virtual personal assistant function may be improved, and an operation corresponding to a user utterance intention may be performed with a minimum conversation pattern.

1 is a conceptual diagram illustrating a virtual personal assistant system according to an embodiment of the present disclosure;

2 is a schematic block diagram illustrating a configuration of an electronic device according to an embodiment of the present disclosure;

3 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present disclosure;

4 is a block diagram of a processor in accordance with some embodiments of the present disclosure;

5A is a block diagram of a data learning unit according to some embodiments of the present disclosure;

5B is a block diagram of a data recognizer according to some embodiments of the present disclosure;

FIG. 6 is a diagram illustrating an embodiment of providing / recognizing response information using reliability of a word representing intention;

7 is a diagram illustrating a screen provided according to the embodiment of FIG. 6;

8 is a diagram illustrating an embodiment in which response information is changed by recognizing an additional spoken voice of a user.

FIG. 9 is a diagram illustrating an embodiment of displaying a UI for selecting a plurality of speech intents corresponding to individual names; FIG.

FIG. 10 is a diagram illustrating a screen provided according to the embodiment of FIG. 9; FIG.

FIG. 11 is a diagram illustrating an embodiment of displaying a UI for inducing a user to utter additional information; and

12 to 15 are flowcharts illustrating a method for determining user utterance intention of an electronic device according to various embodiments of the present disclosure.

16 is a sequence diagram illustrating a method of building a data recognition model by a system including an electronic device and a server according to an embodiment of the present disclosure.

17 is a sequence diagram illustrating a method of recognizing data by a system including an electronic device and a server according to an embodiment of the present disclosure.

Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In describing the present disclosure, when it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present disclosure, the detailed description thereof will be omitted. The terms to be described below are terms defined in consideration of functions in the present disclosure, and may vary according to a user, an operator, or a custom. Therefore, the definition should be made based on the contents throughout the specification.

Terms including ordinal numbers such as first and second may be used to describe various components, but the components are not limited by the terms. The terms are only used to distinguish one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component. The term and / or includes any one of a plurality of related items or a combination of a plurality of related items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting and / or limiting of the invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this specification, the terms including or having are intended to indicate that there is a feature, number, operation, operation, component, part, or a combination thereof described in the specification, one or more other features or numbers, operation It is to be understood that the present invention does not exclude in advance the possibility of the presence or the addition of operations, components, components, or a combination thereof.

In an embodiment, the module or unit performs at least one function or operation, and may be implemented by hardware or software or a combination of hardware or software. In addition, a plurality of 'modules' or a plurality of 'units' may be integrated into at least one module except for 'modules' or 'units' that need to be implemented by specific hardware, and may be implemented as at least one processor.

Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

1 is a conceptual diagram illustrating a virtual personal assistant system 1000 according to an embodiment of the present disclosure. As shown in FIG. 1, the virtual personal assistant system 1000 may include an electronic device 100 and a server 200. The electronic device 100 and the server 200 may interwork to provide a virtual personal assistant function to a user.

As used herein, the term 'virtual personal assistant' refers to a software application that understands a user's language and performs instructions desired by a user through a combination of artificial intelligence and voice recognition technology. For example, a virtual personal assistant may perform artificial intelligence functions such as machine learning, deep speech recognition, sentence analysis, and situational awareness, including deep learning. The virtual personal assistant may provide a personalized service for the individual by learning a user's habits or patterns. Examples of virtual personal assistants include S voice and Bixby.

The electronic device 100 may be a mobile device such as a smartphone or a tablet PC, but this is only an example, and a voice recognition device, a hub of a home network, an electronic frame, a humanoid robot, an audio device, a navigation device, a smart TV, etc. Any device capable of recognizing a user's voice and performing a corresponding operation may be implemented.

The electronic device 100 may recognize a user voice spoken by a user and understand a language. In addition, the electronic device 100 may manage a conversation with a user and generate a response.

The server 200 may provide information required when the electronic device 100 manages a conversation with a user and generates a response. In addition, the server 200 may provide and update a language model used by the electronic device 100.

As shown in the embodiment of FIG. 1, the electronic device 100 and the server 200 may provide a virtual personal assistant function in cooperation with each other, but provide a virtual personal assistant function only by the operation of the electronic device 100. It may be implemented as. In addition, the electronic device 100 may merely be implemented as an input / output device for receiving a user's voice and providing response information, and may be implemented in a form in which the server 200 processes most of a virtual personal assistant function.

2 is a schematic block diagram illustrating a configuration of an electronic device 100 according to an embodiment of the present disclosure. Referring to FIG. 2, the electronic device 100 may include a microphone 110 and a processor 120.

The microphone 110 may receive a user voice spoken by the user. For example, the microphone 110 may be implemented as an integrated body integrated with the upper side, the front side, the side direction, or the like of the electronic device 100, or may be provided as a separate means and connected to the electronic device 100 through a wired or wireless interface. have.

In addition, the microphone 110 may include a plurality of microphones, and may generate a plurality of voice signals by receiving voices from different locations. Using the plurality of voice signals, the electronic device 100 may generate a single voice signal that is enhanced in a pre-processing process before performing the voice recognition function.

The processor 120 may recognize the input user voice. The processor 120 may perform preprocessing on the input user voice before performing the voice recognition function. For example, preprocessing may include operations such as noise removal and feature extraction. The preprocessing process may be performed by the processor 120 or may be performed through a separate component.

The processor 120 may perform an operation corresponding to the determined utterance intention when the utterance intention of the user can be determined even in the middle of the utterance of the user. In detail, the processor 120 may measure the reliability of the recognition result of the user voice spoken up to now. The processor 120 may provide response information corresponding to the intention of the user having a predetermined reliability or more even before the user's utterance is terminated.

In addition, the processor 120 may update the response information using the additional voice spoken after the user voice used to provide the response information. The processor 120 may newly determine a user's intention to speak based on the entire user's voice plus the added voice. If it is determined that the intention is the same as the determined intention of the user, the processor 120 may provide more accurate response information. In contrast, if it is determined that the user's intention is different from the determined intention, the processor 120 may replace the response information provided with the response information corresponding to the newly determined intention.

For example, by using the result of recognizing the additional voice, the processor 120 may more accurately provide response information corresponding to the determined intention of the user. For example, when the intention of the user who wants to provide a map for 'Seoul-si, Seoul' is determined, the processor 120 recognizes an additional user voice called 'Gangnam-daero' and the resolution is further increased. On the map, you can provide a map (only magnified around Gangnam-daero).

As another example, using the result of recognizing the additional voice, the processor 120 may replace the provided response information with response information corresponding to the newly determined user's intention and provide the response information. For example, when it is determined that a user intends to provide a map for Seocho-gu, Seoul, the processor 120 recognizes an additional user voice of 'weather tomorrow' and provides weather information instead of the provided map information. The execution screen of the application may be provided.

The processor 120 may induce the user to speak additional information necessary to perform an operation corresponding to the determined speech intent. By allowing the user to utter all the information necessary to perform the operation within a small number of talk turns, the processor 120 may prevent the occurrence of additional talk turns and increase the response speed.

For example, when it is determined that the user's utterance intention to set an alarm is determined, the processor 120 may provide a screen for inducing the user's utterance about the alarm setting time, whether or not to repeat the information necessary for setting the alarm.

3 is a block diagram illustrating in detail a configuration of an electronic device 100 according to an exemplary embodiment. Referring to FIG. 3, the electronic device 100 may include a microphone 110, a processor 120, a display 130, an input unit 140, a communication unit 150, a memory 160, and a speaker 170. have. In addition to the components illustrated in the embodiment of FIG. 3, the electronic device 100 may include various components such as an image receiving unit (not shown), an image processing unit (not shown), a power supply unit (not shown), and a wired interface (not shown). have. In addition, the electronic device 100 is not limited to being implemented to necessarily include all the configurations shown in FIG. 3.

The microphone 110 is implemented in various forms to perform a function of receiving a user's voice. The microphone 110 may include various acoustic filters to remove noise.

The display 130 may display various image contents, information, UI, etc. provided from the electronic device 100. For example, the display 130 may display a response information providing screen corresponding to a user voice.

The display 130 may be implemented as a liquid crystal display (LCD), an organic light emitting diode (OLED), a plasma display panel (PDP), or the like, thereby forming the electronic device 100. Various screens that can be provided can be displayed.

The display 130 may display an image corresponding to the voice recognition result of the processor 120. For example, the display 130 may display response information corresponding to a speech intent determined by the user's voice as text or an image. As another example, the display 130 may display a UI for guiding additional information required for an operation corresponding to the intention of speaking. In addition, the display 130 may display a UI for displaying a plurality of speech intent lists, text indicating a user voice recognized to date, an application execution screen for performing an operation corresponding to speech intent, and the like.

The input unit 140 receives various user commands for controlling the electronic device 100. For example, the input unit 140 may receive a user command for selecting one of a plurality of speech intents displayed on the UI. The input unit 140 may be implemented as a button, a motion recognition device, a touch pad, or the like. In the embodiment of FIG. 3, the microphone 110 performs a voice input function, and the input unit 140 may perform a function of receiving a user command except for a voice input. In addition, when the input unit 140 is implemented as a touch pad, the input unit 140 may be implemented in the form of a touch screen combined with the display 130 to form a mutual layer structure. The touch screen may detect a touch input position, an area, and a pressure of a tire input.

The communicator 150 communicates with an external device. For example, the external device may be implemented as a server 200, a cloud storage, a network, or the like. The communicator 150 may transmit a voice recognition result to an external device and receive corresponding information from the external device. The communicator 150 may receive a language model for speech recognition from an external device.

To this end, the communication unit 150 may include various communication modules such as a short range wireless communication module (not shown), a wireless communication module (not shown), and the like. Here, the short range wireless communication module is a module for communicating with an external device located in a short range according to a short range wireless communication scheme such as Bluetooth, Zigbee, or the like. In addition, the wireless communication module is a module that is connected to an external network and performs communication according to a wireless communication protocol such as WiFi, WiFi direct, or IEEE. In addition, the wireless communication module performs communication by connecting to a mobile communication network according to various mobile communication standards such as 3G (3rd Generation), 3GPP (3rd Generation Partnership Project), Long Term Evoloution (LTE), LTE Advanced (LTE-A), etc. It may further include a mobile communication module.

The memory 160 may store various modules, software, and data for driving the electronic device 100. For example, the memory 160 may store an acoustic model (AM) and a language model (LM) that may be used to recognize a user's voice.

The memory 160 is a storage medium that stores various programs necessary for operating the electronic device 100. The memory 160 may be implemented in the form of a flash memory, a hard disk drive (HDD), a solid state drive (SSD), or the like. For example, the memory 160 may include a ROM for storing a program for performing an operation of the electronic device 100 and a RAM for temporarily storing data for performing an operation of the electronic device 100.

The memory 160 may store programs and data for configuring various screens to be displayed on the display 130. In addition, the memory 160 may store a program, an application, and data for performing a specific service. For example, the memory 160 may store a map application, a transportation reservation application, a music application, a schedule management application, and the like.

The memory 160 may previously store various response messages corresponding to the voice of the user as voice or text data. The electronic device 100 may read at least one of voice and text data corresponding to the received user voice (especially, a user control command) from the memory 160 and output the read out voice to the display 130 or the speaker 170.

The speaker 170 may output voice. For example, the speaker 170 may output not only various audio data but also a notification sound or a voice message. The electronic device 100 according to an embodiment of the present disclosure may include a speaker 170 as an output unit for providing an interactive voice recognition function. Through the speaker, the electronic device 100 may provide the user with a user experience that is like talking with the electronic device 100. The speaker 170 may be built in the electronic device 100 or may be implemented in the form of an output port such as a jack.

The processor 120 may control the above-described components of the electronic device 100. For example, the processor 120 may control the display 130 to display an execution screen of an application that performs an operation corresponding to the determined speech intent of the user.

The processor 120 may be implemented as a single CPU to perform a voice recognition operation, a language understanding operation, a conversation management operation, a response generation operation, or the like, or may be implemented with a plurality of processors and IPs performing specific functions. The processor 120 may perform speech recognition based on a traditional hidden markov model (HMM), or may perform deep learning based speech recognition such as a deep neural network (DNN).

In addition, the processor 120 may use big data and user-specific history data for speech recognition and reliability measurement. In this manner, the processor 120 may use the speech recognition model learned from the big data and personalize the speech recognition model. For example, the processor 120 may determine the reliability of the entity name using the learned acoustic model AM, and determine the reliability of the intention using the learned language model LM.

The processor 120 may recognize a user's voice in real time. In addition, the processor 120 may determine the user's intention of speaking using the intermediate recognition result recognized to date. For example, the processor 120 may determine the user's intention of speaking based on at least one word (core word) included in the user's voice.

Subsequently, the processor 120 may perform an operation corresponding to the determined speech intent. For example, the processor 120 may provide response information corresponding to the determined speech intent. As another example, the processor 120 may execute an application for performing an operation corresponding to the determined speech intent.

In addition, the processor 120 may update the response information from the spoken user voice after the intermediate recognition process for the user voice is performed. That is, the processor 120 may recognize the additionally input user voice together with the previously input user voice while the operation corresponding to the intention to speak is performed (for example, while the response information is provided). Therefore, the processor 120 may determine whether the user's intention to be determined as the intermediate recognition process is correct.

For example, the processor 120 may update the response information based on the additional words spoken after the at least one word is input while the response information is provided. When the user's speech intent determined based on the additional word matches the determined speech intent, the processor 120 may provide more precise information. On the contrary, when the intention of speech does not match, the processor 120 may provide response information corresponding to the intention of the user, which is determined based on additional words instead of the existing response information.

The processor 120 may recognize the voice of the user in real time and select a plurality of candidate speech intentions from the recognized user voice. If one of the candidate speech intentions has a value equal to or greater than the predetermined reliability, the processor 120 may determine that the intention having a value equal to or greater than the predetermined reliability is the user's intention to speak. By monitoring in real time whether the user's speech intent can be determined using only the recognized user voice, the processor 120 may reduce the time required to respond to the user.

The processor 120 may statistically determine what information the user wants to search or what the user wants to execute when a specific word is input, using the big data and the voice data received and stored from the user. Reliability is the quantification of these statistical judgments. For example, the processor 120 may determine the reliability of the entity name using the learned acoustic model AM, and determine the reliability of the intention using the learned language model LM.

In order to provide a quick response, the processor 120 may prepare an operation corresponding to the intention of speaking the user in advance. The processor 120 may initiate an execution preparation operation of an application for performing an operation corresponding to a speech intent having the highest reliability among the plurality of candidate speech intents. For example, if the intent of uttering the most reliable is to search the map for the location of an office located in Seocho-gu, Seoul, the processor 120 may execute a map application or activate the GPS function of the electronic device 100. have.

If the reliability of one of the candidate speech intentions is greater than or equal to a predetermined value, the processor 120 may determine the speech intention as the user's speech intent. In addition, the processor 120 may control the display 130 to display an execution screen of an application for performing an operation corresponding to the determined speech intent.

In addition, when there is additional information necessary to perform an operation corresponding to the determined speech intent, the processor 120 may control the display 130 to display a UI that prompts the user to speak a voice including the additional information. have. Through this, the processor 120 may prevent the talk turn from occurring further, and may induce the user to speak all the information in this talk turn.

If the reliability of all of the plurality of candidate speech intents is less than a predetermined value, the processor 120 may display a UI including the plurality of candidate speech intents so that the user may directly select the speech intents. In addition, the processor 120 may perform an operation corresponding to the speech intent selected by the user. Such an embodiment may be more usefully used when only a simple entity name is recognized, not a sentence for intent determination from a user's voice.

More specific operation of the processor 120 will be described below with reference to the accompanying drawings.

4 is a block diagram of a processor 120 in accordance with some embodiments of the present disclosure. Referring to FIG. 4, a processor 120 according to some embodiments may include a data learner 121 and a data recognizer 122. The processor 120 may be included in the electronic device 100 or included in the server 200.

According to an exemplary embodiment, at least a part of the data learning unit 121 and at least a part of the data recognizing unit 122 may be implemented as a software module or manufactured in the form of a hardware chip and mounted on the electronic device 100 or the server 200. have.

The data learner 121 may learn criteria for speech recognition, language understanding, and user's speech intent determination. The processor 120 may analyze the input user voice according to the learned criteria to determine the user's intention to speak, and generate corresponding response information. The data learner 121 may determine what data to use to recognize the user's voice. In addition, the data learner 121 may determine what data to use to determine the user's intention by understanding the recognized user's voice. The data learner 121 acquires data to be used for learning, and applies the acquired data to a data recognition model to be described later to learn criteria for determining speech recognition and user speech intent. In more detail, the data learner 121 may acquire data to be used for learning from another external server or electronic device. The data recognizer 122 recognizes a situation from predetermined data by using the learned data recognition model. can do. The data recognizer 122 may obtain predetermined data according to a predetermined criterion by learning, and use the data recognition model by using the acquired data as an input value. For example, the data recognizer 122 may recognize the input user voice by using the learned acoustic model and the language model. The data recognizer 122 may determine the user's intention to speak based on the recognized user voice. The data recognition unit 122 may update the data recognition model by using the data acquired as the voice recognition and speech intention result values for each user as input values again. As such, the data recognizer 122 may use big data and user-specific history data to measure reliability of speech recognition and speech intent. The processor 120 may personalize the speech recognition model while using the speech recognition model learned from the big data.

At least one of the data learner 121 and the data recognizer 122 may be manufactured in the form of one or a plurality of hardware chips and mounted on the electronic device 100. For example, at least one of the data learner 121 and the data recognizer 122 may be manufactured in the form of a dedicated hardware chip for artificial intelligence (AI), or may be a conventional general purpose processor (eg, It may be manufactured as a part of a CPU or an application processor) or a graphics dedicated processor (eg, a GPU) and mounted on the aforementioned various electronic devices 100. At this time, the dedicated hardware chip for artificial intelligence is a dedicated processor specialized in probability calculation, and has higher parallelism performance than the conventional general-purpose processor, so that it is possible to process arithmetic tasks in the field of artificial intelligence such as machine learning. In the embodiment of FIG. 4, the data learner 121 and the data recognizer 122 are both mounted on the electronic device 100, but they may be mounted on separate devices. For example, one of the data learner 121 and the data recognizer 122 may be included in the electronic device 100, and the other may be included in the server 200. In addition, the data learner 121 and the data recognizer 122 may be connected to each other by wire or wirelessly, so that model information constructed by the data learner 121 may be provided to the data recognizer 122, and the data recognizer 122 may be provided. The data input to 122 may be provided to the data learner 121 as additional learning data.

For example, the electronic device 100 may include a data recognizer 122, and the external server 200 may include a data learner 121. The server 120 may learn a criterion for determining the intention of the user, and the electronic device 100 may determine the intention of the voice spoken by the user based on the learning result by the server 200.

The data learner 121 of the server 200 may learn a criterion about what data is used to determine the user intention and how to determine the user intention using the data. The data learner 121 acquires data to be used for learning and applies the acquired data to a data recognition model to be described later, thereby learning a criterion for determining user intention.

However, this is only an example, and the electronic device 100 may include the data learner 121, and an external device such as a server may include the data recognizer 122.

Meanwhile, at least one of the data learner 121 and the data recognizer 122 may be implemented as a software module. When at least one of the data learner 121 and the data recognizer 122 is implemented as a software module (or a program module including instructions), the software module may be stored in a non-transitory computer readable recording medium. At least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by the OS, and some of the at least one software module may be provided by a predetermined application.

5A is a block diagram of a data learner 121 according to some embodiments of the present disclosure. Referring to FIG. 5A, the data learner 121 according to an exemplary embodiment may include a data acquirer 121-1, a preprocessor 121-2, a training data selector 121-3, and a model learner 121. -4) and the model evaluator 121-5.

The data acquirer 121-1 may acquire data necessary for determining a situation. For example, the data acquirer 121-1 may obtain voice data by converting a user voice signal input through the microphone 110 into a digital signal. The data acquirer 121-1 may receive learning voice data from the server 200 or a network such as the Internet.

The preprocessor 121-2 may preprocess the acquired data so that the data obtained for learning for the situation determination may be used. The preprocessor 121-2 may process the acquired data in a predetermined format so that the model learner 121-4, which will be described later, uses the acquired data for learning for situation determination.

For example, the preprocessor 121-2 may extract a section that is a recognition target for the input user voice. The preprocessor 121-2 may generate voice data by performing noise removal, feature extraction, and the like.

As another example, the preprocessor 121-2 may generate voice data to be suitable for speech recognition by analyzing a frequency component of an input user voice, reinforcing some frequency components, and suppressing other frequency components.

The training data selector 121-3 may select data necessary for learning from the preprocessed data. The selected data may be provided to the model learner 121-4. The training data selector 121-3 may select data required for learning from preprocessed data according to a predetermined criterion for determining a situation. In addition, the training data selection unit 121-3 may select data according to a predetermined criterion by learning by the model training unit 121-4 to be described later.

For example, in the early stage of learning, the learning data selector 121-3 may remove voice data having high similarity among the preprocessed voice data. That is, for initial learning, the learning data selector 121-3 may select voice data having low similarity so as to learn a criterion that is easy to distinguish.

As another example, the learning data selector 121-3 may select only voice data spoken in a specific language. Since the speech characteristics are different for each language, by selecting the speech data set spoken in a specific language, the training data selector 121-3 may allow the model learner 121-4 to learn a criterion suitable for the selected specific language. have.

On the contrary, the learning data selector 121-3 may select voice data in which characteristics of each language are reflected. Through this, the model learner 121-4 may learn a criterion for which language the voice data corresponds to.

For example, the training data selector 121-3 selects only voice data of a specific user so that the model learner 121-4 learns a criterion for speaker dependent or speaker adoption recognition. You can do that.

In addition, the learning data selector 121-3 may select preprocessed voice data that satisfactorily meets one of the preset criteria by learning. In this way, the model learner 121-4 may learn another criterion different from the previously learned criterion.

The model learner 121-4 may learn a criterion about how to determine a situation based on the training data. In addition, the model learner 121-4 may learn a criterion about what learning data should be used for situation determination.

For example, the model learner 121-4 may learn physical features that distinguish phonemes, syllables, vowels, etc. by comparing the plurality of voice data. In this way, the model learner 121-4 may construct an acoustic model AM for classifying sound units such as phonemes. In addition, the model learner 121-4 may learn a word or lexical use by comparing a plurality of voice data. Through this, the model learning unit 121-4 may build a language model LM.

The model learner 121-4 may train the data recognition model used for situation determination using the training data. In this case, the data recognition model may be a pre-built model. For example, the data recognition model may be a model built in advance by receiving basic training data (eg, sample voice data). As another example, the data recognition model may be an acoustic model AM or a language model LM pre-built using big data. The model learner 121-4 learns the voice data of a specific user and converts the speaker-independent pre-built acoustic model AM or language model LM into a personalized acoustic model AM or language model LM. It can also be developed.

The data recognition model may be constructed in consideration of the application field of the recognition model, the purpose of learning, or the computer performance of the device. The data recognition model can be designed to simulate the human brain structure on a computer. The data recognition model may include a plurality of weighted network nodes that simulate neurons in a human neural network. The plurality of network nodes may form a connection relationship so that neurons simulate synaptic activity through which signals are sent and received through synapses. The data recognition model may include, for example, a neural network model or a deep learning model developed from the neural network model. In the deep learning model, a plurality of network nodes may be located at different depths (or layers) and exchange data according to a convolutional connection relationship. The data recognition model may include, for example, a model such as a deep neural network (DNN), a recurrent neural network (RNN), and a bidirectional recurrent deep neural network (BRDNN), and is not particularly limited to the above-described example. According to the present invention, when there are a plurality of pre-built data recognition models, the model learner 121-4 may determine a data recognition model having a high correlation between input training data and basic training data as a data recognition model to be trained. . In this case, the basic training data may be enjoyed for each type of data, and the data recognition model may be built in advance for each type of data. For example, the basic training data may be mood based on various criteria such as the region where the training data is generated, the time at which the training data is generated, the size of the training data, the genre of the training data, the creator of the training data, the types of objects in the training data, and the like. It may be.

In addition, the model learner 121-4 may train the data recognition model using, for example, a learning algorithm including an error back-propagation method or a gradient descent method. .

For example, the model learner 121-4 may train the data recognition model through supervised learning using the training data as an input value. As another example, the model learner 121-4 learns a data recognition model through unsupervised learning that discovers a criterion for situation determination by learning a kind of data necessary for situation determination without a separate guidance. I can learn. As another example, the model learner 121-4 may train the data recognition model through reinforcement learning using feedback on whether the result of the situation determination according to the learning is correct.

In addition, when the data recognition model is trained, the model learner 121-4 may store the trained data recognition model. In this case, the model learner 121-4 may store the learned data recognition model in the memory 160 of the electronic device 100. Alternatively, the model learner 121-4 may store the learned data recognition model in a memory of the server 200 connected to the electronic device 100 through a wired or wireless network.

In this case, the memory 160 in which the learned data recognition model is stored may also store commands or data related to at least one other element of the electronic device 100. The memory 160 may also store software and / or programs. For example, the program may include a kernel, middleware, an application programming interface (API) and / or an application program (or “application”), and the like.

The model evaluator 121-5 may input the evaluation data into the data recognition model, and cause the model learner 121-4 to relearn if the recognition result output from the evaluation data does not satisfy a predetermined criterion. have. In this case, the evaluation data may be preset data for evaluating the data recognition model.

In the initial recognition model construction step, the evaluation data may be speech data including phonemes with different physical characteristics. The evaluation data can then be replaced with a voice data set with increasingly similarities. Through this, the model evaluator 121-5 may gradually verify the performance of the data recognition model.

For example, the model evaluator 121-5 may determine a predetermined criterion when the number or ratio of the evaluation data that is not accurate among the recognition results of the learned data recognition model for the evaluation data exceeds a preset threshold. It can be evaluated as not satisfied. For example, when a predetermined criterion is defined at a ratio of 2%, the model evaluation unit 121-5 when the trained data recognition model outputs an incorrect recognition result for more than 20 evaluation data out of a total of 1000 evaluation data. Can be judged that the learned data recognition model is not suitable.

On the other hand, when there are a plurality of trained data recognition models, the model evaluator 121-5 evaluates whether a predetermined criterion is satisfied for each trained speech recognition model, and recognizes the final data for a model satisfying the predetermined criterion. Can be determined as a model. In this case, when there are a plurality of models satisfying a predetermined criterion, the model evaluator 121-5 may determine any one or a predetermined number of models that are preset in the order of the highest evaluation score as the final data recognition model.

Meanwhile, the data acquisition unit 121-1, the preprocessor 121-2, the training data selection unit 121-3, the model training unit 121-4, and the model evaluation unit 121 in the data learning unit 121. At least one of -5) may be manufactured in the form of at least one hardware chip and mounted on the electronic device. For example, at least one of the data acquirer 121-1, the preprocessor 121-2, the training data selector 121-3, the model learner 121-4, and the model evaluator 121-5. One may be manufactured in the form of a dedicated hardware chip for artificial intelligence (AI), or may be manufactured as an existing general purpose processor (eg, a CPU or an application processor) or part of an IP for a specific function. It may be mounted on the electronic device 100.

In addition, the data obtaining unit 121-1, the preprocessor 121-2, the training data selecting unit 121-3, the model learning unit 121-4, and the model evaluating unit 121-5 are one electronic components. It may be mounted on the device, or may be mounted on separate electronic devices, respectively. For example, some of the data acquirer 121-1, the preprocessor 121-2, the training data selector 121-3, the model learner 121-4, and the model evaluator 121-5 are provided. May be included in the electronic device 100, and some of them may be included in the server 200.

Meanwhile, at least one of the data acquirer 121-1, the preprocessor 121-2, the training data selector 121-3, the model learner 121-4, and the model evaluator 121-5 is provided. It may be implemented as a software module. At least one of the data acquirer 121-1, the preprocessor 121-2, the training data selector 121-3, the model learner 121-4, and the model evaluator 121-5 is a software module. (Or, a program module including instructions), the software module may be stored on a non-transitory computer readable recording medium. At least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by the OS, and some of the at least one software module may be provided by a predetermined application.

5B is a block diagram of a data recognizer 122 according to some embodiments of the present disclosure.

Referring to FIG. 5B, the data recognizer 122 according to some embodiments may include a data acquirer 122-1, a preprocessor 122-2, a recognition data selector 122-3, and a recognition result providing unit ( 122-4) and the model updater 122-5.

The data acquirer 122-1 may acquire data necessary for situation determination, and the preprocessor 122-2 may preprocess the acquired data so that the acquired data may be used for situation determination. The preprocessing unit 122-2 may process the acquired data into a predetermined format so that the recognition result providing unit 122-4 to be described later can use the obtained data for determining the situation.

The recognition data selector 122-3 may select data required for situation determination from among the preprocessed data. The selected data may be provided to the recognition result provider 122-4. The recognition data selector 122-3 may select some or all of the preprocessed data according to a predetermined criterion for determining the situation. In addition, the recognition data selector 122-3 may select data according to a predetermined criterion by learning by the model learner 142-4 to be described later.

The recognition result provider 122-4 may determine the situation by applying the selected data to the data recognition model. The recognition result providing unit 122-4 may provide a recognition result according to the recognition purpose of the data. The recognition result providing unit 122-4 may apply the selected data to the data recognition model by using the data selected by the recognition data selecting unit 122-3 as an input value. In addition, the recognition result may be determined by the data recognition model.

For example, the recognition result providing unit 122-4 may recognize the input user utterance according to a division criterion determined in the data recognition model. The processor 120 may determine the user's intention to speak based on the recognized user voice. As another example, the recognition result provider 122-4 may recognize a key word from a user speech input using a data recognition model. Based on the recognized key word, the processor 120 may perform an operation corresponding to the user's intention of speaking. In addition, the processor 120 may induce a user to utter a keyword including additional information necessary to perform an operation.

The model updater 122-5 may cause the data recognition model to be updated based on the evaluation of the recognition result provided by the recognition result provider 122-4. For example, the model updater 122-5 may provide the model learner 141-4 to the model learner 141-4 by providing the recognition result provided by the recognition result provider 122-4 to the model learner 141-4. The data recognition model can be updated.

Meanwhile, the data acquisition unit 122-1, the preprocessor 122-2, the recognition data selection unit 122-3, the recognition result providing unit 122-4, and the model updating unit in the data recognition unit 122 ( At least one of 122-5) may be manufactured in the form of at least one hardware chip and mounted on the electronic device. For example, among the data acquirer 122-1, the preprocessor 122-2, the recognition data selector 122-3, the recognition result provider 122-4, and the model updater 122-5. At least one may be fabricated in the form of a dedicated hardware chip for artificial intelligence (AI), or may be fabricated as part of an existing general purpose processor (e.g., a CPU or application processor) or an IP for a particular function as described above. It may be mounted on various electronic devices 100.

In addition, one data acquisition unit 122-1, the preprocessor 122-2, the recognition data selection unit 122-3, the recognition result providing unit 122-4, and the model updater 122-5 It may be mounted on the device, or may be mounted on separate electronic devices, respectively. For example, among the data acquirer 122-1, the preprocessor 122-2, the recognition data selector 122-3, the recognition result provider 122-4, and the model updater 122-5. Some may be included in the electronic device 100 and others may be included in the server 200.

Meanwhile, at least one of the data acquirer 122-1, the preprocessor 122-2, the recognition data selector 122-3, the recognition result provider 122-4, and the model updater 122-5. May be implemented as a software module. At least one of the data acquirer 122-1, the preprocessor 122-2, the recognition data selector 122-3, the recognition result provider 122-4, and the model updater 122-5 is software. If implemented as a module (or a program module containing instructions), the software module may be stored on a non-transitory computer readable recording medium. At least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by the OS, and some of the at least one software module may be provided by a predetermined application.

Hereinafter, the operation of the processor 120 will be described in more detail with reference to the accompanying drawings.

According to an embodiment of the present disclosure, the processor 120 may determine the user's intention of speaking based on at least one word included in the user voice while the user voice is input. In detail, the at least one word included in the user's voice may include a word representing an intent and a word representing a slot. The entity name is a word that knows information such as place, type, time, origin, and destination. For example, a user voice of "I'm hungry. Is there a good steak house near Seoul Station?" May be classified as a word representing intention. Also, words such as 'Seoul Station' and 'steak' may be classified as words representing individual names.

The processor 120 may determine an operation corresponding to the user's voice based on the reliability of each word representing the intention and the entity name. For example, if the reliability of the words representing the intention and the entity name are both less than the predetermined value, the processor 120 may wait for additional input of the user's voice.

If the confidence level for a particular intention is greater than or equal to a predetermined value, the processor 120 may start preparing for executing an operation corresponding to the specific intention. In addition, the processor 120 may control the display 130 to display an object name additionally required to execute an operation.

If the confidence level for the specific entity name is greater than or equal to a preset value, the processor 120 may control the display 130 to display a plurality of candidate intentions related to the specific entity name.

When the response information is displayed on the display 130 because the reliability of the intention and the entity name is greater than or equal to a predetermined value, the processor 120 may update the response information by using the additional spoken user voice. If there is no change in the reliability value, the processor 120 may maintain the response information currently displayed on the display 130. Conversely, if there is a change in the reliability value, the processor 120 may update the response information currently displayed on the display 130. That is, the processor 120 may control the display 130 to display updated response information.

6 is a diagram illustrating an embodiment in which a word representing an intention is first recognized, and a word representing an entity name is additionally recognized to update response information. In the embodiment of FIG. 6, a predetermined value of reliability, which is a threshold for displaying response information, is assumed to be 0.8. In addition, it is assumed that the threshold value for preparing for the corresponding operation is set to 0.7. As described above, in the embodiment of FIG. 6, two thresholds are set, but only one threshold may be set. In addition, when the corresponding operation may be prepared in several steps, the processor 120 may set a plurality of thresholds and use it as a trigger for performing each step.

The processor 120 may generate a reliability measurement model for determining reliability based on the big data and the user history data. For example, the data learning unit 121 and the data recognizing unit 122 may be used to generate the reliability measurement model.

Referring to FIG. 6, the processor 120 may extract a plurality of speech intents from a user voice of “doubled now” that the user speaks up to now. For example, the processor 120 may extract the intention (Search.Time) for asking the time from the word "now". In addition, the processor 120 may extract an intention to search for a hospital (Search. Hospital) and an intention to search for a restaurant (Search. Restaurant) from the word "doubled". The processor 120 may determine the reliability of each of the extracted plurality of speech intents. Since all of the determined reliability is less than the predetermined value of 0.8, the processor 120 may wait until additional user voice is input.

The processor 120 may then determine that the user's uttering intent is the closest to the intention to find a restaurant (Search. Restaurant) based on further uttering up to "I'm hungry now." The processor 120 may determine the reliability from the user voice spoken up to now. Since the determined reliability is 0.7, the processor 120 may start preparation for processing an operation of finding a restaurant. For example, the processor 120 may execute a map application or activate a GPS function.

Before the user's speech is terminated, the processor 120 may re-recognize the intermediate speech of the user and the additional speech after the intermediate speech. Based on the re-recognized user voice, the processor 120 may find the intention and the entity name and again determine the reliability of each.

Based on the user's voice input up to "Now I'm hungry near Seoul Station", the processor 120 adjusts the reliability of Search.Restaurant to 0.8 to find a restaurant, and extracts the name "Seoul Station" related to the location. Can be. In the embodiment of FIG. 6, since the response reliability is set to provide response information when the reliability of the intention is 0.8 or more, the processor 120 may control the display 130 to display response information corresponding to the determined user speech intent. . For example, the processor 120 may execute a map application to find a restaurant, and may set a search area of the map application to 'Seoul station' using the extracted entity name 'Seoul station'. In addition, the processor 120 may control the display 130 to display an application execution screen that provides map information near Seoul station.

Based on the user's voice input up to "a steak house near Seoul station, which is hungry now," the processor 120 may further extract an 'stake house', which is an entity name related to the type. The processor 120 may update the response information based on the additional spoken word. For example, the processor 120 may update the application execution screen displaying the map information near the Seoul station being provided as response information to the screen in which the arrows are displayed at the positions corresponding to the stake house.

Based on the user's voice input up to "I'm hungry but is there a steak house near Seoul Station," the processor 120 determines that there is no additional intention or entity name extracted and that there is no change in the reliability of the previously determined intention or entity name. can do. Since the determined utterance intention of the user is the same and no additional information is input, the processor 120 may maintain the provided response information. As such, the processor 120 may determine the user's intention to speak before the user's speech is completed. Therefore, the processor 120 may reduce the time required to provide response information to the user.

FIG. 7 illustrates a screen provided through the display 130 according to the exemplary embodiment of FIG. 6. 6 illustrates an example in which the electronic device 100 is implemented as a smartphone. 7 illustrates the electronic device 100 from left to right in chronological order.

In the embodiment of FIG. 7, the processor 120 controls the display 130 to display the recognized user voice as text on the top of the display 130. In addition, the processor 120 controls the display 130 to display an image indicating that the voice recognition function is activated at the bottom of the display 130. The processor 120 controls the display 130 to display the response information in the center of the display 130. Of course, all embodiments of the present disclosure are not limited to those having the same configuration and composition as the screen layout of FIG. 7.

The electronic device 100 shown on the left side of FIG. 7 is preparing for the corresponding operation processing. Based on the user's voice recognized to date, "I'm hungry," the processor 120 may determine that the intention is to find a restaurant. Since the reliability of the intention of finding a restaurant is determined to be greater than or equal to a threshold for preparing for a corresponding operation, the processor 120 may execute a map application to prepare for providing response information.

The electronic device 100 shown second from the left of FIG. 7 determines that there is an intention to speak with a reliability that is greater than or equal to a threshold for displaying response information. A text corresponding to a speech intent and an entity name having a reliability greater than or equal to a preset value may be displayed at the bottom of the map application execution screen. By showing the determined intention and the entity name to the user, the processor 120 has the effect of receiving feedback from the user. The processor 120 may provide map information around the Seoul station, which is a response screen corresponding to the determined intention and the entity name.

The electronic device 100 shown in the third from the left of FIG. 7 updates the response information by using words included in the additional spoken voice of the user. The processor 120 may additionally recognize the spoken user voice and extract type information of a restaurant to be searched. The processor 120 may update and provide map information around the Seoul station, which is being provided, with map information indicating the positions of the stake houses.

Although the electronic device 100 shown at the far right of FIG. 7 recognizes the user's additional spoken voice, there is no change in the reliability and there is no additional recognized entity name. Therefore, the electronic device 100 maintains the previously provided response information screen. . Thus, substantially the response information that the user wants to receive has already been provided at the third time shown from the left of FIG. That is, the electronic device 100 according to an embodiment of the present disclosure may recognize the user's voice in real time even before the user's voice input is completed and provide / update the response information through reliability verification.

FIG. 8 is a diagram illustrating an embodiment in which response information is changed by recognizing an additional spoken voice of a user. In the leftmost diagram of FIG. 8, the processor 120 determines that the intention is to find a restaurant based on a user voice "now hungry" recognized so far. Since the reliability of the intention of finding a restaurant is determined to be greater than or equal to a threshold for preparing for a corresponding operation, the processor 120 may execute a map application to prepare for providing response information.

Referring to the second figure from the left of FIG. 8, the processor 120 may further detect an entity name 'stake' from a user voice of “stake hungry now”. Since the reliability of the intention to find a restaurant was determined to be greater than or equal to a threshold for providing response information, the processor 120 controls the display 130 to display a map application execution screen for searching for a steak house currently located near the user's location. can do.

The processor 120 may confirm that the newly determined speech intention is different from the restaurant search intention, which is the determined speech intent, based on the additionally recognized word. The processor 120 may update the response information screen so that the response information corresponding to the newly determined speech intent is provided.

Referring to the third figure from the left of FIG. 8, based on a user voice of “I'm hungry now, a steak recipe,” the processor 120 may newly detect an intention of finding a recipe. The processor 120 may determine the reliability of both the intention of finding an existing restaurant and the intention of finding a recipe that is a new intention. Based on the word 'recipe' further uttered after the middle recognized user's voice, the processor 120 indicates that the confidence of the intention of finding a recipe is above a preset value and that the confidence of the intention of finding a restaurant is less than a preset value. You can judge. Accordingly, the processor 120 may control the display 130 to display response information about the steak recipe search result.

Referring to the drawing on the far right of FIG. 8, the processor 120 may maintain a response information screen previously provided. This is because there is no change in reliability and no additional recognized entity name as a result of recognizing the user's additional spoken voice. In fact, the response information that the user wants to receive has already been provided at the third time shown from the left of FIG.

FIG. 9 is a diagram illustrating an embodiment in which a word representing an entity name is first recognized and a UI for selecting a plurality of speech intents corresponding to the entity name is displayed.

The processor 120 may predict a plurality of speech intents based on words included in a user voice input to date. The processor 120 may determine the reliability of the predicted plurality of speech intents.

If the reliability of the determined plurality of speech intents is less than a preset value, the processor 120 may control the display 130 to display a list UI including the determined plurality of speech intents. When a user input for selecting one of the displayed speech intents is received, the processor 120 may control the display 130 to display response information corresponding to the selected speech intent.

Referring to FIG. 9, the processor 120 may predict a speech intent from a user voice of “west”. For example, the processor 120 may predict the intention (Search.Time) of asking for time by using history data that a phoneme of “west” is first asked for a time when the phoneme is “first”. However, since the reliability of the predicted intention is low as 0.1, the processor 120 may wait for further utterance of the user without any operation.

Subsequently, based on the user's additional speech up to "Seoul", the processor 120 may extract 'Seoul', which is the entity name associated with the location. The processor 120 may predict a plurality of speech intents associated with the extracted 'Seoul'. For example, the processor 120 may predict the intention of searching for weather (Search.Weather), the intention of finding a road (Find.Path), and the intention of searching for city information (Search.Cityinfo) as a speech intent.

Because the reliability of the predicted plurality of speech intents is all less than a preset value. The processor 120 may control the display 130 to list and display the predicted plurality of speech intents. If a user input for selecting one of the displayed utterance intentions is received, the processor 120 may provide response information corresponding to the selected utterance intention. For example, when a user input for selecting an intention of finding a road is received, the processor 120 may control the display 130 to display a navigation execution screen.

In contrast, if a user input for selecting one of the plurality of utterance intentions displayed is not received, the processor 120 may wait for a further utterance of the user to be input. In the embodiment of FIG. 9, a case in which there is no user input will be described.

Based on the user's voice input up to "going to Seoul," the processor 120 may adjust the confidence of the intention to find the route to 0.9. In addition, the processor 120 may search for a way to Seoul by executing a navigation application using an intention and an entity name having a reliability that is greater than or equal to a preset value.

The processor 120 may determine that there is no change in reliability based on the user's voice input until “tell me the way to Seoul”. Since the determined intention of the user is the same and no additional information is input, the processor 120 may maintain the provided response information. The processor 120 may quickly determine the utterance intention of the user before providing the user's utterance and provide the response information quickly.

FIG. 10 illustrates a screen provided through the display 130 according to the exemplary embodiment of FIG. 9. However, FIG. 10 illustrates an embodiment in which there is a user input for selecting a speech intent. 10 illustrates the electronic device 100 from left to right in chronological order.

Referring to the leftmost diagram of FIG. 10, the processor 120 may not determine the intention or the entity name based on the user voice "We" recognized so far, and may wait for the additional speech to be recognized. .

The processor 120 may recognize 'Seoul', which is an entity name indicating a location, based on a user's voice input up to "Seoul." The processor 120 may predict a plurality of speech intents related to 'Seoul'. When the reliability of the predicted plurality of speech intents is determined to be less than a predetermined value, the processor 120 may display the plurality of speech intents and receive a user selection. As shown in the second drawing from the left of FIG. 10, the processor 120 may select a list UI that selects 'weather search', 'navigation', and 'city information' corresponding to a plurality of speech intents associated with the entity name 'Seoul.' The display 130 can be controlled to display.

In the embodiment of FIG. 10, it will be described on the assumption that the user selects 'navigation' corresponding to the intention of speaking. Of course, as described with reference to FIG. 9, the processor 120 may determine the speech intent by using the additional speech.

The processor 120 may provide a response screen corresponding to the selected speech intent. As shown in the third drawing from the left of FIG. 10, the processor 120 may execute a navigation application to search for a route from the current location of the user to Seoul.

The processor 120 may know the exact speech intent by the user's selection. Therefore, the processor 120 may use the speech data of the user who has been uttered as learning data for reinforcing the speech recognition model as shown in the rightmost diagram of FIG. 10.

FIG. 11 is a diagram illustrating an embodiment of displaying a UI for inducing a user to utter additional information required to perform an operation corresponding to the determined speech intent.

Referring to the leftmost drawing of FIG. 11, it can be seen that the voice of the user input so far is “going to Busan”. The processor 120 may determine an intention of finding a route (Find. Path), a train reservation intention (Book. Train), or the like through a recognized user voice. If the reliability of the determined intentions is less than the predetermined value, the processor 120 may wait for further utterance of the user as shown in FIG. 11.

Based on the voice of the user, "Reserving a train going to Busan," the processor 120 may determine that the user's utterance intention is a train reservation intention. In addition, the processor 120 may perform an operation corresponding to the determined speech intent. The information required for the train reservation operation is 'departure', 'arrival' and 'time'. Based on the user's voice spoken so far, the processor 120 can only determine that the destination is Busan from the necessary information.

In the conventional virtual personal assistant function, when additional information is needed, the user is asked about each necessary information and received an answer. Thus, there has been a problem that additional conversation turns occur as many as the necessary additional information.

The electronic device 100 according to an embodiment of the present disclosure displays a guide UI that indicates what information is required for a corresponding operation and whether necessary information has been input, as shown in the second figure from the left of FIG. 11. can do. By displaying the guide UI, the processor 120 may induce a user to speak necessary information without additional interactive turn.

As illustrated in FIG. 11, the processor 120 may determine whether necessary information is input based on the additional speech spoken by the user. In addition, the processor 120 may control the display 130 to display the content input to the guide UI as necessary information is input. For example, the processor 120 may add an indication of “6 tomorrow” to the “time” field from the user's additional speech of “6 tomorrow”. In addition, the processor 120 may add an indication of 'water source' to the 'origin' field from the additional speech of the user, "starting from the water source".

As described above, the electronic device 100 according to various embodiments of the present disclosure may improve a response speed of the virtual personal assistant. In addition, the electronic device 100 may perform an operation intended by the user with a minimum number of conversation turns. This allows the user to acquire a user experience for a fast and accurate virtual personal assistant function.

As described above, although the electronic device 100 may determine the user's intention to speak and generate the corresponding response information, the server 200 may be the center to perform the virtual personal assistant function. . That is, the electronic device 100 may perform only an input / output operation and the other functions may be implemented by the server 200.

Since the user speech intent determination method according to an embodiment of the present disclosure determines the intention by processing the user speech in real time, a fast communication speed between the server 200 and the electronic device 100 is required. For example, when the virtual personal assistant function is executed in the electronic device 100, a dedicated communication channel may be established with the server 200. The electronic device 100 may transmit the received user voice to the server 200. The server 200 may determine the intention to speak from the user's voice, generate corresponding response information, and transmit the corresponding response information to the electronic device 100. The electronic device 100 may output the received response information and provide it to the user.

12 to 15 are diagrams for determining a method of determining a user utterance intention of the electronic device 100 according to various embodiments of the present disclosure.

Referring to FIG. 12, the electronic device 100 may receive a user voice spoken by a user in operation S1210. While the user's voice is received, the electronic device 100 may determine the user's intention of speaking based on at least one word included in the user's voice (S1220). That is, the electronic device 100 may determine the intention to speak in real time based on the user's voice input until now even before the user's speech is completed. In addition, since the verification process of measuring the reliability of the determined speech intent is performed, the electronic device 100 may not provide a result that is completely different from that of the user.

The electronic device 100 may provide response information corresponding to the determined speech intent of the user in operation S1230. The electronic device 100 may display a screen of a result of performing an operation corresponding to the intention of speaking. For example, the electronic device 100 may display a screen of a result of searching for today's weather. Also, the electronic device 100 may perform an operation corresponding to the intention of speaking. For example, the electronic device 100 may set an alarm in response to user speech.

Based on the additional words spoken after the at least one word is input while the response information is provided, the electronic device 100 may update the response information (S1240). The electronic device 100 may provide response information based on the recognized user voice, and may subsequently determine the user utterance intention based on the entire user voice including the additionally uttered user voice. Therefore, as the user speaks further, the electronic device 100 may update and provide the response information in real time.

For example, if the user's speech intention is correctly determined, the electronic device 100 may provide more accurate and detailed response information based on the contents recognized from the additional speech. On the contrary, the electronic device 100 may recognize that the user's intention of speaking is wrongly determined based on the contents recognized from the additional speech. In addition, the electronic device 100 may update and provide response information corresponding to a newly determined user's utterance intention.

13 is a flowchart illustrating an embodiment in which a user's speech intent is determined through a verification process called a reliability measurement. Referring to FIG. 13, the electronic device 100 may receive a user voice spoken by a user. In operation S1310, the electronic device 100 may recognize a voice of an input user.

The electronic device 100 may estimate a plurality of speech intentions based on the user voice recognized to date (S1320). For example, the electronic device 100 may extract a keyword corresponding to an intention or an object name from the user voice recognized to date. The electronic device 100 may estimate a plurality of speech intentions based on the extracted at least one keyword.

Subsequently, the electronic device 100 may measure reliability of each of the estimated plurality of speech intents (S1330). If the reliability of all the utterance intentions is lower than a predetermined criterion (S1330-N), the electronic device 100 may wait for an additional utterance of the user to be input. In contrast, when a speech intent having a reliability equal to or greater than a predetermined value is detected among the plurality of speech intents (S1330-Y), the electronic device 100 may determine the detected speech intent as the user's speech intent. In operation S1340, the electronic device 100 may provide response information corresponding to the determined speech intent.

Subsequently, the electronic device 100 may determine whether the user's speech has ended (S1350). If there is additionally spoken user voice (S1350-N), the electronic device 100 may estimate the plurality of speech intentions again based on the entire user voice including the additionally spoken user voice. If it is determined that the user's utterance has ended (S1350-Y), the electronic device 100 may maintain the response information provision state and wait until the next user's utterance is input.

14 is a flowchart illustrating a method of determining a user utterance intention of the electronic device 100 according to an embodiment of the present disclosure in more detail. Referring to FIG. 14, the electronic device 100 may receive a user voice spoken by a user. In operation S1410, the electronic device 100 may recognize a voice of an input user.

The electronic device 100 may estimate a plurality of speech intentions based on the recognized user voices (S1420). In operation S1430, the electronic device 100 may measure the reliability of each of the estimated utterance intentions. For example, the electronic device 100 may statistically analyze what the intention of the user is when a specific word is input using the big data and the history data of the user. Reliability may be a numerical value representing the result of statistical analysis. For example, reliability can be defined as a value between 0 and 1.

The electronic device 100 may prepare for a corresponding operation during the utterance of the user in order to provide a quick response to the user's voice. The electronic device 100 may prepare for an operation corresponding to the highest intention among the plurality of speech intents (S1440). For example, the electronic device 100 may execute an application used to perform an operation corresponding to the intention of speaking in the background. As another example, the electronic device 100 may activate a component of the electronic device 100 used to perform an operation corresponding to the intention of speaking.

If the reliability of all the utterance intentions is lower than the predetermined criterion (S1450-N), the electronic device 100 may wait for the additional utterance of the user to be input. On the contrary, when a speech intent having a reliability equal to or greater than a predetermined value is detected among the plurality of speech intents (S1450-Y), the electronic device 100 may determine the detected speech intent as the user's speech intent. In operation S1460, the electronic device 100 may execute an application for an operation corresponding to the determined speech intent.

Subsequently, the electronic device 100 may check whether all information necessary for performing an operation corresponding to the detected speech intent is determined from the user's voice (S1470). For example, if the intention to ignite a food delivery order, the electronic device 100 needs information such as food type, delivery company, payment means, and the like. As another example, if the intention to ignite is a train reservation, the electronic device 100 needs information such as a departure place, an arrival place, a reservation time, a payment means, and the like.

If additional information is needed to perform the operation (S1470-N), the electronic device 100 may display a UI for inducing the user to utter the additional information (S1480). The electronic device 100 may prevent the user from generating additional talk turns by inducing the user to speak all the information in this talk turn. When all the information for performing the operation is collected (S1470-Y), the electronic device 100 may perform an operation corresponding to the intention of speaking (S1490).

15 is a flowchart illustrating a method of determining a user utterance intention of the electronic device 100 according to another embodiment of the present disclosure. Referring to FIG. 15, the electronic device 100 may receive a user voice spoken by a user. In operation S1510, the electronic device 100 may recognize a voice of an input user.

The electronic device 100 may estimate a plurality of speech intentions based on the recognized user voices (S1520). In operation S1530, the electronic device 100 may measure reliability of each of the estimated utterance intentions. When there is a speech intent in which the measured reliability is greater than or equal to a preset criterion (S1530-Y), the electronic device 100 may provide response information corresponding to the corresponding speech intent (S1560).

If the reliability of all the utterance intentions is less than the predetermined criterion (S1530-N), the electronic device 100 may display a list UI including the plurality of uttered intentions (S1540). Displaying a speech intent list and receiving a user selection may be particularly useful when only keywords corresponding to an entity name are extracted, instead of extracting keywords corresponding to intent. If the user selection is not input (S1550-N), the electronic device 100 may wait for additional speech of the user.

When a user input for selecting one of the displayed utterance intentions is received (S1550-Y), the electronic device 100 may provide response information corresponding to the selected utterance intention (S1560). When the user's speech intent is determined through the user input, the electronic device 100 may store the user voice data and the determined speech intent and use the same to learn the data recognition model.

Subsequently, the electronic device 100 may determine whether the user's speech is finished (S1570). If there is additionally spoken user voice (S1570-N), the electronic device 100 may estimate a plurality of speech intentions again based on the entire user voice including the additionally spoken user voice. If it is determined that the utterance of the user is ended (S1570-Y), the electronic device 100 may maintain the response information provision state and wait until the next utterance of the user is input.

16 is a sequence diagram illustrating a method of building a data recognition model by a system including an electronic device and a server according to an embodiment of the present disclosure. In this case, the system for building a data recognition model may include a first component 1601 and a second component 1621.

For example, the first component 1601 may be the electronic device 100 and the second component 1621 may be the server 200. Alternatively, the first component 1601 may be a general purpose processor, and the second component 1621 may be an artificial intelligence dedicated processor. Alternatively, the first component 1601 may be at least one application, and the second component may be an operating system (OS).

That is, the second component 1621 is a component that is more integrated, dedicated, has a smaller delay, has a higher performance, or has more resources than the first component 1601. An operation required for generation, update, or application may be a component capable of processing more quickly and effectively than the first component 1601.

In this case, an interface for transmitting / receiving data (voice data) between the first component 1601 and the second component 1621 may be defined.

For example, an application program interface (API) function having training data to be applied to the data recognition model as an argument value (or a parameter value or a transfer value) may be defined. In this case, when the first component 1601 calls the API function and inputs voice data as a data parameter value, the API function 261 1621 applies the voice data as training data to be applied to the data recognition model. Can be delivered to.

The first component 1601 may receive a user voice spoken by the user in operation S1603. The first component 1601 may transmit voice data about the user's voice to the second component.

The second component 1621 may train the data recognition model using the received voice data (S1605).

The second component 1621 may store the learned data recognition model (S1607).

Meanwhile, in the above-described embodiment, the second component 1621 stores the learned data recognition model. However, this is only an example, and the second component 1621 stores the learned data recognition model. The first component 1601 may store the data recognition model by transmitting to the first component 1601.

17 is a sequence diagram illustrating a method of recognizing data by a system including an electronic device and a server according to an embodiment of the present disclosure. In this case, the system for recognizing data using the data recognition model may include a first component and a second component.

As an example, the first component 1701 may be an electronic device and the second component 1721 may be a server. Alternatively, the first component 1701 may be a general purpose processor, and the second component 1721 may be an artificial intelligence dedicated processor. Alternatively, the first component 1701 may be at least one application, and the second component 1721 may be an operating system.

In this case, an interface for transmitting / receiving data (eg, a moving image, a composite image, or a moving image recognition result) between the first component 1701 and the second component 1721 may be defined.

For example, an API function having recognition data to be applied to the learned data recognition model as an argument value (or, a parameter value or a transfer value), and having a recognition result of the data recognition model as an output value may be defined. In this case, when the first component 1701 calls the API function and inputs voice data as a data parameter value, the API function may use the first component 1701 as training data to apply the voice data to the data recognition model. Can be delivered to. When a video recognition result is received from the second component 1721, the first component 1701 may provide response information corresponding to a user's intention to speak as an output value of the API function.

The first component 1701 may receive the user's voice spoken by the user (S1703). The first component 1701 may transmit voice data for at least one word included in the voice spoken by the user to the second component 1721 while the user's voice is received.

The second component 1721 may determine speech intent of the user by applying the speech data of the received at least one word to the speech recognition model (S1705).

The second component 1721 provides response information corresponding to the determined speech intent (S1707), and transmits the response information to the first component 1701.

The first component 1701 may display a screen of a result of performing an operation corresponding to the intention of speaking. Also, the first component 1701 may perform an operation corresponding to the intention of speaking.

The first component 1701 may update the response information based on the additional words spoken after the at least one word is input while the response information is provided (S1709).

Meanwhile, in the above-described embodiment, the first component 1701 has been described as generating voice data. However, this is only an example, and the second component 1702 receives the input voice to generate at least one voice. Voice data including words can be generated.

Some embodiments may be implemented as S / W programs that include instructions stored on computer-readable storage media.

For example, a computer may be a device capable of calling stored instructions from a storage medium and operating according to the disclosed embodiments according to the called instructions, and may include a device according to the disclosed embodiments or an external server connected to the device. .

The computer readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-temporary' means that the storage medium does not include a signal or a current, and is tangible, but does not distinguish that the data is semi-permanently or temporarily stored in the storage medium. For example, non-transitory storage media may be stored temporarily such as registers, caches, buffers, as well as non-transitory readable recording media such as CD, DVD, hard disk, Blu-ray disc, USB, internal memory, memory card, ROM, or RAM. Media may be included.

In addition, the method according to the disclosed embodiments may be provided as a computer program product.

The computer program product may include a S / W program, a computer readable storage medium on which the S / W program is stored, or a product traded between a seller and a buyer.

For example, a computer program product may include a product (eg, a downloadable app) in the form of a S / W program distributed electronically through a device manufacturer or an electronic market (eg, Google Play Store, App Store). For electronic distribution, at least a part of the S / W program may be stored in a storage medium or temporarily generated. In this case, the storage medium may be a server of a manufacturer or an electronic market, or a storage medium of a relay server.

Although the present disclosure has been described with reference to the limited embodiments and the drawings, the present disclosure is not limited to the above embodiments, and those skilled in the art to which the present disclosure pertains may make various modifications and variations from such descriptions. This is possible. Therefore, the scope of the present disclosure should not be limited to the described embodiments, but should be determined not only by the claims below but also by the equivalents of the claims.

Claims

A microphone for receiving a user voice spoken by the user; And

While the user's voice is input, the user's speech intent is determined based on at least one word included in the user's voice, the response information corresponding to the determined speech intention is provided, and the response information is provided while the response information is provided. And a processor for updating the response information based on the additional words spoken after the at least one word is input.
The method of claim 1,

The processor,

The reliability of a plurality of speech intents is determined based on the input at least one word, and when a speech intent having a reliability higher than or equal to a predetermined value is detected among the plurality of speech intents, the detected speech intent is determined by the user's speech intent. Electronic device judged to be.
The method of claim 2,

The processor,

And an execution preparation operation of an application for performing an operation corresponding to a speech intent having the highest reliability among the plurality of speech intents.
The method of claim 2,

The display further includes;

The processor,

The electronic device controls the display to display an execution screen of an application for performing an operation corresponding to the detected speech intent when the speech intent having the reliability equal to or greater than the predetermined value is detected.
The method of claim 4, wherein

The processor,

And control the display to display a UI for causing the user to utter additional information necessary to perform an operation corresponding to the detected utterance intention.
The method of claim 2,

The display further includes;

The processor,

If the reliability of the determined plurality of speech intents is less than a predetermined value, the display is controlled to display a list UI including the determined plurality of speech intents, and one of the displayed plurality of speech intents is selected. And, when a user input is received, providing response information corresponding to the selected speech intent.
The method of claim 1,

The processor,

If the newly determined speech intention based on the additional word is different from the speech intent determined based on the at least one word, updating the provided response information to provide response information corresponding to the newly determined speech intent. Device.
The method of claim 1,

The electronic device,

The display further includes;

The processor,

And control the display to display the provided response information.
The method of claim 8,

The response information includes the entity name and intent.
In the method of determining the user utterance intention of the electronic device,

Receiving a user voice spoken by the user;

Determining a speech intent of the user based on at least one word included in the user voice while the voice of the user is input;

Providing response information corresponding to the determined speech intent of the user; And

And updating the response information based on the additional words spoken after the at least one word is input while the response information is provided.
The method of claim 10,

The determining step,

Determining reliability of a plurality of speech intents based on the input at least one word;

Detecting a speech intent having a reliability equal to or greater than a predetermined value among the plurality of speech intents; And

And determining the detected speech intent as the speech intent of the user.
The method of claim 11,

And initiating an execution preparation operation of an application to perform an operation corresponding to a speech intent having the highest reliability among the plurality of speech intents.
The method of claim 11,

And displaying an execution screen of an application for performing an operation corresponding to the detected speech intent, when the speech intent having a reliability equal to or greater than the preset value is detected.
The method of claim 13,

And displaying a UI for inducing the user to utter additional information necessary to perform an operation corresponding to the detected utterance intention.
The method of claim 11,

The determining step,

If the reliability of the plurality of determined speech intents is less than a predetermined value, displaying a list UI including the determined plurality of speech intents;

The providing step,

And when the user input for selecting one of the displayed utterance intentions is received, providing response information corresponding to the selected utterance intention.