[go: up one dir, main page]

WO2025046064A1 - A controller for a conversational system using emotion context and method for operating the same - Google Patents

A controller for a conversational system using emotion context and method for operating the same Download PDF

Info

Publication number
WO2025046064A1
WO2025046064A1 PCT/EP2024/074266 EP2024074266W WO2025046064A1 WO 2025046064 A1 WO2025046064 A1 WO 2025046064A1 EP 2024074266 W EP2024074266 W EP 2024074266W WO 2025046064 A1 WO2025046064 A1 WO 2025046064A1
Authority
WO
WIPO (PCT)
Prior art keywords
emotion
user
controller
context
estimated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/EP2024/074266
Other languages
French (fr)
Inventor
Shanmuga Sundaram KARTHIKEYANI
Purvish KHALPADA
Arvind Devarajan SANKRUTHI
Ravisankar SWETHA SHANKAR
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Robert Bosch GmbH
Bosch Global Software Technologies Pvt Ltd
Original Assignee
Robert Bosch GmbH
Bosch Global Software Technologies Pvt Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Robert Bosch GmbH, Bosch Global Software Technologies Pvt Ltd filed Critical Robert Bosch GmbH
Publication of WO2025046064A1 publication Critical patent/WO2025046064A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present invention relates to a controller for a conversational system using emotion context and a method for operating the same.
  • the invention relates to a smart car assistant with artificial intelligence designed for use in automobiles, VIP vehicles, commercial vehicles and smart home systems, which recognizes the user's face, detects the user's emotional state, offers suggestions according to the detected emotion, and has a three-dimensional holographic face to create an emotional bond between the user and the vehicle, has the ability to read news, give weather information, send e- mails, create notes and record alarms, and allows voice control of the equipment in the vehicle by speaking in Vietnamese or in any desired language, makes it possible to perceive the commands given by speaking in a daily and natural speaking language, and to answer the questions asked, can translate in different languages, can tell the malfunctions that may occur in the vehicle and transmit range information audibly, connects to the phone and allows you to use the features of the phone.
  • FIG. 1 illustrates a block diagram of a controller for a conversational system for a user, according to an embodiment of the present invention
  • FIG. 2 illustrates a method of operating the controller for the conversational system, according to the present invention.
  • Fig. 1 illustrates a block diagram of a controller for a conversational system for a user, according to an embodiment of the present invention.
  • the conversational system 100 facilitates contextual conversation with the user 130.
  • the conversational system 100 comprises the controller 110 with an input interface 122 and an output interface 124.
  • the conversational system 100 comprises at least one input signal 128, to detect user characteristics, selected from a group comprising speech, texts in the speech, physiological parameters, and a facial image.
  • the at least one input signal 128 is received from at least one means 132 selected from a group comprising a microphone 102, an Automatic Speech Recognition (ASR) module 104 or Speech-to-Text module, a wearable device 106 and at least one camera 108.
  • ASR Automatic Speech Recognition
  • the controller 110 configured to prompt the user 130, through an output means 132 (such as speaker or display) for a response to validate the estimated emotion, and store the emotion profile 116 as baseline if validated by the user 130 with high confidence score. Otherwise, the controller 110 stores the emotion profile 116 as baseline with the existing confidence score or lower score.
  • the conversational system 100 as explained above comprises the use of at least two input signals 128 from respective means 132, instead of at least one input signal 128 to ensure that the multimodal sensor signals are considered.
  • the conversational system 100 is also adaptable or usable or implementable with one input signal 128 from respective means 132 as well.
  • the wearable device 106 is a health monitoring device which are worn by the user 130 and has the ability to measure vitals of the user 130 such as but not limited to heart rate, blood pressure, oxygen level, etc.
  • the means 132 mentioned are not limited to above list but may comprises other devices known in the art.
  • the controller 110 is either for the conversational system 100 which uses the emotion profile 116 of the user 130 for emotion based contextual conversation, or the controller 110 is for the standalone emotion profiling system, in which case the output of the emotion profiling system is used by the conversational system 100 for the emotion aware conversation with the user 130.
  • the controller 110 configured to normalize the estimated emotion with the baseline emotion.
  • the controller 110 then configured to use the normalized emotion of the user 130, and perform an action corresponding to the estimated emotion.
  • the action comprises at least one of a response to the user 130 through an output means 132 and assist the user 130 in a task.
  • the controller 110 configured to adjust the weightage to each of the input signals 128 received from respective means 132 (based on availability).
  • the controller 110 monitors the emotion profile 116 (or emotions from each of the at least one signal 128) estimated by the emotion model 112 in comparison to the context, and determines that context is more aligned with one of the input signal 132, and thus increases the weightage of the at least one input signal 132 which aligns with the context. , Thus, the controller 110 adjusts weightage of each of the at least one input signal 128 with higher allocation to that at least one input signal 128 which is close to the determined context. For example, if there are two input signals 128 in the conversational system 100, i.e.
  • the controller 110 finetunes the weightage. For example, facial emotions are sad, but voice is neutral, and there is already an established context of user having lost her phone, the controller 110 determines that user 130 expresses facially more than vocally. So, the weightage is given little more to the emotion estimated from the camera 108 than the emotion estimated from the microphone 102. Further, when the adjustment happens over multiple instances, over a period of time, the controller 110 learns the characteristic of user’s expression of emotion, by means of gradually finetuning the weightages.
  • each user 130 (or person) has different medium of expressing their emotion.
  • the equal weight to all the means 132 is finetuned to the subjective values for the user 130, based on the understanding of the baseline profile.
  • the controller 110 monitors the differential emotion value.
  • the controller 110 adjusts weights based on the feedback response from the emotion model 112.
  • the conversational and assistive system uses the emotion. This allows conversational system 100 to not only understand the conversational context better, but also changes the course of actions and conversation based on the user emotion (for example, not prompting for vehicle servicing).
  • the conversational system 100 may use an always-on microphone 102 to listen/capture/monitor the conversations within an environment.
  • the at least one microphone 102 collects the speech data and passes to the controller 110, where a speech module or speech processor (not shown), continuously analyzes and processes the incoming speech data.
  • the speech module performs speech processing like speaker diarization, recognition, tonal and emotion analysis, and speech to text conversion.
  • the emotion model 112 estimates the emotion profile 116 of the user 130 based on the same followed by formation of the baseline or execution of the action.
  • the at least one microphone 102 is selectively switched ON by the user 130 before the conversation.
  • the speech input refers to dialogue or utterances in the environment with one or more users 130.
  • the context model 114 uses at least one of a rule based, and learning based model to process the incoming text data, along with conversation history and/or emotion history, to estimate the context of the ongoing conversation.
  • AI/ML devices/sy stem may include many components.
  • One such component is an AI/ML model or AI/ML modules. Different modules are described later in this disclosure.
  • the AI/ML model can be defined as reference or an inference set of data, which uses different forms of correlation matrices. Using these AI/ML models and the data from these AI/ML models, correlations can be established between different types of data to arrive at some logical understanding of the data.
  • AI/ML models such as linear regression, naive bayes classifier, support vector machine, neural networks and the like. It must be understood that this disclosure is not specific to the type of model being executed and can be applied to any AI/ML module irrespective of the AI/ML model being executed. A person skilled in the art will also appreciate that the AI/ML model may be implemented as a set of software instructions, combination of software and hardware or any combination of the same.
  • Some of the typical tasks performed by AI/ML systems are classification, clustering, regression etc.
  • Majority of classification tasks depend upon labeled datasets; that is, the data sets are labelled manually in order for a neural network to learn the correlation between labels and data. This is known as supervised learning.
  • Some of the typical applications of classifications are, face recognition, object identification, gesture recognition, voice recognition etc.
  • the model is trained based on labeled datasets, where the target labels are numeric values.
  • Some of the typical applications of regressions are, Weather forecasting, Stock price predictions, House price estimation, energy consumption forecasting etc.
  • Clustering or grouping is the detection of similarities in the inputs. The cluster learning techniques do not require labels to detect similarities.
  • the controller 110 is provided with necessary signal detection, acquisition, and processing circuits.
  • the controller 110 is the one which comprises input interface 122, output interfaces 124 having pins or ports, the memory element 118 such as Random Access Memory (RAM) and/or Read Only Memory (ROM), Anal og-to-Digi tai Converter (ADC) and a Digital-to-Analog Convertor (DAC), clocks, timers, counters and at least one processor (capable of implementing machine learning) connected with each other and to other components through communication bus channels.
  • RAM Random Access Memory
  • ROM Read Only Memory
  • ADC Anal og-to-Digi tai Converter
  • DAC Digital-to-Analog Convertor
  • the memory element 118 is pre-stored with logics or instructions or programs or applications or modules/models and/or threshold values/ranges, reference values, predefined/predetermined criteria/conditions, predetermined lists, which is/are accessed by the at least one processor as per the defined routines.
  • the internal components of the controller 110 are not explained for being state of the art, and the same must not be understood in a limiting manner.
  • the controller 110 may also comprise communication units such as transceivers to communicate through wireless or wired means such as Global System for Mobile Communications (GSM), 3G, 4G, 5G, Wi-Fi, Bluetooth, Ethernet, serial networks, and the like.
  • the controller 110 is implementable in the form of System-in-Package (SiP) or Sy stem - on-Chip (SOC) or any other known types. Examples of controller 110 comprises but not limited to, microcontroller, microprocessor, microcomputer, etc.
  • the conversational system 100 is possible to be distributed such as multiple cameras 108, microphone 102 spread across a cabin of the vehicle, for example.
  • the controller 110 is in the cloud and receives the signals from the means 132.
  • a block diagram 126 illustrates the same.
  • the controller 110 configured to determine/receive an estimated emotion of the user 130 from an emotion history or memory element 118, characterized in that, the controller 110 configured to perform the action corresponding to the estimated emotion.
  • the action comprises at least one of a response to the user 130 through the output means 132 and assist the user 130 in the task.
  • the Fig. 1 is an abstract view of collection of emotion from different sources/means 132 by the conversational system 100 or (multimodal emotion understanding system).
  • the conversational system 100 collects the weighted data of the emotion predictions, intensity, and confidence from various means 132 like emotion detection from voice, emotion detection from vitals (or physiological parameters), through wearable devices 106, emotion detected from the text of what user speaks, and emotion detection from user’s face through camera 108.
  • the emotion model 112 in the controller 110 are neural model (or other Al or ML model) that individually predicts emotions from the respective input signals 128 (voice, text, image stream, etcetera).
  • the conversational system 100 initially considers all these predictions with equal weight.
  • the aggregated data is considered as the emotion profile 116 of the user 130 and is ready for further processing, i.e. for baselining.
  • the controller 110 checks if the user’s 130 baseline profile is already established and is confident enough. If so, the controller 110 normalizes the feature matrix (or the emotion profile 116) and stores in the user’s emotion history inside the memory element 118. The emotion profile 116 is usable later for any action to be performed. If the baseline is not present, then the controller 110 checks if there is a known context like user 130 mentioned in ongoing utterances or conversation, such as any interview or a meeting with friend. The controller 110 stores the estimated emotion profile 116 (or emotion matrix) as the baseline value for the detected emotion. This indicates that the emotion profile 116 is estimated for each emotion type and stored under user’s emotion history in the memory element 118.
  • the technical effect of the controller 110 is envisaged with an example.
  • a smiling person has a higher baseline emotion of happiness, let us say 55 (out of 100).
  • the emotion model 112 estimates the emotion as happiness with an intensity of 80, the controller 110 determines a small variation, compared to someone whose baseline emotion for happiness is 0. This process allows the controller 110 to understand the variation in subjective expression of the emotion.
  • the controller 110 builds a baseline emotion profile 116 from multiple means 132 and understands the variation from the baseline and starts modulating conversation or actions based on the user emotion. For example, a virtual assistant in the car not prompting for the service reminders if the user 130 is stressed or angry.
  • Fig. 2 illustrates a method of operating the controller for the conversational system, according to the present invention. The method comprises plurality of steps of which, a step 202 comprises receiving, by the controller 110, at least one input signals 128 from at least one means 132 for user characteristic selected from a group comprising speech, texts in the speech, a physiological parameters and facial image.
  • the at least one means 132 comprises at least one microphone 102, the ASR model 104, the wearable device 106, and at least one camera 108, respectively connected to the controller 110.
  • a step 204 comprises estimating the emotion profile 116, by the emotion model 112 of the controller 110, using the input signals 128 from each of the at least one means 132.
  • the emotion profile 116 comprises the estimated emotion, the intensity of the estimated emotion and confidence score of the estimated emotion.
  • the method is characterized by, while the speech input signal 128 is available, a step 206 which comprises, determining the context, by the context model 114 of the controller 110, of the ongoing conversation detected in the speech input signal 128.
  • a step 208 comprises storing, by the controller 110, the emotion profile 116 as baseline for the emotion and for the user 130 if the estimated emotion and the emotion detected in the context are same.
  • a step 210 comprises prompting, by the controller 110, the user 130 for the response to validate the estimated emotion through the output means 120.
  • a step 212 comprises storing, by the controller 110, the emotion profile 116 as baseline if validated by the user 130 with high confidence score, otherwise, i.e. if not validated by the user 130, the method comprises storing the estimated emotion profile 116 as the baseline with existing score or lower score. The method is executed by the controller 110.
  • the method while the baseline emotion is stored in the memory element 118, the method comprises a step 214 which comprises normalizing the estimated emotion with the baseline emotion.
  • a step 216 comprises using the normalized emotion of the user 130, and performing the action corresponding to the estimated emotion.
  • the action comprises at least one of the response to the user 130 through the output means 120 and assisting the user 130 in the task.
  • the method also comprises monitoring the emotion profile 116 of the user 130 for the predefined time period before the emotion profile 116 is stored as the baseline.
  • the method performs a step 218 (periodically) which comprises monitoring the estimated emotion of the at least one input signal 128 received from the at least one means 132 in comparison to the context, and adjusting weightage of each of the at least one input signal 128 with higher allocation to that at least one input signal 128 which is determined to be close to the determined context.
  • the weightage is adjusted between 0 to 1 or 0% to 100%.
  • a method for enabling conversation with emotion context with the user 130 comprises plurality of steps of which a step 220 comprises determining the estimated emotion of the user 130.
  • the method is characterized by a step 222 which comprises performing the action corresponding to the estimated emotion.
  • the action comprises at least one of the response to the user 130 through the output means 120 and assisting the user 130 in the task.
  • the conversational system 100 is preferably used for the vehicle to provide more convenience to the driver or passengers.
  • the conversational system 100 may also be referred to as digital companion or virtual companion which is more than a digital assistant in a manner that the conversational system 100 is able to extract/deriver and give more information for a detected or asked query.
  • the automatic conversational system 100 is applicable for different domains and environments such as home, office, hospital, airports, hospitality industry and the like and not just limited to vehicle.
  • an emotion aware personal companion is provided through the controller 110 and the method. The present invention uses multiple means 132 and processes these expressions to create a multimodal emotional awareness.
  • the controller 110 analyses speech, user’s words, user’s face, and user’s wearable to understand the user’s emotional state. Every person’s way of intensity of feeling and expression is different. Someone might have a happiness intensity of 60 on face, while they are feeling 100. On the other hand, many people have “smiling” face and the facial emotion recognition system will always mark that person to be happy, even if the person is neutral, or sometimes angry. So, the standard emotion analysis models might not work fine on each of us.
  • the present invention uses the common practice from psychology, which is establishing the baseline. As a human, we subconsciously profile baseline of our near and dear ones. The present invention ticks both essential conditions.
  • the controller 110 In addition to the emotion history, the controller 110 considers weighted aggregation of the sensory information from the means 132, along with the baseline profile, to understand the current emotion of the user 130. On top of that, the understanding of the emotion is non-intrusive and without user 130 mentioning explicitly. Understanding of the emotion as context and modulating future conversation or course of action in conversational and assistive system makes the conversational system 100 as a personal companion, which is perceivably more intelligent, and more importantly, sensible, which is farfetched aim for the existing conversational systems.
  • the personal companion seamlessly understand the user’s emotion and influence the conversation to fit the situation. For example, in the evening, while going home, if user 130 is stressed and little furious, the personal companion avoids service reminder or if car fuel drops below reserved level but is more than sufficient for user 130 to comfortably reach home (as per navigation information), the personal companion refrains from alerting about the low fuel. On the other hand, if user 130 is driving fast, the personal companion understands that stress and anger may reduce the reaction time and it may prompt user 130 about the speed in a caring way, for which, it would not have prompted otherwise. At state-of-the-art, virtual assistants on modem cars cannot perform this. This is provided just as an example and to understand the invention in better way, and in no sense limited by the same.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The conversational system (100) comprises at least one input signal (128), to detect user characteristics, selected from a group comprising speech, texts in the speech, physiological parameters, and a facial image. The controller (110) connected to the at least one means (132), and configured to estimate an emotion profile (116), by an emotion model (112), using input signals (128) from each of the at least one means (132). The emotion profile (116) comprises an estimated emotion, an intensity of the estimated emotion and a confidence score of the estimated emotion. The controller (110), characterized in that, while a speech input is available, configured to determine a context, by a context model (114), of an ongoing conversation detected in the speech input, and store the emotion profile (116) as baseline for the emotion and for the user (130) if the estimated emotion and the emotion detected in the context are same.

Description

Title of the invention:
A CONTROLLER FOR A CONVERSATIONAL SYSTEM USING EMOTION CONTEXT AND METHOD FOR OPERATING THE SAME
Complete Specification:
The following specification describes and ascertains the nature of this invention and the manner in which it is to be performed.
Field of the invention:
[0001] The present invention relates to a controller for a conversational system using emotion context and a method for operating the same.
Background of the invention:
[0002] Emotions play a vital role in effective communication. Emotions often provide important context and insight into what a person is really saying. If emotions are not considered, there is high chances of missing the nuances of the message. By understanding the emotions of the other person, an effective communication style can be tailored. For example, if someone is feeling stressed, then more calming, and reassuring approach must be used.
[0003] The existing conversational systems do not have emotion awareness. For sure, if you say, “I am stressed,” you might get a generic response on how the stress is bad and what you can do. However, you must explicitly express your emotion every time, which is unlike human-to-human conversation. That context is lost, and your future communication is not influence by it. It is as if the conversational system did not really care on your emotions. [0004] According to a patent literature WO23113717, a smart vehicle assistant with artificial intelligence is disclosed. The invention relates to a smart car assistant with artificial intelligence designed for use in automobiles, VIP vehicles, commercial vehicles and smart home systems, which recognizes the user's face, detects the user's emotional state, offers suggestions according to the detected emotion, and has a three-dimensional holographic face to create an emotional bond between the user and the vehicle, has the ability to read news, give weather information, send e- mails, create notes and record alarms, and allows voice control of the equipment in the vehicle by speaking in Turkish or in any desired language, makes it possible to perceive the commands given by speaking in a daily and natural speaking language, and to answer the questions asked, can translate in different languages, can tell the malfunctions that may occur in the vehicle and transmit range information audibly, connects to the phone and allows you to use the features of the phone.
Brief description of the accompanying drawings:
[0005] An embodiment of the disclosure is described with reference to the following accompanying drawings,
[0006] Fig. 1 illustrates a block diagram of a controller for a conversational system for a user, according to an embodiment of the present invention, and
[0007] Fig. 2 illustrates a method of operating the controller for the conversational system, according to the present invention.
Detailed description of the embodiments:
[0008] Fig. 1 illustrates a block diagram of a controller for a conversational system for a user, according to an embodiment of the present invention. The conversational system 100 facilitates contextual conversation with the user 130. The conversational system 100 comprises the controller 110 with an input interface 122 and an output interface 124. The conversational system 100 comprises at least one input signal 128, to detect user characteristics, selected from a group comprising speech, texts in the speech, physiological parameters, and a facial image. The at least one input signal 128 is received from at least one means 132 selected from a group comprising a microphone 102, an Automatic Speech Recognition (ASR) module 104 or Speech-to-Text module, a wearable device 106 and at least one camera 108. The controller 110 connected to the at least one means 132, and configured to estimate an emotion profile 116, by an emotion model 112, using input signals 128 from each of the at least one means 132. The emotion profile 116 comprises an estimated emotion, an intensity of the estimated emotion and a confidence score of the estimated emotion. The controller 110, characterized in that, while a speech input is available, configured to determine a context, by a context model 114, of an ongoing conversation detected in the speech input, and store the emotion profile 116 as baseline for the emotion and for the user 130 if the estimated emotion and the emotion detected in the context are same. Alternatively, while the speech input is unavailable, the controller 110 configured to prompt the user 130, through an output means 132 (such as speaker or display) for a response to validate the estimated emotion, and store the emotion profile 116 as baseline if validated by the user 130 with high confidence score. Otherwise, the controller 110 stores the emotion profile 116 as baseline with the existing confidence score or lower score.
[0009] According to an embodiment of the present invention, the conversational system 100 as explained above comprises the use of at least two input signals 128 from respective means 132, instead of at least one input signal 128 to ensure that the multimodal sensor signals are considered. However, the conversational system 100 is also adaptable or usable or implementable with one input signal 128 from respective means 132 as well. The wearable device 106 is a health monitoring device which are worn by the user 130 and has the ability to measure vitals of the user 130 such as but not limited to heart rate, blood pressure, oxygen level, etc. Further, the means 132 mentioned are not limited to above list but may comprises other devices known in the art.
[0010] According to an embodiment of the present invention, the controller 110 is either for the conversational system 100 which uses the emotion profile 116 of the user 130 for emotion based contextual conversation, or the controller 110 is for the standalone emotion profiling system, in which case the output of the emotion profiling system is used by the conversational system 100 for the emotion aware conversation with the user 130.
[0011] According to another embodiment of the present invention, while the baseline emotion is already stored in a memory element 118 of the controller 110, the controller 110 configured to normalize the estimated emotion with the baseline emotion. The controller 110 then configured to use the normalized emotion of the user 130, and perform an action corresponding to the estimated emotion. The action comprises at least one of a response to the user 130 through an output means 132 and assist the user 130 in a task.
[0012] According to an embodiment of the present invention, the controller 110 configured to monitor the emotion profile 116 of the user 130 for a predefined time period before the emotion profile 116 is stored as the baseline. Further, the baseline is set for each type of emotion foe each user 130.
[0013] According to an embodiment of the present invention, the controller 110 configured to adjust the weightage to each of the input signals 128 received from respective means 132 (based on availability). The controller 110 monitors the emotion profile 116 (or emotions from each of the at least one signal 128) estimated by the emotion model 112 in comparison to the context, and determines that context is more aligned with one of the input signal 132, and thus increases the weightage of the at least one input signal 132 which aligns with the context. , Thus, the controller 110 adjusts weightage of each of the at least one input signal 128 with higher allocation to that at least one input signal 128 which is close to the determined context. For example, if there are two input signals 128 in the conversational system 100, i.e. one microphone 102 and one camera 108, and the input signal 128 from the camera 108 estimates emotion with higher confidence and the input signal 128 from the at least one microphone 102 is neutral, then the controller 110 allocates higher weightage to the input signal 128 from the camera 108 than the at least one microphone 102 for estimating the emotion profile 116 for the user 130.
[0014] In simple words, the controller 110, over a time period, finetunes the weightage. For example, facial emotions are sad, but voice is neutral, and there is already an established context of user having lost her phone, the controller 110 determines that user 130 expresses facially more than vocally. So, the weightage is given little more to the emotion estimated from the camera 108 than the emotion estimated from the microphone 102. Further, when the adjustment happens over multiple instances, over a period of time, the controller 110 learns the characteristic of user’s expression of emotion, by means of gradually finetuning the weightages.
[0015] Further, each user 130 (or person) has different medium of expressing their emotion. Hence, the equal weight to all the means 132 is finetuned to the subjective values for the user 130, based on the understanding of the baseline profile. As the emotion model 112 builds the emotion profile 116 of the user 130, the controller 110 monitors the differential emotion value. The controller 110 adjusts weights based on the feedback response from the emotion model 112. As the emotion history is built, the conversational and assistive system uses the emotion. This allows conversational system 100 to not only understand the conversational context better, but also changes the course of actions and conversation based on the user emotion (for example, not prompting for vehicle servicing).
[0016] According to an embodiment of the present invention, the conversational system 100 may use an always-on microphone 102 to listen/capture/monitor the conversations within an environment. The at least one microphone 102 collects the speech data and passes to the controller 110, where a speech module or speech processor (not shown), continuously analyzes and processes the incoming speech data. The speech module performs speech processing like speaker diarization, recognition, tonal and emotion analysis, and speech to text conversion. The emotion model 112 estimates the emotion profile 116 of the user 130 based on the same followed by formation of the baseline or execution of the action. Alternatively, the at least one microphone 102 is selectively switched ON by the user 130 before the conversation.
[0017] The speech input refers to dialogue or utterances in the environment with one or more users 130. The context model 114 uses at least one of a rule based, and learning based model to process the incoming text data, along with conversation history and/or emotion history, to estimate the context of the ongoing conversation.
[0018] According to an embodiment of the present invention, the task is at least one but not limited to, scheduling a meeting, rescheduling a service of an equipment or appliance, postponing a reminder, setting the reminder, booking tickets for an event such as movie, theater, playing a song, and the like in smart environment. The controller 110 configured to perform an action in relation to the determined emotion and the application domain. The application domain corresponds to the environment in which the conversational system 100 is deployed such as the automotive domain, home, office, hospital, hospitality, etc. Further, the user 130 in the environment is not just one, but one or more user 130 who are in proximity to the at least one microphone 102.
[0019] It is important to understand some aspects of Artificial Intelligence (Al)/ Machine Learning (ML) technology and AI/ML based devices/sy stems (such as conversational system 100), which can be explained as follows. Depending on the architecture of the implements, AI/ML devices/sy stem may include many components. One such component is an AI/ML model or AI/ML modules. Different modules are described later in this disclosure. The AI/ML model can be defined as reference or an inference set of data, which uses different forms of correlation matrices. Using these AI/ML models and the data from these AI/ML models, correlations can be established between different types of data to arrive at some logical understanding of the data. A person skilled in the art would be aware of the different types of AI/ML models such as linear regression, naive bayes classifier, support vector machine, neural networks and the like. It must be understood that this disclosure is not specific to the type of model being executed and can be applied to any AI/ML module irrespective of the AI/ML model being executed. A person skilled in the art will also appreciate that the AI/ML model may be implemented as a set of software instructions, combination of software and hardware or any combination of the same.
[0020] Some of the typical tasks performed by AI/ML systems are classification, clustering, regression etc. Majority of classification tasks depend upon labeled datasets; that is, the data sets are labelled manually in order for a neural network to learn the correlation between labels and data. This is known as supervised learning. Some of the typical applications of classifications are, face recognition, object identification, gesture recognition, voice recognition etc. In a regression task, the model is trained based on labeled datasets, where the target labels are numeric values. Some of the typical applications of regressions are, Weather forecasting, Stock price predictions, House price estimation, energy consumption forecasting etc. Clustering or grouping is the detection of similarities in the inputs. The cluster learning techniques do not require labels to detect similarities.
[0021] In accordance to an embodiment of the present invention, the controller 110 is provided with necessary signal detection, acquisition, and processing circuits. The controller 110 is the one which comprises input interface 122, output interfaces 124 having pins or ports, the memory element 118 such as Random Access Memory (RAM) and/or Read Only Memory (ROM), Anal og-to-Digi tai Converter (ADC) and a Digital-to-Analog Convertor (DAC), clocks, timers, counters and at least one processor (capable of implementing machine learning) connected with each other and to other components through communication bus channels. The memory element 118 is pre-stored with logics or instructions or programs or applications or modules/models and/or threshold values/ranges, reference values, predefined/predetermined criteria/conditions, predetermined lists, which is/are accessed by the at least one processor as per the defined routines. The internal components of the controller 110 are not explained for being state of the art, and the same must not be understood in a limiting manner. The controller 110 may also comprise communication units such as transceivers to communicate through wireless or wired means such as Global System for Mobile Communications (GSM), 3G, 4G, 5G, Wi-Fi, Bluetooth, Ethernet, serial networks, and the like. The controller 110 is implementable in the form of System-in-Package (SiP) or Sy stem - on-Chip (SOC) or any other known types. Examples of controller 110 comprises but not limited to, microcontroller, microprocessor, microcomputer, etc.
[0022] Further, the processor may be implemented as any or a combination of one or more microchips or integrated circuits interconnected using a parent board, hardwired logic, software stored in the memory element 118 and executed by a microprocessor, firmware, an application specific integrated circuit (ASIC), and/or a field programmable gate array (FPGA). The processor is configured to exchange and manage the processing of various Al models.
[0023] According to an embodiment of the present invention, the controller 110 is part of at least one of an infotainment unit of the vehicle, a smartphone, a wearable device 104, a cloud computer. Alternatively, the conversational system 100 is at least one of the infotainment unit of the vehicle, the smartphone, the wearable device, the cloud computer, a smart speaker, or a smart display and the like. In other words, the controller 110 is part of an internal device of the vehicle or part of external device which is connected to the vehicle through known wired or wireless means as described earlier or an external device to be used in non-automotive environment such as home, office, hospitals, etc. In the vehicle, the conversational system 100 is possible to be distributed such as multiple cameras 108, microphone 102 spread across a cabin of the vehicle, for example. In case of the cloud computer, the controller 110 is in the cloud and receives the signals from the means 132. [0024] In accordance to an embodiment of the present invention, the controller 110 to enable conversation with emotion context with the user 130 is disclosed. A block diagram 126 illustrates the same. The controller 110 configured to determine/receive an estimated emotion of the user 130 from an emotion history or memory element 118, characterized in that, the controller 110 configured to perform the action corresponding to the estimated emotion. The action comprises at least one of a response to the user 130 through the output means 132 and assist the user 130 in the task.
[0025] According to the present invention, a working of the controller 110 of the conversational system 100 is explained. The Fig. 1 is an abstract view of collection of emotion from different sources/means 132 by the conversational system 100 or (multimodal emotion understanding system). The conversational system 100 collects the weighted data of the emotion predictions, intensity, and confidence from various means 132 like emotion detection from voice, emotion detection from vitals (or physiological parameters), through wearable devices 106, emotion detected from the text of what user speaks, and emotion detection from user’s face through camera 108. The emotion model 112 in the controller 110 are neural model (or other Al or ML model) that individually predicts emotions from the respective input signals 128 (voice, text, image stream, etcetera). The conversational system 100 initially considers all these predictions with equal weight. The aggregated data is considered as the emotion profile 116 of the user 130 and is ready for further processing, i.e. for baselining.
[0026] The controller 110 checks if the user’s 130 baseline profile is already established and is confident enough. If so, the controller 110 normalizes the feature matrix (or the emotion profile 116) and stores in the user’s emotion history inside the memory element 118. The emotion profile 116 is usable later for any action to be performed. If the baseline is not present, then the controller 110 checks if there is a known context like user 130 mentioned in ongoing utterances or conversation, such as any interview or a meeting with friend. The controller 110 stores the estimated emotion profile 116 (or emotion matrix) as the baseline value for the detected emotion. This indicates that the emotion profile 116 is estimated for each emotion type and stored under user’s emotion history in the memory element 118.
[0027] In case if the context is not present and if emotion is intense (multimodal system has marked high intensity with high confidence), the controller 110 configured to randomly send a prompt through an output means 132 of the conversational system 100. Depending on the ongoing conversation and context, the controller 110 may or may not prompt the user 130 such as “Hey, you seem happy! Anything special you would like to share?” In response, the user 130 might either confirm or decline (or chose not to response at all) the estimated emotion prompt in indirect manner (or direct manner as well). If validated, the controller 110 sends the emotion back to be stored in the emotion profile 116 of the user 130. Alternatively, the controller 110 saves/stores the estimated emotion profile 116 as the baseline in the memory element 118.
[0028] According to the present invention, the technical effect of the controller 110 is envisaged with an example. A smiling person has a higher baseline emotion of happiness, let us say 55 (out of 100). Hence, if the emotion model 112 estimates the emotion as happiness with an intensity of 80, the controller 110 determines a small variation, compared to someone whose baseline emotion for happiness is 0. This process allows the controller 110 to understand the variation in subjective expression of the emotion.
[0029] According to the present invention, the controller 110 builds a baseline emotion profile 116 from multiple means 132 and understands the variation from the baseline and starts modulating conversation or actions based on the user emotion. For example, a virtual assistant in the car not prompting for the service reminders if the user 130 is stressed or angry. [0030] Fig. 2 illustrates a method of operating the controller for the conversational system, according to the present invention. The method comprises plurality of steps of which, a step 202 comprises receiving, by the controller 110, at least one input signals 128 from at least one means 132 for user characteristic selected from a group comprising speech, texts in the speech, a physiological parameters and facial image. The at least one means 132 comprises at least one microphone 102, the ASR model 104, the wearable device 106, and at least one camera 108, respectively connected to the controller 110. A step 204 comprises estimating the emotion profile 116, by the emotion model 112 of the controller 110, using the input signals 128 from each of the at least one means 132. The emotion profile 116 comprises the estimated emotion, the intensity of the estimated emotion and confidence score of the estimated emotion. The method is characterized by, while the speech input signal 128 is available, a step 206 which comprises, determining the context, by the context model 114 of the controller 110, of the ongoing conversation detected in the speech input signal 128. A step 208 comprises storing, by the controller 110, the emotion profile 116 as baseline for the emotion and for the user 130 if the estimated emotion and the emotion detected in the context are same. However, while the speech input signal 128 is unavailable, a step 210 comprises prompting, by the controller 110, the user 130 for the response to validate the estimated emotion through the output means 120. A step 212 comprises storing, by the controller 110, the emotion profile 116 as baseline if validated by the user 130 with high confidence score, otherwise, i.e. if not validated by the user 130, the method comprises storing the estimated emotion profile 116 as the baseline with existing score or lower score. The method is executed by the controller 110.
[0031] According to the method, while the baseline emotion is stored in the memory element 118, the method comprises a step 214 which comprises normalizing the estimated emotion with the baseline emotion. A step 216 comprises using the normalized emotion of the user 130, and performing the action corresponding to the estimated emotion. The action comprises at least one of the response to the user 130 through the output means 120 and assisting the user 130 in the task. According to the present invention, the method also comprises monitoring the emotion profile 116 of the user 130 for the predefined time period before the emotion profile 116 is stored as the baseline.
[0032] According to the present invention, once the baseline is established/set, the method performs a step 218 (periodically) which comprises monitoring the estimated emotion of the at least one input signal 128 received from the at least one means 132 in comparison to the context, and adjusting weightage of each of the at least one input signal 128 with higher allocation to that at least one input signal 128 which is determined to be close to the determined context. The weightage is adjusted between 0 to 1 or 0% to 100%.
[0033] According to the present invention, a method for enabling conversation with emotion context with the user 130 is disclosed. The method comprises plurality of steps of which a step 220 comprises determining the estimated emotion of the user 130. The method is characterized by a step 222 which comprises performing the action corresponding to the estimated emotion. The action comprises at least one of the response to the user 130 through the output means 120 and assisting the user 130 in the task.
[0034] According to an embodiment of the present invention, the conversational system 100 is preferably used for the vehicle to provide more convenience to the driver or passengers. The conversational system 100 may also be referred to as digital companion or virtual companion which is more than a digital assistant in a manner that the conversational system 100 is able to extract/deriver and give more information for a detected or asked query. Again as indicated above, the automatic conversational system 100 is applicable for different domains and environments such as home, office, hospital, airports, hospitality industry and the like and not just limited to vehicle. [0035] According to the present invention, an emotion aware personal companion is provided through the controller 110 and the method. The present invention uses multiple means 132 and processes these expressions to create a multimodal emotional awareness. In other words, the controller 110 analyses speech, user’s words, user’s face, and user’s wearable to understand the user’s emotional state. Every person’s way of intensity of feeling and expression is different. Someone might have a happiness intensity of 60 on face, while they are feeling 100. On the other hand, many people have “smiling” face and the facial emotion recognition system will always mark that person to be happy, even if the person is neutral, or sometimes angry. So, the standard emotion analysis models might not work fine on each of us. The present invention uses the common practice from psychology, which is establishing the baseline. As a human, we subconsciously profile baseline of our near and dear ones. The present invention ticks both essential conditions. Unlike existing emotion detection systems, in addition to the emotion history, the controller 110 considers weighted aggregation of the sensory information from the means 132, along with the baseline profile, to understand the current emotion of the user 130. On top of that, the understanding of the emotion is non-intrusive and without user 130 mentioning explicitly. Understanding of the emotion as context and modulating future conversation or course of action in conversational and assistive system makes the conversational system 100 as a personal companion, which is perceivably more intelligent, and more importantly, sensible, which is farfetched aim for the existing conversational systems.
[0036] The personal companion seamlessly understand the user’s emotion and influence the conversation to fit the situation. For example, in the evening, while going home, if user 130 is stressed and little furious, the personal companion avoids service reminder or if car fuel drops below reserved level but is more than sufficient for user 130 to comfortably reach home (as per navigation information), the personal companion refrains from alerting about the low fuel. On the other hand, if user 130 is driving fast, the personal companion understands that stress and anger may reduce the reaction time and it may prompt user 130 about the speed in a caring way, for which, it would not have prompted otherwise. At state-of-the-art, virtual assistants on modem cars cannot perform this. This is provided just as an example and to understand the invention in better way, and in no sense limited by the same.
[0037] It should be understood that the embodiments explained in the description above are only illustrative and do not limit the scope of this invention. Many such embodiments and other modifications and changes in the embodiment explained in the description are envisaged. The scope of the invention is only limited by the scope of the claims.

Claims

We claim:
1. A controller (110) of a conversational system (100), said conversational system (100) comprises at least one means (132) to receive input signals (128), to detect user characteristics, selected from a group comprising speech, texts in said speech, a physiological parameters and facial image, and said controller (110) connected to said at least one means (132), and configured to estimate an emotion profile (116), by an emotion model (112), using said input signals (128) from each of said at least one means (132), said emotion profile (116) comprises an estimated emotion, an intensity and confidence score, characterized in that, while a speech input is available, determine a context, by a context model (114), of an ongoing conversation detected in said speech input, and store said emotion profile (116) as baseline for said emotion and for said user (130) if said estimated emotion and the emotion detected in said context are same, and while a speech input is unavailable, prompt said user (130) for a response to validate said estimated emotion, and store said emotion profile (116) as baseline if validated by said user (130) with high confidence score.
2. The controller (110) as claimed in claim 1, wherein while a baseline emotion is stored in a memory element (118), said controller (110) configured to normalize said estimated emotion with said baseline emotion.
3. The controller (110) as claimed in claim 1 configured to monitor said emotion profile (116) of said user (130) for a predefined time period before said emotion profile (116) is stored as the baseline emotion.
4. The controller (110) as claimed in claim Iconfigured to monitor said estimated emotion of said at least one input signal (128) received from said at least one means (132) in comparison to said context, and adjust weightage of each of said at least one input signal (128) with higher allocation to said at least one input signal (128) which is close to said context.
5. A controller (110) to enable conversation with emotion context with a user (130), said controller (110) configured to determine an estimated emotion of said user (130), characterized in that, and perform action corresponding to said estimated emotion, said action comprises at least one of a response to said user (130) through an output means (120) and assist said user (130) in a task.
6. A method for a conversational system (100), said method comprising the steps of: receiving at least one input signal (128) from at least one means (132) for user characteristic selected from a group comprising speech, texts in said speech, a physiological parameters and facial image, and estimating an emotion profile (116), by an emotion model (112), using said input signal (128) from each of said at least one means (132), said emotion profile (116) comprises an estimated emotion, an intensity and confidence score, characterized by, while a speech input is available, determining a context, by a context model (114), of an ongoing conversation detected in said speech input, and storing said emotion profile (116) as baseline for said emotion and for said user (130) if said estimated emotion and the emotion detected in said context are same, and while a speech input is unavailable, prompting said user (130) for a response to validate said estimated emotion, and storing said emotion profile (116) as baseline if validated by said user (130) with high confidence score.
7. The method as claimed in claim 6, while a baseline emotion is stored in a memory element (118), said method comprises normalizing said estimated emotion with said baseline emotion.
8. The method as claimed in claim 6, comprises monitoring said emotion profile (116) of said user (130) for a predefined time period before said emotion profile (116) is stored as the baseline.
9. The method as claimed in claim 6 comprises, monitoring said estimated emotion of said at least one input signal (128) received from said at least one means (132) in comparison to said context, and adjusting weightage of each of the at least one input signal (128) with higher allocation to said at least one input signal (128) which is close to said context.
10. A method for enabling conversation with emotion context with a user (130), said method comprising the steps of determining an estimated emotion of said user (130), characterized by, and performing an action corresponding to said estimated emotion, said action comprises at least one of a response to said user (130) through an output means (132) and assist said user (130) in a task.
PCT/EP2024/074266 2023-09-01 2024-08-30 A controller for a conversational system using emotion context and method for operating the same Pending WO2025046064A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN202341058659 2023-09-01
IN202341058659 2023-09-01

Publications (1)

Publication Number Publication Date
WO2025046064A1 true WO2025046064A1 (en) 2025-03-06

Family

ID=92672177

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2024/074266 Pending WO2025046064A1 (en) 2023-09-01 2024-08-30 A controller for a conversational system using emotion context and method for operating the same

Country Status (1)

Country Link
WO (1) WO2025046064A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120245934A1 (en) * 2011-03-25 2012-09-27 General Motors Llc Speech recognition dependent on text message content
US20190378515A1 (en) * 2018-06-12 2019-12-12 Hyundai Motor Company Dialogue system, vehicle and method for controlling the vehicle
US20200279553A1 (en) * 2019-02-28 2020-09-03 Microsoft Technology Licensing, Llc Linguistic style matching agent
WO2023113717A1 (en) 2021-12-13 2023-06-22 Di̇zaynvi̇p Teknoloji̇ Bi̇li̇şi̇m Ve Otomoti̇v Sanayi̇ Anoni̇m Şi̇rketi̇ Smart vehicle assistant with artificial intelligence

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120245934A1 (en) * 2011-03-25 2012-09-27 General Motors Llc Speech recognition dependent on text message content
US20190378515A1 (en) * 2018-06-12 2019-12-12 Hyundai Motor Company Dialogue system, vehicle and method for controlling the vehicle
US20200279553A1 (en) * 2019-02-28 2020-09-03 Microsoft Technology Licensing, Llc Linguistic style matching agent
WO2023113717A1 (en) 2021-12-13 2023-06-22 Di̇zaynvi̇p Teknoloji̇ Bi̇li̇şi̇m Ve Otomoti̇v Sanayi̇ Anoni̇m Şi̇rketi̇ Smart vehicle assistant with artificial intelligence

Similar Documents

Publication Publication Date Title
CN113986016B (en) Intelligent Assistant
JP7199451B2 (en) Emotional interaction system, device and method based on emotional computing user interface
KR102599607B1 (en) Dynamic and/or context-specific hot words to invoke automated assistant
US12105876B2 (en) System and method for using gestures and expressions for controlling speech applications
US20200333875A1 (en) Method and apparatus for interrupt detection
CN110047487A (en) Awakening method, device, vehicle and the machine readable media of vehicle-mounted voice equipment
CN113287175B (en) Interactive health state assessment method and system thereof
Gladys et al. Survey on multimodal approaches to emotion recognition
US20200043476A1 (en) Electronic device, control method therefor, and non-transitory computer readable recording medium
KR102276415B1 (en) Apparatus and method for predicting/recognizing occurrence of personal concerned context
WO2024254021A1 (en) Augmenting artificial intelligence prompt design with emotional context
US11531881B2 (en) Artificial intelligence apparatus for controlling auto stop system based on driving information and method for the same
CN108806699B (en) Voice feedback method and device, storage medium and electronic equipment
KR101984283B1 (en) Automated Target Analysis System Using Machine Learning Model, Method, and Computer-Readable Medium Thereof
US20210349433A1 (en) System and method for modifying an initial policy of an input/output device
KR102815625B1 (en) Server and user device for providing psychological stability service, and method for analyzing multimodal user experience data for the same
Lison et al. Spoken dialogue systems: the new frontier in human-computer interaction
CN111209380A (en) Control method and device for conversation robot, computer device and storage medium
Yoon et al. Fear emotion classification in speech by acoustic and behavioral cues
WO2025046064A1 (en) A controller for a conversational system using emotion context and method for operating the same
US12106393B1 (en) Artificial intelligence for reducing bias in policing
Sahlab et al. Designing an elderly-appropriate voice control for a pill dispenser
Lang et al. Design of a virtual assistant: collect of user’s needs for connected and automated vehicles
US20210326758A1 (en) Techniques for automatically and objectively identifying intense responses and updating decisions related to input/output devices accordingly
WO2025046065A1 (en) An automatic conversational system and method for operating the same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24765552

Country of ref document: EP

Kind code of ref document: A1