CN119180335A

CN119180335A - Intelligent network connection automobile digital virtual man dialogue method device, vehicle and medium

Info

Publication number: CN119180335A
Application number: CN202411242022.9A
Authority: CN
Inventors: 文林
Original assignee: Chongqing Changan Automobile Co Ltd
Current assignee: Chongqing Changan Automobile Co Ltd
Priority date: 2024-09-05
Filing date: 2024-09-05
Publication date: 2024-12-24

Abstract

The invention relates to the technical field of automobile digital virtual persons, and discloses an intelligent networking automobile digital virtual person dialogue method, device, vehicle and medium, wherein the method comprises the steps of responding to character setting operation of a user to generate a corresponding digital virtual person; the method comprises the steps of collecting voice data of a user and state data of a vehicle, identifying the voice data, determining user intention, identifying a current scene of the vehicle and dialogue service corresponding to the current scene according to the user intention and the state data, determining control instructions based on a digital virtual person, the current scene and the dialogue service, and controlling the digital virtual person to execute corresponding actions based on the control instructions so as to complete dialogue with the user. According to the invention, the personalized requirements of the user are considered, the corresponding digital virtual person is generated through the character setting operation, the current scene and the corresponding dialogue service are determined based on the voice data of the user and the state data of the vehicle, and the control instruction is acquired to control the digital virtual person to execute the corresponding action, so that the service experience of the user on voice interaction can be improved.

Description

Intelligent network connection automobile digital virtual man dialogue method device, vehicle and medium

Technical Field

The invention relates to the technical field of automobile digital virtual persons, in particular to an intelligent network-connected automobile digital virtual man dialogue method, device, vehicle and medium.

Background

Currently, in the process of digital and intelligent upgrading and transformation of intelligent network-connected automobiles, the personification design of voice products, such as vehicle-mounted assistants, gradually shows importance, and personification voice intelligent interaction experience becomes a different competition area of each automobile enterprise. With the rapid development of computer graphics, graphic rendering, motion capturing, deep learning, voice synthesis and other technologies, a multi-modal interaction mode is often fused through a virtual digital person so as to provide personification intelligent interaction experience of each terminal.

The virtual digital person is a digital character image which is created by using a digital technology and is close to the human image, has image capacity, expression capacity, perception and interaction capacity, can automatically read, analyze and identify external input information through an intelligent system, decides a subsequent output text of the digital person according to an analysis result, and drives the character model to generate corresponding voice and action so as to enable the digital person to interact with a user.

However, the related technology has single setting of the digital virtual person, only has single voice or character gesture, ignores personalized requirements of different users, and often ignores research on driving scenes and emotion states in the interaction process, so that scene perception and emotion expression of a dialogue system are poor, and service experience of the user on voice interaction is greatly reduced.

Disclosure of Invention

In view of the above, the present invention provides a method, apparatus, vehicle and medium for intelligent network-connected digital virtual human dialogue, which solves the problems that the existing dialogue system proposed in the above technical background ignores the personalized needs of users, and lacks in-depth study on scene perception and emotion expression, resulting in a plurality of defects and further seriously affecting user experience.

In a first aspect, the present invention provides a method for intelligent internet-connected automobile digital virtual person dialogue, the method comprising:

responding to the character setting operation of the user to generate a corresponding digital virtual person;

collecting voice data of a user and state data of a vehicle;

Identifying voice data and determining user intention;

identifying a current scene of the vehicle and a dialogue service corresponding to the current scene according to the user intention and the state data;

and determining control instructions based on the digital virtual person, the current scene and the dialogue service, and controlling the digital virtual person to execute corresponding actions based on the control instructions so as to complete the dialogue with the user.

The invention responds to the character setting operation of the user to generate the corresponding digital virtual person, determines the current scene and the corresponding dialogue service according to the collected voice data of the user and the state data of the vehicle, and acquires the control instruction to control the digital virtual person to execute the corresponding action, so that the problems of weak scene perception, insufficient emotion expressive force and the like in the intelligent network-connected automobile digital virtual person dialogue process can be solved, and the anthropomorphic and intelligent interaction experience between the user and the digital virtual person dialogue is greatly improved.

In an alternative embodiment, generating a corresponding digital dummy in response to a user's character setting operation includes:

Responding to the character setting operation of a user on preset human setting elements and preset image elements, and correspondingly generating target human setting and target images;

generating a digital virtual person according to the target person setting and the target image;

And performing image optimization on the digital virtual man according to a preset optimization mode to obtain the optimized digital virtual man.

According to the invention, the target person setting and the target image are obtained in response to the person setting operation of the user, the digital virtual person is obtained based on the target person setting and the target image, and the image is optimized, so that the personalized requirements of the user are met, the corresponding digital virtual person is generated, the quality of the generated digital virtual person is ensured, and the intelligent and anthropomorphic image of the digital virtual person is improved.

In an alternative embodiment, identifying a current scene of a vehicle and a dialog service corresponding to the current scene based on user intent and status data includes:

determining corresponding scenes and emotion voices according to the user intention and the state data;

and determining the current scene of the vehicle and the dialogue service corresponding to the current scene based on the scene and the emotion voice.

According to the method and the device, the corresponding scene and emotion voice are determined through the user intention and the state data, and the current scene of the vehicle and the dialogue service corresponding to the current scene are determined based on the user intention and the state data, so that the driving scene and the emotion state in the dialogue process can be considered, the scene perception and the emotion expressive force can be improved, and the intelligent dialogue requirement of the user can be met.

In an alternative embodiment, the control instructions include action instructions and/or voice instructions, the control instructions are determined based on the digital dummy, the current scene and the dialog service, and the digital dummy is controlled to perform corresponding actions based on the control instructions to complete the dialog with the user, including:

matching a reply language, an action and/or emotion voice of the dialogue service based on the digital virtual person and the current scene;

configuring animation instructions and/or voice instructions based on the reply language and the action and/or emotion voice;

And controlling the digital virtual person to execute corresponding actions according to the animation instructions and/or the voice instructions so as to complete the dialogue with the user.

According to the invention, the digital virtual person and the current scene are matched with the reply language, the action and/or the emotion voice of the dialogue service, and the animation instruction and/or the voice instruction are determined based on the reply language, so that the interaction between the user and the digital virtual person is more personified and intelligent, the quality and the efficiency of the dialogue can be greatly improved, and the intelligent dialogue requirement of the user is met.

In an alternative embodiment, configuring voice instructions based on a reply language and emotion voice includes:

and inputting the reply language and the emotion voice into a preset acoustic model to output target emotion voice, and determining the target emotion voice as a voice instruction.

According to the invention, the current reply language and emotion voice are synthesized through the preset acoustic model, and the voice command corresponding to the emotion feature is output, so that more accurate emotion voice dialogue can be realized, and the service experience of the user on voice interaction is improved.

In a second aspect, the present invention provides an intelligent network-connected automobile digital virtual person dialogue device, which includes:

The character setting module is used for responding to character setting operation of a user to generate a corresponding digital virtual person;

the data collection module is used for collecting voice data of a user and state data of a vehicle;

the intention recognition module is used for recognizing voice data and determining the intention of a user;

The scene engine module is used for identifying the current scene of the vehicle and the dialogue service corresponding to the current scene according to the user intention and the state data;

and the dialogue service module is used for determining a control instruction based on the digital virtual person, the current scene and the dialogue service and controlling the digital virtual person to execute corresponding actions based on the control instruction so as to complete the dialogue with the user.

According to the intelligent network-connected automobile digital virtual person dialogue device, the corresponding digital virtual person is generated through the character setting operation, the current scene and the dialogue service corresponding to the current scene are determined based on the voice data of the user and the state data of the vehicle, and the control instruction is acquired to control the digital virtual person to execute the corresponding action, so that the problems of weak scene perception, insufficient emotion expressive force and the like in the intelligent network-connected automobile digital virtual person dialogue process can be solved, and the anthropomorphic and intelligent interaction experience between the user and the digital virtual person dialogue is improved to a certain extent.

In an alternative embodiment, the character setting module includes:

The person setting unit is used for providing different person settings comprising preset person setting elements so as to enable a user to correspondingly generate target person settings after performing person setting operation;

The image setting unit is used for providing different images containing preset image elements so as to correspondingly generate a target image after the user performs character setting operation;

The character generating unit is used for generating a digital virtual person according to the target person setting and the target image;

And the image optimization unit is used for performing image optimization on the digital virtual person according to a preset optimization mode to obtain the optimized digital virtual person.

The invention provides the target person setting and the target image obtained after the person setting operation through the person setting unit and the image setting unit, generates the digital virtual person through the person generating unit, and optimizes the image of the digital virtual person by combining the image optimizing unit, thereby not only meeting the personalized requirements of users and generating the corresponding digital virtual person, but also guaranteeing the quality of the generated digital virtual person and improving the intelligent and anthropomorphic image thereof.

In an alternative embodiment, the scene engine module includes:

the data candidate unit is used for providing different scenes and emotion voices corresponding to different people;

The scene recognition unit is used for determining corresponding scenes and emotion voices from the data candidate unit according to the user intention and the state data, and determining the current scenes of the vehicle and dialogue services corresponding to the current scenes based on the scenes and the emotion voices;

And the scene output unit is used for outputting the current scene of the vehicle and the dialogue service corresponding to the current scene.

According to the invention, the scene recognition unit determines the corresponding scene and emotion voice from the data candidate unit according to the user intention and the state data, and determines the current scene of the vehicle and the dialogue service corresponding to the current scene based on the scene and emotion state, so that the driving scene and emotion state in the dialogue process can be considered, the scene perception and emotion expressive force can be improved, and the intelligent dialogue requirement of the user can be further met.

In an alternative embodiment, the control instructions include action instructions and/or voice instructions, and the dialog service module includes:

The dialogue design unit is used for defining reply languages corresponding to different scenes and actions and/or emotion voices corresponding to different people;

The dialogue configuration unit is used for matching the reply language, the action and/or the emotion voice of the dialogue service from the dialogue design unit based on the digital virtual person and the current scene, and configuring the animation instruction and/or the voice instruction based on the reply language, the action and/or the emotion voice;

And the dialogue executing unit is used for controlling the digital virtual person to execute corresponding actions according to the animation instructions and/or the voice instructions so as to complete dialogue with the user.

The dialogue configuration unit of the invention matches the reply language, action and/or emotion voice of dialogue service from the dialogue design unit based on the digital virtual person and the current scene, and determines the animation instruction and/or voice instruction based on the reply language, action and/or emotion voice, so that the interaction between the user and the digital virtual person is more personified and intelligent, the quality and efficiency of dialogue can be greatly improved, and the intelligent dialogue requirement of the user is met.

In a third aspect, the invention provides a vehicle, the vehicle comprising a controller, the controller comprising a memory and a processor, the memory and the processor being in communication with each other, the memory having stored therein computer instructions, the processor executing the computer instructions to thereby perform an intelligent network-connected automotive digital virtual human conversation method according to the first aspect or any of its corresponding embodiments.

In a fourth aspect, the present invention provides a computer readable storage medium, where computer instructions are stored on the computer readable storage medium, where the computer instructions are configured to cause a computer to perform an intelligent network-connected automobile digital virtual person dialogue method according to the first aspect or any one of the corresponding embodiments of the first aspect.

According to the intelligent network-connected automobile digital virtual person dialogue method and device, the corresponding digital virtual person is generated in response to the character setting operation of the user, the current scene and the dialogue service corresponding to the current scene are determined according to the collected voice data of the user and the state data of the vehicle, and the control instruction is acquired to control the digital virtual person to execute the corresponding action, so that the personalized requirements of the user can be fully considered, meanwhile, the problems of weak scene perception, insufficient emotion expressive force and the like in the intelligent network-connected automobile digital virtual person dialogue process are solved, and the anthropomorphic and intelligent interaction experience between the user and the digital virtual person dialogue is facilitated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an intelligent network-connected car digital virtual person dialogue method according to an embodiment of the invention;

FIG. 2 is a flow chart of another intelligent networked automobile digital virtual person dialogue method according to an embodiment of the invention;

FIG. 3 is a schematic diagram of the architecture of a speech design system;

FIG. 4 is a schematic diagram of an emotion dialog design;

FIG. 5 is a schematic diagram of a scene dialog configuration;

FIG. 6 is a schematic representation of prediction of synthesized audio;

FIG. 7 is a block diagram of an intelligent networked car digital virtual human dialog device in accordance with an embodiment of the present invention;

Fig. 8 is a schematic structural view of a controller of a vehicle according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In an embodiment of the present invention, an embodiment of an intelligent network-connected automobile digital virtual human dialogue method is provided, and it should be noted that, the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different from that illustrated herein.

In this embodiment, a method for intelligent network-connected automobile digital virtual human dialogue is provided, fig. 1 is a schematic flow chart of the intelligent network-connected automobile digital virtual human dialogue method according to an embodiment of the invention, as shown in fig. 1, the flow includes the following steps:

step S101, generating a corresponding digital dummy in response to a user' S character setting operation.

It should be noted that, the specific manner of the setting operation of the person in this embodiment is not limited herein, and is determined based on the adaptability of the actual requirement. For example, the user selects the corresponding person through a touch operation on the car screen, or selects the corresponding person by voice-related technology. Specifically, the user wakes up the digital virtual person setting function by setting a voice wake-up word, performs voice recognition on related voices uttered by the user, and designs the digital virtual person based on the recognition result so as to meet the personalized requirements of the user, which is only used as an exemplary illustration.

Step S102, voice data of a user and state data of a vehicle are collected.

It should be noted that, in this embodiment, specific contents and acquisition manners of the voice data and the status data are not limited herein, and may be adaptively determined based on actual requirements. For example, the status data of the vehicle including, but not limited to, status signals of the vehicle, position signals, traffic signals, weather signals, etc., are collected in real time by microphones mounted on the vehicle, wherein the acquisition mode of each type of signal data is obtained according to conventional data acquisition means in the art, such as acquiring the corresponding position signals of the vehicle in real time by using an on-board navigation system, which is only used as an exemplary illustration.

Step S103, recognizing voice data and determining user intention.

In this embodiment, the user intention is mainly reflected in the current user's use requirement of the vehicle, for example, after the user speaks the sound of "help me navigate home", the user intention is to use the navigation function can be obtained through voice recognition, which is only used as an example, not by way of limitation, and other user intentions in the automotive field can be adaptively adjusted according to the actual driving scene.

Step S104, identifying the current scene of the vehicle and the dialogue service corresponding to the current scene according to the user intention and the state data.

It should be noted that, the current scene of the vehicle is intended to indicate the current use environment and condition of the vehicle, and the specific scene of the vehicle in this embodiment may be adaptively set based on the actual project requirement, such as a navigation scene (e.g. using a navigation function), an entertainment scene (e.g. listening to music), a vehicle control scene (e.g. controlling the vehicle to travel at a constant speed), a query scene (e.g. querying stocks), and a chat scene (e.g. chatting with a vehicle assistant), which are merely illustrative.

Step S105, determining control instructions based on the digital virtual person, the current scene and the dialogue service, and controlling the digital virtual person to execute corresponding actions based on the control instructions so as to complete the dialogue with the user.

In this embodiment, the control instruction is mainly used to control the digital virtual person to execute a corresponding action to implement dialogue interaction with the user, and the specific content and expression form are not limited herein, and are adaptively adjusted based on actual requirements. For example, the control instructions are voice instructions by which a digital virtual person may interact with the user in voice, just as an exemplary illustration.

According to the intelligent network-connected automobile digital virtual person dialogue method, the corresponding digital virtual person is generated in response to the character setting operation of the user, the current scene and the dialogue service corresponding to the current scene are determined according to the collected voice data of the user and the state data of the vehicle, and the control instruction is acquired to control the digital virtual person to execute the corresponding action, so that the personalized requirements of the user can be considered, the problems of weak scene perception, insufficient emotion expressive force and the like in the intelligent network-connected automobile digital virtual person dialogue process are solved, and the personified and intelligent interaction experience between the user and the digital virtual person dialogue is greatly improved.

In this embodiment, a method for intelligent internet-connected automobile digital virtual person dialogue is provided, fig. 2 is a schematic flow chart of another intelligent internet-connected automobile digital virtual person dialogue method according to an embodiment of the invention, as shown in fig. 2, the flow includes the following steps:

step S201, a corresponding digital dummy is generated in response to a user' S character setting operation.

Specifically, the step S201 includes:

in step S2011, in response to the user' S character setting operation on the preset human setting element and the preset image element, the target human setting and the target image are correspondingly generated.

In this embodiment, specific contents of the preset human setting element and the preset image element are not limited herein, and can be adaptively adjusted based on actual project requirements. For example, the user's appearance, shape, stature, clothing, character features, personality features, preferences, etc. are set according to the social attribute and personality attribute of the user, and are used as specific human setting elements, which are merely exemplary.

Step S2012, generating a digital virtual person according to the target person setting and the target image.

And step S2013, performing image optimization on the digital virtual person according to a preset optimization mode to obtain the optimized digital virtual person.

In this embodiment, the specific mode of the preset optimization mode is adaptively set based on the actual requirement. For example, the preset optimizing mode comprises an ornamental optimizing mode, namely, the digital virtual person is correspondingly ornamental according to holidays and user preferences, and the preset optimizing mode comprises a broadcasting optimizing mode, namely, a lip movement synchronizing technology such as Wav2Lip, deepFake, paddleGAN and the like is used for endowing the digital virtual person with more vivid and real lip movements, so that the digital virtual person presents more natural and smooth voice expression in the dialogue interaction process with the user.

According to the embodiment of the invention, the target person setting and the target image are obtained in response to the person setting operation of the user, the digital virtual person is obtained based on the target person setting and the target image, and the image is optimized, so that the personalized requirements of the user are met, the corresponding digital virtual person is generated, the quality of the generated digital virtual person is ensured, and the intelligent and anthropomorphic image of the digital virtual person is improved.

Step S202, collecting voice data of a user and status data of a vehicle. Please refer to step S102 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S203, the voice data is recognized, and the user intention is determined. Please refer to step S103 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S204, identifying the current scene of the vehicle and the dialogue service corresponding to the current scene according to the user intention and the state data.

Specifically, the step S204 includes:

step S2041, corresponding scenes and emotion voices are determined according to the user intention and the state data.

It should be noted that emotion voice can help voice interaction to be more temperature and more personified, in this embodiment, after determining a corresponding scene according to user intention and state data, emotion voice is determined based on the designed scene emotion arrangement correspondence.

Step S2042, determining a current scene of the vehicle and a dialogue service corresponding to the current scene based on the scene and the emotion voice.

In the present embodiment, the specific content of the dialogue service is not limited herein, and for example, the dialogue service includes a multi-turn dialogue service, a question-and-answer chat service, a query class service, and the like, which are only illustrative.

According to the embodiment of the invention, the corresponding scene and emotion voice are determined through the user intention and the state data, and the current scene of the vehicle and the dialogue service corresponding to the current scene are determined based on the user intention and the state data, so that the driving scene and emotion state in the dialogue process can be considered, the scene perception and emotion expressive force can be improved, and the intelligent dialogue requirement of the user can be met.

Step S205, determining a control instruction based on the digital virtual person, the current scene and the dialogue service, and controlling the digital virtual person to execute corresponding actions based on the control instruction so as to complete the dialogue with the user.

In this embodiment, the control instruction includes an action instruction and/or a voice instruction, which are used to control the digital virtual person to execute a corresponding action according to user preference, so as to achieve improvement of user experience and satisfaction, where the action instruction is intended to achieve action response (such as eye communication) of the digital virtual person to the user, and the voice instruction is intended to achieve natural language interaction of the digital virtual person and the user. Specifically, the step S205 includes:

step S2051 matches the reply language and action and/or emotion voice of the dialogue service based on the digital dummy and the current scene.

In the present embodiment, emotion voice (also referred to as emotion TTS) is voice carrying emotion characteristics, and the specific type of emotion characteristics is adaptively set based on actual needs, such as happiness, as an exemplary illustration only.

TTS (Text To Speech) is a technology that involves multiple disciplines such as acoustics, linguistics, natural language understanding, signal processing, pattern recognition, etc., and converts Text information into natural and smooth Speech output, which is an important component of man-machine conversations, and aims To enable a machine To "speak".

Step S2052, the animation instructions and/or voice instructions are configured based on the reply language and the action and/or emotion voice.

It should be noted that, in this embodiment, the voice command is configured based on the reply language and the emotion voice, which includes inputting the reply language and the emotion voice into a preset acoustic model to output the target emotion voice, and determining the target emotion voice as the voice command. The preset acoustic model is a trained deep learning model (also called an acoustic model) for emotion voice synthesis, and specifically the model type is adaptively selected according to actual data. Specifically, the current reply language and emotion voice are synthesized through the preset acoustic model, and a voice command corresponding to the emotion feature is output, so that more accurate emotion voice dialogue can be realized, and the service experience of a user on voice interaction is improved.

In this embodiment, the animation instruction is used to instruct the digital dummy to execute a specific animation set in advance. It should be noted that, the animation includes specifically set actions (such as waving hands, clapping hands, turning circles, jumping, etc.) and effects (also called special effects, which are intended to add visual impact and expressive force to the actions of the digital virtual person, such as flames, explosions, shadows, etc., so that the animation scene is more vivid and attractive), and the specific content of the animation can be adaptively adjusted according to the actual project requirements.

Step S2053, controlling the digital dummy to execute corresponding actions according to the animation instructions and/or voice instructions, so as to complete the dialogue with the user.

In the embodiment of the invention, the digital virtual person and the current scene are matched with the reply language, the action and/or the emotion voice of the dialogue service, and the animation instruction and/or the voice instruction are determined based on the reply language, so that the interaction between the user and the digital virtual person is more personified and intelligent, the quality and the efficiency of the dialogue can be greatly improved, and the intelligent dialogue requirement of the user is met.

In a specific embodiment, according to the intelligent network connection automobile digital virtual human dialogue method, an intelligent network connection automobile digital virtual human emotion voice system is correspondingly designed, namely, the intelligent network connection automobile digital virtual human emotion voice system is provided with an emotion and personification voice design system, and the intelligent network connection automobile digital virtual human emotion voice system comprises the following contents:

1. And constructing a anthropomorphic digital virtual person.

In this embodiment, the digital dummy includes a man-set lead and an image linkage.

1.1 People set up related requirements leading to the setting of figures.

1.1.1 Character setting-defining the whole frame established by the characters, directly affecting the impression of the digital virtual person in the user's mind. The character settings include elements such as name, background story, occupation, character, hobbies, age, design style (e.g., 2D/3D, character scale, hairstyle, clothing), representing expression/action/spoken Buddhist, photos uploaded by the user, video, etc.

1.1.2 People's settings all emotion voice interactions, image actions, specific scene actions, reply languages, TTS, scene emotions, etc. need to be designed around people's settings.

1.1.3 Different personal settings should take different schemes, and the person's name can be customized to a voice wake-up word in combination with user implementation preferences.

1.2 Image linkage, namely combining roles or images in different fields to adapt to individual demands of users.

1.2.1 In a real scene, emotional expression is realized by the visual dynamic effect of a pertinent scene, and the foothold is interacted with the full scene. For example, navigation scenes, music scenes, and certain scenes with emotion contacts, such as holiday scenes and active service scenes, can use visual linkage to promote accompany sense.

1.2.2 Figures refer to virtual digital human figures, which are usually human figures such as hair, eyes, mouth, nose, ears, hands, feet, etc., or animal, object, etc. For example, the figures such as michelin tire, tomcat, bubble mart, ai Yun etc. represent the digital virtual man of the central control of the car, which is the digital virtual man able to chat with the car owner, execute commands, provide the intelligent services such as reminding help, etc. and can resonate with it.

1.2.3 Virtual digital personas are generated using AI techniques such as Stable diffration or Midjourney, etc.

1.2.4, Driving the virtual human image to expand the use scene based on the voice interaction cloud and the local dynamic behavior database and the intelligent recommendation service.

1.2.5 When a virtual digital person is playing a conversation, the synthesized TTS audio is input to a model using a Lip sync technique (e.g., wav2Lip to make spoken audio and text more anthropomorphic), matching the phonemes of speech and Lip movements so that the Lip movements of the virtual digital person can accurately represent speech information.

1.2.6 Provides peripheral materials of digital assets, supports digital virtual human grooming and holiday grooming collocation, and aims to create an anthropomorphic digital virtual human with high affinity and high intelligent level.

2. And (5) scene design.

In this embodiment, the scene design contains various orchestration design requirements and design compliance principles.

2.1 Programming design requirements.

2.1.1 Emotion voices can help voice interactions to be more personified. The emotion voice set by the person is determined in the scene design, and the emotion voice is reflected through the image dynamic effect and emotion dialogue design, wherein the emotion dialogue is divided into a reply language and a TTS design (comprising tone and emotion styles).

2.1.2 Through various characteristics of user input, the model is established to the people of training is intelligent number wisdom that has specific characteristic to have self-learning ability, effective car owner information such as conversation, car use characteristic, custom in companion car owner's time of driving, all can promote self intelligent degree and people and establish richness. Specifically, the digital virtual person is like a 'good friend' of the vehicle owner, and various scene designs need to be effective and accord with the preference of the vehicle owner so as to enhance the human-computer interaction experience of the user.

2.1.3 More elaborate design and arrangement are required for the high-frequency scenes of the user, such as navigation, music, car control, telephone, weather, stock market, video and the like, and the corresponding scenes are required to be designed in a targeted manner.

2.1.4 Visual editing platform module, is used for receiving and analyzing the voice information of users, and define and arrange the input course, processing course and output course of the scene module based on the definition and arrangement tactics of presettingso that users can look over and modify the relevant information of scene at any time, have intuitive, flexible apparent advantage.

2.2 Design follows the principle.

2.2.1 Extreme restriction, a design concept that pursues extreme performance while maintaining restriction and simplicity. Wherein, the word "extremely" means that the word is best and most excellent in the design of the reply word without leaving any regrets, while the word "restriction" emphasizes that the word is designed to follow the concept of conciseness, clarity, reality and redundancy avoidance. It should be noted that the general task type scene can be designed in a general manner, and the special scene needs to be focused on for the important design.

2.2.2 Select a scene with high frequency and emotional contacts. The architecture of the speech design system is schematically shown in fig. 3. Note that ASR in fig. 3 is (Automatic Speech Recognition ), i.e. a technique of converting human speech into text, NLU is (Natural Language Understanding ), i.e. a technique of enabling a computer to understand and use natural language of human society, such as chinese, english, etc. to realize natural language communication between humans and machines, DM is (Dialogue Management ), which obtains tasks according to user input, then specifies information required for the tasks, interfaces with a service platform to complete the tasks, and finally returns task execution results to the user, NLG is (Natural Language Generation ), intended to enable the computer to generate and understand human natural language text. Specifically, during the voice interaction process, ASR lets the machine hear, NLU lets the machine understand, and dialogue manages the brain as the voice interaction, which works to control the machine to perform corresponding processing after the machine is understood.

3 Dialogue design.

In this embodiment, the dialogue design includes emotion dialogue design, scene dialogue design cloud platform dynamic configuration and prediction model.

3.1 Emotion dialogue design.

In this embodiment, the emotion dialogue design includes a reply language design and a TTS design, and the design is in compliance with a person setting rule. FIG. 4 is a schematic diagram of an emotion dialog design.

3.1.1 Reply word design: match the target of the user's dialog. The task type scene is focused simply, directly and efficiently, and the entertainment type scene is focused with fun and interaction. It should be noted that the dialog design also follows the two principles of scene design. For example, in an interactive negative scenario, when the user initiates a navigation home without setting a home address, the emotionally designed reply is "do you tell me yet where you are home, is you hard to be on my mind.

3.1.2TTS designs that the styles, tone, intonation, mood, speed and the like are set around the person of the digital virtual person, and corpus of different emotions/emotions and emotion voice audio set by the person are provided (obtained by generating models of different emotion voices after deep learning training). Among them, emotion TTS carries emotional characteristics such as happiness, sadness, anger, skin tone, loveliness, comfort, etc.

3.2 Dynamic configuration of the cloud platform for scene dialogue design.

In this embodiment, the emotion tag may be configured on-line in the cloud, and when a dialogue is designed in the scene design module, the emotion tag is added and then added to the command semantics to be responded to the vehicle end. Fig. 5 is a schematic diagram of a scene dialogue configuration, where after a user selects a scene tag, a reply word and an emotion type provided by a cloud platform, the platform automatically responds to the user selection to complete the corresponding configuration.

In a specific embodiment, a user can wake up a voice query, help the user to navigate home by 'help me' and recognize user audio sent in real time and perform NLU to understand domain-falling 'navi' Navigation domain-dropping intention (navi stands for Navigation service in the automobile field), navigate home and access a cloud platform, intelligently match the well-arranged atomization reply language according to the Navigation domain-falling scene, match the emotion label of 'lovely' or 'happy' according to the weather condition of the day, add the emotion label into the semantic protocol according to the following table 1 format before the semantic atomic instruction is issued, issue the semantic to a registered service party by the help of the language after the fusion is completed, and then receive the service party to analyze the instruction to call back to the voice to perform TTS emotion broadcasting and virtual image broadcasting linkage.

TABLE 1

Fields	Meaning of	Whether or not it is necessary	Remarks
				status	Status of	Is that	-
text	Reply language	Is that	-
				emotion	Visual action	Is that	-

For example, when the user voice input instruction is 'switching to high-speed priority', the following programming reply words are matched before the DM semantic issue:

It should be noted that, the domain-falling scene in this embodiment is developed by combining the semantic skill tree with various scenes, for example, navigation, music, video, car control, playing control, weather, telephone, ticket inquiry, boring and other scenes. In addition, based on the virtual human set which is obtained by fusion training of various characteristics input by the user, the expression communication mode is more relevant to the requirements of the user, and in practical application, by means of chatGPT equal-sized models, the feature labels of the virtual human set are input to obtain the reply words which are designed to be richer and more intelligent.

3.3 Predictive model.

In this embodiment, text information and phoneme time stamps of audio are extracted to form a training set for training, a phoneme stamp and a pronunciation frame length of the text information pronunciation audio are predicted through a trained time stamp prediction model to generate a phoneme sequence, and then the phoneme sequence is input into an acoustic model to obtain a synthesized audio for personalized voice broadcasting of a digital virtual person. Fig. 6 is a schematic diagram of prediction of synthesized audio.

It should be noted that, the predictive model is used to determine the voice characteristics of the original text according to the target emotion and emotion timbre, and synthesize the target emotion voice based on the voice characteristics of the original text and the target timbre. When the voice with different target tone colors and different target emotions are needed to be synthesized, only the voice data with one or a few tone colors and different target emotions are needed to be collected, so that the cost of emotion voice synthesis is reduced, the influence of emotion expression capacity of a speaker with corresponding tone colors on voice quality is reduced, voice characteristics can be more accurately determined, the voice quality of synthesized voice is improved, and the service experience of voice interaction is improved.

In summary, the intelligent networking automobile digital virtual man dialogue method provided by the embodiment of the invention has the following advantages:

1. the digital virtual person can more intuitively display rich and various anthropomorphic emotion states, and combines the arranged and designed emotion voice dialogue scene and lip movement model, so that the digital person has rich expression limb action display and humanized temperature during linkage.

2. According to the synthesized audio, the phoneme sequence is generated by adding the prediction model of the time stamp and predicting the phoneme stamp and pronunciation frame length of the text audio, and then the phoneme sequence is input into the acoustic model to synthesize the audio, so that more accurate emotion voice conversation can be realized.

3. In the design of the reply words in the scene editing of the embodiment, according to the setting of the person of the digital virtual person and the attribute positioning of the image, such as age, character, preference and the like, a large model such as chatGPT is added in the cloud for training so as to train a large model conforming to the attribute of the digital virtual person, the large model is enabled to automatically arrange and customize the reply words, and in combination with emotion labeling, more pertinent emotional reply is carried out, so that a plurality of defects existing in the manual design of the reply words can be overcome, the conversation rhythm is further accelerated, and the interactive experience is improved.

In this embodiment, an intelligent network-connected digital virtual person dialogue device for implementing the foregoing embodiment and a preferred implementation manner is further provided, and details thereof are omitted. The term "module" as used below may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The invention provides an intelligent network connection automobile digital virtual man dialogue device, as shown in figure 7, the device comprises:

The character setting module 701 generates a corresponding digital dummy in response to a character setting operation of the user.

The data collection module 702 is configured to collect voice data of a user and status data of a vehicle.

The intention recognition module 703 is used for recognizing voice data and determining the intention of the user.

The scene engine module 704 is configured to identify a current scene of the vehicle and a dialogue service corresponding to the current scene according to the user intention and the state data.

The session service module 705 is configured to determine a control instruction based on the digital virtual person, the current scene, and the session service, and control the digital virtual person to perform a corresponding action based on the control instruction, so as to complete a session with the user.

In some alternative embodiments, the character setting module comprises a character setting unit, a character generating unit and a character optimizing unit, wherein the character setting unit is used for providing different people settings containing preset human setting elements so as to enable a user to correspondingly generate target people settings after carrying out character setting operation, the character setting unit is used for providing different characters containing preset image elements so as to enable the user to correspondingly generate target images after carrying out character setting operation, the character generating unit is used for generating digital virtual people according to the target people settings and the target images, and the character optimizing unit is used for carrying out image optimization on the digital virtual people according to a preset optimizing mode to obtain the optimized digital virtual people.

According to the character setting module provided by the embodiment of the invention, the character setting unit and the character setting unit are used for providing the target person setting and the target image obtained after the character setting operation, the character generating unit is used for generating the digital virtual person, and the image optimizing unit is combined for carrying out image optimization on the digital virtual person, so that the personalized requirements of a user are met, the corresponding digital virtual person is generated, the quality of the generated digital virtual person is ensured, and the intelligent and anthropomorphic image of the digital virtual person is improved.

In some alternative embodiments, the scene engine module comprises a data candidate unit, a scene recognition unit and a scene output unit, wherein the data candidate unit is used for providing different scenes and emotion voices corresponding to different people, the scene recognition unit is used for determining corresponding scenes and emotion voices from the data candidate unit according to user intention and state data and determining the current scenes of the vehicle and dialogue services corresponding to the current scenes based on the scenes and the emotion voices, and the scene output unit is used for outputting the current scenes of the vehicle and the dialogue services corresponding to the current scenes.

According to the scene engine module provided by the embodiment of the invention, the scene recognition unit is used for determining the corresponding scene and emotion voice from the data candidate unit according to the user intention and the state data, and determining the current scene of the vehicle and the dialogue service corresponding to the current scene based on the scene and emotion state, so that the driving scene and emotion state in the dialogue process can be considered, the scene perception and emotion expressive force can be improved, and the intelligent dialogue requirement of the user can be further met.

In some optional embodiments, the dialogue service module comprises a dialogue design unit, a dialogue configuration unit and a dialogue execution unit, wherein the dialogue design unit is used for defining reply languages corresponding to different scenes and actions and/or emotion voices corresponding to different people, the dialogue configuration unit is used for matching the reply languages, the actions and/or the emotion voices of the dialogue service from the dialogue design unit based on the digital virtual people and the current scenes, and configuring animation instructions and/or voice instructions based on the reply languages, the actions and/or the emotion voices, and the dialogue execution unit is used for controlling the digital virtual people to execute corresponding actions according to the animation instructions and/or voice instructions so as to complete dialogue with a user.

According to the dialogue service module provided by the embodiment of the invention, the dialogue configuration unit is used for matching the reply language, the action and/or the emotion voice of the dialogue service from the dialogue design unit based on the digital virtual person and the current scene, and determining the animation instruction and/or the voice instruction based on the reply language, so that the interaction between the user and the digital virtual person is more personified and intelligent, the quality and the efficiency of the dialogue can be greatly improved, and the intelligent dialogue requirement of the user is met.

In some alternative embodiments, the device further comprises a visual editing platform module for defining and arranging the scene engine module input process, the processing process and the output process based on a preset definition and arrangement strategy.

The visual editing platform module provided by the embodiment of the invention can be used for checking and modifying dialogue related information at any time, and has the obvious advantages of intuitiveness and flexibility.

In some optional embodiments, the device further comprises an emotion voice synthesis module, which is used for synthesizing emotion voice according to a preset acoustic model and providing voice instructions corresponding to emotion features for the dialogue service module.

According to the emotion voice synthesis module provided by the embodiment of the invention, the current reply language and emotion voice are synthesized through the preset acoustic model, and the voice command corresponding to the emotion feature is output, so that more accurate emotion voice conversation can be realized, and the service experience of a user on voice interaction is improved.

Further functional descriptions of the above modules are the same as those of the above corresponding embodiments, and are not repeated here.

The intelligent network-connected car digital virtual human dialogue device in this embodiment is presented in the form of a functional unit, where the unit refers to an ASIC (Application SPECIFIC INTEGRATED Circuit) Circuit, a processor and a memory that execute one or more software or firmware, and/or other devices that can provide the above functions.

The intelligent network connection automobile digital virtual person dialogue device provided by the embodiment of the invention can consider the personalized demands of users, solves the problems of weak perception of scenes, insufficient emotion expressive force and the like in the intelligent network connection automobile digital virtual person dialogue process, and improves the personified and intelligent interaction experience between the users and the digital virtual person dialogue to a certain extent.

The embodiment of the invention also provides a vehicle, which comprises a controller. The controller in this embodiment is a vehicle controller, and is configured to perform operations such as power supply/power outage, sleep and wake-up on the sub-controller and the network node that are hung under the controller, and meanwhile, each power supply interface can collect and output real-time current. Other controllers having the above functions are applicable.

Fig. 8 is a schematic structural diagram of the controller according to an alternative embodiment of the present invention, and as shown in fig. 8, the controller includes one or more processors 10, a memory 20, and interfaces for connecting the components, including a high-speed interface and a low-speed interface. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the controller, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display apparatus coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple controllers may be connected, each providing part of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 8.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform a method for implementing the embodiments described above.

The memory 20 may include a storage program area that may store an operating system, application programs required for at least one function, and a storage data area that may store data created according to the use of the controller, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the controller via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The memory 20 may comprise volatile memory, such as random access memory, or nonvolatile memory, such as flash memory, hard disk or solid state disk, or the memory 20 may comprise a combination of the above types of memory.

The controller also includes a communication interface 30 for the master control chip to communicate with other devices or communication networks.

There is also provided in an embodiment of the present invention a computer readable storage medium, where the above-described method according to an embodiment of the present invention can be implemented in hardware, firmware, or as a computer code which can be recorded on a storage medium, or which is originally stored in a remote storage medium or a non-transitory machine-readable storage medium and is to be stored in a local storage medium, downloaded through a network, so that the method described herein can be stored on such software process on a storage medium using a general purpose computer, special purpose processor, or programmable or special purpose hardware. The storage medium may be a magnetic disk, an optical disk, a read-only memory, a random-access memory, a flash memory, a hard disk, a solid state disk, or the like, and further, the storage medium may further include a combination of the above types of memories. It will be appreciated that a computer, processor, microprocessor master chip or programmable hardware includes a storage component that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the embodiments described above.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. An intelligent networking automobile digital virtual man dialogue method, which is characterized by comprising the following steps:

collecting voice data of the user and state data of the vehicle;

identifying the voice data and determining the intention of a user;

And determining a control instruction based on the digital virtual person, the current scene and the dialogue service, and controlling the digital virtual person to execute corresponding actions based on the control instruction so as to complete the dialogue with the user.

2. The intelligent network-connected car digital virtual person dialogue method according to claim 1, wherein the generating a corresponding digital virtual person in response to a person setting operation of a user comprises:

And performing image optimization on the digital virtual person according to a preset optimization mode to obtain the optimized digital virtual person.

3. The intelligent network-connected car digital virtual person conversation method according to claim 1, wherein the identifying the current scene of the vehicle and the conversation service corresponding to the current scene according to the user intention and the state data includes:

4. The intelligent network-connected car digital virtual person conversation method according to claim 1, wherein the control instruction includes an action instruction and/or a voice instruction, the determining a control instruction based on the digital virtual person, the current scene, and the conversation service, and controlling the digital virtual person to perform a corresponding action based on the control instruction to complete a conversation with the user, comprising:

Matching a reply language and action and/or emotion voice of the dialogue service based on the digital dummy and the current scene;

5. The intelligent network-connected car digital virtual person dialogue method according to claim 4, wherein configuring a voice instruction based on the reply language and emotion voice comprises:

6. An intelligent network-connected car digital virtual man dialogue device, characterized in that the device comprises:

the data collection module is used for collecting voice data of the user and state data of the vehicle;

The intention recognition module is used for recognizing the voice data and determining the intention of a user;

a scene engine module, configured to identify a current scene of the vehicle and a dialogue service corresponding to the current scene according to the user intention and the state data;

And the dialogue service module is used for determining a control instruction based on the digital virtual person, the current scene and the dialogue service, and controlling the digital virtual person to execute corresponding actions based on the control instruction so as to complete dialogue with the user.

7. The intelligent internet-enabled vehicle digital virtual human dialogue device of claim 6, wherein said character setting module comprises:

8. The intelligent networked car digital virtual human dialogue device of claim 6, wherein the scene engine module comprises:

A scene recognition unit, configured to determine a corresponding scene and emotion voice from the data candidate unit according to the user intention and the state data, and determine a current scene of the vehicle and a dialogue service corresponding to the current scene based on the scene and emotion voice;

9. The intelligent network-connected car digital virtual human dialogue device according to claim 6, wherein the control instruction comprises an action instruction and/or a voice instruction, and the dialogue service module comprises:

A dialogue configuration unit, configured to match a reply language and action and/or emotion voice of the dialogue service from the dialogue design unit based on the digital dummy and the current scene, and configure an animation instruction and/or a voice instruction based on the reply language and action and/or emotion voice;

10. A vehicle comprising a controller, the controller comprising a memory and a processor, the memory and the processor being communicatively coupled to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the intelligent network-connected vehicle digital virtual human conversation method of any one of claims 1 to 5.

11. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the intelligent networked car digital virtual human conversation method of any one of claims 1 to 5.