CN119920246A

CN119920246A - In-vehicle voice dialogue method, system and vehicle computer

Info

Publication number: CN119920246A
Application number: CN202311421253.1A
Authority: CN
Inventors: 王月; 王可; 高雪健; 赵嵩; 刘鹏
Original assignee: FAW Volkswagen Automotive Co Ltd
Current assignee: FAW Volkswagen Automotive Co Ltd
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2025-05-02

Abstract

The invention provides a vehicle-mounted voice dialogue method, a vehicle-mounted voice dialogue system and a vehicle. The method includes judging, by a central processing unit, whether a wake-up word is included in user voice input received by a microphone provided on a vehicle, and waking up a vehicle-mounted voice conversation process if the wake-up word is included, performing echo cancellation and noise reduction by an audio digital signal processing chip, generating input audio subjected to echo cancellation and noise reduction, performing voice recognition on the input audio by a graphic processor, generating a voice recognition text, performing semantic understanding by the central processing unit, determining, by the central processing unit, a user intention based on current semantic and conversation history information, obtaining reply information for the user intention, performing voice synthesis by an embedded neural network processor based on the reply information, generating reply audio, and transmitting the reply audio to a speaker provided in the vehicle. The load of the CPU operation is reduced, and better voice interaction experience is provided under the limited hardware condition.

Description

Vehicle-mounted voice dialogue method, system and vehicle machine

Technical Field

The embodiment of the invention relates to the technical field of vehicle-mounted information systems, in particular to a vehicle-mounted voice dialogue method, a vehicle-mounted voice dialogue system and a vehicle.

Background

The hardware performance of the host (hereinafter abbreviated as car machine) of the in-car infotainment system is limited, and in order to provide multiple-scene and intelligent rich services for users, a large amount of service software often competes with the hardware resources of the car machine.

The general solution of the speech engine in the vehicle-mounted entertainment system is that the computing power of a central processing unit (Central Processing Unit, CPU) is totally depended, and as the functions of the vehicle-mounted entertainment system are richer, the functions are more complex, so that the CPU load is overwhelmed. In the development process, the CPU load exceeds 90% for a long time under certain conditions, so that the voice dialogue system is slow in response and can show a clamp.

Therefore, the hardware resources allocated by the vehicle machine for the intelligent voice dialogue system can not meet the requirements of the voice engine at all times, the actual performance of the voice system is affected, and the problem of insufficient system resources is solved by searching an optimization scheme.

Disclosure of Invention

In order to solve the above problems in the prior art, in a first aspect, an embodiment of the present invention provides a vehicle-mounted voice conversation method including judging by a central processing unit whether a wake-up word is included in a user voice input received by a microphone provided on a vehicle and waking up a vehicle-mounted voice conversation process in the case that the user voice input includes the wake-up word, performing echo cancellation and noise reduction on the user voice input received via the microphone by an audio digital signal processing chip, generating echo cancelled and noise reduced input audio, and transmitting the echo cancelled and noise reduced input audio to a graphic processor, performing voice recognition on the echo cancelled and noise reduced input audio by the graphic processor, generating a voice recognition text, and transmitting the voice recognition text to the central processing unit, performing semantic understanding on the voice recognition text by the central processing unit, recognizing current semantics of a voice recognition text of a current conversation, determining by the central processing unit based on the current semantics and conversation history information, invoking an application corresponding to the user intention to obtain the user intention, and transmitting the echo cancelled and noise reduced input audio to a graphic processor, generating a response in response to the response network, and transmitting the response to the response network, generating the response in response is provided by the neural network, and transmitting the response is performed by the voice network.

In some embodiments, the method further comprises receiving, by the central processor, a wake-up indication signal, wherein the wake-up indication signal is from a voice wake-up hard key or a voice wake-up soft key on the vehicle, and waking, by the central processor, a vehicle-mounted voice conversation process in response to receiving the wake-up indication signal.

In some embodiments, performing echo cancellation and noise reduction on user speech input received via the microphone by an audio digital signal processing chip includes generating, by the audio digital signal processing chip, an echo signal estimate by adjusting parameters of an adaptive filter, simulating a channel environment of echoes in the user speech input, and subtracting, by the audio digital signal processing chip, the echo signal estimate from signals of the user speech input, generating an echo cancelled signal.

In some embodiments, performing echo cancellation and noise reduction by an audio digital signal processing chip on user speech input received via the microphone includes determining, by the audio digital signal processing chip, a direction and magnitude of an external noise source from the user speech input, and controlling, by the audio digital signal processing chip, a sound cancellation speaker to emit sound waves opposite in direction and same magnitude as the external noise source.

In a second aspect, embodiments of the present invention provide an in-vehicle voice dialog system that includes a central processor, an audio digital signal processing chip, a graphics processor, and an embedded neural network processor. The central processor is used for judging whether a user voice input received by a microphone arranged on a vehicle comprises a wake-up word or not, waking up a vehicle-mounted voice conversation process when the user voice input comprises the wake-up word, carrying out semantic understanding on voice recognition texts transmitted by the graphic processor, recognizing current semantics of voice recognition texts of a current conversation, determining user intention according to the current semantics and conversation history information, calling an application corresponding to the user intention to obtain reply information aiming at the user intention, and transmitting the reply information to the embedded neural network processor. The audio digital signal processing chip is configured to perform echo cancellation and noise reduction on user speech input received via the microphone, generate echo cancelled and noise reduced input audio, and transmit the echo cancelled and noise reduced input audio to a graphics processor. The graphics processor is configured to perform speech recognition on the echo cancelled and noise reduced input audio, generate speech recognition text, and transmit the speech recognition text to the central processor. The embedded neural network processor is configured to perform speech synthesis based on the reply information, generate reply audio, and transmit the reply audio to a speaker provided in the vehicle so that the speaker plays the reply audio.

In some embodiments, the central processor is further configured to receive a wake-up indication signal, wherein the wake-up indication signal is from a voice wake-up hard key or a voice wake-up soft key on the vehicle, and wake-up an on-board voice conversation process in response to receiving the wake-up indication signal.

In some embodiments, the audio digital signal processing chip is further configured to simulate a channel environment of an echo in a user speech input by adjusting parameters of an adaptive filter to generate an echo signal estimate, and to subtract the echo signal estimate from the signal of the user speech input to generate an echo cancelled signal.

In some embodiments, the audio digital signal processing chip is further configured to determine a direction and an amplitude of an external noise source from the user voice input, and to control the sound attenuating speaker to emit sound waves having a direction opposite to the external noise source and the same amplitude.

In some embodiments, the central processor, the audio digital signal processing chip, the graphics processor, and the embedded neural network processor are integrated on the same system-on-chip.

In a third aspect, embodiments of the present invention provide a vehicle machine, including the vehicle-mounted voice dialogue system described in any of the above embodiments.

The existing speech engine solution is that all modules of the speech engine run the computation in the CPU, and all the modules depend on the performance and the computational power resources of the CPU. The heterogeneous scheme provided by the embodiment of the invention splits the voice engine into a plurality of parts, wherein a wake-up module, a dialogue management module and a semantic understanding module use CPU to calculate, an Audio DSP chip to process echo cancellation/noise reduction module, a GPU to calculate voice recognition module and an NPU to calculate voice synthesis module (TTS) module, so that the CPU-GPU-NPU-DSP heterogeneous solution of the vehicle-mounted voice engine is formed.

The GPU is mainly used for rendering navigation maps at present, and the utilization rate is low in the development of a vehicle-mounted entertainment system, so that the load is low under normal conditions. The current mainstream vehicle-level system chips have GPU processors with stronger processing capability. Compared with a CPU, the GPU has the characteristics of multiple core control units and less quantity, so that the GPU is more suitable for a program with intensive computation and parallel data, a deep learning algorithm model accords with the two characteristics, and an ASR module is realized by using the deep learning model.

Speech synthesis (TTS) is implemented by a neural network model-based deep learning algorithm and is therefore well suited for NPU chip processing.

In order to realize Bluetooth communication quality, an Audio DSP chip is often integrated in the vehicle-mounted entertainment system, and the vehicle-mounted entertainment system has professional echo cancellation and noise reduction functions, and the voice dialogue function can multiplex the capacity, so that the load of other hardware is reduced.

Splitting the modules, rather than using only a CPU and a GPU, further achieves balancing the load of the chips as much as possible.

The heterogeneous scheme provided by the embodiment of the invention fully plays the capability of other processing units of the host SoC chip, and solves the bottleneck problem that the performance of the voice system is limited by system resources. Compared with the existing processing of carrying out vehicle-mounted voice dialogue by using the CPU, the method and the device reduce the load of CPU operation and reduce the operation amount required to be executed by the CPU to the range which can be born by the CPU. Therefore, the complex vehicle-mounted entertainment system resource environment can be better dealt with, and better voice interaction experience is provided under the limited hardware condition.

Drawings

The above, as well as additional purposes, features, and advantages of embodiments of the present invention will become apparent in the following detailed written description and claims upon reference to the accompanying drawings. Several embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 shows a flow chart of a method of vehicle-mounted voice conversation in accordance with an embodiment of the present invention;

Fig. 2 shows a schematic diagram of an in-vehicle voice dialog system according to an embodiment of the invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable those skilled in the art to better understand and practice the invention and are not intended to limit the scope of the invention in any way.

In one aspect, embodiments of the present invention provide a vehicle-mounted voice conversation method. Referring to fig. 1, a flow chart of an in-vehicle voice conversation method 100 in accordance with an embodiment of the present invention is shown. The method 100 includes S101-S106.

In step S101, it is determined by a central processing unit (Central Processing Unit, CPU) whether a wake-up word, for example, "hello, XX (car system name)", is included in a user voice input received by a microphone provided on the vehicle, and in the case where the user voice input includes the wake-up word, a wake-up (wakeup) car-mounted voice dialogue process is performed. Typically, a CPU is the core processor that the computer system is responsible for operation and control.

As one embodiment of the invention, the wake-up process can also be that the central processing unit receives a wake-up indication signal, wherein the wake-up indication signal is from a voice wake-up hard key or a voice wake-up soft key on the vehicle, and the central processing unit wakes up the vehicle-mounted voice dialogue process in response to receiving the wake-up indication signal.

In step S102, echo cancellation and Noise Reduction (Echo Cancellation & Noise Reduction, ECNR) is performed by an Audio (Audio) digital signal processing (DIGITAL SIGNAL Processor) chip on user speech input received via a microphone, echo cancelled and Noise reduced input Audio is generated, and the echo cancelled and Noise reduced input Audio is transmitted to a graphics Processor.

When the echo exists in duplex mode, the microphone records the signal of the loudspeaker, for example, when the car player plays music, the car voice system recognizes the lyrics.

As an embodiment of the present invention, echo cancellation may be performed by an audio digital signal processing chip generating an echo signal estimate by adjusting parameters of an adaptive filter to simulate a channel environment of an echo in a user's voice input, and by an audio digital signal processing chip subtracting the echo signal estimate from the signal of the user's voice input to generate an echo cancelled signal.

Echo cancellation is typically implemented using an adaptive filter, i.e., a filter with adjustable parameters is designed, the parameters of the filter are adjusted by an adaptive algorithm, the channel environment generated by the echo is simulated, the size of the echo signal is estimated, and then the estimated value is subtracted from the received signal to cancel the echo. For example, when the in-car audio system is playing a song, the sound wave data of the song should be eliminated from the recorded voice command audio data, so as to avoid the influence of the played song on the identification of the correct voice command.

As an embodiment of the present invention, noise reduction may be performed by determining the direction and magnitude of an external noise source from a user's voice input by an audio digital signal processing chip, and controlling a sound-deadening speaker to emit sound waves having the opposite direction and the same magnitude as the external noise source by the audio digital signal processing chip. Through the noise reduction system, namely through the sound attenuation loudspeaker, sound waves which are opposite to the external noise source and have the same amplitude are emitted, and the two sound waves are overlapped, so that the noise is reduced. For example, in-vehicle voice systems generally require reduced impact on voice recognition accuracy from tire noise, engine noise, wind noise, and the like.

In step S103, the echo cancelled and noise reduced input audio is speech recognized (ASR) by a graphics processor (Graphics Processing Unit, GPU), speech recognition text is generated, and the speech recognition text is transmitted to a central processor. Generally, GPUs are processors that are primarily responsible for image and graphics-related operations, and may also perform general-purpose computations. Speech recognition is a technique that converts human speech into text. The machine is enabled to convert the speech signal into a corresponding text or command through a recognition and understanding process. For example, the user speaks a sentence of instructions such as "help me" and speech recognition converts the audio instruction correctly into the corresponding text.

In step S104, the CPU performs semantic understanding (Natural Language Understanding, NLU) on the speech recognition text, and recognizes the current semantic meaning of the speech recognition text of the current dialogue. Semantic understanding is used to simulate the human language interaction process, so that the computer can understand and use the natural language of human society, such as Chinese, english, etc., to realize the natural language communication between the human and the computer. In the vehicle-mounted voice system, the NLU can classify ASR result text input by the voice of the user into various sub-fields, such as inquiring weather, booking air tickets and the like, and recognize specific intention of the user. For example, for the instruction of "help me to get to the ocean on tomorrow" semantic understanding will recognize that the functional domain is to be the ticket booking, the time is tomorrow, and the destination is the ocean.

In step S105, the user intention is determined by the CPU based on the current semantic and dialogue history information, that is, dialogue management (Dialogue Management, DM) is performed, and then an application corresponding to the user intention is invoked to obtain reply information to the user intention, and the reply information is transmitted to the embedded neural network processor.

The session management decides the reaction to the user at this time based on the session history information. The most common application is task driven multi-round conversations, users carry with them explicit purposes such as ordering, booking, etc., the user's needs are complex, there are many constraints, and multiple rounds of presentation may be required. On the one hand, the user can continuously modify or perfect his own requirements during the conversation, and on the other hand, when the user's stated requirements are not sufficiently specific or clear, the machine can also help the user find satisfactory results by querying, clarifying or confirming. For example, the user inquires about the weather conditions of Beijing and then the dialogue management system can automatically complement the weather conditions of Beijing and then inquire about the weather conditions of Beijing and Ming through the context.

In step S106, speech synthesis (TTS) is performed by the embedded neural network processor (Neural-network Processing Unit, NPU) according to the reply information, reply audio is generated, and the reply audio is transmitted to a speaker provided in the vehicle so that the speaker plays the reply audio. The NPU is a specially designed processor for accelerating neural network model operations.

The purpose of speech generation is to transform textual information into natural and fluent human language audio. The processing process mainly comprises three steps of ①, namely, disassembling the characters to obtain the duration and frequency change of the phonemes. ② The combination of the words forms words, pauses in a natural way, and confirms the pronunciation of the polyphones in a phrase way. ③ The speech with obvious human characteristics is formed by combining the speaking habit, pronunciation characteristics, accent characteristics and the like of the speaker. For example, the voice dialog system has generated a reply text that feeds back the user's instructions, and the voice generation module converts the text into natural fluent simulated audio for playback.

Referring to table 1 below, the results of the power calculation test after optimization of an item of the vehicle-mounted voice system are shown. The data in the table represent CPU power in DMIPS. Taking the high-pass Cells 8155 chip as an example, the CPU total calculation power is about 95k DMIPS.

TABLE 1 CPU calculation force test results example before and after optimization

It can be seen that the heterogeneous scheme provided by the embodiment of the invention fully plays the capability of other processing units of the host SoC chip, and solves the bottleneck problem that the performance of the voice system is limited by system resources. Compared with the existing processing of carrying out vehicle-mounted voice dialogue by using the CPU, the method and the device reduce the load of CPU operation and reduce the operation amount required to be executed by the CPU to the range which can be born by the CPU. Therefore, the complex vehicle-mounted entertainment system resource environment can be better dealt with, and better voice interaction experience is provided under the limited hardware condition.

On the other hand, the embodiment of the invention provides a vehicle-mounted voice dialogue system. Referring to fig. 2, a schematic diagram of an in-vehicle voice dialog system is shown, according to an embodiment of the present invention. The system comprises a central processing unit, an audio digital signal processing chip, a graphic processor and an embedded neural network processor.

The central processor is used for judging whether the user voice input received by the microphone arranged on the vehicle comprises a wake-up word or not, waking up a vehicle-mounted voice dialogue process when the user voice input comprises the wake-up word, carrying out semantic understanding on the voice recognition text transmitted by the graphic processor, recognizing the current semantic of the voice recognition text of the current dialogue, determining the user intention according to the current semantic and dialogue history information, calling an application corresponding to the user intention to obtain reply information aiming at the user intention, and transmitting the reply information to the embedded neural network processor.

The audio digital signal processing chip is configured to perform echo cancellation and noise reduction on user speech input received via the microphone, generate echo cancelled and noise reduced input audio, and transmit the echo cancelled and noise reduced input audio to the graphics processor.

The graphics processor is configured to perform speech recognition on the echo cancelled and denoised input audio, generate speech recognition text, and transmit the speech recognition text to the central processor.

The embedded neural network processor is configured to perform speech synthesis based on the reply information, generate reply audio, and transmit the reply audio to a speaker provided in the vehicle so that the speaker plays the reply audio.

The central processor can also be used for receiving a wake-up indication signal, wherein the wake-up indication signal is from a voice wake-up hard key or a voice wake-up soft key on the vehicle, and waking up the vehicle-mounted voice conversation process in response to receiving the wake-up indication signal.

The audio digital signal processing chip can be further used for generating an echo signal estimated value by adjusting parameters of the adaptive filter and simulating the channel environment of the echo in the user voice input and generating an echo cancelled signal by subtracting the echo signal estimated value from the signal of the user voice input.

As one embodiment of the invention, the audio digital signal processing chip can be further used for determining the direction and the amplitude of an external noise source from the voice input of a user and controlling the silencing loudspeaker to emit sound waves with the opposite direction and the same amplitude as the external noise source.

As one embodiment of the present invention, the central processor, the audio digital signal processing Chip, the graphic processor, and the embedded neural network processor are integrated on the same System on Chip (SoC).

In yet another aspect, embodiments of the present invention provide a vehicle machine including a vehicle-mounted voice dialog system as described in any of the embodiments above.

According to the vehicle-mounted voice dialogue method, system and vehicle machine provided by the embodiment of the invention, the voice engine which is originally totally dependent on CPU calculation power is split into a plurality of parts through heterogeneous design, and the capacities of other processing units of the SoC chip of the host machine, such as GPU, NPU, DSP, are fully exerted, so that the bottleneck problem that the performance of the voice system is limited by system resources is solved, and better voice interaction experience is provided under the limited hardware condition.

The foregoing description of embodiments of the invention have been presented for the purpose of illustration and is not intended to be exhaustive or to limit the invention to the precise form disclosed. It will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A vehicle-mounted voice dialogue method, characterized in that the method comprises:

The central processor determines whether the user voice input received by the microphone disposed on the vehicle includes a wake-up word, and wakes up the vehicle voice dialogue process if the user voice input includes the wake-up word;

The audio digital signal processing chip performs echo cancellation and noise reduction on the user voice input received via the microphone, generates input audio after echo cancellation and noise reduction, and transmits the input audio after echo cancellation and noise reduction to the graphics processor;

The graphics processor performs speech recognition on the echo-cancelled and denoised input audio to generate speech recognition text, and transmits the speech recognition text to the central processor;

The central processor performs semantic understanding on the speech recognition text to identify the current semantics of the speech recognition text of the current conversation;

The central processor determines the user intention according to the current semantics and the conversation history information, calls the application corresponding to the user intention to obtain the reply information for the user intention, and transmits the reply information to the embedded neural network processor;

The embedded neural network processor performs speech synthesis according to the reply information to generate reply audio, and transmits the reply audio to a speaker arranged in the vehicle so that the speaker plays the reply audio.

2. The method according to claim 1, characterized in that the method further comprises:

The central processor receives a wake-up indication signal, wherein the wake-up indication signal comes from a voice wake-up hard key or a voice wake-up soft key on the vehicle computer;

In response to receiving the wake-up indication signal, the central processor wakes up the in-vehicle voice dialogue process.

3. The method according to claim 1, wherein the step of performing echo cancellation and noise reduction on the user voice input received via the microphone by the audio digital signal processing chip comprises:

The audio digital signal processing chip simulates the channel environment of the echo in the user's voice input by adjusting the parameters of the adaptive filter to generate an echo signal estimation;

The audio digital signal processing chip subtracts the echo signal estimate from the signal of the user voice input to generate a signal after echo cancellation.

4. The method according to claim 1, wherein the step of performing echo cancellation and noise reduction on the user voice input received via the microphone by the audio digital signal processing chip comprises:

The audio digital signal processing chip determines the direction and amplitude of the external noise source from the user voice input;

The audio digital signal processing chip controls the mute speaker to emit a sound wave in a direction opposite to that of the external noise source and with the same amplitude.

5. A vehicle-mounted voice dialogue system, characterized in that the system includes a central processing unit, an audio digital signal processing chip, a graphics processor and an embedded neural network processor, wherein:

The central processor is used to: determine whether the user voice input received by the microphone arranged on the vehicle includes a wake-up word, and if the user voice input includes the wake-up word, wake up the vehicle voice dialogue process; perform semantic understanding on the voice recognition text transmitted by the graphics processor, and identify the current semantics of the voice recognition text of the current dialogue; determine the user intention according to the current semantics and dialogue history information, call the application corresponding to the user intention to obtain the reply information for the user intention, and transmit the reply information to the embedded neural network processor;

The audio digital signal processing chip is used to: perform echo cancellation and noise reduction on the user voice input received via the microphone, generate input audio after echo cancellation and noise reduction, and transmit the input audio after echo cancellation and noise reduction to the graphics processor;

The graphics processor is used to: perform speech recognition on the echo-cancelled and denoised input audio, generate speech recognition text, and transmit the speech recognition text to the central processor;

The embedded neural network processor is used to perform speech synthesis according to the reply information, generate reply audio, and transmit the reply audio to a speaker arranged in the vehicle so that the speaker plays the reply audio.

6. The system according to claim 5, characterized in that the central processing unit is also used for:

Receiving a wake-up indication signal, wherein the wake-up indication signal comes from a voice wake-up hard key or a voice wake-up soft key on the vehicle computer;

In response to receiving the wake-up indication signal, the in-vehicle voice dialogue process is woken up.

7. The system according to claim 5, characterized in that the audio digital signal processing chip is further used for:

By adjusting the parameters of the adaptive filter, the channel environment of the echo in the user's voice input is simulated to generate an echo signal estimation;

The echo signal estimate is subtracted from the signal of the user voice input to generate an echo-cancelled signal.

8. The system according to claim 5, characterized in that the audio digital signal processing chip is further used for:

Determining the direction and magnitude of an external noise source from the user voice input;

Control the noise canceling speaker to emit sound waves in the opposite direction and with the same amplitude as the external noise source.

9. The system according to claim 5 is characterized in that the central processing unit, the audio digital signal processing chip, the graphics processor and the embedded neural network processor are integrated on the same system-level chip.

10. A vehicle computer, characterized by comprising the vehicle-mounted voice dialogue system according to any one of claims 5-9.