[go: up one dir, main page]

CN119920246A - In-vehicle voice dialogue method, system and vehicle computer - Google Patents

In-vehicle voice dialogue method, system and vehicle computer Download PDF

Info

Publication number
CN119920246A
CN119920246A CN202311421253.1A CN202311421253A CN119920246A CN 119920246 A CN119920246 A CN 119920246A CN 202311421253 A CN202311421253 A CN 202311421253A CN 119920246 A CN119920246 A CN 119920246A
Authority
CN
China
Prior art keywords
vehicle
audio
voice
wake
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311421253.1A
Other languages
Chinese (zh)
Inventor
王月
王可
高雪健
赵嵩
刘鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FAW Volkswagen Automotive Co Ltd
Original Assignee
FAW Volkswagen Automotive Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by FAW Volkswagen Automotive Co Ltd filed Critical FAW Volkswagen Automotive Co Ltd
Priority to CN202311421253.1A priority Critical patent/CN119920246A/en
Publication of CN119920246A publication Critical patent/CN119920246A/en
Pending legal-status Critical Current

Links

Landscapes

  • Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)

Abstract

The invention provides a vehicle-mounted voice dialogue method, a vehicle-mounted voice dialogue system and a vehicle. The method includes judging, by a central processing unit, whether a wake-up word is included in user voice input received by a microphone provided on a vehicle, and waking up a vehicle-mounted voice conversation process if the wake-up word is included, performing echo cancellation and noise reduction by an audio digital signal processing chip, generating input audio subjected to echo cancellation and noise reduction, performing voice recognition on the input audio by a graphic processor, generating a voice recognition text, performing semantic understanding by the central processing unit, determining, by the central processing unit, a user intention based on current semantic and conversation history information, obtaining reply information for the user intention, performing voice synthesis by an embedded neural network processor based on the reply information, generating reply audio, and transmitting the reply audio to a speaker provided in the vehicle. The load of the CPU operation is reduced, and better voice interaction experience is provided under the limited hardware condition.

Description

Vehicle-mounted voice dialogue method, system and vehicle machine
Technical Field
The embodiment of the invention relates to the technical field of vehicle-mounted information systems, in particular to a vehicle-mounted voice dialogue method, a vehicle-mounted voice dialogue system and a vehicle.
Background
The hardware performance of the host (hereinafter abbreviated as car machine) of the in-car infotainment system is limited, and in order to provide multiple-scene and intelligent rich services for users, a large amount of service software often competes with the hardware resources of the car machine.
The general solution of the speech engine in the vehicle-mounted entertainment system is that the computing power of a central processing unit (Central Processing Unit, CPU) is totally depended, and as the functions of the vehicle-mounted entertainment system are richer, the functions are more complex, so that the CPU load is overwhelmed. In the development process, the CPU load exceeds 90% for a long time under certain conditions, so that the voice dialogue system is slow in response and can show a clamp.
Therefore, the hardware resources allocated by the vehicle machine for the intelligent voice dialogue system can not meet the requirements of the voice engine at all times, the actual performance of the voice system is affected, and the problem of insufficient system resources is solved by searching an optimization scheme.
Disclosure of Invention
In order to solve the above problems in the prior art, in a first aspect, an embodiment of the present invention provides a vehicle-mounted voice conversation method including judging by a central processing unit whether a wake-up word is included in a user voice input received by a microphone provided on a vehicle and waking up a vehicle-mounted voice conversation process in the case that the user voice input includes the wake-up word, performing echo cancellation and noise reduction on the user voice input received via the microphone by an audio digital signal processing chip, generating echo cancelled and noise reduced input audio, and transmitting the echo cancelled and noise reduced input audio to a graphic processor, performing voice recognition on the echo cancelled and noise reduced input audio by the graphic processor, generating a voice recognition text, and transmitting the voice recognition text to the central processing unit, performing semantic understanding on the voice recognition text by the central processing unit, recognizing current semantics of a voice recognition text of a current conversation, determining by the central processing unit based on the current semantics and conversation history information, invoking an application corresponding to the user intention to obtain the user intention, and transmitting the echo cancelled and noise reduced input audio to a graphic processor, generating a response in response to the response network, and transmitting the response to the response network, generating the response in response is provided by the neural network, and transmitting the response is performed by the voice network.
In some embodiments, the method further comprises receiving, by the central processor, a wake-up indication signal, wherein the wake-up indication signal is from a voice wake-up hard key or a voice wake-up soft key on the vehicle, and waking, by the central processor, a vehicle-mounted voice conversation process in response to receiving the wake-up indication signal.
In some embodiments, performing echo cancellation and noise reduction on user speech input received via the microphone by an audio digital signal processing chip includes generating, by the audio digital signal processing chip, an echo signal estimate by adjusting parameters of an adaptive filter, simulating a channel environment of echoes in the user speech input, and subtracting, by the audio digital signal processing chip, the echo signal estimate from signals of the user speech input, generating an echo cancelled signal.
In some embodiments, performing echo cancellation and noise reduction by an audio digital signal processing chip on user speech input received via the microphone includes determining, by the audio digital signal processing chip, a direction and magnitude of an external noise source from the user speech input, and controlling, by the audio digital signal processing chip, a sound cancellation speaker to emit sound waves opposite in direction and same magnitude as the external noise source.
In a second aspect, embodiments of the present invention provide an in-vehicle voice dialog system that includes a central processor, an audio digital signal processing chip, a graphics processor, and an embedded neural network processor. The central processor is used for judging whether a user voice input received by a microphone arranged on a vehicle comprises a wake-up word or not, waking up a vehicle-mounted voice conversation process when the user voice input comprises the wake-up word, carrying out semantic understanding on voice recognition texts transmitted by the graphic processor, recognizing current semantics of voice recognition texts of a current conversation, determining user intention according to the current semantics and conversation history information, calling an application corresponding to the user intention to obtain reply information aiming at the user intention, and transmitting the reply information to the embedded neural network processor. The audio digital signal processing chip is configured to perform echo cancellation and noise reduction on user speech input received via the microphone, generate echo cancelled and noise reduced input audio, and transmit the echo cancelled and noise reduced input audio to a graphics processor. The graphics processor is configured to perform speech recognition on the echo cancelled and noise reduced input audio, generate speech recognition text, and transmit the speech recognition text to the central processor. The embedded neural network processor is configured to perform speech synthesis based on the reply information, generate reply audio, and transmit the reply audio to a speaker provided in the vehicle so that the speaker plays the reply audio.
In some embodiments, the central processor is further configured to receive a wake-up indication signal, wherein the wake-up indication signal is from a voice wake-up hard key or a voice wake-up soft key on the vehicle, and wake-up an on-board voice conversation process in response to receiving the wake-up indication signal.
In some embodiments, the audio digital signal processing chip is further configured to simulate a channel environment of an echo in a user speech input by adjusting parameters of an adaptive filter to generate an echo signal estimate, and to subtract the echo signal estimate from the signal of the user speech input to generate an echo cancelled signal.
In some embodiments, the audio digital signal processing chip is further configured to determine a direction and an amplitude of an external noise source from the user voice input, and to control the sound attenuating speaker to emit sound waves having a direction opposite to the external noise source and the same amplitude.
In some embodiments, the central processor, the audio digital signal processing chip, the graphics processor, and the embedded neural network processor are integrated on the same system-on-chip.
In a third aspect, embodiments of the present invention provide a vehicle machine, including the vehicle-mounted voice dialogue system described in any of the above embodiments.
The existing speech engine solution is that all modules of the speech engine run the computation in the CPU, and all the modules depend on the performance and the computational power resources of the CPU. The heterogeneous scheme provided by the embodiment of the invention splits the voice engine into a plurality of parts, wherein a wake-up module, a dialogue management module and a semantic understanding module use CPU to calculate, an Audio DSP chip to process echo cancellation/noise reduction module, a GPU to calculate voice recognition module and an NPU to calculate voice synthesis module (TTS) module, so that the CPU-GPU-NPU-DSP heterogeneous solution of the vehicle-mounted voice engine is formed.
The GPU is mainly used for rendering navigation maps at present, and the utilization rate is low in the development of a vehicle-mounted entertainment system, so that the load is low under normal conditions. The current mainstream vehicle-level system chips have GPU processors with stronger processing capability. Compared with a CPU, the GPU has the characteristics of multiple core control units and less quantity, so that the GPU is more suitable for a program with intensive computation and parallel data, a deep learning algorithm model accords with the two characteristics, and an ASR module is realized by using the deep learning model.
Speech synthesis (TTS) is implemented by a neural network model-based deep learning algorithm and is therefore well suited for NPU chip processing.
In order to realize Bluetooth communication quality, an Audio DSP chip is often integrated in the vehicle-mounted entertainment system, and the vehicle-mounted entertainment system has professional echo cancellation and noise reduction functions, and the voice dialogue function can multiplex the capacity, so that the load of other hardware is reduced.
Splitting the modules, rather than using only a CPU and a GPU, further achieves balancing the load of the chips as much as possible.
The heterogeneous scheme provided by the embodiment of the invention fully plays the capability of other processing units of the host SoC chip, and solves the bottleneck problem that the performance of the voice system is limited by system resources. Compared with the existing processing of carrying out vehicle-mounted voice dialogue by using the CPU, the method and the device reduce the load of CPU operation and reduce the operation amount required to be executed by the CPU to the range which can be born by the CPU. Therefore, the complex vehicle-mounted entertainment system resource environment can be better dealt with, and better voice interaction experience is provided under the limited hardware condition.
Drawings
The above, as well as additional purposes, features, and advantages of embodiments of the present invention will become apparent in the following detailed written description and claims upon reference to the accompanying drawings. Several embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 shows a flow chart of a method of vehicle-mounted voice conversation in accordance with an embodiment of the present invention;
Fig. 2 shows a schematic diagram of an in-vehicle voice dialog system according to an embodiment of the invention.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present invention will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable those skilled in the art to better understand and practice the invention and are not intended to limit the scope of the invention in any way.
In one aspect, embodiments of the present invention provide a vehicle-mounted voice conversation method. Referring to fig. 1, a flow chart of an in-vehicle voice conversation method 100 in accordance with an embodiment of the present invention is shown. The method 100 includes S101-S106.
In step S101, it is determined by a central processing unit (Central Processing Unit, CPU) whether a wake-up word, for example, "hello, XX (car system name)", is included in a user voice input received by a microphone provided on the vehicle, and in the case where the user voice input includes the wake-up word, a wake-up (wakeup) car-mounted voice dialogue process is performed. Typically, a CPU is the core processor that the computer system is responsible for operation and control.
As one embodiment of the invention, the wake-up process can also be that the central processing unit receives a wake-up indication signal, wherein the wake-up indication signal is from a voice wake-up hard key or a voice wake-up soft key on the vehicle, and the central processing unit wakes up the vehicle-mounted voice dialogue process in response to receiving the wake-up indication signal.
In step S102, echo cancellation and Noise Reduction (Echo Cancellation & Noise Reduction, ECNR) is performed by an Audio (Audio) digital signal processing (DIGITAL SIGNAL Processor) chip on user speech input received via a microphone, echo cancelled and Noise reduced input Audio is generated, and the echo cancelled and Noise reduced input Audio is transmitted to a graphics Processor.
When the echo exists in duplex mode, the microphone records the signal of the loudspeaker, for example, when the car player plays music, the car voice system recognizes the lyrics.
As an embodiment of the present invention, echo cancellation may be performed by an audio digital signal processing chip generating an echo signal estimate by adjusting parameters of an adaptive filter to simulate a channel environment of an echo in a user's voice input, and by an audio digital signal processing chip subtracting the echo signal estimate from the signal of the user's voice input to generate an echo cancelled signal.
Echo cancellation is typically implemented using an adaptive filter, i.e., a filter with adjustable parameters is designed, the parameters of the filter are adjusted by an adaptive algorithm, the channel environment generated by the echo is simulated, the size of the echo signal is estimated, and then the estimated value is subtracted from the received signal to cancel the echo. For example, when the in-car audio system is playing a song, the sound wave data of the song should be eliminated from the recorded voice command audio data, so as to avoid the influence of the played song on the identification of the correct voice command.
As an embodiment of the present invention, noise reduction may be performed by determining the direction and magnitude of an external noise source from a user's voice input by an audio digital signal processing chip, and controlling a sound-deadening speaker to emit sound waves having the opposite direction and the same magnitude as the external noise source by the audio digital signal processing chip. Through the noise reduction system, namely through the sound attenuation loudspeaker, sound waves which are opposite to the external noise source and have the same amplitude are emitted, and the two sound waves are overlapped, so that the noise is reduced. For example, in-vehicle voice systems generally require reduced impact on voice recognition accuracy from tire noise, engine noise, wind noise, and the like.
In step S103, the echo cancelled and noise reduced input audio is speech recognized (ASR) by a graphics processor (Graphics Processing Unit, GPU), speech recognition text is generated, and the speech recognition text is transmitted to a central processor. Generally, GPUs are processors that are primarily responsible for image and graphics-related operations, and may also perform general-purpose computations. Speech recognition is a technique that converts human speech into text. The machine is enabled to convert the speech signal into a corresponding text or command through a recognition and understanding process. For example, the user speaks a sentence of instructions such as "help me" and speech recognition converts the audio instruction correctly into the corresponding text.
In step S104, the CPU performs semantic understanding (Natural Language Understanding, NLU) on the speech recognition text, and recognizes the current semantic meaning of the speech recognition text of the current dialogue. Semantic understanding is used to simulate the human language interaction process, so that the computer can understand and use the natural language of human society, such as Chinese, english, etc., to realize the natural language communication between the human and the computer. In the vehicle-mounted voice system, the NLU can classify ASR result text input by the voice of the user into various sub-fields, such as inquiring weather, booking air tickets and the like, and recognize specific intention of the user. For example, for the instruction of "help me to get to the ocean on tomorrow" semantic understanding will recognize that the functional domain is to be the ticket booking, the time is tomorrow, and the destination is the ocean.
In step S105, the user intention is determined by the CPU based on the current semantic and dialogue history information, that is, dialogue management (Dialogue Management, DM) is performed, and then an application corresponding to the user intention is invoked to obtain reply information to the user intention, and the reply information is transmitted to the embedded neural network processor.
The session management decides the reaction to the user at this time based on the session history information. The most common application is task driven multi-round conversations, users carry with them explicit purposes such as ordering, booking, etc., the user's needs are complex, there are many constraints, and multiple rounds of presentation may be required. On the one hand, the user can continuously modify or perfect his own requirements during the conversation, and on the other hand, when the user's stated requirements are not sufficiently specific or clear, the machine can also help the user find satisfactory results by querying, clarifying or confirming. For example, the user inquires about the weather conditions of Beijing and then the dialogue management system can automatically complement the weather conditions of Beijing and then inquire about the weather conditions of Beijing and Ming through the context.
In step S106, speech synthesis (TTS) is performed by the embedded neural network processor (Neural-network Processing Unit, NPU) according to the reply information, reply audio is generated, and the reply audio is transmitted to a speaker provided in the vehicle so that the speaker plays the reply audio. The NPU is a specially designed processor for accelerating neural network model operations.
The purpose of speech generation is to transform textual information into natural and fluent human language audio. The processing process mainly comprises three steps of ①, namely, disassembling the characters to obtain the duration and frequency change of the phonemes. ② The combination of the words forms words, pauses in a natural way, and confirms the pronunciation of the polyphones in a phrase way. ③ The speech with obvious human characteristics is formed by combining the speaking habit, pronunciation characteristics, accent characteristics and the like of the speaker. For example, the voice dialog system has generated a reply text that feeds back the user's instructions, and the voice generation module converts the text into natural fluent simulated audio for playback.
The existing speech engine solution is that all modules of the speech engine run the computation in the CPU, and all the modules depend on the performance and the computational power resources of the CPU. The heterogeneous scheme provided by the embodiment of the invention splits the voice engine into a plurality of parts, wherein a wake-up module, a dialogue management module and a semantic understanding module use CPU to calculate, an Audio DSP chip to process echo cancellation/noise reduction module, a GPU to calculate voice recognition module and an NPU to calculate voice synthesis module (TTS) module, so that the CPU-GPU-NPU-DSP heterogeneous solution of the vehicle-mounted voice engine is formed.
The GPU is mainly used for rendering navigation maps at present, and the utilization rate is low in the development of a vehicle-mounted entertainment system, so that the load is low under normal conditions. The current mainstream vehicle-level system chips have GPU processors with stronger processing capability. Compared with a CPU, the GPU has the characteristics of multiple core control units and less quantity, so that the GPU is more suitable for a program with intensive computation and parallel data, a deep learning algorithm model accords with the two characteristics, and an ASR module is realized by using the deep learning model.
Speech synthesis (TTS) is implemented by a neural network model-based deep learning algorithm and is therefore well suited for NPU chip processing.
In order to realize Bluetooth communication quality, an Audio DSP chip is often integrated in the vehicle-mounted entertainment system, and the vehicle-mounted entertainment system has professional echo cancellation and noise reduction functions, and the voice dialogue function can multiplex the capacity, so that the load of other hardware is reduced.
Splitting the modules, rather than using only a CPU and a GPU, further achieves balancing the load of the chips as much as possible.
Referring to table 1 below, the results of the power calculation test after optimization of an item of the vehicle-mounted voice system are shown. The data in the table represent CPU power in DMIPS. Taking the high-pass Cells 8155 chip as an example, the CPU total calculation power is about 95k DMIPS.
TABLE 1 CPU calculation force test results example before and after optimization
It can be seen that the heterogeneous scheme provided by the embodiment of the invention fully plays the capability of other processing units of the host SoC chip, and solves the bottleneck problem that the performance of the voice system is limited by system resources. Compared with the existing processing of carrying out vehicle-mounted voice dialogue by using the CPU, the method and the device reduce the load of CPU operation and reduce the operation amount required to be executed by the CPU to the range which can be born by the CPU. Therefore, the complex vehicle-mounted entertainment system resource environment can be better dealt with, and better voice interaction experience is provided under the limited hardware condition.
On the other hand, the embodiment of the invention provides a vehicle-mounted voice dialogue system. Referring to fig. 2, a schematic diagram of an in-vehicle voice dialog system is shown, according to an embodiment of the present invention. The system comprises a central processing unit, an audio digital signal processing chip, a graphic processor and an embedded neural network processor.
The central processor is used for judging whether the user voice input received by the microphone arranged on the vehicle comprises a wake-up word or not, waking up a vehicle-mounted voice dialogue process when the user voice input comprises the wake-up word, carrying out semantic understanding on the voice recognition text transmitted by the graphic processor, recognizing the current semantic of the voice recognition text of the current dialogue, determining the user intention according to the current semantic and dialogue history information, calling an application corresponding to the user intention to obtain reply information aiming at the user intention, and transmitting the reply information to the embedded neural network processor.
The audio digital signal processing chip is configured to perform echo cancellation and noise reduction on user speech input received via the microphone, generate echo cancelled and noise reduced input audio, and transmit the echo cancelled and noise reduced input audio to the graphics processor.
The graphics processor is configured to perform speech recognition on the echo cancelled and denoised input audio, generate speech recognition text, and transmit the speech recognition text to the central processor.
The embedded neural network processor is configured to perform speech synthesis based on the reply information, generate reply audio, and transmit the reply audio to a speaker provided in the vehicle so that the speaker plays the reply audio.
The central processor can also be used for receiving a wake-up indication signal, wherein the wake-up indication signal is from a voice wake-up hard key or a voice wake-up soft key on the vehicle, and waking up the vehicle-mounted voice conversation process in response to receiving the wake-up indication signal.
The audio digital signal processing chip can be further used for generating an echo signal estimated value by adjusting parameters of the adaptive filter and simulating the channel environment of the echo in the user voice input and generating an echo cancelled signal by subtracting the echo signal estimated value from the signal of the user voice input.
As one embodiment of the invention, the audio digital signal processing chip can be further used for determining the direction and the amplitude of an external noise source from the voice input of a user and controlling the silencing loudspeaker to emit sound waves with the opposite direction and the same amplitude as the external noise source.
As one embodiment of the present invention, the central processor, the audio digital signal processing Chip, the graphic processor, and the embedded neural network processor are integrated on the same System on Chip (SoC).
In yet another aspect, embodiments of the present invention provide a vehicle machine including a vehicle-mounted voice dialog system as described in any of the embodiments above.
According to the vehicle-mounted voice dialogue method, system and vehicle machine provided by the embodiment of the invention, the voice engine which is originally totally dependent on CPU calculation power is split into a plurality of parts through heterogeneous design, and the capacities of other processing units of the SoC chip of the host machine, such as GPU, NPU, DSP, are fully exerted, so that the bottleneck problem that the performance of the voice system is limited by system resources is solved, and better voice interaction experience is provided under the limited hardware condition.
The foregoing description of embodiments of the invention have been presented for the purpose of illustration and is not intended to be exhaustive or to limit the invention to the precise form disclosed. It will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims (10)

1.一种车载语音对话方法,其特征在于,所述方法包括:1. A vehicle-mounted voice dialogue method, characterized in that the method comprises: 由中央处理器判断由设置在车辆上的麦克风接收到的用户语音输入中是否包括唤醒词,并且在所述用户语音输入包括唤醒词的情况下,唤醒车载语音对话过程;The central processor determines whether the user voice input received by the microphone disposed on the vehicle includes a wake-up word, and wakes up the vehicle voice dialogue process if the user voice input includes the wake-up word; 由音频数字信号处理芯片对经由所述麦克风接收到的用户语音输入执行回声消除和降噪,生成经回声消除和降噪的输入音频,并且将所述经回声消除和降噪的输入音频传输至图形处理器;The audio digital signal processing chip performs echo cancellation and noise reduction on the user voice input received via the microphone, generates input audio after echo cancellation and noise reduction, and transmits the input audio after echo cancellation and noise reduction to the graphics processor; 由所述图形处理器对所述经回声消除和降噪的输入音频进行语音识别,生成语音识别文本,并且将所述语音识别文本传输至所述中央处理器;The graphics processor performs speech recognition on the echo-cancelled and denoised input audio to generate speech recognition text, and transmits the speech recognition text to the central processor; 由所述中央处理器对所述语音识别文本进行语义理解,识别出当前对话的语音识别文本的当前语义;The central processor performs semantic understanding on the speech recognition text to identify the current semantics of the speech recognition text of the current conversation; 由所述中央处理器根据所述当前语义和对话历史信息,确定用户意图,调用与所述用户意图相对应的应用,以获得针对所述用户意图的答复信息,并且将所述答复信息传输至嵌入式神经网络处理器;The central processor determines the user intention according to the current semantics and the conversation history information, calls the application corresponding to the user intention to obtain the reply information for the user intention, and transmits the reply information to the embedded neural network processor; 由所述嵌入式神经网络处理器根据所述答复信息来执行语音合成,生成答复音频,并且将所述答复音频传输至设置在车辆内的扬声器,以便所述扬声器播放所述答复音频。The embedded neural network processor performs speech synthesis according to the reply information to generate reply audio, and transmits the reply audio to a speaker arranged in the vehicle so that the speaker plays the reply audio. 2.根据权利要求1所述的方法,其特征在于,所述方法还包括:2. The method according to claim 1, characterized in that the method further comprises: 由所述中央处理器接收唤醒指示信号,其中所述唤醒指示信号来自车机上的语音唤醒硬按键或者语音唤醒软按键;The central processor receives a wake-up indication signal, wherein the wake-up indication signal comes from a voice wake-up hard key or a voice wake-up soft key on the vehicle computer; 响应于接收到所述唤醒指示信号,由所述中央处理器唤醒车载语音对话过程。In response to receiving the wake-up indication signal, the central processor wakes up the in-vehicle voice dialogue process. 3.根据权利要求1所述的方法,其特征在于,由音频数字信号处理芯片对经由所述麦克风接收到的用户语音输入执行回声消除和降噪包括:3. The method according to claim 1, wherein the step of performing echo cancellation and noise reduction on the user voice input received via the microphone by the audio digital signal processing chip comprises: 由所述音频数字信号处理芯片通过调整自适应滤波器的参数,模拟用户语音输入中的回声的信道环境,生成回声信号估值;The audio digital signal processing chip simulates the channel environment of the echo in the user's voice input by adjusting the parameters of the adaptive filter to generate an echo signal estimation; 由所述音频数字信号处理芯片从所述用户语音输入的信号中减去所述回声信号估值,生成经回声消除的信号。The audio digital signal processing chip subtracts the echo signal estimate from the signal of the user voice input to generate a signal after echo cancellation. 4.根据权利要求1所述的方法,其特征在于,由音频数字信号处理芯片对经由所述麦克风接收到的用户语音输入执行回声消除和降噪包括:4. The method according to claim 1, wherein the step of performing echo cancellation and noise reduction on the user voice input received via the microphone by the audio digital signal processing chip comprises: 由所述音频数字信号处理芯片从所述用户语音输入中确定外界噪声源的方向和幅度;The audio digital signal processing chip determines the direction and amplitude of the external noise source from the user voice input; 由所述音频数字信号处理芯片控制消音扬声器发出与外界噪声源的方向相反且幅度相同的声波。The audio digital signal processing chip controls the mute speaker to emit a sound wave in a direction opposite to that of the external noise source and with the same amplitude. 5.一种车载语音对话系统,其特征在于,所述系统包括中央处理器、音频数字信号处理芯片、图形处理器和嵌入式神经网络处理器,其中,5. A vehicle-mounted voice dialogue system, characterized in that the system includes a central processing unit, an audio digital signal processing chip, a graphics processor and an embedded neural network processor, wherein: 所述中央处理器用于:判断由设置在车辆上的麦克风接收到的用户语音输入中是否包括唤醒词,并且在所述用户语音输入包括唤醒词的情况下,唤醒车载语音对话过程;对图形处理器传输的语音识别文本进行语义理解,识别出当前对话的语音识别文本的当前语义;根据所述当前语义和对话历史信息,确定用户意图,调用与所述用户意图相对应的应用,以获得针对所述用户意图的答复信息,并且将所述答复信息传输至嵌入式神经网络处理器;The central processor is used to: determine whether the user voice input received by the microphone arranged on the vehicle includes a wake-up word, and if the user voice input includes the wake-up word, wake up the vehicle voice dialogue process; perform semantic understanding on the voice recognition text transmitted by the graphics processor, and identify the current semantics of the voice recognition text of the current dialogue; determine the user intention according to the current semantics and dialogue history information, call the application corresponding to the user intention to obtain the reply information for the user intention, and transmit the reply information to the embedded neural network processor; 所述音频数字信号处理芯片用于:对经由所述麦克风接收到的用户语音输入执行回声消除和降噪,生成经回声消除和降噪的输入音频,并且将所述经回声消除和降噪的输入音频传输至图形处理器;The audio digital signal processing chip is used to: perform echo cancellation and noise reduction on the user voice input received via the microphone, generate input audio after echo cancellation and noise reduction, and transmit the input audio after echo cancellation and noise reduction to the graphics processor; 所述图形处理器用于:对所述经回声消除和降噪的输入音频进行语音识别,生成语音识别文本,并且将所述语音识别文本传输至所述中央处理器;The graphics processor is used to: perform speech recognition on the echo-cancelled and denoised input audio, generate speech recognition text, and transmit the speech recognition text to the central processor; 所述嵌入式神经网络处理器用于:根据所述答复信息来执行语音合成,生成答复音频,并且将所述答复音频传输至设置在车辆内的扬声器,以便所述扬声器播放所述答复音频。The embedded neural network processor is used to perform speech synthesis according to the reply information, generate reply audio, and transmit the reply audio to a speaker arranged in the vehicle so that the speaker plays the reply audio. 6.根据权利要求5所述的系统,其特征在于,所述中央处理器还用于:6. The system according to claim 5, characterized in that the central processing unit is also used for: 接收唤醒指示信号,其中所述唤醒指示信号来自车机上的语音唤醒硬按键或者语音唤醒软按键;Receiving a wake-up indication signal, wherein the wake-up indication signal comes from a voice wake-up hard key or a voice wake-up soft key on the vehicle computer; 响应于接收到所述唤醒指示信号,唤醒车载语音对话过程。In response to receiving the wake-up indication signal, the in-vehicle voice dialogue process is woken up. 7.根据权利要求5所述的系统,其特征在于,所述音频数字信号处理芯片进一步用于:7. The system according to claim 5, characterized in that the audio digital signal processing chip is further used for: 通过调整自适应滤波器的参数,模拟用户语音输入中的回声的信道环境,生成回声信号估值;By adjusting the parameters of the adaptive filter, the channel environment of the echo in the user's voice input is simulated to generate an echo signal estimation; 从所述用户语音输入的信号中减去所述回声信号估值,生成经回声消除的信号。The echo signal estimate is subtracted from the signal of the user voice input to generate an echo-cancelled signal. 8.根据权利要求5所述的系统,其特征在于,所述音频数字信号处理芯片进一步用于:8. The system according to claim 5, characterized in that the audio digital signal processing chip is further used for: 从所述用户语音输入中确定外界噪声源的方向和幅度;Determining the direction and magnitude of an external noise source from the user voice input; 控制消音扬声器发出与外界噪声源的方向相反且幅度相同的声波。Control the noise canceling speaker to emit sound waves in the opposite direction and with the same amplitude as the external noise source. 9.根据权利要求5所述的系统,其特征在于,所述中央处理器、所述音频数字信号处理芯片、所述图形处理器和所述嵌入式神经网络处理器集成在同一系统级芯片上。9. The system according to claim 5 is characterized in that the central processing unit, the audio digital signal processing chip, the graphics processor and the embedded neural network processor are integrated on the same system-level chip. 10.一种车机,其特征在于,包括根据权利要求5-9中任一项所述的车载语音对话系统。10. A vehicle computer, characterized by comprising the vehicle-mounted voice dialogue system according to any one of claims 5-9.
CN202311421253.1A 2023-10-30 2023-10-30 In-vehicle voice dialogue method, system and vehicle computer Pending CN119920246A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311421253.1A CN119920246A (en) 2023-10-30 2023-10-30 In-vehicle voice dialogue method, system and vehicle computer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311421253.1A CN119920246A (en) 2023-10-30 2023-10-30 In-vehicle voice dialogue method, system and vehicle computer

Publications (1)

Publication Number Publication Date
CN119920246A true CN119920246A (en) 2025-05-02

Family

ID=95509551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311421253.1A Pending CN119920246A (en) 2023-10-30 2023-10-30 In-vehicle voice dialogue method, system and vehicle computer

Country Status (1)

Country Link
CN (1) CN119920246A (en)

Similar Documents

Publication Publication Date Title
US11875775B2 (en) Voice conversion system and training method therefor
JP6852006B2 (en) Voice-enabled system with domain disambiguation
US7490042B2 (en) Methods and apparatus for adapting output speech in accordance with context of communication
WO2020171868A1 (en) End-to-end speech conversion
KR20200023456A (en) Speech sorter
US20100178956A1 (en) Method and apparatus for mobile voice recognition training
CN114365216A (en) Target speech separation for speech recognition by speaker
EP3616085A1 (en) Processing natural language using machine learning to determine slot values based on slot descriptors
CN111883135A (en) Voice transcription method and device and electronic equipment
US11308946B2 (en) Methods and apparatus for ASR with embedded noise reduction
JP4943335B2 (en) Robust speech recognition system independent of speakers
CN114385800A (en) Voice dialogue method and device
USH2187H1 (en) System and method for gender identification in a speech application environment
CN111833875A (en) Embedded voice interaction system
JP2022111977A (en) Voice recognition system and method
CN117219069A (en) Man-machine interaction method and device, computer readable storage medium and terminal equipment
US9218807B2 (en) Calibration of a speech recognition engine using validated text
JP5235187B2 (en) Speech recognition apparatus, speech recognition method, and speech recognition program
CN118613864A (en) Adaptation and training of neural speech synthesis
WO2025148929A1 (en) "say what you see" implementation method and apparatus, and vehicle
CN111326159A (en) Voice recognition method, device and system
CN119920246A (en) In-vehicle voice dialogue method, system and vehicle computer
CN118197295A (en) In-vehicle voice privacy protection method, system, equipment and storage medium
JP2000172291A (en) Speech recognition device
CN115641874B (en) Audio processing method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination