CN113409805B

CN113409805B - Man-machine interaction method and device, storage medium and terminal equipment

Info

Publication number: CN113409805B
Application number: CN202011202667.1A
Authority: CN
Inventors: 胡孝波
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2024-06-07
Anticipated expiration: 2040-11-02
Also published as: CN113409805A

Abstract

The application discloses a man-machine interaction method, a man-machine interaction device, a storage medium and terminal equipment, and belongs to the technical field of artificial intelligence. The method is applied to terminal equipment, and the terminal equipment is integrated with a voice interaction component, N service components and a custom acoustic model provided by an access party; the SDK related to voice interaction is packaged in the voice interaction component; the N business components are selected from a business component set provided by a developer by an access party according to the product requirements of the access party; a business component for providing at least one service for a terminal device, comprising: receiving audio data acquired by a custom acoustic model through a voice interaction component; transmitting audio data to the server through the voice interaction component, wherein the audio data is used for instructing the server to execute audio processing and generate response data; and transmitting response data returned by the server to the first service component through the voice interaction component. The application provides possibility for the access party to realize flexible and simple intelligent voice interaction.

Description

Man-machine interaction method and device, storage medium and terminal equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a human-computer interaction method, a device, a storage medium, and a terminal device.

Background

Man-machine interaction refers to the process of information exchange between a person and a computer for completing a determined task in a certain interaction mode by using a certain dialogue language between the person and the computer. With rapid development of internet technology and internet of things, intelligent voice interaction is one of the mainstream man-machine interaction modes.

Intelligent voice interactions involve artificial intelligence techniques. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. With the support of AI technology, users are enabled to interact with devices directly based on voice technology.

In the context of intelligent voice interaction, scheme flexibility and portability have been one goal pursued by those skilled in the art. That is, how to flexibly implement intelligent voice interaction is a problem to be solved by those skilled in the art.

Disclosure of Invention

The embodiment of the application provides a man-machine interaction method, a man-machine interaction device, a storage medium and terminal equipment, which can realize smart and portable intelligent voice interaction. The technical scheme is as follows:

in one aspect, a human-computer interaction method is provided and applied to terminal equipment, wherein the terminal equipment is integrated with a voice interaction component, N service components and a custom acoustic model provided by an access party; the voice interaction component is encapsulated with a Software Development Kit (SDK) related to voice interaction; the N business components are selected from a business component set provided by a developer according to the product requirements of the access party; one business component is used for providing at least one service for the terminal equipment, and N is a positive integer;

The method comprises the following steps:

Receiving audio data collected by the custom acoustic model through the voice interaction component, wherein the audio data is input by a user voice;

Transmitting, by the voice interaction component, the audio data to a server, the audio data being for instructing the server to perform audio processing and generating response data matching the audio data;

The response data returned by the server are issued to a first service component through the voice interaction component; and responding to the user voice input as a task type question, wherein the response data is used for triggering the first business component to execute the target operation indicated by the user voice input.

On the other hand, a man-machine interaction device is provided and applied to terminal equipment, wherein the terminal equipment is integrated with a voice interaction component, N service components and a custom acoustic model provided by an access party; the voice interaction component is encapsulated with a Software Development Kit (SDK) related to voice interaction; the N business components are selected from a business component set provided by a developer according to the product requirements of the access party; one business component is used for providing at least one service for the terminal equipment, and N is a positive integer;

the custom acoustic model is configured to collect audio data, wherein the audio data is input by user voice;

The voice interaction component is configured to receive audio data collected by the custom acoustic model;

the voice interaction component is further configured to send the audio data to a server, wherein the audio data is used for instructing the server to execute audio processing and generate response data matched with the audio data;

The voice interaction component is further configured to send the response data returned by the server to a first service component; and responding to the user voice input as a task type question, wherein the response data is used for triggering the first business component to execute the target operation indicated by the user voice input.

In one possible implementation, the SDK includes: a speech recognition SDK, a speech synthesis SDK, and a text recognition SDK;

The path of the acoustic model is arranged under the voice interaction component, and the voice interaction component provides an audio data receiving interface for waking up the terminal equipment.

In one possible implementation, the audio data is used to instruct the server to perform the following audio processing:

Performing semantic analysis on the audio data, and acquiring semantic skill data of the audio data based on a semantic analysis result, wherein the semantic skill data comprises: question intent, knowledge domain to which the question belongs, question text and the response data.

In one possible implementation, the first business component is configured to display the response data in a non-voice form in response to the user voice input not being a task-type question;

The voice interaction component is further configured to respond to the fact that the user voice input is not a task question, and play the response data in a voice form;

The voice interaction component is further configured to display the response data in a non-voice form in response to the user voice input not being a task question and the terminal device not integrating the first business component.

In one possible implementation, a long connection is established between the voice interaction component and the server;

the voice interaction component is further configured to receive a push message issued by the server based on the long connection; and notifying a second service component to receive the push message in a directional broadcasting mode, wherein the second service component registers a callback function or a registered radio with the voice interaction component in advance.

In a possible implementation manner, the voice interaction component is configured to inform the first service component of receiving the response data in a directional broadcast manner, and the first service component registers a callback function or registers a radio with the voice interaction component in advance.

In one possible implementation, the apparatus further includes:

The sound source positioning module is configured to acquire a first voice signal acquired by the first microphone, wherein the first voice signal comprises a first sound source signal and a first noise signal; acquiring a second voice signal acquired by a second microphone, wherein the second voice signal comprises a second sound source signal and a second noise signal; acquiring cross power spectrums of the first voice signal and the second voice signal on a frequency domain; transforming the cross-power spectrum from a frequency domain to a time domain to obtain a cross-correlation function; determining a time value corresponding to the maximum cross-correlation value as a propagation delay, wherein the propagation delay is an arrival time difference of a voice signal between the first microphone and the second microphone; and performing sound source positioning based on the propagation delay, wherein the first microphone and the second microphone come from a microphone array of the terminal equipment.

In one possible implementation, the apparatus further includes:

An echo cancellation module configured to perform echo cancellation processing on a voice signal received by the microphone array based on the first filter; wherein the filter function of the first filter is infinitely close to the impulse response of the loudspeaker to the microphone array; the voice signal received by the microphone array is determined according to a sound source signal, a noise signal, a voice signal played by the loudspeaker and the impulse response.

In one possible implementation, the apparatus further includes:

The reverberation elimination module is configured to transform the voice signals received by the microphone array from a time domain to a frequency domain to obtain frequency domain signals; performing inverse filtering processing on the frequency domain signal based on a second filter to recover an acoustic source signal; wherein the speech signal received by the microphone array is determined from the sound source signal, the noise signal and the house impulse response of the sound source.

In another aspect, a terminal device is provided, the device including a processor and a memory, the memory storing at least one program code, the at least one program code being loaded and executed by the processor to implement the above-described human-machine interaction method.

In another aspect, a computer readable storage medium having stored therein at least one program code loaded and executed by a processor to implement the above-described human-machine interaction method is provided.

In another aspect, a computer program product or a computer program is provided, the computer program product or computer program comprising computer program code stored in a computer readable storage medium, the computer program code being read from the computer readable storage medium by a processor of a terminal device, the computer program code being executed by the processor, causing the terminal device to perform the above-described human-machine interaction method.

The technical scheme provided by the embodiment of the application has the beneficial effects that:

Aiming at intelligent voice interaction, a developer can realize voice interaction components and various service components in advance, so that an access party can conveniently perform free combination access of the service components according to own product requirements, and further own terminal equipment is formed; that is, the access party can freely combine service components to form its own product scheme, and can freely select access or non-access in the face of various service components provided by the developer. In other words, the access party can define the functions of the terminal equipment according to the product requirements of the access party.

The modularized solution based on voice interaction can be conveniently applied to various terminal devices, such as IOT (Internet of Things ) devices, screen-free devices and the like; and the access is simple, the customization of the equipment is strong, the access period of an access party can be shortened as much as possible, the development cost is saved, and the flexibility is strong. In summary, the embodiment of the application provides possibility for the access party to realize flexible and simple intelligent voice interaction.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a voice interaction system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an implementation environment related to a man-machine interaction method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a target frame platform according to an embodiment of the present application;

FIG. 4 is a flowchart of a man-machine interaction method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of the basic architecture of an acoustic front-end acquisition system provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of an acoustic signal and microphone array according to an embodiment of the present application;

FIG. 7 is a schematic diagram of echo cancellation according to an embodiment of the present application;

Fig. 8 is an interaction schematic diagram of a terminal device and a background server according to an embodiment of the present application;

fig. 9 is a schematic diagram of message pushing performed by a background server according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a man-machine interaction device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

It is to be understood that the terms "first," "second," and the like, as used herein, may be used to describe various concepts, but are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. Wherein at least one means one or more, for example, at least one user may be an integer number of users of one or more of any one user, two users, three users, and the like. The plurality means two or more, and for example, the plurality of users may be an integer number of two or more of any two users, three users, or the like.

The embodiment of the application provides a man-machine interaction method, a man-machine interaction device, a storage medium and electronic equipment. The method relates to the field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) and the field of cloud technology.

The AI is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

In detail, the artificial intelligence technology is a comprehensive discipline, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning and other directions.

Key technologies for Speech technology (Speech Technology, ST) are automatic Speech recognition technology (Automatic Speech Recognition, ASR) and Speech synthesis technology (Text To Speech, TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

In addition, the method relates to the field of Cloud technology (Cloud technology). The cloud technology is a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. In addition, the cloud technology can be a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, and can form a resource pool, and the cloud computing business model application system is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

Illustratively, the method relates to an artificial intelligence cloud service in the field of cloud technology. Among them, the artificial intelligence cloud service is also generally called AIaaS (AI AS A SERVICE, chinese is "AI as a service"). The service method of the artificial intelligent platform is a currently mainstream service method, in detail, the AIaaS platform can split several common AI services and provide independent or packaged services at the cloud. This service mode is similar to an AI theme mall: all developers can access one or more artificial intelligence services provided by the use platform through an API interface, and partial deep developers can also use an AI framework and AI infrastructure provided by the platform to deploy and operate and maintain self-proprietary cloud artificial intelligence services.

Some terms or abbreviations involved in the embodiments of the present application will be described first.

Voice interaction: the method refers to a new generation interaction mode based on user voice input, and a user can obtain feedback results given by a machine through speaking. A typical application scenario is a voice assistant.

And (3) groupwise: also referred to as modularization, refers to isolating or split-charging codes belonging to the same function/service into independent modules, located at the service layer. Illustratively, the components include, but are not limited to: desktop components, setup components, notification bar components, music components, video call components, video components, account components, and components formed by various business applications.

Fig. 1 is a schematic structural diagram of a voice interaction system according to an embodiment of the present application.

Referring to fig. 1, a complete voice interaction system comprises: hardware 101, software systems 102, acoustic models (also known as front-end acoustic acquisition models or front-end acoustic acquisition systems) 103, voice AI assistants 104, and upper layer business Applications (APPs) 105.

The hardware 101 may be terminal equipment hardware of an access party, and the software system 102 may be an operating system of the terminal equipment.

The first point to be noted is that the acoustic model 103 may be customized, and may be an acoustic model provided by a developer or an acoustic model customized by an access party. In other words, the embodiment of the application supports the access party to use the customized acoustic model besides providing the acoustic model for the access party, namely, provides the access mode of the customized acoustic model. That is, since the acoustic models required by different access parties may be different according to different hardware, the present solution provides the capability of accessing the acoustic model customized by the access party to the target frame platform, and realizes that the audio acquired by the acoustic model 103 is transferred to the target frame platform for subsequent audio processing.

A second point that is needed is that the above-described target framework platform, also referred to herein as a voice interaction component, provides voice interaction functionality, which is the core for implementing a voice interaction based terminal componentization solution. Illustratively, the voice AI assistant 104 is packaged in the target framework platform described above; the voice interaction development kit includes, but is not limited to: a speech recognition SDK (Software Development Kit ), a speech synthesis SDK, a recording recognition SDK, a text recognition SDK, and the like.

Wherein the upper various business applications 105 may be referred to herein as a business component or a functional module. The functional implementation of these business components or functional modules also relies on the target framework platform described above.

Fig. 2 is a schematic diagram of an implementation environment related to a man-machine interaction method according to an embodiment of the present application.

Illustratively, referring to FIG. 2, the implementation environment includes: a user 201, a terminal device 202 and a server 203. Wherein the target framework platform 2021 is integrated in the terminal device 202.

In one possible implementation, after the acoustic model on the terminal device collects the audio data, the audio data is transmitted to the target frame platform 2021, and the target frame platform 2021 is responsible for transmitting the audio data to the server 203 through the gateway interface for audio processing; further, after the processing is completed, the server 203 returns a processing result to the target frame platform 2021; the target framework platform 2021, after receiving the processing result returned by the server 203, further distributes the processing result to the service component or the functional module for data processing.

In one possible implementation, the server 203 may be a stand-alone physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content distribution network (Content Delivery Network, CDN), and basic cloud computing services such as big data and an artificial intelligence platform. Terminal device 202 may be, but is not limited to, a smart phone, tablet, notebook, desktop, smart box, smart watch, etc. The terminal device 202 and the server 203 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

The following describes possible application scenarios of the man-machine interaction method provided by the embodiment of the application.

In a broad sense, the scheme can be applied to any product which needs to perform interaction scene based on the voice AI, such as an access control machine, an intelligent home, a household appliance industry and the like. Namely, the products can realize customizable quick access through the scheme. I.e. the embodiments of the application propose the concept of modular access.

Aiming at the scheme, after the target frame platform and each service component are realized, the free combination access of various service components can be conveniently carried out according to the product requirements of the user; that is, the access party can freely combine business components to form its own product solution. The target framework platform is required to be accessed to any access party, and the common components of the desktop component, the setting component, the account component and the video call component are accessed or not accessed according to the product requirements of the user. This access is called a combined access. Wherein the target framework platform, also referred to herein as a voice interaction component, provides voice interaction functionality, which is the core for implementing a voice interaction based terminal componentization solution. Illustratively, the voice AI assistant is packaged within the target framework platform, i.e., the software development kit associated with the voice interaction is packaged within the target framework platform; among the software development kits described above include, but are not limited to: a speech recognition SDK, a speech synthesis SDK, a recording recognition SDK, a text recognition SDK, and the like.

The combined access defines that each access party can freely select a service component to access according to own product requirements when realizing own terminal equipment. The self-defined function or self-combination mode is combined access.

Fig. 3 is a schematic structural diagram of a target frame platform according to an embodiment of the present application.

In the embodiment of the application, after the terminal equipment is awakened through the front-end acoustic model, the target frame platform can be adopted to carry out audio processing on the audio data acquired by the front-end acoustic model.

According to the scheme, various functional packages of the background are in butt joint through the target frame platform, so that an access party does not need to concern specific implementation details of voice interaction, and the target frame platform carries out functional transparent transmission. That is, the target framework platform is an intermediary for interaction between the background section and the terminal device. Wherein the portion of fig. 3 above the target frame platform belongs to the background portion. As shown in fig. 3, the left portion 301 located above the target frame platform is a portion that is strongly related to speech and semantics, including but not limited to: skill configuration services, ASR (Automatic Speech Recognition ), NLP, TTS (Text To Speech). The right portion 302 above the target frame platform provides various skill services. As shown in fig. 3, the skill services provided in this section include, but are not limited to: account number service, PUSH service, weather service, alarm clock service, operation service, etc., which are not particularly limited in the embodiment of the present application.

Wherein the skill configuration service is used to configure skills strongly related to speech and semantics. After the voice conversion, the target framework platform can acquire various service messages through various provided skill services, such as weather messages through weather services, and then send the received weather messages to a weather component accessed by the terminal device.

Referring to fig. 3, in addition to providing a target framework platform for an access party, the scheme provides various self-contained service components 303, so that each access party can freely select the service components according to own product requirements and access the service components to own terminal equipment. Illustratively, the business components provided by the present solution include, but are not limited to: desktop components, setup components, notification bar components, account components, video call components, video components, music components, IOT components, and the like.

In addition, the target framework platform also receives various service messages 304 issued by the background server by interacting with the background server. Illustratively, these service messages include, but are not limited to: TTS messages, PUSH messages, voice recognition messages, weather messages, account messages, video call messages, authorization messages, wake-up messages, broadcast control messages, etc., to which embodiments of the present application are not specifically limited.

Based on the target framework platform shown in fig. 3, the scheme can support three docking schemes of an access party.

1. The modularized access mode is that the developer of the scheme has realized various business components, such as a desktop component, a setting component, a notification bar component, a video call component and the like, the access party only needs to match any business component according to own product requirements, the target framework platform is a component which needs to be carried, the complete AI voice interaction flow can be realized, the characteristics related to hardware are shielded, and the work of the access party only needs to carry out business component configuration according to own product requirements.

2. The SDK access mode is simultaneously supported for an access party with development capability or an access party needing to realize various interactions; wherein the access mode for the SDK is not referred to herein.

3. The whole machine access mode realizes the full-flow solution of the front-end acoustic model, ASR, NLP, TTS, various skill configuration services and various AI voice interaction flows for the access party which does not have development experience or does not want to make investment in development and also wants to realize voice interaction quickly, and the access party can not need development, in other words, the scheme also supports the development party to provide the whole machine such as an intelligent sound box for the access party to use. In some embodiments, the openers may select different functional components to integrate integrally based on various application scenarios of the terminal device, so as to provide various terminal devices with different functional services for different access parties to select for use, which is not particularly limited in the embodiments of the present application.

It should be noted that the implementation of either docking scheme involves speech acquisition, speech recognition, TTS, semantic understanding, and semantic skill execution. In particular, to a modular access scheme, which relates to message communication and state management techniques between multiple components; a UI (User Interface) and a dual-channel synchronization technology of audio and video and a background; TTS, ASR, NLP and semantic conversion skills; long connection of App and PUSH channel technology; audio acquisition and echo cancellation techniques, etc.

The man-machine interaction method provided by the embodiment of the application is explained in detail below by taking a modularized access mode as an example.

Fig. 4 is a flowchart of a man-machine interaction method according to an embodiment of the present application. The execution main body of the method is terminal equipment, and the terminal equipment is access party equipment. The terminal equipment is integrated with a voice interaction component (namely the target framework platform), N service components and a custom acoustic model provided by an access party. Wherein, this voice interaction component is used for providing voice interaction function. For example, the voice interaction component is packaged with a software development kit associated with voice interactions. Illustratively, the software development kit includes at least: speech recognition SDK, speech synthesis SDK, and text recognition SDK. In addition, N business components are selected from a business component set provided by a developer according to the product requirements of the access party; wherein, a business component is used for providing a service for the terminal equipment; for example, the video call module provides a video call function for the terminal device. Referring to fig. 4, the method provided by the embodiment of the application includes:

401. The voice interaction component of the terminal equipment receives audio data collected by the self-defined acoustic model, wherein the audio data is input by voice of a user.

In the embodiment of the application, the acoustic model of the terminal equipment is responsible for picking up the sound. And the audio data collected by the acoustic model is passed to the voice interaction component. That is, the audio data received by the voice interaction component is collected by an acoustic model.

In one possible implementation, the audio data is input as user speech, and may be a task question, such as playing a piece of music; can also be a question, such as asking what weather is today; the application may also be a boring question without any purpose, and the embodiment of the application is not particularly limited thereto.

It should be noted that, the acoustic model may be a custom acoustic model provided by the access party in addition to the acoustic model provided by the developer for the access party. The path of the self-defined acoustic model is arranged under the voice interaction component, and the voice interaction component provides an audio data receiving interface for waking up the terminal equipment. In other words, the scheme supports access of the access party custom acoustic model besides providing the acoustic model of the whole machine, and the detailed process is as follows: the access party prepares a custom acoustic model; setting a path of the custom acoustic model under a target frame platform; the target frame platform provides an interface when in wake-up, and is realized by an access party; based on such an interface implementation, audio data collected through the custom acoustic model can be transferred to the target framework platform.

Based on the steps, the terminal equipment is awakened based on the custom acoustic model, and the user can awaken the terminal equipment by speaking the awakening keywords into the microphone of the terminal equipment. The embodiment of the application provides a flexible and various front-end acoustic model access scheme.

In another possible implementation, the basic architecture of the acoustic front-end acquisition system is shown in fig. 5, including multiple microphone pickup, echo cancellation, microphone array beamforming, far-field reverberation control, nonlinear noise suppression, and critical wake-up. In addition, far-field voice interaction, transmission attenuation exists in the voice in the transmission process, noise, interference signals and the like are accompanied, so that a microphone array is adopted in the past to enhance target voice, inhibit interference voice and improve recognition accuracy. Of these, 6 microphones are shown in fig. 6, microphone 1, microphone 2, microphone 3, microphone 4, microphone 5 and microphone 6, respectively. The 6 microphones form a microphone array. In addition, as shown in fig. 6, in addition to the sound signal emitted from the sound source, there are usually reflected sound, microphone echo, environmental noise, environmental null noise, and the like of the sound source signal in the environment.

In another possible implementation, the acoustic model further includes sound source localization, echo cancellation, and reverberation cancellation steps before delivering the acquired audio data to the target frame platform.

For example, a generalized cross-correlation method may be used for sound source localization, which is essentially a TDOA (TIME DIFFERENCE Of Arrival time difference) calculation method, where the cross-power spectrum Of two voice signals is calculated in the frequency domain, and then converted from the time domain to the time domain by inverse fourier transform, and the propagation delay is determined by finding the delay corresponding to the maximum cross-correlation value.

For the embodiment of the application, the detailed process of sound source localization can be as follows: acquiring a first voice signal acquired by a first microphone of a microphone array of a terminal device, wherein the first voice signal comprises a first sound source signal and a first noise signal; acquiring a second voice signal acquired by a second microphone of the microphone array, wherein the second voice signal comprises a second sound source signal and a second noise signal; acquiring cross power spectrums of the first voice signal and the second voice signal on a frequency domain; transforming the cross-power spectrum from a frequency domain to a time domain to obtain a cross-correlation function; determining a time value corresponding to the maximum cross-correlation value as a propagation delay, wherein the propagation delay is an arrival time difference of the voice signal between the first microphone and the second microphone; sound source localization is performed based on the propagation delay. Illustratively, after determining the propagation delay, the sound source position may be determined according to the speed of sound and the distance between the two microphones, which is not particularly limited by the embodiments of the present application.

For example, two microphones spaced apart from L acquire a sound signal from a remote sound source may be expressed as the following formula: x ₁(t)＝s₁(t)+z₁(t),x₂(t)＝s₂(t+D)+z₂ (t).

Where s ₁ (t) and s ₂ (t+d) are sound source signals received by two microphones, respectively, and z ₁ (t) and z ₂ (t) are noise signals received by two microphones, respectively. As shown in fig. 7, after the sound signals X ₁ (t) and X ₂ (t) are collected, the sound signals are filtered through filters H ₁ (t) and H ₂ (t), and then FFT (Fast Fourier Transform, fast fourier transform and power spectrum calculation are performed to obtain X ₁ (ω) and X ₂ (ω), then cross power spectrums of the two voice signals are calculated in the frequency domain, and then the cross power spectrums are converted from the time domain to the time domain through IFFT (INVERSE FAST Fourier Transform ), so as to obtain a propagation delay d that approximates to the real time delay, wherein Φ (ω) in fig. 7 is a phase transformation function, and can be selected according to the actual situation.

For echo cancellation, in a terminal device based on voice interaction, in a scene such as playing voice or video during the operation of the terminal device, a voice interaction instruction input by a user is sometimes required to be responded. At this time, the voice signal played by the local speaker needs to be removed from the voice signal received by the microphone array, so that the terminal device can correctly recognize the voice interaction instruction input by the user. In one possible implementation, the sound signal played by the local speaker may be modeled using the impulse response of the speaker to microphone array. Where x (t) =n (t) ×r (t) +s (t) +z (t).

Wherein n (t) is a voice signal played by the loudspeaker, r (t) is impulse response from the loudspeaker to the microphone array, s (t) is a real sound source signal, z (t) is a noise signal, and x (t) is a voice signal received by the final microphone array. The process of echo cancellation is, for example, to obtain a filter H (t) that is infinitely close to the real impulse response r (t).

f(t)＝x(t)-n(t)*H(t)

＝n(t)*r(t)-n(t)*H(t)+s(t)+z(t)

The above-mentioned filter H (t) is also referred to herein as a first filter, i.e., based on the first filter, performing echo cancellation processing on the speech signal received by the microphone array; wherein the filter function of the first filter is infinitely close to the impulse response of the loudspeaker to the microphone array; and the speech signal x (t) received by the microphone array is determined from the sound source signal s (t), the noise signal z (t), the loudspeaker-played speech signal n (t) and the loudspeaker-to-microphone array impulse response r (t).

For reverberation cancellation, reverberation refers to the phenomenon that a sound signal encounters obstacles such as floors, walls and the like to form reflected sound, and is superimposed with a real sound source signal. By way of example, modeling of reverberation can be described by a house impulse response (RIR, room Impulse Response).

The voice signal received by the final microphone array is determined according to the sound source signal, the noise signal and the house impulse response of the sound source, namely, x (t) =r (t) ×s (t) +z (t), wherein x (t) is the voice signal received by the final microphone array, s (t) is the real sound source signal, r (t) is the house impulse response of the sound source, and z (t) is the noise signal.

Converting x (t) to the frequency domain as follows; illustratively, the method of reverberation cancellation may be to estimate a filter/> Make waiting/>Thereby achieving the aim of eliminating the reverberant sound.

The above method of removing reverberation is called an inverse filtering method. Wherein the filterIn this context, the method is also referred to as a second filter, and after the voice signal received by the microphone array is transformed from the time domain to the frequency domain, the embodiment of the application performs inverse filtering processing on the obtained frequency domain signal based on the second filter, so as to recover the real sound source signal s (t).

It should be noted that, through the above method steps, the user can wake up the terminal device by speaking the wake-up keyword into the microphone array.

402. The voice interaction component of the terminal device transmits audio data to the server, the audio data being used to instruct the server to perform audio processing and to generate response data matching the audio data.

The voice interaction component of the terminal equipment acquires response data matched with the received audio data through interaction with the background server.

Wherein the response data includes a speech form and a non-speech form. Illustratively, for non-voice forms, the response data may be UI data, which is not specifically limited by the embodiments of the present application.

In the embodiment of the present application, as shown in fig. 8, audio data received by the acoustic model of the terminal device 801 is transferred to the voice interaction component, and the voice interaction component transfers the audio data to the background server 802 for audio processing through the gateway interface of the terminal device. That is, the terminal device transmits audio data received by the acoustic model to the background server through the voice interaction component, and the audio data is used for instructing the background server to perform audio processing and generating response data matched with the audio data.

In one possible implementation, FIG. 8 illustrates a specific architecture of a background server 802. As shown in fig. 8, the background server 802 includes, but is not limited to: AIProxy (agent), ASR module, TTS module, goal service module, TSKM platform (skill configuration platform), and skill service module.

Based on the background server architecture shown in fig. 8, the background server may perform the following audio processing on the received audio:

4021. The voice interaction component of the terminal device delivers the received audio data to AIproxy through the gateway interface.

4022. AIproxy transmitting the received audio data to an ASR module, and performing speech-to-text processing on the audio data by the ASR module to obtain text data, namely obtaining ASR text.

4023. AIproxy transmits the received audio data to a target service module, and further obtains semantic skill data based on the TSKM platform and the skill service module. Aiming at the step, semantic analysis, namely voice-to-semantic processing, is carried out on the audio data; illustratively, the speech-to-semantic processing may be performed by a target service module, which is not specifically limited in embodiments of the present application.

Illustratively, the TSKM platform is also referred to as a skill configuration platform for configuring skills that are strongly related to semantics, such as weather skills, alarm clock skills, music skills, etc.; yet further, the acquisition of skill data such as weather messages requires the skill service to provide support, i.e., TSKM platforms pull specific skill data based on semantic parsing results by interacting with the skill service module. For example, if the semantic analysis result is weather searching, the weather message of the current position of the user is pulled through interaction with weather service.

As one example, the semantic skill data includes: question intent, knowledge domain to which the question belongs, question text, and response data in non-voice form. Such as: assuming that the received audio data is "S market weather", the corresponding semantic skill data may include: the field-weather, intention-comprehensive search, question-S market weather, response data-S market has thunder and lightning early warning today. As shown above, the semantic skill data comprises a question intention (intent), a knowledge domain (domain) to which the question belongs, a question text (Query) and response data in a non-voice form, namely 'S city has thunder early warning today'.

4024. And the TTS module performs text-to-speech processing on the non-speech form of response data to obtain TTS data, namely speech form of response data.

In the embodiment of the application, the background server can return the ASR text, the TTS data and the response data to the voice interaction component of the terminal equipment. The voice interaction component, upon receiving the data, can respond to user voice input based upon the data.

Referring to fig. 8, three callback manners are provided in the embodiment of the present application for the interaction between the voice interaction component and each service component of the terminal device. That is, after the voice interaction component receives the data returned by the background server 802, the embodiment of the present application provides an interaction scheme between the three components.

For the interactive mode of cross-process message distribution, if the data returned by the background server 802 needs to be transferred to a certain service component integrated by the terminal device, the service component needs to register a callback function or register a radio when being started, so that the voice interaction component carries out remote callback after receiving the data issued by the background server. That is, the embodiment of the present application further includes the following step 403.

403. And the voice interaction component of the terminal equipment transmits the response data to the first service component, responds to the task type question of the user voice input, and the response data is used for triggering the first service component to execute the target operation indicated by the user voice input.

That is, the response data is used to trigger the first business component to respond to the user voice input. For example, the music component is instructed to play a certain piece of music or the alarm clock component rings at a target time, which is not particularly limited in the embodiment of the present application.

In one possible implementation, the response data is delivered to the first service component through the voice interaction component, including but not limited to: notifying a first service component to receive the response data in a directional broadcasting mode through a voice interaction component; wherein the first business component has registered a callback function (callback) or a registered radio (listener) with the voice interaction component in advance.

A callback function is a function called by a function pointer. If a pointer to that function is passed as a parameter to another function, then this pointer is referred to as a callback function when it is used to call the function to which it points. Wherein the callback function is not directly called by the implementer of the function, but is called by another party when a specific event or condition occurs, for responding to the event or condition. In most cases, a callback is a method that is called when certain events occur. For example, a user may be buying a store but just without a good, then the user may leave a contact at a store clerk, have a good in the store for the next few days, the store clerk contacts the user according to the contact provided by the user, and then the user goes to the store to pick up the good. In this example, the contact of the user is called a callback function, the contact of the user is reserved for a store clerk to be registered as the callback function, the store future availability is an event triggering callback association, the store clerk notifies the user to call the callback function, and the user gets goods in the store to be in response to the callback event.

In the embodiment of the application, the user voice input can be aimed at a task scene, wherein the task dialogue is used for completing a specific task, for example, the user needs to be answered, the condition of the ticket needs to be queried and corresponding actions need to be executed. In addition, the user voice input may also be directed to non-task scenes, such as question-answering scenes and boring scenes. The question-answer type dialogue is mainly used for answering questions of users, and is equivalent to an encyclopedia knowledge base, such as how to return a train ticket and what the airplane needs to pay attention to, and usually only needs to answer questions without executing tasks. Chat-type conversations are open and generally have no task goals, nor hard-defined answers.

Wherein the manner in which the response data triggers the first business component to respond to the user voice input is different for different types of user voice inputs. In another possible implementation, in response to the user voice input not being a task-type question, the response data is used to trigger the first business component to display non-voice form of the response data through the UI template; it should be noted that, in addition to the callback manner of the above-mentioned cross-process message distribution, the embodiment of the present application further includes a callback manner based on the original thread, and the detailed procedure is as follows.

In another possible implementation, the voice interaction component of the terminal device plays the response data in voice form.

In the callback mode, namely, the callback mode is the callback mode for playing the voice interaction component, in other words, the voice interaction component supports voice playing because the voice interaction component encapsulates various voice interaction development kits, so that response data in a voice form can be played through the voice interaction component. For example, assuming that the user voice input is not a task-type question, the response data in voice form may be played through the voice interaction component.

The response data is typically returned to the terminal device by the background server in the form of a network request, for example, so that the callback mode can directly make a callback on the thread that initiated the network request.

It should be noted that, in addition to the callback manner of the above-mentioned cross-procedure message distribution, the embodiment of the present application further includes a callback manner for the UI thread, which is described in detail below.

In another possible implementation, the voice interaction component of the terminal device displays the response data in a non-voice form through the UI template.

In the embodiment of the application, for the data which needs to be displayed based on the UI template, such as the response data in the form of UI data, the UI thread of the voice interaction component can be called back, and then the interactive presentation is carried out through the UI template provided by the voice interaction component. For example, assuming that the user voice input is not a task-type question and the terminal device is not integrated with a corresponding business component, the non-voice form of the response data may be displayed through a UI template of the voice interaction component.

According to the scheme, callback of the corresponding target is conducted on the data issued by the background server, and greater convenience is provided for the access party.

The method provided by the embodiment of the application has at least the following beneficial effects:

Aiming at intelligent voice interaction, a developer can realize voice interaction components and various service components in advance, so that an access party can conveniently perform free combination access of the service components according to own product requirements, and further own terminal equipment is formed; that is, the access party can freely combine business components to form its own product solution. The access may be freely selected to be accessed or not accessed in the face of various service components provided by the developer. In other words, the access party can define the functions of the terminal equipment according to the product requirements of the access party. The modularized solution based on voice interaction can be conveniently applied to various intelligent terminal devices, such as IOT devices, screen-free devices and the like; and the access is simple, the customization of the equipment is strong, the access period of an access party can be shortened as much as possible, the development cost is saved, and the flexibility is strong. In summary, the embodiment of the application provides possibility for the access party to realize flexible and simple intelligent voice interaction.

In another embodiment, in addition to proactive request and response in the interaction of the semantic interaction component and the background server, the embodiments of the present application support PUSH mechanisms. Illustratively, the present solution provides a PUSH centralized system transmitting and receiving mode based on registration. As shown in fig. 9, a PUSH management channel is set in the voice interaction component, and is used for receiving a message pushed to the terminal device by the background server in real time. The voice interaction assembly is connected with the acquired server in a long way; as an example, pushManager is used in this scenario to establish a long connection with a background server. Wherein PushManager provides the ability to receive push messages from a background server.

In the embodiment of the application, the PUSH functions required by each service component, voice interaction component and the like integrated on the terminal equipment are supported by the component PushManager; the processing makes the system have only one unified PUSH channel, which is convenient to manage and saves memory and flow. Based on the above description, the method provided by the embodiment of the application further includes: the terminal equipment receives push messages sent by the server based on long connection through the voice interaction assembly; notifying a second service component to receive the push message in a directional broadcasting mode through a voice interaction component; wherein the second business component has registered a callback function or a registered radio with the voice interaction component in advance.

In other words, in order to enable each service component integrated by the terminal device to receive the push message of the background server, the service components also need to register with the voice interaction component; thus, after receiving the push message, the voice interaction component can inform the service component which needs to receive the push message in a directional broadcasting mode; the service component receiving the push message does not need to be resident, so that the memory and the CPU (Central Processing Unit ) overhead can be saved, and the performance is provided.

The componentized access scheme based on intelligent voice interaction provided by the embodiment of the application is specifically illustrated below.

After the developer realizes the voice interaction component and various service components, the access party can perform free combination access of the service components according to the product requirements of the access party. Illustratively, a developer may implement components such as a desktop component, a settings component, an account component, and a video call component. The access party can freely combine the service components to form its own product scheme. In addition, the embodiment of the application also supports the access party to customize the service component according to the external interface provided by the developer, and the embodiment of the application is not particularly limited.

The following illustrates a componentized access scheme using the access video call component as an example. If the access party wants its own terminal device to have the video call function, the access party can access the video call component (videocall) and the voice interaction component (target framework platform) provided by the developer, that is, the two components are integrated on the terminal device of the access party. Illustratively, the video call procedure based on these two components is specifically as follows:

step a, the acoustic model of the terminal equipment collects user voices, such as' call to).

And b, transmitting the collected user voice to a voice interaction component by the acoustic model.

And c, transmitting the voice of the user to a background server by the voice interaction component.

Step d, the background server performs voice conversion semantic processing and skill service matching on the received user voice to obtain response data and sends the response data to the voice interaction component;

step e, the voice interaction component informs the video call component of receiving the response data in a broadcasting mode;

F, triggering the video call component to initiate an interactive flow of call by the response data; that is, the video call component initiates a video call waiting for the counterpart to call after being connected.

The broadcasting mode is a directional broadcasting mode, and the receiving process does not need to reside in the directional broadcasting mode, so that the performance of the system is saved; that is, the current broadcast implementation is a directional broadcast, i.e., only designated processes can receive the broadcast; the same logical access is used for other service components. The combined access scheme can rapidly perform minimized function combination through atomic requirements, so that the requirements of an access party are met, and convenience is brought.

Fig. 10 is a schematic structural diagram of a man-machine interaction device according to an embodiment of the present application. Applied to terminal equipment, referring to fig. 10, the terminal equipment is integrated with a voice interaction component 1001, N service components 1002 and a custom acoustic model 1003 provided by an access party; the voice interaction component 1001 is encapsulated with a software development kit SDK related to voice interaction; the N service components 1001 are selected by the access party from a set of service components provided by the developer according to the product requirements of the access party; a service component 1001 for providing at least one service for a terminal device, N being a positive integer;

A custom acoustic model 1003 configured to collect audio data, the audio data being user speech input;

a voice interaction component 1001 configured to receive audio data collected by the custom acoustic model;

A voice interaction component 1001 further configured to send the audio data to a server, the audio data being for instructing the server to perform audio processing and generating response data matching the audio data;

The voice interaction component 1001 is further configured to send the response data returned by the server to the first service component; and responding to the user voice input as a task type question, wherein the response data is used for triggering the first business component to execute the target operation indicated by the user voice input.

Aiming at intelligent voice interaction, a developer can realize voice interaction components and various service components in advance, so that an access party can conveniently perform free combination access of the service components according to own product requirements, and further own terminal equipment is formed; that is, the access party can freely combine business components to form its own product solution. The access party may freely choose to access or not access in the face of various business components provided by the developer. In other words, the access party can define the functions of the terminal equipment according to the product requirements of the access party. The modularized solution based on voice interaction can be conveniently applied to various intelligent terminal devices, such as IOT devices, screen-free devices and the like; and the access is simple, the customization of the equipment is strong, the access period of an access party can be shortened as much as possible, the development cost is saved, and the flexibility is strong. In summary, the embodiment of the application provides possibility for the access party to realize flexible and simple intelligent voice interaction.

In one possible implementation, the apparatus further includes:

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

It should be noted that: in the man-machine interaction device provided in the above embodiment, only the division of each service component is used for illustration during man-machine interaction, in practical application, the above function allocation can be completed by different service components according to needs, that is, the internal structure of the device is divided into different service components, so as to complete all or part of the functions described above. In addition, the man-machine interaction device provided in the above embodiment and the man-machine interaction method embodiment belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.

Fig. 11 shows a block diagram of a terminal device 1100 according to an exemplary embodiment of the present application. The terminal device 1100 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal device 1100 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, and the like.

In general, the terminal apparatus 1100 includes: a processor 1101 and a memory 1102.

The processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1101 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). The processor 1101 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 1101 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one program code for execution by processor 1101 to implement the human-machine interaction method provided by the method embodiments of the present application.

In some embodiments, the terminal device 1100 may further optionally include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102, and peripheral interface 1103 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1103 by buses, signal lines or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, a display screen 1105, a camera assembly 1106, audio circuitry 1107, and a power supply 1109.

A peripheral interface 1103 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 1101 and memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or both of the processor 1101, memory 1102, and peripheral interface 1103 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1104 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1104 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 1104 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (WIRELESS FIDELITY ) networks. In some embodiments, the radio frequency circuit 1104 may further include NFC (NEAR FIELD Communication) related circuits, which is not limited by the present application.

The display screen 1105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1105 is a touch display, the display 1105 also has the ability to collect touch signals at or above the surface of the display 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this time, the display screen 1105 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 1105 may be one and disposed on the front panel of the terminal device 1100; in other embodiments, the display 1105 may be at least two, and disposed on different surfaces of the terminal device 1100 or in a folded design; in other embodiments, the display 1105 may be a flexible display disposed on a curved surface or a folded surface of the terminal device 1100. Even more, the display 1105 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display screen 1105 may be made of materials such as an LCD (Liquid CRYSTAL DISPLAY) and an OLED (Organic Light-Emitting Diode).

The camera assembly 1106 is used to capture images or video. Optionally, the camera assembly 1106 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 1106 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing, or inputting the electric signals to the radio frequency circuit 1104 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be provided at different portions of the terminal device 1100, respectively. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 1107 may also include a headphone jack.

The power supply 1109 is used to supply power to the respective components in the terminal device 1100. The power source 1109 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power source 1109 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal device 1100 also includes one or more sensors 1110. The one or more sensors 1110 include, but are not limited to: acceleration sensor 1111, gyroscope sensor 1112, pressure sensor 1113, optical sensor 1115, and proximity sensor 1116.

The acceleration sensor 1111 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established in the terminal apparatus 1100. For example, the acceleration sensor 1111 may be configured to detect components of gravitational acceleration in three coordinate axes. The processor 1101 may control the display screen 1105 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 1111. Acceleration sensor 1111 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 1112 may detect a body direction and a rotation angle of the terminal device 1100, and the gyro sensor 1112 may collect a 3D motion of the user on the terminal device 1100 in cooperation with the acceleration sensor 1111. The processor 1101 may implement the following functions based on the data collected by the gyro sensor 1112: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 1113 may be disposed at a side frame of the terminal device 1100 and/or at a lower layer of the display screen 1105. When the pressure sensor 1113 is provided at a side frame of the terminal apparatus 1100, a grip signal of the terminal apparatus 1100 by a user can be detected, and the processor 1101 performs left-right hand recognition or quick operation based on the grip signal collected by the pressure sensor 1113. When the pressure sensor 1113 is disposed at the lower layer of the display screen 1105, the processor 1101 realizes control of the operability control on the UI interface according to the pressure operation of the user on the display screen 1105. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 1115 is used to collect the ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 based on the intensity of ambient light collected by the optical sensor 1115. Specifically, when the intensity of the ambient light is high, the display luminance of the display screen 1105 is turned up; when the ambient light intensity is low, the display luminance of the display screen 1105 is turned down. In another embodiment, the processor 1101 may also dynamically adjust the shooting parameters of the camera assembly 1106 based on the intensity of ambient light collected by the optical sensor 1115.

A proximity sensor 1116, also referred to as a distance sensor, is typically provided on the front panel of the terminal device 1100. The proximity sensor 1116 is used to collect a distance between the user and the front surface of the terminal device 1100. In one embodiment, when the proximity sensor 1116 detects that the distance between the user and the front face of the terminal device 1100 gradually decreases, the processor 1101 controls the display 1105 to switch from the bright screen state to the off screen state; when the proximity sensor 1116 detects that the distance between the user and the front surface of the terminal apparatus 1100 gradually increases, the processor 1101 controls the display screen 1105 to switch from the off-screen state to the on-screen state.

It will be appreciated by those skilled in the art that the structure shown in fig. 11 is not limiting and that terminal device 1100 may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

In an exemplary embodiment, a computer readable storage medium, e.g. a memory comprising program code, executable by a processor in a terminal to perform the man-machine interaction method of the above embodiments, is also provided. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, which computer program product or computer program comprises computer program code, which computer program code is stored in a computer readable storage medium, from which computer readable storage medium a processor of a terminal device reads the computer program code, which computer program code is executed by a processor, which computer program code causes the terminal device to perform the above-mentioned human-machine interaction method.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. The man-machine interaction method is characterized by being applied to terminal equipment, wherein the terminal equipment is integrated with a voice interaction component, N service components and a custom acoustic model provided by an access party; the path of the acoustic model is arranged under the voice interaction component, and the voice interaction component externally provides an audio data receiving interface for waking up the terminal equipment; the voice interaction component is encapsulated with a Software Development Kit (SDK) related to voice interaction; the N business components are selected from a business component set provided by a developer according to the product requirements of the access party; one business component is used for providing at least one service for the terminal equipment, and N is a positive integer;

The method comprises the following steps:

receiving audio data acquired by the self-defined acoustic model and subjected to sound source positioning, echo cancellation and reverberation cancellation through the voice interaction component, wherein the audio data is input by a user voice;

Transmitting the audio data to a server through the voice interaction component; the server comprises an agent, a text-to-speech (TTS) module, a target service module, a skill configuration platform and a skill service module; the audio data is transmitted to the target service module after being received by the agent; the target service module is used for carrying out semantic analysis on the audio data, calling the skill configuration platform and acquiring semantic skill data of the audio data from the skill service module based on semantic analysis results; the semantic skill data includes non-voice form response data that matches the audio data; the TTS module is used for performing text-to-speech processing on the response data in a non-speech form to obtain the response data in a speech form;

Receiving the response data in a non-voice form and the response data in a voice form returned by the server through the voice interaction component;

notifying a first service component in the N service components to receive the response data in a non-voice form in a directional broadcasting mode through the voice interaction component; the first business component registers a callback function or a radio with the voice interaction component in advance;

Responding to the user voice input as a task type question, wherein the response data in a non-voice form is used for triggering the first service component to execute target operation indicated by the user voice input;

Responsive to the user voice input not being a task question, displaying the response data in a non-voice form through a user interface UI template of the first business component; or, in response to the user voice input not being a task-type question, playing, by the voice interaction component, the response data in voice form: or, in response to the user voice input not being a task question and the terminal device not integrating the first service component, displaying the response data in a non-voice form through a UI template of the voice interaction component;

the voice interaction component is connected with the server in a long way;

the method further comprises the steps of:

Receiving a push message issued by the server based on the long connection through the voice interaction component;

Notifying a second service component in the N service components to receive the push message in a directional broadcasting mode through the voice interaction component; the second business component registers a callback function or a radio with the voice interaction component in advance.

2. The method of claim 1, wherein the SDK comprises: speech recognition SDK, speech synthesis SDK, and text recognition SDK.

3. The method of claim 1, wherein the semantic skill data further comprises: question intents, knowledge domain to which the question belongs, and question text.

4. The method of claim 1, wherein during the collection of audio data, the method further comprises:

Acquiring a first voice signal acquired by a first microphone, wherein the first voice signal comprises a first sound source signal and a first noise signal;

Acquiring a second voice signal acquired by a second microphone, wherein the second voice signal comprises a second sound source signal and a second noise signal;

acquiring cross power spectrums of the first voice signal and the second voice signal on a frequency domain;

transforming the cross-power spectrum from a frequency domain to a time domain to obtain a cross-correlation function;

determining a time value corresponding to the maximum cross-correlation value as a propagation delay, wherein the propagation delay is an arrival time difference of a voice signal between the first microphone and the second microphone;

And performing sound source positioning based on the propagation delay, wherein the first microphone and the second microphone come from a microphone array of the terminal equipment.

5. The method of claim 1, wherein during the collection of audio data, the method further comprises:

Performing echo cancellation processing on the voice signals received by the microphone array based on the first filter;

wherein the filter function of the first filter is infinitely close to the impulse response of the loudspeaker to the microphone array; the voice signal received by the microphone array is determined according to a sound source signal, a noise signal, a voice signal played by the loudspeaker and the impulse response.

6. The method of claim 1, wherein during the collection of audio data, the method further comprises:

transforming the voice signal received by the microphone array from a time domain to a frequency domain to obtain a frequency domain signal;

Performing inverse filtering processing on the frequency domain signal based on a second filter to recover an acoustic source signal;

Wherein the speech signal received by the microphone array is determined from the sound source signal, the noise signal and the house impulse response of the sound source.

7. The man-machine interaction device is characterized by being applied to terminal equipment, wherein the terminal equipment is integrated with a voice interaction component, N service components and a custom acoustic model provided by an access party; the path of the acoustic model is arranged under the voice interaction component, and the voice interaction component externally provides an audio data receiving interface for waking up the terminal equipment; the voice interaction component is encapsulated with a Software Development Kit (SDK) related to voice interaction; the N business components are selected from a business component set provided by a developer according to the product requirements of the access party; one business component is used for providing at least one service for the terminal equipment, and N is a positive integer;

The voice interaction component is further configured to send the audio data to a server; the server comprises an agent, a text-to-speech (TTS) module, a target service module, a skill configuration platform and a skill service module; the audio data is transmitted to the target service module after being received by the agent; the target service module is used for carrying out semantic analysis on the audio data, calling the skill configuration platform to obtain semantic skill data of the audio data from the skill service module base according to semantic analysis results; the semantic skill data includes non-voice form response data that matches the audio data; the TTS module is used for performing text-to-speech processing on the response data in a non-speech form to obtain the response data in a speech form;

The voice interaction component is further configured to receive the response data in a non-voice form and the response data in a voice form returned by the server;

The voice interaction component is further configured to notify a first service component of the N service components to receive the response data in a non-voice form in a directional broadcasting mode; the first business component registers a callback function or a radio with the voice interaction component in advance;

The first business component is configured to respond to the fact that the user voice input is a task question, and execute target operation indicated by the user voice input after receiving the response data in a non-voice form;

the first business component is further configured to respond to the user voice input not being a task question and display the response data in a non-voice form through a User Interface (UI) template;

The voice interaction component is further configured to respond to the fact that the user voice input is not a task question, the terminal equipment is not integrated with the first service component, and the response data in a non-voice form is displayed through a UI template;

the voice interaction component is connected with the server in a long way;

the voice interaction component is further configured to receive a push message issued by the server based on the long connection; and notifying a second service component in the N service components to receive the push message in a directional broadcasting mode, wherein the second service component registers a callback function or a registration radio with the voice interaction component in advance.

8. The apparatus of claim 7, wherein the SDK comprises: speech recognition SDK, speech synthesis SDK, and text recognition SDK.

9. The apparatus of claim 7, wherein the semantic skill data further comprises: question intents, knowledge domain to which the question belongs, and question text.

10. The apparatus of claim 7, wherein the apparatus further comprises:

11. The apparatus of claim 7, wherein the apparatus further comprises:

12. The apparatus of claim 7, wherein the apparatus further comprises:

13. A terminal device, characterized in that it comprises a processor and a memory, in which at least one program code is stored, which is loaded and executed by the processor to implement the human-machine interaction method according to any of claims 1-6.

14. A computer readable storage medium, characterized in that at least one program code is stored in the storage medium, which is loaded and executed by a processor to implement the human-machine interaction method of any one of claims 1 to 6.