[go: up one dir, main page]

CN120600029A - Intelligent body dialogue system and method - Google Patents

Intelligent body dialogue system and method

Info

Publication number
CN120600029A
CN120600029A CN202511093085.7A CN202511093085A CN120600029A CN 120600029 A CN120600029 A CN 120600029A CN 202511093085 A CN202511093085 A CN 202511093085A CN 120600029 A CN120600029 A CN 120600029A
Authority
CN
China
Prior art keywords
audio
data
module
text
transmission
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202511093085.7A
Other languages
Chinese (zh)
Inventor
文一然
庞文刚
任奕
邹西山
潘东玮
周静纯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unicom WO Music and Culture Co Ltd
Original Assignee
China Unicom WO Music and Culture Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unicom WO Music and Culture Co Ltd filed Critical China Unicom WO Music and Culture Co Ltd
Priority to CN202511093085.7A priority Critical patent/CN120600029A/en
Publication of CN120600029A publication Critical patent/CN120600029A/en
Pending legal-status Critical Current

Links

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

本发明提供一种智能体对话系统及方法,涉及音频转换技术领域,包括:采集模块用于获取用户的音频数据;音频转换模块用于根据用户端的数据网络信号强度,设置音频数据的传输方式,当传输方式为通信信号传输时,将音频数据进行编码处理得到音频流;语音识别处理模块用于根据音频流进行实时解析,得到解析结果,并通过语言声学模型对解析结果进行优化,得到文字信息;意图识别与回答模块用于根据文字信生成音频数据的回答文字;语音合成模块用于对回答文字进行语音合成,得到回答文字的音频数据。本发明采用文字传输代替音频流,显著降低了数据传输量,提高了传输效率,减少了因网络拥塞导致的延迟和错误,提高了智能体音频对话的稳定性。

The present invention provides an intelligent agent dialogue system and method, relating to the field of audio conversion technology, comprising: an acquisition module for acquiring user audio data; an audio conversion module for setting the transmission mode of the audio data based on the data network signal strength of the user end; when the transmission mode is communication signal transmission, encoding the audio data to obtain an audio stream; a speech recognition processing module for performing real-time analysis based on the audio stream to obtain analysis results, and optimizing the analysis results using a language acoustic model to obtain text information; an intention recognition and response module for generating response text based on the audio data; and a speech synthesis module for performing speech synthesis on the response text to obtain audio data of the response text. The present invention uses text transmission instead of audio streaming, significantly reducing the data transmission volume, improving transmission efficiency, reducing delays and errors caused by network congestion, and improving the stability of the intelligent agent audio dialogue.

Description

Intelligent body dialogue system and method
Technical Field
The invention relates to the technical field of audio conversion, in particular to an agent dialogue system and method.
Background
In the existing intelligent dialogue interaction scene, a common mode is that a user interacts with an intelligent agent by inputting text information through a mobile application or a webpage end. However, when the user is in a scene inconvenient to manually input, such as driving or sports, it is difficult to conveniently communicate with the intelligent agent. Therefore, in general, a user uses APP on a terminal device such as a mobile phone to realize an audio call with an agent.
In the related art, most of the APP of the intelligent agent adopts a data network to realize dialogue, and the existing voice interaction has the characteristic of strong dependence on data network signals, so that the phenomenon of unsmooth dialogue with the intelligent agent can occur in the scene of poor network signals, and further the requirement of users on instant feedback of the intelligent agent cannot be met.
Disclosure of Invention
The invention solves the problem of how to improve the stability of the agent audio dialogue.
In order to solve the above problems, the present invention provides an agent dialogue system and method.
In a first aspect, the present invention provides an agent dialogue system, including a user side and a server side, where an output of the user side is connected with the server side in a communication manner;
The user terminal comprises an acquisition module, an audio conversion module, a voice recognition processing module and a voice synthesis module which are sequentially connected in a communication way, and the server terminal comprises an intention recognition and answer module;
The output end of the voice recognition processing module of the user end is connected with the input end of the intention recognition and answer module of the server end, and the output end of the intention recognition and answer module is connected with the input end of the voice synthesis module of the user end;
The acquisition module is used for acquiring audio data of a user;
The audio conversion module is used for setting a transmission mode of the audio data according to the data network signal strength of the user side, and when the transmission mode is communication signal transmission, the audio data is encoded according to a transmission protocol of communication push stream to obtain an audio stream corresponding to the audio data;
the voice recognition processing module is used for carrying out real-time analysis through deep learning according to the audio stream to obtain an analysis result, and optimizing the analysis result through a language acoustic model to obtain text information corresponding to the audio data;
The intention recognition and answer module is used for determining the field and intention of the text information according to the text information corresponding to the audio data and combining context information, and obtaining answer text of the audio data according to the field and the intention and combining a logical reasoning algorithm and a language generation algorithm;
The voice synthesis module is used for performing voice synthesis on the answer characters to obtain audio streams corresponding to the answer characters, and performing coding processing on the audio streams corresponding to the answer characters according to the transmission protocol of the communication plug flow to obtain audio data of the answer characters.
Optionally, the audio conversion module is specifically configured to:
acquiring the data network signal strength of the user side;
setting the transmission mode of the audio data according to the magnitude relation between the signal intensity of the data network and a preset signal intensity threshold;
When the signal intensity of the data network is smaller than the preset signal intensity threshold, setting the transmission mode of the audio data as the communication signal transmission;
And setting the transmission mode of the audio data to be data network transmission when the signal strength of the data network is greater than or equal to the preset signal strength threshold.
Optionally, the voice recognition processing module is specifically configured to:
Carrying out framing treatment on the audio stream, and dividing continuous audio signals in the audio stream into audio frames of a plurality of time periods;
Extracting acoustic features of each audio frame to obtain acoustic features of the audio frames;
inputting the acoustic features into a deep learning model for nonlinear mapping, and generating a preliminary voice recognition result;
and carrying out grammar and semantic constraint on the preliminary voice recognition result through the language acoustic model to obtain the text information corresponding to the audio data.
Optionally, the intention recognition and answer module is specifically configured to:
performing word segmentation and part-of-speech tagging on the text information to obtain a basic structure and grammar characteristics of the text information;
carrying out semantic understanding and classification on the text information according to the basic structure and the grammar characteristics and by combining the context information through an intention classification model to obtain the field and the intention of the text information;
according to the domain and the intention, combining the logic reasoning algorithm to generate a logic reasoning result corresponding to the domain and the intention;
and generating the answer text according to the logical reasoning result by utilizing the language generation algorithm.
Optionally, the voice synthesis module is specifically configured to:
Converting the answer text into a voice signal according to preset acoustic parameters through a voice synthesis model;
performing noise suppression and echo cancellation processing on the voice signal to obtain a processed voice signal;
and encoding the processed voice signal according to a transmission protocol of communication plug flow, and generating the audio stream corresponding to the answer text.
Optionally, the audio conversion module is specifically further configured to:
And when the transmission mode is the data network transmission, the audio data is encoded according to a user datagram protocol, so as to obtain network transmission data of the audio data.
Optionally, the voice synthesis module is specifically configured to:
when the transmission mode of the audio data is the data network transmission, determining a speech rate parameter, a intonation parameter, a volume parameter and a tone parameter of a speech synthesis model according to the signal strength of the data network;
Optimizing preset acoustic parameters of a speech synthesis model according to the speech speed parameter, the intonation parameter, the volume parameter and the tone parameter to obtain the optimized speech synthesis model;
and performing dynamic speech synthesis on the answer words by combining the optimized speech synthesis model with a forward error correction and redundancy coding strategy to obtain the audio stream corresponding to the answer words.
Optionally, the intention recognition and answer module is specifically configured to:
according to the data network signal intensity, determining network fluctuation data of the user side;
predicting according to the network fluctuation data to obtain future network fluctuation data of the answer text in the transmission period;
and sending the answer text to the user terminal according to the future network fluctuation data.
Optionally, the intention recognition and answer module is specifically further configured to:
determining the transmission frequency of the answer text according to the future network fluctuation data, and dividing the answer text into a plurality of text fragments;
generating a sending plan of the answer text according to the transmission frequency of the answer text and the text segment;
and sending the text fragments of the answer text to the user side according to the sending plan.
In a second aspect, the present invention provides an agent dialogue method, applied to any one of the above agent dialogue systems, where the agent dialogue system includes a user end and a server end, and an output of the user end is connected with the server end in a communication manner;
The user terminal comprises an acquisition module, an audio conversion module, a voice recognition processing module and a voice synthesis module which are sequentially connected in a communication way, and the server terminal comprises an intention recognition and answer module;
The output end of the voice recognition processing module of the user end is connected with the input end of the intention recognition and answer module of the server end, and the output end of the intention recognition and answer module is connected with the input end of the voice synthesis module of the user end;
the agent dialogue method comprises the following steps:
acquiring audio data of a user through the acquisition module;
Setting a transmission mode of the audio data according to the data network signal strength of the user side through the audio conversion module, and when the transmission mode is communication signal transmission, carrying out coding processing on the audio data according to a transmission protocol of communication push stream to obtain an audio stream corresponding to the audio data;
Real-time analysis is carried out through deep learning according to the audio stream by the voice recognition processing module to obtain an analysis result, and the analysis result is optimized through a language acoustic model to obtain text information corresponding to the audio data;
Determining the field and the intention of the text information according to the text information corresponding to the audio data and combining context information through the intention recognition and answer module, and obtaining answer text of the audio data according to the field and the intention and combining a logical reasoning algorithm and a language generation algorithm;
And performing voice synthesis on the answer characters through the voice synthesis module to obtain audio streams corresponding to the answer characters, and performing coding processing on the audio streams corresponding to the answer characters according to the transmission protocol of the communication plug flow to obtain the audio data of the answer characters.
In the intelligent agent dialogue system and method, after the voice recognition processing module converts the audio stream into the text information, the text information is only transmitted to the server side. The text data volume is far smaller than the audio data volume, so that the data load in the transmission process is greatly reduced. The lightweight transmission mode reduces the requirement on the bandwidth of the data network, so that the system can still efficiently communicate under the condition that the data network signal is unstable or the bandwidth is limited. Specifically, the system divides the whole conversation process into a plurality of functional modules, including a collection module, an audio conversion module, a voice recognition processing module, a voice synthesis module and an intention recognition and answer module at the server side. The acquisition module is responsible for acquiring a user voice instruction, converting the user voice instruction into audio data and providing an original input for subsequent processing. The audio conversion module intelligently judges and switches the transmission mode of the audio data according to the signal strength of the data network monitored in real time, and when the transmission mode is communication signal transmission, the audio data is encoded and processed by utilizing a communication plug flow protocol to generate an audio flow, so that the dependence on a data network is effectively reduced, the continuity and stability of communication can be maintained in a poor signal area, the application scene of the system is greatly expanded, and a user can keep smooth dialogue with an intelligent body no matter in remote mountain areas, underground parking lots or vehicles running at high speed. After receiving the audio stream, the voice recognition processing module analyzes in real time by utilizing the deep learning technology, extracts key characteristics of the audio signal and converts the key characteristics into text information. Meanwhile, the preliminary analysis result is optimized through the language acoustic model, and the accuracy and consistency of the text information are improved. The optimized text information is sent to an intention recognition and response module at the server side. The intention recognition and answer module is used for determining the field and intention of the text information by combining the context information after receiving the text information. The module uses a logical reasoning algorithm and a language generation algorithm to generate an answer text matched with the user intention. The process fully considers the consistency and consistency of the dialogue, and ensures the rationality and accuracy of the answer. After receiving the answer text, the voice synthesis module converts the answer text into a natural and smooth voice signal. The converted voice signal is processed by the audio conversion module again according to the transmission protocol of the communication plug flow to generate audio data of the answer audio. The invention adopts text transmission to replace audio stream by optimizing communication process, thereby remarkably reducing data transmission quantity, improving transmission efficiency and reducing delay and error caused by network congestion. The intelligent voice interaction system has the advantages of close cooperation of the modules, full play of the advantages of text transmission, intelligent transmission mode selection, efficient voice recognition and processing, intelligent intention recognition and answer generation, optimized voice synthesis and transmission and the like, is constructed, solves the problem of unstable conversation caused by strong dependence of voice interaction on data network signals in the prior art, and provides a smoother, natural and accurate voice interaction experience for users.
Drawings
FIG. 1 is a schematic diagram of a system for intelligent dialogue according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for intelligent agent dialogue according to another embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. While the invention is susceptible of embodiment in the drawings, it is to be understood that the invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the invention. It should be understood that the drawings and embodiments of the invention are for illustration purposes only and are not intended to limit the scope of the present invention.
It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.
The term "comprising" and variations thereof as used herein is meant to be open-ended, i.e., "including but not limited to," based at least in part on, "one embodiment" means "at least one embodiment," another embodiment "means" at least one additional embodiment, "some embodiments" means "at least some embodiments," and "optional" means "optional embodiment. Related definitions of other terms will be given in the description below. It should be noted that the concepts of "first", "second", etc. mentioned in this disclosure are only used to distinguish between different devices, modules or units, and are not intended to limit the order or interdependence of functions performed by these devices, modules or units.
It should be noted that references to "a" and "an" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the devices in the embodiments of the present invention are for illustrative purposes only and are not intended to limit the scope of such messages or information.
In view of the above problems associated with the related art, the present embodiment provides an agent dialogue system and method.
The intelligent agent dialogue system provided by the embodiment of the invention comprises a user end and a server end, wherein the output of the user end is in communication connection with the server end, the user end comprises a collecting module, an audio conversion module, a voice recognition processing module and a voice synthesis module which are in communication connection in sequence, the server end comprises an intention recognition and answer module, the output end of the voice recognition processing module of the user end is connected with the input end of the intention recognition and answer module of the server end, and the output end of the intention recognition and answer module is connected with the input end of the voice synthesis module of the user end.
Specifically, the user end sequentially communicates with the acquisition module, the audio conversion module, the voice recognition processing module and the voice synthesis module to form a front-end link of voice processing. The acquisition module is used as a starting point and is responsible for acquiring audio data of a user and providing original voice materials for subsequent processing. The audio conversion module takes on the key role of judging the transmission mode according to the data network signal strength of the user immediately after that, when the data network signal is not good, the audio conversion module can flexibly switch to communication signal transmission, and encodes the audio data according to the transmission protocol of the communication push stream to generate an audio stream, thereby ensuring the stable transmission of the audio data under different network conditions. The voice recognition processing module receives the processing task of the audio stream, analyzes the processing task in real time by using the deep learning technology, extracts key features of the audio signal and converts the key features into text information, optimizes the preliminary analysis result by means of a language acoustic model, and improves the accuracy and consistency of the text information. The output end of the module is connected with the input end of the intention recognition and answer module of the server end, so that seamless butt joint between the user end and the server end is realized. After the intention recognition and answer module at the server end receives the text information, the field and intention of the text information are determined by combining the context information, and an answer text matched with the intention of the user is generated by using a logical reasoning algorithm and a language generation algorithm. The output end of the intention recognition and answer module is connected with the input end of the voice synthesis module of the user end, and the voice synthesis module converts the answer words into natural and smooth voice signals and generates audio data.
The acquisition module is used for acquiring the audio data of the user.
Specifically, as an initial module at the user side, the function is to acquire audio data of the user, such as a microphone. The system is the forefront link of the whole intelligent agent dialogue system, ensures that voice information sent by a user can be accurately collected, provides original data input for a subsequent processing module, and the performance of the system directly influences the effectiveness and usability of the whole intelligent agent dialogue system.
The audio conversion module is used for setting the transmission mode of the audio data according to the data network signal intensity of the user side, and when the transmission mode is communication signal transmission, the audio data is encoded according to a communication push stream transmission protocol to obtain an audio stream corresponding to the audio data.
Specifically, the audio conversion module is located at the user side, and first, monitors the signal strength of the data network at the user side in real time, for example, invokes an Application Programming Interface (API) related to the network interface of the device to obtain the signal strength value, for example, in an android system, the signal strength of the network can be obtained by using the types ConnectivityManager, networkInfo, and the like. And then judging a transmission mode according to the set signal strength threshold value, and selecting a communication signal transmission mode when the signal strength of the data network is lower than the threshold value. The transmission protocol of the communication push stream may adopt a real-time transmission protocol (RTP) or the like, and the audio data is encoded according to the protocol, including packetizing the audio data, adding protocol header information (such as a serial number, a timestamp or the like) to ensure the sequence and synchronization of the audio data in network transmission, thereby obtaining an audio stream corresponding to the audio data, and facilitating the subsequent stable transmission in a communication channel.
The voice recognition processing module is used for carrying out real-time analysis through deep learning according to the audio stream to obtain an analysis result, and optimizing the analysis result through a language acoustic model to obtain text information corresponding to the audio data.
Specifically, the voice recognition processing module is located at the user end, and after receiving the audio stream, the voice recognition processing module analyzes the audio stream in real time by using a deep learning algorithm (such as a convolutional neural network CNN, a cyclic neural network RNN, etc.). The deep learning model is trained in advance through a large amount of voice data with labels, and can learn the mapping relation between voice signals and characters. In the parsing process, the model extracts features (such as mel frequency cepstrum coefficient MFCC) in the audio stream, predicts corresponding text content according to the features, and obtains a preliminary parsing result. Then, the analysis result is optimized using the language acoustic model. The language acoustic model can be based on a statistical language model or a neural network language model, and can adjust and correct words and sentences in the preliminary analysis result according to grammar rules of the language, collocation habits of words and the like, unreasonable contents are filtered out, and finally text information corresponding to more accurate audio data is obtained.
The intention recognition and answer module is used for determining the field and intention of the text information according to the text information corresponding to the audio data and combining the context information, and obtaining answer text of the audio data according to the field and the intention and combining a logical reasoning algorithm and a language generation algorithm.
Specifically, the intention recognition and answer module is located at the server end and is responsible for receiving the text information transmitted by the voice recognition processing module, and the field and intention of the text information are further determined by combining the context information. Specifically, first, the field and intention of the text information are determined in conjunction with the context information. The context information may include dialogue content prior to the user, dialogue scenes (e.g., whether in a driving scenario, a customer service scenario, etc.), and so forth. For example, if the user previously inquires about weather information during a conversation, if the conversation refers to "today is fit for short sleeves," the conversation can determine that the field is weather in combination with the context, and the intention is to inquire about the relationship between the weather condition and wearing clothes. After determining the domain and intent, a logical inference algorithm (e.g., rule-based inference algorithm, bayesian network inference algorithm, etc.) and a language generation algorithm (e.g., sequence-to-sequence generation model Seq2Seq, etc.) are utilized to generate the corresponding answer text. The logical reasoning algorithm can carry out logical analysis and deduction on the questions, and the language generation algorithm generates natural and fluent answer text contents according to the deduction result. In addition, if the user side is a device connected through Bluetooth, the Bluetooth protocol is utilized to transmit the reply text to the Bluetooth module of the device, and if the user side is connected through a mobile data network or a Wi-Fi network, the reply text is transmitted through a corresponding network protocol.
The voice synthesis module is used for performing voice synthesis on the answer characters to obtain audio streams corresponding to the answer characters, and performing coding processing on the audio streams corresponding to the answer characters according to the transmission protocol of the communication plug flow to obtain audio data of the answer characters.
Specifically, the voice synthesis module is located at the user end, the voice synthesis can adopt a splicing synthesis method, a large number of voice phonemes and a corpus are stored in advance, and appropriate phonemes are selected from the corpus to splice according to the content of the answer text, so that corresponding voice signals are generated. A parameter synthesis method may also be used to generate a corresponding speech signal according to the content of the answer text by controlling parameters (such as pitch frequency, tone parameters, etc.) of the speech synthesizer. After the generated voice signal is processed (such as filtering, gain control and the like) by digital signals, an audio stream corresponding to the answer text is obtained, so that the audio stream better accords with the hearing habit of a user in the aspects of tone quality, speech speed, tone and the like. Finally, the answer audio data which can be used for playing is obtained, and the voice synthesis module can adjust parameters such as speech speed, intonation, tone and the like of the synthesized voice according to different scenes and requirements, so that the synthesized voice is more natural, smooth and emotional, and the acceptance and satisfaction of users on answer of the intelligent body are improved.
In the intelligent agent dialogue system and method, after the voice recognition processing module converts the audio stream into the text information, the text information is only transmitted to the server side. The text data volume is far smaller than the audio data volume, so that the data load in the transmission process is greatly reduced. The lightweight transmission mode reduces the requirement on the bandwidth of the data network, so that the system can still efficiently communicate under the condition that the data network signal is unstable or the bandwidth is limited. Specifically, the system divides the whole conversation process into a plurality of functional modules, including a collection module, an audio conversion module, a voice recognition processing module, a voice synthesis module and an intention recognition and answer module at the server side. The acquisition module is responsible for acquiring a user voice instruction, converting the user voice instruction into audio data and providing an original input for subsequent processing. The audio conversion module intelligently judges and switches the transmission mode of the audio data according to the signal strength of the data network monitored in real time, and when the transmission mode is communication signal transmission, the audio data is encoded and processed by utilizing a communication plug flow protocol to generate an audio flow, so that the dependence on a data network is effectively reduced, the continuity and stability of communication can be maintained in a poor signal area, the application scene of the system is greatly expanded, and a user can keep smooth dialogue with an intelligent body no matter in remote mountain areas, underground parking lots or vehicles running at high speed. After receiving the audio stream, the voice recognition processing module analyzes in real time by utilizing the deep learning technology, extracts key characteristics of the audio signal and converts the key characteristics into text information. Meanwhile, the preliminary analysis result is optimized through the language acoustic model, and the accuracy and consistency of the text information are improved. The optimized text information is sent to an intention recognition and response module at the server side. The intention recognition and answer module is used for determining the field and intention of the text information by combining the context information after receiving the text information. The module uses a logical reasoning algorithm and a language generation algorithm to generate an answer text matched with the user intention. The process fully considers the consistency and consistency of the dialogue, and ensures the rationality and accuracy of the answer. After receiving the answer text, the voice synthesis module converts the answer text into a natural and smooth voice signal. The converted voice signal is processed by the audio conversion module again according to the transmission protocol of the communication plug flow to generate audio data of the answer audio. The invention adopts text transmission to replace audio stream by optimizing communication process, thereby remarkably reducing data transmission quantity, improving transmission efficiency and reducing delay and error caused by network congestion. The intelligent voice interaction system has the advantages of close cooperation of the modules, full play of the advantages of text transmission, intelligent transmission mode selection, efficient voice recognition and processing, intelligent intention recognition and answer generation, optimized voice synthesis and transmission and the like, is constructed, solves the problem of unstable conversation caused by strong dependence of voice interaction on data network signals in the prior art, and provides a smoother, natural and accurate voice interaction experience for users.
Optionally, the audio conversion module is specifically configured to:
acquiring the data network signal strength of the user side;
setting the transmission mode of the audio data according to the magnitude relation between the signal intensity of the data network and a preset signal intensity threshold;
When the signal intensity of the data network is smaller than the preset signal intensity threshold, setting the transmission mode of the audio data as the communication signal transmission;
And setting the transmission mode of the audio data to be data network transmission when the signal strength of the data network is greater than or equal to the preset signal strength threshold.
Specifically, on a user's smart device (e.g., smart phone, tablet, etc.), the audio conversion module obtains the data network signal strength by calling an Application Programming Interface (API) provided by the operating system of the device. For example, in an android system, telephonyManager classes can be used to obtain signal strength information. The method for acquiring the mobile signal strength can acquire the signal strength value of the data network (such as 4G, 5G and the like) to which the equipment is currently connected in real time, and is usually measured by parameters such as signal strength indication (RSSI, received Signal Strength Indicator) and the like. In order to timely and accurately reflect the change situation of the network signal, the audio conversion module may acquire the signal strength of the data network periodically (for example, every 1-2 seconds), or immediately acquire the updated signal strength when detecting that the network state may change (for example, when the user switches the network or enters or leaves the signal coverage area, etc. triggered by an event). This ensures that the system is able to quickly perceive changes in the network signal under different network environments.
In a software configuration file at the user side of the agent dialogue system, a signal strength threshold is determined based on a large amount of network test data and the actual user experience. Generally, the threshold is set at a value that ensures that no significant settling, delay, etc. of the signal strength of the normal conversation occurs while the audio data is being transmitted over the data network. For example, for a 4G network, the threshold is set at a certain RSSI value (e.g., about 85 dBm), when the signal strength is higher than this value, the data network is considered to have better transmission quality, and the audio data transmission can be performed normally, and when the signal strength is lower than this value, the data network may be unstable, and it is necessary to switch to the communication signal transmission mode. It should be noted that communication signaling relies mainly on conventional telecommunication network infrastructure, and common implementations include circuit-switched and packet-switched technologies. Circuit switching technology, such as public switched telephone network, establishes a special physical circuit for both parties of communication to ensure stable transmission of voice signals. The data network transmission mainly depends on the internet, and the high-efficiency transmission of the audio data is realized through a network protocol such as a real-time transmission protocol or a real-time message transmission protocol. The data network transmission has the advantages of fully utilizing the existing internet infrastructure, providing efficient and flexible communication service, and being particularly suitable for being used in a scene with good network conditions.
When the acquired data network signal strength is greater than or equal to a preset signal strength threshold, the audio conversion module sets a transmission mode of audio data to be transmitted by a data network. In this case, the audio data may be sent directly through an existing data network channel (e.g., a user's mobile data network connection). For example, if the user covers a good area in the 5G network, the signal strength is strong, the audio data can be quickly transmitted from the user terminal to the server terminal by using the high bandwidth and low delay characteristics of the 5G network, and the answer audio data processed by the server terminal can also be quickly returned to the user terminal through the 5G network. During data network transmission, audio data is encapsulated and transmitted according to a common network data transmission protocol (such as a TCP/IP protocol). When the signal intensity of the data network is smaller than a preset signal intensity threshold value, the audio conversion module switches the transmission mode of the audio data into communication signal transmission. Communication signaling typically utilizes a conventional communication network (e.g., a circuit-switched network), such as a communication channel upon which the handset relies for voice call functions, to transmit audio data. In an implementation, the audio data is encoded according to a transport protocol of the communication push stream. This coding process is typically implemented by specific speech coding algorithms (e.g., AMR-NB/WB, etc.) that compress audio data into an audio stream format suitable for transmission over a communication channel while ensuring that the audio quality is within an acceptable range. The encoded audio stream is then transmitted over a communication network using a communication module of the device (e.g., a baseband processor of a cell phone, etc.).
In the embodiment of the invention, the transmission mode is flexibly selected according to the signal strength of the data network, so that the situations of interruption, blocking or loss and the like of audio data transmission caused by signal difference when the data network signal is poor are effectively avoided. For example, when a user is in a place where the coverage of data network signals is poor, such as a basement or a corner of a building, the system can automatically switch to a communication signal transmission mode, and stable transmission of audio data is ensured by using the communication signals. Compared with the traditional mode of performing voice interaction only by depending on a data network, the mode greatly improves the stability of audio conversation, so that a user can keep smooth conversation with an intelligent body in various complex network environments. When the data network signal is poor, the communication signal transmission is used as a relatively stable transmission mode, so that the audio data can be ensured to be completely transmitted from the user terminal to the server terminal, and the answer audio data processed by the server terminal can be completely returned to the user terminal. Therefore, the situation that dialogue content is incomplete or user intention is misunderstood caused by data loss can be avoided, and the reliability and user experience of the intelligent dialogue system are improved. Under the condition that the data network signal is good, the communication signal resource can be saved by utilizing the data network to transmit the audio data. Meanwhile, the intelligent switching mechanism is also beneficial to reducing the flow consumption of a user in the use process of the data network, because the data network can be used for transmitting audio data only when the data network signal is good enough, and the situation that the flow is wasted due to repeated retransmission of the data when the data network signal is bad is avoided, so that the transmission quality is ensured. For the communication operators, the loads of the data network and the communication network can be balanced to a certain extent, and the resource utilization efficiency of the whole communication system is improved.
Optionally, the voice recognition processing module is specifically configured to:
Carrying out framing treatment on the audio stream, and dividing continuous audio signals in the audio stream into audio frames of a plurality of time periods;
Extracting acoustic features of each audio frame to obtain acoustic features of the audio frames;
inputting the acoustic features into a deep learning model for nonlinear mapping, and generating a preliminary voice recognition result;
and carrying out grammar and semantic constraint on the preliminary voice recognition result through the language acoustic model to obtain the text information corresponding to the audio data.
In particular, in an agent dialogue system, a voice recognition processing module is responsible for converting an audio stream into text information, and this process is critical to the efficiency and stability of the whole system. The module first frames the audio stream and divides the continuous audio signal into audio frames of a plurality of time periods, typically 20-30 ms per frame, depending on the short-term nature of the speech signal, i.e. the speech signal may be considered stationary in a short time. Taking an audio stream with a sampling rate of 16kHz as an example, every 320-480 sampling points constitute an audio frame. The method is characterized in that the method comprises the steps of receiving audio stream data from a buffer area, and storing the audio stream data into the buffer area to realize framing, wherein the audio stream data is circularly read in software, and the audio frame is extracted after the data quantity reaches the standard, and the special audio processing chip can also directly framing and processing the audio signal in hardware. Extracting acoustic features from each audio frame is a key step in speech recognition. Common acoustic features include mel-frequency cepstral coefficients (MFCCs), filter bank energies, linear Predictive Cepstral Coefficients (LPCCs), and the like. Taking MFCC as an example, the extraction process mainly includes pre-emphasis, framing and windowing, fast Fourier Transform (FFT), mel-filter bank processing, and Discrete Cosine Transform (DCT). Pre-emphasis can emphasize high frequency parts to make the frequency spectrum flatter, framing and windowing (such as hamming window) to reduce frequency spectrum leakage, FFT to convert time domain signals into frequency domain signals, the number of points is usually 256 or 512, mel filter bank processes the frequency spectrum to simulate human ear hearing characteristics, DCT (discrete cosine transform) extracts MFCC coefficients from the filter bank output, and the first 12-13 coefficients are used as acoustic characteristics. Deep learning models, such as Recurrent Neural Networks (RNNs), convolutional Neural Networks (CNNs), long short term memory networks (LSTM), are used to map acoustic features to speech text. In the training phase, a large number of voice data with text labels are utilized to train the model. For example, a TensorFlow or PyTorch framework is used to build the model. The LSTM is adapted to process the timing characteristics of the speech signal, its input being the acoustic signature sequence of the audio frame, and its output being the conditional probability distribution of the corresponding words for each time step. In actual use, the extracted acoustic features are input into a trained model, and the model outputs a preliminary speech recognition result which comprises word candidates and probability sequences thereof. The preliminary speech recognition result is then input into a language acoustic model, grammar and semantic constraint are carried out, and the language acoustic model is optimized into text information conforming to language specifications and semantic logic. The model is constructed based on a statistical or neural network language model, and the preliminary result is adjusted according to grammar rules and vocabulary collocation habits. For example, the combination of words which are not in accordance with grammar is replaced, so that the accuracy of outputting the text information is ensured.
In the embodiment of the invention, the framing process enables the voice signal to be better analyzed, and the acoustic feature extraction can extract important information capable of effectively representing the voice characteristic. The distinguishing capability of the voice signals can be improved through a proper framing method and an acoustic feature extraction algorithm. For example, MFCC features can well reflect the timbre, pitch, etc. characteristics of speech, making homophones more accurate to distinguish. Accurate acoustic features are input into the deep learning model, so that the recognition performance of the model can be improved. The deep learning model can fully mine potential mapping rules between acoustic features and characters, and can more accurately convert the acoustic features into characters compared with the traditional rule-based voice recognition method. The language acoustic model corrects some errors in the preliminary speech recognition result through grammatical and semantic constraints. For example, if the preliminary recognition result is a sentence which does not conform to grammar, the language acoustic model can select a more proper vocabulary from possible alternative vocabularies to replace wrong vocabularies according to grammar rules and semantic knowledge, so that more accurate text information is obtained, the vocabulary error rate in voice recognition can be effectively reduced, and the overall recognition accuracy is improved. Meanwhile, the efficient acoustic feature extraction algorithm can extract useful features in a short time, and the calculated amount of the whole voice recognition process is reduced. The optimized deep learning model can process the input acoustic characteristics in a short time and output a preliminary voice recognition result, so that the instantaneity of voice recognition is improved. For example, in a voice assistant application on some smartphones, voice recognition can be completed and a response given within a few seconds after the user speaks, which depends largely on the efficient reasoning capabilities of the deep learning model. By dividing the continuous audio signal into audio frames and extracting acoustic features, the speech recognition processing module can effectively reduce the amount of data. Compared with the direct transmission of the original audio data, the data volume of the text information is obviously reduced, the requirement on network bandwidth is reduced, delay and errors in the transmission process are reduced, and the system stability is improved.
Optionally, the intention recognition and answer module is specifically configured to:
performing word segmentation and part-of-speech tagging on the text information to obtain a basic structure and grammar characteristics of the text information;
carrying out semantic understanding and classification on the text information according to the basic structure and the grammar characteristics and by combining the context information through an intention classification model to obtain the field and the intention of the text information;
according to the domain and the intention, combining the logic reasoning algorithm to generate a logic reasoning result corresponding to the domain and the intention;
and generating the answer text according to the logical reasoning result by utilizing the language generation algorithm.
Specifically, the continuous text information is segmented into words or word sequences by means of character string matching by utilizing a pre-built dictionary. For example, for chinese segmentation, a longest matching method, a forward maximum matching method, or the like may be used. Taking a forward maximum matching method as an example, starting from the beginning of a sentence, searching the longest matching word in the dictionary in turn until the sentence ends. For example, for the sentence "i love natural language processing", using the forward maximum matching method, matching is performed according to the words in the dictionary, and it is possible to match to "i", then "love", then "natural language" (assuming that there is a word in the dictionary), and finally "processing". The word or term is automatically segmented by training a statistical model. Such as using Hidden Markov Models (HMMs) or Conditional Random Fields (CRFs), etc. During the training process, the model learns the relationship between the occurrence probability of the word and the context. In word segmentation, word or word boundaries are determined from these probability relationships. For example, the CRF model may consider information about each word's left and right neighbors to determine whether the word belongs to a preceding word or a following word. And marking the part of speech for the word after word segmentation according to the marking part of speech and grammar rule of each word in the dictionary. For example, in a simple rule dictionary, "me" is labeled as a pronoun, "love" is labeled as a verb, and "natural language processing" is labeled as a noun. According to grammar rules, when words appear at different positions in sentences, the parts of speech are determined according to collocation relation and sentence structure. For example, "natural language processing" is used as an object of "love" in a sentence, and its part of speech is still a noun.
Training a machine learning model (such as HMM, CRF and the like) by using the training corpus marked with the parts of speech. The model learns the sexual word probabilities of each word under different context. In actual tagging, parts of speech are tagged according to the context of the context and the probability calculated by the model. For example, in the sentence "he plays basketball," the "play" is labeled as a verb, while in the "play water" the "play" would be labeled as a verb or other part of speech depending on the context, depending on the learning result of the model.
Taking Convolutional Neural Network (CNN) as an example, features of text are automatically extracted. In the training process, word information after Word segmentation and part-of-speech tagging is converted into a Word vector sequence (which can be obtained through a pre-trained Word vector model such as Word2Vec and Glove) and then input into a CNN model. The convolution layer in the CNN can extract local features, the pooling layer can reduce the dimension of the features, and finally, the intention classification result is output through the full connection layer. For example, for the intent classification of a customer service scenario, the model may learn the characteristics of different expressions corresponding to the intentions of consultation, complaints, advice, etc. A large number of training corpora with intention labels are prepared, and the corpora comprise text information of various fields and corresponding field and intention labels. The training corpus is input into the intent classification model, and the differences between the predicted intent and the actual intent are minimized by adjusting parameters of the model (e.g., weights, biases), etc. (e.g., by cross entropy loss functions, etc.). Through multiple rounds of iterative training, the model can learn how to accurately classify the intention according to the structural characteristics of the text information (such as basic structure and grammar characteristics obtained after word segmentation and part-of-speech tagging) and the context information. In a dialog system, multiple rounds of dialog content between the user and the agent are recorded. For example, text information, intentions, fields, etc. of the previous call are stored by a queue or list. When semantic understanding and classification of the current text information is required, the content of the previous dialog can be referenced to better determine its domain and intent. In the intent classification model, the current text information may be spliced or fused with the context information. For example, a word vector sequence of a current sentence and a word vector sequence of a preceding sentence are spliced and then input together into a model. Or use Attention mechanisms (Attention) to let the model focus on the context parts related to the current intent. For example, in a multi-round conversation, if the user previously mentioned a consultation about a trip, then in a subsequent conversation the model would be more inclined to focus on the trip-related content to determine intent. For some rule-specific domains, reasoning can be based on predefined knowledge rules. For example, in the field of mathematical computation, answers are derived according to four rules of operation. If the intent of the textual information is to evaluate a mathematical expression, the logical inference algorithm may calculate and generate a result according to the rules of operation. The bayesian network may represent probabilistic relationships between variables. And after the field and the intention of the text information are determined, calling a corresponding reasoning method from a corresponding logical reasoning rule base or model base. For example, in the area of travel planning, if an optimal travel route is being interrogated, a logical inference algorithm may plan the route based on rules such as travel time, location, distance between attractions, etc. And reasoning the input information by using a reasoning rule or model. For example, in the field of questions and answers, if the intention is to query the time of an event, a logical inference algorithm may find the corresponding time from a record of the event in an existing knowledge base and generate a result.
In some fields, answer templates may be predefined. For example, in the query weather field, a template may be designed for a response to query weather such as "weather today [ place ] is [ weather conditions ], and temperature range is [ temperature range ]". And filling specific information (such as places, weather conditions, temperatures and the like) in the logical reasoning result into the template to generate the answer text. Taking the Seq2Seq model as an example, it consists of an encoder and a decoder. The encoder encodes the input logical reasoning result and other relevant information (such as context, etc.) into a fixed-length vector, and the decoder generates answer words according to the vector. During the training process, a large number of question-answer pairs are used to train a model that learns how to generate appropriate answers based on different inputs. The logical reasoning results, possibly needed context information, etc. are provided as inputs to the language generation algorithm. For example, in a multi-round dialog, it is necessary to provide the content of the previous dialog in addition to the logical reasoning results in order to generate a coherent answer. According to the input information, the language generating algorithm generates answer words according to the trained model structure and parameters. For example, in a deep learning based model, each word of the answer is generated step by a decoder until a complete sentence is generated as the answer.
In the embodiment of the invention, the word segmentation and the part-of-speech tagging can clearly analyze the basic structure and the grammar characteristics of the text information, and provide an accurate basis for subsequent semantic understanding. The intention classification model combines the structural features and the context information, so that semantic understanding can be performed on the text information more accurately, and wrong intention judgment caused by misunderstanding of grammar or semantics is avoided. Considering context information enables intent recognition to take into account the consistency and overall semantics of the dialog. For example, in a multi-turn conversation, the user may omit some information, and through fusion of context information, the intent classification model may accurately determine the intent of the current text information in conjunction with the content of the previous conversation. For example, the user firstly inquires "what scenic spots are in Beijing" and then inquires "ticket prices of the scenic spots", and the model can judge "the scenic spots" refer to the scenic spots in Beijing by combining the context, so that the intention of accurately identifying the second problem is to inquire the ticket prices of the scenic spots in Beijing. The rule-based logic reasoning algorithm can provide accurate and reliable reasoning results in the field of rule definition, and the probability reasoning algorithm based on Bayesian networks and the like can provide reasonable reasoning results in the field of uncertainty. The template-based language generation algorithm can ensure the accuracy of the format and content of the answer, especially in the structured answer scene such as information query. The language generation algorithm based on deep learning can generate more natural and flexible answers, and can generate the answers which are consistent and matched with the requirements of users according to different contexts. For example, for an open question asking the travel experience, a deep learning-based language generation algorithm may generate rich and attractive answers based on logical reasoning results (e.g., characteristics of travel location, user interest preferences, etc.), improving user satisfaction with the answers. Meanwhile, the interactive method and the interactive system interact with the user side in the form of answer words, data volume is reduced by using word transmission, transmission efficiency and stability are improved, and continuity and accuracy of multiple rounds of conversations are ensured.
Optionally, the voice synthesis module is specifically configured to:
Converting the answer text into a voice signal according to preset acoustic parameters through a voice synthesis model;
performing noise suppression and echo cancellation processing on the voice signal to obtain a processed voice signal;
and encoding the processed voice signal according to a transmission protocol of communication plug flow, and generating the audio stream corresponding to the answer text.
In particular, common speech synthesis models include splice-based methods (Concatenative Synthesis), parameter-based methods (PARAMETRIC SYNTHESIS), and deep learning-based methods (e.g., tacotron, waveNet, etc.). Among them, the deep learning-based model is currently excellent in speech synthesis quality. For example, tacotron model is an end-to-end speech synthesis system that can directly convert text to spectral features of speech, which are then converted to the original speech waveform by a vocoder (Vocoder) such as WaveNet. The model is trained using a large number of text-tagged speech data, which typically includes speech segments of different speakers (Speaker), different speech rates, and different intonation. The preset acoustic parameters include acoustic characteristics (such as pitch frequency, tone, etc.), speech speed, intonation, etc. of the speaker. During the training process, the model learns how to map text to speech signals that meet preset acoustic parameters. For example, the quality of the synthesized speech is improved by adjusting the super-parameters of the model (e.g., learning rate, hidden layer size, etc.) and optimizing the objective (e.g., minimizing the difference between the generated speech and the real speech). The answer words are preprocessed, including operations such as standardization of the text (e.g. converting numbers and special symbols into text forms), word segmentation, etc. For example, "20%" is converted to "twenty percent" so that the speech synthesis model can better understand text content. And inputting the preprocessed text into the trained speech synthesis model. The model generates a corresponding voice signal according to preset acoustic parameters. For example, in the generating process, the model controls the playing speed of the voice according to the preset speech speed parameter, and generates a voice signal with a proper intonation according to the preset intonation parameter.
Noise suppression and echo cancellation processing are performed on a speech signal, and a short-time fourier transform (STFT) is first performed on the speech signal to convert a time-domain signal into a frequency spectrum. In the spectrum domain, the power spectrum of noise is estimated, and then the estimated noise power spectrum is subtracted from the power spectrum of the voice signal to obtain a purer voice spectrum. For example, in a quiet environment, a spectrum of clean speech is acquired as a reference, and when noise is contained in the speech signal, the noise is suppressed by subtracting the noise spectrum. Finally, the processed spectrum is converted back into a time domain signal by an Inverse Short Time Fourier Transform (ISTFT). Noise is suppressed using deep learning models (e.g., LSTM, CRNN, etc.). The input of the model is the spectrum of the noisy speech signal and the output is the spectrum of the clean speech signal. During the training process, a model is trained using a large amount of noisy speech and corresponding clean speech data. For example, a trained deep learning noise suppression model can effectively remove noise and improve speech quality when interference such as background music, environmental noise and the like exists in a speech signal. Echoes are typically generated as a result of the voice signal emitted by the speaker being picked up by a microphone. The adaptive filter may simulate an echo signal received by the microphone. By constantly adjusting the parameters of the adaptive filter, its output signal is made as close as possible to the actual echo signal. Then, the output signal of the adaptive filter is subtracted from the echo-containing speech signal received by the microphone to obtain a cleaner speech signal. For example, in a full duplex call system, the adaptive filter can adjust the filter coefficients in real time according to the voice signal played by the speaker and the signal received by the microphone, so as to effectively cancel the echo. Similar to noise suppression, echo cancellation may be implemented using a deep learning model. The input of the model is an echo-containing speech signal, and the output is a de-echoed speech signal. Through extensive data training, the model can learn the difference between the echo and the clean speech, thereby effectively removing the echo.
Common audio coding formats include PCM (pulse code modulation), MP3, opus, or the like. In a speech synthesis system, an efficient coding format is generally selected in consideration of characteristics and transmission efficiency of a speech signal. For example, opus coding is a low-delay and high-compression-ratio audio coding format, is suitable for real-time voice communication, and can reduce the size of voice data and improve the transmission efficiency while guaranteeing the voice quality. The speech signal after noise suppression and echo cancellation processing is encoded according to the specifications of the selected encoding format. For example, in Opus coding, a speech signal is divided into small blocks, each of the small blocks is subjected to spectral analysis, quantization, and the like, and the quantized data is packaged to generate an audio stream conforming to the Opus format. Depending on the actual communication environment and requirements, a suitable communication push stream transmission protocol is selected, such as RTP (real time transmission protocol), RTMP (real time messaging protocol), etc. The RTP protocol is taken as an example of a network control protocol for transmitting real-time data, such as audio, video, over existing networks. Before the audio stream is sent to the user side, the audio stream is packetized according to the RTP protocol, and necessary header information (such as a sequence number, a time stamp, etc.) is added so that the audio signal can be correctly decoded and played at the receiving side.
In the embodiment of the invention, the answer text is converted into the voice signal through the voice synthesis module and is efficiently transmitted, so that the stability and the user experience of the intelligent agent dialogue system are obviously improved. Speech synthesis models, such as Tacotron and WaveNet based on deep learning, efficiently convert text to speech signals. During model training, a large amount of voice data with text labels is used to cover different pronounciators, speech speeds and intonation so as to generate natural voice. The preset acoustic parameters ensure that the tone and intonation of the voice signal meet the user's expectations and improve the voice quality. During speech synthesis, the system pre-processes the answer words, including text normalization and word segmentation, so that the model better understands the text content. The generated voice signal is subjected to noise suppression and echo cancellation processing, so that the voice definition and the voice intelligibility are improved. For example, short-time fourier transform (STFT) converts a time-domain signal into a frequency spectrum, estimates and subtracts the noise power spectrum, and deep learning models can also effectively remove background noise and echoes. The processed voice signal is encoded according to a communication push stream transmission protocol, and an audio stream corresponding to the answer text is generated. The system selects a high-efficiency audio coding format such as Opus, so that the size of voice data is greatly reduced, and the transmission efficiency is improved. By combining with a text transmission mode, the system reduces the data volume and the requirement on network bandwidth, reduces delay and error rate and ensures smooth conversation. Under the environment of low bandwidth or unstable network, the efficient coding format and protocol ensure stable transmission of voice signals and avoid interruption or delay. The text transmission mode optimizes the communication process, reduces the data volume and improves the transmission efficiency and the system stability. Through the efficient processing and transmission mechanism of the voice synthesis module, the user side can timely and clearly receive the answer of the intelligent agent, the naturalness and fluency of the dialogue are enhanced, and more stable and efficient intelligent agent dialogue experience is provided for the user.
Optionally, the audio conversion module is specifically further configured to:
And when the transmission mode is the data network transmission, the audio data is encoded according to a user datagram protocol, so as to obtain network transmission data of the audio data.
Specifically, the audio conversion module receives raw audio data, typically represented in PCM (Pulse-Code Modulation) format, from an audio acquisition device (e.g., microphone) or audio processing module, and pre-processes the PCM data, including but not limited to removing silence segments, gain adjustments, and audio format conversion, to optimize audio quality and reduce unnecessary data transmission. The PCM data after processing is divided into data blocks with proper size, and the size of each data block is about 1000 bytes generally so as to adapt to the requirement of network transmission. UDP protocol header information is added for each data block. The UDP header includes fields for source port, destination port, data length, and checksum, etc., which ensure proper routing and integrity checking of the data in the network. Audio data is encoded using a specific audio coding algorithm (e.g., AAC, MP3, etc.) to reduce the amount of data and improve transmission efficiency. The coding algorithm is selected by comprehensively considering the factors such as audio quality, compression rate, computational complexity and the like. The encoded audio data and the UDP header information are combined together to form complete UDP datagrams, which are network transmission data of the audio data and can be sent to a designated destination. The intention recognition and answer module processes the decoded audio data, extracts semantic information in the decoded audio data, and recognizes intention and requirement of a user. According to the recognition result, the module generates corresponding answers, and the text forms are more convenient for semantic generation and logic reasoning, so that the contents and the logic structure of the answers can be more accurately expressed, and the answers are usually text.
In the embodiment of the invention, UDP is used as a connectionless transmission protocol, has lower delay and can rapidly send out the audio data, so that the audio signal can be transmitted in a network in real time, and the fluency of conversation is ensured. The UDP protocol has a small header overhead of only 8 bytes, and can more efficiently utilize network bandwidth than other protocols (e.g., TCP). Meanwhile, the audio coding algorithm can further reduce the data volume and improve the transmission efficiency. The audio conversion module can effectively convert the audio data into network transmission data suitable for communication signal transmission, and stability and efficiency of audio conversation are improved.
Optionally, the voice synthesis module is specifically configured to:
when the transmission mode of the audio data is the data network transmission, determining a speech rate parameter, a intonation parameter, a volume parameter and a tone parameter of a speech synthesis model according to the signal strength of the data network;
Optimizing preset acoustic parameters of a speech synthesis model according to the speech speed parameter, the intonation parameter, the volume parameter and the tone parameter to obtain the optimized speech synthesis model;
and performing dynamic speech synthesis on the answer words by combining the optimized speech synthesis model with a forward error correction and redundancy coding strategy to obtain the audio stream corresponding to the answer words.
In particular, the signal strength of a data network is obtained in real time through a network interface of a device, and is generally measured by parameters such as signal strength indication (RSSI) or Received Signal Strength (RSS). These values may be obtained through an API or network driver provided by the operating system. And comparing the signal intensity of the data network with a preset intensity interval, and mapping the signal intensity to the speech speed, intonation, volume and tone parameters of the speech synthesis model. For example, when the signal strength is high, the network condition is good, the user may be in a quiet and stable environment, the speech speed can be properly increased, the speech synthesis is smoother, the volume can be properly reduced, the intonation can be smoother, and the tone can be softer. Conversely, when the signal strength is low, the user may be in a noisy or network unstable environment, the speech speed may be appropriately reduced, the volume is increased, the tone is clearer, and the tone is brighter, so as to ensure that the user can clearly hear the speech synthesis content.
And according to the mapping result, adjusting the speech speed parameters of the speech synthesis model. In the process of speech synthesis, different speech speeds are realized by controlling the playing speed of a speech signal. For example, in a deep learning-based speech synthesis model, the hyper-parameters controlling the speech rate in the model may be adjusted, or the speech signal may be compressed or expanded in the time axis during speech generation to change the speech rate. The adjustment of the pitch parameters may be achieved by changing the fundamental frequency of the speech signal. In the speech synthesis model, the fundamental frequency determines the pitch of speech, thereby affecting intonation. The intonation may be changed by adjusting parameters associated with the fundamental frequency in the model, or by modifying the fundamental frequency after speech generation. For example, the fundamental frequency of the speech signal is shifted up or down in the frequency domain to achieve different intonation effects. The volume parameter is mainly adjusted to control the amplitude of the voice signal. In a speech synthesis model, the volume may be changed by adjusting the magnitude value of the model output. Or after the voice is generated, the voice signal is amplified or reduced to achieve the purpose of adjusting the volume. The adjustment of the tone color parameters relates to the spectral characteristics of the speech signal. The tone color may be changed by changing parameters related to the spectrum in the speech synthesis model, or by filtering the spectrum after speech generation, or the like. For example, different vocoders or filters are used to shape different timbres.
And updating the preset acoustic parameters of the speech synthesis model according to the adjusted speech speed, intonation, volume and tone parameters. Within the model, these parameters may affect various links of speech generation, such as spectral generation of the speech signal, fundamental frequency control, amplitude adjustment, etc. For example, in Tacotron-2 speech synthesis models, optimization of speech speed, intonation, volume, and timbre may be achieved by adjusting parameters associated with mel spectrum generation modules, fundamental frequency prediction modules, etc. in the model. In a preferred embodiment, the optimized model is further trained using a small amount of adaptation data to better accommodate current network conditions and user requirements. The adaptive data may be speech data collected in a network-like environment, and the model is enabled to better generate a satisfactory speech signal by fine-tuning parameters of the model.
In the speech synthesis process, a forward error correction coding technology is adopted to add redundant information to the generated speech data. For example, in the process of voice data transmission, even if part of data is lost or damaged, the receiving end can recover the original voice data according to the redundant information by using a Reed-Solomon code (Reed-Solomon Codes) and other coding modes. In the speech synthesis module, the speech data may be divided into a plurality of data blocks during the encoding phase of the speech signal, and then a corresponding error correction code is generated for each data block and appended to the data block. In addition to forward error correction coding, a redundant coding strategy may be employed, i.e., multiple encodings or backups of important speech data. For example, the key frames or important characteristic parameters in the voice signal are redundantly encoded, so that the receiving end can reconstruct the voice signal according to the redundant data even if part of the data is lost in the data transmission process. In implementation, the level and manner of redundancy coding may be determined based on the data network signal strength and the importance of the voice data. And converting the answer text into a voice signal through the optimized voice synthesis model. In the speech synthesis process, speech data is processed according to forward error correction and redundancy coding strategies. For example, when generating a voice data stream, error correction codes and redundant data are added in accordance with a predetermined encoding rule. And then, packaging the processed voice data according to a transmission protocol of the communication plug flow, generating an audio stream corresponding to the answer text, and preparing for sending to a user terminal.
In the embodiment of the invention, the parameters of the voice synthesis model are dynamically adjusted according to the signal intensity of the data network, so that the voice synthesis can adapt to different network environments and user scenes. The method provides smoother and natural voice synthesis under the environment of strong signals and stable network, ensures the voice synthesis to be clear and intelligible under the environment of weak signals and unstable network, meets the requirements of users under various conditions, and improves the adaptability and reliability of the intelligent dialogue system. The adjusted speech speed, intonation, volume and tone parameters can better match the hearing requirements and usage scenarios of the user. For example, increasing volume and clarity in a noisy environment allows a user to more easily hear speech synthesis content, providing softer, natural speech in a quiet environment, and increasing the user's auditory comfort. In this way, the user experience is optimized, making the user more willing to use the agent dialog system. The optimized voice synthesis model can generate voice signals which more meet the current network conditions and the requirements of users. The forward error correction and redundancy coding strategy can effectively cope with the loss and damage of voice data in the transmission process, and ensure the integrity and accuracy of voice signals. For example, under the condition of network fluctuation or data packet loss, the receiving end can recover the original voice signal by using error correction codes and redundant data, thereby reducing voice interruption or distortion caused by data loss and improving the overall quality of voice synthesis. The forward error correction and redundancy coding strategies increase the fault tolerance of the voice data and reduce the requirements on the reliability of network transmission. Even under the condition of poor network condition, stable transmission of voice data can be ensured, conversation interruption or delay caused by network problems is reduced, and stability and usability of the intelligent conversation system are improved.
Optionally, the intention recognition and answer module is specifically configured to:
according to the data network signal intensity, determining network fluctuation data of the user side;
predicting according to the network fluctuation data to obtain future network fluctuation data of the answer text in the transmission period;
and sending the answer text to the user terminal according to the future network fluctuation data.
Specifically, the signal strength of the data network is continuously monitored, and the change of the signal strength with time is recorded. In addition to signal strength, other network related metrics such as packet loss rate, delay, etc. are collected. Such data may be obtained through a network interface API or network diagnostic tool of the device. The monitoring time period is divided into a plurality of small time windows, for example, a window is formed every 10 seconds, and statistical indexes such as signal intensity average value, variance, packet loss rate, delay and the like in each window are calculated, wherein the indexes form network fluctuation data. The rate of change of signal strength between different time windows is calculated, e.g., the difference in signal strength between adjacent time windows divided by the signal strength of the previous window, to yield the relative rate of change of signal strength. Meanwhile, the variance of the signal intensity is calculated to measure the stability of the signal intensity. And counting the packet loss rate and delay change condition in each time window. For example, the standard deviation of the packet loss rate and the statistical indexes such as the maximum value, the minimum value, the average value and the like of the delay are calculated so as to comprehensively reflect the fluctuation condition of the network.
Suitable time series prediction models are selected, such as autoregressive moving average (ARIMA), exponential smoothing (Exponential Smoothing), or long-term memory network (LSTM), etc. Taking LSTM as an example, the method can process long-short-term dependency relationship in time sequence data, and is suitable for predicting network fluctuation data. The model is trained using the collected historical network fluctuation data. The historical data is divided into training and testing sets, and the difference between the predicted value and the actual value, such as a Mean Square Error (MSE) loss function, is minimized by adjusting parameters of the model, such as the hidden layer size of the LSTM, the learning rate, and the like. The latest network fluctuation data (including statistical indexes such as signal strength, packet loss rate, delay and the like) are used as the input of the model. And carrying out preprocessing operations such as normalization and the like on the input data so as to enable the input data to meet the input requirements of the model. The model predicts network fluctuation data for a future period of time (e.g., for the next 30 seconds) from the input data, including predicted values of signal strength variation trend, packet loss rate, delay, and the like. These predictive data will be used for subsequent answer text transmission planning.
In the embodiment of the invention, the influence of network fluctuation on audio transmission can be dealt with in advance by predicting future network fluctuation data and making a transmission plan according to the future network fluctuation data. For example, when the network packet loss rate is predicted to increase, the transmission frequency is reduced and the transmission plan is adjusted, so that the audio interruption or the blocking phenomenon caused by packet loss is reduced. The self-adaptive transmission strategy improves the transmission stability of the answer text under the condition of network fluctuation, and ensures that users can smoothly receive audio content.
Optionally, the intention recognition and answer module is specifically further configured to:
determining the transmission frequency of the answer text according to the future network fluctuation data, and dividing the answer text into a plurality of text fragments;
generating a sending plan of the answer text according to the transmission frequency of the answer text and the text segment;
and sending the text fragments of the answer text to the user side according to the sending plan.
Specifically, the transmission frequency of the answer text is determined according to the predicted future network fluctuation data. If the network fluctuation is predicted to be large (such as the signal strength is reduced and the packet loss rate is increased), the transmission frequency is reduced so as to reduce the data loss and retransmission caused by the network fluctuation. Conversely, if the predicted network condition is stable, the transmission frequency can be appropriately increased to increase the transmission speed of the answer text. For example, the transmission frequency may be dynamically calculated based on the predicted packet loss rate and delay. It is assumed that the basic transmission frequency is 20 text segments per second in a stable network environment. When the predicted packet loss rate increases by 10%, the transmission frequency is reduced by 10%, namely 18 text fragments are transmitted per second, and when the predicted delay increases beyond a certain threshold value, the transmission frequency is also reduced. And dividing the answer text into a plurality of text fragments according to the determined transmission frequency. And generating a transmission plan of the answer text according to the transmission frequency and the sequence of the text fragments. The transmission schedule includes information such as transmission time, transmission order, and transmission data amount of each text segment. For example, each segment is arranged in turn to be sent at a corresponding point in time in the order of the text segments, ensuring continuity and integrity of the answer text. And optimizing the transmission plan by considering the network fluctuation prediction result. For example, if network fluctuations are predicted to be large within a certain period of time, some key text segments may be sent in advance, or a certain buffer time may be reserved in the transmission plan to cope with possible network delays. And the audio transmission module transmits the answer text of the answer audio to the user side according to the generated transmission plan. In the process of sending, the network condition is monitored in real time, and if the actual network condition has larger deviation from the predicted result, the sending plan can be dynamically adjusted. For example, if the actual network fluctuation is smaller than the predicted one, the transmission frequency can be appropriately increased to increase the transmission speed of the answer text.
According to the method and the device for sending the text fragments, the sending sequence and the sending time of the text fragments are reasonably arranged according to the network fluctuation prediction result, and the risk of losing the answer text in the transmission process can be reduced. For example, when the network condition is poor, the key text segments are preferentially sent, or the sending frequency and the buffering time are adjusted, so that the receiving end can better reorganize the answer text, the audio quality problem caused by data loss is reduced, and the reliability of audio transmission is improved. When the network condition is stable, the transmission frequency is properly increased according to the prediction result, and the transmission speed of the answer text can be increased. For example, when the future network fluctuation is predicted to be small and the bandwidth is sufficient, the number of text fragments transmitted per second is increased, so that the user can receive complete audio content more quickly, the waiting time of the user is shortened, and the response speed of the intelligent agent dialogue system is improved. Dynamically adjusting the transmission frequency and the transmission schedule enables more rational utilization of network resources. Under different network environments, the sending strategy of the answer words is adjusted according to actual conditions, so that the situation that the network bandwidth is excessively occupied when the network condition is good and resource waste caused by blind sending when the network fluctuates is avoided. The optimization strategy improves the utilization efficiency of network resources, and enables the dialogue with the intelligent agent to be smoother and more stable.
Referring to fig. 2, the method for intelligent agent dialogue is applied to the intelligent agent dialogue system, wherein the intelligent agent dialogue system comprises a user end and a server end, and the output of the user end is in communication connection with the server end;
The user terminal comprises an acquisition module, an audio conversion module, a voice recognition processing module and a voice synthesis module which are sequentially connected in a communication way, and the server terminal comprises an intention recognition and answer module;
The output end of the voice recognition processing module of the user end is connected with the input end of the intention recognition and answer module of the server end, and the output end of the intention recognition and answer module is connected with the input end of the voice synthesis module of the user end;
the agent dialogue method comprises the following steps:
acquiring audio data of a user through the acquisition module;
Setting a transmission mode of the audio data according to the data network signal strength of the user side through the audio conversion module, and when the transmission mode is communication signal transmission, carrying out coding processing on the audio data according to a transmission protocol of communication push stream to obtain an audio stream corresponding to the audio data;
Real-time analysis is carried out through deep learning according to the audio stream by the voice recognition processing module to obtain an analysis result, and the analysis result is optimized through a language acoustic model to obtain text information corresponding to the audio data;
Determining the field and the intention of the text information according to the text information corresponding to the audio data and combining context information through the intention recognition and answer module, and obtaining answer text of the audio data according to the field and the intention and combining a logical reasoning algorithm and a language generation algorithm;
And performing voice synthesis on the answer characters through the voice synthesis module to obtain audio streams corresponding to the answer characters, and performing coding processing on the audio streams corresponding to the answer characters according to the transmission protocol of the communication plug flow to obtain the audio data of the answer characters.
The advantages of the agent dialogue method of the present invention compared with the prior art are the same as those of the agent dialogue system described above compared with the prior art, and will not be described here again.
Although the present disclosure is disclosed above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the disclosure, and these changes and modifications will fall within the scope of the disclosure.

Claims (10)

1.一种智能体对话系统,其特征在于,包括用户端和服务器端,用户端与服务器端通讯连接;1. An agent dialogue system, comprising a user end and a server end, wherein the user end and the server end are in communication connection; 所述用户端包括依次通讯连接的采集模块、音频转换模块、语音识别处理模块和语音合成模块;所述服务器端包括意图识别与回答模块;The user end includes a collection module, an audio conversion module, a speech recognition processing module and a speech synthesis module which are connected in sequence; the server end includes an intention recognition and answering module; 所述用户端的所述语音识别处理模块的输出端与所述服务器端的意图识别与回答模块的输入端相连,所述意图识别与回答模块的输出端与所述用户端的所述语音合成模块的输入端相连;The output end of the speech recognition processing module of the user end is connected to the input end of the intention recognition and answering module of the server end, and the output end of the intention recognition and answering module is connected to the input end of the speech synthesis module of the user end; 所述采集模块,用于获取用户的音频数据;The acquisition module is used to obtain the user's audio data; 所述音频转换模块,用于根据所述用户端的数据网络信号强度,设置所述音频数据的传输方式,当所述传输方式为通信信号传输时,将所述音频数据按照通信推流的传输协议进行编码处理,得到所述音频数据对应的音频流;The audio conversion module is configured to set a transmission mode for the audio data according to the data network signal strength of the user terminal, and when the transmission mode is communication signal transmission, encode the audio data according to the transmission protocol of the communication push stream to obtain an audio stream corresponding to the audio data; 所述语音识别处理模块,用于根据所述音频流,通过深度学习进行实时解析,得到解析结果,并通过语言声学模型对所述解析结果进行优化,得到所述音频数据对应的文字信息;The speech recognition processing module is used to perform real-time analysis on the audio stream through deep learning to obtain an analysis result, and optimize the analysis result through a language acoustic model to obtain text information corresponding to the audio data; 所述意图识别与回答模块,用于根据所述音频数据对应的所述文字信息,结合上下文信息确定所述文字信息的领域和意图;并根据所述领域和所述意图,结合逻辑推理算法和语言生成算法,得到所述音频数据的回答文字;The intention recognition and response module is used to determine the domain and intent of the text information corresponding to the audio data in combination with context information; and to obtain the response text of the audio data based on the domain and intent in combination with a logical reasoning algorithm and a language generation algorithm; 所述语音合成模块,用于对所述回答文字进行语音合成,得到所述回答文字对应的音频流,对所述回答文字对应的所述音频流按照所述通信推流的所述传输协议进行编码处理,得到所述回答文字的音频数据。The speech synthesis module is used to perform speech synthesis on the answer text to obtain the audio stream corresponding to the answer text, and encode the audio stream corresponding to the answer text according to the transmission protocol of the communication push stream to obtain the audio data of the answer text. 2.根据权利要求1所述的智能体对话系统,其特征在于,所述音频转换模块,具体用于:2. The agent dialogue system according to claim 1, wherein the audio conversion module is specifically configured to: 获取所述用户端的数据网络信号强度;Obtaining the data network signal strength of the user terminal; 根据所述数据网络信号强度和预设信号强度阈值的大小关系,设置所述音频数据的所述传输方式;Setting the transmission mode of the audio data according to a magnitude relationship between the data network signal strength and a preset signal strength threshold; 其中,当所述数据网络信号强度小于所述预设信号强度阈值时,将所述音频数据的所述传输方式设置为所述通信信号传输;Wherein, when the data network signal strength is less than the preset signal strength threshold, the transmission mode of the audio data is set to the communication signal transmission; 当所述数据网络信号强度大于或等于所述预设信号强度阈值时,将所述音频数据的所述传输方式设置为数据网络传输。When the data network signal strength is greater than or equal to the preset signal strength threshold, the transmission mode of the audio data is set to data network transmission. 3.根据权利要求1所述的智能体对话系统,其特征在于,所述语音识别处理模块,具体用于:3. The intelligent agent dialogue system according to claim 1, wherein the speech recognition processing module is specifically configured to: 对所述音频流进行分帧处理,将所述音频流中连续的音频信号分割为多个时间段的音频帧;Performing frame processing on the audio stream to divide the continuous audio signal in the audio stream into audio frames of multiple time periods; 对每个所述音频帧进行声学特征提取,得到所述音频帧的声学特征;Extracting acoustic features from each of the audio frames to obtain acoustic features of the audio frame; 将所述声学特征输入至深度学习模型进行非线性映射,生成初步语音识别结果;Inputting the acoustic features into a deep learning model for nonlinear mapping to generate preliminary speech recognition results; 通过所述语言声学模型对所述初步语音识别结果进行语法和语义约束,得到所述音频数据对应的所述文字信息。The preliminary speech recognition result is subjected to grammatical and semantic constraints by means of the language acoustic model to obtain the text information corresponding to the audio data. 4.根据权利要求1所述的智能体对话系统,其特征在于,所述意图识别与回答模块,具体用于:4. The intelligent agent dialogue system according to claim 1, wherein the intention recognition and response module is specifically configured to: 对所述文字信息进行分词和词性标注,得到文本信息的基本结构和语法特征;Performing word segmentation and part-of-speech tagging on the text information to obtain the basic structure and grammatical features of the text information; 通过意图分类模型,根据所述基本结构和所述语法特征,结合上下文信息,对所述文字信息进行语义理解和分类,得到所述文字信息的所述领域和所述意图;Using an intent classification model, based on the basic structure and the grammatical features, combined with contextual information, semantically understand and classify the text information to obtain the domain and intent of the text information; 根据所述领域和所述意图,结合所述逻辑推理算法,生成所述领域和所述意图对应的逻辑推理结果;According to the domain and the intention, combined with the logical reasoning algorithm, generate a logical reasoning result corresponding to the domain and the intention; 再利用所述语言生成算法,根据所述逻辑推理结果生成所述回答文字。Then, the language generation algorithm is used to generate the answer text according to the logical reasoning result. 5.根据权利要求1所述的智能体对话系统,其特征在于,所述语音合成模块,具体用于:5. The intelligent agent dialogue system according to claim 1, wherein the speech synthesis module is specifically configured to: 通过语音合成模型按照预设声学参数,将所述回答文字转换为语音信号;Converting the answer text into a speech signal according to preset acoustic parameters using a speech synthesis model; 对所述语音信号进行噪声抑制和回声消除处理,得到处理后的所述语音信号;Performing noise suppression and echo cancellation processing on the speech signal to obtain the processed speech signal; 将处理后的所述语音信号按照通信推流的传输协议进行编码,生成所述回答文字对应的所述音频流。The processed voice signal is encoded according to the transmission protocol of the communication push stream to generate the audio stream corresponding to the answer text. 6.根据权利要求2所述的智能体对话系统,其特征在于,所述音频转换模块,具体还用于:6. The agent dialogue system according to claim 2, wherein the audio conversion module is further configured to: 当所述传输方式为所述数据网络传输时,将所述音频数据按照用户数据报协议进行编码处理,得到所述音频数据的网络传输数据。When the transmission mode is the data network transmission, the audio data is encoded according to the User Datagram Protocol to obtain network transmission data of the audio data. 7.根据权利要求6所述的智能体对话系统,其特征在于,所述语音合成模块,具体用于:7. The intelligent agent dialogue system according to claim 6, wherein the speech synthesis module is specifically configured to: 当所述音频数据的所述传输方式为所述数据网络传输时,根据所述数据网络信号强度,确定语音合成模型的语速参数、语调参数、音量参数以及音色参数;When the transmission mode of the audio data is the data network transmission, determining the speech rate parameter, intonation parameter, volume parameter and timbre parameter of the speech synthesis model according to the data network signal strength; 根据所述语速参数、所述语调参数、所述音量参数以及所述音色参数对语音合成模型的预设声学参数进行优化,得到优化后的所述语音合成模型;Optimizing preset acoustic parameters of a speech synthesis model according to the speech rate parameter, the intonation parameter, the volume parameter, and the timbre parameter to obtain the optimized speech synthesis model; 通过优化后的所述语音合成模型,结合前向纠错和冗余编码策略对所述回答文字进行动态语音合成,得到所述回答文字对应的所述音频流。The answer text is dynamically speech synthesized by using the optimized speech synthesis model in combination with forward error correction and redundant coding strategies to obtain the audio stream corresponding to the answer text. 8.根据权利要求1所述的智能体对话系统,其特征在于,所述意图识别与回答模块,具体用于:8. The intelligent agent dialogue system according to claim 1, wherein the intention recognition and response module is specifically configured to: 根据所述数据网络信号强度,确定所述用户端的网络波动数据;determining network fluctuation data of the user terminal according to the data network signal strength; 根据所述网络波动数据进行预测,得到所述回答文字在传输期间的未来网络波动数据;Predicting based on the network fluctuation data to obtain future network fluctuation data during the transmission of the answer text; 根据所述未来网络波动数据,将所述回答文字发送至所述用户端。The answer text is sent to the user terminal according to the future network fluctuation data. 9.根据权利要求8所述的智能体对话系统,其特征在于,所述意图识别与回答模块,具体还用于:9. The intelligent agent dialogue system according to claim 8, wherein the intention recognition and response module is further configured to: 根据所述未来网络波动数据,确定所述回答文字的传输频率,并将所述回答文字分为多个文字片段;determining a transmission frequency of the reply text according to the future network fluctuation data, and dividing the reply text into a plurality of text segments; 根据所述回答文字的传输频率以及所述文字片段,生成所述回答文字的发送计划;generating a sending plan for the reply text according to the transmission frequency of the reply text and the text fragment; 将所述回答文字的所述文字片段按照所述发送计划发送至所述用户端。The text segment of the answer text is sent to the user terminal according to the sending plan. 10.一种智能体对话方法,其特征在于,应用于如权利要求1-9任意一项所述的智能体对话系统,所述智能体对话系统包括用户端和服务器端,用户端的输出与服务器端通讯连接;10. An agent dialogue method, characterized in that it is applied to the agent dialogue system according to any one of claims 1 to 9, wherein the agent dialogue system includes a user end and a server end, and the output of the user end is communicatively connected to the server end; 所述用户端包括依次通讯连接的采集模块、音频转换模块、语音识别处理模块和语音合成模块;所述服务器端包括意图识别与回答模块;The user end includes a collection module, an audio conversion module, a speech recognition processing module and a speech synthesis module which are connected in sequence; the server end includes an intention recognition and answering module; 所述用户端的所述语音识别处理模块的输出端与所述服务器端的意图识别与回答模块的输入端相连,所述意图识别与回答模块的输出端与所述用户端的所述语音合成模块的输入端相连;The output end of the speech recognition processing module of the user end is connected to the input end of the intention recognition and answering module of the server end, and the output end of the intention recognition and answering module is connected to the input end of the speech synthesis module of the user end; 所述智能体对话方法包括:The agent dialogue method comprises: 通过所述采集模块,获取用户的音频数据;Acquire the user's audio data through the acquisition module; 通过所述音频转换模块,根据所述用户端的数据网络信号强度,设置所述音频数据的传输方式,当所述传输方式为通信信号传输时,将所述音频数据按照通信推流的传输协议进行编码处理,得到所述音频数据对应的音频流;The audio conversion module sets a transmission mode for the audio data according to the data network signal strength of the user terminal. When the transmission mode is communication signal transmission, the audio data is encoded according to the transmission protocol of the communication push stream to obtain an audio stream corresponding to the audio data. 通过所述语音识别处理模块,根据所述音频流,通过深度学习进行实时解析,得到解析结果,并通过语言声学模型对所述解析结果进行优化,得到所述音频数据对应的文字信息;The speech recognition processing module performs real-time analysis based on the audio stream through deep learning to obtain an analysis result, and optimizes the analysis result through a language acoustic model to obtain text information corresponding to the audio data; 通过所述意图识别与回答模块,根据所述音频数据对应的所述文字信息,结合上下文信息确定所述文字信息的领域和意图;并根据所述领域和所述意图,结合逻辑推理算法和语言生成算法,得到所述音频数据的回答文字;The intention recognition and answer module determines the domain and intention of the text information according to the text information corresponding to the audio data in combination with context information; and obtains the answer text of the audio data in combination with the logical reasoning algorithm and the language generation algorithm based on the domain and the intention; 通过所述语音合成模块,对所述回答文字进行语音合成,得到所述回答文字对应的音频流,对所述回答文字对应的所述音频流按照所述通信推流的所述传输协议进行编码处理,得到所述回答文字的音频数据。The answer text is speech-synthesized by the speech synthesis module to obtain an audio stream corresponding to the answer text, and the audio stream corresponding to the answer text is encoded according to the transmission protocol of the communication push stream to obtain audio data of the answer text.
CN202511093085.7A 2025-08-06 2025-08-06 Intelligent body dialogue system and method Pending CN120600029A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511093085.7A CN120600029A (en) 2025-08-06 2025-08-06 Intelligent body dialogue system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511093085.7A CN120600029A (en) 2025-08-06 2025-08-06 Intelligent body dialogue system and method

Publications (1)

Publication Number Publication Date
CN120600029A true CN120600029A (en) 2025-09-05

Family

ID=96891246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511093085.7A Pending CN120600029A (en) 2025-08-06 2025-08-06 Intelligent body dialogue system and method

Country Status (1)

Country Link
CN (1) CN120600029A (en)

Similar Documents

Publication Publication Date Title
JP6113302B2 (en) Audio data transmission method and apparatus
JP6374028B2 (en) Voice profile management and speech signal generation
US11763801B2 (en) Method and system for outputting target audio, readable storage medium, and electronic device
CN113362828A (en) Method and apparatus for recognizing speech
CN110838894B (en) Speech processing method, device, computer readable storage medium and computer equipment
KR20040101575A (en) Distributed voice recognition system utilizing multistream feature processing
WO2023222088A1 (en) Voice recognition and classification method and apparatus
CN111508498A (en) Conversational speech recognition method, system, electronic device and storage medium
WO2006113029A1 (en) Bandwidth efficient digital voice communication system and method
US20230130777A1 (en) Method and system for generating voice in an ongoing call session based on artificial intelligent techniques
CN112908301A (en) Voice recognition method, device, storage medium and equipment
CN113488026B (en) Speech understanding model generation method based on pragmatic information and intelligent speech interaction method
CN112767955B (en) Audio encoding method and device, storage medium and electronic equipment
CN116168699A (en) Security platform control method and device based on voice recognition, storage medium and equipment
CN118865942A (en) A low-latency real-time speech-to-text and text-to-speech transmission method
CN120600029A (en) Intelligent body dialogue system and method
Vicente-Peña et al. Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition
US20030220794A1 (en) Speech processing system
Garcia et al. Low bit rate compression methods of feature vectors for distributed speech recognition
US11089183B1 (en) Multiple device audio-video synchronization
Maes et al. Conversational networking: conversational protocols for transport, coding, and control.
CN119626210B (en) A speech recognition device and training method, electronic device and storage medium
US20230186900A1 (en) Method and system for end-to-end automatic speech recognition on a digital platform
CN117456996A (en) Speech recognition method and device, storage medium and electronic equipment
CN120746585A (en) Customer service robot response method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination