CN120600029A

CN120600029A - Intelligent body dialogue system and method

Info

Publication number: CN120600029A
Application number: CN202511093085.7A
Authority: CN
Inventors: 文一然; 庞文刚; 任奕; 邹西山; 潘东玮; 周静纯
Original assignee: China Unicom WO Music and Culture Co Ltd
Current assignee: China Unicom WO Music and Culture Co Ltd
Priority date: 2025-08-06
Filing date: 2025-08-06
Publication date: 2025-09-05

Abstract

The present invention provides an intelligent agent dialogue system and method, relating to the field of audio conversion technology, comprising: an acquisition module for acquiring user audio data; an audio conversion module for setting the transmission mode of the audio data based on the data network signal strength of the user end; when the transmission mode is communication signal transmission, encoding the audio data to obtain an audio stream; a speech recognition processing module for performing real-time analysis based on the audio stream to obtain analysis results, and optimizing the analysis results using a language acoustic model to obtain text information; an intention recognition and response module for generating response text based on the audio data; and a speech synthesis module for performing speech synthesis on the response text to obtain audio data of the response text. The present invention uses text transmission instead of audio streaming, significantly reducing the data transmission volume, improving transmission efficiency, reducing delays and errors caused by network congestion, and improving the stability of the intelligent agent audio dialogue.

Description

Intelligent body dialogue system and method

Technical Field

The invention relates to the technical field of audio conversion, in particular to an agent dialogue system and method.

Background

In the existing intelligent dialogue interaction scene, a common mode is that a user interacts with an intelligent agent by inputting text information through a mobile application or a webpage end. However, when the user is in a scene inconvenient to manually input, such as driving or sports, it is difficult to conveniently communicate with the intelligent agent. Therefore, in general, a user uses APP on a terminal device such as a mobile phone to realize an audio call with an agent.

In the related art, most of the APP of the intelligent agent adopts a data network to realize dialogue, and the existing voice interaction has the characteristic of strong dependence on data network signals, so that the phenomenon of unsmooth dialogue with the intelligent agent can occur in the scene of poor network signals, and further the requirement of users on instant feedback of the intelligent agent cannot be met.

Disclosure of Invention

The invention solves the problem of how to improve the stability of the agent audio dialogue.

In order to solve the above problems, the present invention provides an agent dialogue system and method.

In a first aspect, the present invention provides an agent dialogue system, including a user side and a server side, where an output of the user side is connected with the server side in a communication manner;

The user terminal comprises an acquisition module, an audio conversion module, a voice recognition processing module and a voice synthesis module which are sequentially connected in a communication way, and the server terminal comprises an intention recognition and answer module;

The output end of the voice recognition processing module of the user end is connected with the input end of the intention recognition and answer module of the server end, and the output end of the intention recognition and answer module is connected with the input end of the voice synthesis module of the user end;

The acquisition module is used for acquiring audio data of a user;

The audio conversion module is used for setting a transmission mode of the audio data according to the data network signal strength of the user side, and when the transmission mode is communication signal transmission, the audio data is encoded according to a transmission protocol of communication push stream to obtain an audio stream corresponding to the audio data;

the voice recognition processing module is used for carrying out real-time analysis through deep learning according to the audio stream to obtain an analysis result, and optimizing the analysis result through a language acoustic model to obtain text information corresponding to the audio data;

The intention recognition and answer module is used for determining the field and intention of the text information according to the text information corresponding to the audio data and combining context information, and obtaining answer text of the audio data according to the field and the intention and combining a logical reasoning algorithm and a language generation algorithm;

The voice synthesis module is used for performing voice synthesis on the answer characters to obtain audio streams corresponding to the answer characters, and performing coding processing on the audio streams corresponding to the answer characters according to the transmission protocol of the communication plug flow to obtain audio data of the answer characters.

Optionally, the audio conversion module is specifically configured to:

acquiring the data network signal strength of the user side;

setting the transmission mode of the audio data according to the magnitude relation between the signal intensity of the data network and a preset signal intensity threshold;

When the signal intensity of the data network is smaller than the preset signal intensity threshold, setting the transmission mode of the audio data as the communication signal transmission;

And setting the transmission mode of the audio data to be data network transmission when the signal strength of the data network is greater than or equal to the preset signal strength threshold.

Optionally, the voice recognition processing module is specifically configured to:

Carrying out framing treatment on the audio stream, and dividing continuous audio signals in the audio stream into audio frames of a plurality of time periods;

Extracting acoustic features of each audio frame to obtain acoustic features of the audio frames;

inputting the acoustic features into a deep learning model for nonlinear mapping, and generating a preliminary voice recognition result;

and carrying out grammar and semantic constraint on the preliminary voice recognition result through the language acoustic model to obtain the text information corresponding to the audio data.

Optionally, the intention recognition and answer module is specifically configured to:

performing word segmentation and part-of-speech tagging on the text information to obtain a basic structure and grammar characteristics of the text information;

carrying out semantic understanding and classification on the text information according to the basic structure and the grammar characteristics and by combining the context information through an intention classification model to obtain the field and the intention of the text information;

according to the domain and the intention, combining the logic reasoning algorithm to generate a logic reasoning result corresponding to the domain and the intention;

and generating the answer text according to the logical reasoning result by utilizing the language generation algorithm.

Optionally, the voice synthesis module is specifically configured to:

Converting the answer text into a voice signal according to preset acoustic parameters through a voice synthesis model;

performing noise suppression and echo cancellation processing on the voice signal to obtain a processed voice signal;

and encoding the processed voice signal according to a transmission protocol of communication plug flow, and generating the audio stream corresponding to the answer text.

Optionally, the audio conversion module is specifically further configured to:

And when the transmission mode is the data network transmission, the audio data is encoded according to a user datagram protocol, so as to obtain network transmission data of the audio data.

Optionally, the voice synthesis module is specifically configured to:

when the transmission mode of the audio data is the data network transmission, determining a speech rate parameter, a intonation parameter, a volume parameter and a tone parameter of a speech synthesis model according to the signal strength of the data network;

Optimizing preset acoustic parameters of a speech synthesis model according to the speech speed parameter, the intonation parameter, the volume parameter and the tone parameter to obtain the optimized speech synthesis model;

and performing dynamic speech synthesis on the answer words by combining the optimized speech synthesis model with a forward error correction and redundancy coding strategy to obtain the audio stream corresponding to the answer words.

according to the data network signal intensity, determining network fluctuation data of the user side;

predicting according to the network fluctuation data to obtain future network fluctuation data of the answer text in the transmission period;

and sending the answer text to the user terminal according to the future network fluctuation data.

Optionally, the intention recognition and answer module is specifically further configured to:

determining the transmission frequency of the answer text according to the future network fluctuation data, and dividing the answer text into a plurality of text fragments;

generating a sending plan of the answer text according to the transmission frequency of the answer text and the text segment;

and sending the text fragments of the answer text to the user side according to the sending plan.

In a second aspect, the present invention provides an agent dialogue method, applied to any one of the above agent dialogue systems, where the agent dialogue system includes a user end and a server end, and an output of the user end is connected with the server end in a communication manner;

the agent dialogue method comprises the following steps:

acquiring audio data of a user through the acquisition module;

Setting a transmission mode of the audio data according to the data network signal strength of the user side through the audio conversion module, and when the transmission mode is communication signal transmission, carrying out coding processing on the audio data according to a transmission protocol of communication push stream to obtain an audio stream corresponding to the audio data;

Real-time analysis is carried out through deep learning according to the audio stream by the voice recognition processing module to obtain an analysis result, and the analysis result is optimized through a language acoustic model to obtain text information corresponding to the audio data;

Determining the field and the intention of the text information according to the text information corresponding to the audio data and combining context information through the intention recognition and answer module, and obtaining answer text of the audio data according to the field and the intention and combining a logical reasoning algorithm and a language generation algorithm;

And performing voice synthesis on the answer characters through the voice synthesis module to obtain audio streams corresponding to the answer characters, and performing coding processing on the audio streams corresponding to the answer characters according to the transmission protocol of the communication plug flow to obtain the audio data of the answer characters.

In the intelligent agent dialogue system and method, after the voice recognition processing module converts the audio stream into the text information, the text information is only transmitted to the server side. The text data volume is far smaller than the audio data volume, so that the data load in the transmission process is greatly reduced. The lightweight transmission mode reduces the requirement on the bandwidth of the data network, so that the system can still efficiently communicate under the condition that the data network signal is unstable or the bandwidth is limited. Specifically, the system divides the whole conversation process into a plurality of functional modules, including a collection module, an audio conversion module, a voice recognition processing module, a voice synthesis module and an intention recognition and answer module at the server side. The acquisition module is responsible for acquiring a user voice instruction, converting the user voice instruction into audio data and providing an original input for subsequent processing. The audio conversion module intelligently judges and switches the transmission mode of the audio data according to the signal strength of the data network monitored in real time, and when the transmission mode is communication signal transmission, the audio data is encoded and processed by utilizing a communication plug flow protocol to generate an audio flow, so that the dependence on a data network is effectively reduced, the continuity and stability of communication can be maintained in a poor signal area, the application scene of the system is greatly expanded, and a user can keep smooth dialogue with an intelligent body no matter in remote mountain areas, underground parking lots or vehicles running at high speed. After receiving the audio stream, the voice recognition processing module analyzes in real time by utilizing the deep learning technology, extracts key characteristics of the audio signal and converts the key characteristics into text information. Meanwhile, the preliminary analysis result is optimized through the language acoustic model, and the accuracy and consistency of the text information are improved. The optimized text information is sent to an intention recognition and response module at the server side. The intention recognition and answer module is used for determining the field and intention of the text information by combining the context information after receiving the text information. The module uses a logical reasoning algorithm and a language generation algorithm to generate an answer text matched with the user intention. The process fully considers the consistency and consistency of the dialogue, and ensures the rationality and accuracy of the answer. After receiving the answer text, the voice synthesis module converts the answer text into a natural and smooth voice signal. The converted voice signal is processed by the audio conversion module again according to the transmission protocol of the communication plug flow to generate audio data of the answer audio. The invention adopts text transmission to replace audio stream by optimizing communication process, thereby remarkably reducing data transmission quantity, improving transmission efficiency and reducing delay and error caused by network congestion. The intelligent voice interaction system has the advantages of close cooperation of the modules, full play of the advantages of text transmission, intelligent transmission mode selection, efficient voice recognition and processing, intelligent intention recognition and answer generation, optimized voice synthesis and transmission and the like, is constructed, solves the problem of unstable conversation caused by strong dependence of voice interaction on data network signals in the prior art, and provides a smoother, natural and accurate voice interaction experience for users.

Drawings

FIG. 1 is a schematic diagram of a system for intelligent dialogue according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for intelligent agent dialogue according to another embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. While the invention is susceptible of embodiment in the drawings, it is to be understood that the invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the invention. It should be understood that the drawings and embodiments of the invention are for illustration purposes only and are not intended to limit the scope of the present invention.

It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.

The term "comprising" and variations thereof as used herein is meant to be open-ended, i.e., "including but not limited to," based at least in part on, "one embodiment" means "at least one embodiment," another embodiment "means" at least one additional embodiment, "some embodiments" means "at least some embodiments," and "optional" means "optional embodiment. Related definitions of other terms will be given in the description below. It should be noted that the concepts of "first", "second", etc. mentioned in this disclosure are only used to distinguish between different devices, modules or units, and are not intended to limit the order or interdependence of functions performed by these devices, modules or units.

It should be noted that references to "a" and "an" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the devices in the embodiments of the present invention are for illustrative purposes only and are not intended to limit the scope of such messages or information.

In view of the above problems associated with the related art, the present embodiment provides an agent dialogue system and method.

The intelligent agent dialogue system provided by the embodiment of the invention comprises a user end and a server end, wherein the output of the user end is in communication connection with the server end, the user end comprises a collecting module, an audio conversion module, a voice recognition processing module and a voice synthesis module which are in communication connection in sequence, the server end comprises an intention recognition and answer module, the output end of the voice recognition processing module of the user end is connected with the input end of the intention recognition and answer module of the server end, and the output end of the intention recognition and answer module is connected with the input end of the voice synthesis module of the user end.

Specifically, the user end sequentially communicates with the acquisition module, the audio conversion module, the voice recognition processing module and the voice synthesis module to form a front-end link of voice processing. The acquisition module is used as a starting point and is responsible for acquiring audio data of a user and providing original voice materials for subsequent processing. The audio conversion module takes on the key role of judging the transmission mode according to the data network signal strength of the user immediately after that, when the data network signal is not good, the audio conversion module can flexibly switch to communication signal transmission, and encodes the audio data according to the transmission protocol of the communication push stream to generate an audio stream, thereby ensuring the stable transmission of the audio data under different network conditions. The voice recognition processing module receives the processing task of the audio stream, analyzes the processing task in real time by using the deep learning technology, extracts key features of the audio signal and converts the key features into text information, optimizes the preliminary analysis result by means of a language acoustic model, and improves the accuracy and consistency of the text information. The output end of the module is connected with the input end of the intention recognition and answer module of the server end, so that seamless butt joint between the user end and the server end is realized. After the intention recognition and answer module at the server end receives the text information, the field and intention of the text information are determined by combining the context information, and an answer text matched with the intention of the user is generated by using a logical reasoning algorithm and a language generation algorithm. The output end of the intention recognition and answer module is connected with the input end of the voice synthesis module of the user end, and the voice synthesis module converts the answer words into natural and smooth voice signals and generates audio data.

The acquisition module is used for acquiring the audio data of the user.

Specifically, as an initial module at the user side, the function is to acquire audio data of the user, such as a microphone. The system is the forefront link of the whole intelligent agent dialogue system, ensures that voice information sent by a user can be accurately collected, provides original data input for a subsequent processing module, and the performance of the system directly influences the effectiveness and usability of the whole intelligent agent dialogue system.

The audio conversion module is used for setting the transmission mode of the audio data according to the data network signal intensity of the user side, and when the transmission mode is communication signal transmission, the audio data is encoded according to a communication push stream transmission protocol to obtain an audio stream corresponding to the audio data.

Specifically, the audio conversion module is located at the user side, and first, monitors the signal strength of the data network at the user side in real time, for example, invokes an Application Programming Interface (API) related to the network interface of the device to obtain the signal strength value, for example, in an android system, the signal strength of the network can be obtained by using the types ConnectivityManager, networkInfo, and the like. And then judging a transmission mode according to the set signal strength threshold value, and selecting a communication signal transmission mode when the signal strength of the data network is lower than the threshold value. The transmission protocol of the communication push stream may adopt a real-time transmission protocol (RTP) or the like, and the audio data is encoded according to the protocol, including packetizing the audio data, adding protocol header information (such as a serial number, a timestamp or the like) to ensure the sequence and synchronization of the audio data in network transmission, thereby obtaining an audio stream corresponding to the audio data, and facilitating the subsequent stable transmission in a communication channel.

The voice recognition processing module is used for carrying out real-time analysis through deep learning according to the audio stream to obtain an analysis result, and optimizing the analysis result through a language acoustic model to obtain text information corresponding to the audio data.

Specifically, the voice recognition processing module is located at the user end, and after receiving the audio stream, the voice recognition processing module analyzes the audio stream in real time by using a deep learning algorithm (such as a convolutional neural network CNN, a cyclic neural network RNN, etc.). The deep learning model is trained in advance through a large amount of voice data with labels, and can learn the mapping relation between voice signals and characters. In the parsing process, the model extracts features (such as mel frequency cepstrum coefficient MFCC) in the audio stream, predicts corresponding text content according to the features, and obtains a preliminary parsing result. Then, the analysis result is optimized using the language acoustic model. The language acoustic model can be based on a statistical language model or a neural network language model, and can adjust and correct words and sentences in the preliminary analysis result according to grammar rules of the language, collocation habits of words and the like, unreasonable contents are filtered out, and finally text information corresponding to more accurate audio data is obtained.

The intention recognition and answer module is used for determining the field and intention of the text information according to the text information corresponding to the audio data and combining the context information, and obtaining answer text of the audio data according to the field and the intention and combining a logical reasoning algorithm and a language generation algorithm.

Specifically, the intention recognition and answer module is located at the server end and is responsible for receiving the text information transmitted by the voice recognition processing module, and the field and intention of the text information are further determined by combining the context information. Specifically, first, the field and intention of the text information are determined in conjunction with the context information. The context information may include dialogue content prior to the user, dialogue scenes (e.g., whether in a driving scenario, a customer service scenario, etc.), and so forth. For example, if the user previously inquires about weather information during a conversation, if the conversation refers to "today is fit for short sleeves," the conversation can determine that the field is weather in combination with the context, and the intention is to inquire about the relationship between the weather condition and wearing clothes. After determining the domain and intent, a logical inference algorithm (e.g., rule-based inference algorithm, bayesian network inference algorithm, etc.) and a language generation algorithm (e.g., sequence-to-sequence generation model Seq2Seq, etc.) are utilized to generate the corresponding answer text. The logical reasoning algorithm can carry out logical analysis and deduction on the questions, and the language generation algorithm generates natural and fluent answer text contents according to the deduction result. In addition, if the user side is a device connected through Bluetooth, the Bluetooth protocol is utilized to transmit the reply text to the Bluetooth module of the device, and if the user side is connected through a mobile data network or a Wi-Fi network, the reply text is transmitted through a corresponding network protocol.

Specifically, the voice synthesis module is located at the user end, the voice synthesis can adopt a splicing synthesis method, a large number of voice phonemes and a corpus are stored in advance, and appropriate phonemes are selected from the corpus to splice according to the content of the answer text, so that corresponding voice signals are generated. A parameter synthesis method may also be used to generate a corresponding speech signal according to the content of the answer text by controlling parameters (such as pitch frequency, tone parameters, etc.) of the speech synthesizer. After the generated voice signal is processed (such as filtering, gain control and the like) by digital signals, an audio stream corresponding to the answer text is obtained, so that the audio stream better accords with the hearing habit of a user in the aspects of tone quality, speech speed, tone and the like. Finally, the answer audio data which can be used for playing is obtained, and the voice synthesis module can adjust parameters such as speech speed, intonation, tone and the like of the synthesized voice according to different scenes and requirements, so that the synthesized voice is more natural, smooth and emotional, and the acceptance and satisfaction of users on answer of the intelligent body are improved.

Optionally, the audio conversion module is specifically configured to:

acquiring the data network signal strength of the user side;

Specifically, on a user's smart device (e.g., smart phone, tablet, etc.), the audio conversion module obtains the data network signal strength by calling an Application Programming Interface (API) provided by the operating system of the device. For example, in an android system, telephonyManager classes can be used to obtain signal strength information. The method for acquiring the mobile signal strength can acquire the signal strength value of the data network (such as 4G, 5G and the like) to which the equipment is currently connected in real time, and is usually measured by parameters such as signal strength indication (RSSI, received Signal Strength Indicator) and the like. In order to timely and accurately reflect the change situation of the network signal, the audio conversion module may acquire the signal strength of the data network periodically (for example, every 1-2 seconds), or immediately acquire the updated signal strength when detecting that the network state may change (for example, when the user switches the network or enters or leaves the signal coverage area, etc. triggered by an event). This ensures that the system is able to quickly perceive changes in the network signal under different network environments.

In a software configuration file at the user side of the agent dialogue system, a signal strength threshold is determined based on a large amount of network test data and the actual user experience. Generally, the threshold is set at a value that ensures that no significant settling, delay, etc. of the signal strength of the normal conversation occurs while the audio data is being transmitted over the data network. For example, for a 4G network, the threshold is set at a certain RSSI value (e.g., about 85 dBm), when the signal strength is higher than this value, the data network is considered to have better transmission quality, and the audio data transmission can be performed normally, and when the signal strength is lower than this value, the data network may be unstable, and it is necessary to switch to the communication signal transmission mode. It should be noted that communication signaling relies mainly on conventional telecommunication network infrastructure, and common implementations include circuit-switched and packet-switched technologies. Circuit switching technology, such as public switched telephone network, establishes a special physical circuit for both parties of communication to ensure stable transmission of voice signals. The data network transmission mainly depends on the internet, and the high-efficiency transmission of the audio data is realized through a network protocol such as a real-time transmission protocol or a real-time message transmission protocol. The data network transmission has the advantages of fully utilizing the existing internet infrastructure, providing efficient and flexible communication service, and being particularly suitable for being used in a scene with good network conditions.

When the acquired data network signal strength is greater than or equal to a preset signal strength threshold, the audio conversion module sets a transmission mode of audio data to be transmitted by a data network. In this case, the audio data may be sent directly through an existing data network channel (e.g., a user's mobile data network connection). For example, if the user covers a good area in the 5G network, the signal strength is strong, the audio data can be quickly transmitted from the user terminal to the server terminal by using the high bandwidth and low delay characteristics of the 5G network, and the answer audio data processed by the server terminal can also be quickly returned to the user terminal through the 5G network. During data network transmission, audio data is encapsulated and transmitted according to a common network data transmission protocol (such as a TCP/IP protocol). When the signal intensity of the data network is smaller than a preset signal intensity threshold value, the audio conversion module switches the transmission mode of the audio data into communication signal transmission. Communication signaling typically utilizes a conventional communication network (e.g., a circuit-switched network), such as a communication channel upon which the handset relies for voice call functions, to transmit audio data. In an implementation, the audio data is encoded according to a transport protocol of the communication push stream. This coding process is typically implemented by specific speech coding algorithms (e.g., AMR-NB/WB, etc.) that compress audio data into an audio stream format suitable for transmission over a communication channel while ensuring that the audio quality is within an acceptable range. The encoded audio stream is then transmitted over a communication network using a communication module of the device (e.g., a baseband processor of a cell phone, etc.).

In the embodiment of the invention, the transmission mode is flexibly selected according to the signal strength of the data network, so that the situations of interruption, blocking or loss and the like of audio data transmission caused by signal difference when the data network signal is poor are effectively avoided. For example, when a user is in a place where the coverage of data network signals is poor, such as a basement or a corner of a building, the system can automatically switch to a communication signal transmission mode, and stable transmission of audio data is ensured by using the communication signals. Compared with the traditional mode of performing voice interaction only by depending on a data network, the mode greatly improves the stability of audio conversation, so that a user can keep smooth conversation with an intelligent body in various complex network environments. When the data network signal is poor, the communication signal transmission is used as a relatively stable transmission mode, so that the audio data can be ensured to be completely transmitted from the user terminal to the server terminal, and the answer audio data processed by the server terminal can be completely returned to the user terminal. Therefore, the situation that dialogue content is incomplete or user intention is misunderstood caused by data loss can be avoided, and the reliability and user experience of the intelligent dialogue system are improved. Under the condition that the data network signal is good, the communication signal resource can be saved by utilizing the data network to transmit the audio data. Meanwhile, the intelligent switching mechanism is also beneficial to reducing the flow consumption of a user in the use process of the data network, because the data network can be used for transmitting audio data only when the data network signal is good enough, and the situation that the flow is wasted due to repeated retransmission of the data when the data network signal is bad is avoided, so that the transmission quality is ensured. For the communication operators, the loads of the data network and the communication network can be balanced to a certain extent, and the resource utilization efficiency of the whole communication system is improved.

In particular, in an agent dialogue system, a voice recognition processing module is responsible for converting an audio stream into text information, and this process is critical to the efficiency and stability of the whole system. The module first frames the audio stream and divides the continuous audio signal into audio frames of a plurality of time periods, typically 20-30 ms per frame, depending on the short-term nature of the speech signal, i.e. the speech signal may be considered stationary in a short time. Taking an audio stream with a sampling rate of 16kHz as an example, every 320-480 sampling points constitute an audio frame. The method is characterized in that the method comprises the steps of receiving audio stream data from a buffer area, and storing the audio stream data into the buffer area to realize framing, wherein the audio stream data is circularly read in software, and the audio frame is extracted after the data quantity reaches the standard, and the special audio processing chip can also directly framing and processing the audio signal in hardware. Extracting acoustic features from each audio frame is a key step in speech recognition. Common acoustic features include mel-frequency cepstral coefficients (MFCCs), filter bank energies, linear Predictive Cepstral Coefficients (LPCCs), and the like. Taking MFCC as an example, the extraction process mainly includes pre-emphasis, framing and windowing, fast Fourier Transform (FFT), mel-filter bank processing, and Discrete Cosine Transform (DCT). Pre-emphasis can emphasize high frequency parts to make the frequency spectrum flatter, framing and windowing (such as hamming window) to reduce frequency spectrum leakage, FFT to convert time domain signals into frequency domain signals, the number of points is usually 256 or 512, mel filter bank processes the frequency spectrum to simulate human ear hearing characteristics, DCT (discrete cosine transform) extracts MFCC coefficients from the filter bank output, and the first 12-13 coefficients are used as acoustic characteristics. Deep learning models, such as Recurrent Neural Networks (RNNs), convolutional Neural Networks (CNNs), long short term memory networks (LSTM), are used to map acoustic features to speech text. In the training phase, a large number of voice data with text labels are utilized to train the model. For example, a TensorFlow or PyTorch framework is used to build the model. The LSTM is adapted to process the timing characteristics of the speech signal, its input being the acoustic signature sequence of the audio frame, and its output being the conditional probability distribution of the corresponding words for each time step. In actual use, the extracted acoustic features are input into a trained model, and the model outputs a preliminary speech recognition result which comprises word candidates and probability sequences thereof. The preliminary speech recognition result is then input into a language acoustic model, grammar and semantic constraint are carried out, and the language acoustic model is optimized into text information conforming to language specifications and semantic logic. The model is constructed based on a statistical or neural network language model, and the preliminary result is adjusted according to grammar rules and vocabulary collocation habits. For example, the combination of words which are not in accordance with grammar is replaced, so that the accuracy of outputting the text information is ensured.

In the embodiment of the invention, the framing process enables the voice signal to be better analyzed, and the acoustic feature extraction can extract important information capable of effectively representing the voice characteristic. The distinguishing capability of the voice signals can be improved through a proper framing method and an acoustic feature extraction algorithm. For example, MFCC features can well reflect the timbre, pitch, etc. characteristics of speech, making homophones more accurate to distinguish. Accurate acoustic features are input into the deep learning model, so that the recognition performance of the model can be improved. The deep learning model can fully mine potential mapping rules between acoustic features and characters, and can more accurately convert the acoustic features into characters compared with the traditional rule-based voice recognition method. The language acoustic model corrects some errors in the preliminary speech recognition result through grammatical and semantic constraints. For example, if the preliminary recognition result is a sentence which does not conform to grammar, the language acoustic model can select a more proper vocabulary from possible alternative vocabularies to replace wrong vocabularies according to grammar rules and semantic knowledge, so that more accurate text information is obtained, the vocabulary error rate in voice recognition can be effectively reduced, and the overall recognition accuracy is improved. Meanwhile, the efficient acoustic feature extraction algorithm can extract useful features in a short time, and the calculated amount of the whole voice recognition process is reduced. The optimized deep learning model can process the input acoustic characteristics in a short time and output a preliminary voice recognition result, so that the instantaneity of voice recognition is improved. For example, in a voice assistant application on some smartphones, voice recognition can be completed and a response given within a few seconds after the user speaks, which depends largely on the efficient reasoning capabilities of the deep learning model. By dividing the continuous audio signal into audio frames and extracting acoustic features, the speech recognition processing module can effectively reduce the amount of data. Compared with the direct transmission of the original audio data, the data volume of the text information is obviously reduced, the requirement on network bandwidth is reduced, delay and errors in the transmission process are reduced, and the system stability is improved.

Specifically, the continuous text information is segmented into words or word sequences by means of character string matching by utilizing a pre-built dictionary. For example, for chinese segmentation, a longest matching method, a forward maximum matching method, or the like may be used. Taking a forward maximum matching method as an example, starting from the beginning of a sentence, searching the longest matching word in the dictionary in turn until the sentence ends. For example, for the sentence "i love natural language processing", using the forward maximum matching method, matching is performed according to the words in the dictionary, and it is possible to match to "i", then "love", then "natural language" (assuming that there is a word in the dictionary), and finally "processing". The word or term is automatically segmented by training a statistical model. Such as using Hidden Markov Models (HMMs) or Conditional Random Fields (CRFs), etc. During the training process, the model learns the relationship between the occurrence probability of the word and the context. In word segmentation, word or word boundaries are determined from these probability relationships. For example, the CRF model may consider information about each word's left and right neighbors to determine whether the word belongs to a preceding word or a following word. And marking the part of speech for the word after word segmentation according to the marking part of speech and grammar rule of each word in the dictionary. For example, in a simple rule dictionary, "me" is labeled as a pronoun, "love" is labeled as a verb, and "natural language processing" is labeled as a noun. According to grammar rules, when words appear at different positions in sentences, the parts of speech are determined according to collocation relation and sentence structure. For example, "natural language processing" is used as an object of "love" in a sentence, and its part of speech is still a noun.

Training a machine learning model (such as HMM, CRF and the like) by using the training corpus marked with the parts of speech. The model learns the sexual word probabilities of each word under different context. In actual tagging, parts of speech are tagged according to the context of the context and the probability calculated by the model. For example, in the sentence "he plays basketball," the "play" is labeled as a verb, while in the "play water" the "play" would be labeled as a verb or other part of speech depending on the context, depending on the learning result of the model.

Taking Convolutional Neural Network (CNN) as an example, features of text are automatically extracted. In the training process, word information after Word segmentation and part-of-speech tagging is converted into a Word vector sequence (which can be obtained through a pre-trained Word vector model such as Word2Vec and Glove) and then input into a CNN model. The convolution layer in the CNN can extract local features, the pooling layer can reduce the dimension of the features, and finally, the intention classification result is output through the full connection layer. For example, for the intent classification of a customer service scenario, the model may learn the characteristics of different expressions corresponding to the intentions of consultation, complaints, advice, etc. A large number of training corpora with intention labels are prepared, and the corpora comprise text information of various fields and corresponding field and intention labels. The training corpus is input into the intent classification model, and the differences between the predicted intent and the actual intent are minimized by adjusting parameters of the model (e.g., weights, biases), etc. (e.g., by cross entropy loss functions, etc.). Through multiple rounds of iterative training, the model can learn how to accurately classify the intention according to the structural characteristics of the text information (such as basic structure and grammar characteristics obtained after word segmentation and part-of-speech tagging) and the context information. In a dialog system, multiple rounds of dialog content between the user and the agent are recorded. For example, text information, intentions, fields, etc. of the previous call are stored by a queue or list. When semantic understanding and classification of the current text information is required, the content of the previous dialog can be referenced to better determine its domain and intent. In the intent classification model, the current text information may be spliced or fused with the context information. For example, a word vector sequence of a current sentence and a word vector sequence of a preceding sentence are spliced and then input together into a model. Or use Attention mechanisms (Attention) to let the model focus on the context parts related to the current intent. For example, in a multi-round conversation, if the user previously mentioned a consultation about a trip, then in a subsequent conversation the model would be more inclined to focus on the trip-related content to determine intent. For some rule-specific domains, reasoning can be based on predefined knowledge rules. For example, in the field of mathematical computation, answers are derived according to four rules of operation. If the intent of the textual information is to evaluate a mathematical expression, the logical inference algorithm may calculate and generate a result according to the rules of operation. The bayesian network may represent probabilistic relationships between variables. And after the field and the intention of the text information are determined, calling a corresponding reasoning method from a corresponding logical reasoning rule base or model base. For example, in the area of travel planning, if an optimal travel route is being interrogated, a logical inference algorithm may plan the route based on rules such as travel time, location, distance between attractions, etc. And reasoning the input information by using a reasoning rule or model. For example, in the field of questions and answers, if the intention is to query the time of an event, a logical inference algorithm may find the corresponding time from a record of the event in an existing knowledge base and generate a result.

In some fields, answer templates may be predefined. For example, in the query weather field, a template may be designed for a response to query weather such as "weather today [ place ] is [ weather conditions ], and temperature range is [ temperature range ]". And filling specific information (such as places, weather conditions, temperatures and the like) in the logical reasoning result into the template to generate the answer text. Taking the Seq2Seq model as an example, it consists of an encoder and a decoder. The encoder encodes the input logical reasoning result and other relevant information (such as context, etc.) into a fixed-length vector, and the decoder generates answer words according to the vector. During the training process, a large number of question-answer pairs are used to train a model that learns how to generate appropriate answers based on different inputs. The logical reasoning results, possibly needed context information, etc. are provided as inputs to the language generation algorithm. For example, in a multi-round dialog, it is necessary to provide the content of the previous dialog in addition to the logical reasoning results in order to generate a coherent answer. According to the input information, the language generating algorithm generates answer words according to the trained model structure and parameters. For example, in a deep learning based model, each word of the answer is generated step by a decoder until a complete sentence is generated as the answer.

In the embodiment of the invention, the word segmentation and the part-of-speech tagging can clearly analyze the basic structure and the grammar characteristics of the text information, and provide an accurate basis for subsequent semantic understanding. The intention classification model combines the structural features and the context information, so that semantic understanding can be performed on the text information more accurately, and wrong intention judgment caused by misunderstanding of grammar or semantics is avoided. Considering context information enables intent recognition to take into account the consistency and overall semantics of the dialog. For example, in a multi-turn conversation, the user may omit some information, and through fusion of context information, the intent classification model may accurately determine the intent of the current text information in conjunction with the content of the previous conversation. For example, the user firstly inquires "what scenic spots are in Beijing" and then inquires "ticket prices of the scenic spots", and the model can judge "the scenic spots" refer to the scenic spots in Beijing by combining the context, so that the intention of accurately identifying the second problem is to inquire the ticket prices of the scenic spots in Beijing. The rule-based logic reasoning algorithm can provide accurate and reliable reasoning results in the field of rule definition, and the probability reasoning algorithm based on Bayesian networks and the like can provide reasonable reasoning results in the field of uncertainty. The template-based language generation algorithm can ensure the accuracy of the format and content of the answer, especially in the structured answer scene such as information query. The language generation algorithm based on deep learning can generate more natural and flexible answers, and can generate the answers which are consistent and matched with the requirements of users according to different contexts. For example, for an open question asking the travel experience, a deep learning-based language generation algorithm may generate rich and attractive answers based on logical reasoning results (e.g., characteristics of travel location, user interest preferences, etc.), improving user satisfaction with the answers. Meanwhile, the interactive method and the interactive system interact with the user side in the form of answer words, data volume is reduced by using word transmission, transmission efficiency and stability are improved, and continuity and accuracy of multiple rounds of conversations are ensured.

Optionally, the voice synthesis module is specifically configured to:

In particular, common speech synthesis models include splice-based methods (Concatenative Synthesis), parameter-based methods (PARAMETRIC SYNTHESIS), and deep learning-based methods (e.g., tacotron, waveNet, etc.). Among them, the deep learning-based model is currently excellent in speech synthesis quality. For example, tacotron model is an end-to-end speech synthesis system that can directly convert text to spectral features of speech, which are then converted to the original speech waveform by a vocoder (Vocoder) such as WaveNet. The model is trained using a large number of text-tagged speech data, which typically includes speech segments of different speakers (Speaker), different speech rates, and different intonation. The preset acoustic parameters include acoustic characteristics (such as pitch frequency, tone, etc.), speech speed, intonation, etc. of the speaker. During the training process, the model learns how to map text to speech signals that meet preset acoustic parameters. For example, the quality of the synthesized speech is improved by adjusting the super-parameters of the model (e.g., learning rate, hidden layer size, etc.) and optimizing the objective (e.g., minimizing the difference between the generated speech and the real speech). The answer words are preprocessed, including operations such as standardization of the text (e.g. converting numbers and special symbols into text forms), word segmentation, etc. For example, "20%" is converted to "twenty percent" so that the speech synthesis model can better understand text content. And inputting the preprocessed text into the trained speech synthesis model. The model generates a corresponding voice signal according to preset acoustic parameters. For example, in the generating process, the model controls the playing speed of the voice according to the preset speech speed parameter, and generates a voice signal with a proper intonation according to the preset intonation parameter.

Noise suppression and echo cancellation processing are performed on a speech signal, and a short-time fourier transform (STFT) is first performed on the speech signal to convert a time-domain signal into a frequency spectrum. In the spectrum domain, the power spectrum of noise is estimated, and then the estimated noise power spectrum is subtracted from the power spectrum of the voice signal to obtain a purer voice spectrum. For example, in a quiet environment, a spectrum of clean speech is acquired as a reference, and when noise is contained in the speech signal, the noise is suppressed by subtracting the noise spectrum. Finally, the processed spectrum is converted back into a time domain signal by an Inverse Short Time Fourier Transform (ISTFT). Noise is suppressed using deep learning models (e.g., LSTM, CRNN, etc.). The input of the model is the spectrum of the noisy speech signal and the output is the spectrum of the clean speech signal. During the training process, a model is trained using a large amount of noisy speech and corresponding clean speech data. For example, a trained deep learning noise suppression model can effectively remove noise and improve speech quality when interference such as background music, environmental noise and the like exists in a speech signal. Echoes are typically generated as a result of the voice signal emitted by the speaker being picked up by a microphone. The adaptive filter may simulate an echo signal received by the microphone. By constantly adjusting the parameters of the adaptive filter, its output signal is made as close as possible to the actual echo signal. Then, the output signal of the adaptive filter is subtracted from the echo-containing speech signal received by the microphone to obtain a cleaner speech signal. For example, in a full duplex call system, the adaptive filter can adjust the filter coefficients in real time according to the voice signal played by the speaker and the signal received by the microphone, so as to effectively cancel the echo. Similar to noise suppression, echo cancellation may be implemented using a deep learning model. The input of the model is an echo-containing speech signal, and the output is a de-echoed speech signal. Through extensive data training, the model can learn the difference between the echo and the clean speech, thereby effectively removing the echo.

Common audio coding formats include PCM (pulse code modulation), MP3, opus, or the like. In a speech synthesis system, an efficient coding format is generally selected in consideration of characteristics and transmission efficiency of a speech signal. For example, opus coding is a low-delay and high-compression-ratio audio coding format, is suitable for real-time voice communication, and can reduce the size of voice data and improve the transmission efficiency while guaranteeing the voice quality. The speech signal after noise suppression and echo cancellation processing is encoded according to the specifications of the selected encoding format. For example, in Opus coding, a speech signal is divided into small blocks, each of the small blocks is subjected to spectral analysis, quantization, and the like, and the quantized data is packaged to generate an audio stream conforming to the Opus format. Depending on the actual communication environment and requirements, a suitable communication push stream transmission protocol is selected, such as RTP (real time transmission protocol), RTMP (real time messaging protocol), etc. The RTP protocol is taken as an example of a network control protocol for transmitting real-time data, such as audio, video, over existing networks. Before the audio stream is sent to the user side, the audio stream is packetized according to the RTP protocol, and necessary header information (such as a sequence number, a time stamp, etc.) is added so that the audio signal can be correctly decoded and played at the receiving side.

In the embodiment of the invention, the answer text is converted into the voice signal through the voice synthesis module and is efficiently transmitted, so that the stability and the user experience of the intelligent agent dialogue system are obviously improved. Speech synthesis models, such as Tacotron and WaveNet based on deep learning, efficiently convert text to speech signals. During model training, a large amount of voice data with text labels is used to cover different pronounciators, speech speeds and intonation so as to generate natural voice. The preset acoustic parameters ensure that the tone and intonation of the voice signal meet the user's expectations and improve the voice quality. During speech synthesis, the system pre-processes the answer words, including text normalization and word segmentation, so that the model better understands the text content. The generated voice signal is subjected to noise suppression and echo cancellation processing, so that the voice definition and the voice intelligibility are improved. For example, short-time fourier transform (STFT) converts a time-domain signal into a frequency spectrum, estimates and subtracts the noise power spectrum, and deep learning models can also effectively remove background noise and echoes. The processed voice signal is encoded according to a communication push stream transmission protocol, and an audio stream corresponding to the answer text is generated. The system selects a high-efficiency audio coding format such as Opus, so that the size of voice data is greatly reduced, and the transmission efficiency is improved. By combining with a text transmission mode, the system reduces the data volume and the requirement on network bandwidth, reduces delay and error rate and ensures smooth conversation. Under the environment of low bandwidth or unstable network, the efficient coding format and protocol ensure stable transmission of voice signals and avoid interruption or delay. The text transmission mode optimizes the communication process, reduces the data volume and improves the transmission efficiency and the system stability. Through the efficient processing and transmission mechanism of the voice synthesis module, the user side can timely and clearly receive the answer of the intelligent agent, the naturalness and fluency of the dialogue are enhanced, and more stable and efficient intelligent agent dialogue experience is provided for the user.

Optionally, the audio conversion module is specifically further configured to:

Specifically, the audio conversion module receives raw audio data, typically represented in PCM (Pulse-Code Modulation) format, from an audio acquisition device (e.g., microphone) or audio processing module, and pre-processes the PCM data, including but not limited to removing silence segments, gain adjustments, and audio format conversion, to optimize audio quality and reduce unnecessary data transmission. The PCM data after processing is divided into data blocks with proper size, and the size of each data block is about 1000 bytes generally so as to adapt to the requirement of network transmission. UDP protocol header information is added for each data block. The UDP header includes fields for source port, destination port, data length, and checksum, etc., which ensure proper routing and integrity checking of the data in the network. Audio data is encoded using a specific audio coding algorithm (e.g., AAC, MP3, etc.) to reduce the amount of data and improve transmission efficiency. The coding algorithm is selected by comprehensively considering the factors such as audio quality, compression rate, computational complexity and the like. The encoded audio data and the UDP header information are combined together to form complete UDP datagrams, which are network transmission data of the audio data and can be sent to a designated destination. The intention recognition and answer module processes the decoded audio data, extracts semantic information in the decoded audio data, and recognizes intention and requirement of a user. According to the recognition result, the module generates corresponding answers, and the text forms are more convenient for semantic generation and logic reasoning, so that the contents and the logic structure of the answers can be more accurately expressed, and the answers are usually text.

In the embodiment of the invention, UDP is used as a connectionless transmission protocol, has lower delay and can rapidly send out the audio data, so that the audio signal can be transmitted in a network in real time, and the fluency of conversation is ensured. The UDP protocol has a small header overhead of only 8 bytes, and can more efficiently utilize network bandwidth than other protocols (e.g., TCP). Meanwhile, the audio coding algorithm can further reduce the data volume and improve the transmission efficiency. The audio conversion module can effectively convert the audio data into network transmission data suitable for communication signal transmission, and stability and efficiency of audio conversation are improved.

Optionally, the voice synthesis module is specifically configured to:

In particular, the signal strength of a data network is obtained in real time through a network interface of a device, and is generally measured by parameters such as signal strength indication (RSSI) or Received Signal Strength (RSS). These values may be obtained through an API or network driver provided by the operating system. And comparing the signal intensity of the data network with a preset intensity interval, and mapping the signal intensity to the speech speed, intonation, volume and tone parameters of the speech synthesis model. For example, when the signal strength is high, the network condition is good, the user may be in a quiet and stable environment, the speech speed can be properly increased, the speech synthesis is smoother, the volume can be properly reduced, the intonation can be smoother, and the tone can be softer. Conversely, when the signal strength is low, the user may be in a noisy or network unstable environment, the speech speed may be appropriately reduced, the volume is increased, the tone is clearer, and the tone is brighter, so as to ensure that the user can clearly hear the speech synthesis content.

And according to the mapping result, adjusting the speech speed parameters of the speech synthesis model. In the process of speech synthesis, different speech speeds are realized by controlling the playing speed of a speech signal. For example, in a deep learning-based speech synthesis model, the hyper-parameters controlling the speech rate in the model may be adjusted, or the speech signal may be compressed or expanded in the time axis during speech generation to change the speech rate. The adjustment of the pitch parameters may be achieved by changing the fundamental frequency of the speech signal. In the speech synthesis model, the fundamental frequency determines the pitch of speech, thereby affecting intonation. The intonation may be changed by adjusting parameters associated with the fundamental frequency in the model, or by modifying the fundamental frequency after speech generation. For example, the fundamental frequency of the speech signal is shifted up or down in the frequency domain to achieve different intonation effects. The volume parameter is mainly adjusted to control the amplitude of the voice signal. In a speech synthesis model, the volume may be changed by adjusting the magnitude value of the model output. Or after the voice is generated, the voice signal is amplified or reduced to achieve the purpose of adjusting the volume. The adjustment of the tone color parameters relates to the spectral characteristics of the speech signal. The tone color may be changed by changing parameters related to the spectrum in the speech synthesis model, or by filtering the spectrum after speech generation, or the like. For example, different vocoders or filters are used to shape different timbres.

And updating the preset acoustic parameters of the speech synthesis model according to the adjusted speech speed, intonation, volume and tone parameters. Within the model, these parameters may affect various links of speech generation, such as spectral generation of the speech signal, fundamental frequency control, amplitude adjustment, etc. For example, in Tacotron-2 speech synthesis models, optimization of speech speed, intonation, volume, and timbre may be achieved by adjusting parameters associated with mel spectrum generation modules, fundamental frequency prediction modules, etc. in the model. In a preferred embodiment, the optimized model is further trained using a small amount of adaptation data to better accommodate current network conditions and user requirements. The adaptive data may be speech data collected in a network-like environment, and the model is enabled to better generate a satisfactory speech signal by fine-tuning parameters of the model.

In the speech synthesis process, a forward error correction coding technology is adopted to add redundant information to the generated speech data. For example, in the process of voice data transmission, even if part of data is lost or damaged, the receiving end can recover the original voice data according to the redundant information by using a Reed-Solomon code (Reed-Solomon Codes) and other coding modes. In the speech synthesis module, the speech data may be divided into a plurality of data blocks during the encoding phase of the speech signal, and then a corresponding error correction code is generated for each data block and appended to the data block. In addition to forward error correction coding, a redundant coding strategy may be employed, i.e., multiple encodings or backups of important speech data. For example, the key frames or important characteristic parameters in the voice signal are redundantly encoded, so that the receiving end can reconstruct the voice signal according to the redundant data even if part of the data is lost in the data transmission process. In implementation, the level and manner of redundancy coding may be determined based on the data network signal strength and the importance of the voice data. And converting the answer text into a voice signal through the optimized voice synthesis model. In the speech synthesis process, speech data is processed according to forward error correction and redundancy coding strategies. For example, when generating a voice data stream, error correction codes and redundant data are added in accordance with a predetermined encoding rule. And then, packaging the processed voice data according to a transmission protocol of the communication plug flow, generating an audio stream corresponding to the answer text, and preparing for sending to a user terminal.

In the embodiment of the invention, the parameters of the voice synthesis model are dynamically adjusted according to the signal intensity of the data network, so that the voice synthesis can adapt to different network environments and user scenes. The method provides smoother and natural voice synthesis under the environment of strong signals and stable network, ensures the voice synthesis to be clear and intelligible under the environment of weak signals and unstable network, meets the requirements of users under various conditions, and improves the adaptability and reliability of the intelligent dialogue system. The adjusted speech speed, intonation, volume and tone parameters can better match the hearing requirements and usage scenarios of the user. For example, increasing volume and clarity in a noisy environment allows a user to more easily hear speech synthesis content, providing softer, natural speech in a quiet environment, and increasing the user's auditory comfort. In this way, the user experience is optimized, making the user more willing to use the agent dialog system. The optimized voice synthesis model can generate voice signals which more meet the current network conditions and the requirements of users. The forward error correction and redundancy coding strategy can effectively cope with the loss and damage of voice data in the transmission process, and ensure the integrity and accuracy of voice signals. For example, under the condition of network fluctuation or data packet loss, the receiving end can recover the original voice signal by using error correction codes and redundant data, thereby reducing voice interruption or distortion caused by data loss and improving the overall quality of voice synthesis. The forward error correction and redundancy coding strategies increase the fault tolerance of the voice data and reduce the requirements on the reliability of network transmission. Even under the condition of poor network condition, stable transmission of voice data can be ensured, conversation interruption or delay caused by network problems is reduced, and stability and usability of the intelligent conversation system are improved.

Specifically, the signal strength of the data network is continuously monitored, and the change of the signal strength with time is recorded. In addition to signal strength, other network related metrics such as packet loss rate, delay, etc. are collected. Such data may be obtained through a network interface API or network diagnostic tool of the device. The monitoring time period is divided into a plurality of small time windows, for example, a window is formed every 10 seconds, and statistical indexes such as signal intensity average value, variance, packet loss rate, delay and the like in each window are calculated, wherein the indexes form network fluctuation data. The rate of change of signal strength between different time windows is calculated, e.g., the difference in signal strength between adjacent time windows divided by the signal strength of the previous window, to yield the relative rate of change of signal strength. Meanwhile, the variance of the signal intensity is calculated to measure the stability of the signal intensity. And counting the packet loss rate and delay change condition in each time window. For example, the standard deviation of the packet loss rate and the statistical indexes such as the maximum value, the minimum value, the average value and the like of the delay are calculated so as to comprehensively reflect the fluctuation condition of the network.

Suitable time series prediction models are selected, such as autoregressive moving average (ARIMA), exponential smoothing (Exponential Smoothing), or long-term memory network (LSTM), etc. Taking LSTM as an example, the method can process long-short-term dependency relationship in time sequence data, and is suitable for predicting network fluctuation data. The model is trained using the collected historical network fluctuation data. The historical data is divided into training and testing sets, and the difference between the predicted value and the actual value, such as a Mean Square Error (MSE) loss function, is minimized by adjusting parameters of the model, such as the hidden layer size of the LSTM, the learning rate, and the like. The latest network fluctuation data (including statistical indexes such as signal strength, packet loss rate, delay and the like) are used as the input of the model. And carrying out preprocessing operations such as normalization and the like on the input data so as to enable the input data to meet the input requirements of the model. The model predicts network fluctuation data for a future period of time (e.g., for the next 30 seconds) from the input data, including predicted values of signal strength variation trend, packet loss rate, delay, and the like. These predictive data will be used for subsequent answer text transmission planning.

In the embodiment of the invention, the influence of network fluctuation on audio transmission can be dealt with in advance by predicting future network fluctuation data and making a transmission plan according to the future network fluctuation data. For example, when the network packet loss rate is predicted to increase, the transmission frequency is reduced and the transmission plan is adjusted, so that the audio interruption or the blocking phenomenon caused by packet loss is reduced. The self-adaptive transmission strategy improves the transmission stability of the answer text under the condition of network fluctuation, and ensures that users can smoothly receive audio content.

Specifically, the transmission frequency of the answer text is determined according to the predicted future network fluctuation data. If the network fluctuation is predicted to be large (such as the signal strength is reduced and the packet loss rate is increased), the transmission frequency is reduced so as to reduce the data loss and retransmission caused by the network fluctuation. Conversely, if the predicted network condition is stable, the transmission frequency can be appropriately increased to increase the transmission speed of the answer text. For example, the transmission frequency may be dynamically calculated based on the predicted packet loss rate and delay. It is assumed that the basic transmission frequency is 20 text segments per second in a stable network environment. When the predicted packet loss rate increases by 10%, the transmission frequency is reduced by 10%, namely 18 text fragments are transmitted per second, and when the predicted delay increases beyond a certain threshold value, the transmission frequency is also reduced. And dividing the answer text into a plurality of text fragments according to the determined transmission frequency. And generating a transmission plan of the answer text according to the transmission frequency and the sequence of the text fragments. The transmission schedule includes information such as transmission time, transmission order, and transmission data amount of each text segment. For example, each segment is arranged in turn to be sent at a corresponding point in time in the order of the text segments, ensuring continuity and integrity of the answer text. And optimizing the transmission plan by considering the network fluctuation prediction result. For example, if network fluctuations are predicted to be large within a certain period of time, some key text segments may be sent in advance, or a certain buffer time may be reserved in the transmission plan to cope with possible network delays. And the audio transmission module transmits the answer text of the answer audio to the user side according to the generated transmission plan. In the process of sending, the network condition is monitored in real time, and if the actual network condition has larger deviation from the predicted result, the sending plan can be dynamically adjusted. For example, if the actual network fluctuation is smaller than the predicted one, the transmission frequency can be appropriately increased to increase the transmission speed of the answer text.

According to the method and the device for sending the text fragments, the sending sequence and the sending time of the text fragments are reasonably arranged according to the network fluctuation prediction result, and the risk of losing the answer text in the transmission process can be reduced. For example, when the network condition is poor, the key text segments are preferentially sent, or the sending frequency and the buffering time are adjusted, so that the receiving end can better reorganize the answer text, the audio quality problem caused by data loss is reduced, and the reliability of audio transmission is improved. When the network condition is stable, the transmission frequency is properly increased according to the prediction result, and the transmission speed of the answer text can be increased. For example, when the future network fluctuation is predicted to be small and the bandwidth is sufficient, the number of text fragments transmitted per second is increased, so that the user can receive complete audio content more quickly, the waiting time of the user is shortened, and the response speed of the intelligent agent dialogue system is improved. Dynamically adjusting the transmission frequency and the transmission schedule enables more rational utilization of network resources. Under different network environments, the sending strategy of the answer words is adjusted according to actual conditions, so that the situation that the network bandwidth is excessively occupied when the network condition is good and resource waste caused by blind sending when the network fluctuates is avoided. The optimization strategy improves the utilization efficiency of network resources, and enables the dialogue with the intelligent agent to be smoother and more stable.

Referring to fig. 2, the method for intelligent agent dialogue is applied to the intelligent agent dialogue system, wherein the intelligent agent dialogue system comprises a user end and a server end, and the output of the user end is in communication connection with the server end;

the agent dialogue method comprises the following steps:

acquiring audio data of a user through the acquisition module;

The advantages of the agent dialogue method of the present invention compared with the prior art are the same as those of the agent dialogue system described above compared with the prior art, and will not be described here again.

Although the present disclosure is disclosed above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the disclosure, and these changes and modifications will fall within the scope of the disclosure.

Claims

1. An agent dialogue system, comprising a user end and a server end, wherein the user end and the server end are in communication connection;

The user end includes a collection module, an audio conversion module, a speech recognition processing module and a speech synthesis module which are connected in sequence; the server end includes an intention recognition and answering module;

The output end of the speech recognition processing module of the user end is connected to the input end of the intention recognition and answering module of the server end, and the output end of the intention recognition and answering module is connected to the input end of the speech synthesis module of the user end;

The acquisition module is used to obtain the user's audio data;

The audio conversion module is configured to set a transmission mode for the audio data according to the data network signal strength of the user terminal, and when the transmission mode is communication signal transmission, encode the audio data according to the transmission protocol of the communication push stream to obtain an audio stream corresponding to the audio data;

The speech recognition processing module is used to perform real-time analysis on the audio stream through deep learning to obtain an analysis result, and optimize the analysis result through a language acoustic model to obtain text information corresponding to the audio data;

The intention recognition and response module is used to determine the domain and intent of the text information corresponding to the audio data in combination with context information; and to obtain the response text of the audio data based on the domain and intent in combination with a logical reasoning algorithm and a language generation algorithm;

The speech synthesis module is used to perform speech synthesis on the answer text to obtain the audio stream corresponding to the answer text, and encode the audio stream corresponding to the answer text according to the transmission protocol of the communication push stream to obtain the audio data of the answer text.

2. The agent dialogue system according to claim 1, wherein the audio conversion module is specifically configured to:

Obtaining the data network signal strength of the user terminal;

Setting the transmission mode of the audio data according to a magnitude relationship between the data network signal strength and a preset signal strength threshold;

Wherein, when the data network signal strength is less than the preset signal strength threshold, the transmission mode of the audio data is set to the communication signal transmission;

When the data network signal strength is greater than or equal to the preset signal strength threshold, the transmission mode of the audio data is set to data network transmission.

3. The intelligent agent dialogue system according to claim 1, wherein the speech recognition processing module is specifically configured to:

Performing frame processing on the audio stream to divide the continuous audio signal in the audio stream into audio frames of multiple time periods;

Extracting acoustic features from each of the audio frames to obtain acoustic features of the audio frame;

Inputting the acoustic features into a deep learning model for nonlinear mapping to generate preliminary speech recognition results;

The preliminary speech recognition result is subjected to grammatical and semantic constraints by means of the language acoustic model to obtain the text information corresponding to the audio data.

4. The intelligent agent dialogue system according to claim 1, wherein the intention recognition and response module is specifically configured to:

Performing word segmentation and part-of-speech tagging on the text information to obtain the basic structure and grammatical features of the text information;

Using an intent classification model, based on the basic structure and the grammatical features, combined with contextual information, semantically understand and classify the text information to obtain the domain and intent of the text information;

According to the domain and the intention, combined with the logical reasoning algorithm, generate a logical reasoning result corresponding to the domain and the intention;

Then, the language generation algorithm is used to generate the answer text according to the logical reasoning result.

5. The intelligent agent dialogue system according to claim 1, wherein the speech synthesis module is specifically configured to:

Converting the answer text into a speech signal according to preset acoustic parameters using a speech synthesis model;

Performing noise suppression and echo cancellation processing on the speech signal to obtain the processed speech signal;

The processed voice signal is encoded according to the transmission protocol of the communication push stream to generate the audio stream corresponding to the answer text.

6. The agent dialogue system according to claim 2, wherein the audio conversion module is further configured to:

When the transmission mode is the data network transmission, the audio data is encoded according to the User Datagram Protocol to obtain network transmission data of the audio data.

7. The intelligent agent dialogue system according to claim 6, wherein the speech synthesis module is specifically configured to:

When the transmission mode of the audio data is the data network transmission, determining the speech rate parameter, intonation parameter, volume parameter and timbre parameter of the speech synthesis model according to the data network signal strength;

Optimizing preset acoustic parameters of a speech synthesis model according to the speech rate parameter, the intonation parameter, the volume parameter, and the timbre parameter to obtain the optimized speech synthesis model;

The answer text is dynamically speech synthesized by using the optimized speech synthesis model in combination with forward error correction and redundant coding strategies to obtain the audio stream corresponding to the answer text.

8. The intelligent agent dialogue system according to claim 1, wherein the intention recognition and response module is specifically configured to:

determining network fluctuation data of the user terminal according to the data network signal strength;

Predicting based on the network fluctuation data to obtain future network fluctuation data during the transmission of the answer text;

The answer text is sent to the user terminal according to the future network fluctuation data.

9. The intelligent agent dialogue system according to claim 8, wherein the intention recognition and response module is further configured to:

determining a transmission frequency of the reply text according to the future network fluctuation data, and dividing the reply text into a plurality of text segments;

generating a sending plan for the reply text according to the transmission frequency of the reply text and the text fragment;

The text segment of the answer text is sent to the user terminal according to the sending plan.

10. An agent dialogue method, characterized in that it is applied to the agent dialogue system according to any one of claims 1 to 9, wherein the agent dialogue system includes a user end and a server end, and the output of the user end is communicatively connected to the server end;

The agent dialogue method comprises:

Acquire the user's audio data through the acquisition module;

The audio conversion module sets a transmission mode for the audio data according to the data network signal strength of the user terminal. When the transmission mode is communication signal transmission, the audio data is encoded according to the transmission protocol of the communication push stream to obtain an audio stream corresponding to the audio data.

The speech recognition processing module performs real-time analysis based on the audio stream through deep learning to obtain an analysis result, and optimizes the analysis result through a language acoustic model to obtain text information corresponding to the audio data;

The intention recognition and answer module determines the domain and intention of the text information according to the text information corresponding to the audio data in combination with context information; and obtains the answer text of the audio data in combination with the logical reasoning algorithm and the language generation algorithm based on the domain and the intention;

The answer text is speech-synthesized by the speech synthesis module to obtain an audio stream corresponding to the answer text, and the audio stream corresponding to the answer text is encoded according to the transmission protocol of the communication push stream to obtain audio data of the answer text.