CN120434333A

CN120434333A - System and method for telephone robot interaction acceleration based on large model

Info

Publication number: CN120434333A
Application number: CN202510359840.5A
Authority: CN
Inventors: 胡家鹰; 李全忠; 何国涛; 蒲瑶
Original assignee: Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Current assignee: Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority date: 2025-03-25
Filing date: 2025-03-25
Publication date: 2025-08-05

Abstract

The invention provides a system and a method for interactive acceleration of a telephone robot based on a large model, wherein the system comprises an IVR (Interactive Voice response) for carrying out automatic communication service through a telephone, controlling telephone communication in a telephone robot dialogue system, realizing a complete telephone robot system by interfacing with a voice synthesis engine and a session management service, a TTS (text to speech) for converting text information into natural sounding voice, a LLM (deep learning model) with a large number of parameters and a complex structure for processing a large number of data in the fields of natural language processing and voice recognition, wherein the IVR is connected with the TTS, and the LLM is connected with the TTS. According to the invention, the conversation prompt fragments are synthesized one by adopting a mode that a large model, a voice synthesis engine and a call platform play and work simultaneously, and the voice fragments are played one by one, so that the delay of each link of the telephone robot dialogue system is optimized, the acceleration of the synthesis link and the acceleration of the playing link are realized, and the human-computer interaction low-delay experience is realized.

Description

System and method for telephone robot interaction acceleration based on large model

Technical Field

The invention relates to the technical field of intelligent telephone voice robots, in particular to a telephone robot interactive acceleration system and method based on a large model.

Background

In the outbound robot project based on the large model, although the large model can provide better understanding effect than the traditional semantic understanding engine, the delay of returning the complete result of the large model is larger, usually 5 seconds to 10 seconds, and the traditional telephone robot plays after receiving the complete prompt result, so that the requirement of man-machine interaction on low delay cannot be met.

Moreover, the semantic understanding capability of the phone robot system based on the large model is greatly superior to that of the traditional semantic understanding engine, but the introduction of the large model into the phone voice robot system is also very difficult due to the very long delay of the large model call.

Disclosure of Invention

In view of the above, the present invention aims to provide a system and a method for interactive acceleration of a phone robot based on a large model, which optimize the delay of each link of a dialogue system of the phone robot, directly communicate with a speech synthesis engine by using a session management service, and transfer sentences to be synthesized generated by the large model to the speech synthesis engine in a sentence-by-sentence transfer manner.

The invention provides a telephone robot interactive acceleration system based on a large model, which comprises:

the call platform IVR is used for carrying out automatic communication service through a telephone, controlling telephone communication in a telephone robot dialogue system, realizing telephone operation such as telephone answering, calling out, transferring, playing back and receiving numbers, hanging up and transferring, and realizing a complete telephone robot system by interfacing with a voice synthesis engine and a session management service;

a speech synthesis engine TTS for converting text information into natural sounding speech;

specifically, the speech synthesis engine TTS converts the text answers of the telephony robot dialog system into speech output, enabling the user to receive information audibly.

The technical options of the speech synthesis engine TTS include:

based on the spliced TTS, the pre-recorded voice fragments are used for splicing.

Parameter-based TTS: neural network models such as WaveNet, tacotron.

The implementation steps of the speech synthesis engine TTS include:

and (3) text analysis, namely preprocessing such as word segmentation, prosody prediction and the like on the text.

Speech synthesis, which is to generate speech waveforms according to text features.

Voice optimization, namely adjusting the voice speed, the tone and the like, and improving the naturalness.

The large model LLM is a deep learning model with a large number of parameters and a complex structure and is used for processing a large amount of data in the fields of natural language processing and voice recognition;

the calling platform IVR is connected with the speech synthesis engine TTS, and the large model LLM is connected with the speech synthesis engine TTS.

Further, the system for phone robot interaction acceleration based on the large model further comprises:

the session management service DM is used for realizing the design of a man-machine conversation process, the configuration of parameters of a prompt, a pause time, a hot word and a dynamic model of each node in the process, and the configuration of rules and models for semantic understanding, thereby realizing the process control of man-machine multi-round interaction;

the call platform IVR is connected with the session management service DM which is respectively connected with the large model LLM and the speech synthesis engine TTS.

In particular, the session management service DM is the core of the dialog system for maintaining dialog states and deciding on the next action.

The technical options of the session management service DM include:

Rule-based systems-use predefined rules and state machines.

Machine learning based systems using reinforcement learning or policy networks.

The implementation steps of the session management service DM include:

State maintenance, tracking dialog history and current state.

Decision making-deciding the system response based on the user intent and the context information.

And executing the action, namely calling the corresponding service or API to generate an answer.

A speech recognition engine ASR for converting human speech into machine-understandable text;

The speech recognition engine ASR is connected with the call platform IVR.

Specifically, the speech recognition engine ASR is the first step in converting a user's speech input into text information. ASR is based on deep learning technology, and can realize high-accuracy speech-to-text conversion.

Technical options for the speech recognition engine ASR include:

End-to-end models such as CTC (Connectionist Temporal Classification) and attention mechanism models.

Open source tools Mozilla DeepSpeech, kaldi, etc.

The implementation steps of the speech recognition engine ASR include:

Audio acquisition using a microphone or other audio input device.

Preprocessing, noise reduction, gain control and feature extraction.

Model training, namely training a model by using a large amount of marked voice data.

Recognition, converting speech to text in real-time or non-real-time.

The invention also provides a method for the telephone robot interaction acceleration based on the large model, which is applied to the telephone robot interaction acceleration system based on the large model, and comprises the following steps:

The method comprises the steps of adopting a mode that a speech synthesis engine TTS synthesizes conversation prompt fragments one by one, a calling platform IVR plays conversation prompt fragments one by one, enabling a large model LLM, the speech synthesis engine TTS and the calling platform IVR to play simultaneously, calling the speech synthesis engine TTS immediately when the large model LLM returns a sentence result to be synthesized, and enabling the calling platform IVR to start playing immediately when the speech synthesis engine TTS synthesizes a conversation prompt fragment (one conversation prompt fragment is a sentence, generally 3-5 words are short and 20-30 words are long).

When the intelligent telephone voice robot based on the large model performs outbound, the session management service DM transmits the session prompt fragments to the voice synthesis engine sentence by sentence, so that the voice synthesis engine instantly starts the synthesis of the prompt, and instantly starts playing by the call platform, thereby realizing the man-machine interaction experience with lower delay.

Further, the method for accelerating interaction of the telephone robot based on the large model further comprises the following steps:

Under the open question-answering scene of the telephone robot, the session management service DM calls a large model LLM stream to return an answer text, when the session management service DM receives a sentence generated by the large model LLM, the session management service DM immediately informs a speech synthesis engine TTS to synthesize, a call platform IVR calls the speech synthesis engine TTS to play recording fragments one by one, the speech synthesis speed of the speech synthesis engine TTS exceeds the play speed of the call platform IVR, so that the play is normally carried out, and the play of the round of prompt is ended until the last recording fragment is played.

Further, the speech synthesis speed of the speech synthesis engine TTS is higher than the speed at which a human normally speaks.

Preferably, the speech synthesis speed of the speech synthesis engine TTS is greater than 3 words/sec.

Embodiments of the present invention have compressed the man-machine interaction delay of a telerobotic system to an acceptable level (industry standard is typically within 1.8 seconds-2.2 seconds).

The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of large model based telephonic robot interaction acceleration as described above.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of large model based telephony robot interaction acceleration as described above when executing the program.

Compared with the prior art, the invention has the beneficial effects that:

The invention provides a system and a method for interactive acceleration of a telephone robot based on a large model, which adopt a mode that the large model, a voice synthesis engine and a call platform play and work simultaneously, change the existing mode of receiving the complete prompt results and then sending the complete prompt results to the voice synthesis engine, change the design into a mode of synthesizing conversation prompt fragments one by one and playing conversation prompt fragments one by one, directly communicate with the voice synthesis engine by using conversation management service, transfer sentences to be synthesized generated by the large model to the voice synthesis engine in a sentence-by-sentence transmission mode, and when the voice synthesis engine synthesizes one conversation prompt fragment, the call platform instantly plays the synthesized conversation prompt voice fragments, so that the voice synthesis engine also synthesizes conversation prompt fragments simultaneously when the large model returns the sentence results to be synthesized, and the call platform also plays conversation prompt voice fragments simultaneously, thereby optimizing the delay of each link of the telephone robot dialogue system, realizing the acceleration of synthesis links and the acceleration of playing links, meeting the requirements of man-machine interaction reduction delay, and providing practical and effective solutions for the large model in the production application of telephone robot projects.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.

In the drawings:

FIG. 1 is a flow diagram of a session management service quasi-streaming calling TTS composite recording clip according to an embodiment of the present invention;

FIG. 2 is a flow chart of a large model flow returning result according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a TTS quasi-streaming composite recording clip for a session management service according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and products consistent with some aspects of the disclosure as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The term "if" as used herein may be interpreted as "at..once" or "when..once" or "in response to a determination", depending on the context.

Embodiments of the present invention are described in further detail below.

The embodiment of the invention provides a telephone robot interactive acceleration system based on a large model, which comprises the following steps:

The speech synthesis engine TTS converts the text answers of the telephony robot dialog system into speech output enabling the user to receive information audibly.

The technical options of the speech synthesis engine TTS include:

Parameter-based TTS: neural network models such as WaveNet, tacotron.

The implementation steps of the speech synthesis engine TTS include:

A large model LLM (large artificial intelligence model) which is a deep learning model with a large number of parameters and complex structures and is used for processing a large amount of data in the fields of natural language processing and voice recognition;

The call platform IVR is connected with the speech synthesis engine TTS; the large model LLM is connected with the speech synthesis engine TTS;

the system for telephone robot interaction acceleration based on the large model further comprises:

The session management service in the telephone robot of the embodiment can realize the design of a man-machine conversation process, the configuration of parameters such as a prompt, a pause time, a hot word, a dynamic model and the like of each node in the process, and the configuration of rules and models for semantic understanding, thereby realizing the process control of man-machine multi-round interaction. The following are examples of parameters:

Prompt to welcome to call a certain travel net, ask what can help you?

The pause time is 800ms;

hotwords, namely a building and an address;

dynamic parameters cmn/yue/eng, wherein cmn represents a mandarin model, yue represents a cantonese model, and eng represents an English model;

rules for semantic understanding, rule models, including regular expressions, for intent judgment, such as (yes) | (confirm) | (opposite) such representations confirm intent;

semantic understanding model the semantic understanding can use a neural network small model or a neural network large model, and the trained model can return corresponding intention judgment according to the text input by the client.

The speech recognition engine ASR is connected with the call platform IVR.

The speech recognition engine ASR is the first step in converting the user's speech input into text information. ASR is based on deep learning technology, and can realize high-accuracy speech-to-text conversion.

Technical options for the speech recognition engine ASR include:

Open source tools Mozilla DeepSpeech, kaldi, etc.

The implementation steps of the speech recognition engine ASR include:

Audio acquisition using a microphone or other audio input device.

Preprocessing, noise reduction, gain control and feature extraction.

Recognition, converting speech to text in real-time or non-real-time.

The embodiment of the invention also provides a method for accelerating the interaction of the telephone robot based on the large model, which is applied to the system for accelerating the interaction of the telephone robot based on the large model, and comprises the following steps:

The method comprises the steps of adopting a mode that a speech synthesis engine TTS synthesizes conversation prompt fragments one by one, and a calling platform IVR plays conversation prompt fragments one by one, enabling a large model LLM, the speech synthesis engine TTS and the calling platform IVR to play simultaneously, enabling the speech synthesis engine TTS to be called immediately when the large model LLM returns a sentence result to be synthesized, and enabling the calling platform IVR to start playing immediately when the speech synthesis engine TTS synthesizes a conversation prompt fragment. Fig. 1 shows a flow of a session management service quasi-streaming call TTS composite recording clips of the present embodiment.

Under the open question-answering scene of the telephone robot, the session management service DM calls a large model LLM stream to return an answer text, when the session management service DM receives a sentence generated by the large model LLM, the session management service DM immediately informs a speech synthesis engine TTS to synthesize, a call platform IVR calls the speech synthesis engine TTS to play recording fragments one by one, the speech synthesis speed of the speech synthesis engine TTS exceeds the play speed of the call platform IVR, so that the play is normally carried out, and the play of the round of prompt is ended until the last recording fragment is played. Fig. 2 shows a flow of a large model streaming return result of the present embodiment, and fig. 3 shows a flow of a session management service call TTS quasi-streaming composite recording clip of the present embodiment.

The speech synthesis speed of the speech synthesis engine TTS is higher than the speed at which a human normally speaks. In this embodiment, the speech synthesis speed of the speech synthesis engine TTS is greater than 3 words/second.

When the intelligent telephone voice robot based on the large model performs outbound, the session management service DM transmits session prompt fragments to the voice synthesis engine sentence by sentence, so that the voice synthesis engine can start prompt synthesis as soon as possible, and the call platform can start playing as soon as possible, and the technical scheme of man-machine interaction experience with lower delay can be realized.

The system and the method for the interaction acceleration of the telephone robot based on the large model adopt a mode that the large model, the voice synthesis engine and the call platform play and work simultaneously, change the existing mode that the complete prompt results are received and then sent to the voice synthesis engine, change the design into a mode that conversation prompt fragments are synthesized one by one and conversation prompt fragments are played one by one, directly communicate with the voice synthesis engine by using conversation management service, transfer sentences to be synthesized generated by the large model to the voice synthesis engine in a sentence-by-sentence transmission mode, and when the voice synthesis engine synthesizes one conversation prompt fragment, the call platform immediately plays the synthesized conversation prompt voice fragments, so that the voice synthesis engine also synthesizes the conversation prompt fragments simultaneously when the large model returns the conversation prompt results, the call platform also plays the conversation prompt voice fragments simultaneously, delay of each link of the conversation system of the telephone robot is optimized, the acceleration of the synthetic links is realized, the effect of the acceleration of the playing links is also accelerated, and the requirement of the reduction delay of man-machine interaction is met.

An embodiment of the present invention further provides a computer device, fig. 4 is a schematic structural diagram of a computer device provided by the embodiment of the present invention, and referring to fig. 4 of the present invention, the computer device includes an input system 23, an output system 24, a memory 22 and a processor 21, where the memory 22 is used to store one or more programs, and when the one or more programs are executed by the one or more processors 21, the one or more processors 21 implement a method for implementing the large model based interaction acceleration of a telephone robot as provided by the foregoing embodiment, and the input system 23, the output system 24, the memory 22 and the processor 21 may be connected by a bus or another manner, where fig. 4 is exemplified by connection via the bus.

The memory 22 is a computer-readable storage medium that may be used to store software programs, computer-executable programs, and program instructions corresponding to the method of large model-based telephony robot interaction acceleration according to embodiments of the present invention, and the memory 22 may mainly include a memory program area that may store an operating system, application programs required for at least one function, data area that may store data created according to the use of the device, etc., and a memory 22 that may further include a high-speed random access memory, a nonvolatile memory such as at least one disk storage device, a flash memory device, or other nonvolatile solid state storage device, and in some examples, the memory 22 may further include a memory remotely located with respect to the processor 21, which may be connected to the device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input system 23 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the device, and the output system 24 may include a display device such as a display screen.

The processor 21 executes various functional applications of the device and data processing, i.e. implements the above-described method of large model-based telephony robot interaction acceleration, by running software programs, instructions and modules stored in the memory 22.

The computer equipment provided by the embodiment can be used for executing the method for accelerating the interaction of the telephone robot based on the large model, and has corresponding functions and beneficial effects.

Embodiments of the present invention also provide a storage medium containing computer executable instructions that when executed by a computer processor are used to perform a method of large model based telephony robot interaction acceleration as provided by the above embodiments, the storage medium being any of various types of memory devices or storage devices, including an installation medium, such as a CD-ROM, floppy disk or tape system, a computer system memory or random access memory, such as DRAM, DDR RAM, SRAM, EDO RAM, lanbas (Rambus) RAM, etc., a non-volatile memory, such as flash memory, magnetic media (e.g., hard disk or optical storage), registers or other similar types of memory elements, etc., the storage medium may also include other types of memory, or combinations thereof, in addition, the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system, the second computer system being connected to the first computer system through a network, such as the internet, the second computer system may provide program instructions to the first computer for execution. Storage media includes two or more storage media that may reside in different locations (e.g., in different computer systems connected by a network). The storage medium may store program instructions (e.g., embodied as a computer program) executable by one or more processors.

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the method for large model-based telephonic robot interaction acceleration described in the above embodiments, and may also perform the related operations in the method for large model-based telephonic robot interaction acceleration provided in any embodiment of the present invention.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A system for accelerating telephone robot interaction based on a large model, characterized by comprising:

Calling platform IVR: used for automated communication services via telephone, controlling telephone communications in the telephone robot dialogue system, and implementing telephone operations such as answering calls, outbound calls, transfers, playing announcements and collecting numbers, hanging up, and transferring calls. By connecting with the speech synthesis engine and conversation management service, a complete telephone robot system is realized.

Speech synthesis engine TTS: used to convert text information into natural-sounding speech;

Large Model (LLM): A deep learning model with a large number of parameters and complex structure, used to process large amounts of data in the fields of natural language processing and speech recognition.

The call platform IVR is connected to the speech synthesis engine TTS; the large model LLM is connected to the speech synthesis engine TTS.

2. The system for accelerating telephone robot interaction based on a large model according to claim 1, further comprising:

Conversation Management Service (DM): This service is used to design the human-computer dialogue process, configure prompts, pause durations, hot words, and dynamic model parameters for each node in the process, and configure semantic understanding rules and models, thereby achieving process control for multi-round human-computer interactions.

The call platform IVR is connected to the session management service DM; the session management service DM is connected to the large model LLM and the speech synthesis engine TTS respectively.

3. The system for accelerating telephone robot interaction based on a large model according to claim 1, further comprising:

ASR (Analog Speech Recognition) engine: used to convert human speech into machine-understandable text;

The speech recognition engine ASR is connected to the call platform IVR.

4. A method for accelerating telephone robot interaction based on a large model, applied to the system for accelerating telephone robot interaction based on a large model as described in any one of claims 1 to 3, characterized by comprising:

The speech synthesis engine TTS is used to synthesize conversation prompt segments one by one, and the call platform IVR plays the conversation prompt segments one by one. The large model LLM, speech synthesis engine TTS and call platform IVR playback work simultaneously; when the large model LLM returns a sentence result to be synthesized, the speech synthesis engine TTS is immediately called; when the speech synthesis engine TTS completes synthesizing a conversation prompt segment, the call platform IVR immediately starts playing.

5. The method for accelerating telephone robot interaction based on a large model according to claim 4, further comprising:

In the open question-and-answer scenario of the telephone robot, the conversation management service DM calls the large model LLM to stream the answer text. Every time the conversation management service DM receives a sentence generated by the large model LLM, it immediately notifies the speech synthesis engine TTS to synthesize it. The call platform IVR calls the speech synthesis engine TTS to play the recording segments one by one; the speech synthesis speed of the speech synthesis engine TTS exceeds the playback speed of the call platform IVR, so that the playback proceeds normally until the last recording segment is played, ending the playback of this round of prompts.

6. The method for accelerating telephone robot interaction based on a large model according to claim 5 is characterized in that the speech synthesis speed of the speech synthesis engine TTS is higher than the normal speaking speed of humans.

7. A computer-readable storage medium having a computer program stored thereon, characterized in that when the program is executed by a processor, the steps of the method for accelerating telephone robot interaction based on a large model as described in any one of claims 4 to 6 are implemented.

8. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps of the method for accelerating telephone robot interaction based on a large model as described in any one of claims 4 to 6 are implemented.