CN118428343B

CN118428343B - Full-media interactive intelligent customer service interaction method and system

Info

Publication number: CN118428343B
Application number: CN202410885227.2A
Authority: CN
Inventors: 苏焕杰; 宁学军
Original assignee: Guangzhou Xunhong Network Technology Co ltd
Current assignee: Guangzhou Xunhong Network Technology Co ltd
Priority date: 2024-07-03
Filing date: 2024-07-03
Publication date: 2024-09-27
Anticipated expiration: 2044-07-03
Also published as: CN118428343A

Abstract

The application relates to the technical field of natural language processing, and discloses a full-media interactive intelligent customer service interaction method and system, wherein the method comprises the following steps: s1: receiving an input of a user; s2: judging the emotional state of the user according to the user input evaluation; s3: adjusting a response strategy in the self-adaptive learning system according to the evaluated and judged emotion attitudes of the user, namely selecting a corresponding response template based on anger, sadness, happiness and neutrality of the emotion states of the user; s4: placeholders in the filling templates are generated using natural language based on the selected reply templates and the replies are output to the user in text or voice or image or video multimedia form. According to the application, the text, voice, image conditions and user history data conditions are considered when the judged user emotion attitude is evaluated, and the hierarchical dynamic activation function is adopted for emotion prediction, so that the accuracy and efficiency of user emotion judgment are greatly improved, and the user experience is greatly increased.

Description

Full-media interactive intelligent customer service interaction method and system

Technical Field

The invention relates to the technical field of natural language processing, in particular to a full-media interactive intelligent customer service interaction method and system.

Background

In the field of modern communication and customer service, full-media interactive intelligent customer service systems gradually become an important technical means for improving user experience and enterprise efficiency, and along with rapid development of information technology and artificial intelligence, the systems can provide more personalized and accurate services by integrating text, voice, image and video input so as to meet the requirements of different users, however, along with diversification of application scenes and complicating of user requirements, the traditional intelligent customer service systems face new challenges and requirements, and technical innovation and breakthrough are urgently needed.

Natural language processing technology is the core of intelligent customer service systems, which allows machines to understand and generate human language. Currently, although NLP technology has been able to support basic dialog management and intent recognition, there are still limitations in understanding complex contexts, multi-turn dialog and fine emotion. Speech recognition technology has evolved from an early rule-based system to a highly accurate system using deep learning, and although the accuracy of speech recognition has significantly improved, how to accurately capture emotion and intent from a user's intonation, speech speed remains a technical challenge. The processing of image and video inputs involves advanced functions such as facial recognition, emotion analysis, etc., which require a large amount of data and high computational power support, and current challenges include how to accurately and quickly extract the emotion and intent of the user from the image and video, and how to handle the high resource requirements of real-time video data.

In addition, along with the diversification of input types, how to effectively integrate data from texts, voices, images and videos and perform comprehensive emotion and intention analysis is a key for improving the response quality of an intelligent customer service system, and multi-mode data processing not only needs a high-efficiency data fusion strategy, but also needs an intelligent algorithm capable of cross-mode understanding and reasoning. Intelligent customer service systems need to process and respond to user input in a very short time, which requires a high degree of real-time and adaptability of the system. Furthermore, the system should also be able to learn and optimize its performance based on the user's feedback and behavior.

The existing intelligent customer service replies singly, and the intelligent customer service cannot reply pertinently according to the emotional state of the user; the current emotion state judgment is carried out only according to a single input text, or according to a single voice condition, and a single image or video; the method for comprehensively judging the information of the three is not adopted, so that the emotion judgment is inaccurate; in the existing emotion judgment process, the historical data of the current user is not considered, so that the emotion judgment is inaccurate, for example, the speech tone of a certain user at ordinary times is higher and belongs to a normal state, but a single judgment method does not compare and judge with a historical average value, so that the emotion judgment is inaccurate, various input data and historical data are not considered in algorithm design, so that the prediction accuracy of the emotion state of the user is greatly reduced, and aiming at the limitation of the existing prediction judgment, a new solution is urgently needed for the high-accuracy customer service emotion state judgment automatic processing method so as to improve the processing efficiency and accuracy.

Disclosure of Invention

Aiming at the problems mentioned in the prior art, the application provides a full-media interactive intelligent customer service interaction method and system, wherein the method firstly receives the input of a user and then evaluates and judges the emotional state of the user according to the input of the user; and adjusting a response strategy in the self-adaptive learning system according to the evaluated and judged emotion attitudes of the user, namely selecting a corresponding response template based on anger, sadness, happiness and neutrality of the emotion states of the user; finally, generating placeholders in the filling template by using natural language based on the selected reply template, and outputting the reply to the user in a text or voice or image or video multimedia form. When the emotion attitude of the user is evaluated and judged, the text, voice, image conditions and historical data conditions of the user are considered, and the hierarchical dynamic activation function is adopted for emotion prediction, so that the emotion judgment accuracy and efficiency of the user are greatly improved, and the user experience is greatly improved.

The application provides a full-media interactive intelligent customer service interaction method, which comprises the following steps:

S1: receiving user input including text and/or voice and/or images and/or video;

s2: judging the emotional state of the user according to the user input evaluation;

S21: if text input exists, word segmentation and part-of-speech tagging are carried out on the input text, if voice input exists, voice is converted into text, the importance of each word in the input text and the converted text is calculated by adopting a TF-IDF algorithm, and the input text and the converted text are converted into a numerical feature vector T ₁;

S22: if the voice is input, the tone, intensity, speed and rhythm of the voice are obtained to form a characteristic vector T ₂;

S23: if the image or video input exists, carrying out face recognition by using a Haar cascade classifier, adopting a Dlib 68-point model to recognize face key points, wherein the face key points comprise eyebrows, eyes, noses, mouths and chin, calculating position changes of the face key points, wherein the position changes comprise lifting speed and amplitude of the eyebrows and lifting and dropping speed and amplitude of mouth corners, and forming a feature vector T ₃;

S24: inputting the feature vector T ₁, the feature vector T ₂ and the feature vector T ₃ into the improved convolutional neural network, and calculating the emotion attitude of the user; wherein the improved convolutional neural network employs a hierarchical dynamic activation function f (x, m):

Wherein x represents an input of the network layer, w _m represents an initial weight of the corresponding multimedia modality m, when m=1, represents an input of text, when m=2, represents an input of speech, when m=3, represents an input of image, when m=4, represents an input of video, δ (h _m, x) represents a similarity adjustment weight of the user history data h _m and the input x of the current network layer, D (h _m, x) represents a cosine similarity of the user history data h _m and the input x of the current network layer, α is a first learned parameter, and β is a second learned parameter; m represents different multimedia modalities;

s3: adjusting a response strategy in the self-adaptive learning system according to the evaluated and judged emotion attitudes of the user, namely selecting a corresponding response template based on anger, sadness, happiness and neutrality of the emotion states of the user;

S4: placeholders in the filling templates are generated using natural language based on the selected reply templates and the replies are output to the user in text or voice or image or video multimedia form.

Preferably, the step S22: if the voice is input, the tone, intensity, speed and rhythm of the voice are obtained to form a characteristic vector T ₂, and the method comprises the steps of framing the voice; calculating the fundamental frequency of each frame by using cepstrum analysis to obtain the tone of the voice; obtaining the strength of the voice by calculating the amplitude value of each frame of the voice; measuring syllable number or word number and time ratio in the voice to obtain speech speed; the tempo of speech is characterized by counting the frequency and duration of speech pauses.

Preferably, the step S21: if text input exists, word segmentation and part-of-speech tagging are carried out on the input text, if voice input exists, voice is converted into text, the importance of each word in the input text and the converted text is calculated by adopting a TF-IDF algorithm, the importance is larger when the calculated TF-IDF value of each word is higher, the input text and the converted text are converted into a numerical feature vector T ₁, namely, the TF-IDF value of each word in a document is calculated by utilizing the TF-IDF algorithm, and then the calculated TF-IDF values are combined into a vector, and each TF-IDF value corresponds to a unique word.

Preferably, the step S23: if the image or video input exists, carrying out face recognition by using a Haar cascade classifier, adopting a Dlib 68-point model to recognize face key points, wherein the face key points comprise eyebrows, eyes, noses, mouths and chin, calculating position changes of the face key points, wherein the position changes comprise lifting speed and amplitude of the eyebrows and lifting and dropping speed and amplitude of mouth corners, and forming a feature vector T ₃; and if the image is input, comparing the key point position of the current image with the preset variation amplitude of the neutral expression position.

Preferably, the D (h _m, x) represents user history dataCosine similarity with input x of the current network layer, when the input is text, K-means clustering is carried out on the historical text data, and cosine similarity calculation is carried out on the input x of the current network layer and the historical text data clustering center vector; when the input is voice, performing cosine similarity calculation on the input x of the current network layer and the pitch, intensity, speech speed and rhythm history mean vector of the history voice data; when the input is an image or video, the input of the current network layerAnd performing cosine similarity calculation with a historical average value characteristic vector formed by the lifting speed and amplitude of the eyebrows and the lifting speed and amplitude of the mouth corners of the historical image or video data.

The application also provides a full-media interactive intelligent customer service interaction system, which comprises:

an input module for receiving user input, including text and/or voice and/or image and/or video;

The evaluation judging module is used for evaluating and judging the emotion state of the user according to the input of the user;

The text vector construction module is used for carrying out word segmentation and part-of-speech tagging on the input text if the text input exists, converting the voice into the text if the voice input exists, calculating the importance of each word in the input text and the converted text by adopting a TF-IDF algorithm, and converting the input text and the converted text into a numerical feature vector T ₁;

The voice vector construction module acquires the tone, intensity, speed and rhythm of voice to form a feature vector T ₂ if voice is input;

the image or video vector construction module is used for carrying out face recognition by utilizing a Haar cascade classifier if the image or video input exists, adopting a Dlib 68-point model to recognize face key points, wherein the face key points comprise eyebrows, eyes, nose, mouth and chin, calculating position changes of the face key points comprise lifting speed and amplitude of the eyebrows, lifting speed and amplitude of mouth corners and sagging speed and amplitude of mouth corners, and forming feature vectors ；

The user emotion attitude calculation module is used for calculating the feature vectorFeature vectorFeature vectorInputting the emotion attitude information into an improved convolutional neural network, and calculating the emotion attitude of a user; wherein the improved convolutional neural network employs a hierarchical dynamic activation function f (x, m):

the reply template calculation module adjusts a reply strategy in the self-adaptive learning system according to the evaluated and judged emotion attitudes of the user, namely, selects a corresponding reply template based on anger, sadness, happiness and neutrality of the emotion states of the user;

and a user reply module which generates placeholders in the filling template by using natural language based on the selected reply template and outputs the reply to the user in the form of text or voice or image or video multimedia.

Preferably, the speech vector construction module acquires a pitch, intensity, speed and rhythm of speech to form a feature vector T ₂ if there is speech input, including framing the speech; calculating the fundamental frequency of each frame by using cepstrum analysis to obtain the tone of the voice; obtaining the strength of the voice by calculating the amplitude value of each frame of the voice; measuring syllable number or word number and time ratio in the voice to obtain speech speed; the tempo of speech is characterized by counting the frequency and duration of speech pauses.

Preferably, the text vector construction module performs word segmentation and part-of-speech labeling on an input text if text input exists, converts speech into text if speech input exists, calculates importance of each word in the input text and the converted text by using a TF-IDF algorithm, wherein the importance of each word is greater when the calculated TF-IDF value of each word is higher, converts the input text and the converted text into a numerical feature vector T ₁, namely, calculates the TF-IDF value of each word in a document by using the TF-IDF algorithm, and then combines the calculated TF-IDF values into a vector, and each TF-IDF value corresponds to a unique word.

Preferably, the image or video vector construction module performs face recognition by using a Haar cascade classifier if an image or video input exists, and recognizes facial key points by using a Dlib 68-point model, wherein the facial key points comprise eyebrows, eyes, nose, mouth and chin, calculates position changes of the facial key points comprising lifting speed and amplitude of the eyebrows and lifting speed and amplitude of mouth corners and forms a feature vector T ₃; and if the image is input, comparing the key point position of the current image with the preset variation amplitude of the neutral expression position.

Preferably, the D (h _m, x) represents cosine similarity between the user history data h _m and the input x of the current network layer, and when the input is text, K-means clustering is performed on the history text data to input the current network layerPerforming cosine similarity calculation with the historical text data clustering center vector; when the input is speech, the input of the current network layerPerforming cosine similarity calculation with the pitch, intensity, speech speed and rhythm history mean vector of the history speech data; when the input is an image or video, cosine similarity calculation is carried out on the input x of the current network layer and a historical mean value feature vector formed by the lifting speed and amplitude of the eyebrow of the historical image or video data and the lifting speed and amplitude of the mouth angle.

The invention provides a full-media interactive intelligent customer service interaction method and system, which can realize the following beneficial technical effects:

1. The present application first receives input from a user, secondly, judging the emotional state of the user according to user input evaluation; and adjusting a response strategy in the self-adaptive learning system according to the evaluated and judged emotion attitudes of the user, namely selecting a corresponding response template based on anger, sadness, happiness and neutrality of the emotion states of the user; finally, generating placeholders in the filling template by using natural language based on the selected reply template, and outputting the reply to the user in a text or voice or image or video multimedia form. When the emotion attitude of the user is evaluated and judged, the text, voice, image conditions and historical data conditions of the user are considered, and the hierarchical dynamic activation function is adopted for emotion prediction, so that the emotion judgment accuracy and efficiency of the user are greatly improved, and the user experience is greatly improved.

2. According to the invention, the user emotion attitude calculation module comprehensively inputs the text input vector, the voice input vector and the image or video input vector into the improved convolutional neural network, calculates the user emotion attitude, the improved convolutional neural network adopts a hierarchical dynamic activation function f (x, m), different input texts correspond to different weights, the weights of the image and the video are higher than those of the voice input, the similarity between the user historical data h _m and the input x of the current network layer is considered, the weights are adjusted, the historical data and how-state data are utilized greatly, the full utilization of the existing rich data and the historical data is realized, and the user emotion judgment accuracy is greatly improved.

3. If voice is input in the emotion judging process, voice is converted into text, the importance of each word in the input text and the converted text is calculated by adopting a TF-IDF algorithm, the importance of each word is larger when the calculated TF-IDF value of each word is higher, the input text and the converted text are converted into a numerical feature vector T ₁, namely, the TF-IDF value of each word in the document is calculated by utilizing the TF-IDF algorithm, then the calculated TF-IDF values are combined into a vector, each TF-IDF value corresponds to a unique word, the emotion attitude of the user is judged by converting the voice into the text, emotion judgment is further carried out on parameters such as tone of the voice, and the like, the dimension of voice data is fully utilized, so that the emotion judging accuracy of the user is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of steps of a full media interactive intelligent customer service interaction method of the present invention;

FIG. 2 is a schematic diagram of a full media interactive intelligent customer service interaction system of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1:

To solve the above-mentioned problems mentioned in the prior art, as shown in fig. 1: the application provides a full-media interactive intelligent customer service interaction method, which comprises the following steps:

S1: receiving user input including text and/or voice and/or images and/or video; in one embodiment, the intelligent customer service system is deployed in an online retail store system for helping users to solve problems with products, purchasing flows, payment methods, after-sales services, etc., the users query the product details through an online chat window where the users enter: "do ask for inventory of the sports shoes with 42 yards? The system receives the information through a text input interface, performs word segmentation and part-of-speech tagging on the information, and then utilizes a pre-trained natural language processing model to understand the intention and key information of the query.

In one embodiment, the user asks for order status through the voice recognition function of the mobile application, and the user speaks: what is the status of My order A12345 present? By the way, the system converts the speech input into text using speech recognition techniques; acoustic features are captured using an acoustic model and converted to corresponding text data by a language model.

In one embodiment, a user uploads a commodity picture through a customer service system to inquire whether the commodity is in stock or not; uploading commodity pictures by a user through an image uploading function; the system receives the image file, utilizes the image recognition algorithm to recognize the commodity in the picture and matches the commodity information in the database to determine the inventory state of the commodity.

In one embodiment, a user presents damaged goods received by the user to customer service through a video call function to seek help; a user uploads a section of video showing damaged commodities through a video uploading function of the intelligent customer service system; the system receives the video file, analyzes the video content by using a video processing technology, recognizes commodities and damage conditions in the video by using an image and video recognition technology, automatically records damage evidence and provides support for further after-sales service.

S2: judging the emotional state of the user according to the user input evaluation; in one embodiment, the intelligent customer service system is deployed in a home electronics service provider system, and a user can interact with the system through text, voice, image or video, query billing information, fault reporting or upgrade services, etc., and input through a chat interface user input: "your service is true and bad, I have waited for an hour-! The system analyzes the text by natural language processing techniques, identifies keywords such as "bad" and "one hour", which are associated with negative emotions, determines the emotion of the user as "anger" using an emotion analysis algorithm (e.g., emotion dictionary based analysis or machine learning model or convolutional neural network model), selects an appropriate reply template based on the determined emotion state, for example, provides immediate sorry information, and quickly switches to manual service to further address the problem.

In one embodiment, a user makes a voice input through a telephone service system: "how does i not have received my refund? The system converts voice into text, extracts characteristics of the voice, such as high pitch and quick speed, analyzes the voice characteristics and the converted text content, uses the pitch and speed data to determine anxiety or discontent emotion of a user, provides a soothing reply according to the emotion state, and gives a detailed description of refund progress.

In one embodiment, the user uploads a self-photograph of a face that appears unpleasant by the mobile application and asks why the service is so slow, the system uses facial recognition techniques to detect expressions in the image, analyze facial key points to indicate user discomfort, based on the facial expression analysis results, the system recognizes the user's discontent emotion, the system generates a reply with a letter and explaining the cause of the delay, and attempts to alleviate the user's discontent. Natural language processing uses pre-trained models such as BERT or GPT for text emotion analysis. The speech processing tool uses, for example, mass-scientific flies to perform speech-to-text, uses acoustic feature analysis tools to analyze speech and speech speed, and uses libraries such as OpenCV and Dlib for facial recognition and emotion analysis.

S21: if text input exists, word segmentation and part-of-speech tagging are carried out on the input text, if voice input exists, voice is converted into text, the importance of each word in the input text and the converted text is calculated by adopting a TF-IDF algorithm, and the input text and the converted text are converted into a numerical feature vector T ₁; in one embodiment, a user enters questions via an online chat platform: "how slow my network connection is, how does it go back? "; the system first performs word segmentation processing on the input text, i.e., breaks the whole sentence into separate words. For example, "my", "network", "connect", "very", "slow", "this", "how to get back"; part-of-speech tagging is performed on the segmented result to identify nouns, adjectives, verbs and the like. For example, "network" and "connection" are labeled as nouns, and "slow" is labeled as adjective; the importance of each word is calculated on the segmented text using the TF-IDF algorithm. In this example, it is found that "network" and "slow" have higher TF-IDF values, as they are critical to describe the problem; and converting the calculated TF-IDF value into a numerical feature vector for the subsequent machine learning model.

In one embodiment, the user issues voice instructions through a customer service hotline: "I want to know the status of my orders". "the system converts the user's voice input into text. Text content "I want to know my order status" is extracted using a speech recognition API or a self-built model. And executing word segmentation and part-of-speech tagging on the converted text. For example, the word segmentation results are "I", "want", "know", "My", "order", "status". TF-IDF values are also calculated for these words to evaluate their importance in the whole sentence. In this example, "order" and "status" have higher TF-IDF values because they are keywords of the query. These TF-IDF values are converted into numerical feature vectors for subsequent processing by the system, such as querying an order database, providing a response, etc. Word segmentation and part of speech tagging the word segmentation and part of speech tagging of text using natural language processing libraries such as NLTK, spaCy, or jieba (for chinese processing), TF-IDF algorithm uses TfidfVectorizer in the scikit-learn library of Python for TF-IDF computation, speech recognition techniques use fly-away or other similar services to convert speech to text.

S22: if the voice is input, the tone, intensity, speed and rhythm of the voice are obtained to form a characteristic vector T ₂; the pitch reflects the fundamental frequency of the speech signal and is a key element in expressing emotion changes; estimating pitch using auto-correlation, cepstrum, or fourier transform-based techniques; these methods can effectively extract periodic components from the speech signal to determine the fundamental frequency; preprocessing the voice signal, including framing and window function processing; applying Fourier transform to each frame to obtain frequency domain data; applying an auto-correlation or cepstrum analysis to estimate the fundamental frequency of each frame; the fundamental frequency variation across frames is smoothed to obtain a consistent pitch contour.

Intensity refers to the energy level of speech, which is related to the emotional intensity of the speaker; calculating intensity by measuring power or energy of the sound signal; framing the voice signal; calculating a squared amplitude value for each frame to obtain energy; the energy values are logarithmically calculated to obtain a strength metric with more perceived relevance.

The speech speed reflects the speed of the speech, and is helpful for understanding the emotional state of the speaker; calculating by measuring the number of syllables or the ratio of the number of words to time in the voice; determining speech segments using Voice Activity Detection (VAD) techniques; calculating the number of syllables or words within the speech segment by Automatic Speech Recognition (ASR) or syllable segmentation techniques; and determining the speech speed according to the ratio of the total syllable number to the total time of the voice segment.

Rhythms relate to regularity and variation of speech beats, and relate to fluency of speech; analyzing the variation degree (such as standard deviation) of syllable duration and the length and frequency of pauses; analyzing the duration of each syllable using syllable segmentation techniques; calculating the variation degree and the average value of syllable duration time; the location, frequency and duration of the pauses are analyzed and integrated into the cadence signature. After the features are synthesized into a feature vector, the feature vector can be used for driving an emotion recognition model or a voice recognition system so as to realize more accurate user intention and emotion analysis.

S23: if the image or video input exists, performing facial recognition by using a Haar cascade classifier, and recognizing facial key points by using a Dlib 68-point model, wherein the facial key points comprise eyebrows, eyes, nose, mouth and chin, and calculating position changes of the facial key points, wherein the position changes comprise lifting speed and amplitude of the eyebrows and lifting and dropping speed and amplitude of mouth corners, and forming a feature vector T ₃.

The first step of face recognition is to locate faces in an image or video frame, perform face recognition using Haar cascade classifiers, first perform gray level conversion and histogram equalization on the input image or video frame to improve image quality, adapt it for face detection, load pre-trained Haar feature classifiers, which are trained on a large number of frontal and non-frontal images, for fast detection of faces in the image, apply the face detection classifier to search for faces in the image, and the Haar classifier detects faces on multiple scales using window sliding techniques to recognize faces of different sizes. Once a face is detected, facial keypoints are identified using the 68-point model of Dlib, and a pre-trained 68-point facial marker model provided by the Dlib library is loaded. This model can identify facial keypoints including eyebrows, eyes, nose, mouth, and chin; the model is applied to accurately locate 68 facial keypoints in the identified face region.

The locations of these points provide detailed information about the facial expression, and analyzing the change in facial key points over time may reveal a user's emotional change, such as happiness, sadness, or anger, track the location of the key points in the video sequence, and if a single image, compare the location of the key points of the current image with a preset neutral expression location; the amount of change in the position of a particular key point (e.g., eyebrow and corner of mouth) is calculated. This includes varying speeds and amplitudes, such as the lifting speed and amplitude of the eyebrows, the lifting or sagging speed and amplitude of the corners of the mouth; the change data of all key points are encoded into a feature vector which is used for subsequent emotion analysis, the pitch is the fundamental frequency of the speech signal, the pitch change can reflect the level and intensity of emotion, the intensity is the loudness level of sound, the energy of the speech signal can be measured to determine, the high intensity indicates the emotion of the user is excited, the speech speed refers to the speed of speech in unit time, and the duration of speech and pause in speech paragraphs can be measured by analyzing. Rhythmic rhythms relate to the regularity and variation of a speech beat, assessed by analyzing the duration variation of syllables.

Wherein x represents an input of the network layer, w _m represents an initial weight of the corresponding multimedia modality m, when m=1, represents an input of text, when m=2, represents an input of speech, when m=3, represents an input of image, when m=4, represents an input of video, δ (h _m, x) represents a similarity adjustment weight of the user history data h _m and the input x of the current network layer, D (h _m, x) represents a cosine similarity of the user history data h _m and the input x of the current network layer, α is a first learned parameter, and β is a second learned parameter; m represents different multimedia modalities.

In one embodiment, the intelligent customer service system is deployed in an online customer service center for processing customer queries and complaints, users interact with the system through text, voice, images or video, and users submit questions through a chat interface: "I are confused about the latest billing, and the cost appears to be higher than expected. "and express concern through voice input: "how does i not contact your customer service? I have urgent problems to solve. The user uploads a picture of a bill, the picture shows an unknown high cost, in addition, the user is not satisfied with the expression of the video uploading function, the expression of the user in the video is serious, and the language speed is high. Extracting keywords and phrases from text input, such as "confusion" and "above expectation", and converting into numerical feature vectors; features such as tone, intensity, speech speed and the like are extracted from voice input, the features are encoded into vectors, an uploaded bill picture is analyzed by using an image recognition algorithm, digital and text information is recognized and converted into feature vectors, facial expressions and voice features in video are analyzed and integrated into a comprehensive feature vector or separated and parallel feature vectors, the feature vectors are input into a convolutional neural network, the convolutional neural network structure can process multiple types of data input, emotion signals from different media can be recognized and analyzed, hierarchical dynamic activation functions used in the network dynamically adjust processing weights according to the types (text, voice, images and video) of input data and the similarity with historical data of a user, the method helps a system to more accurately understand the current emotion state of the user, and output after CNN processing is used for judging emotion attitudes of the user, such as confusion, urgency, dissatisfaction and the like, and the emotion information is used for guiding a subsequent response strategy. The improved Convolutional Neural Network (CNN) accepts feature vectors for text, speech, image and video data and performs efficient feature integration and emotion recognition through hierarchical dynamic activation functions.

The improved convolutional neural network input layer receives feature vectors of multimodal data from different sources, including features extracted from text, speech, images, and video, and the data for each media type is preprocessed and converted into fixed length vectors. The improved convolutional neural network convolutional layer designs independent convolutional processing paths for each type of input data, for example, text data uses a one-dimensional convolutional layer to process sequence data, voice and image data uses a two-dimensional convolutional layer to process time-frequency features and spatial features, and video data uses three-dimensional convolution to process space-time features. Each path has a specific filter size and activation function to optimize processing of the corresponding type of data.

Dynamic weight adjustment of the hierarchical dynamic activation layer dynamically adjusts the activation weight based on the type of input data and the similarity of user history data, for example, if text information in the current situation is more important, the weight of a text data processing path is increased, and the dynamic adjustment of the weight is realized by introducing an attention mechanism or a gating mechanism, so that the network can adaptively and mainly process more critical information according to the change of the context.

The convolutional neural network pooling layer, after processing various data, uses a max-pooling or average pooling layer to merge features from different processing paths, which helps extract and preserve the most important features from various inputs.

The convolutional neural network full-connection layer, the fused features are sent to one or more full-connection layers to further integrate information and classify emotion. The output of the fully connected layer is a probability distribution of the user's emotional state, such as anger, happiness, sadness, etc. The final output layer of the improved convolutional neural network output layer uses a softmax activation function to convert the output of the network into a probability distribution which represents the possibility of different emotion states; the network is implemented using TensorFlow or PyTorch deep learning frameworks that support custom convolutional layers, dynamic activation functions, and complex data structures; training of the network is performed using the labeled emotion dataset, applying, for example, cross entropy loss functions and Adam optimizers, etc., to optimize model performance.

S3: adjusting a response strategy in the self-adaptive learning system according to the evaluated and judged emotion attitudes of the user, namely selecting a corresponding response template based on anger, sadness, happiness and neutrality of the emotion states of the user; in one embodiment, the intelligent customer service system is deployed on a home electronics sub-commerce platform for processing customer queries and complaints regarding orders, product information, return policies, and the like.

In one embodiment, the user encounters a problem during the return, entering angerly: "this is the third time I contact your-! I ask for immediate refund-! "; selecting a reply template designed with a concentric and urgent processing prompt, and then carrying out specific reply: "we very sorry brings inconvenience to you. We understand that you are frustrating and are giving priority to handling your refund requests. Please provide your order number, we will deal with you immediately. ". In one embodiment, the user expresses sadness in failing to purchase the limited version of the product in time: "I really disappointed, and did not buy what I wanted every time. "; selecting a reply template that provides comfort and additional help options, and then proceeding with a specific reply: "we understand your disappointment, true sory you missed this purchase. We want to register for you and immediately inform you when the product is available again. Thank you for understanding and patience. "in one embodiment, the user gives positive feedback after purchase: "I just received my package, very like this shopping-! "; selecting a reply template that encourages positive feedback and further interaction, and then performing a specific reply: "too good, we are happy you like your new shopping-! Please share your shopping experience, or leave an evaluation on our website, if convenient, very thank you for support and feedback-! ". In one embodiment, the user asks: "what is you's time to get in store? "select a standard reply template that provides direct information, then make specific reply" thank you for you, our storefront opens the door at 9 am every day, and closes the door at 8 pm. Look for your visit-! And a plurality of reply templates are preset in the reply template management system, each template is designed aiming at different emotions and situations, and the automatic response system integrates an automatic tool and a CRM system so as to ensure timely and accurate execution of a reply strategy.

In one embodiment, the intelligent customer service system is deployed at a customer service center of an airline to assist passengers in resolving issues regarding flights, baggage, reservations, and the like. Passengers enter through text chat: "do my flights take off on time? "the system selects a standard reply template for the flight status: "your good, your flight [ flight number ] is expected to take off on time [ take off time ]. "; the system inquires the latest flight information database, and the specific flight number system is obtained to extract the estimated departure time from the flight information database; filling information into the template by using NLG technology, and outputting in a chat window text form: "your, your flight BA123 is expected to take off on time 18:00. "; the user inquires through voice: "how many baggage i can carry? "; the system selects a reply template for the baggage policy: "your good, your luggage allowance allowed for free carry is [ luggage allowance ]. Filling placeholders, and inquiring the baggage policy by the system according to the ticket service level and the route of the passenger; the system generates a voice response using text-to-speech (TTS) technology and applies the voice output via telephone or mobile: "you good, you allow free luggage limit to be two, each piece not exceeding 23 kg". The user inquires about the visa requirement by uploading the passport photo, the system selects a reply template for providing visa information, and prepares an image containing the visa requirement information; the system generates visa requirement information in the form of an image and sends it to the user through an email or mobile application. The user inquires about how to perform online on-line machine, the system selects a video reply template for providing online on-line machine tutorials, and plays the prefabricated online on-line machine tutorial videos and displays the video tutorials through a user interface.

In some embodiments, the S22: if the voice is input, the tone, intensity, speed and rhythm of the voice are obtained to form a characteristic vector T ₂, and the method comprises the steps of framing the voice; calculating the fundamental frequency of each frame by using cepstrum analysis to obtain the tone of the voice; obtaining the strength of the voice by calculating the amplitude value of each frame of the voice; measuring syllable number or word number and time ratio in the voice to obtain speech speed; the tempo of speech is characterized by counting the frequency and duration of speech pauses.

In some embodiments, the intelligent customer service system is deployed in a customer service center system of an insurance company, and a user can inquire about questions about insurance policies, pay programs, and the like through a telephone system, and the customer inquires through telephone: "do i want to know the progress of the processing of the incident report, do you get to process at a quick point? The system first frames the user's speech signal, involves dividing the continuous speech signal into short time frames of approximately 20-30 milliseconds each, using a digital signal processing window function (hanning window or hamming window) to divide the speech signal and maintain a frame-to-frame overlap of around 50% to maintain data continuity, and cepstrum analysis of the speech data for each frame to extract pitch (fundamental frequency). The change of tone can reveal the emotional state of a speaker, such as tension or relaxation, acquire a frequency spectrum by using a Fast Fourier Transform (FFT), find the fundamental frequency in a voice signal by using a cepstrum method, calculate the intensity or energy level of each frame of voice, the intensity level can reflect the emotional intensity or tension level of the speaker, sum the square value of the signal of each frame to acquire energy, then carry out logarithmic conversion to obtain the intensity, measure the voice speed by calculating the proportion of the number of voice segments or words in voice to time, fast voice speed indicates the urge or anxiety of a user, transfer the voice into text by using an automatic voice recognition (ASR) system, calculate the word number in a certain time, characterize the rhythm by counting the pause frequency and the duration in voice, the irregular rhythm or frequent pause indicates hesitation or uncertainty, detect the silence segments in the voice signal, and calculate the duration and the frequency of occurrence of the pause.

In some embodiments, the S21: if text input exists, word segmentation and part-of-speech tagging are carried out on the input text, if voice input exists, voice is converted into text, the importance of each word in the input text and the converted text is calculated by adopting a TF-IDF algorithm, the importance is larger when the calculated TF-IDF value of each word is higher, the input text and the converted text are converted into a numerical feature vector T ₁, namely, the TF-IDF value of each word in a document is calculated by utilizing the TF-IDF algorithm, and then the calculated TF-IDF values are combined into a vector, and each TF-IDF value corresponds to a unique word. In some embodiments, the intelligent customer service system is deployed on a home sub-commerce website system, and customers can ask questions through online chat or voice call, ask for product information, order status, return policy, etc.; the user inputs 'I want to know the battery duration of the camera' through the chat window. "; the system uses spaCy or jieba to segment the input sentences to obtain words such as I, want, know, camera, battery, duration, time and the like; part-of-speech tagging is performed on the word segmentation result, nouns, verbs, adjectives and the like are identified, for example, a camera and a battery are used as nouns, and understanding is used as a verb; using TF-IDF algorithm to calculate importance of each word, words with high importance such as battery and cruising will obtain higher TF-IDF value, and the calculated TF-IDF value is converted into a numerical feature vector which can be used for subsequent query processing and information retrieval; the user also speaks through the voice input: "is my order number 12345, is i can cancel? The method includes the steps of converting voice of a user into text by using voice recognition technology such as a fly, performing word segmentation and part-of-speech tagging on the converted text, recognizing keywords such as order number 12345, cancellation, and the like, calculating TF-IDF values of words in the converted text, determining weights of key information, such as order number and cancellation, obtaining higher weights, and creating a numerical feature vector according to the TF-IDF values for further processing a user request.

In some embodiments, the calculating the position change of the facial key points includes lifting speed and amplitude of eyebrows, lifting speed and amplitude of mouth corners and forming a feature vector T ₃; and if the image is input, comparing the key point position of the current image with the preset variation amplitude of the neutral expression position.

The Haar cascade classifier is an effective object detection method, especially for face detection, and Haar features reflect pixel intensity variations between different regions in an image, which are calculated at different sizes and positions to capture structural information such as edges, lines, and textures; in order to quickly calculate the features, firstly, an image is converted into an integral graph, and the pixel sum of any image area can be calculated in constant time by the conversion, so that the feature calculation speed is greatly increased; the Haar features of the cascade classifier are used for training a plurality of cascade AdaBoost classifiers, each classifier stage judges whether a window contains a face or not, and the process continues to the next classifier, otherwise, the Haar features are directly excluded, so that the calculated amount can be effectively reduced; applying sliding windows on the whole image, and detecting whether the face is contained in each window position by using a cascade classifier; in the video call service, a customer communicates in real time through a camera, and the system detects and recognizes facial expressions of the customer to analyze emotion.

The Dlib model for detecting key points of a 68-point face is a machine learning-based method that can locate 68 specific points on the detected face that cover the main area of the face, such as eyes, eyebrows, nose, mouth, and chin, and is trained using a large number of labeled face images. Each training sample comprises a face image and 68 key point positions; the model learns how to extract features from the original pixels and uses these features to predict the location of the keypoints; when the Haar classifier detects a face, the Dlib model can be used to find precisely 68 specific facial keypoints in this region.

In some embodiments, D (h _m, x) represents cosine similarity between the user history data h _m and the input x of the current network layer, and when the input is text, K-means clustering is performed on the history text data, and cosine similarity calculation is performed on the input x of the current network layer and the history text data clustering center vector; when the input is voice, performing cosine similarity calculation on the input x of the current network layer and the pitch, intensity, speech speed and rhythm history mean vector of the history voice data; when the input is an image or video, the input of the current network layerAnd performing cosine similarity calculation with a historical average value characteristic vector formed by the lifting speed and amplitude of the eyebrows and the lifting speed and amplitude of the mouth corners of the historical image or video data.

How cosine similarity of the current input to the history data is calculated from different input types (text, voice, image or video) is described in detail below. Firstly, for cosine similarity calculation of text input, a customer inquires about a problem of product return policy before, and now proposes related problems again through an online chat platform, historical text inquiry is classified through a K-means clustering algorithm, each class has a clustering center vector which represents core content of the class of problems, the problem currently input by the user is firstly processed by word segmentation and TF-IDF and is converted into a feature vector, and similarity between the currently input feature vector and each clustering center vector is calculated by using a cosine similarity formula; according to the similarity score, the system can judge the relevance of the current problem and the historical problem of the user, and select the most suitable answer strategy.

Secondly, for cosine similarity calculation of voice input, a user inquires about account safety problems through a telephone, the system stores historical mean vectors of tone, intensity, speech speed and rhythm of voice inquiry before the user, new voice input is converted into text, feature vectors of tone, intensity, speech speed and rhythm are extracted, cosine similarity of the current voice input feature vector and the historical mean vector is calculated, and the system is helpful for identifying whether the user expresses similar worry or problems under similar situations, so that the system is allowed to adjust response of the user to match emotion and requirements of the user.

Finally, cosine similarity of the image or video input is calculated, and the user uploads the video about the damaged product through the application; the system analyzes the images and videos uploaded by the user, extracts key facial expression features such as the lifting speed and amplitude of eyebrows and the lifting or sagging speed and amplitude of mouth corners, and calculates the historical mean vector of the features. Carrying out the same facial expression analysis on the newly uploaded video, and extracting the current feature vector; and comparing the cosine similarity of the facial expression feature vector of the current video and the historical mean vector.

Example 2:

The application also provides a full-media interactive intelligent customer service interactive system, which comprises a server, a data storage device, a network device and a terminal device as shown in figure 2. The server is used as a central of the system and is responsible for processing all data analysis, storage and response generation, and a high-performance server, such as a server with a multi-core processor and high RAM capacity, can be used to support parallel processing and big data operation. The data storage device stores historical interaction data, user data, cluster center vectors, training models, and the like, including high-speed SSDs, high-capacity Hard Disk Drives (HDDs), and external cloud storage services. Network devices ensure high-speed transmission of data and network connectivity of the system, including routers, switches, and load balancers, supporting local area network and wide area network connections. Audio and video processing hardware is particularly useful for processing audio and video data inputs, including sound cards, video capture cards, and the like, supporting high quality audio and video inputs and outputs. The terminal device implements an interface for user interaction with the system, such as a computer, a smart phone or a dedicated client device, including a personal computer, a mobile device, a microcomputer, etc.

The server is connected with the storage device through high-speed Ethernet or more advanced network technology (such as fiber channel), so that the speed and the safety of data transmission are ensured. The server is connected with the network equipment through a high-bandwidth network interface card, so that the efficiency of external communication is ensured. The server and the terminal equipment are connected through the Internet or a special network to support remote user access, and the server and the cloud platform are connected to cloud service through a secure VPN or a direct Internet for additional data processing and backup. The terminal equipment is directly connected to the server through a USB interface or a PCIe slot to the audio/video processing hardware, and is used for capturing and processing the audio/video data in real time. The system monitoring and management, server and network equipment can monitor through network management protocol such as SNMP to maintain the stability and performance of the system.

The application discloses a full-media interactive intelligent customer service interactive system, which specifically comprises the following steps:

The image or video vector construction module is used for carrying out face recognition by utilizing a Haar cascade classifier if the image or video input exists, adopting a Dlib 68-point model to recognize face key points, wherein the face key points comprise eyebrows, eyes, nose, mouth and chin, and calculating position changes of the face key points, including lifting speed and amplitude of the eyebrows and lifting and sagging speed and amplitude of mouth corners, and forming a feature vector T ₃;

the user emotion attitude calculation module inputs the feature vector T ₁, the feature vector T ₂ and the feature vector T ₃ into the improved convolutional neural network to calculate the user emotion attitude; wherein the improved convolutional neural network employs a hierarchical dynamic activation function f (x, m):

In some embodiments, the speech vector construction module obtains pitch, intensity, speed, and cadence of speech to form a feature vector if there is speech inputIncludes framing the voice; calculating the fundamental frequency of each frame by using cepstrum analysis to obtain the tone of the voice; obtaining the strength of the voice by calculating the amplitude value of each frame of the voice; measuring syllable number or word number and time ratio in the voice to obtain speech speed; the tempo of speech is characterized by counting the frequency and duration of speech pauses.

In some embodiments, the text vector construction module performs word segmentation and part of speech tagging on an input text if there is text input, converts speech into text if there is speech input, calculates importance of each word in the input text and the converted text by using TF-IDF algorithm, and converts the input text and the converted text into a numerical feature vector T ₁ as the calculated TF-IDF value of each word is higher, i.e., calculates the TF-IDF value of each word in the document by using TF-IDF algorithm, and then combines the calculated TF-IDF values into a vector, each TF-IDF value corresponding to a unique word.

In some embodiments, the image or video vector construction module performs face recognition by using a Haar cascade classifier if there is an image or video input, and recognizes facial key points by using a Dlib 68-point model, wherein the facial key points include eyebrows, eyes, nose, mouth and chin, calculates position changes of the facial key points including lifting speed and amplitude of the eyebrows, and upward and downward speed and amplitude of mouth corners, and forms a feature vector T ₃; and if the image is input, comparing the key point position of the current image with the preset variation amplitude of the neutral expression position.

In some embodiments, theRepresenting user history dataCosine similarity with input x of the current network layer, when the input is text, K-means clustering is carried out on the historical text data, and cosine similarity calculation is carried out on the input x of the current network layer and the historical text data clustering center vector; when the input is voice, performing cosine similarity calculation on the input x of the current network layer and the pitch, intensity, speech speed and rhythm history mean vector of the history voice data; when the input is an image or video, the input of the current network layerAnd performing cosine similarity calculation with a historical average value characteristic vector formed by the lifting speed and amplitude of the eyebrows and the lifting speed and amplitude of the mouth corners of the historical image or video data.

2. The user emotion attitude calculation module comprehensively inputs the text input vector, the voice input vector and the image or video input vector into the improved convolutional neural network, calculates the user emotion attitude, and the improved convolutional neural network adopts a hierarchical dynamic activation function：

When m=1, the input is text, when m=2, the input is voice, when m=3, the input is image, when m=4, the input is video, different input texts correspond to different weights, the weights of the image and the video are higher than that of the voice and higher than that of the text input, the similarity between the user history data h _m and the input x of the current network layer is considered, the weights are adjusted, history data and how-state data are utilized greatly, the full utilization of the existing rich data and the history data is realized, and the emotion judgment accuracy of the user is improved greatly.

The above describes a full-media interactive intelligent customer service interaction method and system in detail, and specific examples are applied to illustrate the principle and implementation of the present invention, and the above description of the examples is only used to help understand the core idea of the present invention; also, as will be apparent to those skilled in the art in light of the present teachings, the present disclosure should not be limited to the specific embodiments and applications described herein.

Claims

1. The full-media interactive intelligent customer service interaction method is characterized by comprising the following steps of:

S21: if text input exists, word segmentation and part-of-speech tagging are carried out on the input text, if voice input exists, voice is converted into text, the importance of each word in the input text and the converted text is calculated by adopting a TF-IDF algorithm, namely, the importance is greater when the calculated TF-IDF value of each word is higher; converting the input text and the converted text into a numerical feature vector T ₁; namely, calculating the TF-IDF value of each word in the document by using a TF-IDF algorithm, and then combining the calculated TF-IDF values into a vector, wherein each TF-IDF value corresponds to a unique word;

f(x,m)＝ReLU(x)×(w_m+δ(h_m,x))

S4: generating placeholders in the filling template by using natural language based on the selected reply template, and outputting replies to the user in the form of text or voice or image or video multimedia;

D (h _m, x) represents cosine similarity between the user history data h _m and the input x of the current network layer, when the input is text, K-means clustering is carried out on the history text data, and cosine similarity calculation is carried out on the input x of the current network layer and the clustering center vector of the history text data; when the input is voice, performing cosine similarity calculation on the input x of the current network layer and the pitch, intensity, speech speed and rhythm history mean vector of the history voice data; when the input is an image or video, cosine similarity calculation is carried out on the input x of the current network layer and a historical mean value feature vector formed by the lifting speed and amplitude of the eyebrow of the historical image or video data and the lifting speed and amplitude of the mouth angle.

2. A full media interactive intelligent customer service interaction method as claimed in claim 1, wherein said S22: if the voice is input, the tone, intensity, speed and rhythm of the voice are obtained to form a characteristic vector T ₂, and the method comprises the steps of framing the voice; calculating the fundamental frequency of each frame by using cepstrum analysis to obtain the tone of the voice; obtaining the strength of the voice by calculating the amplitude value of each frame of the voice; measuring syllable number or word number and time ratio in the voice to obtain speech speed; the tempo of speech is characterized by counting the frequency and duration of speech pauses.

3. The full-media interactive intelligent customer service interaction method according to claim 1, wherein the calculating of the position change of the facial key points comprises the lifting speed and amplitude of eyebrows, the lifting speed and amplitude of mouth corners and forming a feature vector T ₃, and if the image is input, the position of the key points of the current image is compared with the change amplitude of the preset neutral expression position.

4. A full media interactive intelligent customer service interaction system, comprising:

The text vector construction module is used for carrying out word segmentation and part-of-speech tagging on an input text if text input exists, converting voice into text if voice input exists, and calculating the importance of each word in the input text and the converted text by adopting a TF-IDF algorithm, namely, the importance is larger when the calculated TF-IDF value of each word is higher; converting the input text and the converted text into a numerical feature vector T ₁, namely calculating the TF-IDF value of each word in the document by using a TF-IDF algorithm, and then combining the calculated TF-IDF values into a vector, wherein each TF-IDF value corresponds to a unique word;

f(x,m)＝ReLU(x)×(w_m+δ(h_m,x))

A user reply module which generates placeholders in the filling template by using natural language based on the selected reply template and outputs replies to the user in the form of text or voice or image or video multimedia;

5. The intelligent customer service interactive system according to claim 4, wherein the speech vector construction module obtains a pitch, intensity, speed, rhythm of speech to form a feature vector T ₂ if there is speech input, and comprises framing the speech; calculating the fundamental frequency of each frame by using cepstrum analysis to obtain the tone of the voice; obtaining the strength of the voice by calculating the amplitude value of each frame of the voice; measuring syllable number or word number and time ratio in the voice to obtain speech speed; the tempo of speech is characterized by counting the frequency and duration of speech pauses.

6. The full media interactive intelligent customer service interactive system according to claim 4, wherein the image or video vector construction module performs face recognition by using a Haar cascade classifier if there is an image or video input, recognizes face key points by using a Dlib 68-point model, the face key points include eyebrows, eyes, nose, mouth, chin, calculates position changes of the face key points including lifting speed and amplitude of the eyebrows, mouth angle lifting speed and amplitude, and forms a feature vector T ₃; and if the image is input, comparing the key point position of the current image with the preset variation amplitude of the neutral expression position.