[go: up one dir, main page]

CN120496580A - Voice response processing method, device, electronic equipment and storage medium - Google Patents

Voice response processing method, device, electronic equipment and storage medium

Info

Publication number
CN120496580A
CN120496580A CN202510659095.6A CN202510659095A CN120496580A CN 120496580 A CN120496580 A CN 120496580A CN 202510659095 A CN202510659095 A CN 202510659095A CN 120496580 A CN120496580 A CN 120496580A
Authority
CN
China
Prior art keywords
voice
task
voiceprint
semantic
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510659095.6A
Other languages
Chinese (zh)
Inventor
贾哲
张玉龙
刘帆
宋沉蔓
曹亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN202510659095.6A priority Critical patent/CN120496580A/en
Publication of CN120496580A publication Critical patent/CN120496580A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a voice response processing method, a voice response processing device, electronic equipment and a storage medium. The voice interaction method comprises the steps of determining first voice print characteristics of first voice data, determining voice print blacklist information, determining task processing resources matched with the first voice interaction task from a plurality of task processing resources according to the first voice print characteristics and the voice print blacklist information, wherein the first voice data comprise voice data generated by executing the first voice interaction task in an interactive voice response stage, the voice print blacklist information comprises semantic characteristics related to a plurality of second voice print characteristics and the second voice print characteristics, and the task processing resources are used for assisting in processing the first voice interaction task continuously executed after the interactive voice response stage is finished. The scheme can improve the efficiency of solving customer appeal and avoid unnecessary interference or loss to the customer service center.

Description

Voice response processing method, device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method and apparatus for processing a speech response, an electronic device, and a storage medium.
Background
In a customer service center operation system, seat personnel serve as core resources, and the working state of the seat personnel directly determines the overall performance of a customer service center. Complaints and harassments of incoming call clients are frequently encountered, so that the working enthusiasm of seat personnel can be seriously hit, the morale is reduced, and the operation stability and efficiency of a customer service center are further weakened. At the same time, customer appeal, if mishandled, can also lead to reduced customer satisfaction.
At present, a blacklist client is identified and filtered by utilizing a technical means, a problem client is manually counted and marked by seat personnel, and information such as a client incoming call number, an identity card number and the like is input into a blacklist information base. When a customer calls an interactive voice response system, the system judges whether a blacklist is hit or not by comparing the incoming call number with the certificate number, but the method has obvious defects, the customer can easily bypass blacklist verification by replacing the calling equipment number or logging in by using card information of other people, blacklist customers cannot be effectively intercepted, and practical requirements of high-efficiency management and risk prevention and control of a customer service center are difficult to meet.
Disclosure of Invention
The invention provides a voice response processing method, a voice response processing device, electronic equipment and a storage medium, which are used for improving the efficiency of solving customer requirements and avoiding unnecessary interference or loss to a customer service center.
According to an aspect of the present invention, there is provided a voice response processing method, wherein the method includes:
determining a first voiceprint feature of first voice data, the first voice data comprising voice data generated by performing a first voice interaction task in an interactive voice response phase;
determining voiceprint blacklist information, wherein the voiceprint blacklist information comprises a plurality of semantic features which are related to second voiceprint features, the second voiceprint features are voiceprint features of second voice data generated when a second voice interaction task is executed, each second voice interaction task is a voice interaction task which is executed before the first voice interaction task is executed, and the semantic features related to the second voiceprint features are used for indicating complaint information transmitted through voice expression and emotion features contained when the complaint information is transmitted through voice expression in the execution process of the second voice interaction task;
And determining task processing resources matched with the first voice interaction task from a plurality of task processing resources according to the first voice print characteristics and the voice print blacklist information, wherein the task processing resources are used for assisting in processing the first voice interaction task which is continuously executed after the interactive voice response stage is ended.
According to another aspect of the present invention, there is provided a voice response processing apparatus, wherein the apparatus includes:
the system comprises a determining module, a processing module and a processing module, wherein the determining module is used for determining first voiceprint characteristics of first voice data, and the first voice data comprises voice data generated by executing a first voice interaction task in an interactive voice response stage;
The voice print blacklist information comprises a plurality of voice print characteristics and semantic characteristics associated with the voice print characteristics, wherein the voice print characteristics are voice print characteristics of second voice data generated when a second voice interaction task is executed, each second voice interaction task is a voice interaction task executed before the first voice interaction task is executed, and the semantic characteristics associated with each voice print characteristic are used for indicating the solicited information transmitted through voice expression in the execution process of the second voice interaction task and the emotion characteristics contained when the solicited information is transmitted through voice expression;
and the processing module is used for determining task processing resources matched with the first voice interaction task from a plurality of task processing resources according to the first voice print characteristics and the voice print blacklist information and assisting in processing the first voice interaction task which is continuously executed after the interactive voice response stage is ended.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor, and
A memory communicatively coupled to the at least one processor, wherein,
The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the voice response processing method according to any one of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute a voice response processing method according to any one of the embodiments of the present invention.
According to the technical scheme, when the first voice interaction task is executed in the interactive voice response stage, the attribution of the first voice data can be accurately identified according to the uniqueness of the voice print by determining the first voice print characteristic of the first voice data, the voice print blacklist information not only comprises the voice print characteristic, but also is associated with the semantic characteristic, the semantic characteristic indicates the appeal information in the voice interaction task execution process, the emotion characteristic reflects the emotion state of the voice interaction task when the appeal is expressed through the language, the association is beneficial to enabling the voice interaction task to be matched with the appropriate semantic characteristic according to the voice print characteristic, namely, the semantic characteristic related to the first voice data is found according to the voice print characteristic and the semantic and emotion information reflected in the previous voice interaction task, and then a more appropriate resource is selected from a plurality of task processing resources according to the semantic characteristic related to the first voice data so as to assist in processing the first voice interaction task which is executed continuously after the interactive voice response stage is ended, so that the utilization efficiency of the task processing resources is improved, and the problem of the voice interaction experience in the voice interaction process can be quickly and effectively solved by reasonably distributing the task processing resources, and the waiting time is shortened in the interactive experience process.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for processing a voice response according to an embodiment of the present invention;
FIG. 2 is a diagram of a voice response processing architecture suitable for use in accordance with an embodiment of the present invention;
FIG. 3 is a flow chart of a method for processing a voice response according to an embodiment of the present invention;
Fig. 4 is a schematic structural diagram of a voice response processing device according to an embodiment of the present invention;
Fig. 5 is a schematic structural diagram of an electronic device implementing a voice response processing method according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Fig. 1 is a flowchart of a voice response processing method provided in an embodiment of the present invention, where the embodiment is applicable to a case where appropriate task processing resources can be allocated in time when a first voice interaction task is executed in an interactive voice response phase so that appropriate task processing resources can be allocated to enter a manual response phase to perform voice response processing after the end of the interactive voice response phase.
As shown in fig. 1, the voice response processing method provided in the present embodiment may include the following procedures:
S110, determining first voiceprint characteristics of first voice data, wherein the first voice data comprises voice data generated by executing a first voice interaction task in an interactive voice response stage.
In the initial stage of call access, the method immediately enters an Interactive Voice Response (IVR) stage and synchronously executes the following first voice interaction tasks, namely automatically playing a standardized voice navigation menu, guiding to select a service class through keys or voice instructions, identifying and analyzing voice information input by a user in real time, intelligently matching corresponding service flows, completing interaction tasks such as information inquiry, service handling or switching manual agents and the like according to operation instructions, and realizing high-efficiency man-machine voice dialogue.
The Interactive Voice Response (IVR) is an automatic voice service system, performs voice interaction through prerecorded voice prompts and voice recognition technology, guides to complete a series of operations such as inquiring information, transacting business, selecting service types and the like through keys or voice instructions, does not need to directly participate in the whole process manually, and can autonomously complete related operations according to the voice prompts of the system.
The first voice data may be audio information input to the interactive voice response IVR system in a voice manner when the interactive voice response phase performs a first voice interaction task with the interactive voice response IVR system. For example, the first voice data may include voice data including at least one of a number, a keyword, a phrase, or a complete sentence generated during voice interaction with the interactive voice response IVR system. The first voice data may be a response to a prompt of the interactive voice response IVR system in the interactive voice response phase or a question and a request entered into the interactive voice response IVR system in the interactive voice response phase. For example, in response to a prompt issued by the interactive voice response IVR system, the generated response information indicating the request to perform a particular task is triggered.
Voiceprint features are features with uniqueness and stability contained in voice data, can describe at least one item of information of frequency, amplitude and pitch of the voice data, and can be used as important basis for identity verification and voice recognition by extracting and analyzing the information through acoustic analysis and signal processing technology. The voiceprint features are a group of acoustic parameters extracted from voice data, and can reflect physiological and behavioral features of sounds in the voice data, so as to distinguish different sound sources. The first voiceprint feature is an acoustic parameter feature extracted from the first voice data that is capable of reflecting physiological and behavioral features exhibited by sounds in the first voice data.
As an alternative but non-limiting implementation, determining a first voiceprint feature of first voice data includes the steps of:
and extracting first voiceprint features from the first voice data by adopting a feature extraction mode of a Mel frequency cepstrum coefficient.
In the process of executing the first voice interaction task in the interactive voice response stage, when the first voice interaction task is performed with the interactive voice response system, first voice data input to the interactive voice response system in a voice mode is recorded in the process of executing the voice interaction task.
The Mel Frequency Cepstrum Coefficient (MFCC) is a feature extraction algorithm commonly used in voice signal processing, has good voice distinguishing capability, can reflect important features of voice, such as pitch, tone and the like, and is widely applied to the fields of voice recognition, recognition of a sound source to which the voice belongs and the like.
The characteristic extraction method of the Mel frequency cepstrum coefficient comprises the steps of firstly carrying out framing treatment on first voice data, wherein the first voice data is a time-varying signal, processing the first voice data as a series of short-time stable signals through framing, then carrying out Fast Fourier Transform (FFT) on each frame of voice data of the first voice data, converting each frame of voice data of the first voice data into a frequency domain to obtain a frequency spectrum, then filtering the frequency spectrum of each frame of voice data of the first voice data according to Mel frequency scales, filtering the frequency spectrum through a group of Mel filter groups to obtain a Mel frequency spectrum, and finally taking logarithm on the Mel frequency spectrum and carrying out Discrete Cosine Transform (DCT) to obtain the Mel frequency cepstrum coefficient, wherein the coefficients can effectively describe the first voiceprint characteristic of the first voice data and have strong characterization capability on tone, pronunciation mode and the like of voice.
Voiceprints are unique speech features of everyone, just like fingerprints, with individual variability. After the first voice data is processed by the Mel frequency cepstrum coefficient method, the obtained MFCC features contain feature information related to the speaker, and the information can be used for representing the voiceprint of the speaker. Specifically, different persons have differences in mel frequency cepstrum coefficients of voice signals when making the same voice due to differences in physiological factors such as vocal cords length and thickness, oral cavity shape and psychological factors such as pronunciation habits. By extracting these coefficients, these differences can be quantized into voiceprint features for subsequent voiceprint recognition, voice authentication, and like applications.
As an alternative but non-limiting implementation, before extracting the first voiceprint feature from the first voice data, the method further comprises the steps of:
a first preprocessing operation is performed on the first speech data, the first preprocessing operation including at least one of noise reduction, gain, framing, windowing, pre-emphasis, endpoint detection, sample rate normalization.
Noise reduction on the first voice data can be used for removing background noise contained in the first voice data, so that the first semantic data is clearer. For example, speech data recorded in a noisy environment may contain various environmental noises such as wind sounds, machine sounds, etc., and the noise reduction operation is to reduce the interference of these noises on the speech signal as much as possible. The quality and the intelligibility of the voice data can be improved by reducing the noise of the first voice data, the influence of the noise on the voice processing task is reduced, and the recognition accuracy is improved.
The gain of the first speech data may be used to adjust the amplitude of the speech signal of the first speech data, including increasing or decreasing the strength of the speech signal. If the whole voice signal of the first voice data is weaker, the voice signal can be amplified to a proper amplitude range through gain operation, so that the energy of the voice signal is ensured to be in a proper range, the information loss caused by the too weak signal or the distortion caused by the too strong signal is avoided, and the performance of the voice processing system is improved.
Framing the first speech data may be used to divide the speech signal of the successive first speech data into several shorter frames, typically each frame having a length of around tens of milliseconds. Because the voice signal of the first voice data has short-time stationarity, the characteristic change is small in a short time, and the framing can facilitate more detailed analysis and processing of the voice signal, for example, when extracting voice characteristics, the method is generally carried out on a per-frame basis, so that the accuracy of characteristic extraction is improved.
Windowing the first speech data may be by multiplying each frame of the speech signal of the first speech data by a window function on a frame-by-frame basis. The window function has the function of smoothly transiting the voice signals at two ends of the frame to zero and reducing the boundary effect brought by framing. For example, the window function may employ at least one of a hamming window and a hanning window. By windowing the first voice data, spectrum leakage can be reduced, spectrum analysis is more accurate, and stability and reliability of voice characteristics are improved.
Pre-emphasis of the first speech data may boost the energy of the high frequency part of the speech signal of the first speech data. Pre-emphasis may enhance high frequency information because the high frequency portion of the speech signal is relatively weak and subject to attenuation during transmission. The high-frequency details in the voice signal can be highlighted by pre-emphasis on the first voice data, so that the definition and the intelligibility of voice can be improved, and particularly, voice phonemes with more high-frequency components such as some fricatives can be better recognized in tasks such as voice recognition.
The endpoint detection of the first voice data can be used for determining the starting point and the ending point of the voice signal of the first voice data and distinguishing the voice part from the mute part, so that the mute sections before and after the voice can be removed, and the unnecessary data processing amount is reduced. By carrying out end point detection on the first voice data, the efficiency of voice processing can be improved, the processing of invalid data is reduced, the accuracy of tasks such as voice recognition and the like can be improved, and the interference of a mute part on a recognition result is avoided.
The normalization of the sample rate of the first speech data may be used to convert speech signals of different sample rates to a uniform sample rate. Since different recording devices or acquisition environments may result in different sampling rates of the speech signal, the subsequent speech processing algorithms typically require a fixed sampling rate. By carrying out sampling rate normalization on the first voice data, the voice data can be uniformly processed and analyzed, and algorithm errors or performance degradation caused by inconsistent sampling rates are avoided.
S120, determining voiceprint blacklist information, wherein the voiceprint blacklist information comprises a plurality of second voiceprint features and semantic features associated with the second voiceprint features, the second voiceprint features are voiceprint features of second voice data generated when a second voice interaction task is executed, each second voice interaction task is a voice interaction task executed before a first voice interaction task is executed, and the semantic features associated with each second voiceprint feature are used for indicating appeal information transmitted through voice expression and emotion features contained when the appeal information is transmitted through voice expression in the execution process of the second voice interaction task.
The voiceprint blacklist is thought of as a database or collection of records in which specific information is stored, a plurality of second voiceprint features, and semantic features associated with the second voiceprint features. The second voiceprint feature included in the voiceprint blacklist information refers to a unique acoustic feature that is recognized as being possessed by voice data that requires a restriction or special processing in the voice interaction scenario. The second voiceprint features in the voiceprint blacklist are used to identify and limit or otherwise specifically process the voice data having the second voiceprint features during voice interaction.
The second voice interaction task may be a voice interaction task that has been performed before the first voice interaction task is performed, the second voice interaction task including a voice interaction task that is performed in an interactive voice response phase and a manual response phase after the interactive voice response phase is finished. The second voiceprint feature is a voiceprint feature extracted from voice data generated when the second voice interaction task is performed.
The solicited information conveyed by the speech expression refers to the purpose desired to be achieved or the information conveyed by the speech expression, and focuses more on subjective willingness and solicitation in the language expression. For example, the complaint information conveyed by the speech expression may ask for weather conditions, seek a suggestion, express a emotion, issue an instruction, etc.
The emotional characteristics included when the complaint information is delivered through the speech expression reflect the degree of urgency of the complaint delivered through the speech expression and the desired degree of satisfaction of the complaint. Emotional characteristics contained in the transmission of the appeal information through the speech expression can be characterized by intonation, speech speed, volume and tone. For example, emotional characteristics may be characterized by urgency, calm, anger, and the like.
The voiceprint blacklist information is set based on previous voice interaction conditions, when the first voice interaction task is executed, voiceprint and semantic features generated by the previous second voice interaction task can be referred to for judging whether the voiceprint blacklist conditions are met, for example, some voiceprint features and semantic feature combinations can be used as the basis for judging whether to add a voiceprint to the blacklist or special processing is carried out on specific voiceprint and semantic conditions when the first voice interaction task is executed.
S130, determining task processing resources matched with the first voice interaction task from a plurality of task processing resources according to the first voice print characteristics and the voice print blacklist information, wherein the task processing resources are used for assisting in processing the first voice interaction task which is continuously executed after the interactive voice response stage is ended.
The interactive voice response phase belongs to an initial phase of execution of the first voice interaction task, in which an Interactive Voice Response (IVR) system functions. An Interactive Voice Response (IVR) system performs voice interaction through preset voice prompts. Sometimes, the interactive voice response system cannot meet the requirements of the first voice interaction task, at the moment, a manual response stage can be entered, communication is directly carried out through seat personnel, more flexible and personalized services are provided, the complex problem that an IVR system is difficult to process is solved, and the problem of a user is effectively solved.
The interactive voice response phase can enter the manual response phase from the interactive voice response phase when the situation that the automatic service of the IVR system can not be solved in the process of interacting with the IVR system is found, then the manual customer service is selected to be switched to enter the manual response phase, the IVR system can not accurately understand or process the instructions of the user for some complicated business requirements or special cases, then the manual response phase is automatically entered, and in some cases, emotion agitation or emotional support is possibly existing in the process of voice interaction, and the IVR system is difficult to provide the humanized service and then the manual response phase is automatically entered.
The first voiceprint feature refers to voiceprint information extracted from a first voice interaction task currently in progress, and voiceprint blacklist information is a set of second voiceprint features that are typically associated with situations requiring special handling, such as voiceprint features that have been subject to bad behavior or violated regulations, are recorded in the blacklist. By comparing the first voiceprint feature with each second voiceprint feature recorded in the voiceprint blacklist information, whether the first voiceprint feature belongs to the voiceprint feature recorded in the voiceprint blacklist information can be judged.
The task processing resources refer to various resources capable of processing voice interaction tasks, such as different processing algorithms, specific processing devices, different seat personnel, and the like. Different task processing resources may have different characteristics and capabilities, and the task processing performance and the task processing effect of the different task processing resources when processing the first voice interaction task are different. For example, there may be some task processing resources that may be better for a particular type of voice interactive task, and there may be other task processing resources that may be less effective for a particular type of voice interactive task.
Semantic features are associated with each second voice feature in the voice print blacklist information, the semantic features associated with each second voice feature are used for indicating appeal information transmitted through voice expression and emotion features contained in the appeal information transmitted through voice expression in the execution process of a second voice interaction task, under the condition that the first voice interaction task hits the second voice feature in the voice print blacklist information, the first voice interaction task is possibly related to special conditions, task processing resources cannot be selected randomly when a manual response stage is entered, proper task processing resources need to be selected for processing, and specifically, task processing resources matched with the first voice interaction task can be determined from a plurality of task processing resources associated in advance with the voice print blacklist information according to the semantic features associated with the second voice feature hit by the first voice feature.
The task processing resources matched with the first voice interaction task are used for assisting in processing the first voice interaction task which is continuously executed after the interactive voice response phase is finished. The interactive voice response phase is a voice interaction phase corresponding to an Interactive Voice Response (IVR) system, which gives a response result. When the interactive voice response stage is finished, the first unfinished voice interaction task is continued to enter the manual response stage, and the task processing resources matched with the first voice interaction task can be used for continuing auxiliary processing on the first voice interaction task. The proper task processing resources can improve the efficiency and accuracy of task processing and better meet the voice interaction requirements.
According to the technical scheme, when the first voice interaction task is executed in the interactive voice response stage, the attribution of the first voice data can be accurately identified according to the uniqueness of the voice print by determining the first voice print characteristic of the first voice data, the voice print blacklist information not only comprises the voice print characteristic, but also is associated with the semantic characteristic, the semantic characteristic indicates the appeal information in the voice interaction task execution process, the emotion characteristic reflects the emotion state of the voice interaction task when the appeal is expressed through the language, the association is beneficial to enabling the voice interaction task to be matched with the appropriate semantic characteristic according to the voice print characteristic, namely, the semantic characteristic related to the first voice data is found according to the voice print characteristic and the semantic and emotion information reflected in the previous voice interaction task, and then a more appropriate resource is selected from a plurality of task processing resources according to the semantic characteristic related to the first voice data so as to assist in processing the first voice interaction task which is executed continuously after the interactive voice response stage is ended, so that the utilization efficiency of the task processing resources is improved, and the problem of the voice interaction experience in the voice interaction process can be quickly and effectively solved by reasonably distributing the task processing resources, and the waiting time is shortened in the interactive experience process.
On the basis of the above embodiment, optionally, the voiceprint blacklist information is generated in the following manner:
The method comprises the steps of obtaining second voice data, carrying out semantic recognition on the second voice data to obtain semantic features contained in the second voice data, wherein the semantic features contained in the second voice data comprise appeal information transmitted through voice expression and emotion features contained in the appeal information transmitted through voice expression in the execution process of a second voice interaction task, extracting second voice features from the second voice data under the condition that the semantic features contained in the second voice data hit semantic feature categories associated with voice print blacklist information, and determining the semantic features contained in the second voice data as the semantic features associated with the second voice features to be used for generating the voice print blacklist information.
The voiceprint blacklist information is associated with voiceprint features possessed by voice data recognized as requiring restriction or special processing in a voice interaction scene. Although different voiceprint features are included in the range where the restriction or special processing is required, the semantic features shown by the voice data of the different voiceprint features will be different, so the task processing resources when performing the voice interaction task for the voice data of the different voiceprint features will be different, and therefore, a proper task processing resource needs to be selected according to the semantic feature classification included in the voice data to which the voiceprint features belong.
The semantic feature class associated with the voiceprint blacklist information is a preset semantic feature class of voice data which is identified as needing to be limited or specially processed in a voice interaction scene. Under the condition that the semantic features contained in the second voice data hit the semantic feature categories associated with the voiceprint blacklist information, the second voice data are identified as voice data which need to be limited or specially processed in a voice interaction scene, so that the voiceprint features of the second voice data need to be saved, and when the same voiceprint features appear next time, task processing resources screened from task processing resources associated with the voiceprint blacklist information can be adopted to process voice interaction tasks in a manual response stage, and the problem that the processing performance of the voice interaction tasks is poor due to the fact that the task processing resources are randomly selected in the manual response stage is avoided.
And extracting voiceprint features from the second voice data and storing the second voiceprint features extracted from the second voice data in association with the semantic features of the second voice data under the condition that the semantic features contained in the second voice data hit the semantic feature categories associated with the voiceprint blacklist information. Optionally, extracting the second voice characteristic from the second voice data comprises extracting the second voice characteristic from the second voice data by adopting a characteristic extraction mode of a mel frequency cepstrum coefficient.
By identifying and correlating semantic features of voice data, the voiceprint blacklist is not judged only by the voiceprint features, but also the appeal information transmitted by the voice expression and the emotion features contained when the appeal information is transmitted by the voice expression are considered, so that the blacklist records the semantic features with bad behaviors or potential risks more accurately. When the voice interaction system detects that the semantic features of the voice data hit the category associated with the blacklist, the voice print features are extracted and the blacklist is updated, so that the system can find and prevent potential security threats in time.
Referring to fig. 2, the voice processing unit includes a semantic understanding unit and a voiceprint feature extractor, the voice processing unit acts on two processes, one process is used for the second voice data, the semantic understanding unit performs semantic recognition on the second voice data to obtain semantic features contained in the second voice data, the semantic features contained in the second voice data include appeal information transmitted through voice expression and emotion features contained in the appeal information transmitted through voice expression in the execution process of the second voice interaction task, and under the condition that the semantic features contained in the second voice data hit semantic feature categories associated with voiceprint blacklist information, second voiceprint features are extracted from the second voice data, the semantic features contained in the second voice data are determined to be the semantic features associated with the second voiceprint features, and the semantic features associated with the second voiceprint features are transmitted to the information management unit to generate the voiceprint blacklist information for management. Another process is to directly extract the first voiceprint feature of the second voice data by using the voiceprint feature extractor, and also transmit the first voiceprint feature to the information management unit.
Optionally, the first voice data is recorded by a recording system of the phone customer service center, the recording data is generated by performing a first voice interaction task with an interactive voice response system in an interactive response stage, the second voice data is recorded by a recording system of the phone customer service center, the recording data is generated by performing a second voice interaction task with the interactive voice response system, and each second voice interaction task is a historical voice interaction task which has been performed before the first voice interaction task is performed.
As an optional but non-limiting implementation scheme, performing semantic recognition on the second voice data to obtain semantic features contained in the second voice data, including but not limited to the following steps:
Determining text data of the second voice data, inputting the text data of the second voice data into a semantic recognition model for semantic recognition, wherein the semantic recognition model is a multi-task neural network model for analyzing and recognizing semantic information expressed by voice signals, and outputting semantic features contained in the second voice data through the semantic recognition model.
The multi-task neural network model is a neural network architecture capable of simultaneously processing a plurality of related or unrelated tasks, and is applied to the fields of image recognition, segmentation, natural language processing and the like, for example, the tasks of image classification and target detection are simultaneously carried out, or the tasks of text classification, emotion analysis and the like are simultaneously carried out. Semantic recognition refers to the process of parsing and recognizing semantic information expressed by a speech signal. Through semantic recognition, information such as intention, emotion, knowledge and the like in the voice can be recognized and understood. The multi-task neural network model of the semantic recognition model can be constructed and generated based on a cyclic neural network (RNN) framework.
As an alternative but non-limiting implementation, before inputting the text data of the second speech data into the semantic recognition model for semantic recognition, the method further comprises the following steps:
a second preprocessing operation is performed on text data of the second speech data, the second preprocessing operation including at least one of text cleansing and text segmentation.
As an alternative but non-limiting implementation, the semantic recognition model is generated in the following way:
The method comprises the steps of obtaining training sample data used by semantic recognition training tasks, inputting the training sample data into a semantic recognition model to conduct semantic feature recognition prediction to obtain a semantic feature prediction result, calculating loss difference between the semantic feature prediction result and the semantic feature label by using a cross entropy loss function, and carrying out back propagation on gradients of parameters in the semantic recognition model according to the loss difference to update parameters of the semantic recognition model.
Referring to fig. 2, text data of the third voice data is obtained, a data annotator of the model training unit is adopted to label the text data of the third voice data to obtain semantic feature labels of the text data of the third voice data, and the text data of the third voice data and the semantic feature labels of the text data of the third voice data form training sample data. After a semantic recognition model is built by using a cyclic neural network (RNN) framework, sample data is extracted from training sample data through a network trainer of a model training unit and is input into the semantic recognition model for forward propagation, a semantic feature prediction result is calculated and output, and a loss difference between the output semantic feature prediction result and a semantic feature label is calculated and output by using a cross entropy loss function. And carrying out back propagation on the gradient of the model parameters of the semantic recognition model according to the loss difference calculated by the loss function, updating the parameters of the semantic recognition model, and repeating the process until the semantic recognition model meets the requirement.
Referring to fig. 2, the model training unit includes a data annotator and a network trainer, where the data annotator receives text data of the third voice data maintained by the information management unit, and annotates semantic feature labels of the text data of the third voice data, for example, data annotation for behaviors such as harassment, malicious complaints, fraud, and the like when performing semantic interaction tasks. The network training device is used for training a multi-task neural network model used for semantic recognition, collecting text data of third voice data which accords with the semantic feature class set associated with voiceprint blacklist information, and preprocessing the text data to be used as training sample data.
Fig. 3 is a schematic flow chart of another voice response processing method according to an embodiment of the present invention, where the process of determining, from a plurality of task processing resources, a task processing resource that is matched with a first voice interaction task according to the first voice print feature and the voice print blacklist information in the foregoing embodiment is further optimized based on the technical solution of the foregoing embodiment, and this embodiment may be combined with each of the alternatives in one or more embodiments.
As shown in fig. 3, the voice response processing method according to the embodiment of the present invention may include the following procedures:
s310, determining first voiceprint characteristics of first voice data, wherein the first voice data comprises voice data generated by executing a first voice interaction task in an interactive voice response stage.
S320, determining voiceprint blacklist information, wherein the voiceprint blacklist information comprises a plurality of second voiceprint features and semantic features associated with the second voiceprint features, the second voiceprint features are voiceprint features of second voice data generated when a second voice interaction task is executed, each second voice interaction task is a voice interaction task executed before a first voice interaction task is executed, and the semantic features associated with each second voiceprint feature are used for indicating appeal information transmitted through voice expression and emotion features contained when the appeal information is transmitted through voice expression in the execution process of the second voice interaction task.
S330, under the condition that third voiceprint features matched with the first voiceprint features exist in the voiceprint blacklist information, determining semantic features associated with the third voiceprint features according to the voiceprint blacklist information, wherein the third voiceprint features are second voiceprint features which are screened from the voiceprint blacklist information and matched with the first voiceprint features.
And comparing the currently acquired first voiceprint features with the voiceprint features in the voiceprint blacklist information. If a third voiceprint feature matched with the first voiceprint feature is found in the voiceprint blacklist information, the third voiceprint feature is screened out, wherein the third voiceprint feature originally exists in the voiceprint blacklist information and is a second voiceprint feature recorded by the voiceprint blacklist information, and the feature similarity of the third voiceprint feature and the first voiceprint feature is larger than the feature similarity between the second voiceprint feature and the first voiceprint feature of the voiceprint blacklist information except the third voiceprint feature. The feature similarity between voiceprint features can be calculated by adopting cosine similarity or Euclidean distance.
The voiceprint blacklist information not only contains second voiceprint features, but also associates semantic features with the second voiceprint features. Once a matching third voiceprint feature is found, semantic features associated with the third voiceprint feature are obtained from the voiceprint blacklist information. For example, the semantic features corresponding to the third voiceprint feature may include the appeal information transferred through the speech expression during the execution of the second voice interaction task and the emotion features included when the appeal information is transferred through the speech expression. The second voice interaction task corresponding to the third voiceprint feature and the first voice interaction task are from the same sound source.
S340, determining task processing resources matched with the first voice interaction task from a plurality of task processing resources according to reference information of the third voiceprint feature, wherein the task processing resources are used for assisting in processing the first voice interaction task which is continuously executed after the interactive voice response stage is ended, the reference information of the third voiceprint feature comprises semantic feature categories to which semantic features associated with the third voiceprint feature belong and hit times of the third voiceprint feature, and the hit times of the third voiceprint feature can be used for indicating the times of the third voiceprint feature in the voiceprint blacklist information which is matched or accessed in the query operation process, and the task processing resources matched with the first voice interaction task are associated with the reference information associated with the third voiceprint feature.
The reference information for the third voiceprint feature includes a semantic feature class and a number of times the third voiceprint feature was matched or accessed during a previous query operation. Furthermore, task processing resources matched with the first voice interaction task can be found from a plurality of alternative task processing resources (such as different seat personnel, different processing flows and the like) pre-associated with the voiceprint blacklist information according to the reference information of the third voiceprint feature. For example, if the semantic feature class associated with the third voiceprint feature is "complaint" and the number of hits is high (indicating that the speaker complaints multiple times), the agent who is more experienced in handling the complaint will be determined to be the matching task processing resource.
The task processing resource matched by the first voice interaction task is associated with the reference information associated with the third voiceprint feature, and the determination of the task processing resource is based on the reference information of the third voiceprint feature. Different reference information may result in different task processing resources being selected to better process the first voice interaction task. For example, for a user who is emotionally excited and complained multiple times, task processing resources that are more adept at handling such situations may be selected to deal with, to improve processing efficiency and user satisfaction.
Through voiceprint matching and semantic feature analysis, the behavior mode and the appeal characteristic of the user can be known more accurately, and therefore the most appropriate task processing resources are matched for the first voice interaction task according to the information. The task processing resources are determined according to the reference information of the third voiceprint features, so that reasonable allocation of the resources can be realized, the resources are prevented from being wasted on the unmatched tasks, and the utilization efficiency of the resources is improved. The use of voiceprint blacklist information helps identify potentially risky users who may be more effectively managed and monitored by matching corresponding task processing resources.
Referring to fig. 2, the information management unit is mainly configured to process and manage semantic feature information that is transmitted by the voice processing unit and is associated with the second voice feature, and perform route allocation of traffic resources according to the first voice feature and the voice blacklist information, so as to determine task processing resources that are matched with the first voice interaction task from a plurality of task processing resources. The information management unit comprises a semantic memory, a semantic manager, a voiceprint memory, a voiceprint manager, a voiceprint comparator and a route distributor.
As an alternative but non-limiting implementation, determining a task processing resource matched with the first voice interaction task from a plurality of task processing resources according to the reference information of the third voiceprint feature includes, but is not limited to, the following steps A1-A2:
And A1, if the hit number of the third voiceprint feature is larger than the preset hit number according to the reference information of the third voiceprint feature, determining task processing resources matched with the first voice interaction task from a plurality of task processing resources, wherein the task processing performance and/or task processing effect of the task processing resources matched with the first voice interaction task when the task processing resources matched with the first voice interaction task assist in processing the first voice interaction task are larger than the task processing performance and/or task processing effect of the rest task processing resources except the task processing resources matched with the first voice interaction task in the plurality of task processing resources.
And A2, selecting task processing resources matched with semantic feature categories of semantic features associated with the third voiceprint features from a plurality of task processing resources according to the semantic feature categories of the semantic features associated with the third voiceprint features in the reference information associated with the third voiceprint features, and determining task processing resources matched with the first voice interaction task, wherein different semantic feature categories are associated with different task processing resources in at least part of the task processing resources of the plurality of task processing resources.
The number of hits for the third voiceprint feature (i.e., the number of times the third voiceprint feature was matched or accessed in a previous query operation) is compared to a predetermined number of hits (which may be determined based on factors such as business needs and experience). If the hit number of the third voiceprint feature exceeds the preset hit number, it is indicated that the high-frequency record exists in the voiceprint blacklist information in the sound source corresponding to the third voiceprint feature, and a task processing resource matched with the first voice interaction task is selected from a plurality of alternative task processing resources (for example, the plurality of alternative task processing resources can be at least one of different seat personnel, different processing algorithms or different processing flows for processing customer requirements).
When the task processing resource matched with the first voice interaction task processes the first voice interaction task which is continuously executed after the end of the auxiliary processing interactive voice response stage, the processing performance (such as at least one of processing speed, resource utilization rate and response time) and/or the processing effect (such as at least one of problem solving rate and user satisfaction degree) of the first voice interaction task are superior to those of other unselected task processing resources. For example, for high frequency interactive users, the system may select more experienced, more processing power agent personnel as the matching task processing resources to better meet the user's needs.
The category (such as complaint category, consultation category, business handling category, etc.) to which the semantic feature associated with the third voiceprint feature belongs is analyzed, and the semantic feature category is a classification mode for the voice interaction content, and can reflect the task appeal type in the voice interaction task. Based on the analysis of the semantic feature class, a task processing resource corresponding to the semantic feature class is selected from a plurality of task processing resources. For example, if the semantic feature class is a complaint class, an agent who is adept at dealing with the complaint problem or a particular process flow may be selected as a matching task processing resource, and the selected task processing resource may be determined as a resource that matches the first voice interaction task. Also, among the plurality of task processing resources, at least some of the task processing resources may be associated with different semantic feature categories, which provides a basis for selecting an appropriate task processing resource based on the semantic feature categories.
Referring to fig. 2, the semantic memory and the voiceprint memory are responsible for entering voiceprint blacklist information transmitted by the voice processing unit into the system. The semantic manager and the voiceprint manager are used for checking, confirming and deleting recorded information to ensure accuracy and effectiveness, classifying the voiceprint blacklist information according to reference information of second voiceprint features in the voiceprint blacklist information, carrying out statistical analysis on the voiceprint blacklist information, knowing relevant conditions and trends, and adjusting labeling data of a model training unit according to analysis results. The voiceprint comparator can match the extracted voiceprint features with the existing voiceprint features to perform voiceprint blacklist information authentication. The route allocator may determine task processing resources from the plurality of task processing resources that the first voice interaction task matches based on the first voiceprint feature and the voiceprint blacklist information.
By selecting task processing resources based on hit times and semantic feature categories, more suitable resources can be allocated to the corresponding tasks. For tasks with different semantic feature categories, the task processing resources matched with the task processing resources can be selected to solve the problem more efficiently, and the task processing speed and quality are improved, so that the overall task processing efficiency is improved. The method and the device have the advantages that more proper task processing resources are matched for users, the requirements of the users can be better met, and the system resources can be reasonably distributed by determining the task processing resources according to the voiceprint feature reference information. The waste of resources is avoided, and the resources are concentrated on more needed tasks and users.
According to the technical scheme, when the first voice interaction task is executed in the interactive voice response stage, the attribution of the first voice data can be accurately identified according to the uniqueness of the voice print by determining the first voice print characteristic of the first voice data, the voice print blacklist information not only comprises the voice print characteristic, but also is associated with the semantic characteristic, the semantic characteristic indicates the appeal information in the voice interaction task execution process, the emotion characteristic reflects the emotion state of the voice interaction task when the appeal is expressed through the language, the association is beneficial to enabling the voice interaction task to be matched with the appropriate semantic characteristic according to the voice print characteristic, namely, the semantic characteristic related to the first voice data is found according to the voice print characteristic and the semantic and emotion information reflected in the previous voice interaction task, and then a more appropriate resource is selected from a plurality of task processing resources according to the semantic characteristic related to the first voice data so as to assist in processing the first voice interaction task which is executed continuously after the interactive voice response stage is ended, so that the utilization efficiency of the task processing resources is improved, and the problem of the voice interaction experience in the voice interaction process can be quickly and effectively solved by reasonably distributing the task processing resources, and the waiting time is shortened in the interactive experience process.
In addition, the invention only uses the voiceprint characteristics to authenticate the blacklist client, and the voiceprint authentication has uniqueness and is difficult to forge, so the invention has higher accuracy and reliability in the aspect of blacklist client identification. The invention utilizes the multitasking neural network model to carry out semantic understanding, and can identify scenes in a plurality of blacklist client behavior sets. Meanwhile, the information management unit has a feedback mechanism to the model training unit, and can adjust the labeling strategy of training data according to new data characteristics, thereby being beneficial to improving adaptability and accuracy. The semantic recognition function of the invention is to process multiple historical records of clients, directly acquire the historical record data from a record quality inspection system commonly existing in a customer service center, and has simple structure without additionally arranging a recording device. In addition, the semantic recognition is not carried out on the current call, only the voiceprint is extracted, and the client opening can make a decision, so that the subsequent allocation quick response of the current call is facilitated, and the accidental misjudgment is reduced.
Fig. 4 is a schematic structural diagram of a voice response processing device according to an embodiment of the present invention, where the embodiment is applicable to a case where appropriate task processing resources can be allocated in time when a first voice interaction task is executed in an interactive voice response phase so that appropriate task processing resources can be allocated to enter a manual response phase to perform voice response processing after the end of the interactive voice response phase, and the voice response processing device may be implemented in a form of hardware and/or software, and the voice response processing device may be configured in any electronic device having a network communication function.
As shown in fig. 4, the voice response processing apparatus provided in the present embodiment may include the following:
A determining module 410, configured to determine a first voiceprint feature of first voice data, where the first voice data includes voice data generated by performing a first voice interaction task during an interactive voice response phase;
the determining module 410 is further configured to determine voiceprint blacklist information, where the voiceprint blacklist information includes a plurality of second voiceprint features and semantic features associated with the second voiceprint features, where the second voiceprint features are voiceprint features of second voice data generated when performing second voice interaction tasks, each of the second voice interaction tasks is a voice interaction task that has been performed before the first voice interaction task is performed, and each of the semantic features associated with the second voiceprint features is configured to indicate complaint information transmitted through a voice expression during execution of the second voice interaction task and emotional features included when the complaint information is transmitted through the voice expression;
And the processing module 420 is configured to determine, according to the first voiceprint feature and the voiceprint blacklist information, a task processing resource that is matched with the first voice interaction task from a plurality of task processing resources, and to assist in processing the first voice interaction task that is continuously executed after the interactive voice response phase is ended.
On the basis of the foregoing embodiment, optionally, determining the first voiceprint feature of the first voice data includes:
acquiring first voice data generated when a first voice interaction task is executed;
and extracting the first voiceprint features from the first voice data by adopting a feature extraction mode of a Mel frequency cepstrum coefficient.
On the basis of the above embodiment, optionally, the voiceprint blacklist information is generated in the following manner:
acquiring second voice data, and carrying out semantic recognition on the second voice data to obtain semantic features contained in the second voice data, wherein the semantic features contained in the second voice data comprise appeal information transmitted through voice expression and emotion features contained in the appeal information transmitted through voice expression in the execution process of a second voice interaction task;
Extracting second voice characteristics from the second voice data under the condition that the semantic characteristics contained in the second voice data hit the semantic characteristics category associated with voice print blacklist information;
and determining semantic features contained in the second voice data as semantic features associated with the second voice features, and generating voice print blacklist information.
On the basis of the foregoing embodiment, optionally, performing semantic recognition on the second voice data to obtain semantic features contained in the second voice data includes:
determining text data of second voice data, and inputting the text data of the second voice data into a semantic recognition model for semantic recognition, wherein the semantic recognition model is a multitasking neural network model for analyzing and recognizing semantic information expressed by voice signals;
and outputting semantic features contained in the second voice data through the semantic recognition model.
On the basis of the above embodiment, optionally, the semantic recognition model is generated in the following manner:
acquiring training sample data used by a semantic recognition training task, wherein the training sample data comprises text data of third voice data and semantic feature tags of the text data of the third voice data;
Inputting the training sample data into the semantic recognition model to perform recognition prediction of semantic features to obtain a semantic feature prediction result;
Calculating the loss difference between the semantic feature prediction result and the semantic feature label by using a cross entropy loss function, and carrying out back propagation on the gradient of the parameter in the semantic recognition model according to the loss difference to update the parameter of the semantic recognition model.
On the basis of the foregoing embodiment, optionally, determining, according to the first voiceprint feature and the voiceprint blacklist information, a task processing resource that matches the first voice interaction task from a plurality of task processing resources includes:
Determining semantic features associated with third voiceprint features in the voiceprint blacklist information under the condition that the third voiceprint features matched with the first voiceprint features exist in the voiceprint blacklist information, wherein the third voiceprint features are second voiceprint features which are screened from the voiceprint blacklist information and matched with the first voiceprint features;
And determining task processing resources matched with the first voice interaction task from a plurality of task processing resources according to the reference information of the third voiceprint feature, wherein the reference information of the third voiceprint feature comprises the semantic feature category to which the semantic feature associated with the third voiceprint feature belongs and the hit times of the third voiceprint feature, the hit times of the third voiceprint feature can be used for indicating the times of the third voiceprint feature being matched or accessed in the inquiry operation process in the voiceprint blacklist information, and the task processing resources matched with the first voice interaction task are associated with the reference information associated with the third voiceprint feature.
On the basis of the foregoing embodiment, optionally, determining, according to the reference information of the third voiceprint feature, a task processing resource that is matched with the first voice interaction task from a plurality of task processing resources, includes:
If the hit number of the third voiceprint feature is larger than the preset hit number according to the reference information of the third voiceprint feature, determining task processing resources matched with the first voice interaction task from a plurality of task processing resources, wherein the task processing performance and/or task processing effect of the task processing resources matched with the first voice interaction task when assisting in processing the first voice interaction task are larger than the task processing performance and/or task processing effect of the rest task processing resources except the task processing resources matched with the first voice interaction task in the plurality of task processing resources;
According to the semantic feature category of the semantic feature associated with the third voiceprint feature in the reference information associated with the third voiceprint feature, selecting task processing resources matched with the semantic feature category of the semantic feature associated with the third voiceprint feature from a plurality of task processing resources, determining the task processing resources matched with the first voice interaction task, and associating different semantic feature categories with different task processing resources in at least part of task processing resources of the plurality of task processing resources.
The voice response processing device provided by the embodiment of the invention can execute the voice response processing method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the voice response processing method.
It should be noted that the above-mentioned units and modules included in the apparatus are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding functions can be implemented, and the specific names of the functional units are only used for distinguishing from each other, and are not used for limiting the protection scope of the embodiments of the present invention.
Fig. 5 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including an input unit 16, such as a keyboard, mouse, etc., an output unit 17, such as various types of displays, speakers, etc., a storage unit 18, such as a magnetic disk, optical disk, etc., and a communication unit 19, such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, such as a voice response processing method.
In particular, according to embodiments of the present invention, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present invention include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from the network through the communication unit 19, or installed from the storage unit 18, or installed from the ROM 12. The above-described functions defined in the method of the embodiment of the present invention are performed when the computer program is executed by the processor 11.
In some embodiments, the voice response processing method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the voice response processing method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the voice response processing method in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), a blockchain network, and the Internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method of processing a voice response, the method comprising:
determining a first voiceprint feature of first voice data, the first voice data comprising voice data generated by performing a first voice interaction task in an interactive voice response phase;
determining voiceprint blacklist information, wherein the voiceprint blacklist information comprises a plurality of semantic features which are related to second voiceprint features, the second voiceprint features are voiceprint features of second voice data generated when a second voice interaction task is executed, each second voice interaction task is a voice interaction task which is executed before the first voice interaction task is executed, and the semantic features related to the second voiceprint features are used for indicating complaint information transmitted through voice expression and emotion features contained when the complaint information is transmitted through voice expression in the execution process of the second voice interaction task;
and determining task processing resources matched with the first voice interaction task from a plurality of task processing resources according to the first voice print characteristics and the voice print blacklist information, wherein the task processing resources are used for assisting in processing the first voice interaction task which is continuously executed after the interactive voice response stage is ended.
2. The method of claim 1, wherein determining the first voiceprint feature of the first voice data comprises:
acquiring first voice data generated when a first voice interaction task is executed;
and extracting the first voiceprint features from the first voice data by adopting a feature extraction mode of a Mel frequency cepstrum coefficient.
3. The method of claim 1, wherein the voiceprint blacklist information is generated by:
acquiring second voice data, and carrying out semantic recognition on the second voice data to obtain semantic features contained in the second voice data, wherein the semantic features contained in the second voice data comprise appeal information transmitted through voice expression and emotion features contained in the appeal information transmitted through voice expression in the execution process of a second voice interaction task;
Extracting second voice characteristics from the second voice data under the condition that the semantic characteristics contained in the second voice data hit the semantic characteristics category associated with voice print blacklist information;
and determining semantic features contained in the second voice data as semantic features associated with the second voice features, and generating voice print blacklist information.
4. A method according to claim 3, wherein performing semantic recognition on the second speech data to obtain semantic features contained in the second speech data comprises:
determining text data of second voice data, and inputting the text data of the second voice data into a semantic recognition model for semantic recognition, wherein the semantic recognition model is a multitasking neural network model for analyzing and recognizing semantic information expressed by voice signals;
and outputting semantic features contained in the second voice data through the semantic recognition model.
5. The method of claim 4, wherein the semantic recognition model is generated by:
acquiring training sample data used by a semantic recognition training task, wherein the training sample data comprises text data of third voice data and semantic feature tags of the text data of the third voice data;
Inputting the training sample data into the semantic recognition model to perform recognition prediction of semantic features to obtain a semantic feature prediction result;
Calculating the loss difference between the semantic feature prediction result and the semantic feature label by using a cross entropy loss function, and carrying out back propagation on the gradient of the parameter in the semantic recognition model according to the loss difference to update the parameter of the semantic recognition model.
6. The method of claim 1, wherein determining the task processing resource from a plurality of task processing resources for which the first voice interaction task matches based on the first voiceprint feature and the voiceprint blacklist information comprises:
Determining semantic features associated with third voiceprint features in the voiceprint blacklist information under the condition that the third voiceprint features matched with the first voiceprint features exist in the voiceprint blacklist information, wherein the third voiceprint features are second voiceprint features which are screened from the voiceprint blacklist information and matched with the first voiceprint features;
And determining task processing resources matched with the first voice interaction task from a plurality of task processing resources according to the reference information of the third voiceprint feature, wherein the reference information of the third voiceprint feature comprises the semantic feature category to which the semantic feature associated with the third voiceprint feature belongs and the hit times of the third voiceprint feature, the hit times of the third voiceprint feature can be used for indicating the times of the third voiceprint feature being matched or accessed in the inquiry operation process in the voiceprint blacklist information, and the task processing resources matched with the first voice interaction task are associated with the reference information associated with the third voiceprint feature.
7. The method of claim 6, wherein determining the task processing resource that the first voice interaction task matches from a plurality of task processing resources based on the reference information of the third voiceprint feature comprises:
If the hit number of the third voiceprint feature is larger than the preset hit number according to the reference information of the third voiceprint feature, determining task processing resources matched with the first voice interaction task from a plurality of task processing resources, wherein the task processing performance and/or task processing effect of the task processing resources matched with the first voice interaction task when assisting in processing the first voice interaction task are larger than the task processing performance and/or task processing effect of the rest task processing resources except the task processing resources matched with the first voice interaction task in the plurality of task processing resources;
According to the semantic feature category of the semantic feature associated with the third voiceprint feature in the reference information associated with the third voiceprint feature, selecting task processing resources matched with the semantic feature category of the semantic feature associated with the third voiceprint feature from a plurality of task processing resources, determining the task processing resources matched with the first voice interaction task, and associating different semantic feature categories with different task processing resources in at least part of task processing resources of the plurality of task processing resources.
8. A voice response processing apparatus, the apparatus comprising:
the system comprises a determining module, a processing module and a processing module, wherein the determining module is used for determining first voiceprint characteristics of first voice data, and the first voice data comprises voice data generated by executing a first voice interaction task in an interactive voice response stage;
The voice print blacklist information comprises a plurality of voice print characteristics and semantic characteristics associated with the voice print characteristics, wherein the voice print characteristics are voice print characteristics of second voice data generated when a second voice interaction task is executed, each second voice interaction task is a voice interaction task executed before the first voice interaction task is executed, and the semantic characteristics associated with each voice print characteristic are used for indicating the solicited information transmitted through voice expression in the execution process of the second voice interaction task and the emotion characteristics contained when the solicited information is transmitted through voice expression;
and the processing module is used for determining task processing resources matched with the first voice interaction task from a plurality of task processing resources according to the first voice print characteristics and the voice print blacklist information and assisting in processing the first voice interaction task which is continuously executed after the interactive voice response stage is ended.
9. An electronic device, the electronic device comprising:
at least one processor, and
A memory communicatively coupled to the at least one processor, wherein,
The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the voice response processing method of any one of claims 1-7.
10. A computer readable storage medium storing computer instructions for causing a processor to perform the method of processing a voice response according to any one of claims 1 to 7.
CN202510659095.6A 2025-05-21 2025-05-21 Voice response processing method, device, electronic equipment and storage medium Pending CN120496580A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510659095.6A CN120496580A (en) 2025-05-21 2025-05-21 Voice response processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510659095.6A CN120496580A (en) 2025-05-21 2025-05-21 Voice response processing method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN120496580A true CN120496580A (en) 2025-08-15

Family

ID=96684270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510659095.6A Pending CN120496580A (en) 2025-05-21 2025-05-21 Voice response processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN120496580A (en)

Similar Documents

Publication Publication Date Title
US10771627B2 (en) Personalized support routing based on paralinguistic information
CN110349564B (en) A method and device for cross-language speech recognition
CN109346088A (en) Personal identification method, device, medium and electronic equipment
US9711167B2 (en) System and method for real-time speaker segmentation of audio interactions
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN109545193B (en) Method and apparatus for generating a model
CN110570853A (en) Intention recognition method and device based on voice data
CN109313892A (en) Robust language recognition method and system
CN108877779B (en) Method and device for detecting voice tail point
CN112786058B (en) Voiceprint model training method, device, equipment and storage medium
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
CN110782902A (en) Audio data determination method, apparatus, device and medium
CN112331217A (en) Voiceprint recognition method and device, storage medium and electronic equipment
CN109545226A (en) A kind of audio recognition method, equipment and computer readable storage medium
WO2021152566A1 (en) System and method for shielding speaker voice print in audio signals
CN114678040B (en) Voice consistency detection method, device, equipment and storage medium
CN116631412A (en) Method for judging voice robot through voiceprint matching
CN115831125A (en) Speech recognition method, device, equipment, storage medium and product
CN113112992A (en) Voice recognition method and device, storage medium and server
CN112052994B (en) Customer complaint upgrading prediction method and device and electronic equipment
CN112908299A (en) Customer demand information identification method and device, electronic equipment and storage medium
CN108364654B (en) Voice processing method, medium, device and computing equipment
CN116469420A (en) Speech emotion recognition method, device, equipment and medium
CN114067842B (en) Customer satisfaction degree identification method and device, storage medium and electronic equipment
CN117116251A (en) Repayment probability assessment method and device based on collection-accelerating record

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination