CN114255786A - Interruption detection method and device for voice broadcast, storage medium and electronic equipment - Google Patents
Interruption detection method and device for voice broadcast, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN114255786A CN114255786A CN202111616939.7A CN202111616939A CN114255786A CN 114255786 A CN114255786 A CN 114255786A CN 202111616939 A CN202111616939 A CN 202111616939A CN 114255786 A CN114255786 A CN 114255786A
- Authority
- CN
- China
- Prior art keywords
- voice
- speaker
- target person
- segment
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 60
- 238000000034 method Methods 0.000 claims abstract description 78
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 39
- 238000000605 extraction Methods 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 8
- 238000003062 neural network model Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 239000012634 fragment Substances 0.000 claims description 7
- 238000004891 communication Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000013527 convolutional neural network Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000003064 k means clustering Methods 0.000 description 4
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000005034 decoration Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Game Theory and Decision Science (AREA)
- Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
The invention discloses an interruption detection method and device for voice broadcasting, a storage medium and electronic equipment. Wherein, the method comprises the following steps: in the broadcasting process of the intelligent customer service, detecting a speaker voice segment in the audio data by adopting a voiceprint recognition algorithm; extracting speaker characteristics from the speaker voice segments to obtain speaker voice characteristics; carrying out similarity matching on the voice characteristics of the speaker and the voice characteristics of the target person; and determining whether to interrupt the broadcasting process according to the comparison result of the similarity between the voice characteristics of the speaker and the voice characteristics of the target person and a set threshold value. The invention solves the technical problems of high error interruption rate and poor user experience caused by voice interference of irrelevant speakers in the voice broadcasting process.
Description
Technical Field
The invention relates to the technical field of intelligent voice, in particular to a voice broadcast interruption detection method, a voice broadcast interruption detection device, a storage medium and electronic equipment.
Background
With the development of voice technology, the application of intelligent voice in production and life is more and more extensive, and meanwhile, due to the complexity of the practical application process, a serious challenge is brought to the voice technology.
The conventional intelligent customer service system interruption method is based on a valid voice detection (VAD) technology, and in the voice broadcasting process, when a valid speaking voice is detected, the broadcasting is interrupted. If the background speaker voice is large in the using process of the user and the interference of irrelevant speakers exists, the intelligent voice broadcasting is easily interrupted, the mistaken interruption rate of the voice broadcasting is high, and the user experience is poor.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides an interruption detection method and device for voice broadcasting, a storage medium and electronic equipment, and aims to at least solve the technical problems of high error interruption rate and poor user experience caused by voice interference of irrelevant speakers in a voice broadcasting process.
According to an aspect of an embodiment of the present invention, there is provided a method for detecting interruption of voice broadcast, including: in the broadcasting process of the intelligent customer service, detecting a speaker voice segment in the audio data by adopting a voiceprint recognition algorithm; extracting speaker characteristics from the speaker voice segments to obtain speaker voice characteristics; carrying out similarity matching on the voice characteristics of the speaker and the voice characteristics of the target person; and determining whether to interrupt the broadcasting process according to the comparison result of the similarity between the voice characteristics of the speaker and the voice characteristics of the target person and a set threshold value.
Optionally, determining whether to interrupt the broadcasting process according to a comparison result between the similarity between the voice feature of the speaker and the voice feature of the target person and a set threshold, including: if the similarity is higher than or equal to a set threshold, the speaking voice segment is judged to be a target voice segment, and the broadcasting process is interrupted; if the similarity is lower than the set threshold, the speaking voice segment is judged not to be the target voice segment, and the broadcasting process is continued.
Optionally, before the broadcasting of the intelligent customer service, the method further includes: at the beginning stage of communication, voice segment extraction is carried out on the voice data of the target person passing the identity authentication, and a voice segment of the target person is obtained; and carrying out speaker characteristic extraction on the voice fragments of the target person accumulated to the specific time length to obtain the voice characteristics of the target person.
Optionally, performing voice segment extraction on the voice data of the target person to obtain a voice segment of the target person, including: performing probability calculation on the voice data by adopting a deep neural network model to obtain a probability sequence of the voice data being voice or non-voice; carrying out voice segment judgment on the probability sequence by adopting a Viterbi addressing algorithm to obtain an optimal state corresponding to the voice segment of the target person at each moment, and generating a state sequence for starting and ending point judgment; and detecting the effective sound segment according to the state sequence by adopting a starting point and ending point judgment algorithm to obtain the voice segment of the target person.
Optionally, performing valid sound segment detection according to the state sequence by using a beginning and ending point judgment algorithm to obtain the target person sound segment, including: if the active sound segment detects the state sequence and the continuous active sound frame after any frame exceeds the set threshold value, determining the any frame as the starting point of the active sound segment; if the state sequence is detected by the valid sound segment, and after the continuous normal sound frame after any frame exceeds a set threshold value, determining that the any frame is the tail point of the valid sound segment; and determining the voice segment of the target person according to the starting point and the tail point of the active voice segment.
Optionally, performing voice segment extraction on the voice data of the target person to obtain a voice segment of the target person, including: carrying out speaker clustering processing on the obtained multiple sections of speaker voice sections by adopting a speaker clustering algorithm to obtain a clustering result; determining the speaker with the most voice segments based on the clustering result as the target person; and extracting voice segments of the voice data of the target person to obtain the voice segments of the target person.
According to another aspect of the embodiments of the present invention, there is also provided a voice broadcast interruption detection apparatus, including: the detection module is used for detecting the voice segments of the speaker in the audio data by adopting a voiceprint recognition algorithm in the broadcasting process of the intelligent customer service; the extraction module is used for extracting the speaker characteristics of the speaker voice segments to obtain the speaker voice characteristics; the matching module is used for matching the similarity of the voice characteristics of the speaker and the voice characteristics of the target person; and the determining module is used for determining whether to interrupt the broadcasting process according to the comparison result of the similarity between the voice characteristics of the speaker and the voice characteristics of the target person and a set threshold value.
According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium storing a plurality of instructions, the instructions being adapted to be loaded by a processor and to execute any one of the above interruption detection methods for voice broadcasting.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored, and a processor configured to execute the computer program to perform any one of the above interruption detection methods for voice broadcast.
According to another aspect of the embodiments of the present invention, there is also provided a processor, where the processor is configured to execute a program, where the program executes any one of the above interruption detection methods for voice broadcast.
In the embodiment of the invention, a mode of interrupting detection by voice broadcasting is adopted, and a voiceprint recognition algorithm is adopted to detect the voice segment of a speaker in audio data in the broadcasting process of the intelligent customer service; extracting speaker characteristics from the speaker voice segments to obtain speaker voice characteristics; carrying out similarity matching on the voice characteristics of the speaker and the voice characteristics of the target person; according to the comparison result of the similarity between the voice characteristics of the speaker and the voice characteristics of the target person and the set threshold value, whether the broadcasting process is interrupted or not is determined, the mode that effective sound detection and voiceprint recognition are combined is achieved, whether the broadcasting process is interrupted or not is determined, and the purpose of shielding the interference of irrelevant speakers is achieved, so that the technical effects of reducing the voice broadcasting misinterpretation rate and improving the voice broadcasting flexibility and the user experience sense are achieved, and the technical problems that the misinterpretation rate is high and the user experience sense is poor due to the voice interference of the irrelevant speakers in the voice broadcasting process are solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic diagram of residual blocks in a residual convolutional neural network model according to the prior art;
fig. 2 is a flowchart of a method for interruption detection of a voice broadcast according to an embodiment of the present invention;
fig. 3 is a flowchart of an alternative interruption detection method for voice broadcast according to an embodiment of the present invention;
fig. 4 is a flowchart of another alternative interruption detection method for voice broadcast according to an embodiment of the present invention;
FIG. 5 is a flow diagram of an alternative voice endpoint detection method according to an embodiment of the present invention;
fig. 6 is a flowchart of another alternative interruption detection method for voice broadcast according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an interruption detection device for voice broadcast according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, in order to facilitate understanding of the embodiments of the present invention, some terms or nouns referred to in the present invention will be explained as follows:
and (3) voiceprint recognition: one of the biometric techniques, also called speaker recognition techniques, is a technique for discriminating the identity of a speaker by voice.
Residual convolutional neural (Reidaual CNN) network: the Convolutional Neural Network (CNN) is formed by stacking and combining a plurality of Residual error networks (ResNet), which are used for solving the degradation problem caused by the deepening of the layer number of the Convolutional Neural Network (CNN), the structure of the convolutional neural Network is similar to that of a standard multilayer convolutional neural Network, and the convolutional neural Network is formed by stacking a plurality of Residual error modules (Residual Blocks), and each Residual error block realizes the direct connection between the low-layer output and the high-layer input, as shown in fig. 1, the output of the Residual error block can be written as: h ═ F (x, W)i) + x, where x represents the input of the previous layer and h represents the output of the residual block; f (×) is a residual function, representing the learned residual.
Example 1
In accordance with an embodiment of the present invention, there is provided an embodiment of a method for interruption detection of a voice announcement, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Fig. 2 is a flowchart of an interruption detection method for a voice broadcast according to an embodiment of the present invention, and as shown in fig. 2, the method includes the following steps:
step S102, in the broadcasting process of the intelligent customer service, detecting a voice segment of a speaker in audio data by adopting a voiceprint recognition algorithm;
step S104, carrying out speaker characteristic extraction on the speaker voice segment to obtain the speaker voice characteristic;
step S106, carrying out similarity matching on the voice characteristics of the speaker and the voice characteristics of the target person;
and step S108, determining whether to interrupt the broadcasting process according to the comparison result of the similarity between the voice characteristics of the speaker and the voice characteristics of the target person and a set threshold value.
Optionally, a last layer output of a residual convolutional neural network (residual CNN) is used to extract a speaker voice feature in the speaker voice segment, where the speaker voice feature may be, but is not limited to, a speaker spectrum feature.
Optionally, the voice feature of the target person is obtained by performing voice feature extraction and training on voice data of the target person who passes the authentication at a call starting stage.
Optionally, the voice characteristics of the target person are obtained by combining voice endpoint detection (VAD) and a speaker clustering algorithm.
Optionally, but not limited to, performing similarity matching on the voice feature of the speaker and the voice feature of the target person by using a cosine similarity equation, where the expression of the cosine similarity equation is:wherein x isiRepresenting the speaker's speech feature vector, xjRepresenting the voice characteristics of the target person and calculating the result cos (x)i,xj) Representing the similarity between the speech characteristics of the speaker and the speech characteristics of the target person.
Optionally, if the similarity is higher than or equal to a set threshold, determining that the speaking voice segment is a target voice segment, and interrupting the broadcasting process; if the similarity is lower than the set threshold, the speaking voice segment is judged not to be the target voice segment, and the broadcasting process is continued.
In the embodiment of the invention, a mode of interrupting detection by voice broadcasting is adopted, and a voiceprint recognition algorithm is adopted to detect the voice segment of a speaker in audio data in the broadcasting process of the intelligent customer service; extracting speaker characteristics from the speaker voice segments to obtain speaker voice characteristics; carrying out similarity matching on the voice characteristics of the speaker and the voice characteristics of the target person; according to the comparison result of the similarity between the voice characteristics of the speaker and the voice characteristics of the target person and the set threshold value, whether the broadcasting process is interrupted or not is determined, the mode that effective sound detection and voiceprint recognition are combined is achieved, whether the broadcasting process is interrupted or not is determined, and the purpose of shielding the interference of irrelevant speakers is achieved, so that the technical effects of reducing the voice broadcasting misinterpretation rate and improving the voice broadcasting flexibility and the user experience sense are achieved, and the technical problems that the misinterpretation rate is high and the user experience sense is poor due to the voice interference of the irrelevant speakers in the voice broadcasting process are solved.
It should be noted that, in the conventional interruption detection method for voice broadcast, under the condition that the use environment of the user is noisy and the background voice is loud, the intelligent customer service broadcast process is easily interrupted by a non-user, and the user experience is poor. The embodiment of the invention is used for extracting and recording speaker information for the user voice in a mode of combining the active voice detection and the voiceprint recognition algorithm, and shielding irrelevant speakers by comparing the speaker information with the target speaker information, thereby reducing the error interruption rate of the system and improving the user experience.
As an optional embodiment, fig. 3 is a flowchart of an optional interruption detection method for voice broadcast according to an embodiment of the present invention, and as shown in fig. 3, the method mainly includes two stages, namely, a target person registration stage and a voice broadcast stage, and specifically includes the following steps: before interrupting detection of voice broadcasting, determining a target person voice segment by adopting a mode of combining voice endpoint detection (VAD) and a speaker clustering algorithm, and carrying out voiceprint recognition on the target person voice segment to obtain voice characteristics of the target person; in the voice broadcasting stage, voice end point detection (VAD) is adopted to obtain a voice segment of a speaker, and voiceprint recognition is carried out on the voice segment of the speaker to obtain voice characteristics of the speaker; carrying out similarity scoring on the voice characteristics of the target person and the voice characteristics of the speaker to obtain a similarity value; if the similarity value is larger than or equal to the threshold value, the broadcasting is interrupted; otherwise, broadcasting continues.
In an optional embodiment, determining whether to interrupt the broadcasting process according to a comparison result between the similarity between the voice characteristics of the speaker and the voice characteristics of the target person and a set threshold includes:
step S202, if the similarity is higher than or equal to a set threshold, the speaking voice segment is judged to be a target voice segment, and the broadcasting process is interrupted;
step S204, if the similarity is lower than the set threshold, the speaking voice segment is judged not to be the target voice segment, and the broadcasting process is continued.
Optionally, if the similarity is higher than or equal to a set threshold, determining that the speaker is a target speaker, that is, the speaker voice segment is a target speaker voice segment, and at this time, interrupting the broadcasting process; if the similarity is lower than the set threshold, the speaker is judged not to be the target speaker, namely the voice segment of the speaker is not the target voice segment, and the broadcasting process is continued at the moment. Therefore, the aim of voice broadcast error interruption caused by voice interference of irrelevant personnel is avoided, and the user experience is further improved.
In an optional embodiment, before the broadcasting of the intelligent customer service, the method further includes:
step S302, in the beginning stage of communication, voice segment extraction is carried out on the voice data of the target person passing the identity verification, and a voice segment of the target person is obtained;
and step S304, carrying out speaker characteristic extraction on the voice segments of the target person accumulated to the specific time length to obtain the voice characteristics of the target person.
Optionally, the voice segment of the target person is obtained by combining voice endpoint detection (VAD) and a speaker clustering algorithm.
Optionally, the target person voice feature may be, but is not limited to, a target person spectrum feature.
As an alternative embodiment, fig. 4 is a flowchart of another alternative interruption detection method for voice broadcast according to an embodiment of the present invention, and as shown in fig. 4, performing voice segment extraction on the voice data of the target person to obtain a voice segment of the target person includes:
step S402, carrying out probability calculation on the voice data by adopting a deep neural network model to obtain a probability sequence that the voice data is voice or non-voice;
step S404, performing voice segment judgment on the probability sequence by adopting a Viterbi addressing algorithm to obtain an optimal state corresponding to the voice segment of the target person at each moment, and generating a state sequence for starting and ending point judgment;
and S406, detecting valid sound segments according to the state sequence by adopting a starting point and ending point judgment algorithm to obtain the target person voice segments.
Optionally, the deep neural network model may be, but is not limited to, a time-delay neural network model.
Optionally, the target person spectral feature (target person speech feature) corresponding to the speech data is used as the input of the deep neural network model, and the probabilities that the speech data is speech and non-speech are calculated, so as to obtain a probability sequence that the speech data is speech or non-speech.
Optionally, the starting point and ending point determining algorithm is configured to determine a starting point and an ending point of an active sound segment, and determine the target person speech segment according to the starting point and the ending point of the active sound segment.
As an alternative embodiment, fig. 5 is a flowchart of an alternative voice endpoint detection method according to an embodiment of the present invention, and as shown in fig. 5, the method specifically includes the following steps: taking a target person spectrum feature (target person voice feature) corresponding to a voice signal (namely voice data) as the input of the deep neural network model, and performing probability calculation on the voice data by using the deep neural network in voice endpoint detection (VAD); performing voice segment judgment and starting point detection on the output probability sequence by adopting a Viterbi addressing algorithm; judging whether the continuous voice length is larger than the shortest voice length; if the judgment result is negative, performing voice segment judgment and starting point detection on the output probability sequence by adopting a Viterbi addressing algorithm; if the judgment result is yes, determining the starting point of the voice segment, and continuously judging whether the continuous voice length is greater than the longest voice length; if the continuous voice length is larger than the longest voice length, determining the tail point of the voice segment, and ending the whole process; otherwise, starting the initialization process of starting point detection, and judging the voice section and detecting the starting point of the output probability sequence by adopting the Viterbi addressing algorithm again.
In an optional embodiment, performing valid sound segment detection according to the state sequence by using a beginning and ending point determination algorithm to obtain the target person speech segment includes:
step S502, if the active sound segment detects the state sequence and the continuous active sound frame after any frame exceeds the set threshold value, determining the any frame as the starting point of the active sound segment;
step S504, if the state sequence is detected by the valid sound segment, and after the continuous normal sound frame after any frame exceeds the set threshold value, the any frame is determined to be the tail point of the valid sound segment;
step S506, determining the voice segment of the target person according to the start point and the end point of the active voice segment.
It should be noted that, the start and end point judgment algorithm is to judge valid sound segments according to the state sequence, and when a continuous valid sound frame after a certain frame exceeds a set threshold, the frame is judged as the start point of a valid sound segment; when the continuous normal sound frame after a certain frame exceeds a set threshold, judging the frame as the tail point of the effective sound segment; and the voice segment between the starting point and the tail point of the active voice segment is the voice segment of the target person.
As an alternative embodiment, fig. 6 is a flowchart of another alternative interruption detection method for voice broadcast according to an embodiment of the present invention, and as shown in fig. 6, performing voice segment extraction on the voice data of the target person to obtain a voice segment of the target person includes:
step S602, carrying out speaker clustering processing on the obtained multiple sections of speaker voice sections by adopting a speaker clustering algorithm to obtain a clustering result;
step S604, determining the speaker with the largest number of voice segments based on the clustering result as the target person;
step S606, extracting the voice segment of the target person to obtain the voice segment of the target person.
Alternatively, the speaker clustering algorithm may be, but is not limited to, hierarchical clustering and K-Means clustering (K-Means).
Optionally, a speaker clustering algorithm is adopted to perform speaker clustering processing on the obtained multiple segments of speaker voice segments to obtain a clustering result, including: firstly, selecting longer voice segments according to voice segmentation duration to perform hierarchical clustering, and determining a clustering center; and after one-time bottom-up merging hierarchical clustering, correcting the clustering center by using a K-means clustering algorithm to obtain a clustering result.
It should be noted that, for long speech segmentation, the extracted speaker features are better and more accurate, and the K-means clustering algorithm is more sensitive to the initial clustering center selection, and the selection of the central point has an important influence on the clustering performance.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
In this embodiment, a device for detecting interruption of voice broadcast is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, which have already been described and will not be described again. As used hereinafter, the terms "module" and "apparatus" may refer to a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
According to an embodiment of the present invention, an apparatus embodiment for implementing the interruption detection method for voice broadcast is further provided, and fig. 7 is a schematic structural diagram of an interruption detection apparatus for voice broadcast according to an embodiment of the present invention, and as shown in fig. 7, the interruption detection apparatus for voice broadcast includes: detection module 700, extraction module 702, matching module 704, determination module 706, wherein:
the detection module 700 is configured to detect a speaker voice segment in the audio data by using a voiceprint recognition algorithm in the broadcasting process of the intelligent customer service;
the extracting module 702 is configured to perform speaker feature extraction on the speaker voice segment to obtain a speaker voice feature;
the matching module 704 performs similarity matching on the speaker voice feature and the target person voice feature;
the determining module 706 is configured to determine whether to interrupt the broadcasting process according to a comparison result between the similarity between the voice characteristic of the speaker and the voice characteristic of the target person and a set threshold.
It should be noted that the above modules may be implemented by software or hardware, for example, for the latter, the following may be implemented: the modules can be located in the same processor; alternatively, the modules may be located in different processors in any combination.
It should be noted here that the detection module 700, the extraction module 702, the matching module 704, and the determination module 706 correspond to steps S102 to S108 in embodiment 1, and the modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above may be implemented in a computer terminal as part of an apparatus.
It should be noted that, reference may be made to the relevant description in embodiment 1 for alternative or preferred embodiments of this embodiment, and details are not described here again.
The interruption detection device for voice broadcasting may further include a processor and a memory, where the detection module 700, the extraction module 702, the matching module 704, the determination module 706, and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.
The processor comprises a kernel, and the kernel calls a corresponding program unit from the memory, wherein one or more than one kernel can be arranged. The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
According to an embodiment of the present application, there is also provided an embodiment of a non-volatile storage medium. Optionally, in this embodiment, the nonvolatile storage medium includes a stored program, and the device where the nonvolatile storage medium is located is controlled to execute any one of the interruption detection methods for voice broadcast when the program runs.
Optionally, in this embodiment, the nonvolatile storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group, and the nonvolatile storage medium includes a stored program.
Optionally, the device in which the non-volatile storage medium is controlled to execute the following functions when the program runs: in the broadcasting process of the intelligent customer service, detecting a speaker voice segment in the audio data by adopting a voiceprint recognition algorithm; extracting speaker characteristics from the speaker voice segments to obtain speaker voice characteristics; carrying out similarity matching on the voice characteristics of the speaker and the voice characteristics of the target person; and determining whether to interrupt the broadcasting process according to the comparison result of the similarity between the voice characteristics of the speaker and the voice characteristics of the target person and a set threshold value.
Optionally, the device in which the non-volatile storage medium is controlled to execute the following functions when the program runs: if the similarity is higher than or equal to a set threshold, the speaking voice segment is judged to be a target voice segment, and the broadcasting process is interrupted; if the similarity is lower than the set threshold, the speaking voice segment is judged not to be the target voice segment, and the broadcasting process is continued.
Optionally, the device in which the non-volatile storage medium is controlled to execute the following functions when the program runs: at the beginning stage of communication, voice segment extraction is carried out on the voice data of the target person passing the identity authentication, and a voice segment of the target person is obtained; and carrying out speaker characteristic extraction on the voice fragments of the target person accumulated to the specific time length to obtain the voice characteristics of the target person.
Optionally, the device in which the non-volatile storage medium is controlled to execute the following functions when the program runs: performing probability calculation on the voice data by adopting a deep neural network model to obtain a probability sequence of the voice data being voice or non-voice; carrying out voice segment judgment on the probability sequence by adopting a Viterbi addressing algorithm to obtain an optimal state corresponding to the voice segment of the target person at each moment, and generating a state sequence for starting and ending point judgment; and detecting the effective sound segment according to the state sequence by adopting a starting point and ending point judgment algorithm to obtain the voice segment of the target person.
Optionally, the device in which the non-volatile storage medium is controlled to execute the following functions when the program runs: if the active sound segment detects the state sequence and the continuous active sound frame after any frame exceeds the set threshold value, determining the any frame as the starting point of the active sound segment; if the state sequence is detected by the valid sound segment, and after the continuous normal sound frame after any frame exceeds a set threshold value, determining that the any frame is the tail point of the valid sound segment; and determining the voice segment of the target person according to the starting point and the tail point of the active voice segment.
Optionally, the device in which the non-volatile storage medium is controlled to execute the following functions when the program runs: carrying out speaker clustering processing on the obtained multiple sections of speaker voice sections by adopting a speaker clustering algorithm to obtain a clustering result; determining the speaker with the most voice segments based on the clustering result as the target person; and extracting voice segments of the voice data of the target person to obtain the voice segments of the target person.
According to an embodiment of the present application, there is also provided an embodiment of a processor. Optionally, in this embodiment, the processor is configured to execute a program, where the program executes the interruption detection method for any one of the voice broadcasts when running.
There is also provided, in accordance with an embodiment of the present application, an embodiment of a computer program product adapted to, when executed on a data processing device, execute a program for initializing a step of an interruption detection method for a voice broadcast having any of the above-mentioned.
Optionally, the computer program product is adapted to perform a program for initializing the following method steps when executed on a data processing device: in the broadcasting process of the intelligent customer service, detecting a speaker voice segment in the audio data by adopting a voiceprint recognition algorithm; extracting speaker characteristics from the speaker voice segments to obtain speaker voice characteristics; carrying out similarity matching on the voice characteristics of the speaker and the voice characteristics of the target person; and determining whether to interrupt the broadcasting process according to the comparison result of the similarity between the voice characteristics of the speaker and the voice characteristics of the target person and a set threshold value.
According to an embodiment of the present application, there is further provided an embodiment of an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to run the computer program to execute any one of the interruption detection methods for voice broadcast.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable non-volatile storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a non-volatile storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned nonvolatile storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (10)
1. An interruption detection method for voice broadcast is characterized by comprising the following steps:
in the broadcasting process of the intelligent customer service, detecting a speaker voice segment in the audio data by adopting a voiceprint recognition algorithm;
carrying out speaker characteristic extraction on the speaker voice segment to obtain speaker voice characteristics;
carrying out similarity matching on the voice characteristics of the speaker and the voice characteristics of the target person;
and determining whether to interrupt the broadcasting process according to the comparison result of the similarity between the voice characteristics of the speaker and the voice characteristics of the target person and a set threshold value.
2. The method according to claim 1, wherein determining whether to interrupt the broadcasting process according to a comparison result of the similarity between the voice characteristics of the speaker and the voice characteristics of the target person with a set threshold comprises:
if the similarity is higher than or equal to a set threshold, judging that the speaking voice segment is a target voice segment, and interrupting the broadcasting process;
and if the similarity is lower than the set threshold, judging that the speaking voice segment is not the target voice segment, and continuing the broadcasting process.
3. The method of claim 1, wherein prior to the announcement of the smart customer service, the method further comprises:
at the beginning stage of communication, voice segment extraction is carried out on the voice data of the target person passing the identity authentication, and a voice segment of the target person is obtained;
and carrying out speaker characteristic extraction on the voice fragments of the target person accumulated to the specific time length to obtain the voice characteristics of the target person.
4. The method of claim 3, wherein extracting the voice segment of the voice data of the target person to obtain the voice segment of the target person comprises:
performing probability calculation on the voice data by adopting a deep neural network model to obtain a probability sequence that the voice data is voice or non-voice;
carrying out voice segment judgment on the probability sequence by adopting a Viterbi addressing algorithm to obtain an optimal state corresponding to the voice segment of the target person at each moment, and generating a state sequence for starting and ending point judgment;
and detecting the effective sound segment according to the state sequence by adopting a starting point and ending point judgment algorithm to obtain the voice segment of the target person.
5. The method of claim 4, wherein performing valid voice segment detection according to the state sequence by using a beginning and end point judgment algorithm to obtain the target person voice segment comprises:
if the active sound segment detects the state sequence, and the continuous active sound frames after any frame exceed a set threshold value, determining that the any frame is the starting point of the active sound segment;
if the state sequence is detected by the valid sound segment, and after the continuous normal sound frame after any frame exceeds a set threshold value, determining that the any frame is the tail point of the valid sound segment;
and determining the voice fragment of the target person according to the starting point and the tail point of the active voice fragment.
6. The method of claim 3, wherein extracting the voice segment of the voice data of the target person to obtain the voice segment of the target person comprises:
carrying out speaker clustering processing on the obtained multiple sections of speaker voice sections by adopting a speaker clustering algorithm to obtain a clustering result;
determining the speaker with the largest number of voice segments based on the clustering result as the target person;
and extracting voice fragments of the voice data of the target person to obtain the voice fragments of the target person.
7. An interruption detection device for voice broadcast, comprising:
the detection module is used for detecting the voice segments of the speaker in the audio data by adopting a voiceprint recognition algorithm in the broadcasting process of the intelligent customer service;
the extraction module is used for extracting the speaker characteristics of the speaker voice segment to obtain the speaker voice characteristics;
the matching module is used for matching the similarity of the voice characteristics of the speaker and the voice characteristics of the target person;
and the determining module is used for determining whether to interrupt the broadcasting process according to the comparison result of the similarity between the voice characteristics of the speaker and the voice characteristics of the target person and a set threshold value.
8. A computer-readable storage medium storing instructions adapted to be loaded by a processor and to perform the method of interruption detection of a voice broadcast according to any one of claims 1 to 6.
9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the interruption detection method for voice broadcasting according to any one of claims 1 to 6.
10. A processor for executing a program, wherein the program executes to execute the interruption detection method for voice broadcasting according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111616939.7A CN114255786A (en) | 2021-12-27 | 2021-12-27 | Interruption detection method and device for voice broadcast, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111616939.7A CN114255786A (en) | 2021-12-27 | 2021-12-27 | Interruption detection method and device for voice broadcast, storage medium and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114255786A true CN114255786A (en) | 2022-03-29 |
Family
ID=80798355
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111616939.7A Pending CN114255786A (en) | 2021-12-27 | 2021-12-27 | Interruption detection method and device for voice broadcast, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114255786A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115691482A (en) * | 2022-10-12 | 2023-02-03 | 海尔优家智能科技(北京)有限公司 | Electrical appliance control method based on voice recognition, storage medium and electronic device |
CN116189718A (en) * | 2023-02-15 | 2023-05-30 | 北京声智科技有限公司 | Speech activity detection method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030083874A1 (en) * | 2001-10-26 | 2003-05-01 | Crane Matthew D. | Non-target barge-in detection |
US20100158207A1 (en) * | 2005-09-01 | 2010-06-24 | Vishal Dhawan | System and method for verifying the identity of a user by voiceprint analysis |
CN109215646A (en) * | 2018-08-15 | 2019-01-15 | 北京百度网讯科技有限公司 | Voice interaction processing method, device, computer equipment and storage medium |
CN110517697A (en) * | 2019-08-20 | 2019-11-29 | 中信银行股份有限公司 | Prompt tone intelligence cutting-off device for interactive voice response |
CN111508474A (en) * | 2019-08-08 | 2020-08-07 | 马上消费金融股份有限公司 | Voice interruption method, electronic equipment and storage device |
CN112908333A (en) * | 2021-05-08 | 2021-06-04 | 鹏城实验室 | Speech recognition method, device, equipment and computer readable storage medium |
-
2021
- 2021-12-27 CN CN202111616939.7A patent/CN114255786A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030083874A1 (en) * | 2001-10-26 | 2003-05-01 | Crane Matthew D. | Non-target barge-in detection |
US20100158207A1 (en) * | 2005-09-01 | 2010-06-24 | Vishal Dhawan | System and method for verifying the identity of a user by voiceprint analysis |
CN109215646A (en) * | 2018-08-15 | 2019-01-15 | 北京百度网讯科技有限公司 | Voice interaction processing method, device, computer equipment and storage medium |
CN111508474A (en) * | 2019-08-08 | 2020-08-07 | 马上消费金融股份有限公司 | Voice interruption method, electronic equipment and storage device |
CN110517697A (en) * | 2019-08-20 | 2019-11-29 | 中信银行股份有限公司 | Prompt tone intelligence cutting-off device for interactive voice response |
CN112908333A (en) * | 2021-05-08 | 2021-06-04 | 鹏城实验室 | Speech recognition method, device, equipment and computer readable storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115691482A (en) * | 2022-10-12 | 2023-02-03 | 海尔优家智能科技(北京)有限公司 | Electrical appliance control method based on voice recognition, storage medium and electronic device |
CN116189718A (en) * | 2023-02-15 | 2023-05-30 | 北京声智科技有限公司 | Speech activity detection method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2022201831B2 (en) | Call classification through analysis of DTMF events | |
EP3611895B1 (en) | Method and device for user registration, and electronic device | |
CN109410956B (en) | Object identification method, device, equipment and storage medium of audio data | |
CN111128223A (en) | Text information-based auxiliary speaker separation method and related device | |
CA3054063A1 (en) | Method and apparatus for detecting spoofing conditions | |
CN111199741A (en) | Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium | |
WO2017162053A1 (en) | Identity authentication method and device | |
CN110956957A (en) | Training method and system of speech enhancement model | |
CN112634875A (en) | Voice separation method, voice separation device, electronic device and storage medium | |
CN114255786A (en) | Interruption detection method and device for voice broadcast, storage medium and electronic equipment | |
CN108364656B (en) | Feature extraction method and device for voice playback detection | |
CN110517697A (en) | Prompt tone intelligence cutting-off device for interactive voice response | |
CN113593579B (en) | Voiceprint recognition method and device and electronic equipment | |
CN113707154B (en) | Model training method, device, electronic equipment and readable storage medium | |
CN115565533A (en) | Voice recognition method, device, equipment and storage medium | |
WO2022107242A1 (en) | Processing device, processing method, and program | |
CN114038487B (en) | Audio extraction method, device, equipment and readable storage medium | |
CN111933152B (en) | Method and device for detecting validity of registered audio and electronic equipment | |
CN112509556A (en) | Voice awakening method and device | |
CN112151070B (en) | Voice detection method and device and electronic equipment | |
CN109379499A (en) | A kind of voice call method and device | |
CN113178205B (en) | Voice separation method, device, computer equipment and storage medium | |
CN116013322B (en) | Method, device and electronic equipment for determining characters corresponding to lines | |
CN110895929A (en) | Voice recognition method and device | |
CN114299988B (en) | Scene recognition method, device, terminal, storage medium and program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |