[go: up one dir, main page]

CN120260546B - A voice stream segmentation method based on dual-model dynamic triggering - Google Patents

A voice stream segmentation method based on dual-model dynamic triggering

Info

Publication number
CN120260546B
CN120260546B CN202510726884.7A CN202510726884A CN120260546B CN 120260546 B CN120260546 B CN 120260546B CN 202510726884 A CN202510726884 A CN 202510726884A CN 120260546 B CN120260546 B CN 120260546B
Authority
CN
China
Prior art keywords
voice
data
frame
audio
segmentation model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510726884.7A
Other languages
Chinese (zh)
Other versions
CN120260546A (en
Inventor
汤闻易
刘泽原
张阳
徐珂
丁辉
张明伟
唐敏敏
张翔
田靖
王凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN202510726884.7A priority Critical patent/CN120260546B/en
Publication of CN120260546A publication Critical patent/CN120260546A/en
Application granted granted Critical
Publication of CN120260546B publication Critical patent/CN120260546B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Telephonic Communication Services (AREA)
  • Time-Division Multiplex Systems (AREA)

Abstract

The invention discloses a voice stream segmentation method based on dual-model dynamic triggering, which comprises the following steps of 1, constructing a data stream buffer management mechanism of multi-path voice streams, establishing an independent processing channel for each voice stream, forming voice data accumulated to a threshold time into a voice set to be processed, 2, screening, analyzing and processing the voice set to be processed through a rapid segmentation model, selecting out voice fragments meeting the conditions, outputting the voice fragments to a high-precision segmentation model, 3, splicing non-meeting data with data in a data stream buffer according to the screening result of the rapid segmentation model, adjusting the threshold time of a corresponding buffer area of the voice fragments, 4, processing the voice fragments screened through the rapid segmentation model by using the high-precision segmentation model, 5, outputting the segmented voice fragments to other systems such as voice recognition and the like according to the processing result, splicing the rest data with the data in the data stream buffer, and updating the threshold time of the corresponding buffer area.

Description

Voice stream segmentation method based on dual-model dynamic triggering
Technical Field
The invention relates to the field of voice processing, in particular to a voice stream segmentation method based on dual-mode dynamic triggering.
Background
In the field of speech processing, speech recognition models are classified into two types, streaming and non-streaming, the latter having a recognition accuracy much higher than the former. In a scene where the voice stream is required to be subjected to voice recognition, the voice stream can be segmented into a plurality of voice fragments by combining a voice stream segmentation technology, so that the accurate recognition of the voice can be realized by utilizing a non-streaming voice recognition model, and accurate text information is provided for subsequent processing. The voice stream segmentation technology is used as a basic technology in the field of voice processing, and has important application value in scenes such as intelligent customer service systems, multiparty conference transcription, real-time voice analysis and the like. Along with the exponential increase of real-time voice processing demands, the prior art is faced with a key technical bottleneck that the processing efficiency and the segmentation precision are difficult to be compatible in a high concurrency scene.
The current mainstream technical scheme has the following technical defects that 1, a voice activity detection scheme based on a sliding window adopts an energy detection method with a fixed threshold, wherein the scheme has the advantages of millisecond level instantaneity, but has high environmental noise sensitivity, the misjudgment rate exceeds 35% in a low signal-to-noise ratio scene, and 2, an end-to-end deep learning model scheme has the advantages that the segmentation accuracy of more than 90% is realized through a neural network model, the calculation complexity of the model is caused, the model reasoning time consumption and the voice duration are linearly increased, and the long voice processing efficiency is rapidly reduced.
Disclosure of Invention
The invention aims to solve the technical problem of providing a voice stream segmentation method based on dual-mode dynamic triggering aiming at the defects of the prior art.
In order to solve the technical problems, the invention discloses a voice stream segmentation method based on dual-mode dynamic triggering, which comprises the following steps:
Step 1, constructing a data stream buffer management mechanism of multiple voice streams, establishing an independent processing channel for each voice stream, and forming voice data accumulated to a threshold duration into a voice set to be processed;
Step 2, screening, analyzing and processing a voice set to be processed through a rapid segmentation model, selecting voice fragments meeting the conditions and outputting the voice fragments to a high-precision segmentation model;
Step 3, according to the screening result of the rapid segmentation model, splicing the data which do not meet the conditions with the data in the data stream buffer, and adjusting the threshold time of the buffer zone corresponding to the voice fragment;
Step 4, processing the voice fragments screened by the rapid segmentation model by using the high-precision segmentation model;
and 5, outputting the segmented audio fragments to other systems such as voice recognition and the like according to the processing result, splicing the residual data with the data in the data stream buffer, and updating the threshold time of the corresponding buffer zone.
The data stream buffer management mechanism for constructing multiple voice streams in step 1 comprises the following steps:
step 1-1A separate data buffer structure is built for each speech stream, the data buffer structure of the kth speech stream comprising audio data having been received and resampled to a sample rate of 16000Hz Duration of audio that has been received(Units of seconds), threshold duration(In seconds);
step 1-2, storing the received new data into the data buffer structure body of the corresponding voice stream, resampling the new audio data to 16000Hz sampling rate, and adding the new audio data to the audio data Thereafter, the audio duration is updated;
Step 1-3, comparing audio time lengthAnd a threshold durationWhen (when)All voice data in the buffer structure body is processedOutput to a set of speech to be processedOtherwise, executing the step 1-2.
The rapid segmentation model screening in the step 2 comprises the following steps:
step 2-1, checking the voice set to be processed If the test is empty, waiting for the next check, otherwise executing the step 2-2;
Step 2-2 from the set of voices to be processed Selecting a speech segment of a kth speech streamLast_second%The second data is divided into num_frame frames according to a sliding window that the frame length is 30ms and the frame is shifted to 10ms, the spectrum characteristic of each frame is extracted, a fast segmentation model is input, and a probability list of whether each frame is a mute frame or not is obtainedStatistical probability listIn (a)A ratio of frames greater than 0.5, the speech segment being determined when the ratio exceeds 0.4Meets the conditions.
The sliding window, which is described in step 2-2 and moves the last second of data to 10ms according to the frame length of 30ms, divides the data into num_frame frames, and extracts the spectral characteristics of each frame, specifically as follows:
Step 2-2-1, applying Hamming window to the last second speech data to make frame-dividing treatment, frame length is 30ms, frame movement is 10ms, and obtaining num_frame together Frames, each frame having a signal amplitude ofI represents the i-th frame,R represents the r-th sampling point,;
Step 2-2-2, performing fast Fourier transform on the voice data of each frame, and calculating an energy spectrum to obtain a voice frequency domain signal of each frameAnd energy spectrumI represents the i-th frame,K represents the kth frequency point,R represents the r-th sampling point,;
Step 2-2-3, passing the energy spectrum through a Mel filter bank to obtain M-dimensional Mel spectrum characteristicsWhereinThe band-pass triangle filter of the Mel filter group is represented, M represents the number of filters, and M represents the mth filter;
Step 2-2-4, discrete cosine transforming the logarithm of the features of the Mel frequency spectrum to obtain Mel frequency cepstrum coefficient (MFCC coefficients),Where n is the frequency point after DCT, the range of n is required to be the same as m for simplicity, i.e. the spectral feature of the i-th frame.
The fast segmentation model in the step 2-2 is a lightweight two-class model based on a one-dimensional convolutional neural network, and the network structure of the model comprises:
input layer receiving X M dimensional MFCC feature matrices, where M is the number of mel filters;
one-dimensional convolution layer, namely using 32 convolution kernels with width of 5 and step length of 1 to carry out one-dimensional convolution along a time axis, wherein the output dimension is ×32;
A maximum pooling layer, wherein the pooling window size is 2, the step length is 2, and the output dimension is×32;
Flattening the feature map into a 1-dimensional vector;
Full connection layer through Full connection layer of individual neurons, activation function is ReLU;
an output layer for outputting single node probability value representing input via Sigmoid activation function Frames probability that each frame is not a silence frame.
Step 3 comprises the following steps:
step 3-1 statistical probability List Frame occupation with medium silence probability greater than 0.5When the ratio is not more than 0.4, selecting a speech segmentThe corresponding kth voice stream, the voice segmentSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer data;
Step 3-2, according to the formula=+0.3 Update threshold durationThereby avoiding the repeated triggering of the rapid segmentation model screening step.
The network structure of the high-precision segmentation model in the step 4 is as follows:
an input layer, which is used for receiving an MFCC characteristic sequence of the whole voice, wherein the dimension is T multiplied by M, T is the number of frames obtained after the whole voice is subjected to a sliding window with the frame length of 30ms and the frame length of 10ms, and M is the number of Mel filters;
Two bidirectional LSTM layers comprising 128 hidden units, the output dimension is T×256;
full connection layer, through the full connection layer of 64 neurons, the activation function is ReLU;
The output layer outputs a T-dimensional probability sequence through a Sigmoid activation function, and represents the probability that each time frame is not a mute frame;
and the boundary decision module is used for judging the boundary decision point when the probability value is larger than 0.7 and is a local maximum value.
The specific steps of the step 4 comprise the following steps:
step 4-1, extracting the complete MFCC characteristics of the input voice segment, wherein the frame length is 30ms, and the frame is 10ms;
step 4-2, obtaining a boundary probability sequence through a high-precision segmentation model;
step 4-3, segmenting at the probability peak point to obtain an audio fragment set with a person speaking K represents the kth speech stream, L represents the total number of audio segments contained in the set of audio segments in which someone is speaking, and if no someone is detected in the input speech segment, an empty set is output.
Step 5 comprises the following steps:
step 5-1, checking the audio fragment set of the speaker outputted by the high-precision segmentation model Whether or not it is empty:
If it is Non-null, then the audio clips are assembled(I=1, 2,., L) is sent to the speech recognition system in sequence and step 5-2 is performed;
If it is If the air is empty, directly executing the step 5-3;
step 5-2 extracting an original input speech segment Middle inThe remaining data thereafterWill beSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer dataAnd update audio duration;
Step 5-3 updating threshold duration:
When (when)When not in space, according to the formula=Max (2.0, average_duration+0.5) update, where average_duration is the set of audio clipsAverage duration of each audio segment;
When (when) For a time of empty, a threshold durationUnchanged;
step 5-4 reset Audio duration The actual time length of the buffer area after the current splicing.
The rapid segmentation model is trained, and the high-precision segmentation model is trained.
The beneficial effects are that:
1. the invention provides a dual-model dynamic trigger mechanism, and the balance of efficiency and precision is realized through the dynamic matching of a rapid detection model and a high-precision model.
2. The method and the device dynamically adjust the buffer zone threshold value by combining the audio stream double-model segmentation result, and effectively inhibit invalid triggering.
The model can efficiently segment multiple voice streams and ensure higher segmentation precision, can provide a method for realizing high-precision voice recognition results by matching with a non-stream voice recognition model under a multiple voice stream recognition scene, provides auxiliary supporting functions for intelligent customer service and conference transcription programs, and has a certain practical value.
Drawings
Fig. 1 is a schematic diagram of the overall flow of the present invention.
Fig. 2 is a schematic diagram of a rapid segmentation model.
Fig. 3 is a schematic diagram of a high-precision segmentation model.
Fig. 4 is a schematic diagram of the data flow in a practical system according to the present invention.
Detailed Description
A voice stream segmentation method (as shown in figure 1) based on dual-mode dynamic triggering comprises the following steps:
Step 1, constructing a data stream buffer management mechanism of multiple voice streams, establishing an independent processing channel for each voice stream, and forming voice data accumulated to a threshold duration into a voice set to be processed;
Step 2, screening, analyzing and processing a voice set to be processed through a rapid segmentation model, selecting voice fragments meeting the conditions and outputting the voice fragments to a high-precision segmentation model;
Step 3, according to the screening result of the rapid segmentation model, splicing the data which do not meet the conditions with the data in the data stream buffer, and adjusting the threshold time of the buffer zone corresponding to the voice fragment;
Step 4, processing the voice fragments screened by the rapid segmentation model by using the high-precision segmentation model;
and 5, outputting the segmented audio fragments to other systems such as voice recognition and the like according to the processing result, splicing the residual data with the data in the data stream buffer, and updating the threshold time of the corresponding buffer zone.
The data stream buffer management mechanism for constructing multiple voice streams in step 1 comprises the following steps:
step 1-1A separate data buffer structure is built for each speech stream, the data buffer structure of the kth speech stream comprising audio data having been received and resampled to a sample rate of 16000Hz Duration of audio that has been received(Units of seconds), threshold duration(In seconds);
step 1-2, storing the received new data into the data buffer structure body of the corresponding voice stream, resampling the new audio data to 16000Hz sampling rate, and adding the new audio data to the audio data Thereafter, the audio duration is updated;
Step 1-3, comparing audio time lengthAnd a threshold durationWhen (when)All voice data in the buffer structure body is processedOutput to a set of speech to be processedOtherwise, executing the step 1-2.
The rapid segmentation model screening in the step 2 comprises the following steps:
step 2-1, checking the voice set to be processed If the test is empty, waiting for the next check, otherwise executing the step 2-2;
Step 2-2 from the set of voices to be processed Selecting a speech segment of a kth speech streamLast_second%The second data is divided into num_frame frames according to a sliding window that the frame length is 30ms and the frame is shifted to 10ms, the spectrum characteristic of each frame is extracted, a fast segmentation model is input, and a probability list of whether each frame is a mute frame or not is obtainedStatistical probability listIn (a)A ratio of frames greater than 0.5, the speech segment being determined when the ratio exceeds 0.4Meets the conditions.
The sliding window, which is described in step 2-2 and moves the last second of data to 10ms according to the frame length of 30ms, divides the data into num_frame frames, and extracts the spectral characteristics of each frame, specifically as follows:
Step 2-2-1, applying Hamming window to the last second speech data to make frame-dividing treatment, frame length is 30ms, frame movement is 10ms, and obtaining num_frame together Frames, each frame having a signal amplitude ofI represents the i-th frame,R represents the r-th sampling point,;
Step 2-2-2, performing fast Fourier transform on the voice data of each frame, and calculating an energy spectrum to obtain a voice frequency domain signal of each frameAnd energy spectrumI represents the i-th frame,K represents the kth frequency point,R represents the r-th sampling point,;
Step 2-2-3, passing the energy spectrum through a Mel filter bank to obtain M-dimensional Mel spectrum characteristicsWhereinThe band-pass triangle filter of the Mel filter group is represented, M represents the number of filters, and M represents the mth filter;
Step 2-2-4, discrete cosine transforming the logarithm of the features of the Mel frequency spectrum to obtain Mel frequency cepstrum coefficient (MFCC coefficients),Where n is the frequency point after DCT, the range of n is required to be the same as m for simplicity, i.e. the spectral feature of the i-th frame.
The fast slicing model described in step 2-2 is a lightweight two-class model (as shown in fig. 2) based on a one-dimensional convolutional neural network, and the network structure comprises:
input layer receiving X M dimensional MFCC feature matrices, where M is the number of mel filters;
one-dimensional convolution layer, namely using 32 convolution kernels with width of 5 and step length of 1 to carry out one-dimensional convolution along a time axis, wherein the output dimension is ×32;
A maximum pooling layer, wherein the pooling window size is 2, the step length is 2, and the output dimension is×32;
Flattening the feature map into a 1-dimensional vector;
Full connection layer through Full connection layer of individual neurons, activation function is ReLU;
an output layer for outputting single node probability value representing input via Sigmoid activation function Frames probability that each frame is not a silence frame.
Step 3 comprises the following steps:
step 3-1 statistical probability List Frame occupation with medium silence probability greater than 0.5When the ratio is not more than 0.4, selecting a speech segmentThe corresponding kth voice stream, will segment the voiceSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer data;
Step 3-2, according to the formula=+0.3 Update threshold durationThereby avoiding the repeated triggering of the rapid segmentation model screening step.
The network structure (as in fig. 3) of the high-precision segmentation model described in step 4 is as follows:
an input layer, which is used for receiving an MFCC characteristic sequence of the whole voice, wherein the dimension is T multiplied by M, T is the number of frames obtained after the whole voice is subjected to a sliding window with the frame length of 30ms and the frame length of 10ms, and M is the number of Mel filters;
Two bidirectional LSTM layers comprising 128 hidden units, the output dimension is T×256;
full connection layer, through the full connection layer of 64 neurons, the activation function is ReLU;
The output layer outputs a T-dimensional probability sequence through a Sigmoid activation function, and represents the probability that each time frame is not a mute frame;
and the boundary decision module is used for judging the boundary decision point when the probability value is larger than 0.7 and is a local maximum value.
The specific steps of the step 4 comprise the following steps:
step 4-1, extracting the complete MFCC characteristics of the input voice segment, wherein the frame length is 30ms, and the frame is 10ms;
step 4-2, obtaining a boundary probability sequence through a high-precision segmentation model;
step 4-3, segmenting at the probability peak point to obtain an audio fragment set with a person speaking K represents the kth speech stream, L represents the total number of audio segments contained in the set of audio segments in which someone is speaking, and if no someone is detected in the input speech segment, an empty set is output.
Step 5 comprises the following steps:
step 5-1, checking the audio fragment set of the speaker outputted by the high-precision segmentation model Whether or not it is empty:
If it is Non-null, then the audio clips are assembled(I=1, 2,., L) is sent to the speech recognition system (as in fig. 4) in sequence and step 5-2 is performed;
If it is If the air is empty, directly executing the step 5-3;
step 5-2 extracting an original input speech segment Middle inThe remaining data thereafterWill beSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer dataAnd update audio duration;
Step 5-3 updating threshold duration:
When (when)When not in space, according to the formula=Max (2.0, average_duration+0.5) update, where average_duration is the set of audio clipsAverage duration of each audio segment;
When (when) For a time of empty, a threshold durationUnchanged;
step 5-4 reset Audio duration The actual time length of the buffer area after the current splicing.
The rapid segmentation model is trained, and the high-precision segmentation model is trained.
Examples:
in this embodiment, an aeronautical control land-air call voice recorder system is taken as an example, and an application scenario of the method in real-time segmentation of multipath voice streams is described. In combination with the schematic diagram shown in fig. 1, the specific implementation steps are as follows:
step 1, establishing a multipath voice stream buffer management mechanism:
the system establishes an independent buffer structure body for each radio channel, and defines a kth path of voice stream structure body as follows:
{
'Audio_buffer': np. Array ([ ]),// resample to 16000Hz audio data
'Audio_len': 0.0,// cumulative duration (seconds)
'Threshold' 2.0// initial trigger threshold
}
When new audio data arrives, resampling to 16000Hz sampling rate, appending the data to corresponding audio_buffer, updating audio_len=len (audio_buffer)/16000, and packing the complete buffer data to the set to be processed audio_set when audio_len > threshold.
Step 2, dynamic triggering of the rapid segmentation model (corresponding to fig. 2):
a. extracting the audio of the last 1.5 seconds for feature analysis:
Frame shifting and dividing by adopting frame length of 30ms and frame length of 10ms
Computing 13-dimensional MFCC characteristics (m=13)
B. silence detection is performed by a lightweight model:
c. triggering high-precision processing when mute frame proportion is less than or equal to 40 percent
Step3, dynamically adjusting a buffer area:
Not triggered, concatenating the current data with the preamble buffer and updating the threshold formula: threshold_new=current_len+0.3. Example when the original buffer duration is 2.1 seconds and is not triggered, the new threshold is set to 2.1+0.3=2.4 seconds
4. High precision speech segmentation (corresponding to fig. 3):
processing the complete audio clip using the bi-directional LSTM model:
step 5, the result output is linked with the system (corresponding to fig. 4):
the valid speech segments are pushed to the speech recognition engine and the remaining data is written back to the buffer.
The embodiment ensures high-precision segmentation through the cooperative work of the two-stage models (as shown in the flow of fig. 1), and meanwhile, compared with the traditional single-model scheme, the method has the advantage that the computational resource requirement is remarkably reduced, and the method is particularly suitable for scenes such as aviation control, which need to process multiple voice streams in real time.
The invention provides a voice stream segmentation method based on dual-mode dynamic triggering, and the method and the way for realizing the technical scheme are numerous, the above is only a preferred embodiment of the invention, and it should be pointed out that a plurality of improvements and modifications can be made to those skilled in the art without departing from the principle of the invention, and the improvements and modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims (7)

1. A voice stream segmentation method based on dual-mode dynamic triggering is characterized by comprising the following steps:
Step 1, constructing a data stream buffer management mechanism of multiple voice streams, establishing an independent processing channel for each voice stream, and forming voice data accumulated to a threshold duration into a voice set to be processed;
Step 2, screening, analyzing and processing a voice set to be processed through a rapid segmentation model, selecting voice fragments meeting the conditions and outputting the voice fragments to a high-precision segmentation model;
Step 3, according to the screening result of the rapid segmentation model, splicing the data which do not meet the conditions with the data in the data stream buffer, and adjusting the threshold time of the buffer zone corresponding to the voice fragment;
Step 4, processing the voice fragments screened by the rapid segmentation model by using the high-precision segmentation model;
Step 5, outputting the segmented audio clips to a voice recognition system according to the processing result, splicing the residual data with the data in the data stream buffer, and updating the threshold time of the corresponding buffer zone;
The rapid segmentation model screening in the step 2 specifically comprises the following steps:
step 2-1, checking the voice set to be processed If the test is empty, waiting for the next check, otherwise executing the step 2-2;
Step 2-2 from the set of voices to be processed Selecting a speech segment of a kth speech streamDividing the last second data into num_frame frames through a sliding window, extracting the spectrum characteristics of each frame, inputting a rapid segmentation model, and obtaining a probability list of whether each frame is a mute frame or notStatistical probability listIn (a)Judging the voice segmentWhether the conditions are met;
The fast segmentation model in the step 2-2 is a lightweight class classification model based on a one-dimensional convolutional neural network, and the network structure comprises:
input layer receiving X M dimensional MFCC feature matrices, where M is the number of mel filters;
one-dimensional convolution layer, namely using 32 convolution kernels with width of 5 and step length of 1 to carry out one-dimensional convolution along a time axis, wherein the output dimension is ×32;
A maximum pooling layer, wherein the pooling window size is 2, the step length is 2, and the output dimension is×32;
Flattening the feature map into a 1-dimensional vector;
Full connection layer through Full connection layer of individual neurons, activation function is ReLU;
an output layer for outputting single node probability value representing input via Sigmoid activation function Probability that each frame is not a silence frame;
The network structure of the high-precision segmentation model in the step 4 is as follows:
an input layer, which is used for receiving an MFCC characteristic sequence of the whole voice, wherein the dimension is T multiplied by M, T is the number of frames obtained after the whole voice is subjected to a sliding window with the frame length of 30ms and the frame length of 10ms, and M is the number of Mel filters;
Two bidirectional LSTM layers comprising 128 hidden units, the output dimension is T×256;
full connection layer, through the full connection layer of 64 neurons, the activation function is ReLU;
The output layer outputs a T-dimensional probability sequence through a Sigmoid activation function, and represents the probability that each time frame is not a mute frame;
and the boundary decision module is used for judging the boundary decision point when the probability value is larger than 0.7 and is a local maximum value.
2. The method for splitting voice stream based on dual-mode dynamic triggering as claimed in claim 1, wherein the data stream buffer management mechanism for constructing multiple voice streams in step 1 comprises the following steps:
step 1-1A separate data buffer structure is established for each speech stream, the data buffer structure of the kth speech stream comprising audio data that has been received and resampled Duration of audio that has been receivedThreshold duration;
Step 1-2, storing the new data in the data buffer structure corresponding to the voice stream after receiving the new data, resampling the new audio data and adding the new audio data to the audio dataThereafter, the audio duration is updated;
Step 1-3, comparing audio time lengthAnd a threshold durationWhen (when)All voice data in the buffer structure body is processedOutput to a set of speech to be processedOtherwise, executing the step 1-2.
3. The method for splitting a voice stream based on dual-mode dynamic triggering as claimed in claim 1, wherein the data of the last second in step 2-2 is divided into num_frame frames through a sliding window, and the spectral characteristics of each frame are extracted as follows:
Step 2-2-1, framing the last second speech data application to obtain num_frame frames, wherein the signal amplitude of each frame is I represents the i-th frame,R represents the r-th sampling point;
Step 2-2-2, performing fast Fourier transform on the voice data of each frame, and calculating an energy spectrum to obtain a voice frequency domain signal of each frame And energy spectrumI represents the i-th frame,K represents the kth frequency point,R represents the r-th sampling point;
step 2-2-3, passing the energy spectrum through a Mel filter bank to obtain M-dimensional Mel spectrum characteristics WhereinThe band-pass triangle filter of the Mel filter group is represented, M represents the number of filters, and M represents the mth filter;
Step 2-2-4, discrete cosine transforming the logarithm of the features of the Mel frequency spectrum to obtain Mel frequency cepstrum coefficient ,Wherein n is a frequency point after DCT, and the value range of n is the same as m, namely the spectrum characteristic of the ith frame.
4. The method for splitting a voice stream based on dual-mode dynamic triggering as recited in claim 1, wherein the step 3 comprises the steps of:
Step 3-1 statistical probability List Frame occupation with medium silence probability greater than 0.5When the ratio is not more than 0.4, selecting a speech segmentThe corresponding kth voice stream, the voice segmentSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer data;
Step 3-2, according to the formula=+0.3 Update threshold duration
5. The method for splitting a voice stream based on dual-mode dynamic triggering as recited in claim 4, wherein the specific steps of step 4 include the following steps:
Step 4-1, extracting complete MFCC characteristics of the input voice segment;
step 4-2, obtaining a boundary probability sequence through a high-precision segmentation model;
step 4-3, segmenting at the probability peak point to obtain an audio fragment set with a person speaking K represents the kth speech stream, L represents the total number of audio segments contained in the set of audio segments in which someone is speaking, and if no someone is detected in the input speech segment, an empty set is output.
6. The method for splitting a voice stream based on dual-mode dynamic triggering as recited in claim 1, wherein the step 5 comprises the steps of:
step 5-1, checking the audio fragment set of the speaker outputted by the high-precision segmentation model Whether or not it is empty:
if the audio fragment is assembled Non-null, then the audio clips are assembled(I=1, 2,., L) is sent to the speech recognition system in sequence and step 5-2 is performed;
if the audio fragment is assembled If the air is empty, directly executing the step 5-3;
step 5-2 extracting an original input speech segment In a collection of audio clipsThe remaining data thereafterWill remain dataSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer dataAnd update audio duration;
Step 5-3 updating threshold duration:
When the audio clip is assembledWhen not in space, according to the formula=Max (2.0, average_duration+0.5) update, where average_duration is the set of audio clipsAverage duration of each audio segment;
when the audio clip is assembled For a time of empty, a threshold durationUnchanged;
step 5-4 reset Audio duration The actual time length of the buffer area after the current splicing.
7. The method for segmenting the voice stream based on the dual-model dynamic triggering according to claim 1, wherein the rapid segmentation model is trained, and the high-precision segmentation model is trained.
CN202510726884.7A 2025-06-03 2025-06-03 A voice stream segmentation method based on dual-model dynamic triggering Active CN120260546B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510726884.7A CN120260546B (en) 2025-06-03 2025-06-03 A voice stream segmentation method based on dual-model dynamic triggering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510726884.7A CN120260546B (en) 2025-06-03 2025-06-03 A voice stream segmentation method based on dual-model dynamic triggering

Publications (2)

Publication Number Publication Date
CN120260546A CN120260546A (en) 2025-07-04
CN120260546B true CN120260546B (en) 2025-09-23

Family

ID=96187627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510726884.7A Active CN120260546B (en) 2025-06-03 2025-06-03 A voice stream segmentation method based on dual-model dynamic triggering

Country Status (1)

Country Link
CN (1) CN120260546B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021854A (en) * 2006-10-11 2007-08-22 鲍东山 Audio analysis system based on content
CN108766419A (en) * 2018-05-04 2018-11-06 华南理工大学 A kind of abnormal speech detection method based on deep learning

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1174374C (en) * 1999-06-30 2004-11-03 国际商业机器公司 Method for Concurrent Speech Recognition, Speaker Segmentation, and Classification
US8756061B2 (en) * 2011-04-01 2014-06-17 Sony Computer Entertainment Inc. Speech syllable/vowel/phone boundary detection using auditory attention cues
WO2020111676A1 (en) * 2018-11-28 2020-06-04 삼성전자 주식회사 Voice recognition device and method
CN113012684B (en) * 2021-03-04 2022-05-31 电子科技大学 Synthesized voice detection method based on voice segmentation
CN114283792B (en) * 2021-12-13 2025-06-20 亿嘉和科技股份有限公司 Method and device for identifying the opening and closing sound of grounding knife switch
CN114187898A (en) * 2021-12-31 2022-03-15 电子科技大学 An End-to-End Speech Recognition Method Based on Fusion Neural Network Structure
CN119132328A (en) * 2023-06-13 2024-12-13 腾讯科技(深圳)有限公司 A voice processing method, device, equipment, medium and program product
CN117238279A (en) * 2023-09-04 2023-12-15 中国电子科技集团公司第二十八研究所 A method of segmenting regulatory speech based on speech recognition and endpoint detection
CN118968970B (en) * 2024-07-15 2025-06-20 广州市中南民航空管通信网络科技有限公司 A voice segmentation method and system for air traffic control voice recorder
CN119673171B (en) * 2025-02-17 2025-04-25 深圳十方融海科技有限公司 Speech recognition feature extraction and reasoning method for artificial intelligence dialogue system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021854A (en) * 2006-10-11 2007-08-22 鲍东山 Audio analysis system based on content
CN108766419A (en) * 2018-05-04 2018-11-06 华南理工大学 A kind of abnormal speech detection method based on deep learning

Also Published As

Publication number Publication date
CN120260546A (en) 2025-07-04

Similar Documents

Publication Publication Date Title
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
KR100636317B1 (en) Distributed speech recognition system and method
Zazo et al. Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection.
JP3002204B2 (en) Time-series signal recognition device
CN1121681C (en) Speech processing
Hermansky et al. TRAPS-classifiers of temporal patterns.
CN103400580A (en) Method for estimating importance degree of speaker in multiuser session voice
CN111461173A (en) A multi-speaker clustering system and method based on attention mechanism
CN102543063A (en) Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers
CN112270931B (en) A Method for Deceptive Speech Detection Based on Siamese Convolutional Neural Networks
CN118486305B (en) Event triggering processing method based on voice recognition
CN112599123B (en) Lightweight speech keyword recognition network, method, device and storage medium
Lu et al. Real-time unsupervised speaker change detection
CN111429916B (en) Sound signal recording system
CN110047502A (en) The recognition methods of hierarchical voice de-noising and system under noise circumstance
CN112927723A (en) High-performance anti-noise speech emotion recognition method based on deep neural network
CN114822578A (en) Speech noise reduction method, device, equipment and storage medium
CN113889099A (en) Voice recognition method and system
CN113903328A (en) Speaker counting method, device, device and storage medium based on deep learning
CN116741159A (en) Audio classification and model training method and device, electronic equipment and storage medium
CN111341295A (en) Offline real-time multilingual broadcast sensitive word monitoring method
CN120260546B (en) A voice stream segmentation method based on dual-model dynamic triggering
CN119673173A (en) A streaming speaker log method and system
CN115132196A (en) Voice instruction recognition method and device, electronic equipment and storage medium
CN110930985B (en) Telephone voice recognition model, method, system, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant