CN120260546B - A voice stream segmentation method based on dual-model dynamic triggering - Google Patents
A voice stream segmentation method based on dual-model dynamic triggeringInfo
- Publication number
- CN120260546B CN120260546B CN202510726884.7A CN202510726884A CN120260546B CN 120260546 B CN120260546 B CN 120260546B CN 202510726884 A CN202510726884 A CN 202510726884A CN 120260546 B CN120260546 B CN 120260546B
- Authority
- CN
- China
- Prior art keywords
- voice
- data
- frame
- audio
- segmentation model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Telephonic Communication Services (AREA)
- Time-Division Multiplex Systems (AREA)
Abstract
The invention discloses a voice stream segmentation method based on dual-model dynamic triggering, which comprises the following steps of 1, constructing a data stream buffer management mechanism of multi-path voice streams, establishing an independent processing channel for each voice stream, forming voice data accumulated to a threshold time into a voice set to be processed, 2, screening, analyzing and processing the voice set to be processed through a rapid segmentation model, selecting out voice fragments meeting the conditions, outputting the voice fragments to a high-precision segmentation model, 3, splicing non-meeting data with data in a data stream buffer according to the screening result of the rapid segmentation model, adjusting the threshold time of a corresponding buffer area of the voice fragments, 4, processing the voice fragments screened through the rapid segmentation model by using the high-precision segmentation model, 5, outputting the segmented voice fragments to other systems such as voice recognition and the like according to the processing result, splicing the rest data with the data in the data stream buffer, and updating the threshold time of the corresponding buffer area.
Description
Technical Field
The invention relates to the field of voice processing, in particular to a voice stream segmentation method based on dual-mode dynamic triggering.
Background
In the field of speech processing, speech recognition models are classified into two types, streaming and non-streaming, the latter having a recognition accuracy much higher than the former. In a scene where the voice stream is required to be subjected to voice recognition, the voice stream can be segmented into a plurality of voice fragments by combining a voice stream segmentation technology, so that the accurate recognition of the voice can be realized by utilizing a non-streaming voice recognition model, and accurate text information is provided for subsequent processing. The voice stream segmentation technology is used as a basic technology in the field of voice processing, and has important application value in scenes such as intelligent customer service systems, multiparty conference transcription, real-time voice analysis and the like. Along with the exponential increase of real-time voice processing demands, the prior art is faced with a key technical bottleneck that the processing efficiency and the segmentation precision are difficult to be compatible in a high concurrency scene.
The current mainstream technical scheme has the following technical defects that 1, a voice activity detection scheme based on a sliding window adopts an energy detection method with a fixed threshold, wherein the scheme has the advantages of millisecond level instantaneity, but has high environmental noise sensitivity, the misjudgment rate exceeds 35% in a low signal-to-noise ratio scene, and 2, an end-to-end deep learning model scheme has the advantages that the segmentation accuracy of more than 90% is realized through a neural network model, the calculation complexity of the model is caused, the model reasoning time consumption and the voice duration are linearly increased, and the long voice processing efficiency is rapidly reduced.
Disclosure of Invention
The invention aims to solve the technical problem of providing a voice stream segmentation method based on dual-mode dynamic triggering aiming at the defects of the prior art.
In order to solve the technical problems, the invention discloses a voice stream segmentation method based on dual-mode dynamic triggering, which comprises the following steps:
Step 1, constructing a data stream buffer management mechanism of multiple voice streams, establishing an independent processing channel for each voice stream, and forming voice data accumulated to a threshold duration into a voice set to be processed;
Step 2, screening, analyzing and processing a voice set to be processed through a rapid segmentation model, selecting voice fragments meeting the conditions and outputting the voice fragments to a high-precision segmentation model;
Step 3, according to the screening result of the rapid segmentation model, splicing the data which do not meet the conditions with the data in the data stream buffer, and adjusting the threshold time of the buffer zone corresponding to the voice fragment;
Step 4, processing the voice fragments screened by the rapid segmentation model by using the high-precision segmentation model;
and 5, outputting the segmented audio fragments to other systems such as voice recognition and the like according to the processing result, splicing the residual data with the data in the data stream buffer, and updating the threshold time of the corresponding buffer zone.
The data stream buffer management mechanism for constructing multiple voice streams in step 1 comprises the following steps:
step 1-1A separate data buffer structure is built for each speech stream, the data buffer structure of the kth speech stream comprising audio data having been received and resampled to a sample rate of 16000Hz Duration of audio that has been received(Units of seconds), threshold duration(In seconds);
step 1-2, storing the received new data into the data buffer structure body of the corresponding voice stream, resampling the new audio data to 16000Hz sampling rate, and adding the new audio data to the audio data Thereafter, the audio duration is updated;
Step 1-3, comparing audio time lengthAnd a threshold durationWhen (when)All voice data in the buffer structure body is processedOutput to a set of speech to be processedOtherwise, executing the step 1-2.
The rapid segmentation model screening in the step 2 comprises the following steps:
step 2-1, checking the voice set to be processed If the test is empty, waiting for the next check, otherwise executing the step 2-2;
Step 2-2 from the set of voices to be processed Selecting a speech segment of a kth speech streamLast_second%The second data is divided into num_frame frames according to a sliding window that the frame length is 30ms and the frame is shifted to 10ms, the spectrum characteristic of each frame is extracted, a fast segmentation model is input, and a probability list of whether each frame is a mute frame or not is obtainedStatistical probability listIn (a)A ratio of frames greater than 0.5, the speech segment being determined when the ratio exceeds 0.4Meets the conditions.
The sliding window, which is described in step 2-2 and moves the last second of data to 10ms according to the frame length of 30ms, divides the data into num_frame frames, and extracts the spectral characteristics of each frame, specifically as follows:
Step 2-2-1, applying Hamming window to the last second speech data to make frame-dividing treatment, frame length is 30ms, frame movement is 10ms, and obtaining num_frame together Frames, each frame having a signal amplitude ofI represents the i-th frame,R represents the r-th sampling point,;
Step 2-2-2, performing fast Fourier transform on the voice data of each frame, and calculating an energy spectrum to obtain a voice frequency domain signal of each frameAnd energy spectrumI represents the i-th frame,K represents the kth frequency point,R represents the r-th sampling point,;
Step 2-2-3, passing the energy spectrum through a Mel filter bank to obtain M-dimensional Mel spectrum characteristicsWhereinThe band-pass triangle filter of the Mel filter group is represented, M represents the number of filters, and M represents the mth filter;
Step 2-2-4, discrete cosine transforming the logarithm of the features of the Mel frequency spectrum to obtain Mel frequency cepstrum coefficient (MFCC coefficients),Where n is the frequency point after DCT, the range of n is required to be the same as m for simplicity, i.e. the spectral feature of the i-th frame.
The fast segmentation model in the step 2-2 is a lightweight two-class model based on a one-dimensional convolutional neural network, and the network structure of the model comprises:
input layer receiving X M dimensional MFCC feature matrices, where M is the number of mel filters;
one-dimensional convolution layer, namely using 32 convolution kernels with width of 5 and step length of 1 to carry out one-dimensional convolution along a time axis, wherein the output dimension is ×32;
A maximum pooling layer, wherein the pooling window size is 2, the step length is 2, and the output dimension is×32;
Flattening the feature map into a 1-dimensional vector;
Full connection layer through Full connection layer of individual neurons, activation function is ReLU;
an output layer for outputting single node probability value representing input via Sigmoid activation function Frames probability that each frame is not a silence frame.
Step 3 comprises the following steps:
step 3-1 statistical probability List Frame occupation with medium silence probability greater than 0.5When the ratio is not more than 0.4, selecting a speech segmentThe corresponding kth voice stream, the voice segmentSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer data;
Step 3-2, according to the formula=+0.3 Update threshold durationThereby avoiding the repeated triggering of the rapid segmentation model screening step.
The network structure of the high-precision segmentation model in the step 4 is as follows:
an input layer, which is used for receiving an MFCC characteristic sequence of the whole voice, wherein the dimension is T multiplied by M, T is the number of frames obtained after the whole voice is subjected to a sliding window with the frame length of 30ms and the frame length of 10ms, and M is the number of Mel filters;
Two bidirectional LSTM layers comprising 128 hidden units, the output dimension is T×256;
full connection layer, through the full connection layer of 64 neurons, the activation function is ReLU;
The output layer outputs a T-dimensional probability sequence through a Sigmoid activation function, and represents the probability that each time frame is not a mute frame;
and the boundary decision module is used for judging the boundary decision point when the probability value is larger than 0.7 and is a local maximum value.
The specific steps of the step 4 comprise the following steps:
step 4-1, extracting the complete MFCC characteristics of the input voice segment, wherein the frame length is 30ms, and the frame is 10ms;
step 4-2, obtaining a boundary probability sequence through a high-precision segmentation model;
step 4-3, segmenting at the probability peak point to obtain an audio fragment set with a person speaking K represents the kth speech stream, L represents the total number of audio segments contained in the set of audio segments in which someone is speaking, and if no someone is detected in the input speech segment, an empty set is output.
Step 5 comprises the following steps:
step 5-1, checking the audio fragment set of the speaker outputted by the high-precision segmentation model Whether or not it is empty:
If it is Non-null, then the audio clips are assembled(I=1, 2,., L) is sent to the speech recognition system in sequence and step 5-2 is performed;
If it is If the air is empty, directly executing the step 5-3;
step 5-2 extracting an original input speech segment Middle inThe remaining data thereafterWill beSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer dataAnd update audio duration;
Step 5-3 updating threshold duration:
When (when)When not in space, according to the formula=Max (2.0, average_duration+0.5) update, where average_duration is the set of audio clipsAverage duration of each audio segment;
When (when) For a time of empty, a threshold durationUnchanged;
step 5-4 reset Audio duration The actual time length of the buffer area after the current splicing.
The rapid segmentation model is trained, and the high-precision segmentation model is trained.
The beneficial effects are that:
1. the invention provides a dual-model dynamic trigger mechanism, and the balance of efficiency and precision is realized through the dynamic matching of a rapid detection model and a high-precision model.
2. The method and the device dynamically adjust the buffer zone threshold value by combining the audio stream double-model segmentation result, and effectively inhibit invalid triggering.
The model can efficiently segment multiple voice streams and ensure higher segmentation precision, can provide a method for realizing high-precision voice recognition results by matching with a non-stream voice recognition model under a multiple voice stream recognition scene, provides auxiliary supporting functions for intelligent customer service and conference transcription programs, and has a certain practical value.
Drawings
Fig. 1 is a schematic diagram of the overall flow of the present invention.
Fig. 2 is a schematic diagram of a rapid segmentation model.
Fig. 3 is a schematic diagram of a high-precision segmentation model.
Fig. 4 is a schematic diagram of the data flow in a practical system according to the present invention.
Detailed Description
A voice stream segmentation method (as shown in figure 1) based on dual-mode dynamic triggering comprises the following steps:
Step 1, constructing a data stream buffer management mechanism of multiple voice streams, establishing an independent processing channel for each voice stream, and forming voice data accumulated to a threshold duration into a voice set to be processed;
Step 2, screening, analyzing and processing a voice set to be processed through a rapid segmentation model, selecting voice fragments meeting the conditions and outputting the voice fragments to a high-precision segmentation model;
Step 3, according to the screening result of the rapid segmentation model, splicing the data which do not meet the conditions with the data in the data stream buffer, and adjusting the threshold time of the buffer zone corresponding to the voice fragment;
Step 4, processing the voice fragments screened by the rapid segmentation model by using the high-precision segmentation model;
and 5, outputting the segmented audio fragments to other systems such as voice recognition and the like according to the processing result, splicing the residual data with the data in the data stream buffer, and updating the threshold time of the corresponding buffer zone.
The data stream buffer management mechanism for constructing multiple voice streams in step 1 comprises the following steps:
step 1-1A separate data buffer structure is built for each speech stream, the data buffer structure of the kth speech stream comprising audio data having been received and resampled to a sample rate of 16000Hz Duration of audio that has been received(Units of seconds), threshold duration(In seconds);
step 1-2, storing the received new data into the data buffer structure body of the corresponding voice stream, resampling the new audio data to 16000Hz sampling rate, and adding the new audio data to the audio data Thereafter, the audio duration is updated;
Step 1-3, comparing audio time lengthAnd a threshold durationWhen (when)All voice data in the buffer structure body is processedOutput to a set of speech to be processedOtherwise, executing the step 1-2.
The rapid segmentation model screening in the step 2 comprises the following steps:
step 2-1, checking the voice set to be processed If the test is empty, waiting for the next check, otherwise executing the step 2-2;
Step 2-2 from the set of voices to be processed Selecting a speech segment of a kth speech streamLast_second%The second data is divided into num_frame frames according to a sliding window that the frame length is 30ms and the frame is shifted to 10ms, the spectrum characteristic of each frame is extracted, a fast segmentation model is input, and a probability list of whether each frame is a mute frame or not is obtainedStatistical probability listIn (a)A ratio of frames greater than 0.5, the speech segment being determined when the ratio exceeds 0.4Meets the conditions.
The sliding window, which is described in step 2-2 and moves the last second of data to 10ms according to the frame length of 30ms, divides the data into num_frame frames, and extracts the spectral characteristics of each frame, specifically as follows:
Step 2-2-1, applying Hamming window to the last second speech data to make frame-dividing treatment, frame length is 30ms, frame movement is 10ms, and obtaining num_frame together Frames, each frame having a signal amplitude ofI represents the i-th frame,R represents the r-th sampling point,;
Step 2-2-2, performing fast Fourier transform on the voice data of each frame, and calculating an energy spectrum to obtain a voice frequency domain signal of each frameAnd energy spectrumI represents the i-th frame,K represents the kth frequency point,R represents the r-th sampling point,;
Step 2-2-3, passing the energy spectrum through a Mel filter bank to obtain M-dimensional Mel spectrum characteristicsWhereinThe band-pass triangle filter of the Mel filter group is represented, M represents the number of filters, and M represents the mth filter;
Step 2-2-4, discrete cosine transforming the logarithm of the features of the Mel frequency spectrum to obtain Mel frequency cepstrum coefficient (MFCC coefficients),Where n is the frequency point after DCT, the range of n is required to be the same as m for simplicity, i.e. the spectral feature of the i-th frame.
The fast slicing model described in step 2-2 is a lightweight two-class model (as shown in fig. 2) based on a one-dimensional convolutional neural network, and the network structure comprises:
input layer receiving X M dimensional MFCC feature matrices, where M is the number of mel filters;
one-dimensional convolution layer, namely using 32 convolution kernels with width of 5 and step length of 1 to carry out one-dimensional convolution along a time axis, wherein the output dimension is ×32;
A maximum pooling layer, wherein the pooling window size is 2, the step length is 2, and the output dimension is×32;
Flattening the feature map into a 1-dimensional vector;
Full connection layer through Full connection layer of individual neurons, activation function is ReLU;
an output layer for outputting single node probability value representing input via Sigmoid activation function Frames probability that each frame is not a silence frame.
Step 3 comprises the following steps:
step 3-1 statistical probability List Frame occupation with medium silence probability greater than 0.5When the ratio is not more than 0.4, selecting a speech segmentThe corresponding kth voice stream, will segment the voiceSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer data;
Step 3-2, according to the formula=+0.3 Update threshold durationThereby avoiding the repeated triggering of the rapid segmentation model screening step.
The network structure (as in fig. 3) of the high-precision segmentation model described in step 4 is as follows:
an input layer, which is used for receiving an MFCC characteristic sequence of the whole voice, wherein the dimension is T multiplied by M, T is the number of frames obtained after the whole voice is subjected to a sliding window with the frame length of 30ms and the frame length of 10ms, and M is the number of Mel filters;
Two bidirectional LSTM layers comprising 128 hidden units, the output dimension is T×256;
full connection layer, through the full connection layer of 64 neurons, the activation function is ReLU;
The output layer outputs a T-dimensional probability sequence through a Sigmoid activation function, and represents the probability that each time frame is not a mute frame;
and the boundary decision module is used for judging the boundary decision point when the probability value is larger than 0.7 and is a local maximum value.
The specific steps of the step 4 comprise the following steps:
step 4-1, extracting the complete MFCC characteristics of the input voice segment, wherein the frame length is 30ms, and the frame is 10ms;
step 4-2, obtaining a boundary probability sequence through a high-precision segmentation model;
step 4-3, segmenting at the probability peak point to obtain an audio fragment set with a person speaking K represents the kth speech stream, L represents the total number of audio segments contained in the set of audio segments in which someone is speaking, and if no someone is detected in the input speech segment, an empty set is output.
Step 5 comprises the following steps:
step 5-1, checking the audio fragment set of the speaker outputted by the high-precision segmentation model Whether or not it is empty:
If it is Non-null, then the audio clips are assembled(I=1, 2,., L) is sent to the speech recognition system (as in fig. 4) in sequence and step 5-2 is performed;
If it is If the air is empty, directly executing the step 5-3;
step 5-2 extracting an original input speech segment Middle inThe remaining data thereafterWill beSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer dataAnd update audio duration;
Step 5-3 updating threshold duration:
When (when)When not in space, according to the formula=Max (2.0, average_duration+0.5) update, where average_duration is the set of audio clipsAverage duration of each audio segment;
When (when) For a time of empty, a threshold durationUnchanged;
step 5-4 reset Audio duration The actual time length of the buffer area after the current splicing.
The rapid segmentation model is trained, and the high-precision segmentation model is trained.
Examples:
in this embodiment, an aeronautical control land-air call voice recorder system is taken as an example, and an application scenario of the method in real-time segmentation of multipath voice streams is described. In combination with the schematic diagram shown in fig. 1, the specific implementation steps are as follows:
step 1, establishing a multipath voice stream buffer management mechanism:
the system establishes an independent buffer structure body for each radio channel, and defines a kth path of voice stream structure body as follows:
{
'Audio_buffer': np. Array ([ ]),// resample to 16000Hz audio data
'Audio_len': 0.0,// cumulative duration (seconds)
'Threshold' 2.0// initial trigger threshold
}
When new audio data arrives, resampling to 16000Hz sampling rate, appending the data to corresponding audio_buffer, updating audio_len=len (audio_buffer)/16000, and packing the complete buffer data to the set to be processed audio_set when audio_len > threshold.
Step 2, dynamic triggering of the rapid segmentation model (corresponding to fig. 2):
a. extracting the audio of the last 1.5 seconds for feature analysis:
Frame shifting and dividing by adopting frame length of 30ms and frame length of 10ms
Computing 13-dimensional MFCC characteristics (m=13)
B. silence detection is performed by a lightweight model:
c. triggering high-precision processing when mute frame proportion is less than or equal to 40 percent
Step3, dynamically adjusting a buffer area:
Not triggered, concatenating the current data with the preamble buffer and updating the threshold formula: threshold_new=current_len+0.3. Example when the original buffer duration is 2.1 seconds and is not triggered, the new threshold is set to 2.1+0.3=2.4 seconds
4. High precision speech segmentation (corresponding to fig. 3):
processing the complete audio clip using the bi-directional LSTM model:
step 5, the result output is linked with the system (corresponding to fig. 4):
the valid speech segments are pushed to the speech recognition engine and the remaining data is written back to the buffer.
The embodiment ensures high-precision segmentation through the cooperative work of the two-stage models (as shown in the flow of fig. 1), and meanwhile, compared with the traditional single-model scheme, the method has the advantage that the computational resource requirement is remarkably reduced, and the method is particularly suitable for scenes such as aviation control, which need to process multiple voice streams in real time.
The invention provides a voice stream segmentation method based on dual-mode dynamic triggering, and the method and the way for realizing the technical scheme are numerous, the above is only a preferred embodiment of the invention, and it should be pointed out that a plurality of improvements and modifications can be made to those skilled in the art without departing from the principle of the invention, and the improvements and modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.
Claims (7)
1. A voice stream segmentation method based on dual-mode dynamic triggering is characterized by comprising the following steps:
Step 1, constructing a data stream buffer management mechanism of multiple voice streams, establishing an independent processing channel for each voice stream, and forming voice data accumulated to a threshold duration into a voice set to be processed;
Step 2, screening, analyzing and processing a voice set to be processed through a rapid segmentation model, selecting voice fragments meeting the conditions and outputting the voice fragments to a high-precision segmentation model;
Step 3, according to the screening result of the rapid segmentation model, splicing the data which do not meet the conditions with the data in the data stream buffer, and adjusting the threshold time of the buffer zone corresponding to the voice fragment;
Step 4, processing the voice fragments screened by the rapid segmentation model by using the high-precision segmentation model;
Step 5, outputting the segmented audio clips to a voice recognition system according to the processing result, splicing the residual data with the data in the data stream buffer, and updating the threshold time of the corresponding buffer zone;
The rapid segmentation model screening in the step 2 specifically comprises the following steps:
step 2-1, checking the voice set to be processed If the test is empty, waiting for the next check, otherwise executing the step 2-2;
Step 2-2 from the set of voices to be processed Selecting a speech segment of a kth speech streamDividing the last second data into num_frame frames through a sliding window, extracting the spectrum characteristics of each frame, inputting a rapid segmentation model, and obtaining a probability list of whether each frame is a mute frame or notStatistical probability listIn (a)Judging the voice segmentWhether the conditions are met;
The fast segmentation model in the step 2-2 is a lightweight class classification model based on a one-dimensional convolutional neural network, and the network structure comprises:
input layer receiving X M dimensional MFCC feature matrices, where M is the number of mel filters;
one-dimensional convolution layer, namely using 32 convolution kernels with width of 5 and step length of 1 to carry out one-dimensional convolution along a time axis, wherein the output dimension is ×32;
A maximum pooling layer, wherein the pooling window size is 2, the step length is 2, and the output dimension is×32;
Flattening the feature map into a 1-dimensional vector;
Full connection layer through Full connection layer of individual neurons, activation function is ReLU;
an output layer for outputting single node probability value representing input via Sigmoid activation function Probability that each frame is not a silence frame;
The network structure of the high-precision segmentation model in the step 4 is as follows:
an input layer, which is used for receiving an MFCC characteristic sequence of the whole voice, wherein the dimension is T multiplied by M, T is the number of frames obtained after the whole voice is subjected to a sliding window with the frame length of 30ms and the frame length of 10ms, and M is the number of Mel filters;
Two bidirectional LSTM layers comprising 128 hidden units, the output dimension is T×256;
full connection layer, through the full connection layer of 64 neurons, the activation function is ReLU;
The output layer outputs a T-dimensional probability sequence through a Sigmoid activation function, and represents the probability that each time frame is not a mute frame;
and the boundary decision module is used for judging the boundary decision point when the probability value is larger than 0.7 and is a local maximum value.
2. The method for splitting voice stream based on dual-mode dynamic triggering as claimed in claim 1, wherein the data stream buffer management mechanism for constructing multiple voice streams in step 1 comprises the following steps:
step 1-1A separate data buffer structure is established for each speech stream, the data buffer structure of the kth speech stream comprising audio data that has been received and resampled Duration of audio that has been receivedThreshold duration;
Step 1-2, storing the new data in the data buffer structure corresponding to the voice stream after receiving the new data, resampling the new audio data and adding the new audio data to the audio dataThereafter, the audio duration is updated;
Step 1-3, comparing audio time lengthAnd a threshold durationWhen (when)All voice data in the buffer structure body is processedOutput to a set of speech to be processedOtherwise, executing the step 1-2.
3. The method for splitting a voice stream based on dual-mode dynamic triggering as claimed in claim 1, wherein the data of the last second in step 2-2 is divided into num_frame frames through a sliding window, and the spectral characteristics of each frame are extracted as follows:
Step 2-2-1, framing the last second speech data application to obtain num_frame frames, wherein the signal amplitude of each frame is I represents the i-th frame,R represents the r-th sampling point;
Step 2-2-2, performing fast Fourier transform on the voice data of each frame, and calculating an energy spectrum to obtain a voice frequency domain signal of each frame And energy spectrumI represents the i-th frame,K represents the kth frequency point,R represents the r-th sampling point;
step 2-2-3, passing the energy spectrum through a Mel filter bank to obtain M-dimensional Mel spectrum characteristics WhereinThe band-pass triangle filter of the Mel filter group is represented, M represents the number of filters, and M represents the mth filter;
Step 2-2-4, discrete cosine transforming the logarithm of the features of the Mel frequency spectrum to obtain Mel frequency cepstrum coefficient ,Wherein n is a frequency point after DCT, and the value range of n is the same as m, namely the spectrum characteristic of the ith frame.
4. The method for splitting a voice stream based on dual-mode dynamic triggering as recited in claim 1, wherein the step 3 comprises the steps of:
Step 3-1 statistical probability List Frame occupation with medium silence probability greater than 0.5When the ratio is not more than 0.4, selecting a speech segmentThe corresponding kth voice stream, the voice segmentSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer data;
Step 3-2, according to the formula=+0.3 Update threshold duration。
5. The method for splitting a voice stream based on dual-mode dynamic triggering as recited in claim 4, wherein the specific steps of step 4 include the following steps:
Step 4-1, extracting complete MFCC characteristics of the input voice segment;
step 4-2, obtaining a boundary probability sequence through a high-precision segmentation model;
step 4-3, segmenting at the probability peak point to obtain an audio fragment set with a person speaking K represents the kth speech stream, L represents the total number of audio segments contained in the set of audio segments in which someone is speaking, and if no someone is detected in the input speech segment, an empty set is output.
6. The method for splitting a voice stream based on dual-mode dynamic triggering as recited in claim 1, wherein the step 5 comprises the steps of:
step 5-1, checking the audio fragment set of the speaker outputted by the high-precision segmentation model Whether or not it is empty:
if the audio fragment is assembled Non-null, then the audio clips are assembled(I=1, 2,., L) is sent to the speech recognition system in sequence and step 5-2 is performed;
if the audio fragment is assembled If the air is empty, directly executing the step 5-3;
step 5-2 extracting an original input speech segment In a collection of audio clipsThe remaining data thereafterWill remain dataSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer dataAnd update audio duration;
Step 5-3 updating threshold duration:
When the audio clip is assembledWhen not in space, according to the formula=Max (2.0, average_duration+0.5) update, where average_duration is the set of audio clipsAverage duration of each audio segment;
when the audio clip is assembled For a time of empty, a threshold durationUnchanged;
step 5-4 reset Audio duration The actual time length of the buffer area after the current splicing.
7. The method for segmenting the voice stream based on the dual-model dynamic triggering according to claim 1, wherein the rapid segmentation model is trained, and the high-precision segmentation model is trained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202510726884.7A CN120260546B (en) | 2025-06-03 | 2025-06-03 | A voice stream segmentation method based on dual-model dynamic triggering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202510726884.7A CN120260546B (en) | 2025-06-03 | 2025-06-03 | A voice stream segmentation method based on dual-model dynamic triggering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN120260546A CN120260546A (en) | 2025-07-04 |
CN120260546B true CN120260546B (en) | 2025-09-23 |
Family
ID=96187627
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202510726884.7A Active CN120260546B (en) | 2025-06-03 | 2025-06-03 | A voice stream segmentation method based on dual-model dynamic triggering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN120260546B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101021854A (en) * | 2006-10-11 | 2007-08-22 | 鲍东山 | Audio analysis system based on content |
CN108766419A (en) * | 2018-05-04 | 2018-11-06 | 华南理工大学 | A kind of abnormal speech detection method based on deep learning |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1174374C (en) * | 1999-06-30 | 2004-11-03 | 国际商业机器公司 | Method for Concurrent Speech Recognition, Speaker Segmentation, and Classification |
US8756061B2 (en) * | 2011-04-01 | 2014-06-17 | Sony Computer Entertainment Inc. | Speech syllable/vowel/phone boundary detection using auditory attention cues |
WO2020111676A1 (en) * | 2018-11-28 | 2020-06-04 | 삼성전자 주식회사 | Voice recognition device and method |
CN113012684B (en) * | 2021-03-04 | 2022-05-31 | 电子科技大学 | Synthesized voice detection method based on voice segmentation |
CN114283792B (en) * | 2021-12-13 | 2025-06-20 | 亿嘉和科技股份有限公司 | Method and device for identifying the opening and closing sound of grounding knife switch |
CN114187898A (en) * | 2021-12-31 | 2022-03-15 | 电子科技大学 | An End-to-End Speech Recognition Method Based on Fusion Neural Network Structure |
CN119132328A (en) * | 2023-06-13 | 2024-12-13 | 腾讯科技(深圳)有限公司 | A voice processing method, device, equipment, medium and program product |
CN117238279A (en) * | 2023-09-04 | 2023-12-15 | 中国电子科技集团公司第二十八研究所 | A method of segmenting regulatory speech based on speech recognition and endpoint detection |
CN118968970B (en) * | 2024-07-15 | 2025-06-20 | 广州市中南民航空管通信网络科技有限公司 | A voice segmentation method and system for air traffic control voice recorder |
CN119673171B (en) * | 2025-02-17 | 2025-04-25 | 深圳十方融海科技有限公司 | Speech recognition feature extraction and reasoning method for artificial intelligence dialogue system |
-
2025
- 2025-06-03 CN CN202510726884.7A patent/CN120260546B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101021854A (en) * | 2006-10-11 | 2007-08-22 | 鲍东山 | Audio analysis system based on content |
CN108766419A (en) * | 2018-05-04 | 2018-11-06 | 华南理工大学 | A kind of abnormal speech detection method based on deep learning |
Also Published As
Publication number | Publication date |
---|---|
CN120260546A (en) | 2025-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110364143B (en) | Voice awakening method and device and intelligent electronic equipment | |
KR100636317B1 (en) | Distributed speech recognition system and method | |
Zazo et al. | Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection. | |
JP3002204B2 (en) | Time-series signal recognition device | |
CN1121681C (en) | Speech processing | |
Hermansky et al. | TRAPS-classifiers of temporal patterns. | |
CN103400580A (en) | Method for estimating importance degree of speaker in multiuser session voice | |
CN111461173A (en) | A multi-speaker clustering system and method based on attention mechanism | |
CN102543063A (en) | Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers | |
CN112270931B (en) | A Method for Deceptive Speech Detection Based on Siamese Convolutional Neural Networks | |
CN118486305B (en) | Event triggering processing method based on voice recognition | |
CN112599123B (en) | Lightweight speech keyword recognition network, method, device and storage medium | |
Lu et al. | Real-time unsupervised speaker change detection | |
CN111429916B (en) | Sound signal recording system | |
CN110047502A (en) | The recognition methods of hierarchical voice de-noising and system under noise circumstance | |
CN112927723A (en) | High-performance anti-noise speech emotion recognition method based on deep neural network | |
CN114822578A (en) | Speech noise reduction method, device, equipment and storage medium | |
CN113889099A (en) | Voice recognition method and system | |
CN113903328A (en) | Speaker counting method, device, device and storage medium based on deep learning | |
CN116741159A (en) | Audio classification and model training method and device, electronic equipment and storage medium | |
CN111341295A (en) | Offline real-time multilingual broadcast sensitive word monitoring method | |
CN120260546B (en) | A voice stream segmentation method based on dual-model dynamic triggering | |
CN119673173A (en) | A streaming speaker log method and system | |
CN115132196A (en) | Voice instruction recognition method and device, electronic equipment and storage medium | |
CN110930985B (en) | Telephone voice recognition model, method, system, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |