CN120260546B

CN120260546B - A voice stream segmentation method based on dual-model dynamic triggering

Info

Publication number: CN120260546B
Application number: CN202510726884.7A
Authority: CN
Inventors: 汤闻易; 刘泽原; 张阳; 徐珂; 丁辉; 张明伟; 唐敏敏; 张翔; 田靖; 王凯
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2025-06-03
Filing date: 2025-06-03
Publication date: 2025-09-23
Anticipated expiration: 2045-06-03
Also published as: CN120260546A

Abstract

The invention discloses a voice stream segmentation method based on dual-model dynamic triggering, which comprises the following steps of 1, constructing a data stream buffer management mechanism of multi-path voice streams, establishing an independent processing channel for each voice stream, forming voice data accumulated to a threshold time into a voice set to be processed, 2, screening, analyzing and processing the voice set to be processed through a rapid segmentation model, selecting out voice fragments meeting the conditions, outputting the voice fragments to a high-precision segmentation model, 3, splicing non-meeting data with data in a data stream buffer according to the screening result of the rapid segmentation model, adjusting the threshold time of a corresponding buffer area of the voice fragments, 4, processing the voice fragments screened through the rapid segmentation model by using the high-precision segmentation model, 5, outputting the segmented voice fragments to other systems such as voice recognition and the like according to the processing result, splicing the rest data with the data in the data stream buffer, and updating the threshold time of the corresponding buffer area.

Description

Voice stream segmentation method based on dual-model dynamic triggering

Technical Field

The invention relates to the field of voice processing, in particular to a voice stream segmentation method based on dual-mode dynamic triggering.

Background

In the field of speech processing, speech recognition models are classified into two types, streaming and non-streaming, the latter having a recognition accuracy much higher than the former. In a scene where the voice stream is required to be subjected to voice recognition, the voice stream can be segmented into a plurality of voice fragments by combining a voice stream segmentation technology, so that the accurate recognition of the voice can be realized by utilizing a non-streaming voice recognition model, and accurate text information is provided for subsequent processing. The voice stream segmentation technology is used as a basic technology in the field of voice processing, and has important application value in scenes such as intelligent customer service systems, multiparty conference transcription, real-time voice analysis and the like. Along with the exponential increase of real-time voice processing demands, the prior art is faced with a key technical bottleneck that the processing efficiency and the segmentation precision are difficult to be compatible in a high concurrency scene.

The current mainstream technical scheme has the following technical defects that 1, a voice activity detection scheme based on a sliding window adopts an energy detection method with a fixed threshold, wherein the scheme has the advantages of millisecond level instantaneity, but has high environmental noise sensitivity, the misjudgment rate exceeds 35% in a low signal-to-noise ratio scene, and 2, an end-to-end deep learning model scheme has the advantages that the segmentation accuracy of more than 90% is realized through a neural network model, the calculation complexity of the model is caused, the model reasoning time consumption and the voice duration are linearly increased, and the long voice processing efficiency is rapidly reduced.

Disclosure of Invention

The invention aims to solve the technical problem of providing a voice stream segmentation method based on dual-mode dynamic triggering aiming at the defects of the prior art.

In order to solve the technical problems, the invention discloses a voice stream segmentation method based on dual-mode dynamic triggering, which comprises the following steps:

Step 1, constructing a data stream buffer management mechanism of multiple voice streams, establishing an independent processing channel for each voice stream, and forming voice data accumulated to a threshold duration into a voice set to be processed;

Step 2, screening, analyzing and processing a voice set to be processed through a rapid segmentation model, selecting voice fragments meeting the conditions and outputting the voice fragments to a high-precision segmentation model;

Step 3, according to the screening result of the rapid segmentation model, splicing the data which do not meet the conditions with the data in the data stream buffer, and adjusting the threshold time of the buffer zone corresponding to the voice fragment;

Step 4, processing the voice fragments screened by the rapid segmentation model by using the high-precision segmentation model;

and 5, outputting the segmented audio fragments to other systems such as voice recognition and the like according to the processing result, splicing the residual data with the data in the data stream buffer, and updating the threshold time of the corresponding buffer zone.

The data stream buffer management mechanism for constructing multiple voice streams in step 1 comprises the following steps:

step 1-1A separate data buffer structure is built for each speech stream, the data buffer structure of the kth speech stream comprising audio data having been received and resampled to a sample rate of 16000Hz Duration of audio that has been received(Units of seconds), threshold duration(In seconds);

step 1-2, storing the received new data into the data buffer structure body of the corresponding voice stream, resampling the new audio data to 16000Hz sampling rate, and adding the new audio data to the audio data Thereafter, the audio duration is updated;

Step 1-3, comparing audio time lengthAnd a threshold durationWhen (when)All voice data in the buffer structure body is processedOutput to a set of speech to be processedOtherwise, executing the step 1-2.

The rapid segmentation model screening in the step 2 comprises the following steps:

step 2-1, checking the voice set to be processed If the test is empty, waiting for the next check, otherwise executing the step 2-2;

Step 2-2 from the set of voices to be processed Selecting a speech segment of a kth speech streamLast_second%The second data is divided into num_frame frames according to a sliding window that the frame length is 30ms and the frame is shifted to 10ms, the spectrum characteristic of each frame is extracted, a fast segmentation model is input, and a probability list of whether each frame is a mute frame or not is obtainedStatistical probability listIn (a)A ratio of frames greater than 0.5, the speech segment being determined when the ratio exceeds 0.4Meets the conditions.

The sliding window, which is described in step 2-2 and moves the last second of data to 10ms according to the frame length of 30ms, divides the data into num_frame frames, and extracts the spectral characteristics of each frame, specifically as follows:

Step 2-2-1, applying Hamming window to the last second speech data to make frame-dividing treatment, frame length is 30ms, frame movement is 10ms, and obtaining num_frame together Frames, each frame having a signal amplitude ofI represents the i-th frame,R represents the r-th sampling point,;

Step 2-2-2, performing fast Fourier transform on the voice data of each frame, and calculating an energy spectrum to obtain a voice frequency domain signal of each frameAnd energy spectrumI represents the i-th frame,K represents the kth frequency point,R represents the r-th sampling point,;

Step 2-2-3, passing the energy spectrum through a Mel filter bank to obtain M-dimensional Mel spectrum characteristicsWhereinThe band-pass triangle filter of the Mel filter group is represented, M represents the number of filters, and M represents the mth filter;

Step 2-2-4, discrete cosine transforming the logarithm of the features of the Mel frequency spectrum to obtain Mel frequency cepstrum coefficient (MFCC coefficients),Where n is the frequency point after DCT, the range of n is required to be the same as m for simplicity, i.e. the spectral feature of the i-th frame.

The fast segmentation model in the step 2-2 is a lightweight two-class model based on a one-dimensional convolutional neural network, and the network structure of the model comprises:

input layer receiving X M dimensional MFCC feature matrices, where M is the number of mel filters;

one-dimensional convolution layer, namely using 32 convolution kernels with width of 5 and step length of 1 to carry out one-dimensional convolution along a time axis, wherein the output dimension is ×32;

A maximum pooling layer, wherein the pooling window size is 2, the step length is 2, and the output dimension is×32;

Flattening the feature map into a 1-dimensional vector;

Full connection layer through Full connection layer of individual neurons, activation function is ReLU;

an output layer for outputting single node probability value representing input via Sigmoid activation function Frames probability that each frame is not a silence frame.

Step 3 comprises the following steps:

step 3-1 statistical probability List Frame occupation with medium silence probability greater than 0.5When the ratio is not more than 0.4, selecting a speech segmentThe corresponding kth voice stream, the voice segmentSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer data;

Step 3-2, according to the formula=+0.3 Update threshold durationThereby avoiding the repeated triggering of the rapid segmentation model screening step.

The network structure of the high-precision segmentation model in the step 4 is as follows:

an input layer, which is used for receiving an MFCC characteristic sequence of the whole voice, wherein the dimension is T multiplied by M, T is the number of frames obtained after the whole voice is subjected to a sliding window with the frame length of 30ms and the frame length of 10ms, and M is the number of Mel filters;

Two bidirectional LSTM layers comprising 128 hidden units, the output dimension is T×256;

full connection layer, through the full connection layer of 64 neurons, the activation function is ReLU;

The output layer outputs a T-dimensional probability sequence through a Sigmoid activation function, and represents the probability that each time frame is not a mute frame;

and the boundary decision module is used for judging the boundary decision point when the probability value is larger than 0.7 and is a local maximum value.

The specific steps of the step 4 comprise the following steps:

step 4-1, extracting the complete MFCC characteristics of the input voice segment, wherein the frame length is 30ms, and the frame is 10ms;

step 4-2, obtaining a boundary probability sequence through a high-precision segmentation model;

step 4-3, segmenting at the probability peak point to obtain an audio fragment set with a person speaking K represents the kth speech stream, L represents the total number of audio segments contained in the set of audio segments in which someone is speaking, and if no someone is detected in the input speech segment, an empty set is output.

Step 5 comprises the following steps:

step 5-1, checking the audio fragment set of the speaker outputted by the high-precision segmentation model Whether or not it is empty:

If it is Non-null, then the audio clips are assembled(I=1, 2,., L) is sent to the speech recognition system in sequence and step 5-2 is performed;

If it is If the air is empty, directly executing the step 5-3;

step 5-2 extracting an original input speech segment Middle inThe remaining data thereafterWill beSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer dataAnd update audio duration;

Step 5-3 updating threshold duration:

When (when)When not in space, according to the formula=Max (2.0, average_duration+0.5) update, where average_duration is the set of audio clipsAverage duration of each audio segment;

When (when) For a time of empty, a threshold durationUnchanged;

step 5-4 reset Audio duration The actual time length of the buffer area after the current splicing.

The rapid segmentation model is trained, and the high-precision segmentation model is trained.

The beneficial effects are that:

1. the invention provides a dual-model dynamic trigger mechanism, and the balance of efficiency and precision is realized through the dynamic matching of a rapid detection model and a high-precision model.

2. The method and the device dynamically adjust the buffer zone threshold value by combining the audio stream double-model segmentation result, and effectively inhibit invalid triggering.

The model can efficiently segment multiple voice streams and ensure higher segmentation precision, can provide a method for realizing high-precision voice recognition results by matching with a non-stream voice recognition model under a multiple voice stream recognition scene, provides auxiliary supporting functions for intelligent customer service and conference transcription programs, and has a certain practical value.

Drawings

Fig. 1 is a schematic diagram of the overall flow of the present invention.

Fig. 2 is a schematic diagram of a rapid segmentation model.

Fig. 3 is a schematic diagram of a high-precision segmentation model.

Fig. 4 is a schematic diagram of the data flow in a practical system according to the present invention.

Detailed Description

A voice stream segmentation method (as shown in figure 1) based on dual-mode dynamic triggering comprises the following steps:

The fast slicing model described in step 2-2 is a lightweight two-class model (as shown in fig. 2) based on a one-dimensional convolutional neural network, and the network structure comprises:

Flattening the feature map into a 1-dimensional vector;

Step 3 comprises the following steps:

step 3-1 statistical probability List Frame occupation with medium silence probability greater than 0.5When the ratio is not more than 0.4, selecting a speech segmentThe corresponding kth voice stream, will segment the voiceSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer data;

The network structure (as in fig. 3) of the high-precision segmentation model described in step 4 is as follows:

The specific steps of the step 4 comprise the following steps:

Step 5 comprises the following steps:

If it is Non-null, then the audio clips are assembled(I=1, 2,., L) is sent to the speech recognition system (as in fig. 4) in sequence and step 5-2 is performed;

If it is If the air is empty, directly executing the step 5-3;

Step 5-3 updating threshold duration:

When (when) For a time of empty, a threshold durationUnchanged;

Examples:

in this embodiment, an aeronautical control land-air call voice recorder system is taken as an example, and an application scenario of the method in real-time segmentation of multipath voice streams is described. In combination with the schematic diagram shown in fig. 1, the specific implementation steps are as follows:

step 1, establishing a multipath voice stream buffer management mechanism:

the system establishes an independent buffer structure body for each radio channel, and defines a kth path of voice stream structure body as follows:

{

'Audio_buffer': np. Array ([ ]),// resample to 16000Hz audio data

'Audio_len': 0.0,// cumulative duration (seconds)

'Threshold' 2.0// initial trigger threshold

}

When new audio data arrives, resampling to 16000Hz sampling rate, appending the data to corresponding audio_buffer, updating audio_len=len (audio_buffer)/16000, and packing the complete buffer data to the set to be processed audio_set when audio_len > threshold.

Step 2, dynamic triggering of the rapid segmentation model (corresponding to fig. 2):

a. extracting the audio of the last 1.5 seconds for feature analysis:

Frame shifting and dividing by adopting frame length of 30ms and frame length of 10ms

Computing 13-dimensional MFCC characteristics (m=13)

B. silence detection is performed by a lightweight model:

c. triggering high-precision processing when mute frame proportion is less than or equal to 40 percent

Step3, dynamically adjusting a buffer area:

Not triggered, concatenating the current data with the preamble buffer and updating the threshold formula: threshold_new=current_len+0.3. Example when the original buffer duration is 2.1 seconds and is not triggered, the new threshold is set to 2.1+0.3=2.4 seconds

4. High precision speech segmentation (corresponding to fig. 3):

processing the complete audio clip using the bi-directional LSTM model:

step 5, the result output is linked with the system (corresponding to fig. 4):

the valid speech segments are pushed to the speech recognition engine and the remaining data is written back to the buffer.

The embodiment ensures high-precision segmentation through the cooperative work of the two-stage models (as shown in the flow of fig. 1), and meanwhile, compared with the traditional single-model scheme, the method has the advantage that the computational resource requirement is remarkably reduced, and the method is particularly suitable for scenes such as aviation control, which need to process multiple voice streams in real time.

The invention provides a voice stream segmentation method based on dual-mode dynamic triggering, and the method and the way for realizing the technical scheme are numerous, the above is only a preferred embodiment of the invention, and it should be pointed out that a plurality of improvements and modifications can be made to those skilled in the art without departing from the principle of the invention, and the improvements and modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. A voice stream segmentation method based on dual-mode dynamic triggering is characterized by comprising the following steps:

Step 5, outputting the segmented audio clips to a voice recognition system according to the processing result, splicing the residual data with the data in the data stream buffer, and updating the threshold time of the corresponding buffer zone;

The rapid segmentation model screening in the step 2 specifically comprises the following steps:

Step 2-2 from the set of voices to be processed Selecting a speech segment of a kth speech streamDividing the last second data into num_frame frames through a sliding window, extracting the spectrum characteristics of each frame, inputting a rapid segmentation model, and obtaining a probability list of whether each frame is a mute frame or notStatistical probability listIn (a)Judging the voice segmentWhether the conditions are met;

The fast segmentation model in the step 2-2 is a lightweight class classification model based on a one-dimensional convolutional neural network, and the network structure comprises:

Flattening the feature map into a 1-dimensional vector;

an output layer for outputting single node probability value representing input via Sigmoid activation function Probability that each frame is not a silence frame;

2. The method for splitting voice stream based on dual-mode dynamic triggering as claimed in claim 1, wherein the data stream buffer management mechanism for constructing multiple voice streams in step 1 comprises the following steps:

step 1-1A separate data buffer structure is established for each speech stream, the data buffer structure of the kth speech stream comprising audio data that has been received and resampled Duration of audio that has been receivedThreshold duration;

Step 1-2, storing the new data in the data buffer structure corresponding to the voice stream after receiving the new data, resampling the new audio data and adding the new audio data to the audio dataThereafter, the audio duration is updated;

3. The method for splitting a voice stream based on dual-mode dynamic triggering as claimed in claim 1, wherein the data of the last second in step 2-2 is divided into num_frame frames through a sliding window, and the spectral characteristics of each frame are extracted as follows:

Step 2-2-1, framing the last second speech data application to obtain num_frame frames, wherein the signal amplitude of each frame is I represents the i-th frame,R represents the r-th sampling point;

Step 2-2-2, performing fast Fourier transform on the voice data of each frame, and calculating an energy spectrum to obtain a voice frequency domain signal of each frame And energy spectrumI represents the i-th frame,K represents the kth frequency point,R represents the r-th sampling point;

step 2-2-3, passing the energy spectrum through a Mel filter bank to obtain M-dimensional Mel spectrum characteristics WhereinThe band-pass triangle filter of the Mel filter group is represented, M represents the number of filters, and M represents the mth filter;

Step 2-2-4, discrete cosine transforming the logarithm of the features of the Mel frequency spectrum to obtain Mel frequency cepstrum coefficient ,Wherein n is a frequency point after DCT, and the value range of n is the same as m, namely the spectrum characteristic of the ith frame.

4. The method for splitting a voice stream based on dual-mode dynamic triggering as recited in claim 1, wherein the step 3 comprises the steps of:

Step 3-2, according to the formula=+0.3 Update threshold duration。

5. The method for splitting a voice stream based on dual-mode dynamic triggering as recited in claim 4, wherein the specific steps of step 4 include the following steps:

Step 4-1, extracting complete MFCC characteristics of the input voice segment;

6. The method for splitting a voice stream based on dual-mode dynamic triggering as recited in claim 1, wherein the step 5 comprises the steps of:

if the audio fragment is assembled Non-null, then the audio clips are assembled(I=1, 2,., L) is sent to the speech recognition system in sequence and step 5-2 is performed;

if the audio fragment is assembled If the air is empty, directly executing the step 5-3;

step 5-2 extracting an original input speech segment In a collection of audio clipsThe remaining data thereafterWill remain dataSplicing with the latest received data in the corresponding voice stream buffer area end to form new buffer dataAnd update audio duration;

Step 5-3 updating threshold duration:

When the audio clip is assembledWhen not in space, according to the formula=Max (2.0, average_duration+0.5) update, where average_duration is the set of audio clipsAverage duration of each audio segment;

when the audio clip is assembled For a time of empty, a threshold durationUnchanged;

7. The method for segmenting the voice stream based on the dual-model dynamic triggering according to claim 1, wherein the rapid segmentation model is trained, and the high-precision segmentation model is trained.