WO2018049391A1 - Method and apparatus for exemplary segment classification - Google Patents
Method and apparatus for exemplary segment classification Download PDFInfo
- Publication number
- WO2018049391A1 WO2018049391A1 PCT/US2017/051160 US2017051160W WO2018049391A1 WO 2018049391 A1 WO2018049391 A1 WO 2018049391A1 US 2017051160 W US2017051160 W US 2017051160W WO 2018049391 A1 WO2018049391 A1 WO 2018049391A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- energy
- calculation module
- kurtosis
- silence
- Prior art date
Links
- 238000000034 method Methods 0.000 title abstract description 15
- 230000007704 transition Effects 0.000 claims description 14
- 230000005236 sound signal Effects 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 14
- 239000013598 vector Substances 0.000 description 11
- 230000006870 function Effects 0.000 description 5
- 238000001514 detection method Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 206010071299 Slow speech Diseases 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000007620 mathematical function Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/09—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being zero crossing rates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
Definitions
- Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural language.
- the meaning of a complex spoken sentence (which often has never been heard or uttered before) can be understood only by decomposing the complex spoken sentence into smaller lexical segments (roughly, the words of the language), associating a meaning to each segment, and then combining those meanings according to the grammar rules of the language.
- the recognition of each lexical segment in turn requires its decomposition into a sequence of discrete phonetic segments and mapping each segment to one element of a finite set of elementary sounds (roughly, the phonemes of the language).
- VAD Voice activity detection
- speech activity detection is a technique used in speech processing in which the presence or absence of human speech is detected.
- the main uses of VAD are in speech coding and speech recognition.
- VAD can facilitate speech processing, and can also be used to deactivate some processes during non- speech section of an audio session: VAD can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol applications, saving on computation and on network bandwidth.
- aspects of the exemplary embodiments relate to systems and methods designed to segment speech by detecting the pauses between the words and/or phrases, i.e. to determine whether a particular time interval contains speech or non-speech, e.g. a pause.
- FIG. 1 illustrates a block diagram of a system for classifying input audio, according an exemplary embodiment.
- FIG. 2 illustrates a flow diagram of a method of identifying lexical boundaries between silence and non-silence in input audio, according to an exemplary embodiment.
- FIG. 3 illustrates a graph of the sliding energy ratio as a function of time, according an exemplary embodiment.
- FIG. 4 illustrates a block diagram of the blocks over which energy is calculated, according an exemplary embodiment.
- FIG. 5 is a flow diagram of a method of identifying audio transitions within a super segment, according an exemplary embodiment.
- FIG. 6 is a flow diagram of a method of extracting feature vectors for each block, according an exemplary embodiment.
- FIG. 7 illustrates a flow diagram of a method of making a preliminary
- FIG. 8 illustrates a flow diagram of classifying the audio contained within a segment, according an exemplary embodiment.
- FIG. 1 illustrates a block diagram of a system for classifying input audio according an exemplary embodiment.
- the classification system in FIG. 1 may be implemented as a computer system
- computer system 110 is a computer comprising several modules, i.e. computer components embodied as either software modules, hardware modules, or a combination of software and hardware modules, whether separate or integrated, working together to form an exemplary computer system.
- the computer components may be implemented as a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks.
- FPGA Field Programmable Gate Array
- ASIC Application Specific Integrated Circuit
- a unit or module may advantageously be configured to reside on the addressable storage medium and configured to execute on one or more processors or microprocessors.
- a unit or module may include, by way of example, components, such as software components, object oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
- components such as software components, object oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
- components such as software components, object oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
- Input 115 is a module configured to receive input audio from any audio source and output the received audio to the divider 117.
- the audio source may be live audio, for example received from a microphone; or recorded audio, for example received from a file, etc.
- the divider 117 is a module configured to receive the output of the input 115 and divide said input audio into blocks, i.e. the blocked audio, of (for example 500) samples. Audio may be silence, speech, music, background/ambient noise or any combination thereof.
- the energy calculator 120 is a module configured to receive the blocked audio output from the divider 117, calculate the energy of the waveform of the input audio and output the calculated energy of the waveform to the energy ratio calculator 125.
- the energy ratio calculator 125 is a module configured to calculate the ratio of the energy Nl contiguous blocks of audio over the energy of N2 contiguous blocks of audio contained within the first set of contiguous block of audio and output the energy ratio to the silence boundary locator 130.
- the silence boundary locator 130 is a module configured to receive the energy ratio calculated by the energy ratio calculator 125, identify the lexical segments between the silence and non-silence portion of the input audio and output the non-silence portions of the input audio to the kurtosis calculator 135.
- the kurtosis calculator 135 is a module configured to receive the non-silence audio portions from the silence boundary locator 130, calculate the kurtosis over a sliding window in the non-silence audio portion and identify the transitions between the different types of audio within each non-silence audio portion.
- the non-silence audio portion between each transition is designated a super segment.
- Each super segment retains the block divisions from the divider 117.
- the kurtosis calculator 135 is further configured to calculate the kurtosis for each individual block within a super segment.
- the zero crossing calculator 140 is a module configured to receive the block divided super segments from the kurtosis calculator 135 and calculate the number of zero crossings in each individual block with said super segments.
- a zero-crossing is a point at which the sign of a mathematical function changes (e.g. from positive to negative), represented by a crossing of the axis (zero value) in the graph of the function, and is a commonly used term in electronics, mathematics, sound, and image processing.
- the silence ratio calculator 142 is a module configured to receive the block divided super segments from the kurtosis calculator 135 and calculate the silence ratio over each individual block.
- the silence ratio of an audio segment is the ratio of the samples that are smaller than a given threshold to the length of the block
- the weighted averager 145 is a module configured to create a weighted average of the parameters to create the feature vector of each individual block.
- the distance 150 is a module configured to determine, for each block, the
- the preliminary decider 155 is a module configured to make a preliminary classification of the audio contained within each block, i.e. whether the block is primarily speech, music or background/ambient noise or silence.
- the polar 160 is a module configured to poll the individual blocks within a segment to make a final classification of the audio within said segment.
- the pause detector 170 is a module configured to obtain segments, and determine if the segment length is longer than a given threshold and divide the segments into small segments.
- FIG. 2 illustrates a flow diagram of a method of identifying lexical boundaries between silence and non-silence in input audio, according to an exemplary embodiment.
- the audio portions between these initial lexical boundaries are designated super segments, which may contain multiple types of audio.
- the input 120 receives input audio from an audio source.
- the audio source may be live audio, for example received from a microphone, recorded audio, for example audio recorded in a file, synthesized audio, etc.
- the divider 117 divides the input audio into blocks of samples, for example 500 samples per block.
- the length of the blocks is a function of the sample rate.
- the energy calculator 120 calculates the energy over Nl contiguous blocks.
- Nl is 6 blocks.
- the energy calculator 120 calculates the energy over N2 contiguous blocks where said contiguous blocks are wholly contained within Nl contiguous blocks.
- N2 is 4 blocks.
- the energy ratio calculator 125 calculates the ratio of the energy of the Nl blocks over the energy of the N2 blocks.
- the energy ratio calculator 125 calculates the energy ratio over a sliding frame.
- the silence boundary locator 130 identifies the lexical boundaries to and from silence by the energy ratio of the blocks. When there is a transition from silence to non- silence, the moving energy ratio will spike sharply. When there is a transition from non-silence to silence, the moving energy ratio will decline abruptly. The silence boundary locator 130 will assign each super segment a label: silence or non-silence based upon these transitions.
- FIG. 3 illustrates a graph of the sliding energy ratio as a function of time.
- the time index 310 illustrates a relatively uniform energy ratio of approximately
- the time index 320 illustrates a spike in the energy ratio, i.e. the energy ratio is substantially greater than 1. This spike in the energy ratio indicates a transition from non-silence to silence.
- the time index 330 illustrates a uniform energy ratio of approximately 1 , once again indicating a lack of transitions between silence to non- silence.
- Time index 340 represents a dip in the energy ratio, i.e. the energy ratio is substantially less than 1 , indicating transition from silence to non-silence.
- FIG. 4 illustrates a block diagram of the blocks over which energy calculator 120 calculates the energy.
- Blocks 405 illustrate the blocks in the input audio.
- Blocks 410 illustrate the Nl contiguous blocks over which energy calculator first calculates the energy.
- Blocks 420 illustrate the N2 contiguous blocks over which energy calculator 120 also calculates the energy.
- FIG. 5 is a flow diagram of the method of identifying audio transitions within a super segment.
- the kurtosis calculator 135 receives a blocked super segment of non- silence audio.
- the kurtosis calculator 135 calculates the kurtosis over a sliding window over N3 blocks in the super segment. Calculating kurtosis is within the skill of one schooled in the art of signal processing.
- the kurtosis calculator 135 identifies the transitions between the different types of audio contained within the N3 blocks of the super segment, i.e. speech, music, and background/ambient noise. A rapid change in the kurtosis over a sliding window of N3 blocks indicates a change in the audio as shown in Table 1.
- the audio between transitions is designated a segment.
- FIG. 6 is a flow diagram of a method of extracting the feature vectors for each block.
- the feature vectors are used to classify the type of audio contained within each block in the segment.
- the kurtosis calculator 135 calculates the kurtosis for each individual block within a segment.
- the zero crossing calculator 140 calculates the zero crossing rate for each individual block.
- the zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which the signal changes from positive to negative or back.
- the silence ratio calculator 1425 calculates the silence ratio for each individual block.
- the silence ratio of an audio segment is the ratio of the samples that are smaller than a given threshold to the length of the segment.
- the energy calculator 120 calculates the energy of each individual block.
- the four parameters, energy, kurtosis, zero crossing rate and silence ratio, once weighted, will collectively represent the feature vector of a block.
- FIG. 7 illustrates a flow diagram of a method of making a preliminary
- the weighted average calculates a multiplication factor, i.e. a weight, to apply to each parameter.
- a weight i.e. a weight
- the applied weight is one over one standard deviation of each parameter.
- the weighted parameters are the elements in the feature vector of each block.
- distance 150 calculates the Euclidean distances between the feature vector and a centroid of pure speech, a centroid of pure music, a centroid of pure background/ambient noise, and a centroid of pure silence. Calculating a Euclidean distance is within the ordinary skill of one schooled in the art of signal processing.
- the preliminary decider 155 classifies the block based upon the centroid with the lowest Euclidean distance from the feature vector, e.g. if a centroid containing music has the lowest Euclidean distance from the feature vector, the segment will be initially classified as music.
- each block is assigned a value corresponding to its preliminary classification.
- An example of such values is illustrated in Table 2.
- Figure 8 illustrates a flow diagram classifying the audio contained within a segment.
- the polar 160 averages the assigned numbers of each block within the segment.
- the polar 160 determines which block classifications deviate by more than a given threshold, e.g. one standard deviation from the average.
- step 830 the blocks whose classification deviate by the more than the given threshold are identified as potentially misclassified, i.e. outliers.
- step 840 the classification of the outliers is reassigned to the current average.
- the average of all blocks, included the reclassified outliers is recalculated.
- the average is re-calculated with all outliers excluded.
- the polar 160 determines the difference between the recalculated average and each of the assigned audio values in Table 2. The segment is classified according to the audio with the lowest difference to the recalculated average.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Method and apparatus for segmenting speech by detecting the pauses between the words and/or phrases, and to determine whether a particular time interval contains speech or nonspeech, such as a pause.
Description
METHOD AND APPARATUS FOR EXEMPLARY SEGMENT CLASSIFICATION
CROSS-REFERENCE TO RELATED PATENT APPLICATION
[0001] This application is a Continuation-in-Part of U.S. Application No. 15/075,786 filed March 21, 2016, which is a is a Continuation Application of U.S. Application No.
14/262,668, filed April 25, 2014, which claims the benefit of U.S. Provisional Patent Application No. 61/825,523, filed on April 25, 2013, in the U.S. Patent and Trademark Office, the disclosure of which is incorporated herein by reference in its entirety.
BACKGROUND
1. Field
[0002] Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural language. In all natural languages, the meaning of a complex spoken sentence (which often has never been heard or uttered before) can be understood only by decomposing the complex spoken sentence into smaller lexical segments (roughly, the words of the language), associating a meaning to each segment, and then combining those meanings according to the grammar rules of the language. The recognition of each lexical segment in turn requires its decomposition into a sequence of discrete phonetic segments and mapping each segment to one element of a finite set of elementary sounds (roughly, the phonemes of the language).
[0003] For most spoken languages, the boundaries between lexical units are surprisingly difficult to identify. One might expect that the inter-word spaces used by many written languages, like English or Spanish, would correspond to pauses in their spoken version; but that
is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses therebetween. 2. Description of Related Art
[0004] Voice activity detection (VAD), also known as speech activity detection or speech detection, is a technique used in speech processing in which the presence or absence of human speech is detected. The main uses of VAD are in speech coding and speech recognition. VAD can facilitate speech processing, and can also be used to deactivate some processes during non- speech section of an audio session: VAD can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol applications, saving on computation and on network bandwidth.
SUMMARY
[0005] Aspects of the exemplary embodiments relate to systems and methods designed to segment speech by detecting the pauses between the words and/or phrases, i.e. to determine whether a particular time interval contains speech or non-speech, e.g. a pause.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] FIG. 1 illustrates a block diagram of a system for classifying input audio, according an exemplary embodiment.
[0007] FIG. 2 illustrates a flow diagram of a method of identifying lexical boundaries between silence and non-silence in input audio, according to an exemplary embodiment.
[0008] FIG. 3 illustrates a graph of the sliding energy ratio as a function of time, according an exemplary embodiment.
[0009] FIG. 4 illustrates a block diagram of the blocks over which energy is calculated, according an exemplary embodiment.
[0010] FIG. 5 is a flow diagram of a method of identifying audio transitions within a super segment, according an exemplary embodiment.
[0011] FIG. 6 is a flow diagram of a method of extracting feature vectors for each block, according an exemplary embodiment.
[0012] FIG. 7 illustrates a flow diagram of a method of making a preliminary
classification of the audio in each block, according an exemplary embodiment.
[0013] FIG. 8 illustrates a flow diagram of classifying the audio contained within a segment, according an exemplary embodiment.
DETAILED DESCRIPTION OF THE EXEMPLARY EMBODIMENTS
[0014] FIG. 1 illustrates a block diagram of a system for classifying input audio according an exemplary embodiment.
[0015] The classification system in FIG. 1 may be implemented as a computer system
110, computer system 110 is a computer comprising several modules, i.e. computer components embodied as either software modules, hardware modules, or a combination of software and hardware modules, whether separate or integrated, working together to form an exemplary computer system. The computer components may be implemented as a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks. A unit or module may advantageously be configured to reside on the addressable storage medium and configured to execute on one or more processors or microprocessors. Thus, a unit or module may include, by way of example, components, such as software components, object oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. The functionality provided
for in the components and units may be combined into fewer components and units or modules or further separated into additional components and units or modules.
[0016] Input 115 is a module configured to receive input audio from any audio source and output the received audio to the divider 117. The audio source may be live audio, for example received from a microphone; or recorded audio, for example received from a file, etc. The divider 117 is a module configured to receive the output of the input 115 and divide said input audio into blocks, i.e. the blocked audio, of (for example 500) samples. Audio may be silence, speech, music, background/ambient noise or any combination thereof. The energy calculator 120 is a module configured to receive the blocked audio output from the divider 117, calculate the energy of the waveform of the input audio and output the calculated energy of the waveform to the energy ratio calculator 125. The energy ratio calculator 125 is a module configured to calculate the ratio of the energy Nl contiguous blocks of audio over the energy of N2 contiguous blocks of audio contained within the first set of contiguous block of audio and output the energy ratio to the silence boundary locator 130. The silence boundary locator 130 is a module configured to receive the energy ratio calculated by the energy ratio calculator 125, identify the lexical segments between the silence and non-silence portion of the input audio and output the non-silence portions of the input audio to the kurtosis calculator 135.
[0017] The kurtosis calculator 135 is a module configured to receive the non-silence audio portions from the silence boundary locator 130, calculate the kurtosis over a sliding window in the non-silence audio portion and identify the transitions between the different types of audio within each non-silence audio portion. The non-silence audio portion between each transition is designated a super segment. Each super segment retains the block divisions from the
divider 117. The kurtosis calculator 135 is further configured to calculate the kurtosis for each individual block within a super segment.
[0018] The zero crossing calculator 140 is a module configured to receive the block divided super segments from the kurtosis calculator 135 and calculate the number of zero crossings in each individual block with said super segments. A zero-crossing is a point at which the sign of a mathematical function changes (e.g. from positive to negative), represented by a crossing of the axis (zero value) in the graph of the function, and is a commonly used term in electronics, mathematics, sound, and image processing.
[0019] The silence ratio calculator 142 is a module configured to receive the block divided super segments from the kurtosis calculator 135 and calculate the silence ratio over each individual block. The silence ratio of an audio segment is the ratio of the samples that are smaller than a given threshold to the length of the block
[0020] The weighted averager 145 is a module configured to create a weighted average of the parameters to create the feature vector of each individual block.
[0021] The distance 150 is a module configured to determine, for each block, the
Euclidean distance between vectors in n-dimensional space, and in particular between the feature vector for each block and a centroid of pure speech, pure music, pure silence and pure background/ambient noise.
[0022] The preliminary decider 155 is a module configured to make a preliminary classification of the audio contained within each block, i.e. whether the block is primarily speech, music or background/ambient noise or silence.
[0023] The polar 160 is a module configured to poll the individual blocks within a segment to make a final classification of the audio within said segment.
[0024] The pause detector 170 is a module configured to obtain segments, and determine if the segment length is longer than a given threshold and divide the segments into small segments.
[0025] FIG. 2 illustrates a flow diagram of a method of identifying lexical boundaries between silence and non-silence in input audio, according to an exemplary embodiment.
[0026] The audio portions between these initial lexical boundaries are designated super segments, which may contain multiple types of audio.
[0027] At step 210, the input 120 receives input audio from an audio source. The audio source may be live audio, for example received from a microphone, recorded audio, for example audio recorded in a file, synthesized audio, etc.
[0028] At step 220, the divider 117 divides the input audio into blocks of samples, for example 500 samples per block. The length of the blocks is a function of the sample rate.
[0029] At step 230, the energy calculator 120 calculates the energy over Nl contiguous blocks. In a preferred embodiment, Nl is 6 blocks.
[0030] At step 240, the energy calculator 120 calculates the energy over N2 contiguous blocks where said contiguous blocks are wholly contained within Nl contiguous blocks. In a preferred embodiment, N2 is 4 blocks.
[0031] At step 250, the energy ratio calculator 125 calculates the ratio of the energy of the Nl blocks over the energy of the N2 blocks. The energy ratio calculator 125 calculates the energy ratio over a sliding frame.
[0032] At step 260, the silence boundary locator 130 identifies the lexical boundaries to and from silence by the energy ratio of the blocks. When there is a transition from silence to non- silence, the moving energy ratio will spike sharply. When there is a transition from non-silence
to silence, the moving energy ratio will decline abruptly. The silence boundary locator 130 will assign each super segment a label: silence or non-silence based upon these transitions.
[0033] FIG. 3 illustrates a graph of the sliding energy ratio as a function of time.
[0034] The time index 310 illustrates a relatively uniform energy ratio of approximately
1 , indicating that there is no transition between silence and non-silence within the time index 310.
[0035] The time index 320 illustrates a spike in the energy ratio, i.e. the energy ratio is substantially greater than 1. This spike in the energy ratio indicates a transition from non-silence to silence.
[0036] Similar to the time index 310, the time index 330 illustrates a uniform energy ratio of approximately 1 , once again indicating a lack of transitions between silence to non- silence.
[0037] Time index 340 represents a dip in the energy ratio, i.e. the energy ratio is substantially less than 1 , indicating transition from silence to non-silence.
[0038] FIG. 4 illustrates a block diagram of the blocks over which energy calculator 120 calculates the energy.
[0039] Blocks 405 illustrate the blocks in the input audio. Blocks 410 illustrate the Nl contiguous blocks over which energy calculator first calculates the energy. Blocks 420 illustrate the N2 contiguous blocks over which energy calculator 120 also calculates the energy.
[0040] FIG. 5 is a flow diagram of the method of identifying audio transitions within a super segment.
[0041] At step 510, the kurtosis calculator 135 receives a blocked super segment of non- silence audio.
[0042] At step 520, the kurtosis calculator 135 calculates the kurtosis over a sliding window over N3 blocks in the super segment. Calculating kurtosis is within the skill of one schooled in the art of signal processing.
[0043] At step 530, the kurtosis calculator 135 identifies the transitions between the different types of audio contained within the N3 blocks of the super segment, i.e. speech, music, and background/ambient noise. A rapid change in the kurtosis over a sliding window of N3 blocks indicates a change in the audio as shown in Table 1.
[0044] Table 1
[0045] The audio between transitions is designated a segment.
[0046] FIG. 6 is a flow diagram of a method of extracting the feature vectors for each block. The feature vectors are used to classify the type of audio contained within each block in the segment.
[0047] At step 610 the kurtosis calculator 135 calculates the kurtosis for each individual block within a segment.
[0048] At step 620, the zero crossing calculator 140 calculates the zero crossing rate for each individual block. The zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which the signal changes from positive to negative or back.
[0049] At step 630, the silence ratio calculator 1425 calculates the silence ratio for each individual block. The silence ratio of an audio segment is the ratio of the samples that are smaller than a given threshold to the length of the segment.
[0050] At step 640, the energy calculator 120 calculators the energy of each individual block. The four parameters, energy, kurtosis, zero crossing rate and silence ratio, once weighted, will collectively represent the feature vector of a block.
[0051] FIG. 7 illustrates a flow diagram of a method of making a preliminary
classification of the audio in each block.
[0052] At step 710, the weighted average calculates a multiplication factor, i.e. a weight, to apply to each parameter. In one embodiment, the applied weight is one over one standard deviation of each parameter. The weighted parameters are the elements in the feature vector of each block.
[0053] At step 720, for each block, distance 150 calculates the Euclidean distances between the feature vector and a centroid of pure speech, a centroid of pure music, a centroid of pure background/ambient noise, and a centroid of pure silence. Calculating a Euclidean distance is within the ordinary skill of one schooled in the art of signal processing.
[0054] At step 730, the preliminary decider 155 classifies the block based upon the centroid with the lowest Euclidean distance from the feature vector, e.g. if a centroid containing music has the lowest Euclidean distance from the feature vector, the segment will be initially classified as music.
[0055] At step 740, each block is assigned a value corresponding to its preliminary classification. An example of such values is illustrated in Table 2.
[0056] Table 2
Numerical Value Type of Audio
0 Silence
1 Pure speech
2 Pure music
3 Background/ambient noise
[0057] Figure 8 illustrates a flow diagram classifying the audio contained within a segment.
[0058] At step 810, the polar 160 averages the assigned numbers of each block within the segment.
[0059] At step 820, the polar 160 determines which block classifications deviate by more than a given threshold, e.g. one standard deviation from the average.
[0060] At step 830, the blocks whose classification deviate by the more than the given threshold are identified as potentially misclassified, i.e. outliers.
[0061] At step 840, the classification of the outliers is reassigned to the current average.
[0062] At step 850, the average of all blocks, included the reclassified outliers is recalculated. Alternatively, at step 850, the average is re-calculated with all outliers excluded.
[0063] At step 860, the polar 160 determines the difference between the recalculated average and each of the assigned audio values in Table 2. The segment is classified according to the audio with the lowest difference to the recalculated average.
Claims
1. A system configured to identify a lexical segment in speech, the system comprising:
an input interface configured to receive an input of audio;
a divider module configured to divide the input of audio into blocks;
an energy calculation module configured to calculate energy of the audio over a first and second time intervals of the audio;
an energy ratio calculation module configured to calculate the ratio of the energy in the first time intervals of the audio over every second time interval within the first time intervals that slides as the energy calculation module calculates the energy of the audio at each of the first time intervals;
a silence boundary locator module configured to determine transitions between silence and non-silence according to the ratio of the energy calculated by the energy ratio calculation module.
a first kurtosis calculation module configured to calculate the kurtosis of the audio signal over a third time interval that slides as the kurtosis calculation module calculates the kurtosis of the audio at each of the third time intervals;
a segment classification module configured to classify the audio in every third time interval according to the kurtosis of the input audio within each third time interval;
a second kurtosis calculation module configured to calculate kurtosis of the input audio over each fourth time interval that slides as the second kurtosis calculation module calculates the kurtosis of the audio at each fourth time interval;
a zero cross calculation module configured to calculate zero crossings of the audio within a fourth time interval;
a silence ratio calculation module configured to calculate the silence ratio of audio with a fourth time interval;
an individual block calculation module configured to classify the audio within a fourth time interval according to the second kurtosis calculation module, the zero crossing calculation module, the silence ratio calculation module, and the energy of the fourth time interval;
a weighted average calculation module configured to calculate the weighted average of the kurtosis, zero cross rate, silence ratio and energy over the fourth time interval;
a preliminary decision module configured to make a preliminary classification of the audio within the segment; and
a polling module configured to make a final classification of the audio within the segment.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP17849768.1A EP3510592A4 (en) | 2016-09-12 | 2017-09-12 | Method and apparatus for exemplary segment classification |
IL265310A IL265310A (en) | 2016-09-12 | 2019-03-12 | Method and apparatus for exemplary segment classification |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/262,963 | 2016-09-12 | ||
US15/262,963 US9767791B2 (en) | 2013-05-21 | 2016-09-12 | Method and apparatus for exemplary segment classification |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018049391A1 true WO2018049391A1 (en) | 2018-03-15 |
Family
ID=61562835
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2017/051160 WO2018049391A1 (en) | 2016-09-12 | 2017-09-12 | Method and apparatus for exemplary segment classification |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP3510592A4 (en) |
IL (1) | IL265310A (en) |
WO (1) | WO2018049391A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111276164A (en) * | 2020-02-15 | 2020-06-12 | 中国人民解放军空军特色医学中心 | Self-adaptive voice activation detection device and method for high-noise environment on airplane |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5617508A (en) * | 1992-10-05 | 1997-04-01 | Panasonic Technologies Inc. | Speech detection device for the detection of speech end points based on variance of frequency band limited energy |
US6629070B1 (en) * | 1998-12-01 | 2003-09-30 | Nec Corporation | Voice activity detection using the degree of energy variation among multiple adjacent pairs of subframes |
US20120253812A1 (en) * | 2011-04-01 | 2012-10-04 | Sony Computer Entertainment Inc. | Speech syllable/vowel/phone boundary detection using auditory attention cues |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9324319B2 (en) * | 2013-05-21 | 2016-04-26 | Speech Morphing Systems, Inc. | Method and apparatus for exemplary segment classification |
-
2017
- 2017-09-12 EP EP17849768.1A patent/EP3510592A4/en not_active Withdrawn
- 2017-09-12 WO PCT/US2017/051160 patent/WO2018049391A1/en unknown
-
2019
- 2019-03-12 IL IL265310A patent/IL265310A/en unknown
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5617508A (en) * | 1992-10-05 | 1997-04-01 | Panasonic Technologies Inc. | Speech detection device for the detection of speech end points based on variance of frequency band limited energy |
US6629070B1 (en) * | 1998-12-01 | 2003-09-30 | Nec Corporation | Voice activity detection using the degree of energy variation among multiple adjacent pairs of subframes |
US20120253812A1 (en) * | 2011-04-01 | 2012-10-04 | Sony Computer Entertainment Inc. | Speech syllable/vowel/phone boundary detection using auditory attention cues |
Non-Patent Citations (1)
Title |
---|
See also references of EP3510592A4 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111276164A (en) * | 2020-02-15 | 2020-06-12 | 中国人民解放军空军特色医学中心 | Self-adaptive voice activation detection device and method for high-noise environment on airplane |
CN111276164B (en) * | 2020-02-15 | 2021-08-03 | 中国人民解放军空军特色医学中心 | Device and method for self-adaptive voice activation detection in high-noise environment on aircraft |
Also Published As
Publication number | Publication date |
---|---|
IL265310A (en) | 2019-05-30 |
EP3510592A4 (en) | 2020-04-29 |
EP3510592A1 (en) | 2019-07-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9449617B2 (en) | Method and apparatus for exemplary segment classification | |
CN110211565B (en) | Dialect identification method and device and computer readable storage medium | |
KR20240053639A (en) | Speaker-turn-based online speaker segmentation using constrained spectral clustering. | |
US9767791B2 (en) | Method and apparatus for exemplary segment classification | |
JP7526846B2 (en) | voice recognition | |
EP2089877A1 (en) | Voice activity detection system and method | |
JP2006215564A (en) | Method and apparatus for predicting word accuracy in automatic speech recognition systems | |
Deng et al. | Confidence Measures in Speech Emotion Recognition Based on Semi-supervised Learning. | |
JP6622681B2 (en) | Phoneme Breakdown Detection Model Learning Device, Phoneme Breakdown Interval Detection Device, Phoneme Breakdown Detection Model Learning Method, Phoneme Breakdown Interval Detection Method, Program | |
JP2012032557A (en) | Device, method and program for detecting ingressive in voice | |
JP2000172295A (en) | A piecewise-based similarity method for low-complexity speech recognizers | |
JP2017045054A (en) | Language model improvement device and method, and speech recognition device and method | |
Jaiswal | Performance analysis of voice activity detector in presence of non-stationary noise | |
CN114678040B (en) | Voice consistency detection method, device, equipment and storage medium | |
WO2018049391A1 (en) | Method and apparatus for exemplary segment classification | |
US20090150164A1 (en) | Tri-model audio segmentation | |
WO2018169772A2 (en) | Quality feedback on user-recorded keywords for automatic speech recognition systems | |
JP2024032655A (en) | Speech recognition device, speech recognition method, and program | |
CN114203159B (en) | Speech emotion recognition method, terminal device and computer readable storage medium | |
Faycal et al. | Comparative performance study of several features for voiced/non-voiced classification | |
Itagi et al. | Lisp detection and correction based on feature extraction and random forest classifier | |
Mait et al. | Unsupervised VAD method based on short-time energy and spectral centroid in Arabic speech case | |
Prukkanon et al. | F0 contour approximation model for a one-stream tonal word recognition system | |
US20250252960A1 (en) | Longform word-level end-to-end speaker diarization with dynamic audio cohort | |
Hamzah et al. | Impact of acoustical voice activity detection on spontaneous filled pause classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17849768 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2017849768 Country of ref document: EP Effective date: 20190412 |