CN119479708B

CN119479708B - Speech recognition method for parkinsonism

Info

Publication number: CN119479708B
Application number: CN202510064918.0A
Authority: CN
Inventors: 高云龙; 胡炜; 马文越; 程露红
Original assignee: Hangzhou Zhilan Health Co ltd
Current assignee: Hangzhou Zhilan Health Co ltd
Priority date: 2025-01-15
Filing date: 2025-01-15
Publication date: 2025-03-14
Anticipated expiration: 2045-01-15
Also published as: CN119479708A

Abstract

The present invention discloses a speech recognition method for Parkinson's patients, and belongs to the field of speech recognition technology. The present invention first removes the tremor frequency of the speech signal of the Parkinson's patient to obtain a detremorized speech spectrum set; secondly, based on the difference of the detremorized speech spectrum at adjacent moments, the spectrum set is divided into multiple pronunciation subsets; then, the peak frequency, overtone centroid value and overtone deviation value are extracted in each pronunciation subset to construct a characteristic pronunciation vector; finally, a multi-layer pronunciation vector processing model is used to finally parse the semantic content of the Parkinson's patient. The present invention significantly improves the accuracy of speech recognition of Parkinson's patients.

Description

Speech recognition method for parkinsonism

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice recognition method for parkinsonism.

Background

Parkinson's disease is a common degenerative disorder of the nervous system that primarily affects motor ability and speech expression in patients. With the progress of the illness, the voice characteristics of the patient can change significantly, including the tone, volume and clarity of voice. These changes not only affect the patient's ability to communicate, but also negatively impact their mental health and quality of life. Therefore, development of effective speech recognition technology to assist parkinson's disease patients in communication has become an important research direction.

The prior art generally employs conventional spectral analysis and acoustic feature extraction methods, such as mel-frequency cepstral coefficients (MFCCs), linear Predictive Coding (LPC), and the like. However, these methods have significant limitations in processing voices of parkinson's disease patients, and cannot separate the voice signal segments belonging to the same voice, and feature extraction must be performed on the whole voice signal, so that the accuracy and effectiveness of signal processing are greatly reduced, and there is a problem that the voice recognition accuracy is low.

Disclosure of Invention

Aiming at the defects in the prior art, the voice recognition method for the parkinsonism solves the problem of low voice recognition precision in the prior art.

In order to achieve the aim of the invention, the technical scheme adopted by the invention is that the voice recognition method for parkinsonism comprises the following steps:

S1, eliminating tremor frequency of voice signals of parkinsonism to obtain tremor-removed voice frequency spectrum set;

s2, in the tremor-removing voice frequency spectrum set, dividing the tremor-removing voice frequency spectrum set according to the difference of tremor-removing voice frequency spectrums at adjacent moments to obtain a plurality of sounding tremor-removing voice frequency spectrum subsets;

s3, extracting peak frequency, overtone centroid value and overtone offset value from frequencies in the tremor-removing voice frequency spectrum subset of each pronunciation to construct a pronunciation vector;

and S4, processing the multiple pronunciation vectors by adopting a multi-layer pronunciation vector processing model to obtain the semantics of the parkinsonism.

Further, S1 comprises the following sub-steps:

S11, sliding on a voice signal by adopting a sliding window, and advancing for 1 moment each time;

s12, carrying out Fourier transform on a signal section under a sliding window when sliding each time to obtain a voice frequency spectrum at each moment;

s13, calculating the average value of all the amplitude values in the voice frequency spectrum at each moment, and screening out the frequency value corresponding to the amplitude value lower than the average value to form a tremor candidate frequency set;

s14, intersection sets are taken from the candidate tremor frequency sets at all times to obtain tremor frequencies;

S15, eliminating tremor frequency from the voice frequency spectrum at each moment to obtain tremor-removing voice frequency spectrum at each moment, and arranging tremor-removing voice frequency spectrums at each moment in sequence according to time occurrence to form tremor-removing voice frequency spectrum sets.

Further, S2 comprises the following sub-steps:

s21, in the tremor-removing voice frequency spectrum set, the frequency in the tremor-removing voice frequency spectrum at the time t+1 and the frequency in the tremor-removing voice frequency spectrum at the time t are combined;

s22, eliminating frequencies in the tremor-removed voice frequency spectrum at the moment t from the union to obtain a new added frequency at the moment t+1;

s23, calculating a frequency change value according to the new frequency at the time t+1;

S24, when the frequency change value is larger than the frequency change threshold value, marking the moment t+1 as a pronunciation conversion point;

s25, dividing the tremor-removing voice frequency spectrum set according to the positions of the pronunciation conversion points to obtain tremor-removing voice frequency spectrum subsets of the pronunciations.

Further, the formula for calculating the frequency variation value in S23 is:, Wherein mu _t+1 is the frequency change value at time t+1, f _t+1,z,i is the ith newly added frequency at time t+1, f _t,s,j is the jth frequency in the tremor-removing voice frequency spectrum at time t, I is the absolute value, M is the number of frequencies in the tremor-removing voice frequency spectrum at time t, and min is M I and j are positive integers, N is the number of newly added frequencies at time t+1, and d _t+1,z,i is the minimum difference between the ith newly added frequency at time t+1 and the frequencies in the tremor-removed voice frequency spectrum at time t.

Further, S3 comprises the following sub-steps:

S31, in each tremor-removing voice frequency spectrum of a tremor-removing voice frequency spectrum subset of a pronunciation, finding out the frequency corresponding to the maximum amplitude as a main frequency, and taking the average value of all the main frequencies to obtain a peak frequency;

S32, in a sounding tremor-removing voice frequency spectrum subset, taking other frequencies except a main frequency in each tremor-removing voice frequency spectrum as overtones;

s33, calculating an overtone centroid value of overtone frequency;

S34, calculating overtone deviation values according to the differences between the overtone frequencies and the peak frequencies;

S35, constructing a pronunciation vector by taking the peak frequency, the overtone centroid value and the overtone deviation value as elements.

Further, the calculation formula of the harmonic overtone centroid value in S33 is: Wherein f _o,c is the harmonic centroid value, f _o,i is the ith harmonic frequency in the sounding tremor-removal audio spectrum subset, A _o,i is the amplitude of the ith harmonic frequency f _o,i, L is the number of harmonic frequencies in the sounding tremor-removal audio spectrum subset, and i is a positive integer.

Further, the formula for calculating the harmonic offset value in S34 is:

wherein f _o,d is the harmonic offset value, f _o,i is the ith harmonic frequency in the sounding tremor-removal audio spectrum subset, i is a positive integer, f _peak is the peak frequency, and L is the number of harmonic frequencies in the sounding tremor-removal audio spectrum subset.

Further, the multi-layer pronunciation vector processing model in S4 comprises a plurality of first pronunciation processing layers, a plurality of second pronunciation processing layers, a first BiLSTM layers, a second BiLSTM layers, a feature integration layer and a CRF unit;

The input end of each first pronunciation processing layer is used for inputting a pronunciation vector, the input end of each second pronunciation processing layer is used for inputting a pronunciation vector, the input end of each first BiLSTM layer is respectively connected with the output ends of the plurality of first pronunciation processing layers, the input end of each second BiLSTM layer is respectively connected with the output ends of the plurality of second pronunciation processing layers, the input end of the characteristic integration layer is respectively connected with the output ends of the first BiLSTM layer and the output end of the second BiLSTM layer, the output ends of the characteristic integration layer are connected with the input ends of the CRF unit, and the output end of the CRF unit serves as the output end of the multi-layer pronunciation vector processing model.

Further, the expression of the first sounding processing layer is: Wherein y ₁ is the output of the first sounding processing layer, tanh is a hyperbolic tangent activation function, x _n is the nth element in the sounding vector, ω _1,n is the nth weight in the first sounding processing layer, b _1,n is the nth bias in the first sounding processing layer, and the value range of n is a positive integer ranging from 1 to 3;

the expression of the second pronunciation processing layer is: Where y ₂ is the output of the second pronunciation processing layer, sigmoid is the S-type activation function, ω _2,n is the nth weight in the second pronunciation processing layer, and b _2,n is the nth bias in the second pronunciation processing layer.

Further, the expression of the feature integration layer is: Wherein H _m is the mth integrated feature output by the feature integration layer, H _1,m is the mth feature output by the first BiLSTM layer, H _2,m is the mth feature output by the second BiLSTM layer, ω _1,m is the mth first weight in the feature integration layer, and ω _2,m is the mth second weight in the feature integration layer.

The beneficial effects of the invention are as follows:

1. The invention eliminates tremble frequency for voice signal, avoids tremble frequency affecting voice recognition, and improves precision of voice recognition.

2. According to the invention, in the tremor-removing voice frequency spectrum set, according to the difference of tremor-removing voice frequency spectrums at adjacent moments, the segmentation point of each pronunciation is found, and the tremor-removing voice frequency spectrum set is divided, so that different pronunciation paragraphs can be more accurately distinguished, and the problem of inaccurate feature extraction caused by whole-segment voice signal processing is avoided.

3. The invention extracts peak frequency, overtone centroid value and overtone deviation value from the tremor-removing voice frequency spectrum subset of each pronunciation, and reflects the voice characteristics of the pronunciation of the section.

4. The invention adopts the multi-layer pronunciation vector processing model to process a plurality of pronunciation vectors, synthesizes pronunciation vectors corresponding to the voice signals, recognizes the semantics of parkinsonism, and improves the voice recognition precision.

Drawings

Fig. 1 is a flowchart of a voice recognition method of parkinson's disease person.

FIG. 2 is a schematic diagram of a multi-layered pronunciation vector processing model.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

As shown in fig. 1, a voice recognition method for parkinson's disease comprises the following steps:

In this embodiment, S1 includes the following sub-steps:

The invention adopts the sliding window to slide on the voice signal, advances for 1 moment each time, carries out Fourier transform on the signal section under the sliding window after each sliding to obtain the voice frequency spectrum at each moment, realizes the dynamic and real-time frequency spectrum analysis of the voice signal, and can capture the fine change of the voice signal. Because of impaired nervous system motor control, reduced muscle control, and reduced coordination of vocal cords and vocal muscles in parkinsonism, parkinsonism suffers from speech tremors. Tremors are manifested as low amplitude, persistent abnormal frequencies that tend to be masked in normal speech signals. Therefore, the invention screens out the frequency corresponding to the low amplitude value in the voice frequency spectrum at each moment, and the frequency which is always present in the voice signal is extracted by taking the intersection of the candidate tremor frequency sets at each moment, and the frequency is distributed in the whole stage of the voice signal, so that the frequency which is continuously present in the whole stage of the voice signal is effectively extracted, and the tremor frequency which is stably present in the voice of the parkinsonism can be accurately positioned, and the frequency is not sporadic noise.

In this embodiment, the sliding window is a one-dimensional window, the length of the sliding window is K, K is a positive integer, and the sliding window can cover K times each time, i.e. a signal section with a length of K can be obtained.

In this embodiment, the fundamental frequency range of human voice is typically 50-250Hz, the frequency of parkinsonian voice tremor is typically 4-7Hz, the sliding window can be set to 20-40 ms, the length can cover the pitch period (typically 10-20 ms), the typical frequency range of parkinsonian voice tremor is covered, the balance of time resolution and frequency resolution is ensured, for example, when the sampling rate is 16kHz, K can take 32 ms, and the corresponding sampling point number is about 16kHz multiplied by 0.032 s=512 sampling points. The length of the forward one time is in the range of 5-20 milliseconds, for example, when the sampling rate is 16kHz, 15 milliseconds are 1 time length, and the corresponding sampling points are 16kHz multiplied by 0.015=240 sampling points.

In this embodiment, the sliding window and the length of 1 moment may be adjusted, and when the sliding window is too small, the frequency resolution is low, tremble cannot be accurately captured, and when the sliding window is too large, the time resolution is reduced, and the detail features are lost. When the length of 1 moment is too small, the calculation amount is large, the system burden is increased, and important voice features can be lost when the length is too large.

In this embodiment, S2 includes the following substeps:

The invention marks the pronunciation conversion points in the tremor-removed voice frequency spectrum set, and divides the tremor voice frequency spectrum of each section according to the pronunciation conversion points to form a subset.

The speech is a dynamically changing signal, the frequency characteristics of different phonemes and syllables are converted, and the pronunciation process is continuous but discrete language expression. Therefore, the invention compares the frequencies in the tremor-removing voice frequency spectrums at the adjacent time, finds out the newly added frequency at the next time, calculates the frequency change value according to the newly added frequency at the time t+1, can embody the change condition of the pronunciation, and indicates that the pronunciation has change between the time t and the time t+1 if the frequency change value is larger than the frequency change threshold. Pronunciation involves coordination of organs such as vocal cords, mouth, tongue, etc., each phoneme emission requires a specific organ position and muscle state, and this conversion may lead to a significant change in frequency. The invention marks the pronunciation conversion points so that each pronunciation has a tremor-removing voice frequency spectrum subset.

In this embodiment, if there is a small number of tremor-removed speech spectra in a subset of tremor-removed speech spectra, only one tremor-removed speech spectrum may be discarded, for example, where the frequency differences between the tremor-removed speech spectra and the time of the left and right are large, so that the subset does not have continuous pronunciation and cannot provide valuable information, and by discarding these unstable spectra, the overall quality of the data set may be improved.

Speech signals are a continuous process, with usually fluent utterances, and natural transitions between phonemes. The features of a pronunciation are similar at multiple times. During normal pronunciation, certain frequency characteristics remain relatively stable for a short period of time. The tremor-free speech spectrum at multiple moments may reflect this stability, forming a subset of moments.

When the frequency change value is larger than the frequency change threshold value, the invention belongs to the last tremor voice frequency spectrum subset at the time t and belongs to the next tremor voice frequency spectrum subset at the time t+1.

In the present embodiment, the formula for calculating the frequency variation value in S23 is:, Wherein mu _t+1 is the frequency change value at time t+1, f _t+1,z,i is the ith newly added frequency at time t+1, f _t,s,j is the jth frequency in the tremor-removing voice frequency spectrum at time t, I is the absolute value, M is the number of frequencies in the tremor-removing voice frequency spectrum at time t, and min is M I and j are positive integers, N is the number of newly added frequencies at time t+1, and d _t+1,z,i is the minimum difference between the ith newly added frequency at time t+1 and the frequencies in the tremor-removed voice frequency spectrum at time t.

The invention calculates the difference between each newly added frequency and each frequency in the tremor-removing voice frequency spectrum at the moment t, finds out the minimum difference, determines the frequency offset condition, and combines the number of the newly added frequencies to obtain the frequency change value, when the frequency change value is larger, the pronunciation at the moment t+1 changes obviously, so the moment t+1 is probably the pronunciation of the next word.

In the present invention, the frequency change threshold is a threshold set for the frequency change value, and in the present embodiment, the frequency change threshold is set to 0.5.

In this embodiment, S3 includes the following substeps:

s33, calculating an overtone centroid value of overtone frequency;

In the sound signal, the frequency point with the highest amplitude corresponds to the fundamental frequency, which is a main feature of sound. The fundamental frequency determines the pitch of the sound and is a key factor in identifying the pitch and timbre. The main frequency corresponding to the maximum amplitude exists in each tremor-removing voice frequency spectrum, so that the average value of all main frequencies of the section of pronunciation is obtained, the peak frequency is obtained, and the main characteristics of the section of pronunciation are reflected.

In this embodiment, the calculation formula of the harmonic overtone centroid value in S33 is: Wherein f _o,c is the harmonic centroid value, f _o,i is the ith harmonic frequency in the sounding tremor-removal audio spectrum subset, A _o,i is the amplitude of the ith harmonic frequency f _o,i, L is the number of harmonic frequencies in the sounding tremor-removal audio spectrum subset, and i is a positive integer.

In the present embodiment, the formula for calculating the overtone deviation value in S34 is: wherein f _o,d is the harmonic offset value, f _o,i is the ith harmonic frequency in the sounding tremor-removal audio spectrum subset, i is a positive integer, f _peak is the peak frequency, and L is the number of harmonic frequencies in the sounding tremor-removal audio spectrum subset.

The invention forms a subset of tremor-removing voice frequency spectrums belonging to the same pronunciation, finds out the frequency corresponding to the maximum amplitude value in each tremor-removing voice frequency spectrum in the subset as the main frequency, averages each main frequency to obtain the peak frequency, and the main frequency reflects the basic pronunciation characteristics, so that the peak frequency of the pronunciation can embody the pronunciation main characteristics, the overtone reflects the subtle state of a pronunciation organ, the overtone centroid value of the overtone frequency is extracted, the main frequency range of the overtone is embodied, the overtone deviation value is calculated again, the difference between each overtone frequency and the peak frequency is embodied, and the distribution situation of the overtone frequency and the peak frequency during the pronunciation is further excavated.

As shown in fig. 2, the multi-layered pronunciation vector processing model in S4 includes a plurality of first pronunciation processing layers, a plurality of second pronunciation processing layers, a first BiLSTM layers, a second BiLSTM layers, a feature integration layer, and a CRF unit;

In the present invention, the number of the first pronunciation processing layers and the second pronunciation processing layers is equal to the number of pronunciation vectors.

In this embodiment, the expression of the first sounding processing layer is: wherein y ₁ is the output of the first sounding processing layer, tanh is the hyperbolic tangent activation function, x _n is the nth element in the sounding vector, ω _1,n is the nth weight in the first sounding processing layer, b _1,n is the nth bias in the first sounding processing layer, and the value range of n is a positive integer ranging from 1 to 3.

In this embodiment, the expression of the second pronunciation processing layer is: Where y ₂ is the output of the second pronunciation processing layer, sigmoid is the S-type activation function, ω _2,n is the nth weight in the second pronunciation processing layer, and b _2,n is the nth bias in the second pronunciation processing layer.

In this embodiment, the expression of the feature integration layer is: Wherein H _m is the mth integrated feature output by the feature integration layer, H _1,m is the mth feature output by the first BiLSTM layer, H _2,m is the mth feature output by the second BiLSTM layer, ω _1,m is the mth first weight in the feature integration layer, and ω _2,m is the mth second weight in the feature integration layer.

According to the invention, two pronunciation processing layers are arranged to process the same pronunciation vector, characteristics of voice signals are extracted from different layers, the understanding capability of a model on complex voice signals is enhanced, two pronunciation characteristics of the same pronunciation vector are obtained, a first BiLSTM layer is adopted to process one pronunciation characteristic of each pronunciation vector, a second BiLSTM layer is adopted to process the other pronunciation characteristic of each pronunciation vector, so that a BiLSTM layer can capture a context relation better, and a characteristic integration layer is adopted to integrate the characteristics of the first BiLSTM layer and the second BiLSTM layer, so that the characteristics of a double channel are fused, and the accuracy of semantic recognition is improved.

In this embodiment, the CRF unit may also be replaced with a Softmax layer or CTC layer.

In the invention, all pronunciation vectors corresponding to the voice signal are converted into readable text through a multi-layer pronunciation vector processing model.

In this embodiment, weights and biases in the multi-layer pronunciation vector processing model are trained using the existing gradient descent method.

The invention eliminates tremble frequency for voice signal, avoids tremble frequency affecting voice recognition, and improves precision of voice recognition.

According to the invention, in the tremor-removing voice frequency spectrum set, according to the difference of tremor-removing voice frequency spectrums at adjacent moments, the segmentation point of each pronunciation is found, and the tremor-removing voice frequency spectrum set is divided, so that different pronunciation paragraphs can be more accurately distinguished, and the problem of inaccurate feature extraction caused by whole-segment voice signal processing is avoided.

The invention extracts peak frequency, overtone centroid value and overtone deviation value from the tremor-removing voice frequency spectrum subset of each pronunciation, and reflects the voice characteristics of the pronunciation of the section.

The invention adopts the multi-layer pronunciation vector processing model to process a plurality of pronunciation vectors, synthesizes pronunciation vectors corresponding to the voice signals, recognizes the semantics of parkinsonism, and improves the voice recognition precision.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for voice recognition of parkinson's disease, comprising the steps of:

2. The method for voice recognition of parkinson' S disease according to claim 1, wherein said S1 comprises the following sub-steps:

3. The method for voice recognition of parkinson' S disease according to claim 1, wherein said S2 comprises the following sub-steps:

4. A method for voice recognition of parkinson' S disease according to claim 3, wherein said formula for calculating the frequency variation value in S23 is:, Wherein mu _t+1 is the frequency change value at time t+1, f _t+1,z,i is the ith newly added frequency at time t+1, f _t,s,j is the jth frequency in the tremor-removing voice frequency spectrum at time t, I is the absolute value, M is the number of frequencies in the tremor-removing voice frequency spectrum at time t, and min is M I and j are positive integers, N is the number of newly added frequencies at time t+1, and d _t+1,z,i is the minimum difference between the ith newly added frequency at time t+1 and the frequencies in the tremor-removed voice frequency spectrum at time t.

5. The method for voice recognition of parkinson' S disease according to claim 1, wherein said S3 comprises the following sub-steps:

S31, finding out the frequency corresponding to the maximum amplitude value from each tremor-removing voice frequency spectrum of a voice tremor-removing voice frequency spectrum subset as a main frequency, and taking the average value of all the main frequencies to obtain a peak frequency;

s33, calculating an overtone centroid value of overtone frequency;

6. The method for recognizing the voice of parkinsonism according to claim 5, wherein the calculation formula of the harmonic-overtone centroid value in S33 is: Wherein f _o,c is the harmonic centroid value, f _o,i is the ith harmonic frequency in the sounding tremor-removal audio spectrum subset, A _o,i is the amplitude of the ith harmonic frequency f _o,i, L is the number of harmonic frequencies in the sounding tremor-removal audio spectrum subset, and i is a positive integer.

7. The method for recognizing the voice of parkinsonism according to claim 5, wherein the formula for calculating the overtone deviation value in S34 is: wherein f _o,d is the harmonic offset value, f _o,i is the ith harmonic frequency in the sounding tremor-removal audio spectrum subset, i is a positive integer, f _peak is the peak frequency, and L is the number of harmonic frequencies in the sounding tremor-removal audio spectrum subset.

8. The method for recognizing the voice of the parkinsonism according to claim 1, wherein the multi-layered pronunciation vector processing model in S4 comprises a plurality of first pronunciation processing layers, a plurality of second pronunciation processing layers, a first BiLSTM layers, a second BiLSTM layers, a feature integration layer and a CRF unit;

The input end of each first sounding processing layer is used for inputting a sounding vector, the input end of each second sounding processing layer is used for inputting a sounding vector, the input ends of the first BiLSTM layers are respectively connected with the output ends of the first sounding processing layers, the input ends of the second BiLSTM layers are respectively connected with the output ends of the second sounding processing layers, the input ends of the feature integration layers are respectively connected with the output ends of the first BiLSTM layers and the output ends of the second BiLSTM layers, the output ends of the feature integration layers are connected with the input ends of the CRF units, and the output ends of the CRF units serve as the output ends of the multi-layer sounding vector processing model.

9. The method for recognizing the voice of parkinsonism according to claim 8, wherein the expression of the first sound processing layer is: Wherein y ₁ is the output of the first sounding processing layer, tanh is a hyperbolic tangent activation function, x _n is the nth element in the sounding vector, ω _1,n is the nth weight in the first sounding processing layer, b _1,n is the nth bias in the first sounding processing layer, and the value range of n is a positive integer ranging from 1 to 3;

10. The method for recognizing the voice of parkinsonism according to claim 8, wherein the expression of the feature integration layer is: Wherein H _m is the mth integrated feature output by the feature integration layer, H _1,m is the mth feature output by the first BiLSTM layer, H _2,m is the mth feature output by the second BiLSTM layer, ω _1,m is the mth first weight in the feature integration layer, and ω _2,m is the mth second weight in the feature integration layer.