CN112151052B - Speech enhancement method, device, computer equipment and storage medium - Google Patents
Speech enhancement method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN112151052B CN112151052B CN202011153521.2A CN202011153521A CN112151052B CN 112151052 B CN112151052 B CN 112151052B CN 202011153521 A CN202011153521 A CN 202011153521A CN 112151052 B CN112151052 B CN 112151052B
- Authority
- CN
- China
- Prior art keywords
- voice
- enhancement
- voice data
- data
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000012545 processing Methods 0.000 claims abstract description 99
- 230000006870 function Effects 0.000 claims description 15
- 230000002708 enhancing effect Effects 0.000 claims description 12
- 238000001914 filtration Methods 0.000 claims description 11
- 230000009467 reduction Effects 0.000 claims description 11
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 230000000694 effects Effects 0.000 description 10
- 230000008569 process Effects 0.000 description 9
- 238000010276 construction Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Navigation (AREA)
Abstract
The invention discloses a voice enhancement method, a device, computer equipment and a storage medium, which relate to the technical field of artificial intelligence and mainly aim at automatically selecting voice enhancement parameters matched with surrounding environments from a pre-constructed voice enhancement parameter set, and performing voice enhancement processing on voice data to be identified by utilizing the voice enhancement parameters so as to enable the voice identification accuracy to be the highest. The method comprises the following steps: acquiring voice data to be processed; extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set; and carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after the voice enhancement processing. The invention is mainly suitable for voice enhancement processing of voice data.
Description
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a method and apparatus for enhancing speech, a computer device, and a storage medium.
Background
In recent years, with the rapid development and rise of intelligent wearable devices, consumer electronic products controlled by voice have become the latest trend, and voice intelligence requires an automatic voice recognition intelligent system with strong reliability and high accuracy as a support, and front-end voice enhancement technology is the most critical one.
At present, when the front-end voice enhancement technology is utilized to process noise, parameters of the voice enhancement module are usually adjusted according to surrounding environment and expert experience so as to achieve a better voice recognition effect. However, this manner of adjusting the speech enhancement parameters according to the experience of the expert can only adapt to the surrounding environment to a certain extent, and improve the effect of high speech recognition, but cannot ensure that the accuracy of speech recognition is the highest.
Disclosure of Invention
The invention provides a voice enhancement method, a device, computer equipment and a storage medium, which mainly can automatically select voice enhancement parameters matched with surrounding environments from a pre-constructed voice enhancement parameter set, and can enable the voice recognition accuracy to be the highest after voice enhancement processing is carried out on voice data to be recognized by utilizing the voice enhancement parameters, so that the optimal voice recognition effect can be achieved in any environment.
According to a first aspect of the present invention, there is provided a speech enhancement method comprising:
Acquiring voice data to be processed;
Extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters in different environments, and the voice enhancement parameters are used for enhancing voice recognition accuracy in different environments;
And carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after the voice enhancement processing.
According to a second aspect of the present invention, there is provided a speech enhancement apparatus comprising:
an acquisition unit configured to acquire voice data to be processed;
A selecting unit, configured to extract a first voice feature corresponding to the voice data, determine a target environment where the voice data is located according to the first voice feature, and select a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, where the voice enhancement parameter set includes voice enhancement parameters in different environments, and the voice enhancement parameter is used to enhance voice recognition accuracy in different environments;
And the processing unit is used for carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after the voice enhancement processing.
According to a third aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
Acquiring voice data to be processed;
Extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters in different environments, and the voice enhancement parameters are used for enhancing voice recognition accuracy in different environments;
And carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after the voice enhancement processing.
According to a fourth aspect of the present invention there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of:
Acquiring voice data to be processed;
Extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters in different environments, and the voice enhancement parameters are used for enhancing voice recognition accuracy in different environments;
And carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after the voice enhancement processing.
Compared with the current mode of adjusting the parameters of the voice enhancement module according to expert experience, the voice enhancement method, device, computer equipment and storage medium provided by the invention can acquire the voice data to be processed; simultaneously extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters in different environments, and the voice enhancement parameters are used for enhancing voice recognition accuracy in different environments; and then carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after voice enhancement processing, thereby automatically selecting the target voice enhancement parameters corresponding to the voice enhancement parameters from the voice enhancement parameter set by determining the target environment where the voice data to be processed is positioned, carrying out voice enhancement processing on the voice data by utilizing the target voice enhancement parameters, improving the voice enhancement effect under the target environment, and ensuring that the accuracy of voice recognition under the target environment is highest.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 shows a flowchart of a method for speech enhancement provided by an embodiment of the present invention;
FIG. 2 is a flow chart of another speech enhancement method provided by an embodiment of the present invention;
Fig. 3 is a schematic structural diagram of a voice enhancement device according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating another speech enhancement apparatus according to an embodiment of the present invention;
Fig. 5 shows a schematic physical structure of a computer device according to an embodiment of the present invention.
Detailed Description
The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
At present, when the front-end voice enhancement technology is utilized to process noise, parameters of the voice enhancement module are usually adjusted according to surrounding environment and expert experience so as to achieve a better voice recognition effect. However, this manner of adjusting the speech enhancement parameters according to the experience of the expert can only adapt to the surrounding environment to a certain extent, and improve the effect of high speech recognition, but cannot ensure that the accuracy of speech recognition is the highest.
In order to solve the above problem, an embodiment of the present invention provides a credit risk assessment method, as shown in fig. 1, including:
101. And acquiring voice data to be processed.
The voice data to be processed can be a voice sequence collected in different environments, for example, a section of voice sequence of a user is collected beside a street, or a section of voice sequence of a user is collected in a factory. The embodiment of the invention is suitable for voice enhancement processing of voice data, and the execution main body of the embodiment of the invention is a device or equipment capable of carrying out voice enhancement processing on voice data, and can be arranged at the side of a client or a server.
Specifically, a section of voice data of a user in a certain scene is obtained, the voice data is required to be preprocessed before being subjected to voice enhancement processing, specifically including pre-emphasis processing, framing processing and windowing function processing, so that the preprocessed voice data is obtained, further, a target environment where the preprocessed voice data is located is required to be determined, and voice enhancement processing is performed on the basis of the target environment where the voice data is located.
102. Extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set.
The voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments. For the embodiment of the invention, the sample voice data collected under different environments are stored in the preset sample library, in order to determine the environments where the different sample voice data are located, the sample voice data under different environments are needed to be clustered, the sample voice data under different environments are obtained, the voice enhancement model is trained by the sample voice data under different environments, namely, initial voice enhancement parameters in the voice enhancement model are optimized and adjusted, until the voice recognition accuracy of the voice data reaches the highest when the sample voice data subjected to voice enhancement processing is input into a pre-built voice recognition model for voice recognition, thus the voice enhancement parameters under different environments can be obtained, a voice enhancement parameter set is built, and when the voice data is in a certain environment, the voice enhancement processing is carried out on the voice data by utilizing the voice enhancement parameters corresponding to the environments, and the voice data subjected to the voice enhancement processing is input into the pre-built voice enhancement model, so that the voice recognition accuracy of the voice data can reach the highest.
For the embodiment of the invention, before voice enhancement processing is performed on voice data, a target environment where the voice data to be processed is located needs to be determined, specifically, a first voice feature corresponding to the voice data to be processed is extracted, meanwhile, a second voice feature corresponding to sample voice data under different clustering categories (different environments) is respectively extracted, then, according to the second voice feature corresponding to the sample voice data under different clustering categories, a feature center corresponding to the sample voice data under different clustering categories is calculated, and because the voice features corresponding to the voice data collected under the same environment are relatively similar, the distance between the first voice feature and the different feature centers is calculated, and therefore, the sample voice data under which clustering category the voice data to be processed should be classified to is determined, and further, the target environment where the voice data to be processed is located can be determined.
Further, target enhancement parameters corresponding to the target environment are selected from a pre-constructed voice enhancement parameter set, voice enhancement processing is conducted on voice data by using the target voice enhancement parameters, voice recognition is conducted on voice data after voice enhancement processing by inputting the voice data into a pre-constructed voice recognition model, and therefore voice recognition efficiency of the voice data can be highest, the target environment where the voice data are located can be determined according to voice characteristics of the voice data to be processed, further, voice enhancement parameters corresponding to the target environment are selected from the voice enhancement parameter set automatically, voice enhancement processing is conducted on the voice data, voice enhancement effect is improved, and meanwhile, the voice recognition accuracy of the voice data after voice enhancement processing can be guaranteed to be highest.
103. And carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after the voice enhancement processing.
For the embodiment of the invention, the voice enhancement processing mainly refers to noise reduction processing of voice noise in voice data to be processed, in the process of the voice enhancement processing, an LMS adaptive filter noise reduction processing algorithm can be adopted to carry out the voice enhancement processing on the voice data, and when the voice enhancement processing is specifically carried out by utilizing the algorithm, firstly, a voice endpoint detection algorithm (VAD) is used to carry out mute rejection processing on the voice signal to obtain a proper voice frequency spectrum characteristic sequence X= (X 1,x2,…,xn), then, a multichannel wiener filtering operation is carried out, and particularly, the method comprises the steps of carrying out beam forming processing to obtain Y= (Y 1,y2,…,yn), and utilizing Power Spectrum Density (PSD) estimation to reduce residual noise components to obtain wiener filtering input componentsAnd phi V (omega, tau), then obtaining a post-filter input parameter vector G Wiener (omega, tau) through wiener filtering calculation, obtaining a filtered output signal Z (omega, tau) =G Wiener (omega, tau) x Y through post-filter processing, and obtaining voice data after voice enhancement processing after signal compression or expansion processing, wherein the voice data after voice enhancement processing can be adapted to the input form of a voice recognition model.
Compared with the current mode of adjusting parameters of a voice enhancement module according to expert experience, the voice enhancement method provided by the embodiment of the invention can acquire voice data to be processed; simultaneously extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters in different environments, and the voice enhancement parameters are used for enhancing voice recognition accuracy in different environments; and then carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after voice enhancement processing, thereby automatically selecting the target voice enhancement parameters corresponding to the voice enhancement parameters from the voice enhancement parameter set by determining the target environment where the voice data to be processed is positioned, carrying out voice enhancement processing on the voice data by utilizing the target voice enhancement parameters, improving the voice enhancement effect under the target environment, and ensuring that the accuracy of voice recognition under the target environment is highest.
Further, in order to better illustrate the above process of performing speech enhancement processing on speech data, as a refinement and extension of the above embodiment, another speech enhancement method is provided in an embodiment of the present invention, as shown in fig. 2, where the method includes:
201. And acquiring voice data to be processed.
For the embodiment of the present invention, in order to automatically select a speech enhancement parameter matching with an environment according to the environment in which speech data to be processed is located, so that the speech recognition accuracy of the speech data is highest, the speech enhancement parameters under different environments need to be pre-constructed, and based on this, the method includes: performing voice enhancement processing on the sample voice data under different environments by using the initial voice enhancement parameters to obtain the sample voice data after the voice enhancement processing under different environments; constructing a voice recognition accuracy function under different environments according to the sample voice data; and optimizing and adjusting the initial voice enhancement parameters according to the accuracy function to obtain voice enhancement parameters in different environments, and constructing the voice enhancement parameter set based on the voice enhancement parameters in different environments. Further, the constructing a voice recognition accuracy function under different environments according to the sample voice data includes: performing voice recognition on the sample voice data subjected to the voice enhancement processing by utilizing a pre-constructed voice recognition model to obtain voice recognition results under different environments; and constructing a voice recognition accuracy function under different environments according to the voice recognition results under different environments. The pre-constructed voice recognition model may specifically be a neural network voice recognition model.
For example, first, giving initial speech enhancement, then performing speech enhancement processing on sample speech data in a factory environment by using the initial speech enhancement parameters to obtain sample speech data after speech enhancement processing in the factory environment, inputting the sample speech data after speech enhancement processing into a pre-constructed speech recognition model to perform speech recognition processing to obtain a speech recognition result corresponding to the sample speech data in the factory environment, then constructing a speech recognition accuracy function in the factory environment according to the speech recognition result in the factory environment, solving the function under the condition that the speech recognition accuracy is the highest, and searching for the speech enhancement parameters in different environments by using a genetic algorithm when specifically searching for an optimal solution, wherein the specific formula is as follows:
θi=argmaxT(θ)
Wherein, T (θ) is the accuracy of speech recognition in the factory environment, θ i is the speech enhancement parameter in the factory environment, and by continuously optimizing and adjusting the initial speech enhancement parameter, the speech enhancement parameter θ i can be obtained, and the speech enhancement parameter θ i can make the accuracy of speech recognition in the factory environment highest, so that the speech enhancement parameters in different environments can be obtained according to the above-mentioned method, and a speech enhancement parameter set { θ i } is constructed, and the accuracy of speech recognition in different environments is highest.
For the embodiment of the invention, after the voice enhancement parameter set is constructed, the voice data to be processed can be obtained, and the corresponding voice enhancement parameters are selected from the voice enhancement parameter set to carry out voice enhancement processing by determining the target environment where the voice data to be processed is located.
202. Extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set.
The voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for maximizing voice recognition accuracy under different environments. For the embodiment of the present invention, in order to determine the target environment in which the voice data to be processed is located, step 202 specifically includes: acquiring sample voice data under different environments, and extracting second voice features corresponding to the sample voice data; calculating feature centers corresponding to the sample voice data under different environments according to the second voice features; and determining a target environment in which the voice data is located according to the feature center and the first voice feature. Further, the determining, according to the feature center and the first voice feature, a target environment in which the voice data is located includes: calculating Euclidean distances between the first voice feature and different feature centers by using a preset Euclidean distance algorithm; and screening the minimum Euclidean distance from the calculated Euclidean distance, and determining the environment where the sample voice data corresponding to the minimum Euclidean distance is located as the target environment. When extracting the voice features corresponding to the voice data to be processed and the sample voice data, a preset mel-cepstrum algorithm may be adopted to calculate mel-cepstrum coefficients corresponding to the sample data to be processed and the sample voice data respectively, and the calculated mel-cepstrum coefficients are determined to be the voice features corresponding to the voice data to be processed and the sample voice data respectively.
For example, the feature center corresponding to the sample voice data at the roadside is calculated to be a, the feature center corresponding to the sample voice data in the factory environment is calculated to be B, the feature center corresponding to the sample voice data in the airport environment is calculated to be C, and as the voice features corresponding to the voice data in the same environment are similar, the euclidean distance between the first voice feature corresponding to the voice data to be processed and the feature center a and between the feature center B and the feature center C is calculated respectively, and the minimum euclidean distance is screened from the calculated euclidean distances, if the euclidean distance between the feature center B and the first voice feature is determined to be minimum, the voice data to be processed is determined to be similar to the sample voice data in the factory environment, so that the voice data to be processed is determined to be in the factory environment, and the target environment where the voice data to be processed is located can be determined according to the above manner.
203. And carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after the voice enhancement processing.
For the embodiment of the present invention, in order to perform the speech enhancement processing on the speech data, step 203 specifically includes: and carrying out filtering noise reduction processing on the voice data according to the target filtering noise reduction parameters to obtain noise-reduced voice data. The method for performing noise reduction processing on the voice data by using the target filtering noise reduction parameter is exactly the same as step 103, and will not be described herein.
204. And extracting the characteristics of the voice data after the voice enhancement processing to obtain a third voice characteristic corresponding to the voice data, and determining a voice recognition result corresponding to the voice data according to the third voice characteristic.
For this embodiment, after performing the voice enhancement processing on the voice data, further voice recognition is required to be performed on the voice data after the voice enhancement processing, specifically, when performing voice recognition on the voice data, the voice recognition model may be specifically a neural network voice recognition model, specifically, the voice data after the voice enhancement processing is input to the voice recognition model, and the hidden layer in the voice recognition model may extract a third voice feature corresponding to the voice data, and perform voice recognition according to the third voice feature, so as to obtain a voice recognition result corresponding to the voice data, where the accuracy of the voice recognition result may reach the highest.
Compared with the current mode of adjusting the parameters of the voice enhancement module according to expert experience, the voice enhancement method provided by the embodiment of the invention can acquire the voice data to be processed; simultaneously extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters in different environments, and the voice enhancement parameters are used for enhancing voice recognition accuracy in different environments; and then carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after voice enhancement processing, thereby automatically selecting the target voice enhancement parameters corresponding to the voice enhancement parameters from the voice enhancement parameter set by determining the target environment where the voice data to be processed is positioned, carrying out voice enhancement processing on the voice data by utilizing the target voice enhancement parameters, improving the voice enhancement effect under the target environment, and ensuring that the accuracy of voice recognition under the target environment is highest.
Further, as a specific implementation of fig. 1, an embodiment of the present invention provides a voice enhancement device, as shown in fig. 3, where the device includes: an acquisition unit 31, a selection unit 32 and a processing unit 33.
The acquiring unit 31 may be configured to acquire voice data to be processed. The acquiring unit 31 is a main functional module for acquiring voice data to be processed in the present apparatus.
The selecting unit 32 may be configured to extract a first voice feature corresponding to the voice data, determine a target environment where the voice data is located according to the first voice feature, and select a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, where the voice enhancement parameter set includes voice enhancement parameters in different environments, and the voice enhancement parameter is used to maximize voice recognition accuracy in the different environments. The selecting unit 32 is a main functional module that extracts a first voice feature corresponding to the voice data in the device, determines a target environment where the voice data is located according to the first voice feature, and selects a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, and is also a core module.
The processing unit 33 may be configured to perform a speech enhancement process on the speech data according to the target speech enhancement parameter, to obtain speech data after the speech enhancement process. The processing unit 33 is a main functional module in the present apparatus, and performs a voice enhancement process on the voice data according to the target voice enhancement parameter, so as to obtain voice data after the voice enhancement process.
Further, in order to determine the target environment in which the voice data is located, as shown in fig. 4, the selecting unit 32 includes an extracting module 321, a calculating module 322, and a determining module 323.
The extracting module 321 may be configured to obtain sample voice data under different environments, and extract a second voice feature corresponding to the sample voice data.
The calculating module 322 may be configured to calculate, according to the second voice feature, a feature center corresponding to the sample voice data in the different environments.
The determining module 323 may be configured to determine, according to the feature center and the first voice feature, a target environment in which the voice data is located.
Further, to determine the target environment in which the voice data is located, the determining module 323 includes: a calculation sub-module and a determination sub-module.
The calculation submodule can be used for calculating Euclidean distances between the first voice feature and different feature centers by using a preset Euclidean distance algorithm.
The determining submodule is used for screening out the minimum Euclidean distance from the calculated Euclidean distances and determining the environment where the sample voice data corresponding to the minimum Euclidean distance is located as the target environment.
Further, to construct the speech enhancement parameter set, the apparatus further comprises: a construction unit 34.
The processing unit 33 may be further configured to perform a speech enhancement process on the sample speech data under different environments by using the initial speech enhancement parameters, so as to obtain the speech-enhanced sample speech data under different environments.
The construction unit 34 may be configured to construct a speech recognition accuracy function under different environments according to the sample speech data.
The construction unit 34 may be further configured to perform optimization adjustment on the initial speech enhancement parameters according to the accuracy function, obtain speech enhancement parameters in different environments, and construct the speech enhancement parameter set based on the speech enhancement parameters in different environments.
Further, to construct the speech recognition accuracy function under different circumstances, the construction unit 34 includes: an identification module 341 and a construction module 342.
The recognition module 341 may be configured to perform speech recognition on the speech enhancement processed sample speech data by using a pre-constructed speech recognition model, so as to obtain speech recognition results in different environments.
The construction module 342 may be configured to construct a function of accuracy of speech recognition under different environments according to the speech recognition results under the different environments.
Further, for speech recognition of speech data, the apparatus further comprises: an extraction unit 35 and a determination unit 36.
The extracting unit 35 may be configured to perform feature extraction on the voice data after the voice enhancement processing to obtain a third voice feature corresponding to the voice data.
The determining unit 36 may be configured to determine a speech recognition result corresponding to the speech data according to the third speech feature.
Further, in order to perform the voice enhancement processing on the voice data, the processing unit 33 may specifically be configured to perform the filtering noise reduction processing on the voice data according to the target filtering noise reduction parameter, so as to obtain the noise-reduced voice data.
It should be noted that, for other corresponding descriptions of each functional module related to the voice enhancement device provided by the embodiment of the present invention, reference may be made to corresponding descriptions of the method shown in fig. 1, which are not repeated herein.
Based on the above method as shown in fig. 1, correspondingly, the embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the following steps: acquiring voice data to be processed; extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters in different environments, and the voice enhancement parameters are used for enhancing voice recognition accuracy in different environments; and carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after the voice enhancement processing.
Based on the embodiment of the method shown in fig. 1 and the device shown in fig. 3, the embodiment of the invention further provides a physical structure diagram of a computer device, as shown in fig. 5, where the computer device includes: a processor 41, a memory 42, and a computer program stored on the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are both arranged on a bus 43, the processor 41 performing the following steps when said program is executed: acquiring voice data to be processed; extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters in different environments, and the voice enhancement parameters are used for enhancing voice recognition accuracy in different environments; and carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after the voice enhancement processing.
According to the technical scheme, the method and the device can acquire the voice data to be processed; simultaneously extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters in different environments, and the voice enhancement parameters are used for enhancing voice recognition accuracy in different environments; and then carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after voice enhancement processing, thereby automatically selecting the target voice enhancement parameters corresponding to the voice enhancement parameters from the voice enhancement parameter set by determining the target environment where the voice data to be processed is positioned, carrying out voice enhancement processing on the voice data by utilizing the target voice enhancement parameters, improving the voice enhancement effect under the target environment, and ensuring that the accuracy of voice recognition under the target environment is highest.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (8)
1. A method of speech enhancement, comprising:
Acquiring voice data to be processed;
Extracting first voice characteristics corresponding to the voice data, determining a target environment in which the voice data is located according to the first voice characteristics, and selecting target voice enhancement parameters corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters in different environments, and the voice enhancement parameters are used for enhancing voice recognition accuracy in different environments, and are parameters in a voice enhancement model;
According to the target voice enhancement parameters, voice enhancement processing is carried out on the voice data, and voice data after the voice enhancement processing is obtained;
The determining, according to the first voice feature, a target environment in which the voice data is located includes:
Acquiring sample voice data under different environments, and extracting second voice features corresponding to the sample voice data;
calculating feature centers corresponding to the sample voice data under different environments according to the second voice features;
determining a target environment in which the voice data are located according to the feature center and the first voice feature;
The determining, according to the feature center and the first voice feature, a target environment in which the voice data is located, includes:
calculating Euclidean distances between the first voice feature and different feature centers by using a preset Euclidean distance algorithm;
and screening the minimum Euclidean distance from the calculated Euclidean distance, and determining the environment where the sample voice data corresponding to the minimum Euclidean distance is located as the target environment.
2. The method according to claim 1, characterized in that before the acquisition of the speech data to be processed, the method comprises:
performing voice enhancement processing on the sample voice data under different environments by using the initial voice enhancement parameters to obtain the sample voice data after the voice enhancement processing under different environments;
constructing a voice recognition accuracy function under different environments according to the sample voice data;
and optimizing and adjusting the initial voice enhancement parameters according to the accuracy function to obtain voice enhancement parameters in different environments, and constructing the voice enhancement parameter set based on the voice enhancement parameters in different environments.
3. The method of claim 2, wherein constructing a speech recognition accuracy function under different circumstances from the sample speech data comprises:
Performing voice recognition on the sample voice data subjected to the voice enhancement processing by utilizing a pre-constructed voice recognition model to obtain voice recognition results under different environments;
And constructing a voice recognition accuracy function under different environments according to the voice recognition results under different environments.
4. The method of claim 1, wherein after performing speech enhancement processing on the speech data according to the target speech enhancement parameter to obtain speech data after the speech enhancement processing, the method further comprises:
Extracting features of the voice data after the voice enhancement processing to obtain a third voice feature corresponding to the voice data;
And determining a voice recognition result corresponding to the voice data according to the third voice characteristic.
5. The method according to any one of claims 1 to 4, wherein the target speech enhancement parameter is a target filtering noise reduction parameter, and the performing speech enhancement processing on the speech data according to the target speech enhancement parameter to obtain speech data after the speech enhancement processing includes:
And carrying out filtering noise reduction processing on the voice data according to the target filtering noise reduction parameters to obtain noise-reduced voice data.
6. A speech enhancement apparatus, comprising:
an acquisition unit configured to acquire voice data to be processed;
A selecting unit, configured to extract a first voice feature corresponding to the voice data, determine a target environment where the voice data is located according to the first voice feature, and select a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, where the voice enhancement parameter set includes voice enhancement parameters in different environments, and the voice enhancement parameter is used to enhance voice recognition accuracy in different environments, where the voice enhancement parameter is a parameter in a voice enhancement model;
the processing unit is used for carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain voice data after the voice enhancement processing;
The selecting unit is specifically configured to obtain sample voice data in different environments, and extract a second voice feature corresponding to the sample voice data; calculating feature centers corresponding to the sample voice data under different environments according to the second voice features; determining a target environment in which the voice data are located according to the feature center and the first voice feature;
the selecting unit is used for calculating Euclidean distances between the first voice feature and different feature centers by using a preset Euclidean distance algorithm; and screening the minimum Euclidean distance from the calculated Euclidean distance, and determining the environment where the sample voice data corresponding to the minimum Euclidean distance is located as the target environment.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when executed by the processor implements the steps of the method according to any one of claims 1 to 5.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011153521.2A CN112151052B (en) | 2020-10-26 | 2020-10-26 | Speech enhancement method, device, computer equipment and storage medium |
PCT/CN2020/136364 WO2021189979A1 (en) | 2020-10-26 | 2020-12-15 | Speech enhancement method and apparatus, computer device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011153521.2A CN112151052B (en) | 2020-10-26 | 2020-10-26 | Speech enhancement method, device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112151052A CN112151052A (en) | 2020-12-29 |
CN112151052B true CN112151052B (en) | 2024-06-25 |
Family
ID=73955013
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011153521.2A Active CN112151052B (en) | 2020-10-26 | 2020-10-26 | Speech enhancement method, device, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112151052B (en) |
WO (1) | WO2021189979A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113539262B (en) * | 2021-07-09 | 2023-08-22 | 广东金鸿星智能科技有限公司 | Sound enhancement and recording method and system for voice control of electric door |
CN114512136B (en) * | 2022-03-18 | 2023-09-26 | 北京百度网讯科技有限公司 | Model training method, audio processing method, device, equipment, storage medium and program |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104575509A (en) * | 2014-12-29 | 2015-04-29 | 乐视致新电子科技(天津)有限公司 | Voice enhancement processing method and device |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8082148B2 (en) * | 2008-04-24 | 2011-12-20 | Nuance Communications, Inc. | Testing a grammar used in speech recognition for reliability in a plurality of operating environments having different background noise |
CN101593522B (en) * | 2009-07-08 | 2011-09-14 | 清华大学 | Method and equipment for full frequency domain digital hearing aid |
CN101710490B (en) * | 2009-11-20 | 2012-01-04 | 安徽科大讯飞信息科技股份有限公司 | Method and device for compensating noise for voice assessment |
JP5731929B2 (en) * | 2011-08-08 | 2015-06-10 | 日本電信電話株式会社 | Speech enhancement device, method and program thereof |
CN103456305B (en) * | 2013-09-16 | 2016-03-09 | 东莞宇龙通信科技有限公司 | Terminal and the method for speech processing based on multiple sound collection unit |
KR20190037867A (en) * | 2017-09-29 | 2019-04-08 | 주식회사 케이티 | Device, method and computer program for removing noise from noisy speech data |
CN111698629B (en) * | 2019-03-15 | 2021-10-15 | 北京小鸟听听科技有限公司 | Calibration method and apparatus for audio playback device, and computer storage medium |
CN110473568B (en) * | 2019-08-08 | 2022-01-07 | Oppo广东移动通信有限公司 | Scene recognition method and device, storage medium and electronic equipment |
CN110503974B (en) * | 2019-08-29 | 2022-02-22 | 泰康保险集团股份有限公司 | Confrontation voice recognition method, device, equipment and computer readable storage medium |
CN110648680B (en) * | 2019-09-23 | 2024-05-14 | 腾讯科技(深圳)有限公司 | Voice data processing method and device, electronic equipment and readable storage medium |
-
2020
- 2020-10-26 CN CN202011153521.2A patent/CN112151052B/en active Active
- 2020-12-15 WO PCT/CN2020/136364 patent/WO2021189979A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104575509A (en) * | 2014-12-29 | 2015-04-29 | 乐视致新电子科技(天津)有限公司 | Voice enhancement processing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN112151052A (en) | 2020-12-29 |
WO2021189979A1 (en) | 2021-09-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
US12067989B2 (en) | Combined learning method and apparatus using deepening neural network based feature enhancement and modified loss function for speaker recognition robust to noisy environments | |
CN109326299B (en) | Speech enhancement method, device and storage medium based on full convolution neural network | |
EP3301675B1 (en) | Parameter prediction device and parameter prediction method for acoustic signal processing | |
CN108922544B (en) | Universal vector training method, voice clustering method, device, equipment and medium | |
CN111798860B (en) | Audio signal processing method, device, equipment and storage medium | |
CN108735199B (en) | Self-adaptive training method and system of acoustic model | |
CN105976812A (en) | Voice identification method and equipment thereof | |
CN112201270B (en) | Voice noise processing method and device, computer equipment and storage medium | |
CN112151052B (en) | Speech enhancement method, device, computer equipment and storage medium | |
CN110299143B (en) | Apparatus for recognizing a speaker and method thereof | |
KR20170101500A (en) | Method and apparatus for identifying audio signal using noise rejection | |
CN118368561A (en) | Bluetooth headset noise reduction processing method, device, equipment and storage medium | |
CN115602165A (en) | Digital staff intelligent system based on financial system | |
CN117198311A (en) | Voice control method and device based on voice noise reduction | |
CN113870863A (en) | Voiceprint recognition method and device, storage medium and electronic equipment | |
CN112420056A (en) | Speaker identity authentication method and system based on variational self-encoder and unmanned aerial vehicle | |
CN119889350A (en) | Unmanned aerial vehicle acoustic identification method and device based on frequency band feature extraction | |
CN114841287A (en) | Training method of classification model, image classification method and device | |
CN116705013B (en) | Voice wake-up word detection method and device, storage medium and electronic equipment | |
CN111640423B (en) | Word boundary estimation method and device and electronic equipment | |
CN113077779A (en) | Noise reduction method and device, electronic equipment and storage medium | |
CN117198300A (en) | A bird sound recognition method and device based on attention mechanism | |
JP7024615B2 (en) | Blind separation devices, learning devices, their methods, and programs | |
CN119007728A (en) | Method and device for extracting voice of target speaker |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |