CN117995176A

CN117995176A - Multi-source voice recognition method and system

Info

Publication number: CN117995176A
Application number: CN202410123577.5A
Authority: CN
Inventors: 吕召彪; 赵文博; 许程冲; 黄莉梅; 肖清
Original assignee: China Unicom Guangdong Industrial Internet Co Ltd
Current assignee: China Unicom Guangdong Industrial Internet Co Ltd
Priority date: 2024-01-29
Filing date: 2024-01-29
Publication date: 2024-05-07

Abstract

The invention provides a multi-source voice recognition method and a system, which belong to the technical field of voice recognition and specifically comprise the following steps: determining qualified voice time periods and distribution conditions of the qualified voice time periods of different sound sources according to amplitude characteristics of voice signals of the different sound sources and preset amplitude thresholds, determining voice reliability of the different sound sources and reference sound sources by combining the amplitude characteristics of the voice signals of the different sound sources, determining sound source division time periods based on voice variation time periods of different other sound sources, determining reference sound sources of the different sound source division time periods according to the voice reliability of the sound sources, respectively reconstructing the voice characteristics of the reference sound sources and the reference sound sources of the different division time periods by using weights as input quantities, and constructing a fusion voice recognition model to output voice recognition results, so that the reliability of voice recognition is improved.

Description

Multi-source voice recognition method and system

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a multi-source voice recognition method.

Background

In order to improve the reliability and robustness of the voice recognition result of the voice recognition system, in the prior art, the voice signal is collected and processed by utilizing a plurality of microphones or a plurality of sound sources, and the voice recognition result is output after the collected voice signals are aligned, so that the output of the voice recognition result for the coordinated management of different sound sources is realized.

In order to solve the above technical problems, in the prior art, CN201810673599.3 "speech recognition method, system, sound box and storage medium based on multi-source recognition" sets at least two speech recognition platforms in the intelligent sound box to recognize the user's speech, when the recognition results are the same, outputs the speech, when the recognition results are different, performs identity, then obtains the final recognition result and outputs the final recognition result, thus greatly improving the speech recognition accuracy of the intelligent sound box, but at the same time, the following technical problems exist:

In some situations, the position of the user is not fixed, for example, the position of the user may change in some conferences, so if the position change condition of the user cannot be identified, the analysis reliability of different sound sources cannot be determined, and thus, the accurate output of the voice identification result cannot be realized.

Aiming at the technical problems, the invention provides a multi-source voice recognition method and a multi-source voice recognition system.

Disclosure of Invention

The invention aims to provide a multi-source voice recognition method.

In order to solve the technical problems, the invention provides a multi-source voice recognition method, which specifically comprises the following steps:

S1, determining qualified voice time periods and distribution conditions of the qualified voice time periods of different sound sources according to amplitude characteristics and preset amplitude thresholds of the voice signals of the different sound sources, and determining voice reliability of the different sound sources and a reference sound source by combining the amplitude characteristics of the voice signals of the different sound sources;

S2, taking the deviation amount of the amplitude characteristic of the voice signal of the other sound source and the amplitude characteristic of the reference sound source as the voice characteristic deviation amount, determining the voice variation time period based on the variation condition of the voice characteristic deviation amount of the different other sound sources, and determining the voice characteristic variation amount of the different other sound sources and the variation sound source by combining the variation condition of the voice characteristic deviation amount between the different voice variation time periods;

S3, determining the movement probability of the voice signal of the user according to the voice characteristic fluctuation amounts of different sound sources and the quantity of the fluctuation sound sources, and entering the next step when the movement probability does not meet the requirement;

S4, determining sound source division periods based on the voice variation periods of different other sound sources, determining the reference sound sources of the different sound source division periods according to the voice reliability of the sound sources, respectively reconstructing the voice characteristics of the reference sound sources and the reference sound sources of the different division periods by using weights as input quantities, and constructing a fusion voice recognition model to output voice recognition results.

The invention has the beneficial effects that:

1. The voice reliability of different sound sources and the determination of the reference sound source are carried out according to the qualified voice time periods and the amplitude characteristics of the voice signals, so that the time length and the distribution continuity of the qualified voice time periods of the different sound sources are considered, the difference of the recognition accuracy of the different sound sources caused by the difference of the amplitude characteristics is considered, the reliable screening of the reference sound sources is realized, and the foundation is laid for further realizing the determination of the voice change condition.

2. The voice characteristic fluctuation quantity of different sound sources and the quantity of the fluctuation sound sources are used for determining the movement probability of the voice signals of the user, so that the accurate evaluation of the movement condition of the voice signals from multiple angles is realized, the fluctuation condition of the voice characteristics of the different sound sources and the quantity of the fluctuation sound sources are fully considered, and the technical problem that the voice recognition result is inaccurate due to the movement of the voice signals is avoided.

3. The method has the advantages that the weight is adopted to reconstruct the voice characteristics of the reference sound source and the reference sound source in different dividing periods respectively to serve as input quantities, the fusion voice recognition model is constructed to output voice recognition results, the efficiency of voice recognition processing is improved on the basis of reducing the input quantity of the fusion voice recognition model, meanwhile, the difference of the reference meanings of different sound sources due to the fluctuation condition of the voice characteristics is fully considered through reconstructing the voice characteristics of the reference sound source, and the accuracy of voice recognition processing is improved.

The further technical scheme is that the preset amplitude threshold is determined according to the voice recognition results under the amplitude characteristics of different voice signals, and the specific amplitude characteristics corresponding to the fact that the accuracy of the voice recognition results is larger than the preset accuracy threshold are determined.

The further technical solution is that determining the voice variation period based on variation of the voice characteristic deviation of different other sound sources specifically includes:

The other sound sources are divided into a plurality of time periods according to the similarity of the voice characteristic deviation amounts at different moments, and the voice variation time periods are determined according to the deviation amounts of the voice characteristic deviation amounts between different time periods.

On the other hand, the embodiment of the application provides a multi-source voice recognition system, which adopts the multi-source voice recognition method, and specifically comprises the following steps: the system comprises a reference sound source determining module, a fluctuation sound source screening module, a mobile probability evaluating module and a voice recognition module;

The reference sound source determining module is responsible for determining qualified voice time periods of different sound sources and distribution conditions of the qualified voice time periods according to amplitude characteristics of voice signals of the different sound sources and preset amplitude thresholds, and determining voice reliability of the different sound sources and the reference sound sources by combining the amplitude characteristics of the voice signals of the different sound sources;

The change sound source screening module is responsible for taking the deviation amount of the amplitude characteristic of the voice signal of other sound sources and the reference sound source as the voice characteristic deviation amount, determining the voice change period based on the change condition of the voice characteristic deviation amount of different other sound sources, and determining the voice characteristic change amount of different other sound sources and the change sound source according to the change condition of the voice characteristic deviation amount between different voice change periods;

The mobile probability evaluation module is responsible for determining the mobile probability of the voice signal of the user through the voice characteristic fluctuation quantity of different sound sources and the quantity of the fluctuation sound sources;

The voice recognition module is responsible for determining sound source division periods based on voice variation periods of different other sound sources, determining reference sound sources of the different sound source division periods according to voice reliability of the sound sources, and respectively reconstructing voice characteristics of the reference sound sources and the reference sound sources of the different division periods by using weights as input quantities to construct a fusion voice recognition model to output voice recognition results.

In another aspect, the present invention provides a computer storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform a multi-source speech recognition method as described above.

Additional features and advantages will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

Fig. 1 is a flowchart of a multi-source speech recognition method according to embodiment 1.

Fig. 2 is a flowchart of a method of determining a reference sound source in embodiment 1.

Fig. 3 is a frame diagram of a multi-source speech recognition system in embodiment 2.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus detailed descriptions thereof will be omitted.

The terms "a," "an," "the," and "said" are used to indicate the presence of one or more elements/components/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. in addition to the listed elements/components/etc.

Multi-source speech recognition refers to speech signal acquisition and processing using multiple microphones or multiple sound sources to improve the performance and robustness of the speech recognition system.

The following are some common multi-source speech recognition methods and systems:

1. Array microphone technology: accurate capturing and separating of multiple sound sources is achieved by mounting multiple microphones in specific layouts and directions. This approach can provide more speech information, reduce the impact of ambient noise, and enhance the performance of speech recognition.

2. Sound source localization and separation: by using sound source localization and separation techniques, voice signals of multiple persons can be effectively separated and identified. The method can determine the position and direction of the sound source by analyzing the characteristics of time delay, intensity and the like of the signals.

3. Sound source tracking and adaptive processing: by tracking and dynamically analyzing a plurality of sound sources, self-adaptive processing can be realized, the influence of background noise is reduced, and the accuracy of voice recognition is improved. This approach can adjust parameters and models of the speech recognition algorithm based on dynamic changes in the signal.

4. Multimodal fusion: fusing the speech signal with other sensor data (e.g., images, gestures, etc.) can provide a more comprehensive speech recognition result. The method can combine information of different sensors to improve the robustness and accuracy of voice recognition.

Example 1

In order to solve the above technical problems, as shown in fig. 1, the present invention provides a multi-source speech recognition method, which specifically includes:

before extracting the amplitude feature, it is necessary to extract the sound data, i.e., the voice signal, specifically including:

And (3) data collection: voice data of a plurality of microphones or a plurality of sound sources is collected. Such data may come from different locations, directions or environments. They should cover different speakers, speech content and noise conditions in order to train and evaluate the speech recognition system.

Specific:

1. microphone configuration: the number and layout of microphones used is determined. Alternatively, a plurality of microphones may be placed in a fixed position, or a movable microphone array may be used.

2. Data acquisition environment: a suitable data acquisition environment is selected. This may be a laboratory environment or a real-scene environment. The acquisition environment is ensured to have certain heterogeneity, including different background noise, distances, angles and the like.

3. Speaker selection: a set of speakers is determined to cover different genders, ages, speech rates, and accents. Thus, the collected data can be more diversified, and the robustness of the identification system is improved.

4. And (3) recording data: voice data of a speaker is recorded simultaneously using a plurality of microphones. Ensuring that the position and orientation of each microphone matches the pre-configuration and remains constant during recording.

5. Labeling: and labeling the recorded voice data. Labeling may involve operations such as correcting timestamps, segmenting sentences, labeling speakers, and the like. This will provide the necessary labels and references for training and evaluating the speech recognition system.

6. And (3) data processing: and processing the recorded voice data. This includes pre-processing the speech data such as removing noise, equalizing the audio power, and normalizing the audio features.

7. Dividing data: the processed data is divided into a training set, a verification set and a test set. A portion of the data is typically used for model training, a portion is used to adjust model parameters, and a test set is used to evaluate the performance of the recognition system.

It will be appreciated that the speech data also needs to be pre-processed before the extraction of the speech features can take place:

pretreatment: the speech data is pre-processed to reduce noise, equalize signal power, and normalize the audio characteristics. This may include noise reduction, speech enhancement, and audio normalization.

Specific:

1. Noise reduction: ambient noise in the speech signal is removed by using a noise reduction algorithm. Common noise reduction methods include statistical-based methods (e.g., mean filtering, spectral subtraction) and machine-learning-based methods (e.g., deep-learning-based noise reduction networks).

2. Speech enhancement: the speech signal is enhanced to improve the quality and audibility of the speech signal. This may include increasing the volume of the speech signal, enhancing the clarity and loudness of the speech, etc.

3. Audio normalization: the audio characteristics of the speech signal are normalized to ensure that the different speech data have similar volume levels and dynamic ranges. This may be achieved by adjusting the audio gain or performing dynamic range compression.

4. Mel spectrum extraction: the speech signal is converted into mel-spectral features by fourier transforming the speech signal and applying a mel-filter bank. The mel spectrum can better capture the speech information perceived by the human ear.

5. Framing: the speech signal is divided into a series of audio frames of different sizes. Typically, each frame is 20-30 milliseconds in size. This may provide finer time domain information for subsequent feature extraction and processing.

6. Feature extraction: features are extracted from each audio frame. Common features include mel-frequency cepstral coefficients (MFCCs), mel-frequency spectrograms, zero-crossing rates, and the like. These features represent the spectral and time domain characteristics of the speech signal.

7. Feature normalization: the extracted speech features are normalized to reduce the variance between features. This may be achieved by mean removal and variance normalization of the features, or using other normalization methods.

The above steps can be flexibly adjusted and combined according to actual needs and may be different depending on the specific multi-source speech recognition method. The goal of the preprocessing is to reduce noise, improve robustness and recognizability of the speech, and provide better input data for the subsequent multi-source speech recognition algorithm.

It should be noted that, the preset amplitude threshold is determined according to the voice recognition results under the amplitude characteristics of different voice signals, and the specific determination is performed through the amplitude characteristics corresponding to the accuracy of the voice recognition results being greater than the preset accuracy threshold.

In a possible embodiment, as shown in fig. 2, the method for determining the reference sound source in the step S1 is:

Acquiring the accumulated time length of qualified voice periods of different sound sources, determining whether the sound source does not belong to a reference sound source based on the accumulated time length, if so, determining that the sound source does not belong to the reference sound source, and if not, entering the next step;

Determining comprehensive amplitude characteristic evaluation quantity of the sound source according to the amplitude characteristics of the voice signals of the sound source at different moments, determining whether the sound source does not belong to a reference sound source or not based on the comprehensive amplitude characteristic evaluation quantity, if so, determining that the sound source does not belong to the reference sound source, and if not, entering the next step;

Determining the number of the qualified voice periods of the sound source and the interval time of the adjacent qualified voice periods based on the distribution condition of the qualified voice periods of the sound source, determining the voice qualification evaluation amount of the sound source by combining the accumulated time length of the qualified voice periods of the sound source, determining whether the sound source does not belong to a reference sound source or not based on the voice qualification evaluation amount of the sound source, if yes, determining that the sound source does not belong to the reference sound source, and if no, entering the next step;

acquiring the number of qualified voice time periods of the sound source which are smaller than a preset time period and the time period that the amplitude characteristics of the voice signals do not meet the requirements, and determining the voice reliability of the sound source by combining the comprehensive amplitude characteristic evaluation quantity and the voice qualification evaluation quantity of the sound source, wherein whether the sound source belongs to a reference sound source is determined through the voice reliability.

Further, determining whether the sound source belongs to a reference sound source according to the voice reliability specifically includes:

and acquiring the voice reliability of different sound sources, and taking the sound source with the maximum voice reliability as a reference sound source.

In another possible embodiment, the method for determining the reference sound source in the step S1 is:

s11, acquiring the accumulated time length of qualified voice periods of different sound sources, determining whether the accumulated time length of the sound sources is longest or not based on the accumulated time length, if so, entering the next step, and if not, entering the step S14;

S12, judging whether the deviation amounts of the accumulated time length of the sound source and the accumulated time lengths of qualified voice periods of other sound sources are larger than preset deviation, if so, determining that the sound source belongs to a reference sound source, and if not, entering the next step;

S13, determining comprehensive amplitude characteristic evaluation quantity of the sound source according to the amplitude characteristics of the voice signals of the sound source at different moments, determining whether the comprehensive amplitude characteristic evaluation quantity of the sound source is maximum and whether the deviation quantity of the comprehensive amplitude characteristic evaluation quantity of other sound sources is larger than a preset characteristic deviation value or not based on the comprehensive amplitude characteristic evaluation quantity, if so, determining that the sound source belongs to a reference sound source, and if not, entering the next step;

s14, determining the number of the qualified voice periods of the sound source and the interval time of the adjacent qualified voice periods based on the distribution condition of the qualified voice periods of the sound source, and determining the voice qualification evaluation quantity of the sound source by combining the accumulated duration of the qualified voice periods of the sound source;

s15, acquiring the number of qualified voice time periods of the sound source which are smaller than a preset time period and the time period that the amplitude characteristics of the voice signals do not meet the requirements, and determining the voice reliability of the sound source by combining the comprehensive amplitude characteristic evaluation quantity and the voice qualification evaluation quantity of the sound source, wherein whether the sound source belongs to a reference sound source is determined through the voice reliability.

it will be appreciated that alignment of different sound sources is also required before the determination of the amount of speech feature deviation can be made, particularly in combination with other sensor data.

Multimodal fusion (optional): if there is other sensor data (e.g., images, gestures, etc.), the speech data may be fused with the other sensor data. This may provide more accurate and robust speech recognition by combining the timing and spatiotemporal characteristics of the sensor data with the speech data.

Specific:

1. And (3) data collection: simultaneously collecting voice data and other sensor data, such as image data collected by a camera, gesture data collected by a body gesture sensor, and the like. And the synchronism and consistency of data acquisition are ensured.

2. Data preprocessing: the speech and other sensor data is pre-processed to eliminate the magnitude of noise, calibration data, and normalization data. This may include denoising, alignment, and normalization operations.

3. Feature extraction: features are extracted from the speech and other sensor data to obtain respective representations. For speech data, the feature extraction step may refer to the flow of "audio feature extraction". For other sensor data, corresponding feature extraction algorithms, such as image feature extraction or gesture feature extraction, etc., may be used.

4. Feature fusion: features acquired from different sensors are fused. Fusion may be achieved by simple feature level fusion (e.g., stitching, weighted summation) or by more complex fusion models (e.g., multi-modal deep neural networks).

5. Joint modeling: and training a voice recognition model by using the fused features. Conventional statistical models, such as Hidden Markov Models (HMMs) or deep learning models, such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), may be used.

6. Fusion of results: and fusing the output of the voice recognition model with other sensor data to obtain a final multi-mode fusion result. This may employ simple weighted averaging, voting, or more complex decision-level fusion methods.

The multimodal fusion may enable the speech recognition system to utilize information from a variety of data sources to provide more comprehensive and accurate speech recognition results. By combining the voice characteristics and information of other sensor data, the system can better process the problems of noise, voice variation, context uncertainty and the like, and improve the accuracy and the robustness of recognition.

It will be appreciated that the determination of the speech variation period is based on the variation of the speech characteristic deviation amount of different other sound sources, specifically including:

Specifically, the method for determining the variation of the voice characteristics of the sound source is as follows:

Determining a variation value of the voice characteristic deviation amount between different adjacent voice variation periods according to variation conditions of the voice characteristic deviation amount between different voice variation periods, taking the variation value as the voice characteristic variation amount, and determining a severe variation period and other variation periods according to the voice characteristic variation amount;

Determining the variation situation of the voice characteristic deviation amount between different moments through the voice characteristic deviation amounts of different violent variation time periods and other variation time periods respectively, and determining the time period characteristic variation amounts of different violent variation time periods and other variation time periods by combining the ratio of the moment when the variation situation of the voice characteristic deviation amount does not meet the requirement;

And determining the voice characteristic fluctuation amount of the sound source based on the duration of the violent fluctuation period of the sound source, the period characteristic fluctuation amounts of different violent fluctuation periods, the duration of the period characteristic fluctuation amounts of other fluctuation periods and the period characteristic fluctuation amounts of different violent fluctuation periods.

Further, when the voice characteristic fluctuation amount of the sound source does not meet the requirement, the sound source is determined to be a fluctuation sound source, and particularly when the voice characteristic fluctuation amount of the sound source is larger than a preset fluctuation amount threshold, the sound source is determined to be a fluctuation sound source.

it can be understood that the method for determining the movement probability of the voice signal of the user is as follows:

s31, judging whether a fluctuation sound source exists in the sound sources, if so, entering the next step, and if not, determining the movement probability of the voice signal of the user based on the preset movement probability;

s32, judging whether the number of the variable sound sources in the sound sources is smaller than the preset sound source number, if so, entering the next step, and if not, entering the step S34;

S33, determining the number of sound sources with the voice characteristic fluctuation amount larger than a set fluctuation threshold through the voice characteristic fluctuation amounts of different sound sources, determining the voice fluctuation estimated amount of the sound sources by combining the average value of the voice characteristic fluctuation amounts of the sound sources with the voice characteristic fluctuation amount larger than the set fluctuation threshold and the maximum value of the voice fluctuation characteristic amounts of the sound sources, judging whether the voice fluctuation estimated amount of the sound sources meets the requirement, if so, determining the movement probability of voice signals of a user through the voice fluctuation estimated amount of the sound sources, and if not, entering the next step;

s34, acquiring the number of the fluctuation sound sources and the voice fluctuation characteristic quantity of different fluctuation sound sources, and determining the movement probability of the voice signal of the user by combining the voice fluctuation evaluation quantity of the sound sources.

In another possible embodiment, the method for determining the movement probability of the voice signal of the user is as follows:

when no fluctuation sound source exists in the sound sources, determining that the movement probability of the voice signals of the users meets the requirement;

when the fluctuation sound sources exist in the sound sources, acquiring the quantity of the fluctuation sound sources in the sound sources, and when the quantity of the fluctuation sound sources in the sound sources is larger than a second preset sound source quantity, determining that the movement probability of the voice signals of the user does not meet the requirement;

When the number of the fluctuation sound sources in the sound sources is not larger than the second preset sound source number, determining the number of the sound sources with the voice characteristic fluctuation amount larger than the set fluctuation threshold value through the voice characteristic fluctuation amounts of different sound sources, and determining the voice fluctuation evaluation amount of the sound sources by combining the average value of the voice characteristic fluctuation amounts of the sound sources with the voice characteristic fluctuation amount larger than the set fluctuation threshold value and the maximum value of the voice fluctuation characteristic amounts of the sound sources;

When the voice fluctuation estimated quantity meets the requirement, acquiring the quantity of the fluctuation sound sources and the voice fluctuation characteristic quantity of different fluctuation sound sources, and determining the movement probability of the voice signals of the user by combining the voice fluctuation estimated quantity of the sound sources;

and when the voice change evaluation quantity does not meet the requirement, determining that the movement probability of the voice signal of the user does not meet the requirement.

The method for constructing the fusion speech recognition model comprises three aspects of model establishment, speech recognition, evaluation and improvement, and specifically comprises the following steps:

Model training: and training a voice recognition model by using the preprocessed voice data and the corresponding label. Common models include statistical-based Hidden Markov Models (HMMs) or deep learning models, such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). In the training process, optimizing model parameters, selecting proper algorithms and model architecture are critical.

Specific:

1. data preparation: the preprocessed speech data is divided into training, validation and test sets. Typically, a training set is used for training of the model, a validation set is used for selection and tuning of model parameters, and a test set is used to evaluate the performance of the model.

2. The characteristic is represented as follows: before the speech data is input into the model, it is converted into a suitable representation of the features. Common features include mel-frequency cepstral coefficients (MFCCs), mel-frequency spectrograms, and the like. These features represent the spectrum and timing information of the speech signal.

3. Label preparation: corresponding labels are generated for the training data. The labels may be sequences of phonemes or sequences of words/words for guiding the model in learning the relation between the speech signal and the corresponding text representation.

4. Model selection and architecture design: the appropriate model and architecture is selected based on the requirements of the particular task and the characteristics of the data set. Common models include statistical-based Hidden Markov Models (HMMs) and deep learning-based models, such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), among others.

5. Model initialization: parameters of the model are initialized. This may be random initialization or initialization using a pre-trained model. The pre-trained model may be a model trained on large-scale speech data to provide better initial parameters.

6. Forward propagation and backward propagation: forward and backward propagation of the model is performed using training data. The forward propagation inputs features into the model, generating an output of the model. The back propagation is based on the model's output and labels, the model's loss function is calculated, and the model's parameters are adjusted by an optimization algorithm (e.g., gradient descent) to minimize the loss function.

7. Parameter tuning: through iterative training and verification processes, parameters of the model are adjusted to improve performance and generalization capability of the model. This can be achieved by selecting a more appropriate learning rate, regularization method, optimization algorithm, etc.

8. Model evaluation: the trained model is evaluated using the test set. The evaluation index may include recognition accuracy, word Error Rate (WER), and the like.

9. Model improvement: and improving the model according to the evaluation result. This may include parameter tuning, data enhancement, model structure modification, etc.

And (3) voice recognition: and identifying the new voice data by using the trained fusion voice identification model. This is accomplished by inputting the feature representation into a model and employing a recognition algorithm (e.g., a viterbi algorithm) to infer the most likely speech recognition result.

Specific:

1. the characteristic is represented as follows: the input speech signal is converted into a model-acceptable representation of the features. Common features include mel-frequency cepstral coefficients (MFCCs), mel-frequency spectrograms, and the like. These features represent the spectrum and timing information of the speech signal.

2. Model inference: and inputting the characteristic representation into a voice recognition model, and deducing through the model to obtain a corresponding voice recognition result. Common models include statistical-based Hidden Markov Models (HMMs) and deep learning-based models, such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), among others.

3. Decoding algorithm: and decoding the output of the model by using a decoding algorithm to obtain a final recognition result. Common decoding algorithms include viterbi algorithm (Viterbi algorithm) and pruning algorithm (Beam search algorithm). The algorithms search and decode based on the model output and the language model (if any) to find the optimal recognition result.

4. Post-treatment: and carrying out post-processing operation on the identification result so as to further improve the identification quality. Post-processing may include word level error correction, sentence breaking, punctuation addition, and the like.

5. Evaluation and performance index: by comparing with the reference tag, an evaluation index of the recognition result, such as Word Error Rate (WER), is calculated. This can be used to evaluate and compare the performance of different systems.

6. Model update and improvement: and updating and improving the model according to the evaluation result. This may include model parameter tuning, data enhancement, model structure modification, and the like.

Evaluation and improvement: and evaluating the recognition result, and adjusting and improving the voice recognition system according to the feedback information. This may include model parameter tuning, data enhancement, algorithm optimization, etc.

Specific:

1. data set partitioning: the data set is divided into a training set, a validation set and a test set. The training set is used for model training, the verification set is used for selecting and optimizing model parameters, and the test set is used for final system performance evaluation.

2. Evaluation index: an appropriate evaluation index is selected to evaluate the performance of the system. Common indicators include Word Error Rate (WER), character Error Rate (CHARACTER ERROR RATE, CER), and the like. These indices are used to measure the degree of difference between the recognition result and the reference label.

3. Initial evaluation: and (3) carrying out initial evaluation on the trained voice recognition system, and knowing the initial performance of the system by calculating evaluation indexes on a test set. This helps to understand the strengths and direction of improvement of the system.

4. Error analysis: and carrying out error analysis on the identification result, and analyzing common error types and modes. This can discover problems with the system, such as variations in speech, word inconsistencies, etc., based on the cause of the error.

5. Model improvement: and improving the system according to the error analysis result. Possible modification methods include adjusting model parameters, training more data, introducing more accurate language models, etc.

6. Data enhancement: system performance is improved by enhancing the training data. Data enhancement may include adding noise, shifting, tonal variation, speech disturbances, and the like. This may allow the system to handle different speech variations and environmental conditions more robustly.

7. Parameter adjustment and optimization: the super parameters of the model, such as learning rate, regularization parameters, network structure, etc., are adjusted. This can be achieved by cross-validation and tuning using a validation set.

8. Repeated evaluation: after improvement and optimization, the system is evaluated again, an evaluation index is calculated, and the evaluation index is compared with an initial evaluation result. This allows to understand the improved effect and the improvement of the system performance.

9. Iterative improvement: based on the results of the repeated evaluations, an improvement may be iterated if the system performance is still unsatisfactory. By repeating the improvement and evaluation, the performance and the robustness of the system are gradually improved.

Further, constructing a fusion speech recognition model to output a speech recognition result, specifically including:

respectively reconstructing the voice characteristics of the reference sound source and the reference sound source with different dividing periods by adopting weights as input quantity, constructing a fusion voice recognition model, and outputting a voice recognition result;

Determining the weight of the reference sound source and the weight of the reference sound source according to the voice reliability of the reference sound source, the voice reliability of the divided time periods corresponding to the reference sound source, and the voice reliability of the divided time periods corresponding to the reference sound source;

Reconstructing the voice characteristics of the reference sound source and the reference sound source based on the weight of the reference sound source to obtain reconstructed voice characteristics, and taking the reconstructed voice characteristics as input quantity to construct a fusion voice recognition model to output voice recognition results of the dividing period corresponding to the reference sound source;

and outputting the voice recognition result according to the voice recognition results of different divided time periods.

Example 2

On the other hand, as shown in fig. 3, an embodiment of the present application provides a multi-source speech recognition system, which adopts the above-mentioned multi-source speech recognition method, and specifically includes: the system comprises a reference sound source determining module, a fluctuation sound source screening module, a mobile probability evaluating module and a voice recognition module;

Example 3

The present invention provides a computer storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform a multi-source speech recognition method as described above.

Based on the above embodiments, the following technical effects are expected to be obtained:

In the description of the present specification, the terms "one embodiment," "a preferred embodiment," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above is only a preferred embodiment of the present invention and is not intended to limit the embodiment of the present invention, and various modifications and variations can be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the embodiments of the present invention should be included in the protection scope of the embodiments of the present invention.

Claims

1. The multi-source voice recognition method is characterized by comprising the following steps of:

Determining qualified voice time periods and distribution conditions of the qualified voice time periods of different sound sources according to amplitude characteristics of voice signals of the different sound sources and a preset amplitude threshold, and determining voice reliability of the different sound sources and a reference sound source by combining the amplitude characteristics of the voice signals of the different sound sources;

Taking the deviation amount of the amplitude characteristic of the voice signal of the other sound source and the amplitude characteristic of the reference sound source as the voice characteristic deviation amount, determining the voice variation time period based on the variation condition of the voice characteristic deviation amount of the different other sound source, and determining the voice characteristic variation amount of the different other sound source and the variation sound source by combining the variation condition of the voice characteristic deviation amount between the different voice variation time periods;

Determining the movement probability of the voice signal of the user according to the voice characteristic fluctuation amounts of different sound sources and the quantity of the fluctuation sound sources, and entering the next step when the movement probability does not meet the requirement;

The method comprises the steps of determining sound source division time periods based on voice variation time periods of different other sound sources, determining reference sound sources of different sound source division time periods according to voice reliability of the sound sources, respectively reconstructing voice characteristics of the reference sound sources and the reference sound sources of different division time periods by using weights as input quantities, and constructing a fusion voice recognition model to output voice recognition results.

2. The multi-source speech recognition method of claim 1, wherein the predetermined magnitude threshold is determined according to speech recognition results under magnitude characteristics of different speech signals, and the specific magnitude characteristics corresponding to the accuracy rate of the speech recognition results being greater than the predetermined accuracy rate threshold are determined.

3. The multi-source speech recognition method of claim 1, wherein the method of determining the reference sound source is:

4. A multi-source speech recognition method according to claim 3, characterized in that determining whether the sound source belongs to a reference sound source by means of the speech reliability, in particular comprises:

5. The multi-source speech recognition method of claim 1, wherein the determining of the speech variation period is performed based on variation of the speech feature deviation amounts of the different other sound sources, specifically comprises:

6. The multi-source speech recognition method of claim 1, wherein the method for determining the variation of the speech characteristics of the sound source is as follows:

7. The multi-source speech recognition method of claim 1, wherein the sound source is determined to be a fluctuating sound source when the variation of the speech characteristics of the sound source does not satisfy the requirement, and particularly wherein the sound source is determined to be a fluctuating sound source when the variation of the speech characteristics of the sound source is greater than a preset variation threshold.

8. The multi-source speech recognition method of claim 1, wherein constructing a fusion speech recognition model for outputting speech recognition results comprises:

9. A multi-source speech recognition system employing the multi-source speech recognition method of any one of claims 1-8, comprising: the system comprises a reference sound source determining module, a fluctuation sound source screening module, a mobile probability evaluating module and a voice recognition module;

10. A computer storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the multi-source speech recognition method of any of claims 1-8.