CN104008751A

CN104008751A - Speaker recognition method based on BP neural network

Info

Publication number: CN104008751A
Application number: CN201410270239.0A
Authority: CN
Inventors: 周婷婷; 李燕萍
Original assignee: Individual
Current assignee: Individual
Priority date: 2014-06-18
Filing date: 2014-06-18
Publication date: 2014-08-27

Abstract

The invention provides a speaker recognition method based on a BP neural network. The speaker recognition method comprises the steps of a speech training phase and a speech recognition phase. The method is characterized in that according to the speech training phase, speech training is firstly carried out on the speech of a speaker to obtain speech preprocessing signals. Feature extraction is carried out on the speech preprocessing signals through an MFCC speech parameter extraction method, then model training is carried out by adopting the PSO-BP neural network, and a PSO-BP neural network model base is built and optimized through trained models. In the speech recognition phase, the method the same as that in the speech training phase is adopted. Feature parameters are input in the BP neural network, an output result is calculated through a pso-BP procedure algorithm, the output result is compared with expected recognition identities in a database one by one, and the identity with the minimum recognition error is used as the final recognition result.

Description

A kind of method for distinguishing speek person based on BP neural network

Technical field

The present invention relates to speaker Recognition Technology, particularly relate to a kind of method for distinguishing speek person based on BP neural network.

Background technology

Speaker Identification (Speaker Recognition, SR) claims again words person identification, refers to by the analyzing and processing to speaker's voice signal, automatically confirms speaker's technology.It combines a research topic of the subject knowledges such as physiology, phonetics, digital signal processing, pattern-recognition, artificial intelligence, with advantages such as unique convenience, economy and accuracys, in association area, play an important role, and have wide market background.The ultimate principle of Speaker Identification, that to utilize speaker's voice be that each speaker sets up the model that can describe this speaker's feature, as the standard form of this speaker's speech characteristic parameter, then compare for the voice signal of test, realize the object of differentiating speaker ' s identity.

The pronunciation channel that speaker's personal characteristics is embodied in speaker to a certain extent changes above, and sound channel feature can be identified speaker better.Feature based on sound channel mainly contains: (1) Mel-cepstrum coefficient (Mel-frequency CepstralCoefficients, MFCC), be the critical band effect based on auditory system, a kind of cepstrum parameter extracting in Mel scale frequency territory.It can relatively make full use of this special apperceive characteristic of people's ear, and this feature has more intense robustness, is widely applied.(2) linear prediction cepstrum coefficient coefficient (LinearPredictionCepstrum Coefficient, LPCC), nineteen forty-seven Wei Na has proposed this term of linear prediction first, and the people such as plate storehouse are in first 1967 be applied to linear forecasting technology speech analysis and synthesized.LPCC is a kind of cepstrum parameter being applied to the earliest in speech recognition, its major advantage is the excitation information having removed more up hill and dale in voice production process, the response of main reflection sound channel, calculated amount is little, and vowel is had to descriptive power preferably, and often only need tens cepstrum coefficients just can describe preferably the resonance peak characteristic of voice, therefore in Speaker Identification, obtain good application.

In voice technology research and application, the recognizer of voice signal has three kinds: the method based on channel model and voice knowledge, the method for template matches and utilize the method for artificial neural network.Although the research starting based on channel model and voice knowledge aspect early, due to its complicacy, present stage is not obtained good practical function.The method of template matches has dynamic time warping (DTW), Hidden Markov (HMM) theory, vector quantization (VQ) technology, and these algorithms interference performance under noise circumstance is poor, can not reach good recognition effect.Artificial Neural Network has adaptivity carrying out property, robustness, fault-tolerance and learning characteristic, the classification capacity that it is powerful and input-output mapping ability very attractive all in speech recognition.

Backpropagation (BackPropagation, BP) network is the Multi-layered Feedforward Networks of a kind of error backpropagation algorithm training, has massively parallel processing, distributed information storage, the good advantage such as self-organization self-learning capability and simple, the easy realization of principle.But also there is intrinsic defect in it: be easily absorbed in local minimum, speed of convergence is slow, network generalization a little less than.And genetic algorithm is as a kind of global optimization approach, can search out fast all in solution space, and there will not be the falling trap that falls into locally optimal solution, there is the feature of Distributed Calculation due to genetic algorithm simultaneously, can pick up speed in the time of actual solving, and there is stronger precision of prediction than traditional BP neural network.And the square error of prediction is also less.

Summary of the invention

Object of the present invention is exactly to provide a kind of method for distinguishing speek person based on BP neural network in order to overcome the defect that above-mentioned prior art exists.

Object of the present invention can be achieved through the following technical solutions: a kind of method for distinguishing speek person based on BP neural network, the steps include: to be divided into voice training stage and speech recognition stage two steps; It is characterized in that: the step in described voice training stage is: first speaker's voice are carried out to voice training, obtain speaker's voice signal, and obtain voice preprocessed signal.Adopt MFCC speech parameter extraction method to carry out feature extraction to voice preprocessed signal, try to achieve speaker's characteristic parameter; Then adopt PSO-BP neural network to carry out model training, the model after training, sets up and optimization PSO-BP neural network model storehouse.2. when speech recognition, the same method while adopting with the voice training stage extracts phonetic feature from voice to be identified.In BP neural network, input above-mentioned characteristic parameter, then call respectively the network weight that in model bank, everyone has kept; And calculate Output rusults by pso-BP flow algorithm, the expectation identification identity in the result of output and database is compared one by one, using that identity of identification error minimum as last recognition result.

The invention has the beneficial effects as follows: the present invention utilizes MFCC and BP neural network to combine, method for distinguishing speek person disclosed by the invention can more effective identification speaker, the present invention is using standard back-propagation algorithm (Back Propagation) BP neural network as with reference to object, by carrying out Optimized BP Neural Network with particle cluster algorithm to reduce the erroneous judgement of abnormal sound, there is stronger precision of prediction than traditional BP neural network, and the square error of prediction is also less, is with a wide range of applications.

Brief description of the drawings

Fig. 1 is speech recognition process schematic diagram of the present invention.

Fig. 2 is that MFCC speech parameter of the present invention extracts schematic diagram.

Fig. 3 is pso-BP flow algorithm schematic diagram of the present invention.

Fig. 4 is PSO-BP neural network schematic diagram of the present invention.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

According to a kind of method for distinguishing speek person based on BP neural network shown in Fig. 1, Fig. 2, Fig. 3, Fig. 4, the steps include: to be divided into voice training stage and speech recognition stage two steps; It is characterized in that: the step in described voice training stage is: first speaker's voice are carried out to voice training, obtain speaker's voice signal, and obtain voice preprocessed signal.That is: voice signal pre-service, comprising: be divided into four parts by pre-emphasis, end-point detection, point frame and windowing.

1. pre-emphasis

Because the front end of voice signal presents rapid fading, the corresponding signal content of voice signal frequency spectrum that frequency is higher is less, will carry out pre-emphasis for this reason.The object of pre-emphasis is that the frequency spectrum of more useful HFS is promoted, and makes the frequency spectrum of signal become smooth, remains on low frequency in the whole frequency band of high frequency, can ask frequency spectrum by same signal to noise ratio (S/N ratio), so that carry out spectrum analysis or channel parameters analysis.The transport function of pre-emphasis is: H (s)=1-μ s ^-1, wherein μ is pre emphasis factor, can be taken as 1 or than 1 slightly little value, generally get μ=0.95.

2. end-point detection

The object of end-point detection is the segment signal from comprising voice, to determine starting point and the terminal of voice.End-point detection can not only make the processing time reduce to minimum effectively, and can get rid of the noise of unvoiced segments, thereby makes recognition system have good recognition performance.

End-point detection technology is mostly that the temporal signatures based on voice signal carries out, and adopts two kinds of temporal signatures herein: short-time energy and short-time zero-crossing rate, detect by the thresholding of setting them.Short-time energy is defined as:

E_{n} = Σ_{m = 0}^{N - 1} {[X (m) W (n - m)]}^{2},

Make h (n)=w ²(n), have:

E_{n} = Σ_{m = 0}^{N - 1} X {(m)}^{2} \cdot h (n - m) .

The short-time average magnitude of voice signal is:

E _nand M _nall reflected signal intensity.The short-time average zero-crossing rate of voice signal X (n) is defined as:

Z_{n} = Σ_{m = - \infty}^{\infty} | sgn [x (m)] - sgn [x (m - 1)] | w (n - m),

Wherein:

sgn [x (m)] = \{\begin{matrix} 1, x (n) &GreaterEqual; 0, \\ - 1, x (n) < 0 \end{matrix}

W (n) is window function, and its effect is the same when asking short-time average energy.Generally get

w (n) = \{\begin{matrix} \frac{1}{2 N}, 0 \leq n \leq N - 1, \\ 0, else \end{matrix}

3. point frame

The voice of certain length are divided into many frames and analyze, can analyze by the analytical approach to stationary process, therefore voice signal is divided into short time interval one by one by the present invention, and each short time interval is called a frame, and the length of each frame is probably 10-30ms.In order to make to seamlessly transit between frame and frame, make it keep continuity, adopt the method for overlapping segmentation, the postamble of each frame and the frame head of next frame are overlapping.

4. windowing

In order to reduce the truncation effect of speech frame, reduce the gradient at frame two ends, the two ends of speech frame are not caused sharply change and be smoothly transitted into zero, will allow speech frame be multiplied by a window function.If frame signal is x (n), window function is y (n), the number of sampling N of every frame, and the signal y (n) after windowing is:

y(n)＝x(n)w(n)，0≤n≤N-1

It is Hamming window that the present invention adopts window function, its expression formula as

w (n) = \{\begin{matrix} 0.54 - 0.46 \cos [2 πn / (N - 1)], 0 \leq n \leq (N - 1) \\ 0, else \end{matrix}

When waveform is multiplied by Hamming window, compressed the portion waveshape that approaches function two ends, this is equivalent to analyzes interval of use and has shortened 40% left and right, with this frequency resolution 40% left and right that also declined thereupon.Even so in periodically obvious voiced sound spectrum analysis, be multiplied by applicable window function, also can suppress the variable effect of the relative phase relation of pitch period analystal section, thereby can obtain stable frequency spectrum.

5. speech de-noising

Voice signal will be purified as far as possible before transmission, and it is very crucial can improving voice communication quality.The present invention utilizes wavelet transformation to realize the denoising of signal, has good purification sound effect.

Suppose that Noisy Speech Signal is in f (t)=s (t)+n (t) formula: s (t) is pure voice signal, n (t) is that variance is σ ²white Gaussian noise.

Formula (1) is made to wavelet transform:

w_{j, k} (f) = &Integral; f (t) \overset{&OverBar;}{ψ_{j, k} (t)} dt, j = 0,1,2 . . . N; k = 0,1, . . . N

In formula:

ψ_{j, k} (k) = 2^{\frac{1}{2}} ψ (2^{j} t - k)

Wj, k (f) is wavelet coefficient, is designated as cd j.k.First to being carried out discrete series wavelet transformation by the voice signal of noise pollution, obtain being with noisy wavelet coefficient; Then with the threshold value λ setting, as thresholding, wavelet coefficient is processed, as what caused by noise, only allowed those significant wavelet coefficients of exceeding λ be used for reconstructed speech signal to the wavelet coefficient lower than λ.

Adopt MFCC speech parameter extraction method to carry out feature extraction to voice preprocessed signal, try to achieve speaker's characteristic parameter; That is: MFCC speech parameter extracts and shows that method is as follows:

1. through pretreated voice signal X (n, ω _k) amplitude by by the frequency response weighting of Mel scale bank of filters.It is evenly distributed that the centre frequency of Mel scale bank of filters is pressed Mel frequency, and point is the center of adjacent filter at the bottom of two of each triangular filter, and the centre frequency of these wave filters and bandwidth and sense of hearing critical edge band filter group are unanimous on the whole.In system, Mel scale number of filter value is 28.

2. this step is calculated the energy value after the weighting of Mel scale filter frequency, represents the frequency response of first wave filter Vl (ω).The energy of the 1st Mel scale wave filter output of the speech frame of moment n is Emel (n, 1), computing formula wherein U1 and L1 represent each wave filter the highest and low-limit frequency between area of non-zero regions.

Wherein effect be according to the bandwidth of wave filter, wave filter to be normalized.Make for the input that has smooth frequency spectrum, each wave filter is by energy equal output.

(3) according to Emel (n, l), the output of bank of filters is taken the logarithm, then it is done to discrete cosine transform (DCT), obtain the Mel cepstrum coefficient of the speech frame that is positioned at moment n, be calculated as follows

C_{mel} [n, m] = \frac{L}{R} Σ_{l = 0}^{R - 1} \log {E_{mel} (n, l)} \cos (\frac{2 π}{R} lm)

Then adopt PSO-BP neural network to carry out model training, the model after training, sets up and optimization PSO-BP neural network model storehouse.That is: the model bank method of the foundation of PSO-BP neural network and optimization is as follows:

Step 1: initialization

Initialization BP network structure, comprises input layer, hidden layer, the neuron number of output layer and the input and output of learning rate and training sample of setting network.

Initialization population, comprises individual extreme value and global optimum, iteration error precision, constant coefficient c1 and c2, maximum Inertia Weight max, minimum Inertia Weight min, maximal rate Vmax and the maximum iteration time etc. of the scale N of particle and the position vector of each particle and velocity vector, each particle.

Step 2: iteration is upgraded

1. upgrade the speed of each particle, and judge whether the speed after upgrading is greater than maximal rate Vmax, if be greater than maximal rate vmax, the speed after upgrading is maximal rate v with regard to value, otherwise, remain unchanged.

2. upgrade the position of each particle.

3. calculate the fitness value of each particle.

4. calculate the minimum adaptive value fg=min{f1 of the overall situation of population, f2 ..., fN}; If current iteration number of times reaches the training error of maximum iteration time or fg< network and reaches accuracy requirement, iteration stopping, forwards step 3 to; Otherwise, individual extreme value Pi and the global extremum Pg position of calculating each particle, the step 1 that forwards iteration renewal to continues more speed and the position of new particle.

Step 3: the determined network weight in position and the threshold value of output global extremum P, algorithm finishes.

Four. the speech recognition stage.

When speech recognition, adopt method the same during with the voice training stage, from voice to be identified, extract phonetic feature.In BP neural network, input above-mentioned characteristic parameter, then call respectively the network weight that in model bank, everyone has kept; And calculate Output rusults by pso-BP flow algorithm, the expectation identification identity in the result of output and database is compared one by one, using that identity of identification error minimum as last recognition result.

The foregoing is only representative embodiment of the present invention, do not limit the present invention in any way, all any amendments of doing within the spirit and principles in the present invention, be equal to and replace or improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the method for distinguishing speek person based on BP neural network, the steps include: to be divided into voice training stage and speech recognition stage two steps; It is characterized in that: the step in described voice training stage is: first speaker's voice are carried out to voice training, obtain speaker's voice signal, and obtain voice preprocessed signal; That is: voice signal pre-service, comprises pre-emphasis, end-point detection, point frame and windowing.

2. a kind of method for distinguishing speek person based on BP neural network according to claim 1, is characterized in that: described MFCC speech parameter extraction method is carried out feature extraction to voice preprocessed signal, tries to achieve speaker's characteristic parameter; That is: MFCC speech parameter extracts and shows that method is as follows:

(1) through pretreated voice signal X (n, ω _k) amplitude by by the frequency response weighting of Mel scale bank of filters.It is evenly distributed that the centre frequency of Mel scale bank of filters is pressed Mel frequency, and point is the center of adjacent filter at the bottom of two of each triangular filter, and the centre frequency of these wave filters and bandwidth and sense of hearing critical edge band filter group are unanimous on the whole; In system, Mel scale number of filter value is 28;

(2) this step is calculated the energy value after the weighting of Mel scale filter frequency, represents the frequency response of first wave filter Vl (ω); The energy of l Mel scale wave filter output of the speech frame of moment n is Emel (n, l), computing formula wherein U1 and L1 represent each wave filter the highest and low-limit frequency between area of non-zero regions;

Wherein effect be according to the bandwidth of wave filter, wave filter to be normalized; Make for the input that has smooth frequency spectrum, each wave filter is by energy equal output;

C_{mel} [n, m] = \frac{L}{R} Σ_{l = 0}^{R - 1} \log {E_{mel} (n, l)} \cos (\frac{2 π}{R} lm) .

3. a kind of method for distinguishing speek person based on BP neural network according to claim 2, is characterized in that: described PSO-BP neural network is carried out model training, and the model after training is set up and optimizes PSO-BP neural network model storehouse; That is: the foundation of PSO-BP neural network and the model bank of optimization are as follows:

Step 1: initialization

Initialization BP network structure, comprises input layer, hidden layer, the neuron number of output layer and the input and output of learning rate and training sample of setting network;

Initialization population, comprises individual extreme value and global optimum, iteration error precision, constant coefficient c1 and c2, maximum Inertia Weight max, minimum Inertia Weight min, maximal rate Vmax and the maximum iteration time etc. of the scale N of particle and the position vector of each particle and velocity vector, each particle;

Step 2: iteration is upgraded

(1) upgrade the speed of each particle, and judge whether the speed after upgrading is greater than maximal rate Vmax, if be greater than maximal rate vmax, the speed after upgrading is maximal rate v with regard to value, otherwise, remain unchanged;

(2) upgrade the position of each particle;

(3) calculate the fitness value of each particle;

(4) the minimum adaptive value fg=min{f1 of the overall situation of calculating population, f2 ..., fN}; If current iteration number of times reaches the training error of maximum iteration time or fg < network and reaches accuracy requirement, iteration stopping forwards step (3) to; Otherwise, individual extreme value Pi and the global extremum Pg position of calculating each particle, the step (1) that forwards iteration renewal to continues more speed and the position of new particle;

4. a kind of method for distinguishing speek person based on BP neural network according to claim 1, is characterized in that: the same method while adopting with the voice training stage of described speech recognition stage extracts phonetic feature from voice to be identified; In BP neural network, input above-mentioned characteristic parameter, then call respectively the network weight that in model bank, everyone has kept; And calculate Output rusults by pso-BP flow algorithm, the expectation identification identity in the result of output and database is compared one by one, using that identity of identification error minimum as last recognition result.