CN106601234A

CN106601234A - Implementation method of placename speech modeling system for goods sorting

Info

Publication number: CN106601234A
Application number: CN201611007973.3A
Authority: CN
Inventors: 谢巍; 董万里; 何伶珍
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-11-16
Filing date: 2016-11-16
Publication date: 2017-04-26

Abstract

The invention discloses a method for realizing a place name voice modeling system oriented to goods sorting, which includes the following steps: 1) preprocessing the voice signal, including pre-emphasis and framing operations; 2) preprocessing in step 1) 3) check the effect of the voice signal after cutting in step 2), if the cutting is correct, save it in the specified folder; 4) when step 3) After the recording of all voice signals is completed, a hidden Markov model of the voice signal is established, and the established model data is saved. Aiming at the short feature of the place name voice signal, the present invention regards each place name voice signal as a recognition object, directly takes the whole of each place name as an object to establish a hidden Markov model, and the calculation is simple and efficient.

Description

An implementation method of a place name phonetic modeling system oriented to cargo sorting

技术领域technical field

本发明涉及信号处理、模式识别与人机交互领域，尤其是一种面向货物分拣的地名语音建模系统的实现方法。The invention relates to the fields of signal processing, pattern recognition and human-computer interaction, in particular to an implementation method of a place name voice modeling system oriented to cargo sorting.

背景技术Background technique

目前物流现场的分拣环节中，主要的分拣方式是通过按键来确认货物要分往的槽口，操作人员必须使用按键方式输入货物分拣信息，操作比较耗时和麻烦。使用地名语音识别系统可以让操作员直接和分拣系统进行对话，通知分拣系统货物的分流信息，这种方式使得货物分拣更加高效、快捷和省时，地名语音建模系统可对地名语音信号建立隐马尔科夫模型，方便实现地名语音识别。At present, in the sorting link of the logistics site, the main sorting method is to confirm the slot where the goods are to be sorted by pressing the buttons. The operator must use the buttons to input the sorting information of the goods, and the operation is time-consuming and troublesome. Using the voice recognition system of place names allows the operator to directly communicate with the sorting system and inform the sorting system of the distribution information of the goods. This method makes the sorting of goods more efficient, fast and time-saving. The voice modeling system of place names can The hidden Markov model is established for the signal to facilitate the speech recognition of place names.

发明内容Contents of the invention

本发明的目的是针对上述现有技术的不足，提供了一种操作方便、能够实现人机交互的面向货物分拣的地名语音建模系统的实现方法。The object of the present invention is to provide a method for implementing a voice modeling system for place names oriented to cargo sorting, which is easy to operate and capable of human-computer interaction, in view of the above-mentioned deficiencies in the prior art.

本发明的目的可以通过如下技术方案实现：The purpose of the present invention can be achieved through the following technical solutions:

一种面向货物分拣的地名语音建模系统的实现方法，所述方法包括以下步骤：A method for realizing a place name voice modeling system oriented to cargo sorting, said method comprising the following steps:

1)对语音信号进行预处理，包括预加重和分帧操作；1) Preprocessing the voice signal, including pre-emphasis and framing operations;

2)对步骤1)中预处理过的语音信号进行端点检测和剪切处理；2) Carry out endpoint detection and cut processing to the voice signal preprocessed in step 1);

3)查看步骤2)中剪切之后的语音信号的效果，如果剪切正确，将其保存在指定的文件夹中；3) Check the effect of the voice signal after cutting in step 2), if the cutting is correct, save it in the specified folder;

4)当步骤3)中所有语音信号的录取结束，建立语音信号的隐马尔科夫模型，并且保存所建立的模型数据。4) When the recording of all speech signals in step 3) ends, a hidden Markov model of the speech signal is established, and the established model data is saved.

优选地，步骤1)中，所述预加重操作是将信号通过高通滤波器，使信号的频谱变得平坦，保持在低频到高频的整个频带中，能够用同样的信噪比求频谱。Preferably, in step 1), the pre-emphasis operation is to pass the signal through a high-pass filter to flatten the spectrum of the signal and keep it in the entire frequency band from low frequency to high frequency, and the spectrum can be calculated with the same signal-to-noise ratio.

优选地，步骤2)中使用的端点检测方法是基于短时能量和短时过零率的双门限检测法，具体步骤为：Preferably, the endpoint detection method used in step 2) is a double-threshold detection method based on short-term energy and short-term zero-crossing rate, and the specific steps are:

一、在开始进行端点检测之前，首先为短时能量和过零率分别设定两个门限，一个低门限，数值较小，对信号的变化较敏感，容易被超过，另一个高门限，数值较大，信号必须达到设定的强度，该门限才能够被超过；1. Before starting the endpoint detection, first set two thresholds for the short-term energy and zero-crossing rate respectively, a low threshold with a small value, which is sensitive to signal changes and easy to be exceeded, and another high threshold with a numerical value Larger, the signal must reach the set strength before the threshold can be exceeded;

二、对语音信号x(n)进行分帧处理，每一帧记为n＝1,2,…,N，n为离散语音信号时间序列，N为帧长，i表示帧数；Two, the voice signal x (n) is carried out into frames, and each frame is recorded as n=1, 2, ..., N, n is a time series of discrete voice signals, N is a frame length, and i represents the number of frames;

三、计算每一帧语音信号的短时能量，得到语音信号的短时帧能量：3. Calculate the short-term energy of each frame of the speech signal, and obtain the short-time frame energy of the speech signal:

其中N为帧长，i表示帧数，表示第i帧语音信号的第n(1≤n≤N)个采样点的值的平方；Where N is the frame length, i represents the number of frames, Represent the square of the value of the nth (1≤n≤N) sampling point of the speech signal of the i-th frame;

四、计算每一帧语音信号的过零率，得到语音信号的短时过零率：4. Calculate the zero-crossing rate of each frame of speech signal to obtain the short-term zero-crossing rate of the speech signal:

其中： in:

其中sgn[s_i(n)]表示第i帧语音信号的第n(1≤n≤N)个采样点的值；Wherein sgn[s _i (n)] represents the value of the nth (1≤n≤N) sampling point of the speech signal of the i-th frame;

此时整个端点检测分为四段：静音段、过渡段、语音段、结束段，在处于静音段时，如果短时能量或者过零率超过设定的低门限，标记为起点，进入过渡段后，如果短时能量和过零率两个参数值都回落到设定的低门限以下，就将当前状态恢复到静音段，而如果过渡段中短时能量和过零率两个参数值中的任一个超过设定的高门限，即被认为进入语音段，处于语音段时，如果短时能量和过零率两个参数值都降到了设定的低门限以下，而总的计时长度小于最短时间门限，则认为是一段噪音，重新计算语音信号的长度。At this time, the entire endpoint detection is divided into four sections: silent section, transition section, speech section, and end section. In the silent section, if the short-term energy or zero-crossing rate exceeds the set low threshold, mark it as the starting point and enter the transition section Finally, if the two parameter values of short-term energy and zero-crossing rate fall back below the set low threshold, the current state will be restored to the silent section, and if the two parameters of short-term energy and zero-crossing rate in the transition section Any one of them exceeds the set high threshold, that is, it is considered to enter the speech segment. When in the speech segment, if the two parameter values of short-term energy and zero-crossing rate drop below the set low threshold, and the total timing length is less than If the minimum time threshold is used, it is considered as a section of noise, and the length of the speech signal is recalculated.

优选地，步骤2)中，根据语音信号的音节设定一个阈值length和一个计算可能处于语音段信号长度的变量slience1，如果此变量slience1小于设定的阈值length，则继续循环，重新计算语音信号的长度，若大于设定的阈值length，则将前面的语音信号舍弃。Preferably, in step 2), a threshold length is set according to the syllable of the speech signal and a variable silence1 that may be in the length of the speech segment signal is calculated. If the variable silence1 is less than the set threshold length, the loop is continued to recalculate the speech signal If the length is greater than the set threshold length, the previous speech signal will be discarded.

优选的，所述面向货物分拣的地名语音建模系统的硬件，包括一个高性能的降噪耳机和一台计算机。Preferably, the hardware of the place name speech modeling system oriented to goods sorting includes a high-performance noise-canceling earphone and a computer.

优选的，所述面向货物分拣的地名语音建模系统的操作能够通过建立操作界面，在操作界面上完成。Preferably, the operation of the geographical name voice modeling system oriented to goods sorting can be completed on the operation interface by establishing an operation interface.

本发明与现有技术相比，具有如下优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

1、本发明针对地名语音信号简短的特点，将每个地名语音信号作为一个识别对象，直接将每个地名的整体作为一个对象建立隐马尔科夫模型，计算简单。1, the present invention is aimed at the brief feature of place-name speech signal, regards each place-name speech signal as a recognition object, directly sets up the hidden Markov model with the whole of each place-name as an object, and calculation is simple.

2、本发明建立了方便人机交互的图形界面，方便操作，通过系统可以直接看到语音效果图以及端点检测之后的图片，判断检测是否正确，当使用到不同的地域时，可以对特定口音的一群人建模，以提高后续识别的准确率。2. The present invention establishes a graphical interface that is convenient for human-computer interaction, and is easy to operate. Through the system, you can directly see the voice effect map and the picture after the endpoint detection, and judge whether the detection is correct. When used in different regions, you can adjust the specific accent Modeling of a group of people to improve the accuracy of subsequent recognition.

附图说明Description of drawings

图1为本发明面向货物分拣的地名语音建模系统的原理图。Fig. 1 is the principle diagram of the place name voice modeling system oriented to goods sorting in the present invention.

图2为本发明实施例的改进前端点检测效果图。Fig. 2 is an effect diagram of an improved front-end point detection according to an embodiment of the present invention.

图3为本发明实施例的改进后端点检测效果图。Fig. 3 is an effect diagram of endpoint detection after improvement according to the embodiment of the present invention.

图4为本发明实施例的语音信号双门限法端点检测结果图，其中图4(a)为语音信号“武汉”的语音信号波形图，图4(b)为语音信号“武汉”的语音信号短时能量波形图，图4(c)为语音信号“武汉”的语音信号过零率波形图。Fig. 4 is the result figure of the double-threshold method endpoint detection of the speech signal of the embodiment of the present invention, wherein Fig. 4 (a) is the speech signal waveform diagram of the speech signal "Wuhan", and Fig. 4 (b) is the speech signal of the speech signal "Wuhan" Short-term energy waveform diagram, Figure 4(c) is the zero-crossing rate waveform diagram of the speech signal "Wuhan".

图5为本发明的面向货物分拣的地名语音信号的隐马尔科夫链模型。Fig. 5 is the hidden Markov chain model of the voice signal of place names oriented to goods sorting according to the present invention.

图6为本发明建立隐马尔科夫模型的原理图。FIG. 6 is a schematic diagram of the present invention for establishing a hidden Markov model.

图7为本发明的面向货物分拣的地名语音建模系统界面。Fig. 7 is the interface of the place name voice modeling system oriented to cargo sorting of the present invention.

图8为本发明的面向货物分拣的地名语音建模系统的使用界面。Fig. 8 is the user interface of the place name voice modeling system oriented to goods sorting according to the present invention.

具体实施方式detailed description

下面结合实施例及附图对本发明作进一步详细的描述，但本发明的实施方式不限于此。The present invention will be further described in detail below in conjunction with the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

实施例：Example:

本实施例提供了一种面向货物分拣的地名语音建模系统的实现方法，如图1所示，为本发明面向货物分拣的地名语音建模系统的原理图，所述方法包括以下步骤：This embodiment provides an implementation method of a place-name phonetic modeling system oriented to cargo sorting, as shown in Figure 1 , which is a schematic diagram of a place-name phonetic modeling system oriented to cargo sorting in the present invention, and the method includes the following steps :

本步骤中，所述预加重操作是将信号通过高通滤波器，使信号的频谱变得平坦，保持在低频到高频的整个频带中，能够用同样的信噪比求频谱。本实施例的采样频率为8KHz,帧长为256，帧移为128。In this step, the pre-emphasis operation is to pass the signal through a high-pass filter to flatten the frequency spectrum of the signal and keep it in the entire frequency band from low frequency to high frequency, so that the frequency spectrum can be calculated with the same signal-to-noise ratio. In this embodiment, the sampling frequency is 8KHz, the frame length is 256, and the frame shift is 128.

本步骤中，使用的端点检测方法是基于短时能量和短时过零率的双门限检测法，具体步骤为：In this step, the endpoint detection method used is a double-threshold detection method based on short-term energy and short-term zero-crossing rate. The specific steps are:

其中： in:

由于本实施例的信号是地名语音信号，每个信号都是由2-4个音节构成，当识别第一个音节时，可能由于第二个音节距离第一个音节较远，且第一个音节长度太短，即断断续续的语音信号，被直接当噪音滤掉，因此本实施例根据语音信号的音节设定一个阈值length和一个计算可能处于语音段信号长度的变量slience1，如果此变量slience1小于设定的阈值length，则继续循环，重新计算语音信号的长度，若大于设定的阈值length，则将前面的语音信号舍弃。如图2所示，为本实施例的改进前端点检测效果图，图3为本实施例的改进后端点检测效果图。Since the signal of the present embodiment is a place name voice signal, each signal is composed of 2-4 syllables, when the first syllable is recognized, it may be because the second syllable is far away from the first syllable, and the first syllable The syllable length is too short, that is, the intermittent speech signal is directly filtered out as noise, so this embodiment sets a threshold length and a variable silence1 that may be in the length of the speech segment signal according to the syllable of the speech signal, if this variable silence1 is less than If the threshold length is set, the loop continues to recalculate the length of the voice signal, and if it is greater than the set threshold length, the previous voice signal is discarded. As shown in FIG. 2 , it is an effect diagram of the improved front-end point detection of this embodiment, and FIG. 3 is an effect diagram of the improved end point detection of this embodiment.

如图4所示，为本发明实施例的语音信号双门限法端点检测结果图，其中图4(a)为语音信号“武汉”的语音信号波形图，图4(b)为语音信号“武汉”的语音信号短时能量波形图，图4(c)为语音信号“武汉”的语音信号过零率波形图。As shown in Fig. 4, it is the speech signal double-threshold method endpoint detection result figure of the embodiment of the present invention, wherein Fig. 4 (a) is the speech signal waveform diagram of speech signal " Wuhan ", and Fig. 4 (b) is speech signal " Wuhan " The short-term energy waveform of the voice signal of ", Fig. 4(c) is the zero-crossing rate waveform of the voice signal of "Wuhan".

附图5为本实施例的面向货物分拣的地名语音信号的隐马尔科夫链模型，本实施例中，一个模型包含4个状态，状态转移矩阵是4×4矩阵，其中元素a_ij表示从状态i转移到状态j的概率，状态1-4分别包含f₁,f₂,f₃,f₄四个函数，四个函数结构相同，每个都是由3个39维的高斯概率密度函数构成，例如f₁，其结构如下：Accompanying drawing 5 is the hidden Markov chain model of the voice signal of place name facing goods sorting of this embodiment, in this embodiment, a model includes 4 states, and state transition matrix is 4 * 4 matrix, and wherein element a _ij represents The probability of transitioning from state i to state j, states 1-4 contain four functions f ₁ , f ₂ , f ₃ , and f ₄ respectively. The four functions have the same structure, each of which consists of three 39-dimensional Gaussian probability densities Function composition, such as f ₁ , its structure is as follows:

f₁＝k₁₁N[o,μ₁₁,U₁₁]+k₁₂N[o,μ₁₂,U₁₂]+k₁₃N[o,μ₁₃,U₁₃]f ₁ ＝k ₁₁ N[o,μ ₁₁ ,U ₁₁ ]+k ₁₂ N[o,μ ₁₂ ,U ₁₂ ]+k ₁₃ N[o,μ ₁₃ ,U ₁₃ ]

上面式子中是状态1的函数f₁，三个系数表示的是权重，μ和U示的是高斯概率密度函数的均值和方差。综上所述，模型中一个状态包含一个1×3权重向量，一个3×39的均值矩阵和一个3×39的方差矩阵。因此，一个隐马尔科夫模型包含的数据有：一个4×4的状态转移矩阵，4个1×3权重向量，4个3×39的均值矩阵和4个3×39的方差矩阵。The above formula is the function f ₁ of state 1, the three coefficients represent the weight, and μ and U represent the mean and variance of the Gaussian probability density function. To sum up, a state in the model contains a 1×3 weight vector, a 3×39 mean matrix and a 3×39 variance matrix. Therefore, the data contained in a hidden Markov model are: a 4×4 state transition matrix, 4 1×3 weight vectors, 4 3×39 mean matrices and 4 3×39 variance matrices.

附图6为本发明建立隐马尔科夫模型的原理图，首先输入一类语音信号，提取出所有信号的MFCC(梅尔频率倒谱系数)矩阵，即信号的特征矩阵，初始化模型的状态转移矩阵以及均值方差矩阵，利用Baum-Welch算法对参数进行重估，然后将这一类信号的所有特征矩阵代入模型中，利用模型计算出其输出概率，若输出概率前后两次的增长率小于设定阈值，则模型已经收敛，可以结束模型训练，保存数据，若是增长率大于设定阈值，则继续训练，直到训练次数达到设定的次数才会停止。当获得模型数据时，则模型训练结束，模型数据可用于后续识别程序。Accompanying drawing 6 sets up the schematic diagram of Hidden Markov Model for the present invention, at first input a class of voice signals, extract the MFCC (Mel Frequency Cepstral Coefficient) matrix of all signals, i.e. the characteristic matrix of signal, initialize the state transition of model Matrix and mean-variance matrix, use the Baum-Welch algorithm to re-evaluate the parameters, and then substitute all the characteristic matrices of this type of signal into the model, and use the model to calculate its output probability. If the growth rate of the output probability is less than the set If the threshold is set, the model has converged, the model training can be ended, and the data can be saved. If the growth rate is greater than the set threshold, the training will continue until the number of training times reaches the set number. When the model data is obtained, the model training ends, and the model data can be used for subsequent recognition procedures.

具体操作如下：附图7显示的是地名语音识别系统建模界面。“系统操作说明”按钮，单击之后会出现操作说明界面。首先设置“系统设置”对话框，选择好目标文件夹，操作人员性别，此次录音的地名样本总数以及要录制第几个地名样本，当系统设置完成之后，在“样本状态”栏，会显示出所要录制的样本名称以及现有的样本数目。在系统设置完成之后，即可操作“系统操作”界面，单击“录音”按钮，可进行时长为2s的录音，录音完成，中间的坐标图会显示出所录制的语音的波形图以及剪切之后的语音波形图，查看剪切之后的波形是否正确，若正确，可进行下一次样本的录制，若错误，点击“删除”按钮，删除此次录制的样本，再进行下一次的录音。当所有的地名样本录制完成之后，即可单击“建立语音模型”按钮，进行建模。建模完成之后，系统会自动保存数据到所选择的文件夹内，方便语音识别部分的调用。附图8是操作时的示意图，在“系统设置”栏，选择好文件路径之后，性别设置“男”，地名样本总数是25，将要录制的是第3个地名样本，在“样本状态”栏可以看到第三个地名样本名称是“北京”，现在已经有1个样本。在“系统操作”栏中，可录制样本，状态栏会及时显示样本信息。The specific operation is as follows: Accompanying drawing 7 shows the modeling interface of the place name speech recognition system. "System Operation Instructions" button, after clicking, the operation instruction interface will appear. First set up the "System Settings" dialog box, select the target folder, the gender of the operator, the total number of place name samples for this recording and the number of place name samples to be recorded. After the system settings are completed, in the "Sample Status" column, it will display Displays the name of the sample to be recorded and the number of existing samples. After the system setting is complete, you can operate the "System Operation" interface, click the "Record" button, and you can record with a duration of 2s. If it is correct, you can record the next sample. If it is wrong, click the "Delete" button to delete the sample recorded this time, and then perform the next recording. When all place name samples are recorded, you can click the "Build Voice Model" button to start modeling. After the modeling is completed, the system will automatically save the data to the selected folder, which is convenient for calling the voice recognition part. Accompanying drawing 8 is a schematic diagram during operation. In the "System Settings" column, after selecting the file path, the gender is set to "male", the total number of place name samples is 25, and the third place name sample will be recorded. In the "Sample Status" column It can be seen that the name of the third place name sample is "Beijing", and there is already 1 sample. In the "System Operation" column, samples can be recorded, and the status bar will display the sample information in time.

以上所述，仅为本发明专利较佳的实施例，但本发明专利的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明专利所公开的范围内，根据本发明专利的技术方案及其发明专利构思加以等同替换或改变，都属于本发明专利的保护范围。The above is only a preferred embodiment of the patent of the present invention, but the scope of protection of the patent of the present invention is not limited thereto. The equivalent replacement or change of the technical solution and its invention patent concept all belong to the protection scope of the invention patent.

Claims

1. a kind of implementation method of the place name pronunciation modeling system towards goods sorting, it is characterised in that：Methods described include with Lower step：

1) pretreatment is carried out to voice signal, including preemphasis and framing operation；

2) to step 1) in pretreated voice signal carry out end-point detection and shear treatment；

3) check step 2) in voice signal after shearing effect, if shearing is correct, save it in the file specified In folder；

4) when step 3) in the admission of all voice signals terminate, set up the HMM of voice signal, and preserve The model data set up.

2. the implementation method of a kind of place name pronunciation modeling system towards goods sorting according to claim 1, its feature It is：Step 1) in, the preemphasis operation is, by high pass filter, to make the frequency spectrum of signal become flat by signal, is kept In the whole frequency band of low frequency to high frequency, frequency spectrum can be sought with same signal to noise ratio.

3. the implementation method of a kind of place name pronunciation modeling system towards goods sorting according to claim 1, its feature It is：Step 2) used in end-point detecting method be double-threshold comparison method based on short-time energy and short-time zero-crossing rate, specifically Step is：

First, before end-point detection is proceeded by, it is that short-time energy and zero-crossing rate set two thresholdings, a low door respectively first Limit, numerical value are less, and the change to signal is more sensitive, is easily exceeded, and another high threshold, numerical value are larger, and signal must reach The intensity of setting, the thresholding can be exceeded；

2nd, sub-frame processing is carried out to voice signal x (n), each frame is designated as n=1,2 ..., N, and n is discrete voice signal time sequence Row, N is frame length, and i represents frame number；

3rd, the short-time energy of each frame voice signal is calculated, the short time frame energy of voice signal is obtained：

E_{i} = Σ_{i = 1}^{N} s_{i}^{2} (n)

Wherein N is frame length, and i represents frame number,Represent n-th (1≤n≤N) individual sampled point of the i-th frame voice signal value it is flat Side；

4th, the zero-crossing rate of each frame voice signal is calculated, the short-time zero-crossing rate of voice signal is obtained：

Z_{i} = \frac{1}{2} Σ_{i = 1}^{N} | sgn [s_{i} (n)] - sgn [s_{i} (n - 1)] |

Wherein：

Wherein sgn [s_i(n)] represent the i-th frame voice signal n-th (1≤n≤N) individual sampled point value；

Now whole end-point detection is divided into four sections：Quiet section, changeover portion, voice segments, ending segment, in quiet section when, if Short-time energy or zero-crossing rate exceed the low threshold of setting, are labeled as starting point, into after changeover portion, if short-time energy and zero passage Two parameter values of rate are all fallen back to below the low threshold of setting, and current state is returned to quiet section just, and if in changeover portion Any one in two parameter values of short-time energy and zero-crossing rate exceedes the high threshold of setting, that is, be considered as entering voice segments, be in During voice segments, if two parameter values of short-time energy and zero-crossing rate are all fallen below below the low threshold of setting, and total timing is long Degree is less than shortest time thresholding, then it is assumed that is one section of noise, recalculates the length of voice signal.

4. the implementation method of a kind of place name pronunciation modeling system towards goods sorting according to claim 3, its feature It is：Step 2) in, threshold value length is set according to the syllable of voice signal and a calculating is likely to be at voice segments letter The variable slience1 of number length, if this variable slience1 is continued cycling through, counted again less than threshold value length of setting The length of voice signal is calculated, if more than threshold value length of setting, voice signal above is given up.

5. the implementation method of a kind of place name pronunciation modeling system towards goods sorting according to claim 1, its feature It is：The hardware of the place name pronunciation modeling system towards goods sorting, including a high performance noise cancelling headphone and one Computer.

6. the implementation method of a kind of place name pronunciation modeling system towards goods sorting according to claim 1, its feature It is：The operation of the place name pronunciation modeling system towards goods sorting can be by setting up operation interface, in operation interface On complete.