CN1570923A

CN1570923A - Sentence boundary identification method in spoken language dialogue

Info

Publication number: CN1570923A
Application number: CN 03147553
Authority: CN
Inventors: 宗成庆; 刘丁
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2003-07-22
Filing date: 2003-07-22
Publication date: 2005-01-26
Anticipated expiration: 2023-07-22
Also published as: CN1271550C

Abstract

A sentence boundary segmentation method based on a bidirectional N-gram model and a Maximum Entrpy model includes two processes of training and segmentation. The training process includes the steps of: obtaining a spoken language corpus; performing pretreatments such as substitution on the spoken language corpus; counting n -N-element co-occurrence frequency of the gram model; estimate n-element forward dependency probability and n-element reverse dependency probability; obtain n-element positive and reverse dependency probability database; set the characteristic function of the Maximum Entropy model; cyclically calculate the characteristic function parameters; obtain A database of characteristic function parameters. The sentence boundary segmentation method based on the bidirectional n-gram model and the Maximum Entropy model is a pure statistical method, and its implementation only requires a background spoken language corpus, and the corpus does not need any deep segmentation or labeling. This method is not limited by language, and can be applied to sentence boundary segmentation in any language by changing the training corpus.

Description

Sentence Boundary Recognition Method in Spoken Conversation

技术领域technical field

本发明涉及语音识别，特别涉及口语句子的边界识别方法。The invention relates to speech recognition, in particular to a boundary recognition method for spoken sentences.

背景技术Background technique

随着计算机硬件条件的飞速发展和语音识别技术的不断提高，以语音为接口的语言理解以及生成系统(以下简称为语音语言联合系统)如人机接口、人机对话系统、同声翻译系统等开始走向实用化。这些系统有着广阔的应用前景。比如人机语音界面，它的完善将使人们不再为学习繁琐的计算机操作而苦恼，因为任何事情你只需“说”给计算机听，它便会按你的要求执行。再如同声翻译技术，它将消除不同种语言使用者之间的交流障碍，这将极大地方便人们跨国旅行，以及在大型国际盛会(奥运会，亚运会等)中让来自不同国度的成员进行方便自如地交流成为可能。在军事上，语音语言联合系统也有着重要的应用。美国已着手研发士兵用同声翻译机，以方便其在异域作战时向当地居民了解情况。另外电话窃听一直是获取军事情报的有效手段，而从大量语音信息中虑取有效信息一直以来完全依赖人工，如果实现机器的自动虑取，将极大地提高效率和节省人力。With the rapid development of computer hardware conditions and the continuous improvement of speech recognition technology, language understanding and generation systems using speech as the interface (hereinafter referred to as speech-language combined systems) such as man-machine interface, man-machine dialogue system, simultaneous translation system, etc. Started to be practical. These systems have broad application prospects. For example, the man-machine voice interface, its perfection will make people no longer worry about learning cumbersome computer operations, because you only need to "speak" to the computer to listen to anything, and it will execute according to your requirements. Another example is simultaneous translation technology, which will eliminate communication barriers between users of different languages, which will greatly facilitate people's cross-border travel, and allow members from different countries to communicate in large international events (Olympic Games, Asian Games, etc.) Free communication becomes possible. In the military, the voice-language joint system also has important applications. The United States has begun to develop simultaneous translators for soldiers to facilitate their understanding of the situation from local residents when fighting in foreign lands. In addition, telephone eavesdropping has always been an effective means of obtaining military intelligence, and obtaining effective information from a large amount of voice information has always been completely dependent on manual work. If the machine can automatically obtain information, it will greatly improve efficiency and save manpower.

从图1可以看出，语音语言联合系统一般由三个模块组成：语音识别模块、句子边界切分模块和语言分析与生成模块。由于语音识别的结果是没有任何标点的连续文本，要进行下一步的分析、转换和生成处理必须先断句，也就是把连续文本分割成一个个的句子，而句子边界切分模块正是行使这样一种功能，它处于语音识别模块和语言分析生成模块的中间，是连接它们的一道桥梁。语音识别技术和语言分析与生成技术一直是计算机科学领域的研究热点，而句子边界切分技术在语音语言联合系统的初步实用化以前并没有受到广泛关注。如今随着语音语言联合系统的应用的不断扩展，句子边界切分技术作为支撑这种联合应用的核心技术之一，日益受到重视。As can be seen from Figure 1, the speech-language joint system generally consists of three modules: a speech recognition module, a sentence boundary segmentation module, and a language analysis and generation module. Since the result of speech recognition is a continuous text without any punctuation, the next step of analysis, conversion and generation must be segmented, that is, the continuous text is divided into sentences, and the sentence boundary segmentation module is used in this way A function, it is in the middle of the speech recognition module and the language analysis and generation module, and it is a bridge connecting them. Speech recognition technology and language analysis and generation technology have always been research hotspots in the field of computer science, but sentence boundary segmentation technology has not received widespread attention before the initial practical application of speech-language joint systems. Nowadays, with the continuous expansion of the application of speech-language joint system, the sentence boundary segmentation technology, as one of the core technologies supporting this joint application, has been paid more and more attention.

发明内容Contents of the invention

本发明的目的是提供一种口语会话中句子边界识别方法，其解决了将语音识别后的连续文本转化为后续分析模块可处理的句子的问题。The purpose of the present invention is to provide a sentence boundary recognition method in oral conversation, which solves the problem of converting continuous text after speech recognition into sentences that can be processed by subsequent analysis modules.

为实现上述目的，一种口语会话中句子边界识别方法，包括训练和切分两个过程，所述的训练过程包括步骤：In order to achieve the above object, a sentence boundary recognition method in a spoken conversation includes two processes of training and segmentation, and the training process includes steps:

获得口语语料库；Obtain a spoken language corpus;

对口语语料库进行替代等预处理；Perform preprocessing such as substitution on the spoken language corpus;

统计n-gram模型的n元同现频率；Statistical n-gram co-occurrence frequency of n-gram model;

估计n元正向依存概率和n元逆向依存概率；Estimate the probability of n-ary forward dependence and the probability of n-ary reverse dependence;

获得n元正、逆向依存概率数据库；Obtain the n-element positive and negative dependency probability database;

设定Maximum Entropy模型的特征函数；Set the characteristic function of the Maximum Entropy model;

循环计算特征函数参数；Cyclic calculation of characteristic function parameters;

获得特征函数参数数据库。Obtain a database of characteristic function parameters.

口语会话中句子边界识别方法属于纯统计方法，其实施只需要一个后台口语语料库，语料库不需要进行任何深层地切分或者标注等处理。该方法不受语言的限制，通过更换训练语料库，可以运用于任何一种语言的句子边界切分。The sentence boundary recognition method in oral conversation is a purely statistical method, and its implementation only requires a background spoken language corpus, and the corpus does not need any deep segmentation or labeling. This method is not limited by language, and can be applied to sentence boundary segmentation in any language by changing the training corpus.

附图说明Description of drawings

图1是语音语音应用系统的一般模式。Figure 1 is a general schema of a speech speech application system.

具体实施方式Detailed ways

下面结合附图详细说明本发明技术方案中所涉及的各个细节问题。Various details involved in the technical solution of the present invention will be described in detail below in conjunction with the accompanying drawings.

口语语料的预处理Preprocessing of spoken corpus

获取的口语语料不能直接拿来训练，必须经过一些预处理。句子边界切分，就是在连续文本中寻找句子的结束点，也即预测那些句末标点的出现位置，因而只要是句末标点，对切分而言都没有区别。预处理的主要工作就是将语料中的句末标点用统一的符号代替，为方便叙述，本文中的替代符号用“SB”表示；而对于非句末标点的其他标点，则要删除，因为语音识别出的文本中是不可能含有这样的标点符号的。对于中文而言，这项工作很容易，直接将句号、问号、感叹号等句末标点替换成同一的符号，再将逗号、冒号、引号等非句末标点删除即可。但有些语言的标点具有歧义，比英文的句号“.”，它也用来表示缩写形式，比如“Mr.”，“Dr.”等，这时我们必须先将这些缩写形式替换成不含“.”的形式，然后再用统一的符号替换掉“.”。The acquired oral corpus cannot be directly used for training, but must undergo some preprocessing. Sentence boundary segmentation is to find the end point of the sentence in the continuous text, that is, to predict the occurrence position of those sentence-end punctuation, so as long as it is the end-sentence punctuation, there is no difference for segmentation. The main task of preprocessing is to replace the end-of-sentence punctuation in the corpus with a unified symbol. For the convenience of description, the replacement symbol in this article is represented by "SB"; for other punctuation other than the end-of-sentence punctuation, it must be deleted, because the phonetic It is impossible for the recognized text to contain such punctuation marks. For Chinese, this work is very easy, just replace the end-of-sentence punctuation such as period, question mark, and exclamation mark with the same symbol, and then delete the non-end-of-sentence punctuation such as comma, colon, and quotation marks. However, the punctuation in some languages is ambiguous, such as the period "." in English, which is also used to indicate abbreviations, such as "Mr.", "Dr.", etc. At this time, we must first replace these abbreviations with " ".", and then replace "." with a unified symbol.

N元同现频率的统计和N元依存概率估计Statistics of N-gram Co-occurrence Frequency and Estimation of N-gram Dependency Probability

N元同现概率的统计建立在经过预处理的口语语料库的基础上。首先我们要统计出一个基元词表，对于中文而言，这个词表就是语料库中出现的所有字和“SB”，对于英文，这个词表包括所有语料库中出现的单词和缩写等的替换形式以及“SB”。在根据语料库统计出的N元组频率的基础上，我们用Modified Kneser-Ney Smoothing算法对词典中所有条目的N元组合的依存概率进行估计。Modified Kneser-Ney Smoothing对不同出现次数的N元组给予不同程度的消减来补偿那些出现次数为零的N元组，这种平滑方法经Stanley F.Chen等人的评测，性能超过其他平滑方法。The statistics of N-gram co-occurrence probability are based on the preprocessed spoken language corpus. First of all, we need to count a primitive vocabulary. For Chinese, this vocabulary is all the words and "SB" that appear in the corpus. For English, this vocabulary includes all the words and abbreviations that appear in the corpus. and "SB". On the basis of the N-tuple frequency calculated from the corpus, we use the Modified Kneser-Ney Smoothing algorithm to estimate the dependency probability of the N-tuple combinations of all entries in the dictionary. Modified Kneser-Ney Smoothing gives different degrees of reduction to N-tuples with different occurrences to compensate those N-tuples with zero occurrences. This smoothing method has been evaluated by Stanley F. Chen et al., and its performance exceeds other smoothing methods.

Maximum Entropy模型特征函数设定及参数训练Maximum Entropy model feature function setting and parameter training

Maximum Entropy模型是用来估算联合概率的一种统计模型，其中心思想是在满足训练语料约束的情况下使联合事件的熵，也就是不确定性达到最大。在自然语言处理中，联合概率一般表示为：P(b，c)，b表示可能的情况，c表示所在的上下文。在本文所描述的句子边界切分方法中，b被设定为一个布尔型变量，其为真表示判断位置为句子边界，为假则表示判断位置不是句子边界。相应的特征函数成组出现，如下所示：The Maximum Entropy model is a statistical model used to estimate the joint probability. Its central idea is to maximize the entropy of joint events, that is, the uncertainty, under the condition of satisfying the constraints of the training corpus. In natural language processing, the joint probability is generally expressed as: P(b, c), b represents the possible situation, and c represents the context. In the sentence boundary segmentation method described in this paper, b is set as a Boolean variable, which is true to indicate that the judgment position is a sentence boundary, and false to indicate that the judgment position is not a sentence boundary. The corresponding eigenfunctions appear in groups, as follows:

从上面的公式可以看出，每一组特征函数和一个S_j对应，S_j表示某一长度的字组(中文)或词组(英文)，在本方法中S_j设定为训练语料库中出现的所有三元、二元和一元组。公式里prefix(c)、suffix(c)分别表示判断位置的所有前缀和后缀的集合，举例而言，比如句子“请<1>明<2>天<3>再<4>来<5>”，对于位置<3>，其前缀的集合为{天，明天，请明天}，后缀的集合为{再，再来}；include(prefix(c)，S_j)表示S_j属于prefix(c)。每一个特征函数都有一个权值与之对应，权值表明了特征函数所属的特征对结果影响程度的大小。在本方法中，权值也成组出现，表示为α_j10，α_j11，α_j20，α_j21，这些权值通过Generalized Iterative Scaling算法计算得到，并储存于最大熵参数数据库中。对于某一上下文环境下，某种情况出现的概率计算如下：It can be seen from the above formula that each set of feature functions corresponds to a S _j , and S _j represents a word group (Chinese) or phrase (English) of a certain length. In this method, S _j is set to appear in the training corpus All triples, binary and unary of . The prefix(c) and suffix(c) in the formula respectively represent the set of all prefixes and suffixes for judging the position, for example, such as the sentence "Please <1> tomorrow <2> day <3> and then <4> come to <5>", for position <3>, the prefix set is {day, tomorrow, please tomorrow}, and the suffix set is {again, come again}; include(prefix(c), S _j ) indicates that S _j belongs to prefix(c) . Each feature function has a weight corresponding to it, and the weight indicates the degree of influence of the feature to which the feature function belongs to the result. In this method, the weights also appear in groups, expressed as α _j10 , α _j11 , α _j20 , α _j21 , and these weights are calculated by the Generalized Iterative Scaling algorithm and stored in the maximum entropy parameter database. For a certain context, the probability of a certain situation is calculated as follows:

$P P ((c c,, b b)) = = {πΠ πΠ}_{j j = = 11}^{k k} (({α α}_{j j 1010}^{{f f}_{j j 1010} ((b b,, c c))} \times \times {α α}_{j j 1111}^{{f f}_{j j 1111} ((b b,, c c))} \times \times {α α}_{j j 2020}^{{f f}_{j j 2020} ((b b,, c c))} \times \times {α α}_{j j 21 twenty one}^{{f f}_{j j 21 twenty one} ((b b,, c c))}))$

k为所设的特征函数的组数，π为归一化变量，在本例中其值为：k is the group number of the set characteristic function, π is the normalized variable, in this example its value is:

π＝P(c，0)+P(c，1) π = P(c, 0) + P(c, 1)

特别地，有时我们只想考虑左边的上下文或者右边上下文和某种情况出现的联合概率，这时计算公式分别为：In particular, sometimes we only want to consider the joint probability of the left context or the right context and a certain situation. At this time, the calculation formulas are:

$P P ((c c__left left,, b b)) = = {πΠ πΠ}_{j j = = 11}^{k k} (({α α}_{j j 1010}^{{f f}_{j j 1010} ((b b,, c c))} \times \times {α α}_{j j 1111}^{{f f}_{j j 1111} ((b b,, c c))}))$

$P P ((c c__right right,, b b)) = = {πΠ πΠ}_{j j = = 11}^{k k} (({α α}_{j j 2020}^{{f f}_{j j 2020} ((b b,, c c))} \times \times {α α}_{j j 21 twenty one}^{{f f}_{j j 21 twenty one} ((b b,, c c))}))$

基于双向n-gram模型和Maximum Entropy模型的句子边界切分方法Sentence boundary segmentation method based on bidirectional n-gram model and Maximum Entropy model

对于给定的连续文本“W₁<1>W₂<2>...<n-1>W_n”，其中W_i(1 i n-1)表示基元，句子边界切分就是判断所示的n-1个位置是否为句子边界。用P_is(i)表示位置i是句子边界的概率，用P_no(i)表示位置i不是句子边界的概率，那么位置i被判定是一个句子边界当且仅当P_is(i)＞P_no(i)。For a given continuous text "W ₁ <1>W ₂ <2>...<n-1>W _n ", where W _i (1 i n-1) represents a primitive, sentence boundary segmentation is the judgment Whether the n-1 positions shown are sentence boundaries. Use P _is (i) to represent the probability that position i is a sentence boundary, and use P _no (i) to represent the probability that position i is not a sentence boundary, then position i is judged to be a sentence boundary if and only if P _is (i)>P _no (i).

在本方法中，P_is(i)和P_no(i)分别由四部分组成：正向n-gram概率、逆向n-gram概率、最大熵正向修正概率和最大熵逆向修正概率。用公式描述如下：In this method, P _is (i) and P _no (i) are composed of four parts: forward n-gram probability, reverse n-gram probability, maximum entropy forward correction probability and maximum entropy reverse correction probability. It is described by the formula as follows:

P_is(i)＝W_{n_is}(C_i)×P_is(i|NN)×W_{r_is}(C_i)×P_is(i|RN)P _is (i)＝W _{n_is} (C _i )×P _is (i|NN)×W _{r_is} (C _i )×P _is (i|RN)

P_no(i)＝W_{n_no}(C_i)×P_no(i|NN)×W_{r_no}(C_i)×P_no(i|RN)P _no (i)＝W _{n_no} (C _i )×P _no (i|NN)×W _{r_no} (C _i )×P _no (i|RN)

其中P_is(i/NN)、P_no(i/NN)和P_is(i/RN)、P_no(i/RN)分别表示正、逆向n-gram概率，W_{n_is}(C_i)、W_{n_no}(C_i)和W_{r_is}(C_i)，W_{r_no}(C_i)分别表示对正、逆向n-gram概率的加权值，下面我们分别描述上述各项的计算方法。Among them, P _is (i/NN), P _no (i/NN) and P _is (i/RN), P _no (i/RN) represent positive and reverse n-gram probabilities respectively, W _{n_is} (C _i ), W _{n_no} (C _i ) and W _{r_is} (C _i ), W _{r_no} (C _i ) represent the weighted values of the forward and reverse n-gram probabilities respectively. Below we describe the calculation methods of the above items respectively.

正向n-gram切分概率Forward n-gram segmentation probability

正向n-gram模型将文本视为从左至右的马尔可夫序列。我们用W₁W₂...W_m(m为自然数)来表示一个输入文字序列，W_i(1≤i≤m)表示基元，根据马尔可夫特性，某一基元出现的概率只和它左边n-1个基元相关，也就是P(W_m|W₁W₂...W_m-1)＝P(W_m|W_m-n+1...W_m-1)。由条件概率公式，文字序列出现的概率可写为：P(W₁W₂...W_m)＝P(W₁W₂...W_m-1)×P(W_m|W₁W₂...W_m-1)，综合起来我们得到：Forward n-gram models treat text as a left-to-right Markov sequence. We use W ₁ W ₂ ...W _m (m is a natural number) to represent an input text sequence, W _i (1≤i≤m) represents a primitive, and according to the Markov property, the probability of a certain primitive appearing is only It is related to the n-1 primitives on its left, that is, P(W _m |W ₁ W ₂ ...W _m-1 )=P(W _m |W _m-n+1 ...W _m-1 ) . According to the conditional probability formula, the probability of word sequence occurrence can be written as: P(W ₁ W ₂ ...W _m )=P(W ₁ W ₂ ...W _m-1 )×P(W _m |W ₁ W ₂ ...W _m-1 ), combined we get:

P(W₁W₂...W_m)＝P(W₁W₂...W_m-1)×P(W_m|W_m-n+1...W_m-1)P(W ₁ W ₂ ...W _m )＝P(W ₁ W ₂ ...W _m-1 )×P(W _m |W _m-n+1 ...W _m-1 )

将表示句子的边界的符号“SB”加入字符序列中，判断位置i是否为一个句子的边界，即计算P(W₁W₂...W_iSBW_i+1)(即P_is(i|NN))和P(W₁W₂...W_iW_i+1)(即P_no(i|NN))的大小。以3-gram模型为例，考虑到位置i-1有两种情况，一是其为句子边界，二是其不为句子边界，计算P(W₁W₂...W_iSBW_i+1)和P(W₁W₂...W_iW_i+1)的迭带公式为：Add the symbol "SB" representing the boundary of a sentence into the character sequence, and judge whether the position i is a boundary of a sentence, that is, calculate P(W ₁ W ₂ ...W _i SBW _i+1 ) (that is, P _is (i| NN)) and P(W ₁ W ₂ ... W _i W _i+1 ) (that is, the size of P _no (i|NN)). Taking the 3-gram model as an example, considering that there are two situations for position i-1, one is that it is a sentence boundary, and the other is that it is not a sentence boundary, calculate P(W ₁ W ₂ ...W _i SBW _i+1 ) and P(W ₁ W ₂ ...W _i W _i+1 ) the overlapping band formula is:

P(W₁W₂...W_iSBW_i+1)＝P(W₁W₂...SBW_i×P(SB|SBW_i×P(W_i+1|W_iSB)+P(W₁W₂...W_i-1W_i)P(W ₁ W ₂ ...W _i SBW _i+1 )＝P(W ₁ W ₂ ...SBW _i ×P(SB|SBW _i ×P(W _i+1 |W _i SB)+P( W ₁ W ₂ ...W _i-1 W _i )

×P(SB|W_i-1W_i)×P(W_i+1|W_iSB)×P(SB|W _i-1 W _i )×P(W _i+1 |W _i SB)

P(W₁W₂...W_iW_i+1)＝P(W₁W₂...W_i-1SBW_i)×P(W_i+1|SBW_i)+P(W₁W₂...W_i-1W_i)P(W ₁ W ₂ ...W _i W _i+1 )＝P(W ₁ W ₂ ...W _i-1 SBW _i )×P(W _i+1 |SBW _i )+P(W ₁ W ₂ ... W _i-1 W _i )

×P(W_i+1|W_i-1W_i)×P(W _i+1 |W _i-1 W _i )

假设W₁左边的位置编号为0，那么迭带的初始值为：Assuming that the position number on the left of W ₁ is 0, then the initial value of the iterative band is:

P_is(0|NN)＝1P _is (0|NN)＝1

P_n0(0|NN)＝0P _n0 (0|NN)＝0

逆向n-gram切分概率Reverse n-gram segmentation probability

逆向n-gram模型和正向n-gram模型类似，只是它将字符序列W₁W₂...W_m看成一个从右到左的马尔可夫序列，也就是将他们出现的先后顺序看成是W_mW_m-1...W₁。同样，通过条件概率公式和马尔可夫特性我们得到：The reverse n-gram model is similar to the forward n-gram model, except that it regards the character sequence W ₁ W ₂ ... W _m as a Markov sequence from right to left, that is, the order in which they appear is regarded as are W _m W _m-1 . . . W ₁ . Similarly, through the conditional probability formula and the Markov property we get:

P(W_mW_m-1...W₁)＝P(W_mW_m-1...W₂)×P(W₁|W_nW_n-1...W₂)P(W _m W _m-1 ...W ₁ )＝P(W _m W _m-1 ...W ₂ )×P(W ₁ |W _n W _n-1 ...W ₂ )

W_i为一个句子的终结点，当且仅当P(W_mW_m-1...W_i+1SBW_i)＞P(W_mW_m-1...W_i+1W_i)。W _i is the end point of a sentence if and only if P(W _m W _m-1 ...W _i+1 SBW _i )＞P(W _m W _m-1 ...W _i+1 W _i ) .

同样，我们给出迭带计算P(W_mW_m-1...W_i+1SBW_i)(即P_is(i+1|RN))和P(W_mW_m-1...W_i+1W_i)即(P_no(i+1|RN))的公式(3-gram)：Similarly, we give iterative calculations P(W _m W _m-1 ...W _i+1 SBW _i ) (ie P _is (i+1|RN)) and P(W _m W _m-1 ... W _i+1 W _i ) is the formula (3-gram) of (P _no (i+1|RN)):

P(W_mW_m-1...W_i+1SBW_i)＝P(W_mW_m-1...W_i+2SBW_i+1)×P(SB|SBW_i+1)×P(W_i|W_i+1SB)+P(W_mW_m-1...W_i+2W_i+1)P(W _m W _m-1 ...W _i+1 SBW _i )＝P(W _m W _m-1 ...W _i+2 SBW _i+1 )×P(SB|SBW _i+1 )× P(W _i |W _i+1 SB)+P(W _m W _m-1 ...W _i+2 W _i+1 )

×P(SB|W_i+2W_i+1)×P(W_i|W_i+1SB) 以×P(SB|W _i+2 W _i+1 )×P(W _i |W _i+1 SB) with

P(W_mW_m-1...W_i+1W_i)＝P(W_mW_m-1...W_i+2SBW_i+1)×P(W_i|SBW_i+1)+P(W_mW_m-1...W_i+2W_i+1)P(W _m W _m-1 ...W _i+1 W _i )＝P(W _m W _m-1 ...W _i+2 SBW _i+1 )×P(W _i |SBW _i+1 ) +P(W _m W _m-1 ...W _i+2 W _i+1 )

×P(W_i|W_i+2W_i+1)×P(W _i |W _i+2 W _i+1 )

及初始条件：P(SBW_m)＝1，P(W_m)＝0。And initial conditions: P(SBW _m )=1, P(W _m )=0.

逆向n-gram模型从右向左迭带计算每个位置为句子边界的概率，这样做可以避免正向模型的一些错误，比如下面这个句子“小张病了一个星期”，如果采用正向切分，很可能输出如下的结果“小张病了SB一个星期”，因为从左往右搜索，“小张病了”就是一个完整的句子；而如果采用逆向切分，从右往左搜索，我们显然不会把“一个星期”认为是一个完整的句子，那么搜索继续向右，直到句子真正的边界。The reverse n-gram model stacks from right to left to calculate the probability that each position is a sentence boundary, which can avoid some errors in the forward model, such as the following sentence "Xiao Zhang was ill for a week", if the forward cut It is likely to output the following result "Xiao Zhang was sick with SB for a week", because searching from left to right, "Xiao Zhang is sick" is a complete sentence; and if reverse segmentation is used, searching from right to left, We obviously don't consider "a week" to be a complete sentence, so the search continues to the right until the true sentence boundary.

最大熵修正权值maximum entropy correction weight

通过上面的叙述，基于逆向n-gram切分对正向n-gram的有益补充，我们考虑将正、逆向n-gram概率加权综合起来，而权值的确定即依靠本方法所述的最大熵模型的参数。Through the above description, based on the beneficial supplement of the reverse n-gram segmentation to the forward n-gram, we consider combining the positive and reverse n-gram probability weights, and the determination of the weight depends on the maximum entropy described in this method The parameters of the model.

如上文所述，W_{n_is}(C_i)，W_{n_no}(C_i)，表示对正向n-gram概率的加权，其计算等同于P(c_left，1)和P(c_left，0)，如下所示：As mentioned above, W _{n_is} (C _i ), W _{n_no} (C _i ), represents the weighting of the forward n-gram probability, and its calculation is equivalent to P(c_left, 1) and P(c_left, 0), as follows Show:

${W W}_{n no__is is} (({C C}_{i i})) = = {πΠ πΠ}_{j j = = 11}^{k k} {α α}_{j j 1111}^{{f f}_{j j 1111} ((11,, {c c}_{i i}))}$

${W W}_{n no__no no} (({C C}_{i i})) = = {πΠ πΠ}_{j j = = 11}^{k k} {α α}_{j j 1010}^{{f f}_{j j 1010} ((00,, {c c}_{i i}))}$

W_{r_is}(C_i)，W_{r_no}(C_i)分别表示对正向n-gram概率的加权，其计算等同于P(c_right，1)和P(c_right，0)，如下所示：W _{r_is} (C _i ), W _{r_no} (C _i ) respectively represent the weighting of the forward n-gram probability, and their calculations are equivalent to P(c_right, 1) and P(c_right, 0), as follows:

${W W}_{r r__is is} (({C C}_{i i})) = = {πΠ πΠ}_{j j = = 11}^{k k} {α α}_{j j 21 twenty one}^{{f f}_{j j 21 twenty one} ((11,, {c c}_{i i}))}$

${W W}_{n no__no no} (({C C}_{i i})) = = {πΠ πΠ}_{j j = = 11}^{k k} {α α}_{j j 2020}^{{f f}_{j j 2020} ((00,, {c c}_{i i}))}$

为了验证本方法的切分性能，我们利用收集的汉语和英语口语语料进行了句子边界切分试验，并和参考文献中的语言模型(正向n-gram模型)进行了对比。训练语料和测试语料以及切分结果如下所示，需要说明的一点是，测试结果中的准确率为正确切分数目占总切分数目的比率，召回率为正确切分数目占原有句数的比率，而F-Score是综合衡量正确率和召回率的指标，其计算公式为：

In order to verify the segmentation performance of this method, we used the collected Chinese and English spoken corpora to conduct a sentence boundary segmentation experiment, and compared it with the language model (forward n-gram model) in the reference. The training corpus, test corpus, and segmentation results are shown below. It should be noted that the accuracy rate in the test results is the ratio of the number of correct segmentations to the total number of segmentations, and the recall rate is the ratio of the number of correct segmentations to the number of original sentences. Ratio, while F-Score is a comprehensive measure of accuracy and recall, and its calculation formula is:

表1.训练语料的详细情况语言大小句子数目平均句长汉语 4.02MB 148967 8字英语 4.49MB 149311 6词 Table 1. Details of the training corpus language size number of sentences average sentence length Chinese 4.02MB 148967 8 words English 4.49MB 149311 6 words

表2.测试语料的详细情况语言大小句子数目平均句长汉语 412KB 12032 10字英语 391KB 10518 7词 Table 2. Details of the test corpus language size number of sentences average sentence length Chinese 412KB 12032 10 characters English 391KB 10518 7 words

表3.汉语切分试验结果方法准确率召回率 F-Score 文献[1] 79.4％ 84.5％ 81.9％本方法 86.7％ 86.0％ 86.3％ Table 3. Results of Chinese segmentation test method Accuracy recall rate F-Score Literature[1] 79.4% 84.5% 81.9% This method 86.7% 86.0% 86.3%

表4.英语切分试验结果方法准确率召回率 F-Score 文献[1] 73.4％ 83.0％ 77.9％本方法 78.8％ 84.9％ 81.7％ Table 4. English Segmentation Test Results method Accuracy recall rate F-Score Literature[1] 73.4% 83.0% 77.9% This method 78.8% 84.9% 81.7%

从试验结果可以看出，我们提出的基于双向n-gram模型和maximumentropy模型的句子边界切分方法在性能上明显超过文献[1]中所用的单纯基于正向n-gram模型的方法，这是因为我们的方法在判断某一位置是否为句子边界时，综合考虑了正、逆向搜索对切分结果的影响，并通过最大熵参数对正、逆向概率进行合理地调整。It can be seen from the test results that the sentence boundary segmentation method based on the two-way n-gram model and the maximumentropy model we proposed is significantly better than the method based solely on the forward n-gram model used in the literature [1]. This is Because our method comprehensively considers the impact of forward and reverse searches on the segmentation results when judging whether a certain position is a sentence boundary, and reasonably adjusts the forward and reverse probabilities through the maximum entropy parameter.

Claims

1. a sentence boundary recognition method in spoken conversation, comprise training and two processes of segmentation, described training process comprises steps:

Obtain a spoken language corpus;

Perform preprocessing such as substitution on the spoken language corpus;

Statistical n-gram co-occurrence frequency of n-gram model;

Estimate the probability of n-ary forward dependence and the probability of n-ary reverse dependence;

Obtain the n-element positive and negative dependency probability database;

Set the characteristic function of the Maximum Entropy model;

Cyclic calculation of characteristic function parameters;

Obtain a database of characteristic function parameters.

2. by the described method of claim 1, it is characterized in that the algorithm of described cyclic calculation feature function parameter is Generalized Iterative Scaling algorithm.

3. by the described method of claim 1, it is characterized in that described segmentation process comprises steps:

Segment the text with a segmentation method based on the forward n-gram model;

Segment the text with a segmentation method based on the reverse n-gram model;

The context of the segmentation point is extracted, and the parameters of the characteristic function of the Maximum Entropy model are used to carry out weighted synthesis of the forward and reverse segmentation results.