CN106649263A

CN106649263A - Multi-word expression extraction method and device

Info

Publication number: CN106649263A
Application number: CN201610990921.6A
Authority: CN
Inventors: 朱泽德; 曾新华; 郑守国; 孙熊伟; 翁士状
Original assignee: Hefei Technology Innovation Engineering Institute of CAS
Current assignee: Hefei Technology Innovation Engineering Institute of CAS
Priority date: 2016-11-10
Filing date: 2016-11-10
Publication date: 2017-05-10

Abstract

The present invention relates to a multi-word expression extraction method and its device, comprising: forming a vocabulary set after preprocessing a document library, calculating the mutual information of adjacent words in multiple documents, obtaining jump information before and after the mutual information sequence, and combining the mutual information with the jump information. Variable information constitutes two-dimensional mutual information, clustering two-dimensional mutual information screens out multi-word expressions, and builds a multi-word expression library. The present invention avoids the need to manually set the threshold for one-dimensional mutual information, and adapts to different data. At the same time, it is not limited to the binary structure of multi-words, and can obtain multi-word expressions of multi-word combinations at one time without step-by-step implementation. Effectively improve the utilization rate of multi-word expressions, and improve the accuracy of multi-word expression database construction.

Description

A multi-word expression extraction method and device thereof

技术领域technical field

本发明涉及统计机器翻译和跨语言信息检索技术领域，尤其是一种多词表达抽取方法及其装置。The invention relates to the technical field of statistical machine translation and cross-language information retrieval, in particular to a multi-word expression extraction method and device thereof.

背景技术Background technique

多词表达是具有语法、语义或语用特性,并有意义完整的多个词组合。多词表达的识别能够很好的提升分词、词性标注以及机器翻译等工作的效率和准确性。在机器翻译中,正确识别源语言中的多词表达有助于选择合适的翻译,避免多个词分别翻译而导致的目标语言不自然甚至不能达意。Multi-word expression is a combination of multiple words with grammatical, semantic or pragmatic characteristics and complete meaning. The recognition of multi-word expressions can improve the efficiency and accuracy of word segmentation, part-of-speech tagging, and machine translation. In machine translation, correct identification of multi-word expressions in the source language is helpful to choose a suitable translation, and avoid unnatural or even unintelligible target language caused by separate translation of multiple words.

多词表达的抽取方法基本分为基于统计的方法和基于规则的方法。基于规则的方法一般是具体研究某一种类型如动词短语结构等或局限于某一个特定领域，基于统计的方法则可以抽取形式独立的多词表达,也就是利用统计信息无差别的抽取各种结构和领域的多词表达。然而，现有的统计方法面临的问题有：一维互信息需要人工设定阈值，对不同数据存在适应性问题，局限于多词的二元结构，无法一次获取多词组合的多词表达，且需分步实现，多词表达库建设的准确度低。The extraction methods of multi-word expressions are basically divided into statistical methods and rule-based methods. Rule-based methods generally study a certain type such as verb phrase structure or are limited to a specific field, while statistics-based methods can extract multi-word expressions with independent forms, that is, use statistical information to extract various expressions without distinction. Multi-word expressions for structures and domains. However, the problems faced by the existing statistical methods are: one-dimensional mutual information needs to manually set the threshold, there is an adaptability problem to different data, it is limited to the binary structure of multi-words, and it is impossible to obtain multi-word expressions of multi-word combinations at one time. And it needs to be implemented step by step, and the accuracy of multi-word expression database construction is low.

发明内容Contents of the invention

本发明的首要目的在于提供一种一次性获取多词组合的多词表达，无需分步实现，有效提高多词表达抽取利用率，提高了多词表达库建设的准确度。The primary purpose of the present invention is to provide a multi-word expression that obtains multi-word combinations at one time without step-by-step implementation, effectively improves the utilization rate of multi-word expression extraction, and improves the accuracy of multi-word expression database construction.

为实现上述目的，本发明采用了以下技术方案，一种多词表达抽取方法，该方法包括下列顺序的步骤：In order to achieve the above object, the present invention adopts the following technical solutions, a multi-word expression extraction method, the method includes the steps of the following order:

(1)文档库采用分词和词性标注的预处理，形成源语言文档；(1) The document library adopts preprocessing of word segmentation and part-of-speech tagging to form source language documents;

(2)计算多文档中相邻词汇的互信息，并进一步计算互信息序列前后的跳变信息；(2) Calculate the mutual information of adjacent words in multiple documents, and further calculate the jump information before and after the mutual information sequence;

(3)将互信息序列与跳变信息序列构成二维互信息集合；(3) The mutual information sequence and the jump information sequence constitute a two-dimensional mutual information set;

(4)二维互信息集合采用分类器为多词表达内点和外点，选多内点链接构建多词表达。(4) The two-dimensional mutual information set uses a classifier to express inliers and outliers for multi-words, and selects multi-inlier links to construct multi-word expressions.

进一步的，在所述步骤(1)中，针对收集文档库的所有文档进行中文分词、词性标注和命名实体识别、词性选择的预处理构成有特定次序的候选词汇集合。Further, in the step (1), the preprocessing of Chinese word segmentation, part-of-speech tagging, named entity recognition, and part-of-speech selection is performed on all documents in the collected document library to form a set of candidate words in a specific order.

进一步的，所述步骤(2)包括以下顺序的步骤：Further, said step (2) includes the steps in the following order:

(a)计算多文档中所有相邻词汇的互信息；(a) Calculate the mutual information of all adjacent words in multiple documents;

(b)计算互信息序列前后的跳变信息。(b) Calculate the jump information before and after the mutual information sequence.

进一步的，所述步骤(3)中，根据互信息序列与跳变信息序列对应位置点，构建二维互信息(MI_i，f_i)，多个二维互信息构成二维互信息集合。Further, in the step (3), two-dimensional mutual information (MI _i , f _i ) is constructed according to the corresponding position points of the mutual information sequence and the hopping information sequence, and a plurality of two-dimensional mutual information forms a two-dimensional mutual information set.

进一步的，所述步骤(4)中，采用分类器将二维互信息集合中所有点，划分为多词表达内点和外点两类，将包含内点的相邻词汇链接构成多词表达。Further, in the step (4), a classifier is used to divide all points in the two-dimensional mutual information set into two types of multi-word expression interior points and exterior points, and the adjacent vocabulary links containing interior points form multi-word expression .

进一步的，所述步骤(a)中，计算多文档中相邻词汇的互信息,构成互信息序列MI，其中相邻词汇x和y的互信息计算MI_i(0≤i＜len(MI)-α)如下式：Further, in the step (a), the mutual information of adjacent words in multiple documents is calculated to form a mutual information sequence MI, wherein the mutual information of adjacent words x and y is calculated as MI _i (0≤i<len(MI) -α) as follows:

其中，x和y表示相邻词汇；MI_i表示相邻词汇x和y构成的第i个互信息；len(MI)表示互信息序列MI的长度；α表示一个常量；M表示所有文档中词汇的总数；p(x,y)表示词汇x和y在所有文档中共现次数；p(x)表示词汇x在所有文档中出现次数；p(y)表示词汇y在所有文档中出现次数；N表示文档集中所有文档的个数；N_x,y表示包含x和y共现的文档个数。Among them, x and y represent adjacent words; MI _i represents the i-th mutual information formed by adjacent words x and y; len(MI) represents the length of the mutual information sequence MI; α represents a constant; M represents the vocabulary in all documents The total number of; p(x, y) indicates the number of occurrences of vocabulary x and y in all documents; p(x) indicates the number of occurrences of vocabulary x in all documents; p(y) indicates the number of occurrences of vocabulary y in all documents; N Indicates the number of all documents in the document set; N _{x, y} indicates the number of documents containing co-occurrence of x and y.

进一步的，所述步骤(b)中，计算互信息序列前后的跳变信息，构成跳变信息序列f，其中的相邻互信息的跳变信息f_i计算公式如下：Further, in the step (b), the jump information before and after the mutual information sequence is calculated to form the jump information sequence f, and the calculation formula of the jump information _fi of the adjacent mutual information is as follows:

其中，f_i表示互信息序列中当前互信息和后续互信息的跳变信息；||表示取绝对值。Among them, f _i represents the jump information of the current mutual information and the subsequent mutual information in the mutual information sequence; || represents the absolute value.

进一步的，所述α为2。Further, the α is 2.

本发明的另一目的在于提供一种多词表达抽取装置，包括：Another object of the present invention is to provide a multi-word expression extraction device, comprising:

候选词汇获取装置：针对收集文档库的所有文档进行中文分词、词性标注和命名实体识别、词性选择的预处理构成具有特定次序的候选词汇集合；Candidate vocabulary acquisition device: perform preprocessing of Chinese word segmentation, part-of-speech tagging and named entity recognition, and part-of-speech selection for all documents in the collected document library to form a candidate vocabulary set with a specific order;

互信息和跳变信息获取装置：计算多文档中相邻候选词汇的互信息，并跟据相邻互信息计算互信息序列前后的跳变信息；Mutual information and jump information acquisition device: calculate the mutual information of adjacent candidate words in multiple documents, and calculate the jump information before and after the mutual information sequence according to the adjacent mutual information;

二维互信息获取装置：根据互信息序列与跳变信息序列位置对应的信息，选择互信息和跳变信息构成二维互信息；The two-dimensional mutual information acquisition device: according to the information corresponding to the position of the mutual information sequence and the jump information sequence, select the mutual information and the jump information to form the two-dimensional mutual information;

分类筛选多词表达装置：采用分类器将二维互信息集合中所有点，分类为多词表达内点和外点两类，将有内点的相邻词汇链接构成多词表达。Classification and screening multi-word expression device: use a classifier to classify all points in the two-dimensional mutual information set into two types of multi-word expression inner points and outer points, and link adjacent words with inner points to form multi-word expressions.

由上述技术方案可知，本发明将相邻词汇间的互信息转变成二维互信息，聚类二维互信息筛选出多词表达，避免了一维互信息需要人工设定阈值，对不同数据的适应性问题，同时不局限于多词的二元结构，可一次获取多词组合的多词表达，且无需分步实现，有效提高多词表达的利用率，提高了多词表达库建设的准确度。It can be seen from the above technical solution that the present invention converts the mutual information between adjacent words into two-dimensional mutual information, and clusters two-dimensional mutual information to screen out multi-word expressions, avoiding the need to manually set thresholds for one-dimensional mutual information, and different data At the same time, it is not limited to the binary structure of multi-words, and multi-word expressions of multi-word combinations can be obtained at one time without step-by-step implementation, which effectively improves the utilization rate of multi-word expressions and improves the construction of multi-word expression databases. Accuracy.

附图说明Description of drawings

图1是本发明方法的流程示意图；Fig. 1 is a schematic flow sheet of the inventive method;

图2是本发明装置的结构框图。Fig. 2 is a structural block diagram of the device of the present invention.

具体实施方式detailed description

一种多词表达抽取方法，该方法包括下列顺序的步骤：(1)文档库采用分词和词性标注等预处理，形成源语言文档；(2)计算多文档中相邻词汇的互信息，并进一步计算互信息序列前后的跳变信息；(3)将互信息序列与跳变信息序列构成二维互信息集合；(4)二维互信息集合采用分类器为多词表达内点和外点，筛选连续内点链接构建多词表达。如图1所示。A multi-word expression extraction method, the method comprises the steps of the following sequence: (1) the document base adopts preprocessing such as word segmentation and part-of-speech tagging to form a source language document; (2) calculates the mutual information of adjacent words in the multi-document, and Further calculate the jump information before and after the mutual information sequence; (3) The mutual information sequence and the jump information sequence form a two-dimensional mutual information set; (4) The two-dimensional mutual information set uses a classifier to express inliers and outliers for multiple words , to filter continuous internal point links to construct multi-word expressions. As shown in Figure 1.

以下结合图1对本发明作进一步的说明。The present invention will be further described below in conjunction with FIG. 1 .

在所述步骤(1)中，针对收集文档库的所有文本进行中文分词、词性标注和命名实体识别、词性选择的预处理构成有特定次序的候选词汇集合。In the step (1), Chinese word segmentation, part-of-speech tagging, named entity recognition, and part-of-speech selection are performed on all texts in the collected document base to form a set of candidate words in a specific order.

所述步骤(2)包括以下顺序的步骤：(a)计算多文档中所有相邻词汇的互信息；(b)计算互信息序列前后的跳变信息。The step (2) includes steps in the following order: (a) calculating the mutual information of all adjacent words in multiple documents; (b) calculating the jump information before and after the mutual information sequence.

在所述步骤(a)中，计算多文档中相邻词汇的互信息,构成互信息序列MI，其中相邻词汇x和y的互信息计算MI_i(0≤i＜len(MI)-α)如下式：In the step (a), the mutual information of adjacent words in multiple documents is calculated to form a mutual information sequence MI, wherein the mutual information of adjacent words x and y is calculated as MI _i (0≤i<len(MI)-α ) as follows:

其中，x和y表示相邻词汇；MI_i表示相邻词汇x和y构成的第i个互信息；len(MI)表示互信息序列MI的长度；α表示一个常量；M表示所有文档中词汇的总数；p(x,y)表示词汇x和y在所有文档中共现次数；p(x)表示词汇x在所有文档中出现次数；p(y)表示词汇y在所有文档中出现次数；N表示文档集中所有文档的个数；N_x,y表示包含x和y共现的文档个数；常量α为2。Among them, x and y represent adjacent words; MI _i represents the i-th mutual information formed by adjacent words x and y; len(MI) represents the length of the mutual information sequence MI; α represents a constant; M represents the vocabulary in all documents The total number of; p(x, y) indicates the number of occurrences of vocabulary x and y in all documents; p(x) indicates the number of occurrences of vocabulary x in all documents; p(y) indicates the number of occurrences of vocabulary y in all documents; N Indicates the number of all documents in the document set; N _{x, y} indicates the number of documents containing co-occurrence of x and y; the constant α is 2.

在所述步骤(b)中，计算互信息序列前后的跳变信息，构成跳变信息序列f，其中的相邻互信息的跳变信息f_i计算公式如下：In the step (b), the jump information before and after the mutual information sequence is calculated to form the jump information sequence f, and the calculation formula of the jump information _fi of the adjacent mutual information is as follows:

所述步骤(3)中，根据互信息序列与跳变信息序列对应位置点，构建二维互信息(MI_i，f_i)，多个二维互信息构成二维互信息集合。In the step (3), two-dimensional mutual information (MI _i , f _i ) is constructed according to the corresponding position points of the mutual information sequence and the hopping information sequence, and a plurality of two-dimensional mutual information forms a two-dimensional mutual information set.

所述步骤(4)中，采用分类器将二维互信息集合中所有点，划分为多词表达内点和外点两类，将包含内点的相邻词汇链接构成多词表达。In the step (4), a classifier is used to divide all points in the two-dimensional mutual information set into two types of multi-word expression inner points and outer points, and adjacent vocabulary links containing inner points are used to form multi-word expressions.

如图2所示，本发明装置包括：候选词汇获取装置，针对收集文档库的所有文本进行中文分词、词性标注和命名实体识别、词性选择等预处理构成具有特定次序的候选词汇集合；互信息和跳变信息获取装置，计算多文档中相邻候选词汇的互信息，并跟据相邻互信息计算互信息序列前后的跳变信息；二维互信息获取装置，根据互信息序列与跳变信息序列位置对应的信息，选择互信息和跳变信息构成二维互信息；分类筛选多词表达装置，采用分类器将二维互信息集合中所有点，分类为多词表达内点和外点两类，将有内点的相邻词汇链接构成多词表达。As shown in Figure 2, the device of the present invention includes: a candidate vocabulary acquisition device, which performs preprocessing such as Chinese word segmentation, part-of-speech tagging, named entity recognition, and part-of-speech selection for all texts in the collected document library to form a candidate vocabulary set with a specific order; mutual information and jump information acquisition device, calculate the mutual information of adjacent candidate words in multiple documents, and calculate the jump information before and after the mutual information sequence according to the adjacent mutual information; the two-dimensional mutual information acquisition device, according to the mutual information sequence and the jump information Information corresponding to the information sequence position, select mutual information and jump information to form two-dimensional mutual information; classify and screen multi-word expression devices, use a classifier to classify all points in the two-dimensional mutual information set into multi-word expression internal points and external points Two types, linking adjacent words with interior points to form multi-word expressions.

综上所述，本发明将相邻词汇间的互信息转变成二维互信息，聚类二维互信息筛选出多词表达，避免了一维互信息需要人工设定阈值，对不同数据的适应性问题，同时不局限于多词的二元结构，可一次获取多词组合的多词表达，且无需分步实现，有效提高多词表达的利用率，提高了多词表达库建设的准确度。In summary, the present invention transforms the mutual information between adjacent words into two-dimensional mutual information, clusters the two-dimensional mutual information to screen out multi-word expressions, avoids the need for manual setting of thresholds for one-dimensional mutual information, and compares different data The problem of adaptability is not limited to the binary structure of multi-words, and the multi-word expressions of multi-word combinations can be obtained at one time without step-by-step implementation, which effectively improves the utilization rate of multi-word expressions and improves the accuracy of multi-word expression database construction Spend.

Claims

1. A multi-word expression extraction method is characterized in that the method comprises the steps of the following order:

(1) The document library adopts preprocessing of word segmentation and part-of-speech tagging to form source language documents;

(2) Calculate the mutual information of adjacent words in multiple documents, and further calculate the jump information before and after the mutual information sequence;

(3) The mutual information sequence and the jump information sequence constitute a two-dimensional mutual information set;

(4) The two-dimensional mutual information set uses a classifier to express inliers and outliers for multi-words, and selects multi-inlier links to construct multi-word expressions.

2. The method according to claim 1, characterized in that: in said step (1), the preprocessing of Chinese word segmentation, part-of-speech tagging and named entity recognition, part-of-speech selection for all documents in the document library constitutes a specific order set of candidate words.

3. The method according to claim 1, characterized in that: said step (2) comprises the steps in the following order:

(a) Calculate the mutual information of all adjacent words in multiple documents;

(b) Calculate the jump information before and after the mutual information sequence.

4. The method according to claim 1, characterized in that: in the step (3), two-dimensional mutual information (MI _i , f _i ) is constructed according to the corresponding position points of the mutual information sequence and the hopping information sequence, multiple Two-dimensional mutual information constitutes a two-dimensional mutual information set.

5. method according to claim 1, it is characterized in that: in described step (4), adopt classifier to divide all points in the two-dimensional mutual information set into multi-word expression internal point and external point two classes, will Adjacent lexical links containing interior points form multi-word expressions.

6. The method according to claim 3, characterized in that: in the step (a), the mutual information of adjacent words in multiple documents is calculated to form a mutual information sequence MI, wherein the mutual information of adjacent words x and y Calculate MI _i (0≤i<len(MI)-α) as follows:

{MI MI}_{i i} = = l l o o g g [[\frac{M m \times \times p p ((x x,, y the y))}{p p ((x x)) \times \times p p ((y the y))} \times \times \frac{{N N}_{x x,, y the y}}{N N}]],,

Among them, x and y represent adjacent words; MI _i represents the i-th mutual information formed by adjacent words x and y; len(MI) represents the length of the mutual information sequence MI; α represents a constant; M represents the vocabulary in all documents The total number of; p(x, y) indicates the number of occurrences of vocabulary x and y in all documents; p(x) indicates the number of occurrences of vocabulary x in all documents; p(y) indicates the number of occurrences of vocabulary y in all documents; N Indicates the number of all documents in the document set; N _{x, y} indicates the number of documents containing co-occurrence of x and y.

7. The method according to claim 3, characterized in that: in the step (b), the jump information before and after the mutual information sequence is calculated to form the jump information sequence f, wherein the jump information of adjacent mutual information The calculation formula of f _i is as follows:

{f f}_{i i} = = | | \frac{11}{α α} {Σ Σ}_{j j = = 11}^{α α} {MI MI}_{i i + + j j} - - {MI MI}_{i i} | |,, 00 \leq \leq i i < < l l e e n no ((M m I I)) - - α α

Among them, f _i represents the jump information of the current mutual information and the subsequent mutual information in the mutual information sequence; || represents the absolute value.

8. The method according to claim 6, characterized in that: said α is 2.

9. A multi-word expression extraction device, comprising:

Candidate vocabulary acquisition device: perform preprocessing such as Chinese word segmentation, part-of-speech tagging, named entity recognition, and part-of-speech selection for all documents in the collected document library to form a candidate vocabulary set with a specific order;

Mutual information and jump information acquisition device: calculate the mutual information of adjacent candidate words in multiple documents, and calculate the jump information before and after the mutual information sequence according to the adjacent mutual information;

The two-dimensional mutual information acquisition device: according to the information corresponding to the position of the mutual information sequence and the jump information sequence, select the mutual information and the jump information to form the two-dimensional mutual information;

Classification and screening multi-word expression device: use a classifier to classify all points in the two-dimensional mutual information set into two types: inner points and outer points of multi-word expressions, and link adjacent words with inner points to form multi-word expressions.