[go: up one dir, main page]

CN106649263A - Multi-word expression extraction method and device - Google Patents

Multi-word expression extraction method and device Download PDF

Info

Publication number
CN106649263A
CN106649263A CN201610990921.6A CN201610990921A CN106649263A CN 106649263 A CN106649263 A CN 106649263A CN 201610990921 A CN201610990921 A CN 201610990921A CN 106649263 A CN106649263 A CN 106649263A
Authority
CN
China
Prior art keywords
mutual information
word
information
documents
jump
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610990921.6A
Other languages
Chinese (zh)
Inventor
朱泽德
曾新华
郑守国
孙熊伟
翁士状
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Technology Innovation Engineering Institute of CAS
Original Assignee
Hefei Technology Innovation Engineering Institute of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Technology Innovation Engineering Institute of CAS filed Critical Hefei Technology Innovation Engineering Institute of CAS
Priority to CN201610990921.6A priority Critical patent/CN106649263A/en
Publication of CN106649263A publication Critical patent/CN106649263A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

本发明涉及多词表达抽取方法及其装置,包括:文档库经预处理后形成词汇集,计算多文档中相邻词汇的互信息,获取互信息序列前后的跳变信息,将互信息与跳变信息构成二维互信息,聚类二维互信息筛选出多词表达,构建多词表达库。本发明避免了一维互信息需要人工设定阈值,对不同数据的适应性问题,同时不局限于多词的二元结构,可一次获取多词组合的多词表达,且无需分步实现,有效提高多词表达的利用率,提高了多词表达库建设的准确度。

The present invention relates to a multi-word expression extraction method and its device, comprising: forming a vocabulary set after preprocessing a document library, calculating the mutual information of adjacent words in multiple documents, obtaining jump information before and after the mutual information sequence, and combining the mutual information with the jump information. Variable information constitutes two-dimensional mutual information, clustering two-dimensional mutual information screens out multi-word expressions, and builds a multi-word expression library. The present invention avoids the need to manually set the threshold for one-dimensional mutual information, and adapts to different data. At the same time, it is not limited to the binary structure of multi-words, and can obtain multi-word expressions of multi-word combinations at one time without step-by-step implementation. Effectively improve the utilization rate of multi-word expressions, and improve the accuracy of multi-word expression database construction.

Description

一种多词表达抽取方法及其装置A multi-word expression extraction method and device thereof

技术领域technical field

本发明涉及统计机器翻译和跨语言信息检索技术领域,尤其是一种多词表达抽取方法及其装置。The invention relates to the technical field of statistical machine translation and cross-language information retrieval, in particular to a multi-word expression extraction method and device thereof.

背景技术Background technique

多词表达是具有语法、语义或语用特性,并有意义完整的多个词组合。多词表达的识别能够很好的提升分词、词性标注以及机器翻译等工作的效率和准确性。在机器翻译中,正确识别源语言中的多词表达有助于选择合适的翻译,避免多个词分别翻译而导致的目标语言不自然甚至不能达意。Multi-word expression is a combination of multiple words with grammatical, semantic or pragmatic characteristics and complete meaning. The recognition of multi-word expressions can improve the efficiency and accuracy of word segmentation, part-of-speech tagging, and machine translation. In machine translation, correct identification of multi-word expressions in the source language is helpful to choose a suitable translation, and avoid unnatural or even unintelligible target language caused by separate translation of multiple words.

多词表达的抽取方法基本分为基于统计的方法和基于规则的方法。基于规则的方法一般是具体研究某一种类型如动词短语结构等或局限于某一个特定领域,基于统计的方法则可以抽取形式独立的多词表达,也就是利用统计信息无差别的抽取各种结构和领域的多词表达。然而,现有的统计方法面临的问题有:一维互信息需要人工设定阈值,对不同数据存在适应性问题,局限于多词的二元结构,无法一次获取多词组合的多词表达,且需分步实现,多词表达库建设的准确度低。The extraction methods of multi-word expressions are basically divided into statistical methods and rule-based methods. Rule-based methods generally study a certain type such as verb phrase structure or are limited to a specific field, while statistics-based methods can extract multi-word expressions with independent forms, that is, use statistical information to extract various expressions without distinction. Multi-word expressions for structures and domains. However, the problems faced by the existing statistical methods are: one-dimensional mutual information needs to manually set the threshold, there is an adaptability problem to different data, it is limited to the binary structure of multi-words, and it is impossible to obtain multi-word expressions of multi-word combinations at one time. And it needs to be implemented step by step, and the accuracy of multi-word expression database construction is low.

发明内容Contents of the invention

本发明的首要目的在于提供一种一次性获取多词组合的多词表达,无需分步实现,有效提高多词表达抽取利用率,提高了多词表达库建设的准确度。The primary purpose of the present invention is to provide a multi-word expression that obtains multi-word combinations at one time without step-by-step implementation, effectively improves the utilization rate of multi-word expression extraction, and improves the accuracy of multi-word expression database construction.

为实现上述目的,本发明采用了以下技术方案,一种多词表达抽取方法,该方法包括下列顺序的步骤:In order to achieve the above object, the present invention adopts the following technical solutions, a multi-word expression extraction method, the method includes the steps of the following order:

(1)文档库采用分词和词性标注的预处理,形成源语言文档;(1) The document library adopts preprocessing of word segmentation and part-of-speech tagging to form source language documents;

(2)计算多文档中相邻词汇的互信息,并进一步计算互信息序列前后的跳变信息;(2) Calculate the mutual information of adjacent words in multiple documents, and further calculate the jump information before and after the mutual information sequence;

(3)将互信息序列与跳变信息序列构成二维互信息集合;(3) The mutual information sequence and the jump information sequence constitute a two-dimensional mutual information set;

(4)二维互信息集合采用分类器为多词表达内点和外点,选多内点链接构建多词表达。(4) The two-dimensional mutual information set uses a classifier to express inliers and outliers for multi-words, and selects multi-inlier links to construct multi-word expressions.

进一步的,在所述步骤(1)中,针对收集文档库的所有文档进行中文分词、词性标注和命名实体识别、词性选择的预处理构成有特定次序的候选词汇集合。Further, in the step (1), the preprocessing of Chinese word segmentation, part-of-speech tagging, named entity recognition, and part-of-speech selection is performed on all documents in the collected document library to form a set of candidate words in a specific order.

进一步的,所述步骤(2)包括以下顺序的步骤:Further, said step (2) includes the steps in the following order:

(a)计算多文档中所有相邻词汇的互信息;(a) Calculate the mutual information of all adjacent words in multiple documents;

(b)计算互信息序列前后的跳变信息。(b) Calculate the jump information before and after the mutual information sequence.

进一步的,所述步骤(3)中,根据互信息序列与跳变信息序列对应位置点,构建二维互信息(MIi,fi),多个二维互信息构成二维互信息集合。Further, in the step (3), two-dimensional mutual information (MI i , f i ) is constructed according to the corresponding position points of the mutual information sequence and the hopping information sequence, and a plurality of two-dimensional mutual information forms a two-dimensional mutual information set.

进一步的,所述步骤(4)中,采用分类器将二维互信息集合中所有点,划分为多词表达内点和外点两类,将包含内点的相邻词汇链接构成多词表达。Further, in the step (4), a classifier is used to divide all points in the two-dimensional mutual information set into two types of multi-word expression interior points and exterior points, and the adjacent vocabulary links containing interior points form multi-word expression .

进一步的,所述步骤(a)中,计算多文档中相邻词汇的互信息,构成互信息序列MI,其中相邻词汇x和y的互信息计算MIi(0≤i<len(MI)-α)如下式:Further, in the step (a), the mutual information of adjacent words in multiple documents is calculated to form a mutual information sequence MI, wherein the mutual information of adjacent words x and y is calculated as MI i (0≤i<len(MI) -α) as follows:

其中,x和y表示相邻词汇;MIi表示相邻词汇x和y构成的第i个互信息;len(MI)表示互信息序列MI的长度;α表示一个常量;M表示所有文档中词汇的总数;p(x,y)表示词汇x和y在所有文档中共现次数;p(x)表示词汇x在所有文档中出现次数;p(y)表示词汇y在所有文档中出现次数;N表示文档集中所有文档的个数;Nx,y表示包含x和y共现的文档个数。Among them, x and y represent adjacent words; MI i represents the i-th mutual information formed by adjacent words x and y; len(MI) represents the length of the mutual information sequence MI; α represents a constant; M represents the vocabulary in all documents The total number of; p(x, y) indicates the number of occurrences of vocabulary x and y in all documents; p(x) indicates the number of occurrences of vocabulary x in all documents; p(y) indicates the number of occurrences of vocabulary y in all documents; N Indicates the number of all documents in the document set; N x, y indicates the number of documents containing co-occurrence of x and y.

进一步的,所述步骤(b)中,计算互信息序列前后的跳变信息,构成跳变信息序列f,其中的相邻互信息的跳变信息fi计算公式如下:Further, in the step (b), the jump information before and after the mutual information sequence is calculated to form the jump information sequence f, and the calculation formula of the jump information fi of the adjacent mutual information is as follows:

其中,fi表示互信息序列中当前互信息和后续互信息的跳变信息;||表示取绝对值。Among them, f i represents the jump information of the current mutual information and the subsequent mutual information in the mutual information sequence; || represents the absolute value.

进一步的,所述α为2。Further, the α is 2.

本发明的另一目的在于提供一种多词表达抽取装置,包括:Another object of the present invention is to provide a multi-word expression extraction device, comprising:

候选词汇获取装置:针对收集文档库的所有文档进行中文分词、词性标注和命名实体识别、词性选择的预处理构成具有特定次序的候选词汇集合;Candidate vocabulary acquisition device: perform preprocessing of Chinese word segmentation, part-of-speech tagging and named entity recognition, and part-of-speech selection for all documents in the collected document library to form a candidate vocabulary set with a specific order;

互信息和跳变信息获取装置:计算多文档中相邻候选词汇的互信息,并跟据相邻互信息计算互信息序列前后的跳变信息;Mutual information and jump information acquisition device: calculate the mutual information of adjacent candidate words in multiple documents, and calculate the jump information before and after the mutual information sequence according to the adjacent mutual information;

二维互信息获取装置:根据互信息序列与跳变信息序列位置对应的信息,选择互信息和跳变信息构成二维互信息;The two-dimensional mutual information acquisition device: according to the information corresponding to the position of the mutual information sequence and the jump information sequence, select the mutual information and the jump information to form the two-dimensional mutual information;

分类筛选多词表达装置:采用分类器将二维互信息集合中所有点,分类为多词表达内点和外点两类,将有内点的相邻词汇链接构成多词表达。Classification and screening multi-word expression device: use a classifier to classify all points in the two-dimensional mutual information set into two types of multi-word expression inner points and outer points, and link adjacent words with inner points to form multi-word expressions.

由上述技术方案可知,本发明将相邻词汇间的互信息转变成二维互信息,聚类二维互信息筛选出多词表达,避免了一维互信息需要人工设定阈值,对不同数据的适应性问题,同时不局限于多词的二元结构,可一次获取多词组合的多词表达,且无需分步实现,有效提高多词表达的利用率,提高了多词表达库建设的准确度。It can be seen from the above technical solution that the present invention converts the mutual information between adjacent words into two-dimensional mutual information, and clusters two-dimensional mutual information to screen out multi-word expressions, avoiding the need to manually set thresholds for one-dimensional mutual information, and different data At the same time, it is not limited to the binary structure of multi-words, and multi-word expressions of multi-word combinations can be obtained at one time without step-by-step implementation, which effectively improves the utilization rate of multi-word expressions and improves the construction of multi-word expression databases. Accuracy.

附图说明Description of drawings

图1是本发明方法的流程示意图;Fig. 1 is a schematic flow sheet of the inventive method;

图2是本发明装置的结构框图。Fig. 2 is a structural block diagram of the device of the present invention.

具体实施方式detailed description

一种多词表达抽取方法,该方法包括下列顺序的步骤:(1)文档库采用分词和词性标注等预处理,形成源语言文档;(2)计算多文档中相邻词汇的互信息,并进一步计算互信息序列前后的跳变信息;(3)将互信息序列与跳变信息序列构成二维互信息集合;(4)二维互信息集合采用分类器为多词表达内点和外点,筛选连续内点链接构建多词表达。如图1所示。A multi-word expression extraction method, the method comprises the steps of the following sequence: (1) the document base adopts preprocessing such as word segmentation and part-of-speech tagging to form a source language document; (2) calculates the mutual information of adjacent words in the multi-document, and Further calculate the jump information before and after the mutual information sequence; (3) The mutual information sequence and the jump information sequence form a two-dimensional mutual information set; (4) The two-dimensional mutual information set uses a classifier to express inliers and outliers for multiple words , to filter continuous internal point links to construct multi-word expressions. As shown in Figure 1.

以下结合图1对本发明作进一步的说明。The present invention will be further described below in conjunction with FIG. 1 .

在所述步骤(1)中,针对收集文档库的所有文本进行中文分词、词性标注和命名实体识别、词性选择的预处理构成有特定次序的候选词汇集合。In the step (1), Chinese word segmentation, part-of-speech tagging, named entity recognition, and part-of-speech selection are performed on all texts in the collected document base to form a set of candidate words in a specific order.

所述步骤(2)包括以下顺序的步骤:(a)计算多文档中所有相邻词汇的互信息;(b)计算互信息序列前后的跳变信息。The step (2) includes steps in the following order: (a) calculating the mutual information of all adjacent words in multiple documents; (b) calculating the jump information before and after the mutual information sequence.

在所述步骤(a)中,计算多文档中相邻词汇的互信息,构成互信息序列MI,其中相邻词汇x和y的互信息计算MIi(0≤i<len(MI)-α)如下式:In the step (a), the mutual information of adjacent words in multiple documents is calculated to form a mutual information sequence MI, wherein the mutual information of adjacent words x and y is calculated as MI i (0≤i<len(MI)-α ) as follows:

其中,x和y表示相邻词汇;MIi表示相邻词汇x和y构成的第i个互信息;len(MI)表示互信息序列MI的长度;α表示一个常量;M表示所有文档中词汇的总数;p(x,y)表示词汇x和y在所有文档中共现次数;p(x)表示词汇x在所有文档中出现次数;p(y)表示词汇y在所有文档中出现次数;N表示文档集中所有文档的个数;Nx,y表示包含x和y共现的文档个数;常量α为2。Among them, x and y represent adjacent words; MI i represents the i-th mutual information formed by adjacent words x and y; len(MI) represents the length of the mutual information sequence MI; α represents a constant; M represents the vocabulary in all documents The total number of; p(x, y) indicates the number of occurrences of vocabulary x and y in all documents; p(x) indicates the number of occurrences of vocabulary x in all documents; p(y) indicates the number of occurrences of vocabulary y in all documents; N Indicates the number of all documents in the document set; N x, y indicates the number of documents containing co-occurrence of x and y; the constant α is 2.

在所述步骤(b)中,计算互信息序列前后的跳变信息,构成跳变信息序列f,其中的相邻互信息的跳变信息fi计算公式如下:In the step (b), the jump information before and after the mutual information sequence is calculated to form the jump information sequence f, and the calculation formula of the jump information fi of the adjacent mutual information is as follows:

其中,fi表示互信息序列中当前互信息和后续互信息的跳变信息;||表示取绝对值。Among them, f i represents the jump information of the current mutual information and the subsequent mutual information in the mutual information sequence; || represents the absolute value.

所述步骤(3)中,根据互信息序列与跳变信息序列对应位置点,构建二维互信息(MIi,fi),多个二维互信息构成二维互信息集合。In the step (3), two-dimensional mutual information (MI i , f i ) is constructed according to the corresponding position points of the mutual information sequence and the hopping information sequence, and a plurality of two-dimensional mutual information forms a two-dimensional mutual information set.

所述步骤(4)中,采用分类器将二维互信息集合中所有点,划分为多词表达内点和外点两类,将包含内点的相邻词汇链接构成多词表达。In the step (4), a classifier is used to divide all points in the two-dimensional mutual information set into two types of multi-word expression inner points and outer points, and adjacent vocabulary links containing inner points are used to form multi-word expressions.

如图2所示,本发明装置包括:候选词汇获取装置,针对收集文档库的所有文本进行中文分词、词性标注和命名实体识别、词性选择等预处理构成具有特定次序的候选词汇集合;互信息和跳变信息获取装置,计算多文档中相邻候选词汇的互信息,并跟据相邻互信息计算互信息序列前后的跳变信息;二维互信息获取装置,根据互信息序列与跳变信息序列位置对应的信息,选择互信息和跳变信息构成二维互信息;分类筛选多词表达装置,采用分类器将二维互信息集合中所有点,分类为多词表达内点和外点两类,将有内点的相邻词汇链接构成多词表达。As shown in Figure 2, the device of the present invention includes: a candidate vocabulary acquisition device, which performs preprocessing such as Chinese word segmentation, part-of-speech tagging, named entity recognition, and part-of-speech selection for all texts in the collected document library to form a candidate vocabulary set with a specific order; mutual information and jump information acquisition device, calculate the mutual information of adjacent candidate words in multiple documents, and calculate the jump information before and after the mutual information sequence according to the adjacent mutual information; the two-dimensional mutual information acquisition device, according to the mutual information sequence and the jump information Information corresponding to the information sequence position, select mutual information and jump information to form two-dimensional mutual information; classify and screen multi-word expression devices, use a classifier to classify all points in the two-dimensional mutual information set into multi-word expression internal points and external points Two types, linking adjacent words with interior points to form multi-word expressions.

综上所述,本发明将相邻词汇间的互信息转变成二维互信息,聚类二维互信息筛选出多词表达,避免了一维互信息需要人工设定阈值,对不同数据的适应性问题,同时不局限于多词的二元结构,可一次获取多词组合的多词表达,且无需分步实现,有效提高多词表达的利用率,提高了多词表达库建设的准确度。In summary, the present invention transforms the mutual information between adjacent words into two-dimensional mutual information, clusters the two-dimensional mutual information to screen out multi-word expressions, avoids the need for manual setting of thresholds for one-dimensional mutual information, and compares different data The problem of adaptability is not limited to the binary structure of multi-words, and the multi-word expressions of multi-word combinations can be obtained at one time without step-by-step implementation, which effectively improves the utilization rate of multi-word expressions and improves the accuracy of multi-word expression database construction Spend.

Claims (9)

1.一种多词表达抽取方法,其特征在于,该方法包括下列顺序的步骤:1. A multi-word expression extraction method is characterized in that the method comprises the steps of the following order: (1)文档库采用分词和词性标注的预处理,形成源语言文档;(1) The document library adopts preprocessing of word segmentation and part-of-speech tagging to form source language documents; (2)计算多文档中相邻词汇的互信息,并进一步计算互信息序列前后的跳变信息;(2) Calculate the mutual information of adjacent words in multiple documents, and further calculate the jump information before and after the mutual information sequence; (3)将互信息序列与跳变信息序列构成二维互信息集合;(3) The mutual information sequence and the jump information sequence constitute a two-dimensional mutual information set; (4)二维互信息集合采用分类器为多词表达内点和外点,选多内点链接构建多词表达。(4) The two-dimensional mutual information set uses a classifier to express inliers and outliers for multi-words, and selects multi-inlier links to construct multi-word expressions. 2.根据权利要求1所述的方法,其特征在于:在所述步骤(1)中,针对文档库的所有文档进行中文分词、词性标注和命名实体识别、词性选择的预处理构成有特定次序的候选词汇集合。2. The method according to claim 1, characterized in that: in said step (1), the preprocessing of Chinese word segmentation, part-of-speech tagging and named entity recognition, part-of-speech selection for all documents in the document library constitutes a specific order set of candidate words. 3.根据权利要求1所述的方法,其特征在于:在所述步骤(2)包括以下顺序的步骤:3. The method according to claim 1, characterized in that: said step (2) comprises the steps in the following order: (a)计算多文档中所有相邻词汇的互信息;(a) Calculate the mutual information of all adjacent words in multiple documents; (b)计算互信息序列前后的跳变信息。(b) Calculate the jump information before and after the mutual information sequence. 4.根据权利要求1所述的方法,其特征在于:所述步骤(3)中,根据互信息序列与跳变信息序列对应位置点,构建二维互信息(MIi,fi),多个二维互信息构成二维互信息集合。4. The method according to claim 1, characterized in that: in the step (3), two-dimensional mutual information (MI i , f i ) is constructed according to the corresponding position points of the mutual information sequence and the hopping information sequence, multiple Two-dimensional mutual information constitutes a two-dimensional mutual information set. 5.根据权利要求1所述的方法,其特征在于:所述步骤(4)中,采用分类器将二维互信息集合中所有点,划分为多词表达内点和外点两类,将包含内点的相邻词汇链接构成多词表达。5. method according to claim 1, it is characterized in that: in described step (4), adopt classifier to divide all points in the two-dimensional mutual information set into multi-word expression internal point and external point two classes, will Adjacent lexical links containing interior points form multi-word expressions. 6.根据权利要求3所述的方法,其特征在于:所述步骤(a)中,计算多文档中相邻词汇的互信息,构成互信息序列MI,其中相邻词汇x和y的互信息计算MIi(0≤i<len(MI)-α)如下式:6. The method according to claim 3, characterized in that: in the step (a), the mutual information of adjacent words in multiple documents is calculated to form a mutual information sequence MI, wherein the mutual information of adjacent words x and y Calculate MI i (0≤i<len(MI)-α) as follows: MIMI ii == ll oo gg &lsqb;&lsqb; Mm &times;&times; pp (( xx ,, ythe y )) pp (( xx )) &times;&times; pp (( ythe y )) &times;&times; NN xx ,, ythe y NN &rsqb;&rsqb; ,, 其中,x和y表示相邻词汇;MIi表示相邻词汇x和y构成的第i个互信息;len(MI)表示互信息序列MI的长度;α表示一个常量;M表示所有文档中词汇的总数;p(x,y)表示词汇x和y在所有文档中共现次数;p(x)表示词汇x在所有文档中出现次数;p(y)表示词汇y在所有文档中出现次数;N表示文档集中所有文档的个数;Nx,y表示包含x和y共现的文档个数。Among them, x and y represent adjacent words; MI i represents the i-th mutual information formed by adjacent words x and y; len(MI) represents the length of the mutual information sequence MI; α represents a constant; M represents the vocabulary in all documents The total number of; p(x, y) indicates the number of occurrences of vocabulary x and y in all documents; p(x) indicates the number of occurrences of vocabulary x in all documents; p(y) indicates the number of occurrences of vocabulary y in all documents; N Indicates the number of all documents in the document set; N x, y indicates the number of documents containing co-occurrence of x and y. 7.根据权利要求3所述的方法,其特征在于:所述步骤(b)中,计算互信息序列前后的跳变信息,构成跳变信息序列f,其中的相邻互信息的跳变信息fi计算公式如下:7. The method according to claim 3, characterized in that: in the step (b), the jump information before and after the mutual information sequence is calculated to form the jump information sequence f, wherein the jump information of adjacent mutual information The calculation formula of f i is as follows: ff ii == || 11 &alpha;&alpha; &Sigma;&Sigma; jj == 11 &alpha;&alpha; MIMI ii ++ jj -- MIMI ii || ,, 00 &le;&le; ii << ll ee nno (( Mm II )) -- &alpha;&alpha; 其中,fi表示互信息序列中当前互信息和后续互信息的跳变信息;||表示取绝对值。Among them, f i represents the jump information of the current mutual information and the subsequent mutual information in the mutual information sequence; || represents the absolute value. 8.根据权利要求6所述的方法,其特征在于:所述α为2。8. The method according to claim 6, characterized in that: said α is 2. 9.一种多词表达抽取装置,包括:9. A multi-word expression extraction device, comprising: 候选词汇获取装置:针对收集文档库的所有文档进行中文分词、词性标注和命名实体识别、词性选择等预处理构成具有特定次序的候选词汇集合;Candidate vocabulary acquisition device: perform preprocessing such as Chinese word segmentation, part-of-speech tagging, named entity recognition, and part-of-speech selection for all documents in the collected document library to form a candidate vocabulary set with a specific order; 互信息和跳变信息获取装置:计算多文档中相邻候选词汇的互信息,并跟据相邻互信息计算互信息序列前后的跳变信息;Mutual information and jump information acquisition device: calculate the mutual information of adjacent candidate words in multiple documents, and calculate the jump information before and after the mutual information sequence according to the adjacent mutual information; 二维互信息获取装置:根据互信息序列与跳变信息序列位置对应的信息,选择互信息和跳变信息构成二维互信息;The two-dimensional mutual information acquisition device: according to the information corresponding to the position of the mutual information sequence and the jump information sequence, select the mutual information and the jump information to form the two-dimensional mutual information; 分类筛选多词表达装置:采用分类器将二维互信息集合中所有点,分类为多词表达内点和外点两类,将有内点的相邻词汇链接构成多词表达。Classification and screening multi-word expression device: use a classifier to classify all points in the two-dimensional mutual information set into two types: inner points and outer points of multi-word expressions, and link adjacent words with inner points to form multi-word expressions.
CN201610990921.6A 2016-11-10 2016-11-10 Multi-word expression extraction method and device Pending CN106649263A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610990921.6A CN106649263A (en) 2016-11-10 2016-11-10 Multi-word expression extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610990921.6A CN106649263A (en) 2016-11-10 2016-11-10 Multi-word expression extraction method and device

Publications (1)

Publication Number Publication Date
CN106649263A true CN106649263A (en) 2017-05-10

Family

ID=58806046

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610990921.6A Pending CN106649263A (en) 2016-11-10 2016-11-10 Multi-word expression extraction method and device

Country Status (1)

Country Link
CN (1) CN106649263A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549631A (en) * 2018-03-30 2018-09-18 北京智慧正安科技有限公司 Noun dictionary extracting method, electronic device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040044528A1 (en) * 2002-09-03 2004-03-04 Chelba Ciprian I. Method and apparatus for generating decision tree questions for speech processing
CN1567297A (en) * 2003-07-03 2005-01-19 中国科学院声学研究所 Method for extracting multi-word translation equivalent cells from bilingual corpus automatically
US20050216443A1 (en) * 2000-07-06 2005-09-29 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevance intervals
JP2006178536A (en) * 2004-12-20 2006-07-06 Oki Electric Ind Co Ltd Bilingual expression extraction device
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050216443A1 (en) * 2000-07-06 2005-09-29 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevance intervals
US20040044528A1 (en) * 2002-09-03 2004-03-04 Chelba Ciprian I. Method and apparatus for generating decision tree questions for speech processing
CN1567297A (en) * 2003-07-03 2005-01-19 中国科学院声学研究所 Method for extracting multi-word translation equivalent cells from bilingual corpus automatically
JP2006178536A (en) * 2004-12-20 2006-07-06 Oki Electric Ind Co Ltd Bilingual expression extraction device
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549631A (en) * 2018-03-30 2018-09-18 北京智慧正安科技有限公司 Noun dictionary extracting method, electronic device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN108984526B (en) A deep learning-based document topic vector extraction method
CN104572892B (en) A Text Classification Method Based on Recurrent Convolutional Network
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
CN110413768B (en) A method of automatic generation of article title
CN116167362A (en) Model training method, Chinese text error correction method, electronic device and storage medium
CN112926345B (en) Multi-feature fusion neural machine translation error detection method based on data augmentation training
CN112101031B (en) Entity identification method, terminal equipment and storage medium
CN106095749A (en) A kind of text key word extracting method based on degree of depth study
CN106611055A (en) Chinese Fuzzy Restricted Information Range Detection Method Based on Laminated Neural Network
US10303938B2 (en) Identifying a structure presented in portable document format (PDF)
CN108427717B (en) Letter class language family medical text relation extraction method based on gradual expansion
CN103559193B (en) A kind of based on the theme modeling method selecting unit
CN110490081A (en) A kind of remote sensing object decomposition method based on focusing weight matrix and mutative scale semantic segmentation neural network
CN105261358A (en) N-gram grammar model constructing method for voice identification and voice identification system
CN109446333A (en) A kind of method that realizing Chinese Text Categorization and relevant device
CN111476036A (en) A Word Embedding Learning Method Based on Chinese Word Feature Substrings
CN108763192B (en) Entity relation extraction method and device for text processing
CN114510568A (en) Author name disambiguation method and author name disambiguation device
CN116186067A (en) Industrial data table storage query method and equipment
CN107168953A (en) The new word discovery method and system that word-based vector is characterized in mass text
CN104077274A (en) Method and device for extracting hot word phrases from document set
CN116842168A (en) Cross-domain problem processing method and device, electronic equipment and storage medium
CN108846033A (en) The discovery and classifier training method and apparatus of specific area vocabulary
CN109472020B (en) Feature alignment Chinese word segmentation method
CN106484672A (en) Vocabulary recognition methods and vocabulary identifying system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170510

WD01 Invention patent application deemed withdrawn after publication