[go: up one dir, main page]

CN102629266A - Diagram text structure representation model based on harmonic progression - Google Patents

Diagram text structure representation model based on harmonic progression Download PDF

Info

Publication number
CN102629266A
CN102629266A CN2012100594049A CN201210059404A CN102629266A CN 102629266 A CN102629266 A CN 102629266A CN 2012100594049 A CN2012100594049 A CN 2012100594049A CN 201210059404 A CN201210059404 A CN 201210059404A CN 102629266 A CN102629266 A CN 102629266A
Authority
CN
China
Prior art keywords
text
keywords
keyword
harmonic series
keyword pairs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100594049A
Other languages
Chinese (zh)
Inventor
陈雪
吴超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN2012100594049A priority Critical patent/CN102629266A/en
Publication of CN102629266A publication Critical patent/CN102629266A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

本发明公开了一种基于调和级数的文本图结构表示模型。该方法具体步骤如下:(1)打开领域文集中的单篇文本;(2)将文本内容按照重要性程度由大到小重新排列;(3)对文本进行分词并保留标点符号;(4)统计关键词和关键词对的出现次数;(5)以关键词为图的节点,将共现次数不为0的关键词对进行连接;(6)使用调和级数法对关键词和关键词对的权重进行计算。该方法避免文本结构信息的缺失,并同时能够针对单篇文本的结构信息对关键词和关键词对的权重进行计算;该方法简便易操作,效果好,并且能够兼具TFIDF的功能。

The invention discloses a text graph structure representation model based on harmonic series. The specific steps of the method are as follows: (1) Open a single text in the domain corpus; (2) Rearrange the text content according to the degree of importance from large to small; (3) Segment the text and retain punctuation marks; (4) Count the number of occurrences of keywords and keyword pairs; (5) use keywords as nodes in the graph to connect keyword pairs whose co-occurrence times are not 0; (6) use the harmonic series method to compare keywords and keyword Calculate the weight of . This method avoids the lack of text structure information, and at the same time can calculate the weight of keywords and keyword pairs according to the structure information of a single text; the method is simple and easy to operate, the effect is good, and it can also have the function of TFIDF.

Description

一种基于调和级数的文本图结构表示模型A Text Graph Structure Representation Model Based on Harmonic Series

技术领域 technical field

  本发明涉及一种文本的表示模型,具体是涉及采用图结构对文本进行表示,使用调和级数对关键词和关键词对进行权重计算的模型,是一种基于调和级数的文本图结构表示模型。 The present invention relates to a text representation model, in particular to a text representation model using a graph structure, and a model for calculating the weight of keywords and keyword pairs using a harmonic series, which is a text graph structure representation based on a harmonic series Model.

背景技术 Background technique

  人类善于处理非结构化文本,因为非结构化文本符合人类语言表达习惯,更重要的是人类具有很强的逻辑推理能力。而机器则善于处理结构化文本,例如图和表。人机交互时,必然需要将人类可理解的非结构化文本转化为机器可理解的结构化文本,这就需要文本表示模型。 Humans are good at dealing with unstructured text, because unstructured text conforms to human language expression habits, and more importantly, humans have strong logical reasoning ability. Machines, on the other hand, are good at processing structured text, such as graphs and tables. During human-computer interaction, it is necessary to convert human-understandable unstructured text into machine-understandable structured text, which requires a text representation model.

目前应用最广的文本表示模型是向量空间模型。向量空间模型将文本表示成一个权值向量,向量中的每一项均由词项组成,而每个词项的权重由TFIDF方法确定。其中TFIDF方法用词项权重公式计算一个词项对于文集中的单篇文本的重要程度。TFIDF方法的词项权重就是词频TF(Term Frequency)与逆文档频率IDF(Inverse Document Frequency)的乘积。TFIDF具体公式如下: The most widely used text representation model is the vector space model. The vector space model represents the text as a weight vector, each item in the vector is composed of terms, and the weight of each term is determined by the TFIDF method. The TFIDF method uses the term weight formula to calculate the importance of a term to a single text in the corpus. The term weight of the TFIDF method is the product of the term frequency TF (Term Frequency) and the inverse document frequency IDF (Inverse Document Frequency). The specific formula of TFIDF is as follows:

                                                 

Figure 2012100594049100002DEST_PATH_IMAGE001
                                                 
Figure 2012100594049100002DEST_PATH_IMAGE001

   其中,TFi为词项i的词频,即词项i在文本中出现的次数;IDFi为词项i的逆文档频率,它由log(N/ni)计算;N为文本集的文本总数;ni为文本集中包含词项i的文本数。 Among them, TF i is the term frequency of term i, that is, the number of times term i appears in the text; IDF i is the inverse document frequency of term i, which is calculated by log(N/n i ); N is the text of the text set Total number; n i is the number of texts containing term i in the text set.

 但是使用向量空间模型结合TFIDF方法对文本进行表示时,存在以下不足: However, when using the vector space model combined with the TFIDF method to represent text, there are the following shortcomings:

  (1)向量空间模型把文本看成词项的集合,把词项与词项之间的关系看成是独立的,这样就损失了大量的文本结构信息。 (1) The vector space model regards the text as a collection of terms, and regards the relationship between terms and terms as independent, thus losing a lot of text structure information.

 (2)TFIDF方法在计算词项的词频时,没有考虑它们所处位置因素对它们权重的影响,而单独考虑出现次数或共现次数,并不足以表达其实际权重。 (2) The TFIDF method does not consider the influence of their location factors on their weights when calculating the word frequency of terms, and considering the number of occurrences or co-occurrences alone is not enough to express its actual weight.

(3)TFIDF方法在计算词项的逆文档频率时,需要基于领域的文本集,而无法针对单篇的文本。 (3) The TFIDF method needs domain-based text sets when calculating the inverse document frequency of terms, but cannot target a single text.

发明内容 Contents of the invention

  本发明的目的在于针对向量空间模型与TFIDF方法的不足,提供一种基于调和级数的文本图结构表示模型,该模型能够避免文本结构信息的缺失,并同时能够针对单篇文本的结构信息对关键词和关键词对的权重进行计算。 The purpose of the present invention is to provide a text graph structure representation model based on harmonic series in view of the deficiencies of the vector space model and the TFIDF method. The weights of keywords and keyword pairs are calculated.

   为了达到上述的目的,本发明的构思如下:采用图结构模型对单篇文本进行表示,避免文本结构信息的缺失,并同时能够针对单篇文本的结构信息对关键词和关键词对的权重进行计算;所述的图结构模型是:使用图结构对文本的关键词及其之间的关系进行组织,再通过调和级数法进行权重的计算。 In order to achieve the above-mentioned purpose, the idea of the present invention is as follows: a graph structure model is used to represent a single text, avoiding the lack of text structure information, and at the same time, the weight of keywords and keyword pairs can be adjusted according to the structural information of a single text Calculation; the graph structure model is: use the graph structure to organize the keywords of the text and the relationship between them, and then calculate the weight by the harmonic series method.

       根据上述的发明思想,本发明采用下述技术方案: According to the above-mentioned inventive idea, the present invention adopts the following technical solutions:

      一种基于调和级数的文本图结构表示模型,其特征在于,其具体步骤如下: A text graph structure representation model based on harmonic series, characterized in that the specific steps are as follows:

      (1)打开领域文集中的单篇文本; (1) Open a single text in the domain corpus;

     (2)将文本内容按照重要性程度由大到小重新排列; (2) Rearrange the text content in descending order of importance;

      (3)对文本进行分词并保留标点符号; (3) Segment the text and preserve punctuation marks;

     (4)统计关键词和关键词对的出现次数; (4) Count the number of occurrences of keywords and keyword pairs;

     (5)以关键词为图的节点,将共现次数不为0的关键词对进行连接; (5) Take the keywords as the nodes of the graph, and connect the keyword pairs whose co-occurrence times are not 0;

   (6)使用调和级数法对关键词和关键词对的权重进行计算. (6) Use the harmonic series method to calculate the weight of keywords and keyword pairs.

    所述的调和级数法,记为HP,其关键词和关键词对权重计算式如下: The harmonic series method described above is denoted as HP, and its keywords and keyword pair weights are calculated as follows:

       

   其中,n为关键词和关键词对的出现次数,

Figure 683998DEST_PATH_IMAGE004
为欧拉常数,
Figure 2012100594049100002DEST_PATH_IMAGE005
。 Among them, n is the number of occurrences of keywords and keyword pairs,
Figure 683998DEST_PATH_IMAGE004
is Euler's constant,
Figure 2012100594049100002DEST_PATH_IMAGE005
.

    本发明的一种基于调和级数的文本图结构表示模型与现有的技术相比较,具有如下突出特点和优点:在没有领域文本集,无法确定关键词在文本集中的区分能力的情况下,能够通过扫描单篇文本,用关键词的出现次数与出现位置来确定关键词的权重;虽然只使用出现次数对权重进行评价,但是简便易操作,而且效果好;由于调和级数法中的对数是可扩展的数量级,因此能够兼具TFIDF的功能,而且更加简便。 Compared with the existing technology, a text graph structure representation model based on harmonic series of the present invention has the following outstanding features and advantages: in the absence of domain text sets, the ability to distinguish keywords in the text set cannot be determined, By scanning a single text, the weight of the keyword can be determined by the number of occurrences and the location of the keyword; although only the number of occurrences is used to evaluate the weight, it is easy to operate and the effect is good; due to the pairing in the harmonic series method The number is an extensible order of magnitude, so it can have the function of TFIDF and is more convenient.

附图说明 Description of drawings

图1是本发明的一种基于调和级数的文本图结构表示模型的流程图。 Fig. 1 is a flowchart of a text graph structure representation model based on harmonic series in the present invention.

具体实施方式 Detailed ways

 以下结合附图对本发明的实施例作进一步的说明。 Embodiments of the present invention will be further described below in conjunction with accompanying drawings.

实施例一:参见图1,本基于调和级数的文本图结构表示模型,其特征在于:采用图结构模型对单篇文本进行表示,其中使用调和级数法对关键词和关键词对的权重进行计算; Embodiment 1: Referring to Fig. 1, this text graph structure representation model based on harmonic series is characterized in that: a graph structure model is used to represent a single text, wherein the harmonic series method is used to weight keywords and keyword pairs Calculation;

  所述的图结构模型就是将文本的关键词根据关键词对在同一个句子中的共现关系建立连接关系; The graph structure model is to establish a connection relationship between the keywords of the text according to the co-occurrence relationship of the keyword pairs in the same sentence;

所述的调和级数法,其关键词和关键词对权重计算式如下: Described harmonic series method, its key word and key word pair weight computing formula are as follows:

Figure 444143DEST_PATH_IMAGE003
,n为关键词和关键词对的出现次数,
Figure 711176DEST_PATH_IMAGE004
为欧拉常数,
Figure 910077DEST_PATH_IMAGE005
Figure 444143DEST_PATH_IMAGE003
, n is the number of occurrences of keywords and keyword pairs,
Figure 711176DEST_PATH_IMAGE004
is Euler's constant,
Figure 910077DEST_PATH_IMAGE005
.

  实施例二:本基于调和级数的文本图结构表示模型,从TKDE的2011年到2012年的70篇论文进行文本的表示。如图1所示,本实施例的一种基于调和级数的文本图结构表示模型,其步骤如下: Example 2: This text graph structure representation model based on harmonic series represents the text of 70 papers from TKDE from 2011 to 2012. As shown in Fig. 1, a kind of text graph structure representation model based on harmonic series of the present embodiment, its steps are as follows:

S1. 打开领域文集中的单篇文本,例如,打开2011年24卷第1期中的单篇论文; S1. Open a single text in a field corpus, for example, open a single paper in Volume 24, Issue 1, 2011;

S2. 将文本内容按照重要性程度由大到小重新排列,例如,按照标题、摘要、引言和总结顺序进行重新排列; S2. Rearrange the content of the text in descending order of importance, for example, rearrange in order of title, abstract, introduction and summary;

  S3. 对文本进行分词并保留标点符号,例如,保留句点。 S3. Segment the text and preserve punctuation marks, for example, retain periods.

  S4. 统计关键词和关键词对的出现次数,记为n。 S4. Count the number of occurrences of keywords and keyword pairs, denoted as n.

  S5. 以关键词为图的节点,将共现次数不为0的关键词对进行连接。 S5. Use keywords as nodes in the graph, and connect keyword pairs whose co-occurrence times are not 0.

  S6. 使用调和级数法对关键词和关键词对的权重进行计算;调和级数法公式,记为HP,其关键词和关键词对权重计算式如下: S6. Use the harmonic series method to calculate the weight of keywords and keyword pairs; the formula of the harmonic series method is denoted as HP, and the calculation formula for the weight of keywords and keyword pairs is as follows:

  

Figure 765906DEST_PATH_IMAGE003
  
Figure 765906DEST_PATH_IMAGE003

其中,n为关键词和关键词对的出现次数,

Figure 759270DEST_PATH_IMAGE004
为欧拉常数,
Figure 513599DEST_PATH_IMAGE005
。 Among them, n is the number of occurrences of keywords and keyword pairs,
Figure 759270DEST_PATH_IMAGE004
is Euler's constant,
Figure 513599DEST_PATH_IMAGE005
.

Claims (2)

1.一种基于调和级数的文本图结构表示模型,其特征在于:采用图结构模型对单篇文本进行表示,其中使用调和级数法对关键词和关键词对的权重进行计算;所述的图结构模型就是将文本的关键词根据关键词对在同一个句子中的共现关系建立连接关系;其具体步骤如下: 1. A text graph structure representation model based on harmonic series, characterized in that: a graph structure model is used to represent a single text, wherein the harmonic series method is used to calculate the weight of keywords and keyword pairs; The graph structure model of the text is to establish a connection relationship between the keywords of the text according to the co-occurrence relationship of the keyword pairs in the same sentence; the specific steps are as follows: 打开领域文集中的单篇文本; Open a single text in a domain corpus; 将文本内容按照重要性程度由大到小重新排列; Rearrange the text content in descending order of importance; 对文本进行分词并保留标点符号; Tokenize text and preserve punctuation; 统计关键词和关键词对的出现次数; Count the number of occurrences of keywords and keyword pairs; 以关键词为图的节点,将共现次数不为0的关键词对进行连接; Use keywords as nodes in the graph, and connect keyword pairs whose co-occurrence times are not 0; 使用调和级数法对关键词和关键词对的权重进行计算。 The weights of keywords and keyword pairs are calculated using the harmonic series method. 2.按权利要求1所述的基于调和级数的文本图结构表示模型,其特征在于:所述步骤(6)中的调和级数法,其关键词和关键词对权重计算式如下: 2. by the described text graph structure representation model based on harmonic series according to claim 1, it is characterized in that: the harmonic series method in the described step (6), its keyword and keyword pair weight calculation formula are as follows: 其中,n为关键词和关键词对的出现次数,
Figure 2012100594049100001DEST_PATH_IMAGE004
为欧拉常数,
Figure 2012100594049100001DEST_PATH_IMAGE006
Among them, n is the number of occurrences of keywords and keyword pairs,
Figure 2012100594049100001DEST_PATH_IMAGE004
is Euler's constant,
Figure 2012100594049100001DEST_PATH_IMAGE006
.
CN2012100594049A 2012-03-08 2012-03-08 Diagram text structure representation model based on harmonic progression Pending CN102629266A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100594049A CN102629266A (en) 2012-03-08 2012-03-08 Diagram text structure representation model based on harmonic progression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100594049A CN102629266A (en) 2012-03-08 2012-03-08 Diagram text structure representation model based on harmonic progression

Publications (1)

Publication Number Publication Date
CN102629266A true CN102629266A (en) 2012-08-08

Family

ID=46587526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100594049A Pending CN102629266A (en) 2012-03-08 2012-03-08 Diagram text structure representation model based on harmonic progression

Country Status (1)

Country Link
CN (1) CN102629266A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744835A (en) * 2014-01-02 2014-04-23 上海大学 Text keyword extracting method based on subject model
CN109766408A (en) * 2018-12-04 2019-05-17 上海大学 Calculation method of text keyword weight by combining word position factor and word frequency factor
CN114328900A (en) * 2022-03-14 2022-04-12 深圳格隆汇信息科技有限公司 Information abstract extraction method based on key words

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020111941A1 (en) * 2000-12-19 2002-08-15 Xerox Corporation Apparatus and method for information retrieval
CN101067808A (en) * 2007-05-24 2007-11-07 上海大学 Text Keyword Extraction Method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020111941A1 (en) * 2000-12-19 2002-08-15 Xerox Corporation Apparatus and method for information retrieval
CN101067808A (en) * 2007-05-24 2007-11-07 上海大学 Text Keyword Extraction Method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘巧凤: "基于图结构的中文文本聚类方法研究", 《万方硕士学位论文》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744835A (en) * 2014-01-02 2014-04-23 上海大学 Text keyword extracting method based on subject model
CN103744835B (en) * 2014-01-02 2016-12-07 上海大学 A kind of text key word extracting method based on topic model
CN109766408A (en) * 2018-12-04 2019-05-17 上海大学 Calculation method of text keyword weight by combining word position factor and word frequency factor
CN114328900A (en) * 2022-03-14 2022-04-12 深圳格隆汇信息科技有限公司 Information abstract extraction method based on key words

Similar Documents

Publication Publication Date Title
Brank et al. Annotating documents with relevant wikipedia concepts
Yang Research and realization of internet public opinion analysis based on improved TF-IDF algorithm
Huang et al. Detecting suicidal ideation in Chinese microblogs with psychological lexicons
Paltoglou et al. A study of information retrieval weighting schemes for sentiment analysis
Huang et al. A text similarity measurement combining word semantic information with TF-IDF method
CN103390004B (en) Determination method and apparatus, corresponding searching method and the device of a kind of semantic redundancy
Demiroz et al. Learning domain-specific polarity lexicons
Lahiri et al. Keyword extraction from emails
CN102737112B (en) Concept Relevance Calculation Method Based on Representational Semantic Analysis
Lalji et al. Twitter sentiment analysis using hybrid approach
Barbieri et al. Do We Criticise (and Laugh) in the Same Way? Automatic Detection of Multi-Lingual Satirical News in Twitter.
CN101833579A (en) Method and system for automatically detecting academic misconduct literature
Al-Saqqa et al. Stemming effects on sentiment analysis using large arabic multi-domain resources
Khalil et al. Which configuration works best? an experimental study on supervised Arabic twitter sentiment analysis
Aliguliyev A novel partitioning-based clustering method and generic document summarization
CN102629266A (en) Diagram text structure representation model based on harmonic progression
JP6230190B2 (en) Important word extraction device and program
CN103164394B (en) A kind of based on gravitational Text similarity computing method
Chung et al. Improve polarity detection of online reviews with bag-of-sentimental-concepts
CN102136006A (en) Measuring method for text understanding complexity based on concept learning of human
CN103593339A (en) Electronic-book-oriented semantic space representing method and system
Xu et al. A hybrid topic model for multi-document summarization
Gella et al. Unimelb_nlp-core: Integrating predictions from multiple domains and feature sets for estimating semantic textual similarity
Badaro et al. An efficient model for sentiment classification of Arabic tweets on mobiles
Liu et al. Extracting main content of a topic on online social network by multi-document summarization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120808