CN110134951B

CN110134951B - A method and system for analyzing potential topic phrases in text data

Info

Publication number: CN110134951B
Application number: CN201910354460.7A
Authority: CN
Inventors: 马甲林; 张琳; 程清雯
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2021-08-31
Anticipated expiration: 2039-04-29
Also published as: CN110134951A

Abstract

The invention discloses a method and a system for analyzing potential topic phrases of text data. The method includes: collecting a text data set, and segmenting the text data set to obtain a word expression form of the text data set; extracting words according to the text data set The effective phrases formed after word collocation are obtained, and the mixed expressions of words and phrase sets that are not collocated into effective phrases are obtained; the corresponding word vector model is obtained after word vector training is performed on the text data set of mixed expressions; DR‑Phrase LDA is constructed and Solve for each parameter; train DR‑Phrase LDA and output potential topic phrases from text data based on the training results. The present invention adopts a phrase topic model based on word vectors, which reasonably improves the statistical information of phrases in model training with the help of Chinese linguistic laws in the training of probabilistic topic models, and specifically adopts the method of word vectors to measure the relationship between phrase components. Quantitatively reflects the semantic relationship of words in the whole text and phrase parts, which makes the model more accurate.

Description

A method and system for analyzing potential topic phrases in text data

技术领域technical field

本发明涉及文本数据挖掘分析领域，具体涉及一种分析文本数据潜在主题短语的方法及系统。The invention relates to the field of text data mining and analysis, in particular to a method and system for analyzing potential topic phrases of text data.

背景技术Background technique

随着信息技术的发展，各个领域累积了大量的电子文本，导致了信息过载。为了帮助人们快速检索、查找和有效利用这些信息，文本语义及结构分析成为当今研究热点之一。其中从文本数据中分析潜在的主题信息，是信息检索、推荐系统、自动文摘等高级应用系统的关键技术之一。现有的常用方法采用LDA、PLDA等传统以“词袋”为基础的概率主题模型进行文本主题分析。这些方法分析所得的主题结果以主题词形式呈现，而人类自然语言习惯以短语块形式表达语义，因此这些方法获取的主题结果存在可读性、一致性和可视化差等缺陷。With the development of information technology, a large number of electronic texts have been accumulated in various fields, resulting in information overload. In order to help people quickly retrieve, find and effectively use this information, text semantics and structure analysis has become one of the hotspots of today's research. The analysis of potential subject information from text data is one of the key technologies in advanced application systems such as information retrieval, recommendation systems, and automatic summarization. The existing common methods use traditional probabilistic topic models based on "bag of words" such as LDA and PLDA for text topic analysis. The subject results obtained by these methods are presented in the form of subject words, while human natural language is accustomed to express semantics in the form of phrase blocks, so the subject results obtained by these methods have defects such as readability, consistency and poor visualization.

目前同类方法有两种策略：第一种先从文本数据中提取短语后再训练主题模型，由于短语缺乏统计信息导致这类方法在模型训练过程短语出现概率极低，无法有效体现在主题短语结果中；第二种先训练主题模型获取主题词，再有主题词合成短语，由于汉语用词灵活多变，这种后期合成的主题短语质量也较差。At present, there are two strategies for similar methods: the first one extracts phrases from text data and then trains the topic model. Due to the lack of statistical information of phrases, the probability of occurrence of phrases in this type of method is extremely low during the model training process and cannot be effectively reflected in the results of topic phrases. The second is to train the topic model to obtain the topic words, and then to synthesize the phrases with the topic words. Due to the flexibility of Chinese words, the quality of the topic phrases synthesized later is also poor.

发明内容SUMMARY OF THE INVENTION

发明目的：为了克服现有技术的不足，本发明提供一种分析文本数据潜在主题短语的方法，该方法克服了传统以“词袋”为基础的主题模型训练所得主题结果存在可读性、一致性和可视化差的缺陷；并且解决了同类方法由于短语缺乏统计信息所导致的无法获取有效主题短语结果的问题；本发明还提供一种分析文本数据潜在主题短语的系统。Purpose of the invention: In order to overcome the deficiencies of the prior art, the present invention provides a method for analyzing potential topic phrases in text data, which overcomes the problems of readability and consistency in the topic results obtained by the topic model training based on the traditional "bag of words". The defects of poor performance and visualization are solved; and the problem that similar methods cannot obtain effective subject phrase results due to lack of statistical information of phrases is solved; the invention also provides a system for analyzing potential subject phrases of text data.

技术方案：本发明所述的一种分析文本数据潜在主题短语的方法，该方法包括：Technical solution: a method for analyzing potential topic phrases in text data according to the present invention, the method includes:

(1)采集文本数据集，并对所述文本数据集进行分词，得到文本数据集的词语表现形式；(1) Collecting a text data set, and performing word segmentation on the text data set to obtain the word representation of the text data set;

(2)根据文本数据集的词语提取词语搭配后形成的有效短语，得到未搭配成有效短语的词语与短语集的混合表现形式；(2) According to the words in the text data set, the effective phrases formed by the collocation of the words are extracted, and the mixed expression form of the words that are not collocated into the effective phrases and the phrase set is obtained;

(3)对混合表现形式的文本数据集进行词向量训练后得到对应的词向量模型；(3) The corresponding word vector model is obtained after word vector training is performed on the text data set of mixed expressions;

(4)构建基于词向量的短语主题模型DR-Phrase LDA并求解各个参数；(4) Construct a phrase topic model DR-Phrase LDA based on word vector and solve each parameter;

(5)对所述DR-Phrase LDA训练，并根据训练结果输出文本数据潜在的主题短语。(5) Train the DR-Phrase LDA, and output potential topic phrases of text data according to the training results.

进一步地，包括：Further, include:

所述步骤(2)中，所述有效短语包括n元短语，n元为组成短语的词语个数，所述根据文本数据集的词语提取词语搭配后形成的有效短语，具体包括：In the step (2), the valid phrases include n-gram phrases, where n-grams are the number of words that form the phrase, and the valid phrases formed after the word collocation is extracted according to the words in the text data set, specifically including:

(21)统计文本数据集的双词语或短语搭配共现频率，构成二元短语候选集；(21) Count the co-occurrence frequencies of two-word or phrase collocations in the text data set to form a binary phrase candidate set;

(22)计算二元短语候选集score(w_i，w_j)分值，选取分值高的前m个构成正式的二元短语，并加入到短语集中，同时在步骤(1)所述的文本数据集的词语表现形式中更新相应的词语为所得短语；(22) Calculate the score(w _i , w _j ) of the candidate set of binary phrases, select the top m with high scores to form formal binary phrases, and add them to the phrase set. Update the corresponding words in the word representations of the text data set to be the obtained phrases;

(23)迭代步骤(21)(22)计算得到的所述二元短语与其他词语或短语搭配组成的n元短语，依次加入到短语集中。(23) An n-gram phrase formed by collocation of the binary phrase and other words or phrases calculated in the iterative steps (21) and (22) is added to the phrase set in turn.

进一步地，包括：Further, include:

所述步骤(4)中，DR-Phrase LDA模型为概率生成模型，DR-Phrase LDA实现文本生成的过程为：首先文本数据集D中有M篇文本，从超参数为α的狄利克雷分布中取样生成文本d的主题分布θ_d；从主题的多项式分布θ_d中取样生成文本d中词语或短语的主题参数z，主题编号记为z_mn；从超参数为β狄利克雷分布中取样生成主题z相应的词语或短语分布

其中，所有文本共享K个主题；从词语的多项式分布

中采样生成词语或短语t。In the step (4), the DR-Phrase LDA model is a probability generation model, and the process of realizing text generation by DR-Phrase LDA is: first, there are M texts in the text data set D, from the Dirichlet distribution whose hyperparameter is α The topic distribution θ _d of the generated text d is sampled from ; the topic parameter z of the words or phrases in the text d is generated by sampling from the multinomial distribution θ _d of the topic, and the topic number is denoted as z _mn ; the hyperparameter is β Dirichlet distribution. Generate a distribution of words or phrases corresponding to topic z

where all texts share K topics; from the multinomial distribution of words

Sampling produces a word or phrase t.

进一步地，包括：Further, include:

所述词语或短语的主题参数z的计算采用吉布斯采样近似求解方法完成，表示为：The calculation of the topic parameter z of the word or phrase is completed using the Gibbs sampling approximate solution method, which is expressed as:

其中，采样过程中文档d当前位置词语或者有效短语表示为t，记为term，k表示被分配的主题编号，K为预设的主题个数，N_t为文本数据集中的总term数，n_k/d表示文档d中主题k的计数，n_t/k表示主题k中t的计数，n_r表示t的语义相关term的个数，n_tr表示t的相关term个数，α和β为Dirichlet超参数，α，β分别为α和β对应的向量。Among them, the word or valid phrase in the current position of document d in the sampling process is denoted as t, denoted as term, k is the assigned topic number, K is the preset number of topics, N _t is the total number of terms in the text data set, n _k/d represents the count of topic k in document d, n _t/k represents the count of t in topic k, n _r represents the number of semantically related terms of t, n _tr represents the number of related terms of t, α and β are Dirichlet hyperparameters, α, β are the vectors corresponding to α and β, respectively.

所述文本数据集中的某篇文本d中潜在的主题比例θ_d表示为：The potential topic ratio θd in a certain text _d in the text dataset is expressed as:

所述编号为k的主题所包含term概率

值表示为：The term probability contained in the topic numbered k

The value is represented as:

进一步地，包括：Further, include:

所述步骤(5)中，包括：In the step (5), including:

(51)对所述DR-Phrase LDA模型训练，训练步骤包括：(51) to described DR-Phrase LDA model training, the training step comprises:

输入：未搭配成有效短语的词语与短语集的混合表现形式的文本数据集，已训练所得的词向量模型及DR-Phrase LDA模型的狄利克雷分布超参数α，狄利克雷分布超参数β，主题个数K；迭代次数IterNum；短语w^p语义相似度最大的前γ个对应的取值；短语w^p的长度计数调节参数μ；Input: a text dataset of mixed expressions of words and phrase sets that are not matched into valid phrases, the trained word vector model and the Dirichlet distribution hyperparameter α of the DR-Phrase LDA model, and the Dirichlet distribution hyperparameter β , the number of topics K; the number of iterations IterNum; the first γ corresponding values of the phrase w ^p with the largest semantic similarity; the length count adjustment parameter μ of the phrase w ^p ;

训练过程：Training process:

遍历文本数据集中每个文本中的词语或有效短语t对应的编号；Traverse the number corresponding to the word or valid phrase t in each text in the text dataset;

若t为有效短语，则增加所述短语w^p被抽到主题k后的计数C(w^p)；同时遍历与w^p语义相似度最大的前γ个词集，并相应增加该词集中所有词分配到主题k下的计数C(w_i)；If t is a valid phrase, increase the count C(w ^p ) after the phrase w ^p is drawn to topic k; at the same time, traverse the first γ word set with the largest semantic similarity with w ^p , and increase all the words in the word set accordingly. word assigned to count C( _wi ) under topic k;

否则，若t为实词词语，且存在语义相关短语，则增加词语t在主题k的计数C(w)，同时增加该词语作为语义背景的有效短语

在主题k下的计数

Otherwise, if t is a content word and there is a semantically related phrase, increase the count C(w) of the word t in topic k, and add the word as a valid phrase for semantic context

count under topic k

否则，若t为虚词词语，则对应的计数减1；Otherwise, if t is a function word, the corresponding count is decremented by 1;

迭代上述步骤至设定的次数IterNum；Iterate the above steps to the set number of times IterNum;

输出：文本数据集中的所有词语和有效短语的主题编号二维矩阵z；Output: A two-dimensional matrix z of topic numbers for all words and valid phrases in the text dataset;

(52)根据训练结果输出文本数据潜在的主题短语，具体包括：(52) Output potential topic phrases of text data according to the training results, including:

根据某篇文本d中潜在的主题比例θ_d统计可得每一篇文本的主题比例概率值矩阵θ：According to the potential topic ratio θ _d in a certain text d, the probability value matrix θ of the topic ratio of each text can be obtained:

θ＝{θ_(m，k)，m∈{0，..M-1}，k∈{0，..K-1}}θ={θ _{(m, k)} , m∈{0,..M-1}, k∈{0,..K-1}}

其中，M为文本数据集文本总的文本数量；K为主题数。Among them, M is the total number of texts in the text dataset; K is the number of topics.

根据编号为k的主题所包含term概率

值统计可得每个主题与term的概率值矩阵

According to the probability of the term contained in the topic numbered k

Value statistics can get the probability value matrix of each topic and term

其中，N_t为文本数据集的总term数。where N _t is the total number of terms in the text dataset.

进一步地，包括：Further, include:

所述有效短语w^p被抽到主题k的计数C(w^p)，表示为：The valid phrase w ^p is drawn to the count C(w ^p ) of topic k, expressed as:

C(w^p)＝μlen(w^p)C(w ^p )=μlen(w ^p )

其中，μ为大于等于1的整数，调节参数len(w^p)为有效短语w^p的长度。Wherein, μ is an integer greater than or equal to 1, and the adjustment parameter len(w ^p ) is the length of the effective phrase w ^p .

所述有效短语w^p对应的短语语义相关词w_i分配到主题k下的计数C(w_i)，表示为：The phrase semantically related words _wi corresponding to the valid phrase w ^p are assigned to the count C( _wi ) under topic k, which is expressed as:

其中，Int()表示取整数，Sim(w_i w^p)表示在词语w_i与有效短语w^p通过词向量计算的相似度；Among them, Int() represents an integer, Sim( _wi w ^p ) represents the similarity between word _wi and valid phrase w ^p calculated by word vector;

所述词语w在主题k的计数C(w)，表示为：The count C(w) of the word w in topic k is expressed as:

其中，

表示在词语w对应的词向量与词语对应的有效短语

对应的词向量

之间的相似度；in,

Represents a valid phrase corresponding to the word vector corresponding to the word w and the word

Corresponding word vector

similarity between

所述以词w作为语义背景的有效短语

在主题k下的计数

表示为：The effective phrase with the word w as the semantic background

count under topic k

Expressed as:

本发明还公开一种分析文本数据潜在主题短语的系统，所述系统包括：The invention also discloses a system for analyzing potential topic phrases of text data, the system comprising:

预处理模块，用于采集文本数据集，并对所述文本数据集进行分词，得到文本数据集的词语表现形式；a preprocessing module, used to collect a text data set, and perform word segmentation on the text data set to obtain a word representation of the text data set;

短语提取模块，用于根据文本数据集的词语提取词语搭配后形成的有效短语，得到未搭配成有效短语的词语与短语集的混合表现形式；The phrase extraction module is used for extracting valid phrases formed by collocation of words according to the words in the text data set, and obtaining a mixed expression form of the words and phrase sets that are not matched into valid phrases;

词向量构建模块，用于对混合表现形式的文本数据集进行词向量训练后得到对应的词向量模型；The word vector building module is used to obtain the corresponding word vector model after word vector training is performed on the text data set with mixed representation;

模型构建模块，用于构建基于词向量的短语主题模型DR-Phrase LDA并求解各个参数；The model building module is used to build a word vector-based phrase topic model DR-Phrase LDA and solve each parameter;

结果输出模块，用于对所述DR-Phrase LDA训练，并根据训练结果输出文本数据潜在的主题短语。A result output module is used to train the DR-Phrase LDA, and output potential topic phrases of the text data according to the training results.

进一步地，包括：Further, include:

所述短语提取模块中，包括：The phrase extraction module includes:

候选集统计单元，用于统计文本数据集的双词语或短语搭配共现频率，构成二元短语候选集；The candidate set statistical unit is used to count the co-occurrence frequency of two-word or phrase collocations in the text data set to form a binary phrase candidate set;

二元短语计算单元，用于计算二元短语候选集score(w_i，w_j)分值，选取分值高的前m个构成正式的二元短语，并加入到短语集中，同时在候选集统计单元所述的文本数据集的词语表现形式中更新相应的词语为所得短语；The binary phrase calculation unit is used to calculate the score ( _wi , w _j ) of the candidate set of binary phrases, select the top m with high scores to form the formal binary phrases, and add them to the phrase set. Update the corresponding words in the word representations of the text data set described by the statistical unit as the obtained phrases;

短语集生成单元，迭代候选集统统计单元以及二元短语计算单元，并计算得到的所述二元短语与其他词语或短语搭配组成的n元短语，依次加入到短语集中。The phrase set generation unit, the iterative candidate set statistics unit and the binary phrase calculation unit, and the calculated n-gram phrases composed of the binary phrase and other words or phrases are added to the phrase set in turn.

进一步地，包括：Further, include:

所述模型构建模块中，DR-Phrase LDA模型为概率生成模型，DR-Phrase LDA实现文本生成的过程为：首先文本数据集D中有M篇文本，从超参数为α的狄利克雷分布中取样生成文本d的主题分布θ_d；从主题的多项式分布θ_d中取样生成文本d中词语或短语的主题参数z，主题编号记为z_mn；从超参数为β狄利克雷分布中取样生成主题z相应的词语或短语分布

其中，所有文本共享K个主题；从词语的多项式分布

中采样生成词语或短语t。In the model building module, the DR-Phrase LDA model is a probability generation model, and the process of realizing text generation by DR-Phrase LDA is: first, there are M texts in the text data set D, from the Dirichlet distribution whose hyperparameter is α. Sampling to generate topic distribution θ _d of text d; sampling from multinomial distribution θ _d of topics to generate topic parameters z of words or phrases in text d, and the topic number is denoted as z _mn ; sampling from hyperparameter β Dirichlet distribution to generate distribution of words or phrases corresponding to topic z

where all texts share K topics; from the multinomial distribution of words

Sampling produces a word or phrase t.

进一步地，包括：Further, include:

所述结果输出模块，具体训练包括：The result output module, the specific training includes:

训练单元，用于对所述DR-Phrase LDA模型训练，具体包括：A training unit for training the DR-Phrase LDA model, specifically including:

训练过程：Training process:

在主题k下的计数

count under topic k

结果输出单元：用于根据训练结果输出文本数据潜在的主题短语，具体包括：Result output unit: used to output potential topic phrases of text data according to the training results, including:

根据编号为k的主题所包含term概率

k值统计可得每个主题与term的概率值矩阵

According to the probability of the term contained in the topic numbered k

The probability value matrix of each topic and term can be obtained by k-value statistics

有益效果：本发明采用基于词向量的短语主题模型，该模型在概率主题模型训练中借助汉语言学规律来合理提升模型训练中短语的统计信息，具体采用词向量的方法度量短语成分词之间的关系，来定量反映词在文本整体和短语局部中的语义关系，使得模型精度更高；同时本方法中考虑了短语长度、上下文背景及词性等短语语言学特性，提取的主题短语可读性和一致性更强，本方法所构建的语义表达能力更强的短语主题模型，所得的短语主题更符合语言使用习惯，其结果具有更广泛的应用价值。Beneficial effects: the present invention adopts a phrase topic model based on word vectors, which reasonably improves the statistical information of phrases in model training with the help of Chinese linguistic laws in the training of probabilistic topic models, and specifically adopts the method of word vectors to measure the difference between phrase components and words. to quantitatively reflect the semantic relationship of words in the whole text and local phrases, which makes the model more accurate; at the same time, this method considers phrase linguistic characteristics such as phrase length, context, and part of speech, and the extracted topic phrases are readable. The phrase topic model with stronger semantic expression ability constructed by this method is more in line with language usage habits, and the result has wider application value.

附图说明Description of drawings

图1为本发明所述的方法流程图；Fig. 1 is the method flow chart of the present invention;

图2为本发明所述的DR-Phrase LDA模型示意图；Fig. 2 is the DR-Phrase LDA model schematic diagram of the present invention;

图3为本发明所述的系统结构示意图；3 is a schematic diagram of the system structure according to the present invention;

图4为本发明所述的四种模型在不同主题下的一致性指标值。Figure 4 shows the consistency index values of the four models described in the present invention under different themes.

具体实施方式Detailed ways

下面将结合附图对本发明作进一步地详细描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。The present invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明提供一种分析文本数据潜在主题短语的方法，如图1所示，包括以下步骤：The present invention provides a method for analyzing potential topic phrases of text data, as shown in Figure 1, comprising the following steps:

S1采集文本数据集，并对所述文本数据集进行分词，得到文本数据集的词语表现形式。S1 collects a text data set, and performs word segmentation on the text data set to obtain a word representation of the text data set.

S2根据文本数据集的词语提取词语搭配后形成的有效短语，得到未搭配成有效短语的词语与短语集的混合表现形式。S2 extracts valid phrases formed by collocation of words according to the words in the text data set, and obtains a mixed expression form of words and phrase sets that are not matched into valid phrases.

该步骤具体包括：This step specifically includes:

S21统计二元(bigram)共现词频，构成二元短语候选集；S21 counts the co-occurrence word frequencies of bigrams to form a candidate set of bigram phrases;

S22通过计算二元短语候选集score(w_i，w_j)分值，选取分值高的前m个构成正式的二元短语，并加入到短语集中。二元(bigram)短语的score(w_i，w_j)分值具体由如下式计算：S22 calculates the score(w _i , w _j ) of the candidate set of binary phrases, selects the top m with high scores to form formal binary phrases, and adds them to the phrase set. The score( _wi , _wj ) of a bigram phrase is specifically calculated by the following formula:

在公式1中，w_i和w_j是从语料库中捕获的二元短语词对，count(w_i，w_j)为词w_i和w_j的共现频率，count(w_i)为词w_i单词频，count(w_j)为词w_j的单词频，参数η的作用是过滤低频短语。例如η＝5可以过滤掉(w_i，w_j)共现频次低于5的词对。In Equation 1, _wi and w _j are the binary phrase word pairs captured from the corpus, count( _wi , w _j ) is the co-occurrence frequency of words _wi and w _j , and count( _wi ) is the word w _i word frequency, count(w _j ) is the word frequency of word w _j , and the function of parameter η is to filter low-frequency phrases. For example, n=5 can filter out word pairs whose co-occurrence frequency is lower than 5 ( _wi , w _j ).

S23在对步骤S21和S22重复3-4次，通过对二元短语循环合成更长短语，统称为n元短语，本发明的实施例中主要考虑二元短语、三元短语、四元短语及五元短语，在每次合成过程中，不断降低η值以获取更长的短语被抽取。In S23, steps S21 and S22 are repeated 3-4 times, and longer phrases are synthesized by cyclically synthesizing binary phrases, which are collectively referred to as n-gram phrases. In the embodiment of the present invention, binary phrases, ternary phrases, quaternary phrases and Five-element phrases, in each synthesis process, continuously reduce the value of η to obtain longer phrases to be extracted.

S3采用开源包word2vec进行词向量训练后得到该文本数据集的词向量模型。S3 uses the open source package word2vec to train the word vector to obtain the word vector model of the text data set.

S4构建基于词向量的短语主题模型DR-Phrase LDA并求解各个参数。S4 builds a word vector-based phrase topic model DR-Phrase LDA and solves each parameter.

具体包括：S41构建DR-Phrase LDA模型：Specifically: S41 builds the DR-Phrase LDA model:

DR-Phrase LDA是一个概率生成模型，其图模型表示如图2所示。图2中，文本数据集D有M篇文本，N_d表示文本d所含词或短语(term)数量，具体生成文本中某个term的过程包含两个步骤，第一，从文档-主题(document-topic)分布中随机抽取一个分布θ；在主题-term(topic-term)分布中抽取一个分布

后，为该主题抽取一个term t；所有文本共享K个主题；document-topic and topic-term分布均为多项式分布，并且它们均服从先验为对称的狄利克雷分布(Dirichlet distribution)，αandβ为Dirichlet先验的超参数，DR-PhraseLDA模型的文本数据集生成过程可描述为如表1所示。DR-Phrase LDA is a probabilistic generative model, and its graphical model representation is shown in Figure 2. In Figure 2, the text dataset D has M texts, and N _d represents the number of words or phrases (terms) contained in the text d. The specific process of generating a term in the text includes two steps. First, from the document-topic ( A distribution θ is randomly drawn from the document-topic) distribution; a distribution is drawn from the topic-term (topic-term) distribution

After that, a term t is extracted for the topic; all texts share K topics; document-topic and topic-term distributions are multinomial distributions, and they all obey the prior symmetric Dirichlet distribution, α and β are The hyperparameters of the Dirichlet prior, the text dataset generation process of the DR-PhraseLDA model can be described as shown in Table 1.

表1 DR-Phrase LDA模型的文本数据集生成过程Table 1 Text dataset generation process of DR-Phrase LDA model

S42参数求解S42 parameter solver

DR-Phrase LDA模型重点是对文本数据集中每一篇文本d的每一个term的求解其隐变量主题编号z_mn。其他参数可通过间接计算获得。对于隐变量z采用吉布斯采样(GibbsSampling)近似求解方法完成。采样过程主要核心是对短语统计低频问题的处理上。根据波利亚坛子模型原理，在采样过程中来提升短语的权重，主要结合了如下的短语语言学特性：The focus of the DR-Phrase LDA model is to solve the hidden variable topic number z _mn for each term of each text d in the text dataset. Other parameters can be obtained by indirect calculation. For the hidden variable z, the approximate solution method of Gibbs sampling is used to complete. The main core of the sampling process is to deal with the low frequency problem of phrase statistics. According to the principle of the Polya jar model, the weight of phrases is increased in the sampling process, which mainly combines the following linguistic characteristics of phrases:

(1)短语的语义强度与短语长度有关：通常情况下，长短语表达更强的语义；(1) The semantic strength of a phrase is related to the length of the phrase: in general, long phrases express stronger semantics;

(2)短语整体语义与其所处的上下文(context)相关：除部分短语语义和该短语成分词有关，很多情况下，由于语言具有非语义组合性，短语的语义和普通词汇一样需根据上下文背景理解其含义；(2) The overall semantics of a phrase is related to the context in which it is located: except that the semantics of some phrases are related to the constituent words of the phrase, in many cases, due to the non-semantic compositionality of language, the semantics of phrases, like ordinary words, need to be based on the context. understand its meaning;

(3)实词在语义表达中占主要地位：按词的语法功能和性质可分为实词和虚词两大类。虚词一般不含有实在的词义，其基本用途是表达语法关系。虚词主要是指副词、连词、介词、助词、叹词和拟声词等。实词有实在意义，主要指名词、动词、形容词、数词、量词等。无论短语还是词汇，实词为语义表达的核心，对语义分析起主导地位。(3) Content words play a major role in semantic expression: according to their grammatical functions and properties, they can be divided into two categories: content words and function words. Function words generally do not contain real word meanings, and their basic purpose is to express grammatical relations. Function words mainly refer to adverbs, conjunctions, prepositions, auxiliary words, interjections and onomatopoeia. Content words have real meaning, mainly referring to nouns, verbs, adjectives, numerals, quantifiers, etc. Regardless of phrases or vocabulary, content words are the core of semantic expression and play a dominant role in semantic analysis.

根据以上三条特性，结合波利亚坛子模型原理，通过设计恰当的策略来在采样过程中更加体现短语的语义地位。具体采用如下方法：According to the above three characteristics, combined with the principle of the Polya jar model, the semantic status of the phrase can be more reflected in the sampling process by designing an appropriate strategy. Specifically, the following methods are used:

S421采样过程到达某个位置遇到短语w^p，同时提升该短语及其语义相关词计数，具体做法是从训练好的词向量模型中计算得到和有效短语w^p语义相似度最大的前topγ个词集，作为短语w^p的语义上下文背景，则：该短语w^p被抽到主题k后增加k中w^p的计数C(w^p)通过如下公式计算：S421 The sampling process reaches a certain position and encounters a phrase w ^p , and at the same time increases the count of the phrase and its semantically related words. The specific method is to calculate from the trained word vector model the top γ with the largest semantic similarity to the valid phrase w ^p . The word set, as the semantic context background of the phrase w ^p , then: after the phrase w ^p is drawn to topic k, the count C(w ^p ) of w ^p in k is increased by the following formula:

C(w^p)＝μlen(w^p) 公式2C(w ^p )=μlen(w ^p ) Equation 2

其中，μ＞＝1，取整数，调节参数len(w^p)为短语w^p的长度。Wherein, μ>=1, which is an integer, and the adjustment parameter len(w ^p ) is the length of the phrase w ^p .

同时，增加短语语义相关词w_i，在主题k下的计数C(w_i)采用如下公式：At the same time, the semantically related words w _i of the phrase are added, and the count C( _wi ) under topic k adopts the following formula:

公式3中Int()表示取整数，Sim(w_i w^p)表示在词w_i与短语w^p通过词向量计算的相似度。w_i，w^p的词向量分别表示为w_i＝(w_i，0，w_i，1，......，w_i，n)，w^p＝(w^p ₀，w^p ₁......，w^p _n)，其相似度采用余弦公式计算如下：In Equation 3, Int() represents an integer, and Sim( _wi w ^p ) represents the similarity calculated by word vector between word _wi and phrase w ^p . The word vectors _of w _i and w ^p are respectively represented as w _i ⁼ (wi _,0 ,wi _,1 ,...,wi _,n ),wp ⁼ ( ^wp0 , _wp1 . ....., w ^p _n ), the similarity is calculated using the cosine formula as follows:

S422抽样过程到达某个位置遇到某一个词w，如果该词为虚词，则按照正常抽样计数；如果该词为实词，且为某些短语的topγ语义相关词，则同时提升该词及其语义相关短语计数，该词语w被抽到主题k的增加计数词C(w)通过如下公式：S422 The sampling process reaches a certain position and encounters a word w. If the word is a function word, it is counted according to normal sampling; if the word is a real word and is a topγ semantically related word of some phrases, the word and its The count of semantically related phrases, the word w is drawn to topic k's incremental count word C(w) by the following formula:

公式5中，词w为m个短语上下文语义背景。In Equation 5, the word w is the contextual semantic background of m phrases.

同时，增加所有以词w作为语义背景短语

在主题k下的计数

采用如下公式：At the same time, add all the word w as the semantic background phrase

count under topic k

Use the following formula:

公式6中的符号含义同上。公式2和6作用均为计数增加吉布斯采样过程中短语的计数，具有相似之处；但是，考虑到短语自身引起的计数增加与由上下文背景词引起的计数增加应该有所不同，前者的对短语在主题语义贡献更加直接，贡献度更大，因此，在公式6加入了词向量相似度值进行强度度量，确保计数的增加量与语义相关度关联，使其更符合实际。The symbols in Equation 6 have the same meanings as above. Equations 2 and 6 both act to increase the count of phrases in the Gibbs sampling process, and have similarities; however, considering that the count increase caused by the phrase itself and the count increase caused by the context word should be different, the former's count increase should be different. The contribution to the topic semantics of the phrase is more direct and the contribution is greater. Therefore, the word vector similarity value is added to the formula 6 to measure the strength to ensure that the increase in the count is associated with the semantic relevance, making it more realistic.

通过公式2到公式6，新设计的波利亚坛子模型，不仅使得每次抽到的球登记颜色放回去坛子被再次抽到的概率增加，而且使该球相关球被再次抽到的概率也增加。Through formula 2 to formula 6, the newly designed Polya jar model not only increases the probability that the registered color of each drawn ball is put back into the jar to be drawn again, but also increases the probability that the ball related to the ball is drawn again. Increase.

根据上述策略得到了DR-Phrase LDA模型求解文本数据集中每一个term的主题编号的Gibbs采样公式可表达为如下：According to the above strategy, the DR-Phrase LDA model is obtained. The Gibbs sampling formula for solving the topic number of each term in the text dataset can be expressed as follows:

公式7中，t表示词或者短语，为抽样过程中文档d当前位置term，k表示被分配的主题编号，K为预设的主题个数，N_t为语料库的总term数，n_k/d表示文档d中主题k的计数，n_t/k表示主题k中t的计数，n_r表示t的语义相关term的个数，n_tr表示t的相关term个数，α和β为Dirichlet超参数。In formula 7, t represents a word or phrase, is the current position term of document d in the sampling process, k represents the assigned topic number, K is the preset number of topics, N _t is the total number of terms in the corpus, n _k/d represents the count of topic k in document d, n _t/k represents the count of t in topic k, n _r represents the number of semantically related terms of t, n _tr represents the number of related terms of t, α and β are Dirichlet hyperparameters .

DR-Phrase LDA模型的Gibbs Sampling主要通过不同策略下构建的公式2-6来改进经典LDA Gibbs Sampling过程中，term在不同情况下分配主题的计数，主要体现在公式7的后半部分。短语到词和词到短语采样过程中有四种情况下有不同的计数提升策略，为了表述方便，统一用公式7的简化符号表达，实际程序实现过程中，需根据具体情况进行判断计数。The Gibbs Sampling of the DR-Phrase LDA model mainly improves the classical LDA Gibbs Sampling process through formulas 2-6 constructed under different strategies. In the process of Gibbs Sampling, the term is assigned the count of topics in different situations, which is mainly reflected in the second half of formula 7. Phrase-to-word and word-to-phrase sampling processes have four different counting promotion strategies in four cases. For convenience of expression, the simplified notation of Equation 7 is used uniformly. In the actual program implementation process, it is necessary to judge and count according to the specific situation.

S423其他参数求解：S423 other parameters to solve:

文本数据集中的某一篇文本d中潜在的主题比例θ_d通过以下公式计算：The potential topic ratio θd in a text _d in the text dataset is calculated by the following formula:

编号为k的主题所包含term概率

值通过以下公式计算：The term probability contained in topic k

The value is calculated by the following formula:

S5对所述DR-Phrase LDA训练，并根据训练结果输出文本数据潜在的主题短语。具体包括：S5 trains the DR-Phrase LDA, and outputs potential topic phrases of the text data according to the training results. Specifically include:

S51根据下述算法训练DR-Phrase LDA模型S51 trains the DR-Phrase LDA model according to the following algorithm

输入：S2所得的文本数据集的词与短语混合表示形式；S3所得的该文本数据集的词向量模型；α(狄利克雷分布超参数)，β(狄利克雷分布超参数)，K(主题个数)；IterNum(迭代次数)；γ(短语w^p语义相似度最大的前top取值)；μ(短语w^p的长度计数调节参数)。Input: the mixed representation of words and phrases of the text data set obtained by S2; the word vector model of the text data set obtained by S3; α (Dirichlet distribution hyperparameter), β (Dirichlet distribution hyperparameter), K( The number of topics); IterNum (the number of iterations); γ (the top value of the phrase w ^p with the largest semantic similarity); μ (the length count adjustment parameter of the phrase w ^p ).

输出：文本数据集中的所有term的主题编号二维矩阵z。Output: A two-dimensional matrix z of topic numbers for all terms in the text dataset.

训练过程：Training process:

(1)初始化：对文本数据集中每一篇文本中的每一个term，从(1～K)中随机一个抽取主题编号；(1) Initialization: For each term in each text in the text dataset, randomly select a topic number from (1~K);

(2)Gibbs采样过程(2) Gibbs sampling process

按照公式6增加t语义相关短语分配到主题k的计数；其他统计量做相应增加；Increase the count of t semantically related phrases assigned to topic k according to formula 6; other statistics are increased accordingly;

S52根据训练结果输出文本数据潜在的主题短语S52 outputs the potential topic phrases of the text data according to the training results

S521根据公式8统计可得每一篇文本的主题比例概率值矩阵θ进行输出；S521 according to formula 8 statistics can obtain the subject ratio probability value matrix θ of each text for output;

S522根据公式9统计可得每个主题与term的概率值矩阵

并进行输出；S522 According to formula 9, the probability value matrix of each topic and term can be obtained by statistics

and output;

基于相同的技术构思，本发明提供一种分析文本数据潜在主题短语的方法实现的系统，如图3所示，所述系统包括：Based on the same technical concept, the present invention provides a system implemented by a method for analyzing potential topic phrases in text data, as shown in FIG. 3 , the system includes:

预处理模块1，用于采集文本数据集，并对所述文本数据集进行分词，得到文本数据集的词语表现形式；A preprocessing module 1 is used to collect a text data set, and perform word segmentation on the text data set to obtain a word representation of the text data set;

短语提取模块2，用于根据文本数据集的词语提取词语搭配后形成的有效短语，得到未搭配成有效短语的词语与短语集的混合表现形式。The phrase extraction module 2 is used for extracting valid phrases formed by collocation of words according to the words in the text data set, and obtaining a mixed expression form of the words and phrase sets that are not matched into valid phrases.

所述短语提取模块2，具体包括：The phrase extraction module 2 specifically includes:

候选集统计单元21，用于统计二元短语中搭配的词语共现词频，构成二元短语候选集；The candidate set statistics unit 21 is used to count the co-occurrence word frequencies of the words collocated in the binary phrases to form a binary phrase candidate set;

二元短语计算单元22，用于计算二元短语候选集score(w_i，w_j)分值，选取分值高的前m个构成正式的二元短语，并加入到短语集中。The binary phrase calculation unit 22 is used to calculate the score ( _wi , w _j ) of the candidate binary phrase set, select the top m with high scores to form a formal binary phrase, and add it to the phrase set.

二元(bigram)短语的score(w_i，w_j)分值具体由如下式计算：The score( _wi , _wj ) of a bigram phrase is specifically calculated by the following formula:

短语集生成单元23，用于将迭代计算得到的所述二元短语与词语搭配组成的三元短语、三元短语与词与搭配组成的四元短语、四元短语与词语搭配组成的五元短语依次加入到短语集中；所述有效短语包括二元短语、三元短语、四元短语以及五元短语。The phrase set generating unit 23 is used for combining the binary phrases obtained by the iterative calculation with the ternary phrases formed by collocations with words, the quaternary phrases formed by the ternary phrases with words and collocations, and the quintuple phrases formed by collocations of the quaternary phrases and words Phrases are sequentially added to the phrase set; the valid phrases include binary phrases, tri-grams, quaternary phrases, and pentaphrases.

词向量构建模块3，用于对混合表现形式的文本数据集采用开源包word2vec进行词向量训练，得到该文本数据集的词向量模型。The word vector building module 3 is used for word vector training using the open source package word2vec for the text data set with mixed expressions, and the word vector model of the text data set is obtained.

模型构建模块4，用于构建基于词向量的短语主题模型DR-Phrase LDA并求解各个参数。The model building module 4 is used to build a word vector-based phrase topic model DR-Phrase LDA and solve each parameter.

DR-Phrase LDA模型构建单元41，DR-Phrase LDA是一个概率生成模型，文本数据集D有M篇文本，N_d表示文本d所含词或短语(term)数量，具体生成文本中某个term的过程包含两个步骤，第一，从文档-主题(document-topic)分布中随机抽取一个分布θ；在主题-term(topic-term)分布中抽取一个分布

后，为该主题抽取一个term t；所有文本共享K个主题；document-topic and topic-term分布均为多项式分布，并且它们均服从先验为对称的狄利克雷分布(Dirichlet distribution)，αandβ为Dirichlet先验的超参数，DR-PhraseLDA模型的文本数据集生成过程可描述为如表1所示。DR-Phrase LDA model building unit 41, DR-Phrase LDA is a probability generation model, the text data set D has M texts, N _d represents the number of words or phrases (terms) contained in the text d, and specifically generates a certain term in the text The process consists of two steps. First, a distribution θ is randomly selected from the document-topic (document-topic) distribution; a distribution θ is selected from the topic-term (topic-term) distribution.

参数求解单元42：DR-Phrase LDA模型重点是对文本数据集中每一篇文本d的每一个term的求解其隐变量主题编号z_mn。其他参数可通过间接计算获得。对于隐变量z采用吉布斯采样(Gibbs Sampling)近似求解方法完成。采样过程主要核心是对短语统计低频问题的处理上。根据波利亚坛子模型原理，在采样过程中来提升短语的权重，主要结合了如下的短语语言学特性：Parameter solving unit 42: The DR-Phrase LDA model focuses on solving the hidden variable topic number z _mn of each term of each text d in the text data set. Other parameters can be obtained by indirect calculation. For the hidden variable z, the approximate solution method of Gibbs sampling is used to complete it. The main core of the sampling process is to deal with the low frequency problem of phrase statistics. According to the principle of the Polya jar model, the weight of phrases is increased in the sampling process, which mainly combines the following linguistic characteristics of phrases:

当采样过程到达某个位置遇到短语w^p，同时提升该短语及其语义相关词计数，具体做法是从训练好的词向量模型中计算得到和有效短语w^p语义相似度最大的前topγ个词集，作为短语w^p的语义上下文背景，则：该短语w^p被抽到主题k后增加k中w^p的计数C(w^p)通过如下公式计算：When the sampling process reaches a certain position and encounters a phrase w ^p , the count of the phrase and its semantically related words is increased at the same time. The specific method is to calculate from the trained word vector model the top γ with the largest semantic similarity to the valid phrase w ^p . The word set, as the semantic context background of the phrase w ^p , then: after the phrase w ^p is drawn to topic k, the count C(w ^p ) of w ^p in k is increased by the following formula:

C(w^p)＝μlen(w^p) 公式2C(w ^p )=μlen(w ^p ) Equation 2

若抽样过程到达某个位置遇到某一个词w，如果该词为虚词，则按照正常抽样计数；如果该词为实词，且为某些短语的topγ语义相关词，则同时提升该词及其语义相关短语计数，该词语w被抽到主题k的增加计数词C(w)通过如下公式：If the sampling process reaches a certain position and encounters a certain word w, if the word is a function word, it will be counted according to normal sampling; if the word is a real word and is a topγ semantically related word of some phrases, the word and its associated words will be promoted at the same time. The count of semantically related phrases, the word w is drawn to topic k's incremental count word C(w) by the following formula:

同时，增加所有以词w作为语义背景短语

在主题k下的计数

count under topic k

Use the following formula:

编号为k的主题所包含term概率

值通过以下公式计算：The term probability contained in topic k

The value is calculated by the following formula:

结果输出模块5，用于对所述DR-Phrase LDA训练，并根据训练结果输出文本数据潜在的主题短语。The result output module 5 is used to train the DR-Phrase LDA, and output potential topic phrases of the text data according to the training results.

具体训练包括：Specific training includes:

训练单元51，用于对所述DR-Phrase LDA模型训练，具体包括：The training unit 51 is used to train the DR-Phrase LDA model, and specifically includes:

输入：根据所得的文本数据集的词与短语混合表示形式；文本数据集的词向量模型；α(狄利克雷分布超参数)，β(狄利克雷分布超参数)，K(主题个数)；IterNum(迭代次数)；γ(短语w^p语义相似度最大的前top取值)；μ(短语w^p的长度计数调节参数)。Input: mixed representation of words and phrases based on the resulting text dataset; word vector model for the text dataset; α (Dirichlet distribution hyperparameter), β (Dirichlet distribution hyperparameter), K (number of topics) ; IterNum (the number of iterations); γ (the top value of the phrase w ^p with the largest semantic similarity); μ (the length count adjustment parameter of the phrase w ^p ).

训练过程：初始化：对文本数据集中每一篇文本中的每一个term，从(1～K)中随机一个抽取主题编号；Training process: Initialization: For each term in each text in the text dataset, randomly select a topic number from (1~K);

Gibbs采样过程Gibbs sampling process

结果输出单元52，用于根据训练结果输出文本数据潜在的主题短语，具体包括：The result output unit 52 is used to output potential topic phrases of the text data according to the training results, specifically including:

根据公式8统计可得每一篇文本的主题比例概率值矩阵θ进行输出；According to formula 8, the subject ratio probability value matrix θ of each text can be obtained for output;

根据公式9统计可得每个主题与term的概率值矩阵

并进行输出；According to formula 9, the probability value matrix of each topic and term can be obtained

and output;

为了验证本发明的有效性，用具体数据进行仿真实验：In order to verify the validity of the present invention, carry out simulation experiment with concrete data:

试验选用国家标准——中医临床诊疗术语疾病部分(GBT16751.1-1997)中对中医疾病的解释文本，作为数据，共计20类，978篇疾病的解释文本，其相关信息见表2所示。The test selected the interpretive texts of TCM diseases in the national standard - the disease section of clinical diagnosis and treatment of traditional Chinese medicine (GBT16751.1-1997), as data, a total of 20 categories and 978 explanatory texts of diseases, and the relevant information is shown in Table 2.

表2中医临床诊疗术语疾病类别Table 2 TCM clinical diagnosis and treatment terms disease categories

采用python的开源的分词软件jieba 0.39版本对文本数据集进行分词。并提取文本数据集中的有效短语后得到文本数据集的词与短语混合表示形式；然后，采用开源包word2vec进行词向量训练后得到该文本数据集的词向量模型；Using python's open source word segmentation software jieba version 0.39 to segment the text data set. And extract the effective phrases in the text data set to obtain the mixed representation of words and phrases of the text data set; then, use the open source package word2vec for word vector training to obtain the word vector model of the text data set;

具体的：采用Google团队于2013并推出了基于神经网络的词向量(DistributedRepresentation)开源包word2ve对文本数据集进行词向量训练。训练中指定训练集中词的最小出现频次参数：min_count＝5；词向量维度：size＝50。其他参数均采用默认参数。Specifically: The Google team launched the word vector (Distributed Representation) open source package word2ve based on neural network in 2013 to train the word vector on the text data set. During training, specify the minimum frequency of occurrence of words in the training set: min_count=5; word vector dimension: size=50. Other parameters are default parameters.

最后，训练DR-Phrase LDA模型，并输出最终的主题短语。Finally, train the DR-Phrase LDA model and output the final topic phrase.

为了评估本发明提供的DR-Phrase LDA主题模型主题质量，以下两种模型作为比较基准：1)标准LDA模型下文简称LDA)；2)基于短语袋的LDA(下文简称Phrase-LDA)；3)接近本发明的GPU模型(Generalized Pólya Urn，下文简称GPU-LDA)。采用评价可读性和一致性的指标进行评价，计算公式入下：In order to evaluate the topic quality of the DR-Phrase LDA topic model provided by the present invention, the following two models are used as comparison benchmarks: 1) standard LDA model hereinafter referred to as LDA); 2) LDA based on phrase bag (hereinafter referred to as Phrase-LDA); 3) It is close to the GPU model of the present invention (Generalized Pólya Urn, hereinafter referred to as GPU-LDA). The evaluation is carried out using indicators for evaluating readability and consistency, and the calculation formula is as follows:

其中，D(v)表示词v出现过的文档频率；D(v，v’)表示词v和v’共现的文档频率；M表示主题t的概率最高的前M个词；

表示主题t的一些列概率高的前M个词。Among them, D(v) represents the document frequency in which the word v has appeared; D(v, v') represents the document frequency in which the word v and v'co-occur; M represents the top M words with the highest probability of topic t;

A list of top M words with high probability representing topic t.

试验中，设置α＝50/K，andβ＝0.01，为了节省效率和计算量仅提取了常用的二元和三元短语。图4展示了LDA、Phrase-LDA、GPU-LDA及本研究提出的DR-Phrase LDA四种模型根据上述指标计算的各个主题的一致性指标。纵坐标为四种模型在各类别的一致性指标值Topic Coherence，横坐标1-20表示实验数据集的20个疾病类别。从图中可见，Phrase-LDA的各类一致性指标值最差，大约在-2400至-2600之间；LDA模大约都在-2200至-2400之间；而GPU-LDA和DR-Phrase LDA大约在-1800至-2100；应该注意的是Topic Coherence指标为负值，其绝对值越小主题质量更高。In the experiment, α=50/K, and β=0.01, and only common binary and ternary phrases are extracted in order to save efficiency and computational complexity. Figure 4 shows the consistency indicators of each topic calculated by the four models of LDA, Phrase-LDA, GPU-LDA and DR-Phrase LDA proposed in this study according to the above indicators. The ordinate is Topic Coherence, the consistency index value of the four models in each category, and the abscissa 1-20 represents the 20 disease categories of the experimental data set. As can be seen from the figure, Phrase-LDA has the worst consistency index values, ranging from -2400 to -2600; LDA modes are approximately between -2200 and -2400; while GPU-LDA and DR-Phrase LDA Around -1800 to -2100; it should be noted that the Topic Coherence metric is negative, and the smaller the absolute value, the higher the topic quality.

从图4可见，GPU-LDA和DR-Phrase LDA的主题质量明显好于LDA模型和Phrase-LDA。虽然Phrase-LDA模型以“短语袋”(Phrase-of-Word，POW)替代传统“词袋”(Bag-of-Word，BOW)；但是传统主题模型理论对短语整体赋予相同主题进行共现捕获中会引起短语共现信息下降。因此，在Phrase-LDA模型中短语的共性信息远低于单个词，导致大量短语在主题中的概率极低，所以Phrase-LDA的主题一致性低于LDA，对主题的可读性和一致性并无帮助。As can be seen from Figure 4, the topic quality of GPU-LDA and DR-Phrase LDA is significantly better than that of LDA model and Phrase-LDA. Although the Phrase-LDA model replaces the traditional "Bag-of-Word" (BOW) with "Phrase-of-Word" (POW); but the traditional topic model theory assigns the same topic to the whole phrase for co-occurrence capture will cause a decrease in phrase co-occurrence information. Therefore, the common information of phrases in the Phrase-LDA model is much lower than that of a single word, resulting in a very low probability of a large number of phrases in the topic, so the topic consistency of Phrase-LDA is lower than that of LDA, and the readability and consistency of the topic didn't help.

相反，GPU-LDA及本研究提出的DR-Phrase LDA考虑了短语强语义表达特性，在模型中特意提升了短语及其成分词的统计量；因此，图4反映出二者主题质量均得到了提高。GPU-LDA模型对短语及其成分同时等量提升其统计量；相比而言DR-Phrase LDA模型对短语及其成分词统计量的提升过程中考虑了其成分词的不同词性和词的语义，尤其很多短语的成分词语义完全不同于短语整体语义，该处理策略更符合自然语言实际。在图4中也反映出了DR-Phrase LDA模型的主题质量优于GPU-LDA模型。On the contrary, GPU-LDA and the DR-Phrase LDA proposed in this study consider the strong semantic expression characteristics of phrases, and deliberately improve the statistics of phrases and their component words in the model; therefore, Figure 4 reflects that the topic quality of both has been improved. improve. The GPU-LDA model improves the statistics of phrases and their components at the same time; in contrast, the DR-Phrase LDA model takes into account the different parts of speech and the semantics of the words in the process of improving the statistics of phrases and their components. , especially the semantics of the constituent words of many phrases are completely different from the overall semantics of the phrases, and this processing strategy is more in line with the reality of natural language. It is also reflected in Figure 4 that the topic quality of the DR-Phrase LDA model is better than that of the GPU-LDA model.

表3列举了四种模型4个主题训练结果概率最高的Top 5 Term。从表中内容可见，四种模型训练结果均能从整体上分辨各主题的中心语义，LDA和Phrase LDA模型的Top 5个Term结果均为单词；GPU-LDA模型短语和单词都有，但单词相对多点，而DR-Phrase LDA正好相反，短语多于单词。LDA是一个完全“词袋”模型；Phrase LDA虽然基于短语袋，但由于短语和一般词的统计地位相同导致大量短语共现信息较少，因而难以获得较高的概率，未能体现短语的强语义块的作用。GPU-LDA模型采用更一般的罐子模型(GPU)来增加成分词和短语之间的联系。Table 3 lists the Top 5 Term with the highest probability of training results for the four topics of the four models. As can be seen from the table, the training results of the four models can distinguish the central semantics of each topic as a whole. The Top 5 Term results of the LDA and Phrase LDA models are all words; the GPU-LDA model has both phrases and words, but words Relatively many points, while DR-Phrase LDA is just the opposite, with more phrases than words. LDA is a complete "word bag" model; Phrase LDA is based on phrase bag, but due to the same statistical status of phrases and general words, a large number of phrases have less co-occurrence information, so it is difficult to obtain a high probability and fails to reflect the strong phrases. The role of semantic blocks. The GPU-LDA model adopts a more general jar model (GPU) to increase the connections between constituent words and phrases.

根据GPU的原理，观察到词再重新放入罐子中来提升短语的概率，这种模型的可理解为富者更富。该方法采取两种策略构建短语主题模型：当对一个词分配主题的时，包含该词的所有短语的计数增加；当对一个短语分配主题的时，短语的所有成分词计数增加。显然，自然语言中，尤其是汉语中同义词和多义词现象非常普遍，这种以简单增加计数的方法来增强短语的概率会引起不符合现实的过分增强问题。GPU-LDA不仅提升了短语的统计量，而且提升了短语成分的统计量，在GPU-LDA的模型试验结果中发现，短语的成分词提升结果超过短语，如在主题Topic 5的结果中“脑部疾病”的成分词：“疾病”概率高于“脑部疾病”。According to the principle of GPU, observed words are put back into jars to improve the probability of phrases. This model can be understood as the rich get richer. The method adopts two strategies to construct a phrase topic model: when a word is assigned a topic, the counts of all phrases containing the word are increased; when a phrase is assigned a topic, the counts of all the constituent words of the phrase are increased. Obviously, the phenomenon of synonyms and polysemous words is very common in natural language, especially in Chinese, and the probability of enhancing phrases by simply increasing the count will cause the problem of over-enhancement that does not conform to reality. GPU-LDA not only improves the statistics of phrases, but also improves the statistics of phrase components. In the model test results of GPU-LDA, it is found that the component words of phrases are improved more than phrases. For example, in the results of topic Topic 5, "brain" "Brain disease": the probability of "disease" is higher than that of "brain disease".

表3四种模型部分主题分析结果Top5 Term示例Table 3 Examples of Top5 Term analysis results of some of the four models

本方法提出的基于词向量的DR-Phrase LDA短语主题模型，采用词向量的方法度量短语成分词之间的关系，来定量反映单词在文档整体和短语局部中的语义贡献，以提高短语主题模型的分析结果的质量。DR-Phrase LDA模型结果中，短语的整体语义一般强于其成分词，如Topic 5中“精神刺激”概率高于“精神”，Topic 29中“肾气”，概率高于“肾”。The word vector-based DR-Phrase LDA phrase topic model proposed by this method uses the word vector method to measure the relationship between phrase components to quantitatively reflect the semantic contribution of words in the whole document and the phrase part, so as to improve the phrase topic model. the quality of the analysis results. In the results of the DR-Phrase LDA model, the overall semantics of phrases are generally stronger than their constituent words. For example, the probability of "mental stimulation" in Topic 5 is higher than that of "spirit", and the probability of "shen qi" in Topic 29 is higher than that of "kidney".

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

最后应当说明的是：以上实施例仅用以说明本发明的技术方案而非对其限制，尽管参照上述实施例对本发明进行了详细的说明，所属领域的普通技术人员应当理解：依然可以对本发明的具体实施方式进行修改或者等同替换，而未脱离本发明精神和范围的任何修改或者等同替换，其均应涵盖在本发明的权利要求保护范围之内。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention rather than to limit them. Although the present invention has been described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: the present invention can still be Modifications or equivalent replacements are made to the specific embodiments of the present invention, and any modifications or equivalent replacements that do not depart from the spirit and scope of the present invention shall be included within the protection scope of the claims of the present invention.

Claims

1. a method for analyzing text data potential topic phrases, it is characterised in that the method comprises:

(1) Collecting a text data set, and performing word segmentation on the text data set to obtain the word representation of the text data set;

(2) According to the words in the text data set, the effective phrases formed by the collocation of the words are extracted, and the mixed expression form of the words that are not collocated into the effective phrases and the phrase set is obtained;

(3) The corresponding word vector model is obtained after word vector training is performed on the text data set of mixed expressions;

(4) Construct a phrase topic model DR-Phrase LDA based on word vector and solve each parameter;

In the step (4), the DR-Phrase LDA model is a probability generation model, and the process of realizing text generation by DR-Phrase LDA is: first, there are M texts in the text data set D, from the Dirichlet distribution whose hyperparameter is α The topic distribution θ _d of the generated text d is sampled from ; the topic parameter z of the words or phrases in the text d is generated by sampling from the multinomial distribution θ _d of the topic, and the topic number is denoted as z _mn ; the hyperparameter is β Dirichlet distribution. Generate a distribution of words or phrases corresponding to topic z

where all texts share K topics; from the multinomial distribution of words

medium sampling to generate a word or phrase t;

The calculation of the topic parameter z of the word or phrase is completed using the Gibbs sampling approximate solution method, which is expressed as:

Among them, the word or valid phrase in the current position of document d in the sampling process is denoted as t, denoted as term, k is the assigned topic number, K is the preset number of topics, N _t is the total number of terms in the text data set, n _k/d represents the count of topic k in document d, n _t/k represents the count of t in topic k, n _r represents the number of semantically related terms of t, n _tr represents the number of related terms of t, α and β are Dirichlet hyperparameters, α, β are the vectors corresponding to α and β, respectively;

The potential topic ratio θd in a certain text _d in the text dataset is expressed as:

The term probability contained in topic k

The value is represented as:

(5) the DR-Phrase LDA is trained, and the potential subject phrase of the text data is output according to the training result;

In step (5), including:

(51) to described DR-Phrase LDA model training, the training step comprises:

Input: a text dataset of mixed expressions of words and phrase sets that are not matched into valid phrases, the trained word vector model and the Dirichlet distribution hyperparameter α of the DR-Phrase LDA model, and the Dirichlet distribution hyperparameter β , the number of topics K; the number of iterations IterNum; the first γ corresponding values of the phrase w ^p with the largest semantic similarity; the length count adjustment parameter μ of the phrase w ^p ;

Training process:

Traverse the number corresponding to the word or valid phrase t in each text in the text dataset;

If t is a valid phrase, increase the count C(w ^p ) after the phrase w ^p is drawn to topic k; at the same time, traverse the first γ word sets with the largest semantic similarity with w ^p , and increase all the words in the word set accordingly. word assigned to count C( _wi ) under topic k;

Otherwise, if t is a content word and there is a semantically related phrase, then increase the count C(w) of the word t in topic k, and add the word as a valid phrase for the semantic context

count under topic k

Otherwise, if t is a function word, the corresponding count is decremented by 1;

Iterate the above steps to the set number of times IterNum;

Output: A two-dimensional matrix z of topic numbers for all words and valid phrases in the text dataset;

(52) Output potential topic phrases of text data according to the training results, including:

According to the potential topic ratio θ _d in a certain text d, the probability value matrix θ of the topic ratio of each text can be obtained:

θ={θ _(m,k) ,m∈{0,..M-1},k∈{0,..K-1}}

Among them, M is the total number of texts in the text dataset; K is the number of topics;

According to the probability of the term contained in the topic numbered k

Value statistics can get the probability value matrix of each topic and term

Among them, N _t is the total number of terms in the text dataset;

The valid phrase w ^p is drawn to the count C(w ^p ) of topic k, expressed as:

C(w ^p )=μlen(w ^p )

Among them, μ is an integer greater than or equal to 1, and the adjustment parameter len(w ^p ) is the length of the valid phrase w ^p ;

The phrase semantically related words _wi corresponding to the valid phrase w ^p are assigned to the count C( _wi ) under topic k, which is expressed as:

Among them, Int() represents an integer, Sim( _wi w ^p ) represents the similarity between word _wi and valid phrase w ^p calculated by word vector;

The count C(w) of the word w in topic k is expressed as:

in,

Corresponding word vector

similarity between

Valid phrases with word w as semantic background

count under topic k

Expressed as:

2. The method for analyzing text data potential topic phrases according to claim 1, wherein in the step (2), the effective phrases include n-gram phrases, and n-grams are the number of words that form a phrase, and the Describe the effective phrases formed by extracting word collocations according to the words in the text data set, including:

(21) Count the co-occurrence frequencies of two-word or phrase collocations in the text data set to form a binary phrase candidate set;

(22) Calculate the score(w _i ,w _j ) score of the candidate set of binary phrases, select the top m with high scores to form formal binary phrases, and add them to the phrase set. Update the corresponding words in the word representations of the text data set to be the obtained phrases;

(23) An n-gram phrase formed by collocation of the binary phrase and other words or phrases calculated in the iterative steps (21) and (22) is added to the phrase set in turn.

3. A system implemented by the method for analyzing potential topic phrases of text data according to any one of claims 1-2, wherein the system comprises:

a preprocessing module, used to collect a text data set, and perform word segmentation on the text data set to obtain a word representation of the text data set;

The phrase extraction module is used for extracting valid phrases formed by collocation of words according to the words in the text data set, and obtaining a mixed expression form of the words and phrase sets that are not matched into valid phrases;

The word vector building module is used to obtain the corresponding word vector model after word vector training is performed on the text data set with mixed representation;

The model building module is used to build a word vector-based phrase topic model DR-Phrase LDA and solve each parameter;

The DR-Phrase LDA model is a probabilistic generation model. The process of DR-Phrase LDA to realize text generation is as follows: First, there are M texts in the text data set D, and the topic distribution of the text d is sampled from the Dirichlet distribution with the hyperparameter α. θ _d ; Sampling from the multinomial distribution θ _d of the topic to generate the topic parameter z of the words or phrases in the text d, and the topic number is denoted as z _mn ; Sampling from the hyperparameter β Dirichlet distribution to generate the corresponding words or phrases of the topic z distributed

where all texts share K topics; from the multinomial distribution of words

medium sampling to generate a word or phrase t;

The term probability contained in the topic numbered k

The value is represented as:

A result output module, for training the DR-Phrase LDA, and outputting potential topic phrases of text data according to the training results;

The result output module, the specific training includes:

A training unit for training the DR-Phrase LDA model, specifically including:

Training process:

If t is a valid phrase, increase the count C(w ^p ) after the phrase w ^p is drawn to topic k; at the same time, traverse the first γ word set with the largest semantic similarity with w ^p , and increase all the words in the word set accordingly. word assigned to count C( _wi ) under topic k;

count under topic k

Iterate the above steps to the set number of times IterNum;

Result output unit: used to output potential topic phrases of text data according to the training results, including:

θ={θ _(m,k) ,m∈{0,..M-1},k∈{0,..K-1}}

According to the probability of the term contained in the topic numbered k

Value statistics can get the probability value matrix of each topic and term

Among them, N _t is the total number of terms in the text dataset;

The valid phrase w ^p is drawn to the count C(w ^p ) of topic k, expressed as:

C(w ^p )=μlen(w ^p )

The count C(w) of the word w in topic k is expressed as:

in,

Corresponding word vector

similarity between

Valid phrases with word w as semantic background

count under topic k

Expressed as:

4. The system according to claim 3, wherein, in the phrase extraction module, comprising:

The candidate set statistical unit is used to count the co-occurrence frequency of two-word or phrase collocations in the text data set to form a binary phrase candidate set;

The binary phrase calculation unit is used to calculate the score ( _wi , w _j ) of the candidate set of binary phrases, select the top m with high scores to form the formal binary phrases, and add them to the phrase set. Update the corresponding words in the word representations of the text data set described by the statistical unit as the obtained phrases;

The phrase set generation unit, the iterative candidate set statistics unit and the binary phrase calculation unit, and the calculated n-gram phrases composed of the binary phrase and other words or phrases are added to the phrase set in turn.