CN111178048B

CN111178048B - Topic extraction method and device based on smooth phrase topic model

Info

Publication number: CN111178048B
Application number: CN201911421842.3A
Authority: CN
Inventors: 郭佳; 张景鹏; 徐路; 李油; 赵小琦
Original assignee: Weibo Internet Technology China Co Ltd
Current assignee: Weibo Internet Technology China Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-08-01
Anticipated expiration: 2039-12-31
Also published as: CN111178048A

Abstract

The embodiment of the present invention provides a topic extraction method and device based on a smooth phrase topic model, including: extracting effective words in the data set to be processed to obtain a preprocessing data set; extracting frequent phrases from the preprocessing data set through the Apriori association algorithm , to form a frequent phrase data set; according to the Gaussian distribution characteristics of the frequency of frequent phrases, the adjacent frequent phrases in the preprocessing data set that meet the preset requirements are combined into new phrases, and the new phrases are added to the frequent phrase data set, Form a candidate phrase data set; analyze the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and form corresponding topics through topic phrases. The topic phrase is obtained by analyzing the candidate phrase data set through the smooth phrase topic model, and the corresponding topic is formed through the topic phrase, which improves the readability of the topic and expresses the real information of the topic more accurately.

Description

Topic extraction method and device based on smooth phrase topic model

Technical Field

The invention relates to the field of data mining, in particular to a method and a device for extracting a theme based on a smooth phrase theme model.

Background

With the rapid development of the Internet, social platforms such as microblogs, weChats and headbands become a mainstream medium for information transmission and speaker release by users. Microblog attracts more and more users by virtue of the characteristics of platform openness, information timeliness, concise content, wide coverage field and the like, and gradually becomes an important platform for network citizens to acquire news, interpersonal interaction, issue comments and participate in social event discussion and reflect social public opinion.

Common microblog hot topics are typically described using manually labeled phrases, as shown in table 1.

TABLE 1 microblog hot search topic

In carrying out the present invention, the applicant has found that at least the following problems exist in the prior art:

most of the existing topic discovery methods are based on a word bag model for feature extraction, partial effective information is lost due to the fact that the associated information among words in the phrases is not considered, and the topic discovery methods are used for representing topics, so that topic expression readability is poor, ambiguity exists, and real information of the topics cannot be accurately reflected. For example, the result of mining the data of topic 1 is "sun, korea, song Huiqiao, etc., and it is difficult to obtain the result of phrase description such as" descendants of sun ", topic comprehensiveness is to be improved.

Disclosure of Invention

The embodiment of the invention provides a topic extraction method and device based on a smooth phrase topic model, which are used for analyzing a candidate phrase data set through an SPLDA smooth phrase topic model to obtain a topic phrase, and forming a corresponding topic through the topic phrase, so that the readability of the topic is improved, and the true information of the topic is expressed more accurately.

In order to achieve the above object, in one aspect, an embodiment of the present invention provides a topic extraction method based on a smooth phrase topic model, including:

extracting effective words in the data set to be processed to obtain a preprocessed data set;

extracting frequent phrases from the preprocessing data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; according to the Gaussian distribution characteristic of the occurrence frequency of the frequent phrases, combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into new phrases, and adding the new phrases into the frequent phrase data set to form a candidate phrase data set;

and analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.

In another aspect, an embodiment of the present invention provides a topic extraction apparatus based on a smooth phrase topic model, including:

and a pretreatment module: extracting effective words in the data set to be processed to obtain a preprocessed data set;

phrase extraction module: the method comprises the steps of extracting frequent phrases from a preprocessing data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; according to the Gaussian distribution characteristic of the occurrence frequency of the frequent phrases, combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into new phrases, and adding the new phrases into the frequent phrase data set to form a candidate phrase data set;

the theme generation module: and the method is used for analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.

The technical scheme has the following beneficial effects: frequent phrases are generated by using an Apriori association algorithm, and high-quality candidate phrases are generated by combining the Gaussian distribution characteristics of texts, so that the candidate phrases can be quickly converged to obtain. And mining candidate phrases by using Gaussian distribution characteristics of texts on the basis of microblog topics of the smooth phrase topic model, analyzing a candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a topic extraction method based on a smooth phrase topic model in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a topic extraction device based on a smooth phrase topic model according to an embodiment of the present invention;

FIG. 3 is a framework diagram of topic extraction based on a smooth phrase topic model in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of a preprocessing module according to an embodiment of the present invention;

fig. 5 is a schematic diagram of the SPLDA structure according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, in combination with an embodiment of the present invention, there is provided a topic extraction method based on a smooth phrase topic model, including:

s101: extracting effective words in the data set to be processed to obtain a preprocessed data set;

s102: extracting frequent phrases from the preprocessing data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; according to the Gaussian distribution characteristic of the occurrence frequency of the frequent phrases, combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into new phrases, and adding the new phrases into the frequent phrase data set to form a candidate phrase data set;

s103: and analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.

Preferably, in step 102, frequent phrases are extracted from the preprocessed data set by an Apriori association algorithm to form a frequent phrase data set, which specifically includes:

s1021: the preprocessing data set comprises a text-level data set, and when the number of times of occurrence of a word in the text-level data set is greater than the minimum support degree in an Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase data set is generated;

S1022: the updating of the frequent phrase data set by the Apriori association algorithm specifically comprises the following steps:

marking the position of each frequent phrase in the text-level dataset;

detecting whether a data set of a text level contains frequent phrases with preset length, and reserving the data set of the text level when the data set contains the frequent phrases with preset length; otherwise deleting the data set of the text level; the method comprises the steps of,

in a data set of a reserved text level, aiming at frequent phrases with the same length, according to the position of the frequent phrases, when the phrase adjacent to one side of the frequent phrases is also the frequent phrase, synthesizing the frequent phrase and the adjacent phrase into a first-level phrase, when the first-level phrase reaches the minimum support degree, adding the first-level phrase into the frequent phrase data set, and deleting two adjacent frequent phrases corresponding to the first-level phrase from the frequent phrase data set; and repeatedly cycling to synthesize the first-level phrases by the frequent phrases and the adjacent phrases until the first-level phrases do not meet the minimum support degree, and completing updating of the frequent phrase data set.

Preferably, in step 102, a new phrase is synthesized from adjacent frequent phrases in the preprocessed data set meeting a preset requirement, and the new phrase is added to the frequent phrase data set to form a candidate phrase data set, which specifically includes:

S1023: acquiring two adjacent frequent phrases in the data set of the text level, combining the two frequent phrases into a second-level phrase, and calculating the importance of the second-level phrase in the data set of the text level, wherein the importance is the probability that the two frequent phrases appear in the same position in the data set of the text level;

s1024: when the importance is not less than a preset first threshold value, adding the second-level phrase into a frequent phrase data set, and deleting the two adjacent frequent phrases;

s1025: and (3) the operation of combining the two adjacent frequent phrases into a second-level phrase is circulated until the importance degree of the second-level phrase synthesized by any two adjacent frequent phrases is smaller than a preset first threshold value, so as to obtain a candidate phrase data set.

Preferably, step 103 specifically includes:

calculating the probability of candidate phrases under different topics through the SPLDA smooth phrase topic model, and when the probability of the candidate phrases in a certain topic is not smaller than a second threshold value, taking the candidate phrases as topic phrases, and forming corresponding topics through the topic phrases.

Preferably, step 103 further comprises: further comprises: and calculating standard deviation of probability distribution of words in the candidate phrases under the topics, and correcting the probability of the candidate phrases under different topics through the standard deviation of the words.

As shown in fig. 1, in combination with an embodiment of the present invention, there is also provided a topic extraction apparatus based on a smooth phrase topic model, including:

pretreatment module 21: extracting effective words in the data set to be processed to obtain a preprocessed data set;

phrase extraction module 22: the method comprises the steps of extracting frequent phrases from a preprocessing data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; according to the Gaussian distribution characteristic of the occurrence frequency of the frequent phrases, combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into new phrases, and adding the new phrases into the frequent phrase data set to form a candidate phrase data set;

the topic generation module 23: and the method is used for analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.

Preferably, the phrase extraction module 22 includes a frequent phrase mining sub-module 221, and the frequent phrase mining sub-module 221 is specifically configured to:

the preprocessing data set comprises a text-level data set, and when the number of times of occurrence of a word in the text-level data set is greater than the minimum support degree in an Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase data set is generated; the updating of the frequent phrase data set by the Apriori association algorithm specifically comprises the following steps:

Marking the position of each frequent phrase in the text-level data set; the method comprises the steps of,

in a data set of a reserved text level, according to the position of a frequent phrase with the same length, when a phrase adjacent to one side of the frequent phrase is also a frequent phrase, synthesizing the frequent phrase and the adjacent phrase into a first-level phrase, when the first-level phrase reaches the minimum support degree, adding the first-level phrase into the frequent phrase data set, and deleting two adjacent frequent phrases corresponding to the first-level phrase from the frequent phrase data set; and repeatedly cycling the frequent phrase and the adjacent synthesized first-level phrases until the first-level phrases do not meet the minimum support degree, and completing updating of the frequent phrase data set.

Preferably, the phrase extraction module 22 includes a candidate phrase generation sub-module 222, particularly for

Acquiring two adjacent frequent phrases in the data set of the text level, combining the two frequent phrases with a second-level phrase, and calculating the importance of the second-level phrase in the data set of the text level, wherein the importance is the probability that the two frequent phrases appear in the same position in the data set of the text level;

When the importance is not less than a preset first threshold value, adding the second-level phrase into a frequent phrase data set, and deleting the two adjacent frequent phrases;

the operation of combining two adjacent frequent phrases into one second-level phrase is circulated until the importance of the second-level phrase synthesized by any two adjacent frequent phrases is smaller than a preset first threshold value.

Preferably, the theme generation module 23 is specifically configured to: calculating the probability of candidate phrases under different topics through the SPLDA smooth phrase topic model, and when the probability of the candidate phrases in a certain topic is not smaller than a second threshold value, taking the candidate phrases as topic phrases, and forming corresponding topics through the topic phrases.

Preferably, the theme generation module 23 is specifically further configured to: and calculating standard deviation of probability distribution of words in the candidate phrases under the topics, and correcting the probability of the candidate phrases under different topics through the standard deviation of the words.

The invention has the beneficial effects that: frequent phrases are generated using the Apriori association algorithm, which uses two important rules to quickly and efficiently mine the frequent phrases. Further combining the gaussian distribution characteristics of the text generates high quality candidate phrases.

And mining candidate phrases based on the microblog topics of the smooth phrase topic model by using the Gaussian distribution characteristics of texts, calculating the probability of the candidate phrases under different topics through the SPLDA smooth phrase topic model, and taking the candidate phrases as the topic phrases when the probability of the candidate phrases in a certain topic is not smaller than a second threshold value, and forming corresponding topics through the topic phrases.

The probability distribution of the words in the phrase under the same subject is corrected by combining the variance (namely standard deviation) of the probability distribution of the words in the phrase under the same subject, the probability that the phrase belongs to the subject is larger, so that the probability distribution of the subject-phrase is corrected, the convergence speed and the result accuracy of sampling are improved, the generated subject phrase is used for expressing the subject, the readability of the subject is improved, the ambiguity is reduced, and the real information of the subject is more accurately expressed.

The foregoing technical solutions of the embodiments of the present invention will be described in detail with reference to specific application examples, and reference may be made to the foregoing related description for details of the implementation process that are not described.

Abbreviations and key term definitions appearing in the present invention:

theme model: a statistical model for finding abstract topics in a series of documents is used in the fields of machine learning, natural language processing, etc.

SPLDA: smoothed Phrase LDA (smooth phrase topic model).

The method for mining microblog topics is characterized in that a Gibbs sampling equation is smoothed by combining the variance of the probability of terms in the phrase under the same theme, and then the theme-phrase distribution is corrected.

In order to improve the effect of microblog topic discovery facing public opinion, a topic discovery method (also called as a topic extraction method) based on SPLDA (Smoothed Phrase LDA) is provided. The microblog topics are described by the phrases, so that people are helped to grasp the meaning of the topics more accurately and comprehensively, and further, the microblog public opinion monitoring is assisted.

The principle of the topic (theme) discovery method based on SPLDA is shown in fig. 3. The method comprises the steps of inputting a data set into a preprocessing module, and extracting effective words in the data set to be processed to obtain the preprocessed data set. The data set is the blog content of the acquired microblog, and can contain a plurality of pieces of blog content, and the microblog content can be randomly acquired in the current-day blog data. Obtaining a preprocessed data set by performing clutter tag filtering, word segmentation and word deactivation on the data set; where clutter filtering, word segmentation, and de-disabling words are general processes for natural language processing nlp.

The text processed by the preprocessing module is dried and converted into a data form capable of inputting a Gaussian model, the preprocessed text is input into a phrase mining module, frequent phrases are mined through an Apriori association algorithm, and the frequent phrases are recombined by combining Gaussian distribution characteristics of the frequency of occurrence of the frequent phrases in the text processed by the preprocessing module to generate candidate phrases; the frequent phrase is a frequent term, and the frequent term is a term in the association algorithm and refers to a phrase meeting the support degree. The recombination of frequent phrases is specifically: if the importance of two adjacent phrases meets a threshold (the frequency of occurrence reaches the minimum support), the two phrases are combined into one phrase, for example, "Beijing" "and" tally "are combined into" Beijing tally ", so that a candidate phrase set can be generated.

And finally, inputting the candidate phrase set into an SPLDA topic phrase generation model, analyzing by using a phrase topic model, combining the association relation of words belonging to one topic in the phrase and the variance of probability distribution under the same topic to obtain the probability distribution of 'topic-phrase', and selecting one or more topic phrases with high probability values to represent one topic. For example, there are tens of thousands of microblog texts, the core topics inside may be a plurality of, possibly the eight diagrams of the star, possibly the nobel prize, etc.

The specific details are as follows:

as shown in fig. 4, which is a schematic diagram of a preprocessing module, the process firstly filters noise symbols such as html tags, expressions, punctuation marks and the like in a microblog data set by using a rule, where the html tags refer to tags in web page source codes, for example: < br > < div > and the like, and performing complex transformation; then, the Chinese word segmentation tool is used for marking words and parts of speech of the data set, and stop words are removed, wherein the stop words refer to words which are removed, have a lot of meanings in texts, but have no meaning; in addition, other languages like English are natural segmentation without word segmentation, and the microblogs mainly take Chinese as a main part, and hot spots of the collected data set can be obtained by analyzing Chinese data, so that independent pretreatment of English is omitted. And finally, removing the microblog text with the blog content less than 4 effective words, wherein the types of the effective words generally comprise nouns, verbs, adjectives, numbers or time words and the like.

Phrase mining (phrase extraction module), its algorithm mainly consists of two steps: (1) Frequent phrase mining, namely generating frequent phrases by using an Arriori association algorithm and counting the occurrence times of the frequent phrases; (2) Candidate phrase generation, which combines the Gaussian distribution characteristics of the text to generate high-quality candidate phrases.

The task of frequent phrase mining is to collect all continuous words in a corpus (preprocessing dataset) with a statistical number greater than the minimum support (support) of the Apriori algorithm, wherein the minimum support means the minimum number of occurrences of a certain word, for example, the minimum support of a certain word means the minimum number of occurrences of the certain word is 3. The algorithm can quickly and effectively mine frequent phrases by utilizing two important rules of 'downward closing principle' and 'inverse monotonicity of data'. The "down-closing principle" and "inverse monotonicity of data" belong to two rules of parallel relationship. Specifically: the downward closing principle is: "if phrase P is not a frequent term, then any phrase containing P may be considered to be also not a frequent term. Inverse monotonicity of the data is: "if a document does not contain frequent phrases of length n, then the document will not contain frequent phrases of length greater than n.

The flow of the frequent phrase mining algorithm is shown in table 2. The pattern of frequent items (frequent phrases) is first mined using the Apriori algorithm and an active set of indices is maintained. The specific operation is as follows: phrases that meet the support are obtained in combination with the shutdown-down principle, i.e., non-frequent phrases and phrases that include non-frequent phrases are removed. And the active indexes are index information of frequent phrases with length of n in the document where the microblog content is located, and the active indexes are positions of all words meeting the support degree in the data set at the time of initialization.

And judging whether the document needs to be further mined or not by utilizing a data anti-monotonicity rule, wherein the active indexes are index information of frequent phrases with the length of n in the document where the microblog content is located, so that the document where one microblog content is located does not contain frequent phrases with the length of n, and the document does not contain frequent phrases with the length of more than n, so that the document where the microblog content is located is deleted in the next iteration process of the Apriori algorithm. The pruning technology of the two frequent phrase excavation methods can combine the natural sparsity of the phrases, quicken the convergence speed, ensure the iteration of the early termination algorithm and improve the efficiency of phrase excavation.

Examples are: 1. after word segmentation, obtaining the positions of words meeting a threshold according to the support; 2. assuming that a document d1 is processed, if an active word w1 (frequent phrase) is in d1, combining w1 with its neighboring words according to the location of the frequent phrase in the text-level dataset (the neighboring words are frequent phrases that can be combined, otherwise not combined, based on the downward closing principle that if phrase P is not a frequent term, then any phrase containing P can be considered as also not a frequent term) to generate phrase P1; 3. judging whether p1 meets the support degree, if yes, adding p1 (p 1) into the active index group, and deleting w1 and words adjacent to w1 in the frequent phrase data set after the current iteration (for example, when the length of the frequent phrase is 5) is completed; 4. combining the active words in d1 according to the combination mode, and deleting the document d1 if the words in the document d1 are not in the active traction group; 5. and adding the first level phrase to the frequent phrase dataset when the combined new phrase (first level phrase) has minimal support; 6. and iterating the steps for the documents where all the microblog contents are located until the active traction group is unchanged.

Table 2 frequent phrase mining algorithm

That is, in the frequent phrase mining algorithm Apriori algorithm, phrases are generated by "down-closing principle" and "inverse monotonicity of data", a corpus (dataset) is scanned using one sliding window, and the phrases satisfying the support degree are regarded as frequent phrases and the number of phrases is counted, and the size of the sliding window is increased by 1 each iteration (the size of the scanning window means the length of the phrases, such as the scanning window n, meaning that the phrases satisfying the support degree and having the length n are found. In practice, the initial value of n is typically taken as 2, and on the nth iteration, candidate phrases of length n-1 are truncated from each active index position, and the candidate phrases are counted using a hash-based counter (HashMap), as shown in line 7 of table 2, the phrase of length n-1 will not be added to the phrase candidate set if the Apriori minimum supported broad value is not satisfied, while the starting position index of the phrase is removed from the active index set before the next iteration.

And the candidate phrase generation submodule utilizes the result of the frequent phrase mining module to recombine the frequent phrases in the result to generate candidate phrases. Assuming that the corpus (the data set processed by the Apriori algorithm) is generated by a series of independent bernoulli tests, whether the phrase P exists at a specific position in the corpus belongs to bernoulli distribution, the expected occurrence frequency of the phrase P follows binomial distribution, and the distribution of each phrase occurrence event is independent, that is, the expected occurrence frequency of each phrase is respectively different binomial distribution. Since the number of phrases L in the corpus is quite large, the binomial distribution can reasonably be approximated as a gaussian distribution according to the big-number theorem. Assume that the frequent phrase P is known ₁ And P ₂ . The method uses an importance index sig (P ₁ ,P ₂ ) As phrase P ₁ And P ₂ The basis for whether to merge, wherein the importance index sig (P ₁ ,P ₂ ) Refers to calculating a known frequent phrase P ₁ And P ₂ Probability of simultaneous occurrence, i.e. involving frequent phrase P ₁ And P ₂ Probabilities of merging into the same phrase (co-occurrence of both phrases). The specific derivation of the index is as follows. f (P) represents the number of phrases P in the corpus, and the probability distribution of the phrases P is shown in formula (1).

h ₀ (f(P))＝N(Lp(P),Lp(P)(1-p(P)))≈N(L(p(P)),Lp(P)) (1)

Where P (P) is the probability that the phrase P succeeds in Bernoulli testing, the empirical probability of the occurrence of the phrase P in the corpus can be estimated asAssume phrase P ₀ From phrase P ₁ And phrase P ₂ Composition (phrase includes word and phrase), and P ₁ And P ₂ Independent of each other, then phrase P ₀ The expected value of the number of occurrences is:

from equation (1) and equation (2), h can be obtained ₀ (f(P ₀ ) The variance of the distribution is:

the method uses importance sig (P ₁ ,P ₂ ) Index to measure phrase P ₁ And P ₂ Probability of simultaneous occurrence. sig (P) ₁ ,P ₂ ) The calculation method is shown in formula (4).

There are two formulas in the binomial distribution approximation gaussian distribution: expected value=lf (P), and variance=lf (P) (1-f (P)), by which the two companies aim to obtain a gaussian distribution of the occurrence of the phrase p0=p1+p2 in the corpus.

According to formula (1), a gaussian distribution (actually, a probability density function) that phrases P1 and P2 appear in the document at the same time is obtained, and f (P1P 2) is normalized to obtain sig (P1, P2), wherein a larger value of sig (P1, P2) indicates a larger probability that P1 and P2 can be combined into one phrase.

The specific process of generating the candidate phrase is shown in table 3, and the candidate phrase generation rule is to re-combine the frequent phrases to generate longer phrases, so as to improve the readability of the phrases. After the data set passes through the frequent phrase mining module of the Apriori algorithm, the document where each microblog content is located is composed of frequent phrases. For the microblog content at this time, first, a bottom-up aggregation method (i.e., merging frequent phrases and words adjacent to each other from left to right) is adopted for each document to combine the frequent phrases and words adjacent to each other from left to right, so as to generate new phrases. Next, the importance sig of the new phrase is calculated, the phrases meeting the threshold are added to a candidate phrase set, i.e. MaxHeap hash container (the key of the container is a phrase and the value is the value of the importance), and the phrase in the corresponding position in the document is updated, i.e. the hash container is used to store the active traction group. Examples are: p1=beijing Tiananmen, p2=male, and is strong in the document, if sig (P1, P2) meets the threshold, merging P1, P2 in the document generates the phrase beijing Tiananmen male, and P1 and P2 are deleted from d 1. If the threshold is not met, then P1, P2 do not make any changes. Next, select the most important phrase Best (rows 2-4 in table 3) in the maxhaap hash container, and generate new phrases from the words or phrases of the most important phrase Best adjacent to it left and right in the document, add new phrases meeting the threshold to the candidate phrase set, and remove Best phrases from the maxhaap. Finally, the above process is iterated until the sig value of the Best phrase is less than the threshold or the MaxHeap container is empty.

TABLE 3 candidate phrase generation process

After the data set is subjected to the preprocessing and phrase mining module, the content of each document (the blog content of each microblog) is divided into a plurality of candidate phrases, and the documents are converted into phrase bag forms from bag-of-words forms. The phrase mining module can know that the words in the phrase have stronger association relation in the generated new phrase, so that in the phrase topic model, all the words in the phrase can reasonably be assumed to share one topic.

In 2003, blei proposed a LDA (Latent Dirichlet Allocation) model, which introduced a Dirichlet prior distribution on the basis of the pLSA model. In the LDA pseudo model, it is assumed that the text d follows a multi-term distribution over all K topics, and each topic K follows a multi-term distribution over the word set, and that the multi-term distribution is assumedAnd->Obeying the Dirichlet a priori distribution with super parameters alpha and beta. SPLDA method is based on LDAThe word bag model in the LDA is replaced with a phrase bag model, as shown in fig. 5. The SPLDA model is based on the hypothesized theory that text independence and terms in phrases belong to one topic, and the text independence refers to: if 1 ten thousand pieces of microblog data are in the data set, each piece of microblog data is independent and has no influence on each other. The assumption that a term in a phrase belongs to a topic is exemplified by: if the phrase p0=beijing university score line belongs to the topic of "college of college", p1=beijing university and p2=score line, both phrases belong to the topic of "college of college". The SPLDA model generation process of the document is as follows:

(1) For each topic k (there may be multiple microblog documents corresponding to multiple topics)

Generating a "topic-word" distribution

Updating a "topic-phrase" distribution when topics of terms in the phrase are the same

(2) For each document d

a) Generating text topic distribution θ _d ～Dir(α)

b) For the nth word in the text

i. Generating subject item z _d,i :Multi(θ _d )

ii. generating a term w _d,i :

The variables involved in the SPLDA model and their meanings are shown in table 4.

TABLE 4 SPLDA model-related parameters

In Table 4, a vocabulary is a vocabulary that all the different words in the dataset generate. For example: all the different words in the dataset are stored in one file and each row is a word and the line number is the position of the word in the vocabulary. Nd is the number of tokens in the d-th document, one token being a word, i.e., the token being a word; gd is the number of candidate phrases in the d-th document, the candidate phrases containing words.

In the SPLDA model, random variables are usedA topic variable representing the g-th phrase of document d. The joint distribution function of Z, W, phi, theta parameters and the joint distribution function of the theme-word in SPLDA are as follows:

where Z is the subject term, W is the term, Φ is the generation of a "subject-word" distribution for each subject k, Θ is the generation of a text subject distribution θ for each document d _d ～Dir(α)。

Wherein P is _LDA (Z, W, phi, theta) is a parameter joint distribution function in the LDA model, see formula (6), and C is a normalization constant used to ensure that the left side of formula (6) is a reasonable probability distribution. Function f (C) _d,g ) For constraining words in the phrase to belong to the same topic, see formula (7).

Assume phrase P ₀ In subject z _j And subject z _i The "topic-phrase" probability distributions under are the same, i.e. p (C _d,g ＝z _j )＝p(C _d,g ＝z _i ). And phrase P ₀ The term of inclusion, { w _d,g,i In subject z _i The difference of the probability distribution of the theme-words is smaller; in subject z _j In the following, the difference in the probability distribution of "topic-words" is large. At this time, phrase P ₀ Is subject z of allocation _i More in line with the assumption that the terms in the phrase belong to the same topic. However, when parameters of the phrase bag topic model are optimized by using a conventional Gibbs sampling method, phrases P cannot be distinguished ₀ Subject selection z of (2) _j And z _i Leading to inaccurate distribution of the topic-phrase. For this problem, the method uses the term { w } in the phrase _d,g,i Statistical properties (i.e., improved standard deviation) of probability distributions under the same topic to improve the Gibbs sampling method, thereby modifying the probability distribution of "topic-phrase". C (C) _d,g There are K possible values, using C _d,g =k represents C _d,g Where K represents the number of topics and K represents a specific topic. Parameter optimization equation:

Where VarSqrt is the standard deviation of the probability distribution of terms in the phrase under topic k. In the formula (8), as Var increases, the value of Var/tan h (Var) increases accordingly, thereby aggravating the p (C) _d,g The penalty of k), the probability of the phrase topic k becomes smaller. With decreasing Var, the value of Var/tan h (Var) decreases (when Var is 0, the value is 1), thereby alleviating the effect on p (C) _d,g The penalty of k), the probability of the phrase topic k becomes greater. The method fuses the difference of probability distribution of terms in phrases under the same subject into the training process of a subject model through a formula (8), so that the probability distribution of the subject-phrase is corrected.

The invention has the beneficial effects that:

frequent phrases are generated using the Apriori association algorithm, which uses two important rules to quickly and efficiently mine the frequent phrases. Further combining the gaussian distribution characteristics of the text generates high quality candidate phrases.

It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. As will be apparent to those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising," as interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".

Those of skill in the art will further appreciate that the various illustrative logical blocks (illustrative logical block), units, and steps described in connection with the embodiments of the invention may be implemented by electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components (illustrative components), elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation is not to be understood as beyond the scope of the embodiments of the present invention.

The various illustrative logical blocks or units described in the embodiments of the invention may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described. A general purpose processor may be a microprocessor, but in the alternative, the general purpose processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. In an example, a storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may reside in a user terminal. In the alternative, the processor and the storage medium may reside as distinct components in a user terminal.

In one or more exemplary designs, the above-described functions of embodiments of the present invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on a computer-readable medium or transmitted as one or more instructions or code on the computer-readable medium. Computer readable media includes both computer storage media and communication media that facilitate transfer of computer programs from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media may include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store program code in the form of instructions or data structures and other data structures that may be read by a general or special purpose computer, or a general or special purpose processor. Further, any connection is properly termed a computer-readable medium, e.g., if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless such as infrared, radio, and microwave, and is also included in the definition of computer-readable medium. The disks (disks) and disks (disks) include compact disks, laser disks, optical disks, DVDs, floppy disks, and blu-ray discs where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included within the computer-readable media.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A topic extraction method based on a smooth phrase topic model, characterized in that it includes:

Extract the valid words from the dataset to be processed to obtain the preprocessed dataset;

Frequent phrases are extracted from the preprocessed dataset using the Apriori association algorithm to form a frequent phrase dataset, and the frequent phrase dataset is updated using the Apriori association algorithm. Based on the Gaussian distribution characteristics of the frequency of frequent phrases, adjacent frequent phrases in the preprocessed dataset that meet the preset requirements are combined into new phrases, and the new phrases are added to the frequent phrase dataset to form a candidate phrase dataset.

The candidate phrase dataset is analyzed using the SPLDA smoothed phrase topic model to obtain topic phrases, and corresponding topics are formed from these topic phrases.

The process of extracting frequent phrases from the preprocessed dataset using the Apriori association algorithm to form a frequent phrase dataset specifically includes:

The preprocessed dataset includes a text-level dataset. When the frequency of a word in the text-level dataset exceeds the minimum support in the Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase dataset is generated.

The update of the frequent phrase dataset using the Apriori association algorithm specifically includes:

Mark the location of each frequent phrase in the text-level dataset;

The function checks whether a text-level dataset contains frequent phrases of a preset length. If it does, the text-level dataset is retained; otherwise, it is deleted.

In the preserved text-level dataset, for frequent phrases of the same length, based on the position of the frequent phrase, when the phrase adjacent to one side of the frequent phrase is also a frequent phrase, the frequent phrase and the adjacent phrase are combined into a first-level phrase. When the first-level phrase reaches the minimum support, the first-level phrase is added to the frequent phrase dataset, and the two adjacent frequent phrases corresponding to the first-level phrase are deleted from the frequent phrase dataset. This process of combining frequent phrases with adjacent phrases into first-level phrases is repeated until the first-level phrase no longer meets the minimum support, thus completing the update of the frequent phrase dataset.

The preprocessed dataset is used to synthesize adjacent frequent phrases that meet preset requirements into new phrases, and these new phrases are added to the frequent phrase dataset to form a candidate phrase dataset, which specifically includes:

Obtain two adjacent frequent phrases from the text-level dataset and combine them into a second-level phrase. Calculate the importance of this second-level phrase in the text-level dataset, where the importance is the probability that the two frequent phrases appear in the same position in the text-level dataset.

When the importance is not less than the preset first threshold, the second-level phrase is added to the frequent phrase dataset, and the two adjacent frequent phrases are deleted.

The operation of combining two adjacent frequent phrases into a second-level phrase is repeated until the importance of the second-level phrase formed by combining any two adjacent frequent phrases is less than a preset first threshold, thus obtaining a candidate phrase dataset.

2. The topic extraction method based on the SPLDA smooth phrase topic model according to claim 1, characterized in that, the candidate phrase dataset is analyzed using the SPLDA smooth phrase topic model to obtain topic phrases, and corresponding topics are formed using these topic phrases, specifically including:

The probability of a candidate phrase under different topics is calculated using the SPLDA smoothed phrase topic model. When the probability of a candidate phrase under a certain topic is not less than the second threshold, the candidate phrase is taken as the topic phrase, and the corresponding topic is formed through the topic phrase.

3. The topic extraction method based on the smooth phrase topic model according to claim 2, characterized in that it further includes: calculating the standard deviation of the probability distribution of words in the candidate phrase under the topic, and correcting the probability of the candidate phrase under different topics by the standard deviation of the words.

4. A topic extraction device based on a smooth phrase topic model, characterized in that it comprises:

Preprocessing module: used to extract valid words from the dataset to be processed, resulting in a preprocessed dataset;

The phrase extraction module is used to extract frequent phrases from the preprocessed dataset using the Apriori association algorithm, forming a frequent phrase dataset, and updating the frequent phrase dataset using the Apriori association algorithm. Based on the Gaussian distribution characteristics of the frequency of frequent phrases, it combines adjacent frequent phrases in the preprocessed dataset that meet the preset requirements into new phrases, and adds the new phrases to the frequent phrase dataset to form a candidate phrase dataset.

Topic generation module: This module analyzes the candidate phrase dataset using the SPLDA smoothed phrase topic model to obtain topic phrases, and then forms corresponding topics from these topic phrases.

The phrase extraction module includes a frequent phrase mining submodule and a candidate phrase generation submodule, wherein:

The frequent phrase mining submodule is specifically used for:

The preprocessed dataset includes a text-level dataset. When the frequency of a word in the text-level dataset exceeds the minimum support in the Apriori algorithm, the word is designated as a frequent phrase, and a frequent phrase dataset is generated. Updating the frequent phrase dataset using the Apriori association algorithm specifically includes:

Mark the location of each frequent phrase in the text-level dataset;

The candidate phrase generation submodule is specifically used for:

The operation of combining two adjacent frequent phrases into a second-level phrase is repeated until the importance of the second-level phrase formed by combining any two adjacent frequent phrases is less than a preset first threshold, which is then considered a frequent phrase.

5. The topic extraction device based on a smooth phrase topic model according to claim 4, characterized in that the topic generation module is specifically used for:

6. The topic extraction device based on a smooth phrase topic model according to claim 5, wherein the topic generation module is further configured to:

Calculate the standard deviation of the probability distribution of words in the candidate phrase under the topic, and correct the probability of the candidate phrase under different topics by the standard deviation of the words.