[go: up one dir, main page]

CN111178048B - Topic extraction method and device based on smooth phrase topic model - Google Patents

Topic extraction method and device based on smooth phrase topic model Download PDF

Info

Publication number
CN111178048B
CN111178048B CN201911421842.3A CN201911421842A CN111178048B CN 111178048 B CN111178048 B CN 111178048B CN 201911421842 A CN201911421842 A CN 201911421842A CN 111178048 B CN111178048 B CN 111178048B
Authority
CN
China
Prior art keywords
phrase
frequent
phrases
dataset
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911421842.3A
Other languages
Chinese (zh)
Other versions
CN111178048A (en
Inventor
郭佳
张景鹏
徐路
李油
赵小琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weibo Internet Technology China Co Ltd
Original Assignee
Weibo Internet Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weibo Internet Technology China Co Ltd filed Critical Weibo Internet Technology China Co Ltd
Priority to CN201911421842.3A priority Critical patent/CN111178048B/en
Publication of CN111178048A publication Critical patent/CN111178048A/en
Application granted granted Critical
Publication of CN111178048B publication Critical patent/CN111178048B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明实施例提供一种基于平滑短语主题模型的主题提取方法及装置,包括:提取待处理数据集内的有效词,得到预处理数据集;通过Apriori关联算法自预处理数据集中提取出频繁短语,形成频繁短语数据集;根据频繁短语出现频率的高斯分布特性,将预处理数据集中符合预设要求的相邻的频繁短语组合成新的短语,并将新的短语加入到频繁短语数据集,形成候选短语数据集;通过SPLDA平滑短语主题模型对候选短语数据集进行分析,得到主题短语,通过主题短语形成相应的话题。通过平滑短语主题模型对候选短语数据集进行分析得到主题短语,通过主题短语形成相应的话题,提高了话题的可读性,更准确地表述了话题的真实信息。

The embodiment of the present invention provides a topic extraction method and device based on a smooth phrase topic model, including: extracting effective words in the data set to be processed to obtain a preprocessing data set; extracting frequent phrases from the preprocessing data set through the Apriori association algorithm , to form a frequent phrase data set; according to the Gaussian distribution characteristics of the frequency of frequent phrases, the adjacent frequent phrases in the preprocessing data set that meet the preset requirements are combined into new phrases, and the new phrases are added to the frequent phrase data set, Form a candidate phrase data set; analyze the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and form corresponding topics through topic phrases. The topic phrase is obtained by analyzing the candidate phrase data set through the smooth phrase topic model, and the corresponding topic is formed through the topic phrase, which improves the readability of the topic and expresses the real information of the topic more accurately.

Description

Topic extraction method and device based on smooth phrase topic model
Technical Field
The invention relates to the field of data mining, in particular to a method and a device for extracting a theme based on a smooth phrase theme model.
Background
With the rapid development of the Internet, social platforms such as microblogs, weChats and headbands become a mainstream medium for information transmission and speaker release by users. Microblog attracts more and more users by virtue of the characteristics of platform openness, information timeliness, concise content, wide coverage field and the like, and gradually becomes an important platform for network citizens to acquire news, interpersonal interaction, issue comments and participate in social event discussion and reflect social public opinion.
Common microblog hot topics are typically described using manually labeled phrases, as shown in table 1.
TABLE 1 microblog hot search topic
In carrying out the present invention, the applicant has found that at least the following problems exist in the prior art:
most of the existing topic discovery methods are based on a word bag model for feature extraction, partial effective information is lost due to the fact that the associated information among words in the phrases is not considered, and the topic discovery methods are used for representing topics, so that topic expression readability is poor, ambiguity exists, and real information of the topics cannot be accurately reflected. For example, the result of mining the data of topic 1 is "sun, korea, song Huiqiao, etc., and it is difficult to obtain the result of phrase description such as" descendants of sun ", topic comprehensiveness is to be improved.
Disclosure of Invention
The embodiment of the invention provides a topic extraction method and device based on a smooth phrase topic model, which are used for analyzing a candidate phrase data set through an SPLDA smooth phrase topic model to obtain a topic phrase, and forming a corresponding topic through the topic phrase, so that the readability of the topic is improved, and the true information of the topic is expressed more accurately.
In order to achieve the above object, in one aspect, an embodiment of the present invention provides a topic extraction method based on a smooth phrase topic model, including:
extracting effective words in the data set to be processed to obtain a preprocessed data set;
extracting frequent phrases from the preprocessing data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; according to the Gaussian distribution characteristic of the occurrence frequency of the frequent phrases, combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into new phrases, and adding the new phrases into the frequent phrase data set to form a candidate phrase data set;
and analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.
In another aspect, an embodiment of the present invention provides a topic extraction apparatus based on a smooth phrase topic model, including:
and a pretreatment module: extracting effective words in the data set to be processed to obtain a preprocessed data set;
phrase extraction module: the method comprises the steps of extracting frequent phrases from a preprocessing data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; according to the Gaussian distribution characteristic of the occurrence frequency of the frequent phrases, combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into new phrases, and adding the new phrases into the frequent phrase data set to form a candidate phrase data set;
the theme generation module: and the method is used for analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.
The technical scheme has the following beneficial effects: frequent phrases are generated by using an Apriori association algorithm, and high-quality candidate phrases are generated by combining the Gaussian distribution characteristics of texts, so that the candidate phrases can be quickly converged to obtain. And mining candidate phrases by using Gaussian distribution characteristics of texts on the basis of microblog topics of the smooth phrase topic model, analyzing a candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a topic extraction method based on a smooth phrase topic model in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of a topic extraction device based on a smooth phrase topic model according to an embodiment of the present invention;
FIG. 3 is a framework diagram of topic extraction based on a smooth phrase topic model in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of a preprocessing module according to an embodiment of the present invention;
fig. 5 is a schematic diagram of the SPLDA structure according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, in combination with an embodiment of the present invention, there is provided a topic extraction method based on a smooth phrase topic model, including:
s101: extracting effective words in the data set to be processed to obtain a preprocessed data set;
s102: extracting frequent phrases from the preprocessing data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; according to the Gaussian distribution characteristic of the occurrence frequency of the frequent phrases, combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into new phrases, and adding the new phrases into the frequent phrase data set to form a candidate phrase data set;
s103: and analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.
Preferably, in step 102, frequent phrases are extracted from the preprocessed data set by an Apriori association algorithm to form a frequent phrase data set, which specifically includes:
s1021: the preprocessing data set comprises a text-level data set, and when the number of times of occurrence of a word in the text-level data set is greater than the minimum support degree in an Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase data set is generated;
S1022: the updating of the frequent phrase data set by the Apriori association algorithm specifically comprises the following steps:
marking the position of each frequent phrase in the text-level dataset;
detecting whether a data set of a text level contains frequent phrases with preset length, and reserving the data set of the text level when the data set contains the frequent phrases with preset length; otherwise deleting the data set of the text level; the method comprises the steps of,
in a data set of a reserved text level, aiming at frequent phrases with the same length, according to the position of the frequent phrases, when the phrase adjacent to one side of the frequent phrases is also the frequent phrase, synthesizing the frequent phrase and the adjacent phrase into a first-level phrase, when the first-level phrase reaches the minimum support degree, adding the first-level phrase into the frequent phrase data set, and deleting two adjacent frequent phrases corresponding to the first-level phrase from the frequent phrase data set; and repeatedly cycling to synthesize the first-level phrases by the frequent phrases and the adjacent phrases until the first-level phrases do not meet the minimum support degree, and completing updating of the frequent phrase data set.
Preferably, in step 102, a new phrase is synthesized from adjacent frequent phrases in the preprocessed data set meeting a preset requirement, and the new phrase is added to the frequent phrase data set to form a candidate phrase data set, which specifically includes:
S1023: acquiring two adjacent frequent phrases in the data set of the text level, combining the two frequent phrases into a second-level phrase, and calculating the importance of the second-level phrase in the data set of the text level, wherein the importance is the probability that the two frequent phrases appear in the same position in the data set of the text level;
s1024: when the importance is not less than a preset first threshold value, adding the second-level phrase into a frequent phrase data set, and deleting the two adjacent frequent phrases;
s1025: and (3) the operation of combining the two adjacent frequent phrases into a second-level phrase is circulated until the importance degree of the second-level phrase synthesized by any two adjacent frequent phrases is smaller than a preset first threshold value, so as to obtain a candidate phrase data set.
Preferably, step 103 specifically includes:
calculating the probability of candidate phrases under different topics through the SPLDA smooth phrase topic model, and when the probability of the candidate phrases in a certain topic is not smaller than a second threshold value, taking the candidate phrases as topic phrases, and forming corresponding topics through the topic phrases.
Preferably, step 103 further comprises: further comprises: and calculating standard deviation of probability distribution of words in the candidate phrases under the topics, and correcting the probability of the candidate phrases under different topics through the standard deviation of the words.
As shown in fig. 1, in combination with an embodiment of the present invention, there is also provided a topic extraction apparatus based on a smooth phrase topic model, including:
pretreatment module 21: extracting effective words in the data set to be processed to obtain a preprocessed data set;
phrase extraction module 22: the method comprises the steps of extracting frequent phrases from a preprocessing data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; according to the Gaussian distribution characteristic of the occurrence frequency of the frequent phrases, combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into new phrases, and adding the new phrases into the frequent phrase data set to form a candidate phrase data set;
the topic generation module 23: and the method is used for analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.
Preferably, the phrase extraction module 22 includes a frequent phrase mining sub-module 221, and the frequent phrase mining sub-module 221 is specifically configured to:
the preprocessing data set comprises a text-level data set, and when the number of times of occurrence of a word in the text-level data set is greater than the minimum support degree in an Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase data set is generated; the updating of the frequent phrase data set by the Apriori association algorithm specifically comprises the following steps:
Marking the position of each frequent phrase in the text-level data set; the method comprises the steps of,
detecting whether a data set of a text level contains frequent phrases with preset length, and reserving the data set of the text level when the data set contains the frequent phrases with preset length; otherwise deleting the data set of the text level; the method comprises the steps of,
in a data set of a reserved text level, according to the position of a frequent phrase with the same length, when a phrase adjacent to one side of the frequent phrase is also a frequent phrase, synthesizing the frequent phrase and the adjacent phrase into a first-level phrase, when the first-level phrase reaches the minimum support degree, adding the first-level phrase into the frequent phrase data set, and deleting two adjacent frequent phrases corresponding to the first-level phrase from the frequent phrase data set; and repeatedly cycling the frequent phrase and the adjacent synthesized first-level phrases until the first-level phrases do not meet the minimum support degree, and completing updating of the frequent phrase data set.
Preferably, the phrase extraction module 22 includes a candidate phrase generation sub-module 222, particularly for
Acquiring two adjacent frequent phrases in the data set of the text level, combining the two frequent phrases with a second-level phrase, and calculating the importance of the second-level phrase in the data set of the text level, wherein the importance is the probability that the two frequent phrases appear in the same position in the data set of the text level;
When the importance is not less than a preset first threshold value, adding the second-level phrase into a frequent phrase data set, and deleting the two adjacent frequent phrases;
the operation of combining two adjacent frequent phrases into one second-level phrase is circulated until the importance of the second-level phrase synthesized by any two adjacent frequent phrases is smaller than a preset first threshold value.
Preferably, the theme generation module 23 is specifically configured to: calculating the probability of candidate phrases under different topics through the SPLDA smooth phrase topic model, and when the probability of the candidate phrases in a certain topic is not smaller than a second threshold value, taking the candidate phrases as topic phrases, and forming corresponding topics through the topic phrases.
Preferably, the theme generation module 23 is specifically further configured to: and calculating standard deviation of probability distribution of words in the candidate phrases under the topics, and correcting the probability of the candidate phrases under different topics through the standard deviation of the words.
The invention has the beneficial effects that: frequent phrases are generated using the Apriori association algorithm, which uses two important rules to quickly and efficiently mine the frequent phrases. Further combining the gaussian distribution characteristics of the text generates high quality candidate phrases.
And mining candidate phrases based on the microblog topics of the smooth phrase topic model by using the Gaussian distribution characteristics of texts, calculating the probability of the candidate phrases under different topics through the SPLDA smooth phrase topic model, and taking the candidate phrases as the topic phrases when the probability of the candidate phrases in a certain topic is not smaller than a second threshold value, and forming corresponding topics through the topic phrases.
The probability distribution of the words in the phrase under the same subject is corrected by combining the variance (namely standard deviation) of the probability distribution of the words in the phrase under the same subject, the probability that the phrase belongs to the subject is larger, so that the probability distribution of the subject-phrase is corrected, the convergence speed and the result accuracy of sampling are improved, the generated subject phrase is used for expressing the subject, the readability of the subject is improved, the ambiguity is reduced, and the real information of the subject is more accurately expressed.
The foregoing technical solutions of the embodiments of the present invention will be described in detail with reference to specific application examples, and reference may be made to the foregoing related description for details of the implementation process that are not described.
Abbreviations and key term definitions appearing in the present invention:
theme model: a statistical model for finding abstract topics in a series of documents is used in the fields of machine learning, natural language processing, etc.
SPLDA: smoothed Phrase LDA (smooth phrase topic model).
The method for mining microblog topics is characterized in that a Gibbs sampling equation is smoothed by combining the variance of the probability of terms in the phrase under the same theme, and then the theme-phrase distribution is corrected.
In order to improve the effect of microblog topic discovery facing public opinion, a topic discovery method (also called as a topic extraction method) based on SPLDA (Smoothed Phrase LDA) is provided. The microblog topics are described by the phrases, so that people are helped to grasp the meaning of the topics more accurately and comprehensively, and further, the microblog public opinion monitoring is assisted.
The principle of the topic (theme) discovery method based on SPLDA is shown in fig. 3. The method comprises the steps of inputting a data set into a preprocessing module, and extracting effective words in the data set to be processed to obtain the preprocessed data set. The data set is the blog content of the acquired microblog, and can contain a plurality of pieces of blog content, and the microblog content can be randomly acquired in the current-day blog data. Obtaining a preprocessed data set by performing clutter tag filtering, word segmentation and word deactivation on the data set; where clutter filtering, word segmentation, and de-disabling words are general processes for natural language processing nlp.
The text processed by the preprocessing module is dried and converted into a data form capable of inputting a Gaussian model, the preprocessed text is input into a phrase mining module, frequent phrases are mined through an Apriori association algorithm, and the frequent phrases are recombined by combining Gaussian distribution characteristics of the frequency of occurrence of the frequent phrases in the text processed by the preprocessing module to generate candidate phrases; the frequent phrase is a frequent term, and the frequent term is a term in the association algorithm and refers to a phrase meeting the support degree. The recombination of frequent phrases is specifically: if the importance of two adjacent phrases meets a threshold (the frequency of occurrence reaches the minimum support), the two phrases are combined into one phrase, for example, "Beijing" "and" tally "are combined into" Beijing tally ", so that a candidate phrase set can be generated.
And finally, inputting the candidate phrase set into an SPLDA topic phrase generation model, analyzing by using a phrase topic model, combining the association relation of words belonging to one topic in the phrase and the variance of probability distribution under the same topic to obtain the probability distribution of 'topic-phrase', and selecting one or more topic phrases with high probability values to represent one topic. For example, there are tens of thousands of microblog texts, the core topics inside may be a plurality of, possibly the eight diagrams of the star, possibly the nobel prize, etc.
The specific details are as follows:
as shown in fig. 4, which is a schematic diagram of a preprocessing module, the process firstly filters noise symbols such as html tags, expressions, punctuation marks and the like in a microblog data set by using a rule, where the html tags refer to tags in web page source codes, for example: < br > < div > and the like, and performing complex transformation; then, the Chinese word segmentation tool is used for marking words and parts of speech of the data set, and stop words are removed, wherein the stop words refer to words which are removed, have a lot of meanings in texts, but have no meaning; in addition, other languages like English are natural segmentation without word segmentation, and the microblogs mainly take Chinese as a main part, and hot spots of the collected data set can be obtained by analyzing Chinese data, so that independent pretreatment of English is omitted. And finally, removing the microblog text with the blog content less than 4 effective words, wherein the types of the effective words generally comprise nouns, verbs, adjectives, numbers or time words and the like.
Phrase mining (phrase extraction module), its algorithm mainly consists of two steps: (1) Frequent phrase mining, namely generating frequent phrases by using an Arriori association algorithm and counting the occurrence times of the frequent phrases; (2) Candidate phrase generation, which combines the Gaussian distribution characteristics of the text to generate high-quality candidate phrases.
The task of frequent phrase mining is to collect all continuous words in a corpus (preprocessing dataset) with a statistical number greater than the minimum support (support) of the Apriori algorithm, wherein the minimum support means the minimum number of occurrences of a certain word, for example, the minimum support of a certain word means the minimum number of occurrences of the certain word is 3. The algorithm can quickly and effectively mine frequent phrases by utilizing two important rules of 'downward closing principle' and 'inverse monotonicity of data'. The "down-closing principle" and "inverse monotonicity of data" belong to two rules of parallel relationship. Specifically: the downward closing principle is: "if phrase P is not a frequent term, then any phrase containing P may be considered to be also not a frequent term. Inverse monotonicity of the data is: "if a document does not contain frequent phrases of length n, then the document will not contain frequent phrases of length greater than n.
The flow of the frequent phrase mining algorithm is shown in table 2. The pattern of frequent items (frequent phrases) is first mined using the Apriori algorithm and an active set of indices is maintained. The specific operation is as follows: phrases that meet the support are obtained in combination with the shutdown-down principle, i.e., non-frequent phrases and phrases that include non-frequent phrases are removed. And the active indexes are index information of frequent phrases with length of n in the document where the microblog content is located, and the active indexes are positions of all words meeting the support degree in the data set at the time of initialization.
And judging whether the document needs to be further mined or not by utilizing a data anti-monotonicity rule, wherein the active indexes are index information of frequent phrases with the length of n in the document where the microblog content is located, so that the document where one microblog content is located does not contain frequent phrases with the length of n, and the document does not contain frequent phrases with the length of more than n, so that the document where the microblog content is located is deleted in the next iteration process of the Apriori algorithm. The pruning technology of the two frequent phrase excavation methods can combine the natural sparsity of the phrases, quicken the convergence speed, ensure the iteration of the early termination algorithm and improve the efficiency of phrase excavation.
Examples are: 1. after word segmentation, obtaining the positions of words meeting a threshold according to the support; 2. assuming that a document d1 is processed, if an active word w1 (frequent phrase) is in d1, combining w1 with its neighboring words according to the location of the frequent phrase in the text-level dataset (the neighboring words are frequent phrases that can be combined, otherwise not combined, based on the downward closing principle that if phrase P is not a frequent term, then any phrase containing P can be considered as also not a frequent term) to generate phrase P1; 3. judging whether p1 meets the support degree, if yes, adding p1 (p 1) into the active index group, and deleting w1 and words adjacent to w1 in the frequent phrase data set after the current iteration (for example, when the length of the frequent phrase is 5) is completed; 4. combining the active words in d1 according to the combination mode, and deleting the document d1 if the words in the document d1 are not in the active traction group; 5. and adding the first level phrase to the frequent phrase dataset when the combined new phrase (first level phrase) has minimal support; 6. and iterating the steps for the documents where all the microblog contents are located until the active traction group is unchanged.
Table 2 frequent phrase mining algorithm
That is, in the frequent phrase mining algorithm Apriori algorithm, phrases are generated by "down-closing principle" and "inverse monotonicity of data", a corpus (dataset) is scanned using one sliding window, and the phrases satisfying the support degree are regarded as frequent phrases and the number of phrases is counted, and the size of the sliding window is increased by 1 each iteration (the size of the scanning window means the length of the phrases, such as the scanning window n, meaning that the phrases satisfying the support degree and having the length n are found. In practice, the initial value of n is typically taken as 2, and on the nth iteration, candidate phrases of length n-1 are truncated from each active index position, and the candidate phrases are counted using a hash-based counter (HashMap), as shown in line 7 of table 2, the phrase of length n-1 will not be added to the phrase candidate set if the Apriori minimum supported broad value is not satisfied, while the starting position index of the phrase is removed from the active index set before the next iteration.
And the candidate phrase generation submodule utilizes the result of the frequent phrase mining module to recombine the frequent phrases in the result to generate candidate phrases. Assuming that the corpus (the data set processed by the Apriori algorithm) is generated by a series of independent bernoulli tests, whether the phrase P exists at a specific position in the corpus belongs to bernoulli distribution, the expected occurrence frequency of the phrase P follows binomial distribution, and the distribution of each phrase occurrence event is independent, that is, the expected occurrence frequency of each phrase is respectively different binomial distribution. Since the number of phrases L in the corpus is quite large, the binomial distribution can reasonably be approximated as a gaussian distribution according to the big-number theorem. Assume that the frequent phrase P is known 1 And P 2 . The method uses an importance index sig (P 1 ,P 2 ) As phrase P 1 And P 2 The basis for whether to merge, wherein the importance index sig (P 1 ,P 2 ) Refers to calculating a known frequent phrase P 1 And P 2 Probability of simultaneous occurrence, i.e. involving frequent phrase P 1 And P 2 Probabilities of merging into the same phrase (co-occurrence of both phrases). The specific derivation of the index is as follows. f (P) represents the number of phrases P in the corpus, and the probability distribution of the phrases P is shown in formula (1).
h 0 (f(P))=N(Lp(P),Lp(P)(1-p(P)))≈N(L(p(P)),Lp(P)) (1)
Where P (P) is the probability that the phrase P succeeds in Bernoulli testing, the empirical probability of the occurrence of the phrase P in the corpus can be estimated asAssume phrase P 0 From phrase P 1 And phrase P 2 Composition (phrase includes word and phrase), and P 1 And P 2 Independent of each other, then phrase P 0 The expected value of the number of occurrences is:
from equation (1) and equation (2), h can be obtained 0 (f(P 0 ) The variance of the distribution is:
the method uses importance sig (P 1 ,P 2 ) Index to measure phrase P 1 And P 2 Probability of simultaneous occurrence. sig (P) 1 ,P 2 ) The calculation method is shown in formula (4).
There are two formulas in the binomial distribution approximation gaussian distribution: expected value=lf (P), and variance=lf (P) (1-f (P)), by which the two companies aim to obtain a gaussian distribution of the occurrence of the phrase p0=p1+p2 in the corpus.
According to formula (1), a gaussian distribution (actually, a probability density function) that phrases P1 and P2 appear in the document at the same time is obtained, and f (P1P 2) is normalized to obtain sig (P1, P2), wherein a larger value of sig (P1, P2) indicates a larger probability that P1 and P2 can be combined into one phrase.
The specific process of generating the candidate phrase is shown in table 3, and the candidate phrase generation rule is to re-combine the frequent phrases to generate longer phrases, so as to improve the readability of the phrases. After the data set passes through the frequent phrase mining module of the Apriori algorithm, the document where each microblog content is located is composed of frequent phrases. For the microblog content at this time, first, a bottom-up aggregation method (i.e., merging frequent phrases and words adjacent to each other from left to right) is adopted for each document to combine the frequent phrases and words adjacent to each other from left to right, so as to generate new phrases. Next, the importance sig of the new phrase is calculated, the phrases meeting the threshold are added to a candidate phrase set, i.e. MaxHeap hash container (the key of the container is a phrase and the value is the value of the importance), and the phrase in the corresponding position in the document is updated, i.e. the hash container is used to store the active traction group. Examples are: p1=beijing Tiananmen, p2=male, and is strong in the document, if sig (P1, P2) meets the threshold, merging P1, P2 in the document generates the phrase beijing Tiananmen male, and P1 and P2 are deleted from d 1. If the threshold is not met, then P1, P2 do not make any changes. Next, select the most important phrase Best (rows 2-4 in table 3) in the maxhaap hash container, and generate new phrases from the words or phrases of the most important phrase Best adjacent to it left and right in the document, add new phrases meeting the threshold to the candidate phrase set, and remove Best phrases from the maxhaap. Finally, the above process is iterated until the sig value of the Best phrase is less than the threshold or the MaxHeap container is empty.
TABLE 3 candidate phrase generation process
After the data set is subjected to the preprocessing and phrase mining module, the content of each document (the blog content of each microblog) is divided into a plurality of candidate phrases, and the documents are converted into phrase bag forms from bag-of-words forms. The phrase mining module can know that the words in the phrase have stronger association relation in the generated new phrase, so that in the phrase topic model, all the words in the phrase can reasonably be assumed to share one topic.
In 2003, blei proposed a LDA (Latent Dirichlet Allocation) model, which introduced a Dirichlet prior distribution on the basis of the pLSA model. In the LDA pseudo model, it is assumed that the text d follows a multi-term distribution over all K topics, and each topic K follows a multi-term distribution over the word set, and that the multi-term distribution is assumedAnd->Obeying the Dirichlet a priori distribution with super parameters alpha and beta. SPLDA method is based on LDAThe word bag model in the LDA is replaced with a phrase bag model, as shown in fig. 5. The SPLDA model is based on the hypothesized theory that text independence and terms in phrases belong to one topic, and the text independence refers to: if 1 ten thousand pieces of microblog data are in the data set, each piece of microblog data is independent and has no influence on each other. The assumption that a term in a phrase belongs to a topic is exemplified by: if the phrase p0=beijing university score line belongs to the topic of "college of college", p1=beijing university and p2=score line, both phrases belong to the topic of "college of college". The SPLDA model generation process of the document is as follows:
(1) For each topic k (there may be multiple microblog documents corresponding to multiple topics)
Generating a "topic-word" distribution
Updating a "topic-phrase" distribution when topics of terms in the phrase are the same
(2) For each document d
a) Generating text topic distribution θ d ~Dir(α)
b) For the nth word in the text
i. Generating subject item z d,i :Multi(θ d )
ii. generating a term w d,i :
The variables involved in the SPLDA model and their meanings are shown in table 4.
TABLE 4 SPLDA model-related parameters
In Table 4, a vocabulary is a vocabulary that all the different words in the dataset generate. For example: all the different words in the dataset are stored in one file and each row is a word and the line number is the position of the word in the vocabulary. Nd is the number of tokens in the d-th document, one token being a word, i.e., the token being a word; gd is the number of candidate phrases in the d-th document, the candidate phrases containing words.
In the SPLDA model, random variables are usedA topic variable representing the g-th phrase of document d. The joint distribution function of Z, W, phi, theta parameters and the joint distribution function of the theme-word in SPLDA are as follows:
where Z is the subject term, W is the term, Φ is the generation of a "subject-word" distribution for each subject k, Θ is the generation of a text subject distribution θ for each document d d ~Dir(α)。
Wherein P is LDA (Z, W, phi, theta) is a parameter joint distribution function in the LDA model, see formula (6), and C is a normalization constant used to ensure that the left side of formula (6) is a reasonable probability distribution. Function f (C) d,g ) For constraining words in the phrase to belong to the same topic, see formula (7).
Assume phrase P 0 In subject z j And subject z i The "topic-phrase" probability distributions under are the same, i.e. p (C d,g =z j )=p(C d,g =z i ). And phrase P 0 The term of inclusion, { w d,g,i In subject z i The difference of the probability distribution of the theme-words is smaller; in subject z j In the following, the difference in the probability distribution of "topic-words" is large. At this time, phrase P 0 Is subject z of allocation i More in line with the assumption that the terms in the phrase belong to the same topic. However, when parameters of the phrase bag topic model are optimized by using a conventional Gibbs sampling method, phrases P cannot be distinguished 0 Subject selection z of (2) j And z i Leading to inaccurate distribution of the topic-phrase. For this problem, the method uses the term { w } in the phrase d,g,i Statistical properties (i.e., improved standard deviation) of probability distributions under the same topic to improve the Gibbs sampling method, thereby modifying the probability distribution of "topic-phrase". C (C) d,g There are K possible values, using C d,g =k represents C d,g Where K represents the number of topics and K represents a specific topic. Parameter optimization equation:
Where VarSqrt is the standard deviation of the probability distribution of terms in the phrase under topic k. In the formula (8), as Var increases, the value of Var/tan h (Var) increases accordingly, thereby aggravating the p (C) d,g The penalty of k), the probability of the phrase topic k becomes smaller. With decreasing Var, the value of Var/tan h (Var) decreases (when Var is 0, the value is 1), thereby alleviating the effect on p (C) d,g The penalty of k), the probability of the phrase topic k becomes greater. The method fuses the difference of probability distribution of terms in phrases under the same subject into the training process of a subject model through a formula (8), so that the probability distribution of the subject-phrase is corrected.
The invention has the beneficial effects that:
frequent phrases are generated using the Apriori association algorithm, which uses two important rules to quickly and efficiently mine the frequent phrases. Further combining the gaussian distribution characteristics of the text generates high quality candidate phrases.
And mining candidate phrases based on the microblog topics of the smooth phrase topic model by using the Gaussian distribution characteristics of texts, calculating the probability of the candidate phrases under different topics through the SPLDA smooth phrase topic model, and taking the candidate phrases as the topic phrases when the probability of the candidate phrases in a certain topic is not smaller than a second threshold value, and forming corresponding topics through the topic phrases.
The probability distribution of the words in the phrase under the same subject is corrected by combining the variance (namely standard deviation) of the probability distribution of the words in the phrase under the same subject, the probability that the phrase belongs to the subject is larger, so that the probability distribution of the subject-phrase is corrected, the convergence speed and the result accuracy of sampling are improved, the generated subject phrase is used for expressing the subject, the readability of the subject is improved, the ambiguity is reduced, and the real information of the subject is more accurately expressed.
It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. As will be apparent to those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising," as interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".
Those of skill in the art will further appreciate that the various illustrative logical blocks (illustrative logical block), units, and steps described in connection with the embodiments of the invention may be implemented by electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components (illustrative components), elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation is not to be understood as beyond the scope of the embodiments of the present invention.
The various illustrative logical blocks or units described in the embodiments of the invention may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described. A general purpose processor may be a microprocessor, but in the alternative, the general purpose processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. In an example, a storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may reside in a user terminal. In the alternative, the processor and the storage medium may reside as distinct components in a user terminal.
In one or more exemplary designs, the above-described functions of embodiments of the present invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on a computer-readable medium or transmitted as one or more instructions or code on the computer-readable medium. Computer readable media includes both computer storage media and communication media that facilitate transfer of computer programs from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media may include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store program code in the form of instructions or data structures and other data structures that may be read by a general or special purpose computer, or a general or special purpose processor. Further, any connection is properly termed a computer-readable medium, e.g., if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless such as infrared, radio, and microwave, and is also included in the definition of computer-readable medium. The disks (disks) and disks (disks) include compact disks, laser disks, optical disks, DVDs, floppy disks, and blu-ray discs where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included within the computer-readable media.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (6)

1.一种基于平滑短语主题模型的主题提取方法,其特征在于,包括:1. A topic extraction method based on a smooth phrase topic model, characterized in that it includes: 提取待处理数据集内的有效词,得到预处理数据集;Extract the valid words from the dataset to be processed to obtain the preprocessed dataset; 通过Apriori关联算法自预处理数据集中提取出频繁短语,形成频繁短语数据集,并通过Apriori关联算法更新频繁短语数据集;根据频繁短语出现频率的高斯分布特性,将预处理数据集中符合预设要求的相邻的频繁短语组合成新的短语,并将新的短语加入到频繁短语数据集,形成候选短语数据集;Frequent phrases are extracted from the preprocessed dataset using the Apriori association algorithm to form a frequent phrase dataset, and the frequent phrase dataset is updated using the Apriori association algorithm. Based on the Gaussian distribution characteristics of the frequency of frequent phrases, adjacent frequent phrases in the preprocessed dataset that meet the preset requirements are combined into new phrases, and the new phrases are added to the frequent phrase dataset to form a candidate phrase dataset. 通过SPLDA平滑短语主题模型对候选短语数据集进行分析,得到主题短语,通过主题短语形成相应的话题;The candidate phrase dataset is analyzed using the SPLDA smoothed phrase topic model to obtain topic phrases, and corresponding topics are formed from these topic phrases. 所述通过Apriori关联算法自预处理数据集中提取出频繁短语,形成频繁短语数据集,具体包括:The process of extracting frequent phrases from the preprocessed dataset using the Apriori association algorithm to form a frequent phrase dataset specifically includes: 所述预处理数据集包括文本级别的数据集,当所述文本级别的数据集中某个词出现的次数大于Apriori算法中的最小支持度,则设定该词为频繁短语,生成频繁短语数据集;The preprocessed dataset includes a text-level dataset. When the frequency of a word in the text-level dataset exceeds the minimum support in the Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase dataset is generated. 所述通过Apriori关联算法更新频繁短语数据集,具体包括:The update of the frequent phrase dataset using the Apriori association algorithm specifically includes: 标记每个频繁短语在所述文本级别的数据集中的所在位置;Mark the location of each frequent phrase in the text-level dataset; 检测文本级别的数据集中是否包含预设长度的频繁短语,当包含预设长度的频繁短语时则保留该文本级别的数据集;否则删除该文本级别的数据集;以及,The function checks whether a text-level dataset contains frequent phrases of a preset length. If it does, the text-level dataset is retained; otherwise, it is deleted. 在保留的文本级别的数据集中,针对同一长度的频繁短语,根据频繁短语所在位置,当与该频繁短语一侧相邻的短语也为频繁短语时,将频繁短语与该相邻的短语合成为第一级短语,当第一级短语达到最小支持度时,将该第一级短语添加到频繁短语数据集内,并将该第一级短语对应的两个相邻的频繁短语从频繁短语数据集中删除;重复循环将频繁短语与相邻的短语合成第一级短语直到第一级短语不满足最小支持度,完成对频繁短语数据集的更新;In the preserved text-level dataset, for frequent phrases of the same length, based on the position of the frequent phrase, when the phrase adjacent to one side of the frequent phrase is also a frequent phrase, the frequent phrase and the adjacent phrase are combined into a first-level phrase. When the first-level phrase reaches the minimum support, the first-level phrase is added to the frequent phrase dataset, and the two adjacent frequent phrases corresponding to the first-level phrase are deleted from the frequent phrase dataset. This process of combining frequent phrases with adjacent phrases into first-level phrases is repeated until the first-level phrase no longer meets the minimum support, thus completing the update of the frequent phrase dataset. 将预处理数据集中符合预设要求的相邻的频繁短语合成新的短语,并将新的短语加入到频繁短语数据集,形成候选短语数据集,具体包括:The preprocessed dataset is used to synthesize adjacent frequent phrases that meet preset requirements into new phrases, and these new phrases are added to the frequent phrase dataset to form a candidate phrase dataset, which specifically includes: 获取文本级别的数据集中两个相邻的频繁短语并将该两个频繁短语合为第二级短语,计算该第二级短语在文本级别的数据集中的重要度,所述重要度为该两个频繁短语在文本级别的数据集中相同位置出现的概率;Obtain two adjacent frequent phrases from the text-level dataset and combine them into a second-level phrase. Calculate the importance of this second-level phrase in the text-level dataset, where the importance is the probability that the two frequent phrases appear in the same position in the text-level dataset. 当重要度不小于预设的第一阈值时,将该第二级短语添加到频繁短语数据集,并删除该两个相邻的频繁短语;When the importance is not less than the preset first threshold, the second-level phrase is added to the frequent phrase dataset, and the two adjacent frequent phrases are deleted. 循环将两个相邻的频繁短语合为一个第二级短语的操作,直到任何两个相邻的频繁短语合成的第二级短语的重要度小于预设的第一阈值,得到候选短语数据集。The operation of combining two adjacent frequent phrases into a second-level phrase is repeated until the importance of the second-level phrase formed by combining any two adjacent frequent phrases is less than a preset first threshold, thus obtaining a candidate phrase dataset. 2.根据权利要求1所述的基于平滑短语主题模型的主题提取方法,其特征在于,通过SPLDA平滑短语主题模型对候选短语数据集进行分析,得到主题短语,通过主题短语形成相应的话题,具体包括:2. The topic extraction method based on the SPLDA smooth phrase topic model according to claim 1, characterized in that, the candidate phrase dataset is analyzed using the SPLDA smooth phrase topic model to obtain topic phrases, and corresponding topics are formed using these topic phrases, specifically including: 通过SPLDA平滑短语主题模型计算候选短语在不同主题下的概率,当该候选短语在某主题中的概率不小于第二阈值时,将该候选短语作为主题短语,通过该主题短语形成相应的话题。The probability of a candidate phrase under different topics is calculated using the SPLDA smoothed phrase topic model. When the probability of a candidate phrase under a certain topic is not less than the second threshold, the candidate phrase is taken as the topic phrase, and the corresponding topic is formed through the topic phrase. 3.根据权利要求2所述的基于平滑短语主题模型的主题提取方法,其特征在于,还包括:计算候选短语中的词在主题下的概率分布的标准差,通过词的标准差修正该候选短语在不同主题下的概率。3. The topic extraction method based on the smooth phrase topic model according to claim 2, characterized in that it further includes: calculating the standard deviation of the probability distribution of words in the candidate phrase under the topic, and correcting the probability of the candidate phrase under different topics by the standard deviation of the words. 4.一种基于平滑短语主题模型的主题提取装置,其特征在于,包括:4. A topic extraction device based on a smooth phrase topic model, characterized in that it comprises: 预处理模块:用于提取待处理数据集内的有效词,得到预处理数据集;Preprocessing module: used to extract valid words from the dataset to be processed, resulting in a preprocessed dataset; 短语提取模块:用于通过Apriori关联算法自预处理数据集中提取出频繁短语,形成频繁短语数据集,并通过Apriori关联算法更新频繁短语数据集;根据频繁短语出现频率的高斯分布特性,将预处理数据集中符合预设要求的相邻的频繁短语组合成新的短语,并将新的短语加入到频繁短语数据集,形成候选短语数据集;The phrase extraction module is used to extract frequent phrases from the preprocessed dataset using the Apriori association algorithm, forming a frequent phrase dataset, and updating the frequent phrase dataset using the Apriori association algorithm. Based on the Gaussian distribution characteristics of the frequency of frequent phrases, it combines adjacent frequent phrases in the preprocessed dataset that meet the preset requirements into new phrases, and adds the new phrases to the frequent phrase dataset to form a candidate phrase dataset. 主题生成模块:用于通过SPLDA平滑短语主题模型对候选短语数据集进行分析,得到主题短语,通过主题短语形成相应的话题;Topic generation module: This module analyzes the candidate phrase dataset using the SPLDA smoothed phrase topic model to obtain topic phrases, and then forms corresponding topics from these topic phrases. 所述短语提取模块包括频繁短语挖掘子模块和候选短语生成子模块,其中:The phrase extraction module includes a frequent phrase mining submodule and a candidate phrase generation submodule, wherein: 所述频繁短语挖掘子模块,具体用于:The frequent phrase mining submodule is specifically used for: 所述预处理数据集包括文本级别的数据集,当所述文本级别的数据集中某个词出现的次数大于Apriori算法中的最小支持度,则设定该词为频繁短语,生成频繁短语数据集;所述通过Apriori关联算法更新频繁短语数据集,具体包括:The preprocessed dataset includes a text-level dataset. When the frequency of a word in the text-level dataset exceeds the minimum support in the Apriori algorithm, the word is designated as a frequent phrase, and a frequent phrase dataset is generated. Updating the frequent phrase dataset using the Apriori association algorithm specifically includes: 标记每个频繁短语在所述文本级别的数据集中的所在位置;Mark the location of each frequent phrase in the text-level dataset; 检测文本级别的数据集中是否包含预设长度的频繁短语,当包含预设长度的频繁短语时则保留该文本级别的数据集;否则删除该文本级别的数据集;以及,The function checks whether a text-level dataset contains frequent phrases of a preset length. If it does, the text-level dataset is retained; otherwise, it is deleted. 在保留的文本级别的数据集中,针对同一长度的频繁短语,根据频繁短语所在位置,当与该频繁短语一侧相邻的短语也为频繁短语时,将频繁短语与该相邻的短语合成为第一级短语,当第一级短语达到最小支持度时,将该第一级短语添加到频繁短语数据集内,并将该第一级短语对应的两个相邻的频繁短语从频繁短语数据集中删除;重复循环将频繁短语与相邻的短语合成第一级短语直到第一级短语不满足最小支持度,完成对频繁短语数据集的更新;In the preserved text-level dataset, for frequent phrases of the same length, based on the position of the frequent phrase, when the phrase adjacent to one side of the frequent phrase is also a frequent phrase, the frequent phrase and the adjacent phrase are combined into a first-level phrase. When the first-level phrase reaches the minimum support, the first-level phrase is added to the frequent phrase dataset, and the two adjacent frequent phrases corresponding to the first-level phrase are deleted from the frequent phrase dataset. This process of combining frequent phrases with adjacent phrases into first-level phrases is repeated until the first-level phrase no longer meets the minimum support, thus completing the update of the frequent phrase dataset. 所述候选短语生成子模块,具体用于:The candidate phrase generation submodule is specifically used for: 获取文本级别的数据集中两个相邻的频繁短语并将该两个频繁短语合第二级短语,计算该第二级短语在文本级别的数据集中的重要度,所述重要度为该两个频繁短语在文本级别的数据集中相同位置出现的概率;Obtain two adjacent frequent phrases from the text-level dataset and combine them into a second-level phrase. Calculate the importance of this second-level phrase in the text-level dataset, where the importance is the probability that the two frequent phrases appear in the same position in the text-level dataset. 当重要度不小于预设的第一阈值时,将该第二级短语添加到频繁短语数据集,并删除该两个相邻的频繁短语;When the importance is not less than the preset first threshold, the second-level phrase is added to the frequent phrase dataset, and the two adjacent frequent phrases are deleted. 循环将两个相邻的频繁短语合为一个第二级短语的操作,直到任何两个相邻的频繁短语合成的第二级短语的重要度小于预设的第一阈值,频繁短语。The operation of combining two adjacent frequent phrases into a second-level phrase is repeated until the importance of the second-level phrase formed by combining any two adjacent frequent phrases is less than a preset first threshold, which is then considered a frequent phrase. 5.根据权利要求4所述的基于平滑短语主题模型的主题提取装置,其特征在于,主题生成模块,具体用于:5. The topic extraction device based on a smooth phrase topic model according to claim 4, characterized in that the topic generation module is specifically used for: 通过SPLDA平滑短语主题模型计算候选短语在不同主题下的概率,当该候选短语在某主题中的概率不小于第二阈值时,将该候选短语作为主题短语,通过该主题短语形成相应的话题。The probability of a candidate phrase under different topics is calculated using the SPLDA smoothed phrase topic model. When the probability of a candidate phrase under a certain topic is not less than the second threshold, the candidate phrase is taken as the topic phrase, and the corresponding topic is formed through the topic phrase. 6.根据权利要求5所述的基于平滑短语主题模型的主题提取装置,其特征在于,主题生成模块,具体还用于:6. The topic extraction device based on a smooth phrase topic model according to claim 5, wherein the topic generation module is further configured to: 计算候选短语中的词在主题下的概率分布的标准差,通过词的标准差修正该候选短语在不同主题下的概率。Calculate the standard deviation of the probability distribution of words in the candidate phrase under the topic, and correct the probability of the candidate phrase under different topics by the standard deviation of the words.
CN201911421842.3A 2019-12-31 2019-12-31 Topic extraction method and device based on smooth phrase topic model Active CN111178048B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911421842.3A CN111178048B (en) 2019-12-31 2019-12-31 Topic extraction method and device based on smooth phrase topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911421842.3A CN111178048B (en) 2019-12-31 2019-12-31 Topic extraction method and device based on smooth phrase topic model

Publications (2)

Publication Number Publication Date
CN111178048A CN111178048A (en) 2020-05-19
CN111178048B true CN111178048B (en) 2023-08-01

Family

ID=70654319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911421842.3A Active CN111178048B (en) 2019-12-31 2019-12-31 Topic extraction method and device based on smooth phrase topic model

Country Status (1)

Country Link
CN (1) CN111178048B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399162A (en) * 2018-03-21 2018-08-14 北京理工大学 The topic of phrase-based bag topic model finds method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030065655A1 (en) * 2001-09-28 2003-04-03 International Business Machines Corporation Method and apparatus for detecting query-driven topical events using textual phrases on foils as indication of topic
US20180357684A1 (en) * 2017-01-12 2018-12-13 Hefei University Of Technology Method for identifying prefereed region of product, apparatus and storage medium thereof
US10896444B2 (en) * 2017-01-24 2021-01-19 International Business Machines Corporation Digital content generation based on user feedback

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399162A (en) * 2018-03-21 2018-08-14 北京理工大学 The topic of phrase-based bag topic model finds method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
中文文本的主题关键短语提取技术;杨;张德生;;计算机科学(第S2期);全文 *
大规模词序列中基于频繁词集的特征短语抽取模型;余琴琴;彭敦陆;刘丛;;小型微型计算机系统(第05期);全文 *

Also Published As

Publication number Publication date
CN111178048A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN112347778B (en) Keyword extraction method, keyword extraction device, terminal equipment and storage medium
CN111767796B (en) Video association method, device, server and readable storage medium
CN111651986B (en) Event keyword extraction method, device, equipment and medium
CN108959418A (en) Character relation extraction method and device, computer device and computer readable storage medium
CN107463548B (en) Phrase mining method and device
CN109783787A (en) A kind of generation method of structured document, device and storage medium
CN109086375B (en) A short text topic extraction method based on word vector enhancement
US20140032207A1 (en) Information Classification Based on Product Recognition
CN116050397B (en) Method, system, equipment and storage medium for generating long text abstract
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
CN111177375B (en) Electronic document classification method and device
CN114492390B (en) Data expansion method, device, equipment and medium based on keyword recognition
CN110457707B (en) Method and device for extracting real word keywords, electronic equipment and readable storage medium
Bykau et al. Fine-grained controversy detection in Wikipedia
US9754023B2 (en) Stochastic document clustering using rare features
CN116029280A (en) A document key information extraction method, device, computing device and storage medium
CN117391086A (en) Bid participation information extraction method, device, equipment and medium
CN111178048B (en) Topic extraction method and device based on smooth phrase topic model
CN118153007B (en) Text-oriented data database watermark embedding method, system and storage medium
CN110489759B (en) Text feature weighting and short text similarity calculation method, system and medium based on word frequency
CN103886097A (en) Chinese microblog viewpoint sentence recognition feature extraction method based on self-adaption lifting algorithm
CN109446321B (en) Text classification method, text classification device, terminal and computer readable storage medium
CN116226638A (en) Model training method, data benchmarking method, device and computer storage medium
TWI534640B (en) Chinese network information monitoring and analysis system and its method
CN115906835A (en) A Method for Learning Chinese Question Text Representation Based on Clustering and Contrastive Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant