CN114117045B - Method and electronic device for extracting topic tags from text set - Google Patents
Method and electronic device for extracting topic tags from text set Download PDFInfo
- Publication number
- CN114117045B CN114117045B CN202111409911.6A CN202111409911A CN114117045B CN 114117045 B CN114117045 B CN 114117045B CN 202111409911 A CN202111409911 A CN 202111409911A CN 114117045 B CN114117045 B CN 114117045B
- Authority
- CN
- China
- Prior art keywords
- cluster
- text
- word
- topic
- clusters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/231—Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a method and electronic equipment for extracting topic labels from a text set, wherein the method comprises the steps of converting each text in the text set into text vectors, taking each text vector as a bottom-layer cluster, executing bottom-up hierarchical clustering, determining topic labels of each layer of clusters, obtaining a cluster set corresponding to words according to the clusters containing words in the topic labels for any word, wherein the cluster set comprises at least one cluster, each cluster comprises at least one text, finding out a target cluster set mapped by the keywords according to the cluster set corresponding to different words and keywords to be extracted, and obtaining topic labels related to the keywords extracted from the text set according to the topic labels corresponding to each cluster in the target cluster set. The extraction of the theme label is simpler and more convenient.
Description
Technical Field
The application relates to the technical field of natural language processing, in particular to a method for extracting a theme tag from a text set and electronic equipment.
Background
The topic is the central idea of text that summarizes and reflects the body and core of the text content. And the topic label can briefly summarize the main content of the text through a small number of words. In the age of information overload and rapid data growth, enterprises accumulate massive text resources, and under the conditions of large text quantity and wide source channel, a data set contains contents of different fields and types, and the problems of acquiring topics of texts from multiple angles and understanding relative relations among the texts are actually faced.
For topic label extraction of a large amount of text, a topic model mode, such as common latent dirichlet Allocation (LATENT DIRICHLET Allocation, LDA), probability latent semantic analysis (probabilic LATENT SEMANTIC ANALYSIS, PLAS), and the like, can be directly used. However, there are problems that learning effect is poor when short text exists in the data set, and advanced features such as word bag model, word sequence, grammar, semantics and the like cannot be perceived, expansibility is limited, and newer neural network technology cannot be introduced.
Since the dataset may contain text content in different directions, it is not known without prior knowledge whether different text is describing similar or different content, the first way that should be taken is to categorize the content using a categorization or clustering algorithm. The classification algorithm is used as supervised learning, the labels, namely the extracted subject words, are required to be set in advance, and the method belongs to the inversion of the subject. Text may be preprocessed using clustering and then vectorized. The algorithm divides the text into a plurality of clusters, and then carries out further analysis on each cluster independently to further extract the subject terms in the text. However, the cutting subject of the method needs to be dependent on the result of vector representation, and is difficult to operate in a targeted manner. The traditional K-means clustering algorithm and the like also need to realize the determination of the cluster number of clusters, and are difficult to determine under the condition that the segmentation number in the data set is not determined.
Disclosure of Invention
The embodiment of the application provides a method for extracting a theme label from a text set, which reduces the difficulty of extracting the theme label.
The embodiment of the application also provides a method for extracting the theme tag from the text set, which comprises the following steps:
converting each text in the text set into a text vector;
Taking each text vector as the bottom-most cluster, performing bottom-up hierarchical clustering, and determining a theme label of each layer of clusters;
For any word, obtaining a cluster corresponding to the word according to a cluster containing the word in a theme label, wherein the cluster comprises at least one cluster, and each cluster contains at least one text;
Finding out a target cluster set mapped by the keywords according to cluster sets corresponding to different words and the keywords to be extracted;
and obtaining the topic label which is extracted from the text set and is related to the keyword according to the topic label corresponding to each cluster in the target cluster set.
In an embodiment, the converting each text in the text set into a text vector includes:
word segmentation processing is carried out on each text in the text set;
Each text is converted to a text vector by a trained encoder based on the words contained in each text.
In an embodiment, the determining the theme label of each layer cluster includes:
Aiming at each cluster, calculating to obtain an importance degree value of each word according to the word frequency value of each word in the cluster;
And selecting a preset number of words with the maximum importance degree value as theme labels corresponding to the clusters according to the importance degree value of each word in the clusters.
In an embodiment, the calculating, according to the word frequency value of each word in the cluster, the importance level value of each word includes:
And calculating the importance degree value of each word in the cluster according to the reverse file frequency value of the word in the text set and the word frequency value in the cluster.
In an embodiment, the calculating the importance level value of the word according to the reverse file frequency value of the word in the text set and the word frequency value in the cluster includes:
Multiplying the reverse file frequency value of the word in the text set and the word frequency value in the cluster to obtain a word frequency-reverse file frequency value;
And taking the word frequency-inverse file frequency value as the importance degree value of the word.
In an embodiment, the topic tag includes a plurality of topic words, and the obtaining, for any word, a cluster corresponding to the word according to a cluster including the word in the topic tag includes:
dividing clusters containing the subject words in the subject label into the same cluster aiming at each subject word;
If any two sub-clusters are combined into a new cluster and the topic labels of the two sub-clusters are the same, deleting the two sub-clusters from the cluster set corresponding to the topic words according to the topic words contained in the topic labels.
In an embodiment, after the obtaining the cluster corresponding to the word, the method further includes:
and sorting the clusters in the cluster set according to the number of texts contained in each cluster in the cluster set in descending order.
In an embodiment, the obtaining, according to the topic label corresponding to each cluster in the target cluster set, the topic label related to the keyword extracted from the text set includes:
And taking the topic label of the cluster with the highest ranking as the topic label related to the keyword extracted from the text set according to the ranking order of each cluster in the target cluster set.
In an embodiment, after finding the set of target clusters for the keyword mapping, the method further comprises:
And obtaining text content which is extracted from the text set and is related to the keywords according to the text contained in each cluster in the target cluster set.
The embodiment of the application also provides electronic equipment, which comprises:
A processor;
a memory for storing processor-executable instructions;
Wherein the processor is configured to perform the above method of extracting a subject tag from a text set.
According to the technical scheme provided by the embodiment of the application, each text in the text set is converted into the text vector, the topic label of each layer of clusters is determined by adopting a hierarchical clustering mode, and for each word, a cluster corresponding to each word is constructed according to the clusters containing the word in the topic label, so that a target cluster set mapped by the keyword can be found out, and the topic label related to the keyword is obtained according to the topic label corresponding to each cluster in the target cluster set. Because the short text and the long text are both expressed as semantic vectors with fixed length, the short text and the long text can be processed and mined in a clustering mode, and the problem that the traditional topic model faces the short text is avoided. By adopting a bottom-up clustering mode, the number of clusters is not required to be set, the calculation of the topic labels is convenient, the target cluster set of keyword mapping is convenient to find, topic labels of multi-level clusters (namely, different granularities) can be obtained, the calculation is simpler and more convenient, and the application range is wider.
Drawings
In order to more clearly illustrate the technical solution of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for extracting topic labels from a text set according to an embodiment of the present application;
FIG. 3 is a schematic diagram of hierarchical clustering provided by an embodiment of the present application;
FIG. 4 is a detailed flowchart of step S220 in the corresponding embodiment of FIG. 2;
Fig. 5 is a block diagram of an apparatus for extracting a subject tag from a text set according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
Like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 100 may be used to perform the method of extracting a subject tag from a text set provided by an embodiment of the present application. As shown in fig. 1, the electronic device 100 includes one or more processors 102, one or more memories 104 storing processor-executable instructions. Wherein the processor 102 is configured to perform the method of extracting a subject tag from a text set provided by the embodiments of the application described below.
The processor 102 may be a gateway, an intelligent terminal, or a device comprising a Central Processing Unit (CPU), an image processing unit (GPU), or other form of processing unit having data processing capabilities and/or instruction execution capabilities, may process data from other components in the electronic device 100, and may control other components in the electronic device 100 to perform desired functions.
The memory 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 102 to implement the method of extracting a subject tag from a text set described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer readable storage medium.
In one embodiment, fig. 1 illustrates that the electronic device 100 may further include an input device 106, an output device 108, and a data acquisition device 110, which are interconnected by a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and structures of the electronic device 100 shown in fig. 1 are exemplary only and not limiting, as the electronic device 100 may have other components and structures as desired.
The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, mouse, microphone, touch screen, and the like. The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like. The data acquisition device 110 may acquire images of the subject and store the acquired images in the memory 104 for use by other components. The data acquisition device 110 may be a camera, for example.
In an embodiment, the various components in the exemplary electronic device 100 for implementing the method for extracting a subject tag from a text set according to embodiments of the present application may be integrally configured, or may be separately configured, such as integrally configured with the processor 102, the memory 104, the input device 106, and the output device 108, while separately configured with the data acquisition device 110.
In an embodiment, the example electronic device 100 for implementing the method of extracting a subject tag from a text set of embodiments of the present application may be implemented as a smart terminal such as a smart phone, tablet, desktop, server, or the like.
Fig. 2 is a flowchart of a method for extracting a topic label from a text set according to an embodiment of the present application. As shown in fig. 2, the method includes the following steps S210 to S250.
Step S210, converting each text in the text set into a text vector.
The extraction of the topic label from the text set may be, for example, the extraction of the topic label from the evaluation of the electronic commerce of a computer accessory store, so as to determine the evaluation of the user for different products and sales communication, and further improve the service quality of the user.
A text set refers to a collection of large amounts of text. For example, a text may be an evaluation. Assuming that the computer accessory store sells accessories such as CPUs, mainboards, display cards, memories and the like with different brands, wherein the accessories are classified into good evaluation and poor evaluation, and 10 ten thousand evaluation items are total, the text set can be the 10 ten thousand evaluation items.
The text vectors are used for representing the text characteristics of each text, and one text can obtain the corresponding text vector. Specifically, an algorithm that encodes text into a fixed length vector, such as Doc2Vec based on bag of words, or RNN (recurrent neural network), attention model based on deep learning, etc., may be selected to convert text into a text vector.
In one embodiment, each text in the set of text may be first word segmented and then converted to a text vector by a trained encoder based on the words contained in each text.
The word segmentation can adopt the existing word segmentation device to carry out word segmentation processing on the text. The encoder can be Doc2Vec based on word bags or is trained based on algorithms such as RNN (recurrent neural network), attention and the like of deep learning. The training data set can be based on universal or industry pre-trained word vectors, the data set is provided with labels, the labels are used when the data set is provided with labels, and no labels can be learned through comparison or a self-encoder algorithm. Numbering all texts in the dataset, and using an encoder to independently represent each text as a vector, so that a set of all text vectors can be obtained, and assuming that the dataset has N texts, the text vectors can be composed of N texts.
Step S220, taking each text vector as the bottom-most cluster, performing bottom-up hierarchical clustering, and determining the topic label of each layer of clusters.
Hierarchical clustering may also be referred to as hierarchical clustering, where each text vector is considered a separate cluster at the bottom, and as shown in fig. 3, 1,2,3,4,5 may represent text vectors of a text, respectively, as a separate cluster. The distance between every two clusters is calculated, two clusters with the smallest distance are found, for example, 1 and 2,4 and 5, the cluster 1 and the cluster 2 are combined into a new cluster 6, the cluster 4 and the cluster 5 are combined into a new cluster 7, and thus three clusters 3, 6 and 7 are obtained, and the combination can be continued until the condition of the distance threshold value is reached or the set classification number is reached.
In the hierarchical clustering process, the text numbers contained in each cluster can be continuously saved, and finally a tree-shaped clustering structure T is obtained, which is similar to that shown in fig. 3, and the root node 9 contains all the texts. The topic labels for each layer of clusters can be determined at the beginning of the hierarchical clustering and each time the clusters are merged. Taking fig. 3 as an example, clusters 1, 2, 3, 4 and 5 represent text vectors of text a, text B, text C, text D and text E, respectively, so the topic label of cluster 1 is the topic label of text 1, the topic label of cluster 2 is the topic label of text 2, and so on. Cluster 6 is a combination of cluster 1 and cluster 2, and cluster 6 contains text 1 and text 2, so the topic label of cluster 6 is a topic label common to text 1 and text 2. Cluster 7 contains text 4 and text 5, so the subject label of cluster 7 is a common subject label for text 4 and text 5, and so on. The topic tag consists of K topic words, K is a super parameter, and can be adjusted according to the number of topic words to be extracted, and is generally 5-10. For example, the theme tag may be [ CPU, motherboard, memory, customer service, excellent ].
It should be noted that, in the clustering structure T, each node is a cluster of a text set, and K subject words are attached to each node, where the several subject words actually describe a subject core of several texts, and may be used to describe contents expressed by texts in the one set. The cluster of root nodes represents the entire text set, with the subject words being representative of all text. Therefore, in an embodiment, the topic label of the root node can be directly extracted as the topic label of the whole text set according to the requirement.
Step S230, for any word, obtaining a cluster corresponding to the word according to the cluster containing the word in the theme label, wherein the cluster comprises at least one cluster, and each cluster contains at least one text.
A corresponding cluster set may be constructed for each word. The term may be a word that appears in the subject tag. For a word, if the topic label of a certain cluster contains the word, the cluster is added into a cluster set corresponding to the word. A cluster set may be considered as a set of clusters, the topic labels of which contain the same topic word.
For example, assume that one cluster C1 has a subject tag [ a, B, C, D, E ], one cluster C2 has a subject tag [ a, C, D, E, F ], one cluster C3 has a subject tag [ a, C, E, F, G ], and one cluster C4 has a subject tag [ a, B, E, F, G ]. For word A, the topic label contains clusters C1, C2, C3, C4 of word A, so a cluster set (C1, C2, C3, C4) can be constructed, for word B, a cluster set (C1, C4) can be constructed, for word C, a cluster set (C1, C2, C3) can be constructed, for word D, a cluster set (C1, C2) can be constructed, and so on, for each word, its corresponding cluster set is constructed.
Step S240, finding out a target cluster set mapped by the keywords according to clusters corresponding to different words and the keywords to be extracted.
Keywords may be keywords in related information that the user wishes to understand, entered by the user or sent by other devices. The keyword may be "CPU" when the store owner wants to know the CPU related rating. The computer receives the keywords to be extracted, and can find the cluster corresponding to the keywords according to the clusters corresponding to different words. For distinction, the cluster set corresponding to the keyword may be referred to as a target cluster set.
And step S250, obtaining the topic label related to the keyword extracted from the text set according to the topic label corresponding to each cluster in the target cluster set.
Let the target cluster set c= [ c_1, c_2, ], c_n ]. K subject words are included in any node cluster C_i, the query keywords are separated, the rest K-1 subject words can be returned, and the rest K-1 subject words are supplementary explanation for the keywords. So that n groups of topic labels related to the keywords can be finally extracted.
Since the text number contained in each cluster can be saved in the hierarchical clustering stage, text content related to the keyword extracted from the text set can be obtained according to the text contained in each cluster in the target cluster set. This text content may include text contained in each cluster in the set of target clusters.
Since the topic label of each cluster describes the core content of that cluster. When the target cluster set |c|=0, the text set is described with no text describing information related to the keyword. When the target cluster set C is not 0, each cluster in the set C can obtain descriptions of different granularities of the same keyword.
In general, the topics cut by the clusters themselves are determined by the generated sentence vectors, and the encoder algorithm generally ensures that similar texts have closer distances, but the dimensions between different topics are difficult to operate. For example, a data set simultaneously covers four contents of Chinese reading, chinese writing, english reading and English writing, and the number of texts contained in the four types is assumed to be close. When the data sets are clustered into two types, it is difficult to ensure whether the result is [ chinese reading+chinese writing, english reading+english writing ] or [ chinese reading+english reading, chinese writing+english writing ]. Preserving the hierarchical clustering of the individual layers results can solve this problem to some extent. For example, the clustering result is [ Chinese reading+Chinese writing, english reading+English writing ] and the condition of 'reading' is hoped to be used as a subject keyword for inquiring, two clusters [ Chinese reading, english reading ] can be inquired at a relatively deep position of a tree structure, and Chinese and English are used as auxiliary subject words for carrying out supplementary explanation on texts in the clusters.
According to the technical scheme provided by the embodiment of the application, each text in the text set is converted into the text vector, the topic label of each layer of clusters is determined by adopting a hierarchical clustering mode, and for each word, a cluster corresponding to each word is constructed according to the clusters containing the word in the topic label, so that a target cluster set mapped by the keyword can be found out, and the topic label related to the keyword is obtained according to the topic label corresponding to each cluster in the target cluster set. Because the short text and the long text are both expressed as semantic vectors with fixed length, the short text and the long text can be processed and mined in a clustering mode, and the problem that the traditional topic model faces the short text is avoided. By adopting a bottom-up clustering mode, the number of clusters is not required to be set, the calculation of the topic labels is convenient, the target cluster set of keyword mapping is convenient to find, topic labels of multi-level clusters (namely, different granularities) can be obtained, the calculation is simpler and more convenient, and the application range is wider.
In an embodiment, after the clustering corresponding to each word is constructed in the step 230, the clusters in the cluster set may be further sorted according to the number of texts included in each cluster in the cluster set in a descending order.
That is, for one cluster set, clusters with a larger number of texts may be arranged in front, and clusters with a smaller number of texts may be arranged in rear. For example, assuming there are m words, m clusters [ c_1, c_2, ], c_m ], c_i represent the cluster set to which word i corresponds, can be obtained. For any cluster set c_i= [ c_i1, c_i2, ], c_il ], the descending order of the number of texts contained in c_ij is ordered. I.e., the number of texts of cluster c_i1 is the largest and the number of texts of c_il is the smallest.
The front cluster in the target cluster set C contains more text numbers, so that the subject label has stronger generalized semantics, and the rear cluster can explain the content contained in the cluster set from a finer angle. Therefore, in an embodiment, the topic label of the cluster with the highest ranking can be used as the topic label related to the keyword extracted from the text set according to the ranking order of each cluster in the target cluster set. The top ranked cluster may be considered as the cluster with the largest number of text contained in the target cluster set. The topic labels of the clusters are more general and can be used as topic labels related to keywords. The topic labels of the later clusters can also be selected as needed, so that the topic labels related to the keywords can be obtained from the corresponding granularity as needed.
In order to simplify the cluster set, the step S230 specifically includes dividing the cluster containing the subject word into the same cluster for each subject word, and if any two sub-clusters are combined into a new cluster and the subject tags of the two sub-clusters are the same, deleting the two sub-clusters from the cluster corresponding to the subject word according to the subject word contained in the subject tag.
Still for example in fig. 3, assume that clusters 1 and 2 are merged into cluster 6, and that clusters 1 and 2 may be referred to as sub-clusters, and that cluster 6 may be referred to as a new cluster. If the topic labels extracted from the cluster 1 and the cluster 2 are identical, it is indicated that the cluster 1 and the cluster 2 are respectively described as identical content, and the cluster 1 and the cluster 2 are combined into a larger cluster 6, and the topic label of the cluster 6 is identical to the cluster 1 and the cluster 2, so that only the cluster 6 needs to be reserved, and the cluster 1 and the cluster 2 can be deleted.
In one embodiment, as shown in fig. 4, the step S220 determines the theme label of each layer cluster, and specifically includes steps S221-S222.
Step S221, calculating the importance degree value of each word according to the word frequency value of each word in each cluster.
Where the term frequency value of a term refers to the frequency of occurrence of the term in the text contained in the cluster, i.e., the number of occurrences of the term divided by the total number of words in the text. The importance level value is used for representing the importance level of the words in the clusters, and the larger the word frequency value is, the larger the importance level value is. In one embodiment, the importance value may be directly represented by a word frequency value. In another embodiment, for each word in the cluster, the importance level value of the word may also be calculated according to the reverse document frequency value (IDF value) of the word in the text set and the word frequency value (TF value) in the cluster. Specifically, the reverse document frequency value of the word in the text set and the word frequency value in the cluster may be multiplied to obtain a word frequency-reverse document frequency value (TF-IDF value). And taking the word frequency-inverse file frequency value of the word as the importance degree value of the word in the cluster.
Step S222, selecting a preset number of words with the maximum importance degree value as theme labels corresponding to the clusters according to the importance degree value of each word in the clusters.
The preset number can be adjusted according to the number of subject words to be extracted, and is generally 5-10. The 5-10 words with the largest importance value in one cluster can be used as the topic label of the cluster. In one embodiment, the 5-10 words with the largest TF-IDF values in the cluster may be used as the topic label of the cluster. In the hierarchical clustering process, K words with the largest TF-IDF can be selected for each cluster to serve as the topic labels of the cluster, and the topic labels are stored in each node of a clustering structure T. Because a bottom-up clustering mode is used, the number of times that a certain word appears in the process of calculating the TF value is only added, and the total number of words is obtained by clustering the sub-clusters, so that the calculation of the topic label of the cluster is convenient.
The following is an example of a practical application scenario.
And extracting the theme label from the evaluation of the electronic commerce of the computer accessory store. And analyzing the evaluation of the user aiming at different products and sales communication to further improve the service quality of the user.
Assuming that the store has 10 tens of thousands of evaluations, different brands of accessories such as CPUs, mainboards, video cards, memories and the like are sold, wherein the evaluation is divided into good evaluation and bad evaluation. The Doc2Vec unsupervised training construction encoder was used to set the number of extracted subject words k=5. The above step S210-step S230 is used to obtain a tree-like cluster structure T and clusters (i.e., maps M) corresponding to each word.
In this case, the root node may find out that the root node extracts the subject term [ CPU, motherboard, memory, customer service, excellent ]. Through the subject words with the strongest generalization, all the evaluations mainly comprise the evaluations of the CPU, the main board and the memory in the product level, and the contents of the evaluation discussion customer service behaviors are also numerous, so that the overall feeling is good. Of course, the buyer does not have to directly input good comments and poor comments into the evaluation content, and only the side experience analysis is performed, for example, the term "excellent" in this example represents the overall experience of the buyer on the product.
When the store owner wants to know about the CPU-related evaluation, after finding the cluster C (c= [ c_1, c_2,..once., c_n ]) corresponding to the CPU in the map M, where c_1 is the relatively shallowest-level cluster, most of the CPU-related evaluation can be obtained through the text number contained in the node. The topic label corresponding to cluster C_1 is found to be [ CPU, motherboard, customer service, excellent, INTEL ]. Analysis shows that most of the evaluation with the CPU is provided with a main board, and after all, the electronic commerce often sell the package of the CPU and the main board. The CPU sold by the store is of course a product comprising AMD and INTEL. The cluster c_2 is observed here as subject labels [ CPU, motherboard, customer service, excellent, AMD ], most of which are rated here, but ranked because of the low relative INTEL sales.
Some keywords can directly observe the hierarchical relationship, but here, it can be found that the product sold by the electronic commerce actually has a display card product, but does not appear in the shallowest theme label. In practice, the clusters c_1 and c_2 include a part of graphics cards or other comments of accessories, but most of them are still related to the CPU, so that the extracted description of the CPU can be considered. The cluster corresponding to the 'graphics card' can be further inquired, and the theme label of the first cluster is found to be the 'graphics card, the' Huashuo ', the good, the difficult to buy and the power supply', and the display card pin quantity of the Huashuo product can be analyzed from the side surface is higher, but the market source of the graphics card is less.
The main information is found to be in good scores at present, and when a store wants to know that poor scores improve own product services and the like, common poor scores can be used as keywords for searching according to experience. For example, the "broken" is searched directly, for example, the subject label of the first cluster in the cluster set is found to be [ case, broken, bad evaluation, package, problem ], while the number is not known, the package of some cases can be known to be broken, and the evaluation texts can be positioned directly according to the text numbers contained in the cluster. And meanwhile, the rest clusters can be checked to obtain other broken commodities and corresponding evaluation.
The following is an embodiment of the apparatus of the present application, which may be used to perform the above-described method embodiment of extracting a subject tag from a text set. For details not disclosed in the apparatus embodiments of the present application, please refer to an embodiment of the method for extracting the topic labels from the text set.
Fig. 5 illustrates an apparatus for extracting topic labels from a text set according to an embodiment of the present application, where the apparatus includes a vector conversion module 510, a hierarchical clustering module 520, a mapping determination module 530, a clustering determination module 540, and a label extraction module 550.
A vector conversion module 510 for converting each text in the text set into a text vector;
A hierarchical clustering module 520, configured to take each text vector as a bottom-most cluster, perform bottom-up hierarchical clustering, and determine a topic label of each layer of clusters;
The mapping determining module 530 is configured to obtain, for any word, a cluster corresponding to the word according to a cluster including the word in a topic label, where the cluster includes at least one cluster, and each cluster includes at least one text;
The cluster set determining module 540 is configured to find a target cluster set mapped by the keyword according to clusters corresponding to different terms and the keyword to be extracted;
And the tag extraction module 550 is configured to obtain, according to the topic tag corresponding to each cluster in the target cluster set, a topic tag related to the keyword extracted from the text set.
The implementation process of the functions and roles of each module in the above device is specifically detailed in the implementation process of the corresponding steps in the method for extracting the theme tag from the text set, which is not described herein again.
In the several embodiments provided in the present application, the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
Claims (9)
1. A method of extracting a subject tag from a text set, comprising:
converting each text in the text set into a text vector;
Taking each text vector as the bottom-most cluster, performing bottom-up hierarchical clustering, and determining a theme label of each layer of clusters;
For any word, obtaining a cluster corresponding to the word according to the cluster containing the word in the topic label, wherein the cluster containing the topic word in the topic label is divided into the same cluster for each topic word; if any two sub-clusters are combined into a new cluster, and the topic labels of the two sub-clusters are the same, deleting the two sub-clusters from a cluster set corresponding to the topic words according to the topic words contained in the topic labels, wherein the cluster set comprises at least one cluster, and each cluster comprises at least one text;
Finding out a target cluster set mapped by the keywords according to cluster sets corresponding to different words and the keywords to be extracted;
and obtaining the topic label which is extracted from the text set and is related to the keyword according to the topic label corresponding to each cluster in the target cluster set.
2. The method of claim 1, wherein converting each text in the set of texts into a text vector comprises:
word segmentation processing is carried out on each text in the text set;
Each text is converted to a text vector by a trained encoder based on the words contained in each text.
3. The method of claim 1, wherein determining the topic label for each layer of clusters comprises:
Aiming at each cluster, calculating to obtain an importance degree value of each word according to the word frequency value of each word in the cluster;
And selecting a preset number of words with the maximum importance degree value as theme labels corresponding to the clusters according to the importance degree value of each word in the clusters.
4. The method of claim 3, wherein the calculating the importance level value of each word according to the word frequency value of each word in the cluster includes:
And calculating the importance degree value of each word in the cluster according to the reverse file frequency value of the word in the text set and the word frequency value in the cluster.
5. The method of claim 4, wherein the calculating the importance level value of the word according to the reverse document frequency value of the word in the text set and the word frequency value in the cluster comprises:
Multiplying the reverse file frequency value of the word in the text set and the word frequency value in the cluster to obtain a word frequency-reverse file frequency value;
And taking the word frequency-inverse file frequency value as the importance degree value of the word.
6. The method of claim 1, wherein after the obtaining the cluster corresponding to the term, the method further comprises:
and sorting the clusters in the cluster set according to the number of texts contained in each cluster in the cluster set in descending order.
7. The method according to claim 6, wherein the obtaining, according to the topic label corresponding to each cluster in the target cluster set, the topic label related to the keyword extracted from the text set includes:
And taking the topic label of the cluster with the highest ranking as the topic label related to the keyword extracted from the text set according to the ranking order of each cluster in the target cluster set.
8. The method of claim 1, wherein after locating the set of target clusters for keyword mapping, the method further comprises:
And obtaining text content which is extracted from the text set and is related to the keywords according to the text contained in each cluster in the target cluster set.
9. An electronic device, the electronic device comprising:
A processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the method of extracting a subject tag from a text set of any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111409911.6A CN114117045B (en) | 2021-11-25 | 2021-11-25 | Method and electronic device for extracting topic tags from text set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111409911.6A CN114117045B (en) | 2021-11-25 | 2021-11-25 | Method and electronic device for extracting topic tags from text set |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114117045A CN114117045A (en) | 2022-03-01 |
CN114117045B true CN114117045B (en) | 2025-04-04 |
Family
ID=80372636
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111409911.6A Active CN114117045B (en) | 2021-11-25 | 2021-11-25 | Method and electronic device for extracting topic tags from text set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114117045B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101430708A (en) * | 2008-11-21 | 2009-05-13 | 哈尔滨工业大学深圳研究生院 | Blog hierarchy classification tree construction method based on label clustering |
CN113342980A (en) * | 2021-06-29 | 2021-09-03 | 中国平安人寿保险股份有限公司 | PPT text mining method and device, computer equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113407679B (en) * | 2021-06-30 | 2023-10-03 | 竹间智能科技(上海)有限公司 | Text topic mining method and device, electronic equipment and storage medium |
-
2021
- 2021-11-25 CN CN202111409911.6A patent/CN114117045B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101430708A (en) * | 2008-11-21 | 2009-05-13 | 哈尔滨工业大学深圳研究生院 | Blog hierarchy classification tree construction method based on label clustering |
CN113342980A (en) * | 2021-06-29 | 2021-09-03 | 中国平安人寿保险股份有限公司 | PPT text mining method and device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114117045A (en) | 2022-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108399228B (en) | Article classification method and device, computer equipment and storage medium | |
US11816888B2 (en) | Accurate tag relevance prediction for image search | |
CN112131350B (en) | Text label determining method, device, terminal and readable storage medium | |
RU2678716C1 (en) | Use of autoencoders for learning text classifiers in natural language | |
Peng et al. | Information extraction from research papers using conditional random fields | |
US9589208B2 (en) | Retrieval of similar images to a query image | |
US10482146B2 (en) | Systems and methods for automatic customization of content filtering | |
CN109086265B (en) | Semantic training method and multi-semantic word disambiguation method in short text | |
CN111737560B (en) | Content search method, field prediction model training method, device and storage medium | |
US11886515B2 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
CN111695349A (en) | Text matching method and text matching system | |
US11599588B1 (en) | Apparatus and method of entity data aggregation | |
US12190621B2 (en) | Generating weighted contextual themes to guide unsupervised keyphrase relevance models | |
Jo | K nearest neighbor for text summarization using feature similarity | |
CN113988057A (en) | Title generation method, device, device and medium based on concept extraction | |
CN114255067A (en) | Data pricing method and device, electronic equipment and storage medium | |
CN111061939A (en) | Scientific research academic news keyword matching recommendation method based on deep learning | |
US20240168999A1 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
Jo | Using K Nearest Neighbors for text segmentation with feature similarity | |
Patel et al. | Dynamic lexicon generation for natural scene images | |
CN114936278A (en) | Text recommendation method, apparatus, computer equipment and storage medium | |
CN117435685A (en) | Document retrieval method, document retrieval device, computer equipment, storage medium and product | |
CN119739838A (en) | RAG intelligent question answering method, device, equipment and medium for multi-label generation and matching | |
CN113505889B (en) | Processing method and device of mapping knowledge base, computer equipment and storage medium | |
CN119669326A (en) | Archives management method and system based on big data analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |