TWI748749B - Method for automatic extraction of text classification and text keyword and device using same - Google Patents
Method for automatic extraction of text classification and text keyword and device using same Download PDFInfo
- Publication number
- TWI748749B TWI748749B TW109139905A TW109139905A TWI748749B TW I748749 B TWI748749 B TW I748749B TW 109139905 A TW109139905 A TW 109139905A TW 109139905 A TW109139905 A TW 109139905A TW I748749 B TWI748749 B TW I748749B
- Authority
- TW
- Taiwan
- Prior art keywords
- keywords
- classification
- categories
- keyword
- automatic extraction
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24143—Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Human Computer Interaction (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
本發明係指一種短文自動化萃取分類及關鍵字方法,尤指一種可在小樣本量透過關鍵字進行詞向量分類訓練的短文自動化萃取分類及關鍵字方法。 The present invention refers to a method for automatic extraction and classification of short texts and keywords, in particular to a method for automatic extraction and classification of short texts and keywords that can perform word vector classification training through keywords in a small sample size.
目前的短文分類模型之訓練數據集多為新聞資料庫,且大多係應用於新聞採集或網路輿情分析等,其特性包含文本數量充足、文章的主題一致、句法結構清楚等,因此適合用習知技術之訓練詞向量與分類器方法進行分類。 The training data sets of the current short text classification models are mostly news databases, and most of them are used in news gathering or Internet public opinion analysis. Its characteristics include sufficient amount of text, consistent article topics, clear syntactic structure, etc., so it is suitable for use. The training word vector and the classifier method of the known technology are used for classification.
然而,針對一般企業處理客服進件時(如電子郵件或訊息提問),由於進件量較少(如界於10到100萬筆之間)不易執行一般的聚類與分類的自然語言處理(Natural Language Processing,NLP)模型,習知技術訓練詞向量及分類器進行分類僅適用於大樣本數,另外,習知技術由機器學習所訓練得出的分類器較耗費時間與資源,且建立後無法依需求再進行微調,如果因應業務需求而要新增產品名稱或功能名稱等關鍵字時,僅能重新進行訓練而缺乏彈性。 However, it is not easy to perform general clustering and classification of natural language processing ( Natural Language Processing (NLP) model, conventional technology training word vectors and classifiers for classification are only suitable for large sample numbers. In addition, the classifiers trained by conventional technology by machine learning are more time-consuming and resource-intensive, and after establishment It cannot be fine-tuned according to demand. If you need to add keywords such as product names or function names in response to business needs, you can only retrain and lack flexibility.
有鑑於此,習知技術實有改進之必要。 In view of this, it is necessary to improve the conventional technology.
因此,本發明之短文自動化萃取分類及關鍵字方法及裝置之主要目的在於藉由使用配適度演算法遴選出最佳候選關鍵字做為特定類別所對應之關鍵字,使得該組關鍵字從詞向量分析之角度足以代表大部分短文的重點,俾於能在小樣本量透過關鍵字進行詞向量分類。 Therefore, the main purpose of the method and device for automatic extraction and classification of essays and keywords of the present invention is to select the best candidate keywords as keywords corresponding to a specific category by using a suitable algorithm, so that the set of keywords are from words The angle of vector analysis is sufficient to represent the key points of most essays, so that word vectors can be classified by keywords in a small sample size.
本發明之短文自動化萃取分類及關鍵字方法及裝置之另一主要目的在於藉由引入開源詞向量預訓練字詞集以解決先前技術採用字詞向量訓練在小樣本數中分析效率不佳的問題。 Another main purpose of the method and device for automatic extraction and classification of short essays and keywords of the present invention is to solve the problem of poor analysis efficiency in small sample numbers by using word vector training in the prior art by introducing open source word vector pre-training word sets. .
本發明之短文自動化萃取分類及關鍵字方法及裝置之又一主要目的具有關鍵字萃取模組能對被合併類別之關鍵字再次萃取及進行更新,用於修正習知技術類別關鍵字代表性不彰之問題。 Another main purpose of the method and device for automatic extraction and classification of short texts and keywords of the present invention is that the keyword extraction module can extract and update the keywords of the merged category again, and is used to correct the representativeness of the keywords in the conventional technology category. The problem of clarification.
本發明之短文自動化萃取分類及關鍵字方法及裝置之又一主要目的在於相較習知技術執行分類時,本發明不需訓練詞向量及透過機器學習訓練分類器俾於達到節省系統資源之目的。 Another main purpose of the method and device for automatic extraction and classification of short texts and keywords of the present invention is to achieve the purpose of saving system resources without training word vectors and training the classifier through machine learning when performing classification compared with conventional techniques. .
本發明之短文自動化萃取分類及關鍵字方法及裝置之又一主要目的在於藉由採用開源詞向量預訓練集且搭配適度演算法,取出最佳之複數組關鍵字並結合詞向量相似度分類模組進行分類,俾於在未來需要進行關鍵字調整時(譬如因應業務需求要新增產品名稱或功能名稱),不須重新進行訓練俾於達到更敏捷、更有彈性之目的。 Another main purpose of the method and device for automatic extraction and classification of short essays and keywords of the present invention is to extract the best complex array of keywords by using the open source word vector pre-training set and matching with a moderate algorithm, and combine the word vector similarity classification model Groups are classified, so that when keywords need to be adjusted in the future (for example, new product names or function names should be added in response to business needs), there is no need to retrain in order to achieve more agile and flexible goals.
本發明揭露一種短文自動化萃取分類及關鍵字方法,包含輸入複數個短文;對該複數個短文進行一預處理,以產生複數個預處理短文;根據一文本主題分析模型,將該複數個預處理短文分為不同主題之複數個類別,並輸出該複數個預處理短文及相對應複數個類別標籤;根據該複數個預處理短文、該複數個類別標籤、一開源詞向量預訓練集及一配適度演算法,萃取該複數個類 別之相對應複數組關鍵字;以及根據該複數個類別及該複數組關鍵字,建立一詞向量短文分類器。 The present invention discloses a method for automatic extraction and classification of short essays and keywords, which includes inputting plural essays; preprocessing the plural essays to generate plural preprocessing essays; and preprocessing the plural essays according to a text topic analysis model The essays are divided into multiple categories of different topics, and output the multiple preprocessed essays and corresponding multiple category labels; according to the multiple preprocessed essays, the multiple category labels, an open source word vector pre-training set and a pair Moderate algorithm, extract the plural classes Other corresponding complex group keywords; and according to the plural categories and the complex group keywords, a word vector essay classifier is established.
10:短文自動化萃取分類及關鍵字裝置 10: Short text automatic extraction classification and keyword device
102:前處理模組 102: Pre-processing module
104:短文主題分類模組 104: Short essay topic classification module
106:關鍵字萃取模組 106: Keyword Extraction Module
108:關鍵字合併模組 108: Keyword merge module
110:相似度分類模組 110: Similarity classification module
12:輸入裝置 12: Input device
14:輸出裝置 14: output device
T1~Tm:短文 T1~Tm: short text
T1’~Tm’:預處理短文 T1’~Tm’: Preprocessed short text
C1~Cn:類別 C1~Cn: category
TI1~TIm:類別標籤 TI1~TIm: category label
KW1~KWn:關鍵字 KW1~KWn: Keywords
60:流程 60: Process
600~612:步驟 600~612: steps
第1圖為本發明實施例一短文自動化萃取分類及關鍵字裝置之示意圖。 Figure 1 is a schematic diagram of a short text automatic extraction and classification and keyword device according to an embodiment of the present invention.
第2圖至第5圖為本發明實施例第1圖所示短文自動化萃取分類及關鍵字裝置之操作示意圖。 Figures 2 to 5 are schematic diagrams of the operation of the short text automatic extraction and classification and keyword device shown in Figure 1 of the embodiment of the present invention.
第6圖為本發明實施例一短文自動化萃取分類及關鍵字流程之示意圖。 Figure 6 is a schematic diagram of a process of automatic extraction and classification of short texts and keywords according to an embodiment of the present invention.
請參考第1圖,第1圖為本發明實施例一短文自動化萃取分類及關鍵字裝置10之示意圖。如第1圖所示,短文自動化萃取分類及關鍵字裝置10包含一前處理模組102、一短文主題分類模組104、一關鍵字萃取模組106、一關鍵字合併模組108以及一相似度分類模組110。簡單來說,一輸入裝置12輸入短文T1~Tm至前處理模組102,前處理模組102對短文T1~Tm進行分詞、停用詞、大小寫、詞性還原、動詞型態處理等預處理,以產生預處理短文T1’~Tm’;短文主題分類模組104可根據一文本主題分析模型,將預處理短文T1’~Tm’分為不同主題的類別C1~Cn,並輸出預處理短文T1’~Tm’及相對應類別標籤TI1~Tim;關鍵字萃取模組106根據預處理短文T1’~Tm’、類別標籤TI1~TIm、一開源詞向量預訓練集及一配適度演算法,萃取類別C1~Cn之相對應關鍵字KW1~KWn;關鍵字合併模組108判斷關鍵字KW1~KWn之間之向量距離,並將類別C1~Cn中所對應關鍵字之向量距離小於一門檻之類別進行合併;關鍵字萃取模組106對被類別C1~Cn中合併類別再次萃取關鍵字進行更新,以修正被合併類別關鍵字代表性
不彰之問題。最後,相似度分類模組110可根據關鍵字KW1~KWn及其詞向量計算距離,建立一詞向量短文分類器並提供至一輸出裝置14,提高後續小樣本數短文之關鍵字與分類識別準確率(如計算後續特定短文與關鍵字KW1~KWn中何組關鍵字相似度最高,則判斷特定短文屬於相對應類別)。如此一來,本發明可在小樣本量透過關鍵字做詞向量分類而較具彈性,且不需訓練詞向量及分類器而簡省資源。
Please refer to FIG. 1. FIG. 1 is a schematic diagram of a short text automatic extraction and classification and
具體而言,請參考第2圖至第5圖,第2圖至第5圖為本發明實施例短文自動化萃取分類及關鍵字裝置10之操作示意圖。如第2圖所示,短文T1~T6經前處理模組102進行分詞處理、停用詞處理、大小寫處理、詞性還原處理、動詞型態處理等預處理後,可產生僅具重要文字及適合處理之預處理短文T1’~T6’並提供給預處理短文主題分類模組104,預處理為本領域通常技術者所熟知,於此不再贅述。接著,如第3圖所示,短文主題分類模組104將預處理短文T1’~T6’分為類別C1~C4,並決定相對應類別標籤TI1~TI6分別為類別C1、C2、C2、C3、C1、C4,而可將短文進行主體分類而將相近主題短文放在同一類別,然後關鍵字萃取模組106萃取類別C1~C4之相對應關鍵字KW1~KW4以代表各類別。然後,如第4圖所示,關鍵字合併模組108判斷關鍵字KW2、KW3之向量距離小於門檻而將類別C3併入類別C2,以將接近的類別合併,再由關鍵字萃取模組106對合併後類別C2萃取關鍵字,即如第5圖所示更新關鍵字KW2(原本類別標籤TI4亦更新為類別C2)。
Specifically, please refer to FIG. 2 to FIG. 5. FIG. 2 to FIG. 5 are schematic diagrams of the operation of the short text automatic extraction and classification and
詳細來說,短文主題分類模組104之文本主題分析模型可包含一狄利克雷多項式混合模型的吉布斯採樣算法(Gibbs Sampling Dirichlet Multinomial Mixture Model,GSDMM)、一狄利克雷多項式混合模型(Dirichlet Mixture
Model,DMM)、一應用GPU改善狄利克雷多項式混合模型(Generalized Polya Urn-Dirichlet Multinomial Mixture,GPU-DMM)、一隱含狄利克雷分佈模型(Latent Dirichlet Allocation,LDA)、一主題代表詞發現(Topic Representative Term Discovery,TRTD)、一雙詞主題模型(Biterm Topic Model,BTM)、一隱含語意索引(Latent Semantic Indexing,LSI)以及一潛在語義分析(Latent Semantic Analysis,LSA)當中至少一者。舉例來說,GSDMM先隨機將預處理短文T1’~Tm’分群,然後再依移動各短文直到不同主題的類別C1~Cn中各類別中短文性質相似為止,GSDMM及其它作法為本領域所熟知,於此不再贅述以求簡潔。
In detail, the text topic analysis model of the short text
另一方面,關鍵字萃取模組106中開源詞向量預訓練集可包含預訓練集wiki-news-300d-1M.vec、wiki-news-300d-1M-subword、crawl-300d-2M.vec以及crawl-300d-2M-subword當中至少一者,針對於小樣本數,字詞向量訓練無法有好的效果,因此透過引入開源詞向量預訓練字詞集,以提升分析效率,開源詞向量預訓練集亦為本領域所熟知,於此不再贅述以求簡潔。
On the other hand, the open source word vector pre-training set in the
在此情況下,關鍵字萃取模組106執行配適度演算法,根據類別C1~Cn中一特定類別所對應複數組候選關鍵字之複數個相似度加總或複數個通過門檻短文比例當中至少一者,決定該特定類別所對應之關鍵字。詳細來說,關鍵字萃取模組106根據開源詞向量預訓練集,計算各組候選關鍵字與特定類別中所有預處理短文之相似度,然後將相似度加總或設定一門檻(如0.3)再計算通過門檻短文比例(如相似度大於0.3之短文數為70,而特定類別中所有短文數為100,則通過門檻短文比例為0.7),接著將特定類別中具有最高相似度加總或一最高通過門檻短文比例之一最佳候選關鍵字做為特定類別所對應之關鍵字。其中,關鍵字與短文之相似度之計算係將其映射到向量空間,再計算不同的向量
之差異之大小來計算相似度,如[“hello”,“world”]/[“hi”,“world”]可計算出相似度為0.76,相似度計算亦為本領域所熟知,於此不再贅述以求簡潔。
In this case, the
值得注意的是,上述實施例主要在於執行配適度演算法,以將特定類別中具有最高相似度加總或一最高通過門檻短文比例之候選關鍵字做為特定類別所對應之關鍵字,以萃取類別C1~Cn之相對應關鍵字KW1~KWn,再建立一詞向量短文分類器對後續短文進行分類,本領域具通常知識者當可據以進行修飾或變化,而不限於此。舉例來說,上述實施例在第一次萃取類別C1~Cn之相對應關鍵字KW1~KWn後,會將類別C1~Cn中所對應關鍵字之向量距離小於一門檻之類別進行合併,以修正代表性不彰之問題,但在其它實施例中,亦可不進行合併。此外,上述實施例係以類別C1~Cn之相對應關鍵字KW1~KWn建立詞向量短文分類器對後續短文進行分類,因此在實際應用上相似度分類模組110亦可依需求調整類別C1~Cn及關鍵字KW1~KWn建立詞向量短文分類器(如增加或減少類別),而可具有較佳彈性。 It is worth noting that the above-mentioned embodiment is mainly to perform an appropriate matching algorithm to extract the candidate keywords with the highest similarity summation or the highest proportion of passages in the specific category as the keywords corresponding to the specific category. The corresponding keywords KW1~KWn of the categories C1~Cn, and then a word vector essay classifier is established to classify the subsequent essays. Those with ordinary knowledge in the field can modify or change according to it, but it is not limited to this. For example, in the above embodiment, after extracting the corresponding keywords KW1~KWn of the categories C1~Cn for the first time, the categories whose vector distances of the corresponding keywords in the categories C1~Cn are less than a threshold are merged to modify The problem of insufficient representativeness, but in other embodiments, it may not be combined. In addition, in the above-mentioned embodiment, a word vector essay classifier is established based on the corresponding keywords KW1~KWn of the categories C1~Cn to classify subsequent essays. Therefore, in practical applications, the similarity classification module 110 can also adjust the categories C1~ Cn and keywords KW1~KWn create word vector short text classifiers (such as increasing or decreasing categories), which can have better flexibility.
另一方面,上述實施例將複數組候選關鍵字中具有最高相似度加總或一最高通過門檻短文比例之一最佳候選關鍵字做為特定類別所對應之關鍵字,在一實施例中,可根據一基因演算法產生該複數組候選關鍵字,並於該基因演算法執行一指定迴圈數,或者一最佳候選關鍵字之一最高相似度加總或一最高通過門檻短文比例大於一配適度門檻時,配適度演算法以該最佳候選關鍵字做為特定類別所對應之關鍵字。 On the other hand, in the above-mentioned embodiment, the best candidate keyword with the highest similarity summation or the highest proportion of passages among the candidate keywords in the complex array is used as the keyword corresponding to a specific category. In one embodiment, The complex array of candidate keywords can be generated according to a genetic algorithm, and the genetic algorithm is executed with a specified number of cycles, or the sum of the highest similarity of one of the best candidate keywords or the proportion of short passages with the highest passing threshold greater than one When matching the appropriate threshold, the matching algorithm uses the best candidate keyword as the keyword corresponding to the specific category.
舉例來說,基因演算法可由特定類別中複數個字詞進行編碼產生複數組母體候選關鍵字,如在5個字詞中任選3個字詞做為一組母體候選關鍵字,
然後產生具有10-20組母體候選關鍵字,接著配適度演算法計算10-20組候選關鍵字之相似度加總或通過門檻短文比例,並保留母體候選關鍵字中相似度加總或通過門檻短文比例大於特定值(如通過門檻短文比例大於0.8)之保留候選關鍵字,接著基因演算法再將保留候選關鍵字進行交配。舉例來說,可以二元碼表示5個字詞有無如下表,兩組保留候選關鍵字中一字詞原本一組有、一組沒有,交配後變成兩組皆有(第一字詞)、兩組皆無(第三字詞),或者原本有的變沒有而原本沒有的變有(第四字詞):
此外,基因演算法亦可進行突變,以一定機率(設定為低概率)隨機改變保留候選關鍵字中特定字詞的有無,如保留候選關鍵字原本不具有第三字詞突變為具有第三字詞。在此情況下,基因演算法可對保留候選關鍵字進行交配或突變,而形成一迴圈產生新候選關鍵字,再由配適度演算法計算新的候選關鍵字之相似度加總或通過門檻短文比例。如此一來,於該基因演算法執行一指定迴圈數,或者一最佳候選關鍵字之一最高相似度加總或一最高通過門檻短文比例大於一配適度門檻時,配適度演算法以該最佳候選關鍵字做為特定類別所對應之關鍵字,並結束基因演算法之迴圈。 In addition, the genetic algorithm can also perform mutations, with a certain probability (set to a low probability) to randomly change the presence or absence of specific words in the reserved candidate keywords. For example, the reserved candidate keyword does not originally have a third word and is changed to have a third word. word. In this case, the genetic algorithm can mate or mutate the reserved candidate keywords to form a loop to generate new candidate keywords, and then the fitness algorithm calculates the sum of the similarities of the new candidate keywords or passes the threshold Proportion of short essays. In this way, when the genetic algorithm executes a specified number of cycles, or the sum of the highest similarity of one of the best candidate keywords, or the proportion of short passages with the highest passing threshold is greater than a suitable threshold, the appropriate algorithm is used to The best candidate keyword is used as the keyword corresponding to a specific category, and the loop of genetic algorithm is ended.
因此,短文自動化萃取分類及關鍵字裝置10之操作,可歸納為一短文自動化萃取分類及關鍵字流程60,如第6圖所示,其包含以下步驟:
Therefore, the operation of the short text automatic extraction classification and
步驟600:開始。 Step 600: Start.
步驟602:輸入複數個短文。 Step 602: Input a plurality of short essays.
步驟604:對該複數個短文進行一預處理,以產生複數個預處理短文。 Step 604: Perform a preprocessing on the plurality of short texts to generate a plurality of preprocessed short texts.
步驟606:根據一文本主題分析模型,將該複數個預處理短文分為不同主題之複數個類別,並輸出該複數個預處理短文及相對應複數個類別標籤 Step 606: According to a text topic analysis model, divide the plurality of preprocessed essays into a plurality of categories of different topics, and output the plurality of preprocessed essays and corresponding plural category labels
步驟608:根據該複數個預處理短文、該複數個類別標籤、一開源詞向量預訓練集及一配適度演算法,萃取該複數個類別之相對應複數組關鍵字 Step 608: According to the plurality of pre-processed essays, the plurality of category labels, an open source word vector pre-training set and a suitable algorithm, extract the corresponding complex array of keywords of the plurality of categories
步驟610:根據該複數個類別及該複數組關鍵字,建立一詞向量短文分類器 Step 610: Create a word vector essay classifier based on the plural categories and the plural keywords
步驟612:結束。 Step 612: End.
短文自動化萃取分類及關鍵字流程60之詳細操作可參考短文自動化萃取分類及關鍵字裝置10之相關內容,於此不再贅述以求簡潔。
For the detailed operation of the automatic short-text extraction classification and
此外,短文自動化萃取分類及關鍵字裝置10可包含一處理裝置及一儲存單元。處理裝置可為一微處理器或一特殊應用積體電路(application-specific integrated circuit,ASIC)。儲存單元可為任一資料儲存裝置,用來儲存一程式碼,並透過處理裝置讀取及執行程式碼,以執行前處理模組102、短文主題分類模組104、關鍵字萃取模組106、關鍵字合併模組108以及相似度分類模組110之功能,進而完成短文自動化萃取分類及關鍵字流程60之各步驟。儲存單元可為用戶識別模組(subscriber identity module,SIM)、唯讀式記憶體(read-only memory,ROM)、隨機存取記憶體(random-access memory,RAM)、光碟唯讀記憶體(CD-ROMs)、磁帶(magnetic tapes)、軟碟(floppy disks)、光學資料儲存裝置(optical data storage devices)等等,而不限於此。
In addition, the short text automatic extraction and classification and
綜上所述,本發明可執行配適度演算法以利用開源詞向量預訓練集,在小樣本量萃取各類別相對應關鍵字,再根據關鍵字建立詞向量短文分類 器對後續短文進行分類,而較具彈性且不需訓練詞向量及分類器以節省資源。 In summary, the present invention can execute a suitable algorithm to use the open source word vector pre-training set, extract the corresponding keywords of each category in a small sample size, and then establish the word vector essay classification based on the keywords It is more flexible and does not need to train word vectors and classifiers to save resources.
以上所述僅為本發明之較佳實施例,凡依本發明申請專利範圍所做之均等變化與修飾,皆應屬本發明之涵蓋範圍。 The foregoing descriptions are only preferred embodiments of the present invention, and all equivalent changes and modifications made in accordance with the scope of the patent application of the present invention shall fall within the scope of the present invention.
60:流程 60: Process
600~612:步驟 600~612: steps
Claims (8)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW109139905A TWI748749B (en) | 2020-11-16 | 2020-11-16 | Method for automatic extraction of text classification and text keyword and device using same |
| CN202111356362.0A CN114510565B (en) | 2020-11-16 | 2021-11-16 | Short article automatic extraction classification and keyword method and device using the method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW109139905A TWI748749B (en) | 2020-11-16 | 2020-11-16 | Method for automatic extraction of text classification and text keyword and device using same |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TWI748749B true TWI748749B (en) | 2021-12-01 |
| TW202221528A TW202221528A (en) | 2022-06-01 |
Family
ID=80680977
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW109139905A TWI748749B (en) | 2020-11-16 | 2020-11-16 | Method for automatic extraction of text classification and text keyword and device using same |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN114510565B (en) |
| TW (1) | TWI748749B (en) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
| CN107085581A (en) * | 2016-02-16 | 2017-08-22 | 腾讯科技(深圳)有限公司 | Short text classification method and device |
Family Cites Families (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2003016106A (en) * | 2001-06-29 | 2003-01-17 | Fuji Xerox Co Ltd | Device for calculating degree of association value |
| JP4466334B2 (en) * | 2004-11-08 | 2010-05-26 | 日本電信電話株式会社 | Information classification method and apparatus, program, and storage medium storing program |
| JP5311378B2 (en) * | 2008-06-26 | 2013-10-09 | 国立大学法人京都大学 | Feature word automatic learning system, content-linked advertisement distribution computer system, search-linked advertisement distribution computer system, text classification computer system, and computer programs and methods thereof |
| CA2638558C (en) * | 2008-08-08 | 2013-03-05 | Bloorview Kids Rehab | Topic word generation method and system |
| KR101536520B1 (en) * | 2014-04-28 | 2015-07-14 | 숭실대학교산학협력단 | Method and server for extracting topic and evaluating compatibility of the extracted topic |
| CN106156204B (en) * | 2015-04-23 | 2020-05-29 | 深圳市腾讯计算机系统有限公司 | Text label extraction method and device |
| CN105843795B (en) * | 2016-03-21 | 2019-05-14 | 华南理工大学 | Document keyword extraction method and system based on topic model |
| CN108334533B (en) * | 2017-10-20 | 2021-12-24 | 腾讯科技(深圳)有限公司 | Keyword extraction method and device, storage medium and electronic device |
| CN109508378B (en) * | 2018-11-26 | 2023-07-14 | 平安科技(深圳)有限公司 | A sample data processing method and device |
| CN109885680B (en) * | 2019-01-22 | 2020-05-19 | 仲恺农业工程学院 | A method, system and device for short text classification preprocessing based on sememe expansion |
| CN110457707B (en) * | 2019-08-16 | 2023-01-17 | 秒针信息技术有限公司 | Method and device for extracting real word keywords, electronic equipment and readable storage medium |
-
2020
- 2020-11-16 TW TW109139905A patent/TWI748749B/en active
-
2021
- 2021-11-16 CN CN202111356362.0A patent/CN114510565B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
| CN107085581A (en) * | 2016-02-16 | 2017-08-22 | 腾讯科技(深圳)有限公司 | Short text classification method and device |
Also Published As
| Publication number | Publication date |
|---|---|
| TW202221528A (en) | 2022-06-01 |
| CN114510565A (en) | 2022-05-17 |
| CN114510565B (en) | 2025-04-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112269868B (en) | A method of using a machine reading comprehension model based on multi-task joint training | |
| CN108280206B (en) | Short text classification method based on semantic enhancement | |
| CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
| CN109739986A (en) | A short text classification method for complaints based on deep ensemble learning | |
| CN109815336B (en) | Text aggregation method and system | |
| CN109002473A (en) | A kind of sentiment analysis method based on term vector and part of speech | |
| Subhash et al. | Fake news detection using deep learning and transformer-based model | |
| CN112395393A (en) | Remote supervision relation extraction method based on multitask and multiple examples | |
| CN112860898B (en) | Short text box clustering method, system, equipment and storage medium | |
| CN111191051B (en) | Method and system for constructing emergency knowledge map based on Chinese word segmentation technology | |
| CN108804595B (en) | A short text representation method based on word2vec | |
| CN110795564A (en) | Text classification method lacking negative cases | |
| CN112949713A (en) | Text emotion classification method based on ensemble learning of complex network | |
| CN111737453A (en) | An Extractive Text Summarization Method Based on Unsupervised Multi-Model Fusion | |
| CN111859961A (en) | A Text Keyword Extraction Method Based on Improved TopicRank Algorithm | |
| CN114298020A (en) | Keyword vectorization method based on subject semantic information and application thereof | |
| CN108038099B (en) | A low-frequency keyword recognition method based on word clustering | |
| CN113722439B (en) | Cross-domain emotion classification method and system based on adversarial category alignment network | |
| CN111274402A (en) | E-commerce comment emotion analysis method based on unsupervised classifier | |
| CN110222172A (en) | A kind of multi-source network public sentiment Topics Crawling method based on improvement hierarchical clustering | |
| CN112988970A (en) | Text matching algorithm serving intelligent question-answering system | |
| Abuhaiba et al. | Combining different approaches to improve arabic text documents classification | |
| CN115098690B (en) | Multi-data document classification method and system based on cluster analysis | |
| CN111930953B (en) | Text attribute feature identification, classification and structure analysis method and device | |
| CN114936277A (en) | Similarity problem matching method and user similarity problem matching system |