TWI748749B

TWI748749B - Method for automatic extraction of text classification and text keyword and device using same

Info

Publication number: TWI748749B
Application number: TW109139905A
Authority: TW
Inventors: 張凱喬; 黃戎歆; 曾文彥
Original assignee: 威聯通科技股份有限公司
Priority date: 2020-11-16
Filing date: 2020-11-16
Publication date: 2021-12-01
Also published as: TW202221528A; CN114510565A; CN114510565B

Abstract

The present invention provides a method for automatic extraction of text classification and text keyword. The method includes inputting a plurality of texts; preprocessing the plurality of texts to generate a plurality of preprocessed texts; and classifying the plurality of preprocessed texts into a plurality of classifications with different topics according to a text topic analysis model, and outputting the plurality of preprocessed texts and a plurality of corresponding classification indexes; extracting a plurality of sets of corresponding keywords of the plurality of classifications according to the plurality of pre-processed texts, the plurality of classification indexes, an open source word vector pre-training set and a pairing algorithm; and establishing a word vector text classifier according to the plurality of classifications and the plurality of sets of keywords.

Description

Short text automatic extraction classification and keyword method and device adopting the method

本發明係指一種短文自動化萃取分類及關鍵字方法，尤指一種可在小樣本量透過關鍵字進行詞向量分類訓練的短文自動化萃取分類及關鍵字方法。 The present invention refers to a method for automatic extraction and classification of short texts and keywords, in particular to a method for automatic extraction and classification of short texts and keywords that can perform word vector classification training through keywords in a small sample size.

目前的短文分類模型之訓練數據集多為新聞資料庫，且大多係應用於新聞採集或網路輿情分析等，其特性包含文本數量充足、文章的主題一致、句法結構清楚等，因此適合用習知技術之訓練詞向量與分類器方法進行分類。 The training data sets of the current short text classification models are mostly news databases, and most of them are used in news gathering or Internet public opinion analysis. Its characteristics include sufficient amount of text, consistent article topics, clear syntactic structure, etc., so it is suitable for use. The training word vector and the classifier method of the known technology are used for classification.

然而，針對一般企業處理客服進件時(如電子郵件或訊息提問)，由於進件量較少(如界於10到100萬筆之間)不易執行一般的聚類與分類的自然語言處理(Natural Language Processing，NLP)模型，習知技術訓練詞向量及分類器進行分類僅適用於大樣本數，另外，習知技術由機器學習所訓練得出的分類器較耗費時間與資源，且建立後無法依需求再進行微調，如果因應業務需求而要新增產品名稱或功能名稱等關鍵字時，僅能重新進行訓練而缺乏彈性。 However, it is not easy to perform general clustering and classification of natural language processing ( Natural Language Processing (NLP) model, conventional technology training word vectors and classifiers for classification are only suitable for large sample numbers. In addition, the classifiers trained by conventional technology by machine learning are more time-consuming and resource-intensive, and after establishment It cannot be fine-tuned according to demand. If you need to add keywords such as product names or function names in response to business needs, you can only retrain and lack flexibility.

有鑑於此，習知技術實有改進之必要。 In view of this, it is necessary to improve the conventional technology.

因此，本發明之短文自動化萃取分類及關鍵字方法及裝置之主要目的在於藉由使用配適度演算法遴選出最佳候選關鍵字做為特定類別所對應之關鍵字，使得該組關鍵字從詞向量分析之角度足以代表大部分短文的重點，俾於能在小樣本量透過關鍵字進行詞向量分類。 Therefore, the main purpose of the method and device for automatic extraction and classification of essays and keywords of the present invention is to select the best candidate keywords as keywords corresponding to a specific category by using a suitable algorithm, so that the set of keywords are from words The angle of vector analysis is sufficient to represent the key points of most essays, so that word vectors can be classified by keywords in a small sample size.

本發明之短文自動化萃取分類及關鍵字方法及裝置之另一主要目的在於藉由引入開源詞向量預訓練字詞集以解決先前技術採用字詞向量訓練在小樣本數中分析效率不佳的問題。 Another main purpose of the method and device for automatic extraction and classification of short essays and keywords of the present invention is to solve the problem of poor analysis efficiency in small sample numbers by using word vector training in the prior art by introducing open source word vector pre-training word sets. .

本發明之短文自動化萃取分類及關鍵字方法及裝置之又一主要目的具有關鍵字萃取模組能對被合併類別之關鍵字再次萃取及進行更新，用於修正習知技術類別關鍵字代表性不彰之問題。 Another main purpose of the method and device for automatic extraction and classification of short texts and keywords of the present invention is that the keyword extraction module can extract and update the keywords of the merged category again, and is used to correct the representativeness of the keywords in the conventional technology category. The problem of clarification.

本發明之短文自動化萃取分類及關鍵字方法及裝置之又一主要目的在於相較習知技術執行分類時，本發明不需訓練詞向量及透過機器學習訓練分類器俾於達到節省系統資源之目的。 Another main purpose of the method and device for automatic extraction and classification of short texts and keywords of the present invention is to achieve the purpose of saving system resources without training word vectors and training the classifier through machine learning when performing classification compared with conventional techniques. .

本發明之短文自動化萃取分類及關鍵字方法及裝置之又一主要目的在於藉由採用開源詞向量預訓練集且搭配適度演算法，取出最佳之複數組關鍵字並結合詞向量相似度分類模組進行分類，俾於在未來需要進行關鍵字調整時(譬如因應業務需求要新增產品名稱或功能名稱)，不須重新進行訓練俾於達到更敏捷、更有彈性之目的。 Another main purpose of the method and device for automatic extraction and classification of short essays and keywords of the present invention is to extract the best complex array of keywords by using the open source word vector pre-training set and matching with a moderate algorithm, and combine the word vector similarity classification model Groups are classified, so that when keywords need to be adjusted in the future (for example, new product names or function names should be added in response to business needs), there is no need to retrain in order to achieve more agile and flexible goals.

本發明揭露一種短文自動化萃取分類及關鍵字方法，包含輸入複數個短文；對該複數個短文進行一預處理，以產生複數個預處理短文；根據一文本主題分析模型，將該複數個預處理短文分為不同主題之複數個類別，並輸出該複數個預處理短文及相對應複數個類別標籤；根據該複數個預處理短文、該複數個類別標籤、一開源詞向量預訓練集及一配適度演算法，萃取該複數個類別之相對應複數組關鍵字；以及根據該複數個類別及該複數組關鍵字，建立一詞向量短文分類器。 The present invention discloses a method for automatic extraction and classification of short essays and keywords, which includes inputting plural essays; preprocessing the plural essays to generate plural preprocessing essays; and preprocessing the plural essays according to a text topic analysis model The essays are divided into multiple categories of different topics, and output the multiple preprocessed essays and corresponding multiple category labels; according to the multiple preprocessed essays, the multiple category labels, an open source word vector pre-training set and a pair Moderate algorithm, extract the plural classes Other corresponding complex group keywords; and according to the plural categories and the complex group keywords, a word vector essay classifier is established.

10:短文自動化萃取分類及關鍵字裝置 10: Short text automatic extraction classification and keyword device

102:前處理模組 102: Pre-processing module

104:短文主題分類模組 104: Short essay topic classification module

106:關鍵字萃取模組 106: Keyword Extraction Module

108:關鍵字合併模組 108: Keyword merge module

110:相似度分類模組 110: Similarity classification module

12:輸入裝置 12: Input device

14:輸出裝置 14: output device

T1~Tm:短文 T1~Tm: short text

T1’~Tm’:預處理短文 T1’~Tm’: Preprocessed short text

C1~Cn:類別 C1~Cn: category

TI1~TIm:類別標籤 TI1~TIm: category label

KW1~KWn:關鍵字 KW1~KWn: Keywords

60:流程 60: Process

600~612:步驟 600~612: steps

第1圖為本發明實施例一短文自動化萃取分類及關鍵字裝置之示意圖。 Figure 1 is a schematic diagram of a short text automatic extraction and classification and keyword device according to an embodiment of the present invention.

第2圖至第5圖為本發明實施例第1圖所示短文自動化萃取分類及關鍵字裝置之操作示意圖。 Figures 2 to 5 are schematic diagrams of the operation of the short text automatic extraction and classification and keyword device shown in Figure 1 of the embodiment of the present invention.

第6圖為本發明實施例一短文自動化萃取分類及關鍵字流程之示意圖。 Figure 6 is a schematic diagram of a process of automatic extraction and classification of short texts and keywords according to an embodiment of the present invention.

請參考第1圖，第1圖為本發明實施例一短文自動化萃取分類及關鍵字裝置10之示意圖。如第1圖所示，短文自動化萃取分類及關鍵字裝置10包含一前處理模組102、一短文主題分類模組104、一關鍵字萃取模組106、一關鍵字合併模組108以及一相似度分類模組110。簡單來說，一輸入裝置12輸入短文T1~Tm至前處理模組102，前處理模組102對短文T1~Tm進行分詞、停用詞、大小寫、詞性還原、動詞型態處理等預處理，以產生預處理短文T1’~Tm’；短文主題分類模組104可根據一文本主題分析模型，將預處理短文T1’~Tm’分為不同主題的類別C1~Cn，並輸出預處理短文T1’~Tm’及相對應類別標籤TI1~Tim；關鍵字萃取模組106根據預處理短文T1’~Tm’、類別標籤TI1~TIm、一開源詞向量預訓練集及一配適度演算法，萃取類別C1~Cn之相對應關鍵字KW1~KWn；關鍵字合併模組108判斷關鍵字KW1~KWn之間之向量距離，並將類別C1~Cn中所對應關鍵字之向量距離小於一門檻之類別進行合併；關鍵字萃取模組106對被類別C1~Cn中合併類別再次萃取關鍵字進行更新，以修正被合併類別關鍵字代表性不彰之問題。最後，相似度分類模組110可根據關鍵字KW1~KWn及其詞向量計算距離，建立一詞向量短文分類器並提供至一輸出裝置14，提高後續小樣本數短文之關鍵字與分類識別準確率(如計算後續特定短文與關鍵字KW1~KWn中何組關鍵字相似度最高，則判斷特定短文屬於相對應類別)。如此一來，本發明可在小樣本量透過關鍵字做詞向量分類而較具彈性，且不需訓練詞向量及分類器而簡省資源。 Please refer to FIG. 1. FIG. 1 is a schematic diagram of a short text automatic extraction and classification and keyword device 10 according to an embodiment of the present invention. As shown in Figure 1, the automatic short text extraction and classification and keyword device 10 includes a pre-processing module 102, a short text topic classification module 104, a keyword extraction module 106, a keyword merging module 108, and a similar Degree classification module 110. To put it simply, an input device 12 inputs the short text T1~Tm to the pre-processing module 102, and the pre-processing module 102 performs word segmentation, stop words, capitalization, part-of-speech restoration, and verb form processing on the short text T1~Tm. , To generate pre-processed short texts T1'~Tm'; the short text topic classification module 104 can divide the pre-processed short texts T1'~Tm' into categories C1~Cn of different topics according to a text topic analysis model, and output the pre-processed short texts T1'~Tm' and corresponding category labels TI1~Tim; the keyword extraction module 106 is based on the pre-processed short text T1'~Tm', category labels TI1~TIm, an open source word vector pre-training set and a matching appropriate algorithm, Extract the corresponding keywords KW1~KWn of the categories C1~Cn; the keyword merging module 108 determines the vector distance between the keywords KW1~KWn, and the vector distance of the corresponding keywords in the categories C1~Cn is less than a threshold The categories are merged; the keyword extraction module 106 updates the re-extracted keywords from the merged categories in the categories C1~Cn to modify the representativeness of the merged category keywords The problem of insignificance. Finally, the similarity classification module 110 can calculate the distance based on the keywords KW1~KWn and their word vectors, build a word vector essay classifier and provide it to an output device 14 to improve the keyword and classification recognition accuracy of subsequent short essays with a small sample number The accuracy rate (such as calculating which group of keywords in the subsequent specific essay and keywords KW1~KWn have the highest similarity, then it is determined that the specific essay belongs to the corresponding category). In this way, the present invention can classify word vectors through keywords in a small sample size and is more flexible, and does not need to train word vectors and classifiers, which saves resources.

具體而言，請參考第2圖至第5圖，第2圖至第5圖為本發明實施例短文自動化萃取分類及關鍵字裝置10之操作示意圖。如第2圖所示，短文T1~T6經前處理模組102進行分詞處理、停用詞處理、大小寫處理、詞性還原處理、動詞型態處理等預處理後，可產生僅具重要文字及適合處理之預處理短文T1’~T6’並提供給預處理短文主題分類模組104，預處理為本領域通常技術者所熟知，於此不再贅述。接著，如第3圖所示，短文主題分類模組104將預處理短文T1’~T6’分為類別C1~C4，並決定相對應類別標籤TI1~TI6分別為類別C1、C2、C2、C3、C1、C4，而可將短文進行主體分類而將相近主題短文放在同一類別，然後關鍵字萃取模組106萃取類別C1~C4之相對應關鍵字KW1~KW4以代表各類別。然後，如第4圖所示，關鍵字合併模組108判斷關鍵字KW2、KW3之向量距離小於門檻而將類別C3併入類別C2，以將接近的類別合併，再由關鍵字萃取模組106對合併後類別C2萃取關鍵字，即如第5圖所示更新關鍵字KW2(原本類別標籤TI4亦更新為類別C2)。 Specifically, please refer to FIG. 2 to FIG. 5. FIG. 2 to FIG. 5 are schematic diagrams of the operation of the short text automatic extraction and classification and keyword device 10 according to the embodiment of the present invention. As shown in Figure 2, the short texts T1~T6 are preprocessed by the preprocessing module 102 for word segmentation, stop word processing, capitalization processing, part-of-speech restoration processing, and verb type processing. The preprocessed short texts T1'~T6' suitable for processing are provided to the preprocessed short text topic classification module 104. The preprocessing is well known to those skilled in the art and will not be repeated here. Then, as shown in Figure 3, the short text topic classification module 104 divides the preprocessed short text T1'~T6' into categories C1~C4, and determines the corresponding category labels TI1~TI6 to be categories C1, C2, C2, C3, respectively , C1, C4, and short texts can be classified into the main body and short texts with similar topics are placed in the same category, and then the keyword extraction module 106 extracts the corresponding keywords KW1~KW4 of the categories C1~C4 to represent each category. Then, as shown in Figure 4, the keyword merging module 108 determines that the vector distance of keywords KW2 and KW3 is less than the threshold and merges category C3 into category C2 to merge the close categories, and then the keyword extraction module 106 The keywords are extracted for the merged category C2, that is, the keyword KW2 is updated as shown in Figure 5 (the original category label TI4 is also updated to category C2).

詳細來說，短文主題分類模組104之文本主題分析模型可包含一狄利克雷多項式混合模型的吉布斯採樣算法(Gibbs Sampling Dirichlet Multinomial Mixture Model，GSDMM)、一狄利克雷多項式混合模型(Dirichlet Mixture Model，DMM)、一應用GPU改善狄利克雷多項式混合模型(Generalized Polya Urn-Dirichlet Multinomial Mixture，GPU-DMM)、一隱含狄利克雷分佈模型(Latent Dirichlet Allocation，LDA)、一主題代表詞發現(Topic Representative Term Discovery，TRTD)、一雙詞主題模型(Biterm Topic Model，BTM)、一隱含語意索引(Latent Semantic Indexing，LSI)以及一潛在語義分析(Latent Semantic Analysis，LSA)當中至少一者。舉例來說，GSDMM先隨機將預處理短文T1’~Tm’分群，然後再依移動各短文直到不同主題的類別C1~Cn中各類別中短文性質相似為止，GSDMM及其它作法為本領域所熟知，於此不再贅述以求簡潔。 In detail, the text topic analysis model of the short text topic classification module 104 may include a Gibbs Sampling Dirichlet Multinomial Mixture Model (GSDMM) and a Dirichlet Multinomial Mixture Model (Dirichlet). Mixture Model, DMM), a GPU-improved Dirichlet Multinomial Mixture (GPU-DMM), a Latent Dirichlet Allocation (LDA), a topic representative word discovery At least one of (Topic Representative Term Discovery, TRTD), a Biterm Topic Model (BTM), a Latent Semantic Indexing (LSI), and a Latent Semantic Analysis (LSA) . For example, GSDMM randomly groups the preprocessed essays T1'~Tm' into groups, and then moves each essay until the nature of the essays in the categories C1~Cn of different topics are similar. GSDMM and other methods are well known in the art , I won’t repeat it here for brevity.

另一方面，關鍵字萃取模組106中開源詞向量預訓練集可包含預訓練集wiki-news-300d-1M.vec、wiki-news-300d-1M-subword、crawl-300d-2M.vec以及crawl-300d-2M-subword當中至少一者，針對於小樣本數，字詞向量訓練無法有好的效果，因此透過引入開源詞向量預訓練字詞集，以提升分析效率，開源詞向量預訓練集亦為本領域所熟知，於此不再贅述以求簡潔。 On the other hand, the open source word vector pre-training set in the keyword extraction module 106 may include pre-training sets wiki-news-300d-1M.vec, wiki-news-300d-1M-subword, crawl-300d-2M.vec, and At least one of crawl-300d-2M-subword, for small sample size, word vector training cannot have good results, so by introducing open source word vector pre-training word set to improve analysis efficiency, open source word vector pre-training The collection is also well-known in the art, so I won't repeat it here for the sake of brevity.

在此情況下，關鍵字萃取模組106執行配適度演算法，根據類別C1~Cn中一特定類別所對應複數組候選關鍵字之複數個相似度加總或複數個通過門檻短文比例當中至少一者，決定該特定類別所對應之關鍵字。詳細來說，關鍵字萃取模組106根據開源詞向量預訓練集，計算各組候選關鍵字與特定類別中所有預處理短文之相似度，然後將相似度加總或設定一門檻(如0.3)再計算通過門檻短文比例(如相似度大於0.3之短文數為70，而特定類別中所有短文數為100，則通過門檻短文比例為0.7)，接著將特定類別中具有最高相似度加總或一最高通過門檻短文比例之一最佳候選關鍵字做為特定類別所對應之關鍵字。其中，關鍵字與短文之相似度之計算係將其映射到向量空間，再計算不同的向量之差異之大小來計算相似度，如[“hello”,“world”]/[“hi”,“world”]可計算出相似度為0.76，相似度計算亦為本領域所熟知，於此不再贅述以求簡潔。 In this case, the keyword extraction module 106 executes an appropriate algorithm based on the sum of multiple similarities of multiple candidate keywords corresponding to a specific category in the categories C1~Cn or at least one of the multiple number of passages passing the threshold. , Determine the keyword corresponding to the specific category. In detail, the keyword extraction module 106 calculates the similarity between each group of candidate keywords and all pre-processed essays in a specific category according to the open source word vector pre-training set, and then adds up the similarity or sets a threshold (such as 0.3) Then calculate the proportion of passages that pass the threshold (for example, if the number of passages with a similarity greater than 0.3 is 70, and the number of all passages in a particular category is 100, the proportion of passages that pass the threshold is 0.7), and then add the highest similarity in the particular category to the sum or one One of the best candidate keywords with the highest proportion of short passages through the threshold is used as the keyword corresponding to a specific category. Among them, the calculation of the similarity between keywords and essays is to map them to the vector space, and then calculate different vectors The degree of similarity is calculated by the size of the difference. For example, ["hello","world"]/["hi","world"] can calculate the similarity as 0.76. Similarity calculation is also well-known in the art, so I won’t Repeat it again for brevity.

值得注意的是，上述實施例主要在於執行配適度演算法，以將特定類別中具有最高相似度加總或一最高通過門檻短文比例之候選關鍵字做為特定類別所對應之關鍵字，以萃取類別C1~Cn之相對應關鍵字KW1~KWn，再建立一詞向量短文分類器對後續短文進行分類，本領域具通常知識者當可據以進行修飾或變化，而不限於此。舉例來說，上述實施例在第一次萃取類別C1~Cn之相對應關鍵字KW1~KWn後，會將類別C1~Cn中所對應關鍵字之向量距離小於一門檻之類別進行合併，以修正代表性不彰之問題，但在其它實施例中，亦可不進行合併。此外，上述實施例係以類別C1~Cn之相對應關鍵字KW1~KWn建立詞向量短文分類器對後續短文進行分類，因此在實際應用上相似度分類模組110亦可依需求調整類別C1~Cn及關鍵字KW1~KWn建立詞向量短文分類器(如增加或減少類別)，而可具有較佳彈性。 It is worth noting that the above-mentioned embodiment is mainly to perform an appropriate matching algorithm to extract the candidate keywords with the highest similarity summation or the highest proportion of passages in the specific category as the keywords corresponding to the specific category. The corresponding keywords KW1~KWn of the categories C1~Cn, and then a word vector essay classifier is established to classify the subsequent essays. Those with ordinary knowledge in the field can modify or change according to it, but it is not limited to this. For example, in the above embodiment, after extracting the corresponding keywords KW1~KWn of the categories C1~Cn for the first time, the categories whose vector distances of the corresponding keywords in the categories C1~Cn are less than a threshold are merged to modify The problem of insufficient representativeness, but in other embodiments, it may not be combined. In addition, in the above-mentioned embodiment, a word vector essay classifier is established based on the corresponding keywords KW1~KWn of the categories C1~Cn to classify subsequent essays. Therefore, in practical applications, the similarity classification module 110 can also adjust the categories C1~ Cn and keywords KW1~KWn create word vector short text classifiers (such as increasing or decreasing categories), which can have better flexibility.

另一方面，上述實施例將複數組候選關鍵字中具有最高相似度加總或一最高通過門檻短文比例之一最佳候選關鍵字做為特定類別所對應之關鍵字，在一實施例中，可根據一基因演算法產生該複數組候選關鍵字，並於該基因演算法執行一指定迴圈數，或者一最佳候選關鍵字之一最高相似度加總或一最高通過門檻短文比例大於一配適度門檻時，配適度演算法以該最佳候選關鍵字做為特定類別所對應之關鍵字。 On the other hand, in the above-mentioned embodiment, the best candidate keyword with the highest similarity summation or the highest proportion of passages among the candidate keywords in the complex array is used as the keyword corresponding to a specific category. In one embodiment, The complex array of candidate keywords can be generated according to a genetic algorithm, and the genetic algorithm is executed with a specified number of cycles, or the sum of the highest similarity of one of the best candidate keywords or the proportion of short passages with the highest passing threshold greater than one When matching the appropriate threshold, the matching algorithm uses the best candidate keyword as the keyword corresponding to the specific category.

舉例來說，基因演算法可由特定類別中複數個字詞進行編碼產生複數組母體候選關鍵字，如在5個字詞中任選3個字詞做為一組母體候選關鍵字，然後產生具有10-20組母體候選關鍵字，接著配適度演算法計算10-20組候選關鍵字之相似度加總或通過門檻短文比例，並保留母體候選關鍵字中相似度加總或通過門檻短文比例大於特定值(如通過門檻短文比例大於0.8)之保留候選關鍵字，接著基因演算法再將保留候選關鍵字進行交配。舉例來說，可以二元碼表示5個字詞有無如下表，兩組保留候選關鍵字中一字詞原本一組有、一組沒有，交配後變成兩組皆有(第一字詞)、兩組皆無(第三字詞)，或者原本有的變沒有而原本沒有的變有(第四字詞)：

For example, a genetic algorithm can encode a plurality of words in a specific category to generate a complex set of candidate keywords. For example, select 3 words from 5 words as a set of candidate keywords, and then generate 10-20 sets of parent candidate keywords, and then a moderate algorithm is used to calculate the total similarity of 10-20 sets of candidate keywords or the proportion of passages that pass the threshold, and keep the total similarity of the candidate keywords of the parent or the proportion of passages that pass the threshold greater than Retain candidate keywords with a specific value (such as the proportion of passages greater than 0.8), and then genetic algorithm will retain candidate keywords for mating. For example, a binary code can be used to indicate the presence or absence of 5 words in the following table. One word in the two groups of reserved candidate keywords originally has one group and one group does not have it. After mating, it becomes both groups (the first word), Neither of the two groups (the third word), or the original one has become absent and the original one has not become present (the fourth word):

此外，基因演算法亦可進行突變，以一定機率(設定為低概率)隨機改變保留候選關鍵字中特定字詞的有無，如保留候選關鍵字原本不具有第三字詞突變為具有第三字詞。在此情況下，基因演算法可對保留候選關鍵字進行交配或突變，而形成一迴圈產生新候選關鍵字，再由配適度演算法計算新的候選關鍵字之相似度加總或通過門檻短文比例。如此一來，於該基因演算法執行一指定迴圈數，或者一最佳候選關鍵字之一最高相似度加總或一最高通過門檻短文比例大於一配適度門檻時，配適度演算法以該最佳候選關鍵字做為特定類別所對應之關鍵字，並結束基因演算法之迴圈。 In addition, the genetic algorithm can also perform mutations, with a certain probability (set to a low probability) to randomly change the presence or absence of specific words in the reserved candidate keywords. For example, the reserved candidate keyword does not originally have a third word and is changed to have a third word. word. In this case, the genetic algorithm can mate or mutate the reserved candidate keywords to form a loop to generate new candidate keywords, and then the fitness algorithm calculates the sum of the similarities of the new candidate keywords or passes the threshold Proportion of short essays. In this way, when the genetic algorithm executes a specified number of cycles, or the sum of the highest similarity of one of the best candidate keywords, or the proportion of short passages with the highest passing threshold is greater than a suitable threshold, the appropriate algorithm is used to The best candidate keyword is used as the keyword corresponding to a specific category, and the loop of genetic algorithm is ended.

因此，短文自動化萃取分類及關鍵字裝置10之操作，可歸納為一短文自動化萃取分類及關鍵字流程60，如第6圖所示，其包含以下步驟： Therefore, the operation of the short text automatic extraction classification and keyword device 10 can be summarized as a short text automatic extraction classification and keyword process 60, as shown in Figure 6, which includes the following steps:

步驟600：開始。 Step 600: Start.

步驟602：輸入複數個短文。 Step 602: Input a plurality of short essays.

步驟604：對該複數個短文進行一預處理，以產生複數個預處理短文。 Step 604: Perform a preprocessing on the plurality of short texts to generate a plurality of preprocessed short texts.

步驟606：根據一文本主題分析模型，將該複數個預處理短文分為不同主題之複數個類別，並輸出該複數個預處理短文及相對應複數個類別標籤 Step 606: According to a text topic analysis model, divide the plurality of preprocessed essays into a plurality of categories of different topics, and output the plurality of preprocessed essays and corresponding plural category labels

步驟608：根據該複數個預處理短文、該複數個類別標籤、一開源詞向量預訓練集及一配適度演算法，萃取該複數個類別之相對應複數組關鍵字 Step 608: According to the plurality of pre-processed essays, the plurality of category labels, an open source word vector pre-training set and a suitable algorithm, extract the corresponding complex array of keywords of the plurality of categories

步驟610：根據該複數個類別及該複數組關鍵字，建立一詞向量短文分類器 Step 610: Create a word vector essay classifier based on the plural categories and the plural keywords

步驟612：結束。 Step 612: End.

短文自動化萃取分類及關鍵字流程60之詳細操作可參考短文自動化萃取分類及關鍵字裝置10之相關內容，於此不再贅述以求簡潔。 For the detailed operation of the automatic short-text extraction classification and keyword process 60, please refer to the related content of the short-text automatic extraction classification and keyword device 10, which will not be repeated here for the sake of brevity.

此外，短文自動化萃取分類及關鍵字裝置10可包含一處理裝置及一儲存單元。處理裝置可為一微處理器或一特殊應用積體電路(application-specific integrated circuit，ASIC)。儲存單元可為任一資料儲存裝置，用來儲存一程式碼，並透過處理裝置讀取及執行程式碼，以執行前處理模組102、短文主題分類模組104、關鍵字萃取模組106、關鍵字合併模組108以及相似度分類模組110之功能，進而完成短文自動化萃取分類及關鍵字流程60之各步驟。儲存單元可為用戶識別模組(subscriber identity module，SIM)、唯讀式記憶體(read-only memory，ROM)、隨機存取記憶體(random-access memory，RAM)、光碟唯讀記憶體(CD-ROMs)、磁帶(magnetic tapes)、軟碟(floppy disks)、光學資料儲存裝置(optical data storage devices)等等，而不限於此。 In addition, the short text automatic extraction and classification and keyword device 10 may include a processing device and a storage unit. The processing device can be a microprocessor or an application-specific integrated circuit (ASIC). The storage unit can be any data storage device, used to store a program code, and read and execute the program code through the processing device to execute the pre-processing module 102, the essay topic classification module 104, the keyword extraction module 106, The functions of the keyword merging module 108 and the similarity classification module 110 further complete the automatic extraction and classification of short texts and the steps of the keyword process 60. The storage unit can be a subscriber identity module (SIM), read-only memory (read-only memory, ROM), random-access memory (RAM), optical disc read-only memory ( CD-ROMs, magnetic tapes, floppy disks, optical data storage devices, etc. are not limited thereto.

綜上所述，本發明可執行配適度演算法以利用開源詞向量預訓練集，在小樣本量萃取各類別相對應關鍵字，再根據關鍵字建立詞向量短文分類器對後續短文進行分類，而較具彈性且不需訓練詞向量及分類器以節省資源。 In summary, the present invention can execute a suitable algorithm to use the open source word vector pre-training set, extract the corresponding keywords of each category in a small sample size, and then establish the word vector essay classification based on the keywords It is more flexible and does not need to train word vectors and classifiers to save resources.

以上所述僅為本發明之較佳實施例，凡依本發明申請專利範圍所做之均等變化與修飾，皆應屬本發明之涵蓋範圍。 The foregoing descriptions are only preferred embodiments of the present invention, and all equivalent changes and modifications made in accordance with the scope of the patent application of the present invention shall fall within the scope of the present invention.

60:流程 60: Process

600~612:步驟 600~612: steps

Claims

A method for automatic extraction and classification of essays and keywords includes: inputting plural essays; preprocessing the plural essays to generate plural preprocessing essays; according to a text topic analysis model, the plural preprocessing essays are divided into Multiple categories of different topics, and output multiple category labels corresponding to the multiple preprocessed essays, and each category has multiple corresponding candidate keywords; according to the multiple preprocessed essays, the multiple Preprocess the plural category labels corresponding to the essay, an open source word vector pre-training set, the plural categories and the plural corresponding candidate keywords, and an appropriate algorithm to pluralize the plural categories in each category At least one of the sum of similarities or the proportion of multiple passages through the threshold determines the keywords corresponding to each category of the plurality of categories; then the appropriate matching algorithm is performed, and the corresponding categories of the plurality of categories are determined For the vector distance between keywords, merge the categories whose vector distance is less than a threshold, and extract the corresponding keywords from the merged categories again to update; and according to the plural categories and the corresponding categories of the plural categories Establish a word vector essay classifier to achieve the purpose of automatic extraction and classification of essays.

The method for automatic extraction and classification of short texts and keywords according to claim 1, wherein the preprocessing includes at least one of word segmentation processing, stop word processing, capitalization processing, part-of-speech reduction processing, and verb type processing .

According to claim 1, the method for automatic extraction and classification of short texts and keywords, wherein the text topic analysis model includes a Gibbs sampling algorithm of Dirichlet polynomial mixed model (Gibbs sampling algorithm) Sampling Dirichlet Multinomial Mixture Model (GSDMM), a Dirichlet Mixture Model (DMM), an application of GPU to improve the Dirichlet Multinomial Mixture (Generalized Polya Urn-Dirichlet Multinomial Mixture, GPU-DMM), a hidden Includes Dirichlet Allocation (LDA), Topic Representative Term Discovery (TRTD), Biterm Topic Model (BTM), and Latent Semantic Indexing , LSI) and at least one of a Latent Semantic Analysis (LSA); the open source word vector pre-training set includes pre-training sets wiki-news-300d-1M.vec, wiki-news-300d-1M-subword , At least one of crawl-300d-2M.vec and crawl-300d-2M-subword.

The method for automatic extraction and classification of short texts and keywords as described in claim 1, which further includes: the optimal matching algorithm in the specific category has the highest degree of similarity or the highest proportion of short texts that have the highest passing threshold. The keyword is the keyword corresponding to the specific category.

The method for automatic extraction and classification of short texts and keywords as described in claim 1, which further includes: generating the complex array of candidate keywords according to a genetic algorithm; and executing a specified number of cycles on the genetic algorithm, or a maximum When the sum of the highest similarity of one of the best candidate keywords or the proportion of a short passage with the highest passing threshold is greater than a suitable threshold, the matching algorithm uses the best candidate keyword as the keyword corresponding to the specific category.

The method for automatic extraction and keyword extraction of short texts as described in claim 5, which is based on The step of generating the complex set of candidate keywords by the genetic algorithm includes: encoding a plurality of words in the specific category to generate a complex set of matrix candidate keywords in the complex set of candidate keywords.

As described in claim 5, the method for automatic extraction and classification of short texts and keywords, which further includes: a plurality of similarities or a plurality of retained candidates whose ratios of passage threshold passages are greater than a specific value in the complex group of parent candidate keywords The keywords are mated or mutated to generate a new candidate keyword in the complex set of candidate keywords.

A system for automatic extraction and classification of short texts and keywords includes: a processor for executing a program; and a storage unit coupled to the processor for storing the program; wherein the program is used to instruct the processor Carry out the automatic extraction and classification of short texts and keywords as described in Claims 1 to 7.