[go: up one dir, main page]

TWI748749B - Method for automatic extraction of text classification and text keyword and device using same - Google Patents

Method for automatic extraction of text classification and text keyword and device using same Download PDF

Info

Publication number
TWI748749B
TWI748749B TW109139905A TW109139905A TWI748749B TW I748749 B TWI748749 B TW I748749B TW 109139905 A TW109139905 A TW 109139905A TW 109139905 A TW109139905 A TW 109139905A TW I748749 B TWI748749 B TW I748749B
Authority
TW
Taiwan
Prior art keywords
keywords
classification
categories
keyword
automatic extraction
Prior art date
Application number
TW109139905A
Other languages
Chinese (zh)
Other versions
TW202221528A (en
Inventor
張凱喬
黃戎歆
曾文彥
Original Assignee
威聯通科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 威聯通科技股份有限公司 filed Critical 威聯通科技股份有限公司
Priority to TW109139905A priority Critical patent/TWI748749B/en
Priority to CN202111356362.0A priority patent/CN114510565B/en
Application granted granted Critical
Publication of TWI748749B publication Critical patent/TWI748749B/en
Publication of TW202221528A publication Critical patent/TW202221528A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a method for automatic extraction of text classification and text keyword. The method includes inputting a plurality of texts; preprocessing the plurality of texts to generate a plurality of preprocessed texts; and classifying the plurality of preprocessed texts into a plurality of classifications with different topics according to a text topic analysis model, and outputting the plurality of preprocessed texts and a plurality of corresponding classification indexes; extracting a plurality of sets of corresponding keywords of the plurality of classifications according to the plurality of pre-processed texts, the plurality of classification indexes, an open source word vector pre-training set and a pairing algorithm; and establishing a word vector text classifier according to the plurality of classifications and the plurality of sets of keywords.

Description

短文自動化萃取分類及關鍵字方法及採用該方法之裝置 Short text automatic extraction classification and keyword method and device adopting the method

本發明係指一種短文自動化萃取分類及關鍵字方法,尤指一種可在小樣本量透過關鍵字進行詞向量分類訓練的短文自動化萃取分類及關鍵字方法。 The present invention refers to a method for automatic extraction and classification of short texts and keywords, in particular to a method for automatic extraction and classification of short texts and keywords that can perform word vector classification training through keywords in a small sample size.

目前的短文分類模型之訓練數據集多為新聞資料庫,且大多係應用於新聞採集或網路輿情分析等,其特性包含文本數量充足、文章的主題一致、句法結構清楚等,因此適合用習知技術之訓練詞向量與分類器方法進行分類。 The training data sets of the current short text classification models are mostly news databases, and most of them are used in news gathering or Internet public opinion analysis. Its characteristics include sufficient amount of text, consistent article topics, clear syntactic structure, etc., so it is suitable for use. The training word vector and the classifier method of the known technology are used for classification.

然而,針對一般企業處理客服進件時(如電子郵件或訊息提問),由於進件量較少(如界於10到100萬筆之間)不易執行一般的聚類與分類的自然語言處理(Natural Language Processing,NLP)模型,習知技術訓練詞向量及分類器進行分類僅適用於大樣本數,另外,習知技術由機器學習所訓練得出的分類器較耗費時間與資源,且建立後無法依需求再進行微調,如果因應業務需求而要新增產品名稱或功能名稱等關鍵字時,僅能重新進行訓練而缺乏彈性。 However, it is not easy to perform general clustering and classification of natural language processing ( Natural Language Processing (NLP) model, conventional technology training word vectors and classifiers for classification are only suitable for large sample numbers. In addition, the classifiers trained by conventional technology by machine learning are more time-consuming and resource-intensive, and after establishment It cannot be fine-tuned according to demand. If you need to add keywords such as product names or function names in response to business needs, you can only retrain and lack flexibility.

有鑑於此,習知技術實有改進之必要。 In view of this, it is necessary to improve the conventional technology.

因此,本發明之短文自動化萃取分類及關鍵字方法及裝置之主要目的在於藉由使用配適度演算法遴選出最佳候選關鍵字做為特定類別所對應之關鍵字,使得該組關鍵字從詞向量分析之角度足以代表大部分短文的重點,俾於能在小樣本量透過關鍵字進行詞向量分類。 Therefore, the main purpose of the method and device for automatic extraction and classification of essays and keywords of the present invention is to select the best candidate keywords as keywords corresponding to a specific category by using a suitable algorithm, so that the set of keywords are from words The angle of vector analysis is sufficient to represent the key points of most essays, so that word vectors can be classified by keywords in a small sample size.

本發明之短文自動化萃取分類及關鍵字方法及裝置之另一主要目的在於藉由引入開源詞向量預訓練字詞集以解決先前技術採用字詞向量訓練在小樣本數中分析效率不佳的問題。 Another main purpose of the method and device for automatic extraction and classification of short essays and keywords of the present invention is to solve the problem of poor analysis efficiency in small sample numbers by using word vector training in the prior art by introducing open source word vector pre-training word sets. .

本發明之短文自動化萃取分類及關鍵字方法及裝置之又一主要目的具有關鍵字萃取模組能對被合併類別之關鍵字再次萃取及進行更新,用於修正習知技術類別關鍵字代表性不彰之問題。 Another main purpose of the method and device for automatic extraction and classification of short texts and keywords of the present invention is that the keyword extraction module can extract and update the keywords of the merged category again, and is used to correct the representativeness of the keywords in the conventional technology category. The problem of clarification.

本發明之短文自動化萃取分類及關鍵字方法及裝置之又一主要目的在於相較習知技術執行分類時,本發明不需訓練詞向量及透過機器學習訓練分類器俾於達到節省系統資源之目的。 Another main purpose of the method and device for automatic extraction and classification of short texts and keywords of the present invention is to achieve the purpose of saving system resources without training word vectors and training the classifier through machine learning when performing classification compared with conventional techniques. .

本發明之短文自動化萃取分類及關鍵字方法及裝置之又一主要目的在於藉由採用開源詞向量預訓練集且搭配適度演算法,取出最佳之複數組關鍵字並結合詞向量相似度分類模組進行分類,俾於在未來需要進行關鍵字調整時(譬如因應業務需求要新增產品名稱或功能名稱),不須重新進行訓練俾於達到更敏捷、更有彈性之目的。 Another main purpose of the method and device for automatic extraction and classification of short essays and keywords of the present invention is to extract the best complex array of keywords by using the open source word vector pre-training set and matching with a moderate algorithm, and combine the word vector similarity classification model Groups are classified, so that when keywords need to be adjusted in the future (for example, new product names or function names should be added in response to business needs), there is no need to retrain in order to achieve more agile and flexible goals.

本發明揭露一種短文自動化萃取分類及關鍵字方法,包含輸入複數個短文;對該複數個短文進行一預處理,以產生複數個預處理短文;根據一文本主題分析模型,將該複數個預處理短文分為不同主題之複數個類別,並輸出該複數個預處理短文及相對應複數個類別標籤;根據該複數個預處理短文、該複數個類別標籤、一開源詞向量預訓練集及一配適度演算法,萃取該複數個類 別之相對應複數組關鍵字;以及根據該複數個類別及該複數組關鍵字,建立一詞向量短文分類器。 The present invention discloses a method for automatic extraction and classification of short essays and keywords, which includes inputting plural essays; preprocessing the plural essays to generate plural preprocessing essays; and preprocessing the plural essays according to a text topic analysis model The essays are divided into multiple categories of different topics, and output the multiple preprocessed essays and corresponding multiple category labels; according to the multiple preprocessed essays, the multiple category labels, an open source word vector pre-training set and a pair Moderate algorithm, extract the plural classes Other corresponding complex group keywords; and according to the plural categories and the complex group keywords, a word vector essay classifier is established.

10:短文自動化萃取分類及關鍵字裝置 10: Short text automatic extraction classification and keyword device

102:前處理模組 102: Pre-processing module

104:短文主題分類模組 104: Short essay topic classification module

106:關鍵字萃取模組 106: Keyword Extraction Module

108:關鍵字合併模組 108: Keyword merge module

110:相似度分類模組 110: Similarity classification module

12:輸入裝置 12: Input device

14:輸出裝置 14: output device

T1~Tm:短文 T1~Tm: short text

T1’~Tm’:預處理短文 T1’~Tm’: Preprocessed short text

C1~Cn:類別 C1~Cn: category

TI1~TIm:類別標籤 TI1~TIm: category label

KW1~KWn:關鍵字 KW1~KWn: Keywords

60:流程 60: Process

600~612:步驟 600~612: steps

第1圖為本發明實施例一短文自動化萃取分類及關鍵字裝置之示意圖。 Figure 1 is a schematic diagram of a short text automatic extraction and classification and keyword device according to an embodiment of the present invention.

第2圖至第5圖為本發明實施例第1圖所示短文自動化萃取分類及關鍵字裝置之操作示意圖。 Figures 2 to 5 are schematic diagrams of the operation of the short text automatic extraction and classification and keyword device shown in Figure 1 of the embodiment of the present invention.

第6圖為本發明實施例一短文自動化萃取分類及關鍵字流程之示意圖。 Figure 6 is a schematic diagram of a process of automatic extraction and classification of short texts and keywords according to an embodiment of the present invention.

請參考第1圖,第1圖為本發明實施例一短文自動化萃取分類及關鍵字裝置10之示意圖。如第1圖所示,短文自動化萃取分類及關鍵字裝置10包含一前處理模組102、一短文主題分類模組104、一關鍵字萃取模組106、一關鍵字合併模組108以及一相似度分類模組110。簡單來說,一輸入裝置12輸入短文T1~Tm至前處理模組102,前處理模組102對短文T1~Tm進行分詞、停用詞、大小寫、詞性還原、動詞型態處理等預處理,以產生預處理短文T1’~Tm’;短文主題分類模組104可根據一文本主題分析模型,將預處理短文T1’~Tm’分為不同主題的類別C1~Cn,並輸出預處理短文T1’~Tm’及相對應類別標籤TI1~Tim;關鍵字萃取模組106根據預處理短文T1’~Tm’、類別標籤TI1~TIm、一開源詞向量預訓練集及一配適度演算法,萃取類別C1~Cn之相對應關鍵字KW1~KWn;關鍵字合併模組108判斷關鍵字KW1~KWn之間之向量距離,並將類別C1~Cn中所對應關鍵字之向量距離小於一門檻之類別進行合併;關鍵字萃取模組106對被類別C1~Cn中合併類別再次萃取關鍵字進行更新,以修正被合併類別關鍵字代表性 不彰之問題。最後,相似度分類模組110可根據關鍵字KW1~KWn及其詞向量計算距離,建立一詞向量短文分類器並提供至一輸出裝置14,提高後續小樣本數短文之關鍵字與分類識別準確率(如計算後續特定短文與關鍵字KW1~KWn中何組關鍵字相似度最高,則判斷特定短文屬於相對應類別)。如此一來,本發明可在小樣本量透過關鍵字做詞向量分類而較具彈性,且不需訓練詞向量及分類器而簡省資源。 Please refer to FIG. 1. FIG. 1 is a schematic diagram of a short text automatic extraction and classification and keyword device 10 according to an embodiment of the present invention. As shown in Figure 1, the automatic short text extraction and classification and keyword device 10 includes a pre-processing module 102, a short text topic classification module 104, a keyword extraction module 106, a keyword merging module 108, and a similar Degree classification module 110. To put it simply, an input device 12 inputs the short text T1~Tm to the pre-processing module 102, and the pre-processing module 102 performs word segmentation, stop words, capitalization, part-of-speech restoration, and verb form processing on the short text T1~Tm. , To generate pre-processed short texts T1'~Tm'; the short text topic classification module 104 can divide the pre-processed short texts T1'~Tm' into categories C1~Cn of different topics according to a text topic analysis model, and output the pre-processed short texts T1'~Tm' and corresponding category labels TI1~Tim; the keyword extraction module 106 is based on the pre-processed short text T1'~Tm', category labels TI1~TIm, an open source word vector pre-training set and a matching appropriate algorithm, Extract the corresponding keywords KW1~KWn of the categories C1~Cn; the keyword merging module 108 determines the vector distance between the keywords KW1~KWn, and the vector distance of the corresponding keywords in the categories C1~Cn is less than a threshold The categories are merged; the keyword extraction module 106 updates the re-extracted keywords from the merged categories in the categories C1~Cn to modify the representativeness of the merged category keywords The problem of insignificance. Finally, the similarity classification module 110 can calculate the distance based on the keywords KW1~KWn and their word vectors, build a word vector essay classifier and provide it to an output device 14 to improve the keyword and classification recognition accuracy of subsequent short essays with a small sample number The accuracy rate (such as calculating which group of keywords in the subsequent specific essay and keywords KW1~KWn have the highest similarity, then it is determined that the specific essay belongs to the corresponding category). In this way, the present invention can classify word vectors through keywords in a small sample size and is more flexible, and does not need to train word vectors and classifiers, which saves resources.

具體而言,請參考第2圖至第5圖,第2圖至第5圖為本發明實施例短文自動化萃取分類及關鍵字裝置10之操作示意圖。如第2圖所示,短文T1~T6經前處理模組102進行分詞處理、停用詞處理、大小寫處理、詞性還原處理、動詞型態處理等預處理後,可產生僅具重要文字及適合處理之預處理短文T1’~T6’並提供給預處理短文主題分類模組104,預處理為本領域通常技術者所熟知,於此不再贅述。接著,如第3圖所示,短文主題分類模組104將預處理短文T1’~T6’分為類別C1~C4,並決定相對應類別標籤TI1~TI6分別為類別C1、C2、C2、C3、C1、C4,而可將短文進行主體分類而將相近主題短文放在同一類別,然後關鍵字萃取模組106萃取類別C1~C4之相對應關鍵字KW1~KW4以代表各類別。然後,如第4圖所示,關鍵字合併模組108判斷關鍵字KW2、KW3之向量距離小於門檻而將類別C3併入類別C2,以將接近的類別合併,再由關鍵字萃取模組106對合併後類別C2萃取關鍵字,即如第5圖所示更新關鍵字KW2(原本類別標籤TI4亦更新為類別C2)。 Specifically, please refer to FIG. 2 to FIG. 5. FIG. 2 to FIG. 5 are schematic diagrams of the operation of the short text automatic extraction and classification and keyword device 10 according to the embodiment of the present invention. As shown in Figure 2, the short texts T1~T6 are preprocessed by the preprocessing module 102 for word segmentation, stop word processing, capitalization processing, part-of-speech restoration processing, and verb type processing. The preprocessed short texts T1'~T6' suitable for processing are provided to the preprocessed short text topic classification module 104. The preprocessing is well known to those skilled in the art and will not be repeated here. Then, as shown in Figure 3, the short text topic classification module 104 divides the preprocessed short text T1'~T6' into categories C1~C4, and determines the corresponding category labels TI1~TI6 to be categories C1, C2, C2, C3, respectively , C1, C4, and short texts can be classified into the main body and short texts with similar topics are placed in the same category, and then the keyword extraction module 106 extracts the corresponding keywords KW1~KW4 of the categories C1~C4 to represent each category. Then, as shown in Figure 4, the keyword merging module 108 determines that the vector distance of keywords KW2 and KW3 is less than the threshold and merges category C3 into category C2 to merge the close categories, and then the keyword extraction module 106 The keywords are extracted for the merged category C2, that is, the keyword KW2 is updated as shown in Figure 5 (the original category label TI4 is also updated to category C2).

詳細來說,短文主題分類模組104之文本主題分析模型可包含一狄利克雷多項式混合模型的吉布斯採樣算法(Gibbs Sampling Dirichlet Multinomial Mixture Model,GSDMM)、一狄利克雷多項式混合模型(Dirichlet Mixture Model,DMM)、一應用GPU改善狄利克雷多項式混合模型(Generalized Polya Urn-Dirichlet Multinomial Mixture,GPU-DMM)、一隱含狄利克雷分佈模型(Latent Dirichlet Allocation,LDA)、一主題代表詞發現(Topic Representative Term Discovery,TRTD)、一雙詞主題模型(Biterm Topic Model,BTM)、一隱含語意索引(Latent Semantic Indexing,LSI)以及一潛在語義分析(Latent Semantic Analysis,LSA)當中至少一者。舉例來說,GSDMM先隨機將預處理短文T1’~Tm’分群,然後再依移動各短文直到不同主題的類別C1~Cn中各類別中短文性質相似為止,GSDMM及其它作法為本領域所熟知,於此不再贅述以求簡潔。 In detail, the text topic analysis model of the short text topic classification module 104 may include a Gibbs Sampling Dirichlet Multinomial Mixture Model (GSDMM) and a Dirichlet Multinomial Mixture Model (Dirichlet). Mixture Model, DMM), a GPU-improved Dirichlet Multinomial Mixture (GPU-DMM), a Latent Dirichlet Allocation (LDA), a topic representative word discovery At least one of (Topic Representative Term Discovery, TRTD), a Biterm Topic Model (BTM), a Latent Semantic Indexing (LSI), and a Latent Semantic Analysis (LSA) . For example, GSDMM randomly groups the preprocessed essays T1'~Tm' into groups, and then moves each essay until the nature of the essays in the categories C1~Cn of different topics are similar. GSDMM and other methods are well known in the art , I won’t repeat it here for brevity.

另一方面,關鍵字萃取模組106中開源詞向量預訓練集可包含預訓練集wiki-news-300d-1M.vec、wiki-news-300d-1M-subword、crawl-300d-2M.vec以及crawl-300d-2M-subword當中至少一者,針對於小樣本數,字詞向量訓練無法有好的效果,因此透過引入開源詞向量預訓練字詞集,以提升分析效率,開源詞向量預訓練集亦為本領域所熟知,於此不再贅述以求簡潔。 On the other hand, the open source word vector pre-training set in the keyword extraction module 106 may include pre-training sets wiki-news-300d-1M.vec, wiki-news-300d-1M-subword, crawl-300d-2M.vec, and At least one of crawl-300d-2M-subword, for small sample size, word vector training cannot have good results, so by introducing open source word vector pre-training word set to improve analysis efficiency, open source word vector pre-training The collection is also well-known in the art, so I won't repeat it here for the sake of brevity.

在此情況下,關鍵字萃取模組106執行配適度演算法,根據類別C1~Cn中一特定類別所對應複數組候選關鍵字之複數個相似度加總或複數個通過門檻短文比例當中至少一者,決定該特定類別所對應之關鍵字。詳細來說,關鍵字萃取模組106根據開源詞向量預訓練集,計算各組候選關鍵字與特定類別中所有預處理短文之相似度,然後將相似度加總或設定一門檻(如0.3)再計算通過門檻短文比例(如相似度大於0.3之短文數為70,而特定類別中所有短文數為100,則通過門檻短文比例為0.7),接著將特定類別中具有最高相似度加總或一最高通過門檻短文比例之一最佳候選關鍵字做為特定類別所對應之關鍵字。其中,關鍵字與短文之相似度之計算係將其映射到向量空間,再計算不同的向量 之差異之大小來計算相似度,如[“hello”,“world”]/[“hi”,“world”]可計算出相似度為0.76,相似度計算亦為本領域所熟知,於此不再贅述以求簡潔。 In this case, the keyword extraction module 106 executes an appropriate algorithm based on the sum of multiple similarities of multiple candidate keywords corresponding to a specific category in the categories C1~Cn or at least one of the multiple number of passages passing the threshold. , Determine the keyword corresponding to the specific category. In detail, the keyword extraction module 106 calculates the similarity between each group of candidate keywords and all pre-processed essays in a specific category according to the open source word vector pre-training set, and then adds up the similarity or sets a threshold (such as 0.3) Then calculate the proportion of passages that pass the threshold (for example, if the number of passages with a similarity greater than 0.3 is 70, and the number of all passages in a particular category is 100, the proportion of passages that pass the threshold is 0.7), and then add the highest similarity in the particular category to the sum or one One of the best candidate keywords with the highest proportion of short passages through the threshold is used as the keyword corresponding to a specific category. Among them, the calculation of the similarity between keywords and essays is to map them to the vector space, and then calculate different vectors The degree of similarity is calculated by the size of the difference. For example, ["hello","world"]/["hi","world"] can calculate the similarity as 0.76. Similarity calculation is also well-known in the art, so I won’t Repeat it again for brevity.

值得注意的是,上述實施例主要在於執行配適度演算法,以將特定類別中具有最高相似度加總或一最高通過門檻短文比例之候選關鍵字做為特定類別所對應之關鍵字,以萃取類別C1~Cn之相對應關鍵字KW1~KWn,再建立一詞向量短文分類器對後續短文進行分類,本領域具通常知識者當可據以進行修飾或變化,而不限於此。舉例來說,上述實施例在第一次萃取類別C1~Cn之相對應關鍵字KW1~KWn後,會將類別C1~Cn中所對應關鍵字之向量距離小於一門檻之類別進行合併,以修正代表性不彰之問題,但在其它實施例中,亦可不進行合併。此外,上述實施例係以類別C1~Cn之相對應關鍵字KW1~KWn建立詞向量短文分類器對後續短文進行分類,因此在實際應用上相似度分類模組110亦可依需求調整類別C1~Cn及關鍵字KW1~KWn建立詞向量短文分類器(如增加或減少類別),而可具有較佳彈性。 It is worth noting that the above-mentioned embodiment is mainly to perform an appropriate matching algorithm to extract the candidate keywords with the highest similarity summation or the highest proportion of passages in the specific category as the keywords corresponding to the specific category. The corresponding keywords KW1~KWn of the categories C1~Cn, and then a word vector essay classifier is established to classify the subsequent essays. Those with ordinary knowledge in the field can modify or change according to it, but it is not limited to this. For example, in the above embodiment, after extracting the corresponding keywords KW1~KWn of the categories C1~Cn for the first time, the categories whose vector distances of the corresponding keywords in the categories C1~Cn are less than a threshold are merged to modify The problem of insufficient representativeness, but in other embodiments, it may not be combined. In addition, in the above-mentioned embodiment, a word vector essay classifier is established based on the corresponding keywords KW1~KWn of the categories C1~Cn to classify subsequent essays. Therefore, in practical applications, the similarity classification module 110 can also adjust the categories C1~ Cn and keywords KW1~KWn create word vector short text classifiers (such as increasing or decreasing categories), which can have better flexibility.

另一方面,上述實施例將複數組候選關鍵字中具有最高相似度加總或一最高通過門檻短文比例之一最佳候選關鍵字做為特定類別所對應之關鍵字,在一實施例中,可根據一基因演算法產生該複數組候選關鍵字,並於該基因演算法執行一指定迴圈數,或者一最佳候選關鍵字之一最高相似度加總或一最高通過門檻短文比例大於一配適度門檻時,配適度演算法以該最佳候選關鍵字做為特定類別所對應之關鍵字。 On the other hand, in the above-mentioned embodiment, the best candidate keyword with the highest similarity summation or the highest proportion of passages among the candidate keywords in the complex array is used as the keyword corresponding to a specific category. In one embodiment, The complex array of candidate keywords can be generated according to a genetic algorithm, and the genetic algorithm is executed with a specified number of cycles, or the sum of the highest similarity of one of the best candidate keywords or the proportion of short passages with the highest passing threshold greater than one When matching the appropriate threshold, the matching algorithm uses the best candidate keyword as the keyword corresponding to the specific category.

舉例來說,基因演算法可由特定類別中複數個字詞進行編碼產生複數組母體候選關鍵字,如在5個字詞中任選3個字詞做為一組母體候選關鍵字, 然後產生具有10-20組母體候選關鍵字,接著配適度演算法計算10-20組候選關鍵字之相似度加總或通過門檻短文比例,並保留母體候選關鍵字中相似度加總或通過門檻短文比例大於特定值(如通過門檻短文比例大於0.8)之保留候選關鍵字,接著基因演算法再將保留候選關鍵字進行交配。舉例來說,可以二元碼表示5個字詞有無如下表,兩組保留候選關鍵字中一字詞原本一組有、一組沒有,交配後變成兩組皆有(第一字詞)、兩組皆無(第三字詞),或者原本有的變沒有而原本沒有的變有(第四字詞):

Figure 109139905-A0305-02-0009-1
For example, a genetic algorithm can encode a plurality of words in a specific category to generate a complex set of candidate keywords. For example, select 3 words from 5 words as a set of candidate keywords, and then generate 10-20 sets of parent candidate keywords, and then a moderate algorithm is used to calculate the total similarity of 10-20 sets of candidate keywords or the proportion of passages that pass the threshold, and keep the total similarity of the candidate keywords of the parent or the proportion of passages that pass the threshold greater than Retain candidate keywords with a specific value (such as the proportion of passages greater than 0.8), and then genetic algorithm will retain candidate keywords for mating. For example, a binary code can be used to indicate the presence or absence of 5 words in the following table. One word in the two groups of reserved candidate keywords originally has one group and one group does not have it. After mating, it becomes both groups (the first word), Neither of the two groups (the third word), or the original one has become absent and the original one has not become present (the fourth word):
Figure 109139905-A0305-02-0009-1

此外,基因演算法亦可進行突變,以一定機率(設定為低概率)隨機改變保留候選關鍵字中特定字詞的有無,如保留候選關鍵字原本不具有第三字詞突變為具有第三字詞。在此情況下,基因演算法可對保留候選關鍵字進行交配或突變,而形成一迴圈產生新候選關鍵字,再由配適度演算法計算新的候選關鍵字之相似度加總或通過門檻短文比例。如此一來,於該基因演算法執行一指定迴圈數,或者一最佳候選關鍵字之一最高相似度加總或一最高通過門檻短文比例大於一配適度門檻時,配適度演算法以該最佳候選關鍵字做為特定類別所對應之關鍵字,並結束基因演算法之迴圈。 In addition, the genetic algorithm can also perform mutations, with a certain probability (set to a low probability) to randomly change the presence or absence of specific words in the reserved candidate keywords. For example, the reserved candidate keyword does not originally have a third word and is changed to have a third word. word. In this case, the genetic algorithm can mate or mutate the reserved candidate keywords to form a loop to generate new candidate keywords, and then the fitness algorithm calculates the sum of the similarities of the new candidate keywords or passes the threshold Proportion of short essays. In this way, when the genetic algorithm executes a specified number of cycles, or the sum of the highest similarity of one of the best candidate keywords, or the proportion of short passages with the highest passing threshold is greater than a suitable threshold, the appropriate algorithm is used to The best candidate keyword is used as the keyword corresponding to a specific category, and the loop of genetic algorithm is ended.

因此,短文自動化萃取分類及關鍵字裝置10之操作,可歸納為一短文自動化萃取分類及關鍵字流程60,如第6圖所示,其包含以下步驟: Therefore, the operation of the short text automatic extraction classification and keyword device 10 can be summarized as a short text automatic extraction classification and keyword process 60, as shown in Figure 6, which includes the following steps:

步驟600:開始。 Step 600: Start.

步驟602:輸入複數個短文。 Step 602: Input a plurality of short essays.

步驟604:對該複數個短文進行一預處理,以產生複數個預處理短文。 Step 604: Perform a preprocessing on the plurality of short texts to generate a plurality of preprocessed short texts.

步驟606:根據一文本主題分析模型,將該複數個預處理短文分為不同主題之複數個類別,並輸出該複數個預處理短文及相對應複數個類別標籤 Step 606: According to a text topic analysis model, divide the plurality of preprocessed essays into a plurality of categories of different topics, and output the plurality of preprocessed essays and corresponding plural category labels

步驟608:根據該複數個預處理短文、該複數個類別標籤、一開源詞向量預訓練集及一配適度演算法,萃取該複數個類別之相對應複數組關鍵字 Step 608: According to the plurality of pre-processed essays, the plurality of category labels, an open source word vector pre-training set and a suitable algorithm, extract the corresponding complex array of keywords of the plurality of categories

步驟610:根據該複數個類別及該複數組關鍵字,建立一詞向量短文分類器 Step 610: Create a word vector essay classifier based on the plural categories and the plural keywords

步驟612:結束。 Step 612: End.

短文自動化萃取分類及關鍵字流程60之詳細操作可參考短文自動化萃取分類及關鍵字裝置10之相關內容,於此不再贅述以求簡潔。 For the detailed operation of the automatic short-text extraction classification and keyword process 60, please refer to the related content of the short-text automatic extraction classification and keyword device 10, which will not be repeated here for the sake of brevity.

此外,短文自動化萃取分類及關鍵字裝置10可包含一處理裝置及一儲存單元。處理裝置可為一微處理器或一特殊應用積體電路(application-specific integrated circuit,ASIC)。儲存單元可為任一資料儲存裝置,用來儲存一程式碼,並透過處理裝置讀取及執行程式碼,以執行前處理模組102、短文主題分類模組104、關鍵字萃取模組106、關鍵字合併模組108以及相似度分類模組110之功能,進而完成短文自動化萃取分類及關鍵字流程60之各步驟。儲存單元可為用戶識別模組(subscriber identity module,SIM)、唯讀式記憶體(read-only memory,ROM)、隨機存取記憶體(random-access memory,RAM)、光碟唯讀記憶體(CD-ROMs)、磁帶(magnetic tapes)、軟碟(floppy disks)、光學資料儲存裝置(optical data storage devices)等等,而不限於此。 In addition, the short text automatic extraction and classification and keyword device 10 may include a processing device and a storage unit. The processing device can be a microprocessor or an application-specific integrated circuit (ASIC). The storage unit can be any data storage device, used to store a program code, and read and execute the program code through the processing device to execute the pre-processing module 102, the essay topic classification module 104, the keyword extraction module 106, The functions of the keyword merging module 108 and the similarity classification module 110 further complete the automatic extraction and classification of short texts and the steps of the keyword process 60. The storage unit can be a subscriber identity module (SIM), read-only memory (read-only memory, ROM), random-access memory (RAM), optical disc read-only memory ( CD-ROMs, magnetic tapes, floppy disks, optical data storage devices, etc. are not limited thereto.

綜上所述,本發明可執行配適度演算法以利用開源詞向量預訓練集,在小樣本量萃取各類別相對應關鍵字,再根據關鍵字建立詞向量短文分類 器對後續短文進行分類,而較具彈性且不需訓練詞向量及分類器以節省資源。 In summary, the present invention can execute a suitable algorithm to use the open source word vector pre-training set, extract the corresponding keywords of each category in a small sample size, and then establish the word vector essay classification based on the keywords It is more flexible and does not need to train word vectors and classifiers to save resources.

以上所述僅為本發明之較佳實施例,凡依本發明申請專利範圍所做之均等變化與修飾,皆應屬本發明之涵蓋範圍。 The foregoing descriptions are only preferred embodiments of the present invention, and all equivalent changes and modifications made in accordance with the scope of the patent application of the present invention shall fall within the scope of the present invention.

60:流程 60: Process

600~612:步驟 600~612: steps

Claims (8)

一種短文自動化萃取分類及關鍵字方法,包含:輸入複數個短文;對該複數個短文進行一預處理,以產生複數個預處理短文;根據一文本主題分析模型,該複數個預處理短文分為不同主題之複數個類別,並輸出該複數個預處理短文相對應之複數個類別標籤,且各複數個類別均各具有複數個對應之候選關鍵字;根據該複數個預處理短文、該複數個預處理短文相對應之複數個類別標籤、一開源詞向量預訓練集、該複數個類別及該複數個對應之候選關鍵字及一配適度演算法,以將該複數個類別之各類別中複數個相似度加總或複數個通過門檻短文比例當中至少一者,決定該複數個類別之各類別所對應之關鍵字;再進行該配適度演算法,並判斷該複數個類別之各類別所對應關鍵字之間之向量距離,將該其向量距離小於一門檻之類別進行合併,對已合併類別再次萃取相對應關鍵字進行更新;以及根據該複數個類別及該複數個類別之各類別所對應之關鍵字,建立一詞向量短文分類器,以達到短文自動化萃取分類之目的。 A method for automatic extraction and classification of essays and keywords includes: inputting plural essays; preprocessing the plural essays to generate plural preprocessing essays; according to a text topic analysis model, the plural preprocessing essays are divided into Multiple categories of different topics, and output multiple category labels corresponding to the multiple preprocessed essays, and each category has multiple corresponding candidate keywords; according to the multiple preprocessed essays, the multiple Preprocess the plural category labels corresponding to the essay, an open source word vector pre-training set, the plural categories and the plural corresponding candidate keywords, and an appropriate algorithm to pluralize the plural categories in each category At least one of the sum of similarities or the proportion of multiple passages through the threshold determines the keywords corresponding to each category of the plurality of categories; then the appropriate matching algorithm is performed, and the corresponding categories of the plurality of categories are determined For the vector distance between keywords, merge the categories whose vector distance is less than a threshold, and extract the corresponding keywords from the merged categories again to update; and according to the plural categories and the corresponding categories of the plural categories Establish a word vector essay classifier to achieve the purpose of automatic extraction and classification of essays. 如請求項1所述之短文自動化萃取分類及關鍵字方法,其中該預處理包含一分詞處理、一停用詞處理、一大小寫處理、一詞性還原處理、一動詞型態處理中至少一者。 The method for automatic extraction and classification of short texts and keywords according to claim 1, wherein the preprocessing includes at least one of word segmentation processing, stop word processing, capitalization processing, part-of-speech reduction processing, and verb type processing . 如請求項1所述之短文自動化萃取分類及關鍵字方法,其中該文本主題分析模型包含一狄利克雷多項式混合模型的吉布斯採樣算法(Gibbs Sampling Dirichlet Multinomial Mixture Model,GSDMM)、一狄利克雷多項式混合模型(Dirichlet Mixture Model,DMM)、一應用GPU改善狄利克雷多項式混合模型(Generalized Polya Urn-Dirichlet Multinomial Mixture,GPU-DMM)、一隱含狄利克雷分佈模型(Latent Dirichlet Allocation,LDA)、一主題代表詞發現(Topic Representative Term Discovery,TRTD)、一雙詞主題模型(Biterm Topic Model,BTM)、一隱含語意索引(Latent Semantic Indexing,LSI)以及一潛在語義分析(Latent Semantic Analysis,LSA)當中至少一者;該開源詞向量預訓練集包含預訓練集wiki-news-300d-1M.vec、wiki-news-300d-1M-subword、crawl-300d-2M.vec以及crawl-300d-2M-subword當中至少一者。 According to claim 1, the method for automatic extraction and classification of short texts and keywords, wherein the text topic analysis model includes a Gibbs sampling algorithm of Dirichlet polynomial mixed model (Gibbs sampling algorithm) Sampling Dirichlet Multinomial Mixture Model (GSDMM), a Dirichlet Mixture Model (DMM), an application of GPU to improve the Dirichlet Multinomial Mixture (Generalized Polya Urn-Dirichlet Multinomial Mixture, GPU-DMM), a hidden Includes Dirichlet Allocation (LDA), Topic Representative Term Discovery (TRTD), Biterm Topic Model (BTM), and Latent Semantic Indexing , LSI) and at least one of a Latent Semantic Analysis (LSA); the open source word vector pre-training set includes pre-training sets wiki-news-300d-1M.vec, wiki-news-300d-1M-subword , At least one of crawl-300d-2M.vec and crawl-300d-2M-subword. 如請求項1所述之短文自動化萃取分類及關鍵字方法,其另包含有:該配適度演算法將該特定類別中具有一最高相似度加總或一最高通過門檻短文比例之一最佳候選關鍵字做為該特定類別所對應之該關鍵字。 The method for automatic extraction and classification of short texts and keywords as described in claim 1, which further includes: the optimal matching algorithm in the specific category has the highest degree of similarity or the highest proportion of short texts that have the highest passing threshold. The keyword is the keyword corresponding to the specific category. 如請求項1所述之短文自動化萃取分類及關鍵字方法,其另包含有:根據一基因演算法產生該複數組候選關鍵字;以及於該基因演算法執行一指定迴圈數,或者一最佳候選關鍵字之一最高相似度加總或一最高通過門檻短文比例大於一配適度門檻時,該配適度演算法以該最佳候選關鍵字做為該特定類別所對應之該關鍵字。 The method for automatic extraction and classification of short texts and keywords as described in claim 1, which further includes: generating the complex array of candidate keywords according to a genetic algorithm; and executing a specified number of cycles on the genetic algorithm, or a maximum When the sum of the highest similarity of one of the best candidate keywords or the proportion of a short passage with the highest passing threshold is greater than a suitable threshold, the matching algorithm uses the best candidate keyword as the keyword corresponding to the specific category. 如請求項5所述之短文自動化萃取分類及關鍵字方法,其中根據 該基因演算法產生該複數組候選關鍵字之步驟包含有:對該特定類別中複數個字詞進行編碼產生該複數組候選關鍵字中複數組母體候選關鍵字。 The method for automatic extraction and keyword extraction of short texts as described in claim 5, which is based on The step of generating the complex set of candidate keywords by the genetic algorithm includes: encoding a plurality of words in the specific category to generate a complex set of matrix candidate keywords in the complex set of candidate keywords. 如請求項5所述之短文自動化萃取分類及關鍵字方法,其另包含有:對該複數組母體候選關鍵字中複數個相似度或複數個通過門檻短文比例大於一特定值之複數個保留候選關鍵字進行交配或突變,以產生該複數組候選關鍵字中複數組新候選關鍵字。 As described in claim 5, the method for automatic extraction and classification of short texts and keywords, which further includes: a plurality of similarities or a plurality of retained candidates whose ratios of passage threshold passages are greater than a specific value in the complex group of parent candidate keywords The keywords are mated or mutated to generate a new candidate keyword in the complex set of candidate keywords. 一種短文自動化萃取分類及關鍵字系統,包含有:一處理器,用來執行一程式;以及一儲存單元,耦接於該處理器,用來儲存該程式;其中該程式用來指示該處理器執行如請求項1至請求項7所述之短文自動化萃取分類及關鍵字方法。 A system for automatic extraction and classification of short texts and keywords includes: a processor for executing a program; and a storage unit coupled to the processor for storing the program; wherein the program is used to instruct the processor Carry out the automatic extraction and classification of short texts and keywords as described in Claims 1 to 7.
TW109139905A 2020-11-16 2020-11-16 Method for automatic extraction of text classification and text keyword and device using same TWI748749B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
TW109139905A TWI748749B (en) 2020-11-16 2020-11-16 Method for automatic extraction of text classification and text keyword and device using same
CN202111356362.0A CN114510565B (en) 2020-11-16 2021-11-16 Short article automatic extraction classification and keyword method and device using the method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW109139905A TWI748749B (en) 2020-11-16 2020-11-16 Method for automatic extraction of text classification and text keyword and device using same

Publications (2)

Publication Number Publication Date
TWI748749B true TWI748749B (en) 2021-12-01
TW202221528A TW202221528A (en) 2022-06-01

Family

ID=80680977

Family Applications (1)

Application Number Title Priority Date Filing Date
TW109139905A TWI748749B (en) 2020-11-16 2020-11-16 Method for automatic extraction of text classification and text keyword and device using same

Country Status (2)

Country Link
CN (1) CN114510565B (en)
TW (1) TWI748749B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063472A (en) * 2014-06-30 2014-09-24 电子科技大学 KNN text classifying method for optimizing training sample set
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003016106A (en) * 2001-06-29 2003-01-17 Fuji Xerox Co Ltd Device for calculating degree of association value
JP4466334B2 (en) * 2004-11-08 2010-05-26 日本電信電話株式会社 Information classification method and apparatus, program, and storage medium storing program
JP5311378B2 (en) * 2008-06-26 2013-10-09 国立大学法人京都大学 Feature word automatic learning system, content-linked advertisement distribution computer system, search-linked advertisement distribution computer system, text classification computer system, and computer programs and methods thereof
CA2638558C (en) * 2008-08-08 2013-03-05 Bloorview Kids Rehab Topic word generation method and system
KR101536520B1 (en) * 2014-04-28 2015-07-14 숭실대학교산학협력단 Method and server for extracting topic and evaluating compatibility of the extracted topic
CN106156204B (en) * 2015-04-23 2020-05-29 深圳市腾讯计算机系统有限公司 Text label extraction method and device
CN105843795B (en) * 2016-03-21 2019-05-14 华南理工大学 Document keyword extraction method and system based on topic model
CN108334533B (en) * 2017-10-20 2021-12-24 腾讯科技(深圳)有限公司 Keyword extraction method and device, storage medium and electronic device
CN109508378B (en) * 2018-11-26 2023-07-14 平安科技(深圳)有限公司 A sample data processing method and device
CN109885680B (en) * 2019-01-22 2020-05-19 仲恺农业工程学院 A method, system and device for short text classification preprocessing based on sememe expansion
CN110457707B (en) * 2019-08-16 2023-01-17 秒针信息技术有限公司 Method and device for extracting real word keywords, electronic equipment and readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063472A (en) * 2014-06-30 2014-09-24 电子科技大学 KNN text classifying method for optimizing training sample set
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device

Also Published As

Publication number Publication date
TW202221528A (en) 2022-06-01
CN114510565A (en) 2022-05-17
CN114510565B (en) 2025-04-25

Similar Documents

Publication Publication Date Title
CN112269868B (en) A method of using a machine reading comprehension model based on multi-task joint training
CN108280206B (en) Short text classification method based on semantic enhancement
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN109739986A (en) A short text classification method for complaints based on deep ensemble learning
CN109815336B (en) Text aggregation method and system
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
Subhash et al. Fake news detection using deep learning and transformer-based model
CN112395393A (en) Remote supervision relation extraction method based on multitask and multiple examples
CN112860898B (en) Short text box clustering method, system, equipment and storage medium
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN108804595B (en) A short text representation method based on word2vec
CN110795564A (en) Text classification method lacking negative cases
CN112949713A (en) Text emotion classification method based on ensemble learning of complex network
CN111737453A (en) An Extractive Text Summarization Method Based on Unsupervised Multi-Model Fusion
CN111859961A (en) A Text Keyword Extraction Method Based on Improved TopicRank Algorithm
CN114298020A (en) Keyword vectorization method based on subject semantic information and application thereof
CN108038099B (en) A low-frequency keyword recognition method based on word clustering
CN113722439B (en) Cross-domain emotion classification method and system based on adversarial category alignment network
CN111274402A (en) E-commerce comment emotion analysis method based on unsupervised classifier
CN110222172A (en) A kind of multi-source network public sentiment Topics Crawling method based on improvement hierarchical clustering
CN112988970A (en) Text matching algorithm serving intelligent question-answering system
Abuhaiba et al. Combining different approaches to improve arabic text documents classification
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN111930953B (en) Text attribute feature identification, classification and structure analysis method and device
CN114936277A (en) Similarity problem matching method and user similarity problem matching system