WO2019153552A1

WO2019153552A1 - Automatic tagging method and apparatus, and computer device and storage medium

Info

Publication number: WO2019153552A1
Application number: PCT/CN2018/085348
Authority: WO
Inventors: 陈海涛
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-02-12
Filing date: 2018-05-02
Publication date: 2019-08-15
Anticipated expiration: 2020-08-12
Also published as: CN108399227B; CN108399227A

Abstract

An automatic tagging method and apparatus, and a computer device and a storage medium. The method comprises: performing word segmentation preprocessing on a text to be tagged to obtain a preprocessed text (S101); inputting the preprocessed text to a word inverse frequency TF-IDF algorithm model to obtain a keyword set of the text to be tagged (S102); obtaining an initialized transfer matrix according to the keyword set of the text to be tagged, performing multi-time iterative multiplication operation on the initialized transfer matrix and initial keyword probability distribution until convergence to obtain final keyword probability distribution (S103); and obtaining a corresponding row of a maximum probability value in the final keyword probability distribution, obtaining a keyword corresponding to the corresponding row of the maximum probability value, and setting the keyword as a tag of the text to be tagged (S104). According to the method, an article is tagged by means of automatic learning, so that manual tagging is avoided, the tagging efficiency is improved, and the manpower cost is reduced.

Description

Automatic labeling method, device, computer device and storage medium

本申请要求于2018年2月12日提交中国专利局、申请号为201810145692.7、申请名称为“自动打标签的方法、装置、计算机设备及存储介质”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on February 12, 2018, the Chinese Patent Office, Application No. 201101145692.7, the application of the name of the "automatic labeling method, device, computer equipment and storage medium", the entire contents of which are The citations are incorporated herein by reference.

Technical field

本申请涉及文章分类技术领域，尤其涉及一种自动打标签的方法、装置、计算机设备及存储介质。The present application relates to the field of article classification technology, and in particular, to a method, an apparatus, a computer device and a storage medium for automatic labeling.

Background technique

文章的标签有助于文章的搜索以及分类，目前常用的方式是手动打标签，即作者为自己的文章编辑标签，但是并非所有作者都为自己的文章打标签。若海量的未打标签的文章都通过手动打标的方式来实现标签的添加，则效率极其低下，而且极大的增加了人力成本。The article's tags help in the search and classification of articles. The current common method is to manually tag, that is, the author edits the tags for their articles, but not all authors tag their articles. If a large number of unlabeled articles are added by manual marking, the efficiency is extremely low, and the labor cost is greatly increased.

发明内容Summary of the invention

本申请提供了一种自动打标签的方法、装置、计算机设备及存储介质，旨在解决现有技术海量的未打标签的文章都通过手动打标的方式来实现标签的添加，导致效率极其低下，而且极大增加了人力成本的问题。The present application provides a method, a device, a computer device and a storage medium for automatically labeling, aiming at solving the problem that the prior art mass unlabeled articles are manually added by labeling, resulting in extremely low efficiency. And greatly increased the problem of labor costs.

第一方面，本申请提供了一种自动打标签的方法，其包括：将待打标签文本进行分词预处理，得到预处理文本；将预处理文本输入词语逆频率TF-IDF算法模型，得到待打标签文本的关键词集；根据待打标签文本的关键词集得到初始化转移矩阵，由初始化转移矩阵及初始的关键词概率分布经过多次迭代相乘运算直至收敛后，得到关键词最终概率分布；获取关键词最终概率分布中概率最大值的对应行，获取概率最大值的对应行所对应关键词，并将所述关键词设置为待打标签文本的标签。In a first aspect, the present application provides a method for automatically labeling, which comprises: preprocessing a word to be tagged to obtain a preprocessed text; and inputting the preprocessed text into a reverse frequency TF-IDF algorithm model to obtain a The keyword set of the tagged text; the initial transfer matrix is obtained according to the keyword set of the tagged text, and the initial transition matrix and the initial keyword probability distribution are subjected to multiple iterative multiplication operations until convergence, and the final probability distribution of the keyword is obtained. Obtaining a corresponding row of the maximum value of the probability in the final probability distribution of the keyword, obtaining a keyword corresponding to the row corresponding to the maximum value of the probability, and setting the keyword as a tag of the text to be tagged.

第二方面，本申请提供了一种自动打标签的装置，其包括：In a second aspect, the present application provides an apparatus for automatically labeling, comprising:

文本预处理单元，用于将待打标签文本进行分词预处理，得到预处理文本；a text preprocessing unit, configured to perform word segmentation on the tagged text to obtain a preprocessed text;

关键词集获取单元，用于将预处理文本输入词语逆频率TF-IDF算法模型，得到待打标签文本的关键词集；a keyword set obtaining unit, configured to input the preprocessed text into a reverse frequency TF-IDF algorithm model to obtain a keyword set of the to-be-labeled text;

最终概率分布获取单元，用于根据待打标签文本的关键词集得到初始化转移矩阵，由初始化转移矩阵及初始的关键词概率分布经过多次迭代相乘运算直至收敛后，得到关键词最终概率分布；The final probability distribution obtaining unit is configured to obtain an initial transition matrix according to the keyword set of the to-be-labeled text, and the initial transition matrix and the initial keyword probability distribution are subjected to multiple iterative multiplication operations until convergence, and the final probability distribution of the keyword is obtained. ;

打标单元，用于获取关键词最终概率分布中概率最大值的对应行，获取概率最大值的对应行所对应关键词，并将所述关键词设置为待打标签文本的标签。The marking unit is configured to obtain a corresponding row of the probability maximum value in the final probability distribution of the keyword, obtain a keyword corresponding to the corresponding row of the maximum probability, and set the keyword as a label of the to-be-labeled text.

第三方面，本申请又提供了一种计算机设备，包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现本申请提供的任一项所述的自动打标签的方法。In a third aspect, the present application further provides a computer device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor implementing the computer program The method of automatic labeling according to any of the preceding claims.

第四方面，本申请还提供了一种存储介质，其中所述存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令当被处理器执行时使所述处理器执行本申请提供的任一项所述的自动打标签的方法。In a fourth aspect, the present application also provides a storage medium, wherein the storage medium stores a computer program, the computer program comprising program instructions, the program instructions, when executed by a processor, causing the processor to execute the application A method of automatic labeling as described in any of the preceding claims.

本申请提供一种自动打标签的方法、装置、计算机设备及存储介质。该方法通过自动学习的方式对文章打标签，避免了手动打标，提高打标效率且节省人力成本。The application provides a method, device, computer device and storage medium for automatic labeling. The method labels the articles by means of automatic learning, avoids manual marking, improves marking efficiency and saves labor costs.

DRAWINGS

为了更清楚地说明本申请实施例技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. Obviously, the drawings in the following description are some embodiments of the present application, For the ordinary technicians, other drawings can be obtained based on these drawings without any creative work.

图1为本申请实施例提供的一种自动打标签的方法的示意流程图；FIG. 1 is a schematic flowchart of a method for automatically labeling according to an embodiment of the present application;

图2为本申请实施例提供的一种自动打标签的方法的子流程示意流程图；2 is a schematic flowchart of a sub-flow of a method for automatically labeling according to an embodiment of the present application;

图3是本申请实施例提供的一种自动打标签的方法的另一子流程示意图；3 is a schematic diagram of another sub-flow of a method for automatically labeling according to an embodiment of the present application;

图4为本申请实施例提供的一种自动打标签的装置的示意性框图；4 is a schematic block diagram of an apparatus for automatically labeling according to an embodiment of the present application;

图5为本申请实施例提供的一种自动打标签的装置的子单元示意性框图；FIG. 5 is a schematic block diagram of a subunit of an automatic labeling apparatus according to an embodiment of the present disclosure;

图6为本申请实施例提供的一种自动打标签的装置的另一子单元示意性框图；FIG. 6 is a schematic block diagram of another subunit of an automatic labeling apparatus according to an embodiment of the present disclosure;

图7为本申请实施例提供的一种计算机设备的示意性框图。FIG. 7 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

请参阅图1，图1是本申请实施例提供的一种自动打标签的方法的示意流程图。该方法应用于台式电脑、手提电脑、平板电脑等终端中。如图1所示，该方法包括步骤S101～S104。Please refer to FIG. 1. FIG. 1 is a schematic flow chart of a method for automatically labeling according to an embodiment of the present application. The method is applied to terminals such as desktop computers, laptop computers, and tablet computers. As shown in FIG. 1, the method includes steps S101 to S104.

S101、将待打标签文本进行分词预处理，得到预处理文本。S101: Perform word segmentation on the to-be-labeled text to obtain pre-processed text.

如图2所示，所述步骤S101包括以下步骤：As shown in FIG. 2, the step S101 includes the following steps:

S1011、对待打标签文本进行分词，得到分词后文本。S1011: Perform word segmentation on the tagged text to obtain a word segmentation text.

在本实施例中，是基于概率统计模型的分词方法对待打标签文本进行分词。基于概率统计模型的分词方法的步骤如下：In this embodiment, the word segmentation method based on the probability statistical model performs word segmentation on the tagged text. The steps of the word segmentation method based on the probability and statistics model are as follows:

步骤十一、对一个待分词的子串S，按照从左到右的顺序取出全部候选词w1，w2，…，wi，…，wn；Step 11: For a substring S to be segmented, all candidate words w1, w2, ..., wi, ..., wn are taken out in order from left to right;

步骤十二、到词典中查出每个候选词的概率值P(wi)，并记录每个候选词的全部左邻词；Step 12: Find the probability value P(wi) of each candidate word in the dictionary, and record all the adjacent words of each candidate word;

步骤十三、计算每个候选词的累计概率，同时比较得到每个候选词的最佳左邻词；Step 13: Calculate the cumulative probability of each candidate word, and compare and obtain the best neighbors of each candidate word;

步骤十四、如果当前词wn是字串S的尾词，且累计概率P(wn)最大，则wn就是S的终点词；Step 14. If the current word wn is the end word of the string S and the cumulative probability P(wn) is the largest, then wn is the end word of S;

步骤十五、从wn开始，按照从右到左顺序，依次将每个词的最佳左邻词输出，即S的分词结果。Step 15. Beginning with wn, in order from right to left, the best left neighbor words of each word are sequentially output, that is, the word segmentation result of S.

S1012、对分词后文本包括的分词一一设置加权值。S1012: Set a weighting value by using a participle included in the text after the word segmentation.

在本实施例中，以已进行分词的待打标签文本中分词来进行加权处理，也就是已进行分词的待打标签文本中是可以视作由多个分词组成，此时对整篇已进行分词的待打标签文本从头至尾按位置、词性、长度等因素对文本中的各分词进行加权处理，按如下规则：In this embodiment, the weighting process is performed by the word segmentation in the tagged text that has been segmented, that is, the tagged text that has been segmented can be regarded as composed of a plurality of word segments, and the entire article has been performed. The tagged text of the word segmentation weights the word segmentation in the text from beginning to end by factors such as position, part of speech, length, etc., according to the following rules:

文本第一个词是标题，赋予权值8*；段首第一个词等于“摘要”，则赋予权值5*；段首第一个词等于“关键词”或“结论”，则赋予权值5*；词语长度等于2，赋予权值3*；词性为名词，赋予权值2*；其他，每段首赋予权值1*。The first word of the text is the title, giving the weight 8*; the first word at the beginning of the paragraph is equal to the "summary", then the weight is given 5*; the first word at the beginning of the paragraph is equal to "keyword" or "conclusion", then The weight is 5*; the length of the word is equal to 2, and the weight is 3*; the part of speech is a noun, and the weight is given 2*; otherwise, the weight of each paragraph is given 1*.

S1013、删除分词后文本中的停用词，并统计各分词的词频，得到第一三元组。S1013. Delete the stop words in the text after the word segmentation, and count the word frequency of each word segment to obtain the first triplet.

三元组<w _i，fre _i，v _i>表示待打标签文本经处理后的结果集，其中w _i是词语，fre _i是词语w _i加权后出现的次数，v _i是词语在文本中的位置权重；其中，当对分词后文本包括的分词一一设置加权值后，需删除其中的停用词(停用词包括虚拟词、语气组词、副词、符号、一个字的词，这些停用词不会作为关键词的候选词)，能准确的筛选出候选的关键词进行后续处理。 The triplet <w _i ,fre _i ,v _i > represents the processed result set of the to-be-labeled text, where w _i is the word, fre _i is the number of times the word w _{i is} weighted, and v _i is the word in the text Position weight; wherein, when the weights are included in the participles included in the text after the word segmentation, the stop words are deleted (the stop words include virtual words, modal words, adverbs, symbols, words of a word, etc.) The stop word will not be used as a candidate for the keyword), and the candidate keywords can be accurately screened for subsequent processing.

S1014、获取第一三元组中词频大于预设词频阈值所对分词之间的词语相似度；S1014. Obtain a similarity between words in the first triad that is greater than a word frequency threshold of the preset word frequency threshold;

其中，通过词语相似度计算，计算第一三元组<w _i，fre _i，v _i>中词频fre _i>2的所有词语相似度sim _ij；当sim _ij>0.9则认为两个词语的相似度极高，在文本中可以替换，将返回四元组<w _i，w _j，sim _ij，fre _i+fre _j>，并删除第一三元组里的词语w _j。四元组<w _i，w _j，sim _ij，fre _i+fre _j>表示对三元组中部分词语计算相似度后的集合，其中sim _ij表示词语w _i、w _j的相似度，fre _i+fre _j表示两个词语的词频之和。 Wherein, by word similarity calculation, all word similarities sim _ij of the word frequency fre _i >2 in the first triad <w _i , fre _i , v _i > are calculated; when sim _ij >0.9, the similarity of the two words is considered Very high, can be replaced in the text, will return the quads <w _i , w _j , sim _ij , fre _i +fre _j >, and delete the word w _j in the first triple. The quaternion <w _i , w _j , sim _ij , fre _i +fre _j > represents a set of similarities after calculating partial words in the triple, where sim _ij represents the similarity of the words w _i , w _j , fre _i +fre _j represents the sum of the word frequencies of the two words.

S1015、若分词之间的词语相似度大于预设词语相似度阈值，保留其中任意一个分词，得到第二三元组，并将第二三元组作为预处理文本。S1015: If the word similarity between the word segments is greater than the preset word similarity threshold, retain any one of the word segments to obtain the second triplet, and use the second triplet as the pre-processed text.

其中，在第一三元组<w _i，fre _i，v _i>中，查找四元组<w _i，w _j，sim _ij，fre _i+fre _j>中的词语；当三元组的fre _i替换为四元组中的fre _i+fre _j，重新组成第二三元组<wi，fre _i+fre _j，vi>，该第二三元组<w _i，fre _i+fre _j，v _i>即为预处理文本。通过文本预处理，所得到的预处理文本满足关键词筛选模型的输入标准(即将文本进行了向量化处理)，能更为准确的得到文章的关键词。当对待打标签文本进行分词后，就能对各词语的词频进行统计，以作为文章标签的候选词。 Wherein, in the first triplet <w _i ,fre _i ,v _i >, look for the words in the quads <w _i , w _j , sim _ij , fre _i +fre _j >; when the triad is fre _{i is} replaced by fre _i +fre _j in the quaternary, reconstituting the second triplet <wi,fre _i +fre _j ,vi>, the second ternary <w _i ,fre _i +fre _j ,v _i > is the pre-processed text. Through text preprocessing, the obtained preprocessed text satisfies the input criteria of the keyword screening model (that is, the text is vectorized), and the keywords of the article can be obtained more accurately. When the tagged text is segmented, the word frequency of each word can be counted as a candidate for the article tag.

S102、将预处理文本输入词语逆频率TF-IDF算法模型，得到待打标签文本的关键词集。S102. Enter the preprocessed text into the inverse frequency TF-IDF algorithm model to obtain a keyword set of the to-be-labeled text.

在一实施例中，所述词语逆频率TF-IDF算法模型为：In an embodiment, the word inverse frequency TF-IDF algorithm model is:

其中，TF部分分子n _i，j表示词语t _i在文本j中出现的次数，分母表示文本j 中所有的词语频词和，IWF部分分子表示语料库中所有词语频数之和，nt _j表示词语t _i在语料库中出现的总频数。 Wherein, the TF partial molecule n _i,j represents the number of occurrences of the word t _i in the text j, the denominator represents all the word frequency words in the text j, the IWF part numerator represents the sum of the frequency of all words in the corpus, and nt _j represents the word t The total frequency that _i appears in the corpus.

如图3所示，所述步骤S102包括以下步骤：As shown in FIG. 3, the step S102 includes the following steps:

S1021、生成语料库词语统计结果集；S1021: Generating a corpus term statistical result set;

S1022、获取预处理文本；S1022: Obtain a preprocessed text;

S1023、将预处理文本及语料库词语统计结果集输入词语逆频率TF-IDF算法模型，得到待打标签文本的关键词集。S1023: Input the pre-processed text and the corpus term statistical result set into the word inverse frequency TF-IDF algorithm model to obtain a keyword set of the to-be-labeled text.

在本实施例中，语料库词语统计结果集是基于语料库得到的。语料库是用户预先挑选一定数量文章(如2000篇)，采用步骤S10111-S10115中的文本预处理算法，忽略步骤S10114中相似度计算的步骤，对语料库中的文章分别进行处理，得到二元组<w _i，fre _i>，其中w _i是词语，fre _i是词语w _i加权后出现的频次。合并所有二元组<w _i，fre _i>，得到<w _i，fre _isum>，其中fre _isum是词语w _i在语料库中出现的总频次，也即<wi，freisum>即为所生成的语料库词语统计结果集。将预处理文本及语料库词语统计结果集输入词语逆频率TF-IDF算法模型，得到待打标签文本的关键词集，所得到的关键词集准确度较高。 In this embodiment, the corpus term statistical result set is obtained based on the corpus. The corpus is that the user pre-selects a certain number of articles (such as 2000 articles), adopts the text preprocessing algorithm in steps S10111-S10115, ignores the steps of similarity calculation in step S10114, and processes the articles in the corpus separately to obtain a binary group < w _i ,fre _i >, where w _i is a word, and fre _i is the frequency at which the word w _{i is} weighted. Combine all the binary groups <w _i ,fre _i >, and get <w _i ,fre _isum >, where fre _isum is the total frequency of the word w _i appearing in the corpus, that is, <wi,freisum> is the generated corpus Word statistics result set. The pre-processed text and the corpus term statistical result set are input into the word inverse frequency TF-IDF algorithm model to obtain the keyword set of the tagged text, and the obtained keyword set has high accuracy.

S103、根据待打标签文本的关键词集得到初始化转移矩阵，由初始化转移矩阵及初始的关键词概率分布经过多次迭代相乘运算直至收敛后，得到关键词最终概率分布。S103. The initialization transition matrix is obtained according to the keyword set of the to-be-labeled text, and the initial transition matrix and the initial keyword probability distribution are subjected to multiple iterative multiplication operations until convergence, and the keyword final probability distribution is obtained.

在一实施例中，所述根据待打标签文本的关键词集得到初始化转移矩阵中，所述初始化转移矩阵为n维方阵，n维方阵的维数与关键词集中关键词总个数相等；所述初始的关键词概率分布为每一行值均为1/n的n维列向量；其中，n为与关键词集中关键词总个数相等的正整数；In an embodiment, the initialization transition matrix is obtained according to the keyword set of the to-be-labeled text, and the initialization transition matrix is an n-dimensional square matrix, the dimension of the n-dimensional square matrix and the total number of keywords in the keyword set. Equal; the initial keyword probability distribution is an n-dimensional column vector whose row value is 1/n; wherein n is a positive integer equal to the total number of keywords in the keyword set;

所述由初始化转移矩阵及初始的关键词概率分布经过多次迭代相乘运算记为V _m＝MV _m-1，其中，m为正整数，V ₀为初始的关键词概率分布，M为初始化转移矩阵。 The initialization transition matrix and the initial keyword probability distribution are recorded as V _m =MV _m-1 through multiple iterative multiplication operations, where m is a positive integer, V ₀ is an initial keyword probability distribution, and M is an initialization. Transfer matrix.

在本实施例中，例如，关键词集中关键词的总个数为4个，分别记为A，B，C，D。根据关键词的位置、词性、长度等因素得到初始化转移矩阵M为：In the present embodiment, for example, the total number of keywords in the keyword set is four, which are respectively denoted as A, B, C, and D. According to the position, part of speech, length and other factors of the keyword, the initialization transfer matrix M is obtained as follows:

假设每一个关键词为待打标签文本的最终标签的概率都是相等的，即1/n；故初始的关键词概率分布就是一个所有值都为1/n的n维列向量V ₀；用V _n＝MV _n-1计算得到关键词最终概率分布Vn(此初始化转移矩阵M乘以V _n-1，经过不断迭代(一般是30次左右)最终会收敛，不会出现一直迭代的情况)。 Suppose that the probability of each keyword being the final label of the text to be tagged is equal, ie 1/n; therefore the initial keyword probability distribution is an n-dimensional column vector V ₀ whose values are all 1/n; V _n = MV _n-1 calculates the final probability distribution Vn of the keyword (this initial transfer matrix M is multiplied by V _n-1 , and it will eventually converge after repeated iterations (generally about 30 times), and there will be no iterative conditions) .

S104、获取关键词最终概率分布中概率最大值的对应行，获取概率最大值的对应行所对应关键词，并将所述关键词设置为待打标签文本的标签。S104. Acquire a corresponding row of the maximum probability value in the final probability distribution of the keyword, obtain a keyword corresponding to the corresponding row of the maximum value of the probability, and set the keyword as a label of the to-be-labeled text.

作为步骤S103-S104的另一实施例，也可以是根据待打标签文本的关键词集得到初始化转移矩阵，由初始化转移矩阵及初始的关键词概率分布经过多次迭代相乘运算，当Vn这一n维列向量种有一行出现大于预设概率值时，则停止迭代并将该行对应的关键词作为待打标签文本的标签。若同时出现了多行大于预设概率值的情况，则将这些行对应的关键词同时作为待打标签文本的标签。As another embodiment of the steps S103-S104, the initialization transition matrix may be obtained according to the keyword set of the to-be-labeled text, and the initial transition matrix and the initial keyword probability distribution are subjected to multiple iterative multiplication operations, when Vn When a row of n-dimensional column vectors appears larger than the preset probability value, the iteration is stopped and the keyword corresponding to the row is used as the label of the text to be marked. If multiple lines are greater than the preset probability value at the same time, the keywords corresponding to the lines are simultaneously used as the labels of the to-be-labeled text.

在一实施例中，所述步骤S101之前，还包括：In an embodiment, before the step S101, the method further includes:

步骤一、爬取待打标签文本，并存储至MongoDB数据库中。即原始数据从网上爬取，得到待打标签文本，存放到MongoDB数据库。通过爬取数据，可设置一筛选条件，即爬取未设置标签的文本从而进行打标签。Step 1. Crawl the text to be tagged and store it in the MongoDB database. That is, the original data is crawled from the Internet, and the text to be tagged is stored and stored in the MongoDB database. By crawling the data, you can set a filter condition that crawls the text of the unset label for labeling.

该方法通过自动学习的方式对文章打标签，避免了手动打标，提高打标效率且节省人力成本。The method labels the articles by means of automatic learning, avoids manual marking, improves marking efficiency and saves labor costs.

本申请实施例还提供一种自动打标签的装置，该自动打标签的装置用于执行前述任一项自动打标签的方法。具体地，请参阅图4，图4是本申请实施例提供的一种自动打标签的装置的示意性框图。自动打标签的装置100可以安装于台式电脑、平板电脑、手提电脑、等终端中。The embodiment of the present application further provides an automatic labeling device for performing the method for automatically labeling any of the foregoing. Specifically, please refer to FIG. 4 , which is a schematic block diagram of an apparatus for automatically labeling provided by an embodiment of the present application. The automatic tagging device 100 can be installed in a desktop computer, a tablet computer, a laptop computer, or the like.

如图4所示，自动打标签的装置100包括文本预处理单元101、关键词集获取单元102、最终概率分布获取单元103、打标单元104。As shown in FIG. 4, the apparatus 100 for automatic labeling includes a text pre-processing unit 101, a keyword set acquisition unit 102, a final probability distribution acquisition unit 103, and a marking unit 104.

文本预处理单元101，用于将待打标签文本进行分词预处理，得到预处理文本。The text preprocessing unit 101 is configured to perform word segmentation preprocessing on the to-be-labeled text to obtain pre-processed text.

如图5所示，所述文本预处理单元101包括以下子单元：As shown in FIG. 5, the text pre-processing unit 101 includes the following sub-units:

分词单元1011，用于对待打标签文本进行分词，得到分词后文本。The word segmentation unit 1011 is configured to perform word segmentation on the tagged text to obtain a word segmentation text.

在本实施例中，是基于概率统计模型的分词方法对待打标签文本进行分词。基于概率统计模型的分词方法如下：In this embodiment, the word segmentation method based on the probability statistical model performs word segmentation on the tagged text. The word segmentation method based on probability and statistical model is as follows:

1)对一个待分词的子串S，按照从左到右的顺序取出全部候选词w1，w2，…，wi，…，wn；2)到词典中查出每个候选词的概率值P(wi)，并记录每个候选词的全部左邻词；3)计算每个候选词的累计概率，同时比较得到每个候选词的最佳左邻词；4)如果当前词wn是字串S的尾词，且累计概率P(wn)最大，则wn就是S的终点词；5)从wn开始，按照从右到左顺序，依次将每个词的最佳左邻词输出，即S的分词结果。1) For a substring S to be word-divided, all candidate words w1, w2, ..., wi, ..., wn; 2) are taken in order from left to right to find the probability value P of each candidate word in the dictionary ( Wi), and record all the adjacent words of each candidate word; 3) calculate the cumulative probability of each candidate word, and compare the best neighbors of each candidate word; 4) if the current word wn is the string S The last word, and the cumulative probability P(wn) is the largest, then wn is the end word of S; 5) starting from wn, in order from right to left, the best left neighbor of each word is output, that is, S Word segmentation results.

加权单元1012，用于对分词后文本包括的分词一一设置加权值。The weighting unit 1012 is configured to set a weighting value for the participle included in the text after the word segmentation.

在本实施例中，以已进行分词的待打标签文本中分词来进行加权处理，也就是已进行分词的待打标签文本中是可以视作由多个分词组成，此时对整篇已进行分词的待打标签文本从头至尾按位置、词性、长度等因素对文本中的各分词进行加权处理，按如下规则：文本第一个词是标题，赋予权值8*；段首第一个词等于“摘要”，则赋予权值5*；段首第一个词等于“关键词”或“结论”，则赋予权值5*；词语长度等于2，赋予权值3*；词性为名词，赋予权值2*；其他，每段首赋予权值1*。In this embodiment, the weighting process is performed by the word segmentation in the tagged text that has been segmented, that is, the tagged text that has been segmented can be regarded as composed of a plurality of word segments, and the entire article has been performed. The word-to-label text of the word segmentation weights each participle in the text from beginning to end according to factors such as position, part of speech, length, etc., according to the following rules: the first word of the text is the title, and the weight is 8*; the first one of the paragraph If the word is equal to "summary", the weight is given 5*; the first word at the beginning of the paragraph is equal to "keyword" or "conclusion", then the weight is given 5*; the length of the word is equal to 2, the weight is given 3*; the part of speech is noun , the weight is given 2*; others, each paragraph is given a weight of 1*.

统计单元1013，用于删除分词后文本中的停用词，并统计各分词的词频，得到第一三元组。The statistic unit 1013 is configured to delete the stop words in the text after the word segmentation, and count the word frequency of each word segment to obtain the first triplet.

相似度获取单元1014，用于获取第一三元组中词频大于预设词频阈值所对分词之间的词语相似度。The similarity obtaining unit 1014 is configured to obtain the similarity of words between the word segments of the first triplet that are greater than the preset word frequency threshold.

删词单元1015，用于若分词之间的词语相似度大于预设词语相似度阈值，保留其中任意一个分词，得到第二三元组，并将第二三元组作为预处理文本。The deleted word unit 1015 is configured to: if the word similarity between the word segments is greater than the preset word similarity threshold, retain any one of the word segments, obtain the second triplet, and use the second triplet as the preprocessed text.

关键词集获取单元102，用于将预处理文本输入词语逆频率TF-IDF算法模型，得到待打标签文本的关键词集。The keyword set obtaining unit 102 is configured to input the pre-processed text into the word inverse frequency TF-IDF algorithm model to obtain a keyword set of the to-be-labeled text.

其中，TF部分分子n _i，j表示词语t _i在文本j中出现的次数，分母表示文本j中所有的词语频词和，IWF部分分子表示语料库中所有词语频数之和，nt _i表示词语t _i在语料库中出现的总频数。 Wherein, the TF partial molecule n _i,j represents the number of occurrences of the word t _i in the text j, the denominator represents all the word frequency words in the text j, the IWF part of the molecule represents the sum of the frequency of all words in the corpus, and nt _i represents the word t The total frequency that _i appears in the corpus.

如图6所示，所述关键词集获取单元102包括以下子单元：As shown in FIG. 6, the keyword set obtaining unit 102 includes the following subunits:

第一处理单元1021，用于生成语料库词语统计结果集；a first processing unit 1021, configured to generate a corpus term statistical result set;

第二处理单元1022，用于获取预处理文本；a second processing unit 1022, configured to acquire pre-processed text;

关键词集计算单元1023，用于将预处理文本及语料库词语统计结果集输入词语逆频率TF-IDF算法模型，得到待打标签文本的关键词集。The keyword set calculation unit 1023 is configured to input the pre-processing text and the corpus term statistical result set into the word inverse frequency TF-IDF algorithm model to obtain a keyword set of the tagged text.

在本实施例中，语料库词语统计结果集是基于语料库得到的。语料库是用户预先挑选一定数量文章(如2000篇)，采用文本预处理单元101中的文本预处理算法，忽略相似度计算的步骤，对语料库中的文章分别进行处理，得到二元组<w _i，fre _i>，其中w _i是词语，fre _i是词语w _i加权后出现的频次。合并所有二元组<w _i，fre _i>，得到<w _i，fre _isum>，其中fre _isum是词语w _i在语料库中出现的总频次，也即<wi，freisum>即为所生成的语料库词语统计结果集。将预处理文本及语料库词语统计结果集输入词语逆频率TF-IDF算法模型，得到待打标签文本的关键词集，所得到的关键词集准确度较高。 In this embodiment, the corpus term statistical result set is obtained based on the corpus. The corpus is that the user pre-selects a certain number of articles (such as 2000 articles), adopts the text pre-processing algorithm in the text pre-processing unit 101, ignores the steps of the similarity calculation, and processes the articles in the corpus separately to obtain a binary group <w _i , fre _i >, where w _i is a word, and fre _i is the frequency at which the word w _{i is} weighted. Combine all the binary groups <w _i ,fre _i >, and get <w _i ,fre _isum >, where fre _isum is the total frequency of the word w _i appearing in the corpus, that is, <wi,freisum> is the generated corpus Word statistics result set. The pre-processed text and the corpus term statistical result set are input into the word inverse frequency TF-IDF algorithm model to obtain the keyword set of the tagged text, and the obtained keyword set has high accuracy.

最终概率分布获取单元103，用于根据待打标签文本的关键词集得到初始化转移矩阵，由初始化转移矩阵及初始的关键词概率分布经过多次迭代相乘运算直至收敛后，得到关键词最终概率分布。The final probability distribution obtaining unit 103 is configured to obtain an initial transition matrix according to the keyword set of the to-be-labeled text, and the initial transition matrix and the initial keyword probability distribution are subjected to multiple iterative multiplication operations until convergence, and the keyword final probability is obtained. distributed.

打标单元104，用于获取关键词最终概率分布中概率最大值的对应行，获取概率最大值的对应行所对应关键词，并将所述关键词设置为待打标签文本的标签。The marking unit 104 is configured to obtain a corresponding row of the probability maximum value in the final probability distribution of the keyword, obtain a keyword corresponding to the corresponding row of the maximum probability, and set the keyword as a label of the to-be-labeled text.

作为最终概率分布获取单元103和打标单元104的另一实施例，也可以是根据待打标签文本的关键词集得到初始化转移矩阵，由初始化转移矩阵及初始的关键词概率分布经过多次迭代相乘运算，当Vn这一n维列向量种有一行出现大于预设概率值时，则停止迭代并将该行对应的关键词作为待打标签文本的标签。若同时出现了多行大于预设概率值的情况，则将这些行对应的关键词同时作为待打标签文本的标签。As another embodiment of the final probability distribution obtaining unit 103 and the marking unit 104, the initialization transition matrix may be obtained according to the keyword set of the to-be-labeled text, and the initial transition matrix and the initial keyword probability distribution are iterated multiple times. In the multiplication operation, when a row of the n-dimensional column vector of Vn appears to be greater than the preset probability value, the iteration is stopped and the keyword corresponding to the row is used as the label of the to-be-labeled text. If multiple lines are greater than the preset probability value at the same time, the keywords corresponding to the lines are simultaneously used as the labels of the to-be-labeled text.

在一实施例中，自动打标签的装置100，还包括：In an embodiment, the apparatus 100 for automatic labeling further includes:

爬取单元，用于爬取待打标签文本，并存储至MongoDB数据库中。即原始数据从网上爬取，得到待打标签文本，存放到MongoDB数据库。通过爬取数据，可设置一筛选条件，即爬取未设置标签的文本从而进行打标签。Crawl unit for crawling the text to be tagged and storing it in the MongoDB database. That is, the original data is crawled from the Internet, and the text to be tagged is stored and stored in the MongoDB database. By crawling the data, you can set a filter condition that crawls the text of the unset label for labeling.

可见，该装置通过自动学习的方式对文章打标签，避免了手动打标，提高打标效率且节省人力成本。It can be seen that the device labels the articles by means of automatic learning, avoids manual marking, improves marking efficiency and saves labor costs.

上述自动打标签的装置可以实现为一种计算机程序的形式，该计算机程序可以在如图7所示的计算机设备上运行。The above automatic labeling device can be implemented in the form of a computer program that can be run on a computer device as shown in FIG.

请参阅图7，图7是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备500设备可以是终端。该终端可以是平板电脑、笔记本电脑、台式电脑、个人数字助理等电子设备。Please refer to FIG. 7. FIG. 7 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 device can be a terminal. The terminal can be an electronic device such as a tablet computer, a notebook computer, a desktop computer, or a personal digital assistant.

参阅图7，该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505，其中，存储器可以包括非易失性存储介质503和内存储器504。该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032包括程序指令，该程序指令被执行时，可使得处理器502执行一种自动打标签的方法。该处理器502用于提供计算和控制能力，支撑整个计算机设备500的运行。该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境，该计算机程序5032被处理器502执行时，可使得处理器502执行一种自动打标签的方法。该网络接口505用于进行网络通信，如发送分配的任务等。本领域技术人员可以理解，图7中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备500的限定，具体的计算机设备500可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Referring to FIG. 7, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected by a system bus 501, wherein the memory can include a non-volatile storage medium 503 and an internal memory 504. The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. The computer program 5032 includes program instructions that, when executed, cause the processor 502 to perform a method of automatic tagging. The processor 502 is used to provide computing and control capabilities to support the operation of the entire computer device 500. The internal memory 504 provides an environment for operation of the computer program 5032 in the non-volatile storage medium 503, which when executed by the processor 502, can cause the processor 502 to perform a method of automatic tagging. The network interface 505 is used for network communication, such as sending assigned tasks and the like. It will be understood by those skilled in the art that the structure shown in FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device 500 to which the solution of the present application is applied, and a specific computer device. 500 may include more or fewer components than shown, or some components may be combined, or have different component arrangements.

其中，所述处理器502用于运行存储在存储器中的计算机程序5032，以实现如下功能：将待打标签文本进行分词预处理，得到预处理文本；将预处理文本输入词语逆频率TF-IDF算法模型，得到待打标签文本的关键词集；根据待打标签文本的关键词集得到初始化转移矩阵，由初始化转移矩阵及初始的关键词概率分布经过多次迭代相乘运算直至收敛后，得到关键词最终概率分布；获取关键词最终概率分布中概率最大值的对应行，获取概率最大值的对应行所对应关键词，并将所述关键词设置为待打标签文本的标签。The processor 502 is configured to run a computer program 5032 stored in the memory to implement the following functions: performing word segmentation on the tagged text to obtain preprocessed text; and inputting the preprocessed text into the word inverse frequency TF-IDF The algorithm model obtains a keyword set of the tagged text; the initial transfer matrix is obtained according to the keyword set of the tagged text, and the initial transfer matrix and the initial keyword probability distribution are subjected to multiple iterative multiplication operations until convergence, Key words final probability distribution; obtain the corresponding row of the probability maximum value in the final probability distribution of the keyword, obtain the keyword corresponding to the corresponding row of the maximum probability, and set the keyword as the label of the to-be-labeled text.

在一实施例中，处理器502还执行如下操作：对待打标签文本进行分词，得到分词后文本；对分词后文本包括的分词一一设置加权值；删除分词后文本中的停用词，并统计各分词的词频，得到第一三元组；获取第一三元组中词频大于预设词频阈值所对分词之间的词语相似度；若分词之间的词语相似度大于预设词语相似度阈值，保留其中任意一个分词，得到第二三元组，并将第二三元组作为预处理文本。In an embodiment, the processor 502 further performs the following operations: performing word segmentation on the tagged text to obtain a word segmentation text; setting a weighting value for the segmentation word included in the segmentation word text; deleting the stop word in the text after the word segmentation, and The word frequency of each participle is counted to obtain a first triad; the word similarity between the word segments of the first triad greater than the preset word frequency threshold is obtained; if the word similarity between the word segments is greater than the preset word similarity The threshold, retaining any of the participles, obtaining the second triad, and using the second triad as the pre-processed text.

在一实施例中，处理器502还执行如下操作：生成语料库词语统计结果集；获取预处理文本；将预处理文本及语料库词语统计结果集输入词语逆频率TF-IDF算法模型，得到待打标签文本的关键词集；其中，所述词语逆频率TF-IDF算法模型为：In an embodiment, the processor 502 further performs the following operations: generating a corpus term statistical result set; obtaining the preprocessed text; and inputting the preprocessed text and the corpus term statistical result set into the term inverse frequency TF-IDF algorithm model to obtain a to-be-labeled a keyword set of text; wherein the word inverse frequency TF-IDF algorithm model is:

在一实施例中，处理器502还执行如下操作：爬取待打标签文本，并存储至MongoDB数据库中。In an embodiment, the processor 502 also performs the following operations: crawling the to-be-labeled text and storing it in the MongoDB database.

本领域技术人员可以理解，图7中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定，在其他实施例中，计算机设备可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。例如，在一些实施例中，计算机设备可以仅包括存储器及处理器，在这样的实施例中，存储器及处理器的结构及功能与图7所示实施例一致，在此不再赘述。It will be understood by those skilled in the art that the embodiment of the computer device shown in FIG. 7 does not constitute a limitation on the specific configuration of the computer device. In other embodiments, the computer device may include more or fewer components than illustrated. Or combine some parts, or different parts. For example, in some embodiments, the computer device may include only a memory and a processor. In such an embodiment, the structure and function of the memory and the processor are the same as those of the embodiment shown in FIG. 7, and details are not described herein again.

应当理解，在本申请实施例中，处理器502可以是中央处理单元(Central Processing Unit，CPU)，该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现成可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中，通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that, in the embodiment of the present application, the processor 502 may be a central processing unit (CPU), and the processor 502 may also be another general-purpose processor, a digital signal processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

在本申请的另一实施例中提供一种存储介质。该存储介质可以为非易失性的计算机可读存储介质。该存储介质存储有计算机程序，其中计算机程序包括程序指令。该程序指令被处理器执行时实现本申请实施例的自动打标签的方法。In another embodiment of the present application, a storage medium is provided. The storage medium can be a non-transitory computer readable storage medium. The storage medium stores a computer program, wherein the computer program includes program instructions. The method of automatically tagging the embodiment of the present application when the program instruction is executed by the processor.

所述存储介质可以是前述设备的内部存储单元，例如设备的硬盘或内存。所述存储介质也可以是所述设备的外部存储设备，例如所述设备上配备的插接式硬盘，智能存储卡(Smart Media Card，SMC)，安全数字(Secure Digital，SD)卡，闪存卡(Flash Card)等。进一步地，所述存储介质还可以既包括所述设备的内部存储单元也包括外部存储设备。The storage medium may be an internal storage unit of the aforementioned device, such as a hard disk or a memory of the device. The storage medium may also be an external storage device of the device, such as a plug-in hard disk equipped on the device, a smart memory card (SMC), a secure digital (SD) card, and a flash memory card. (Flash Card), etc. Further, the storage medium may also include both an internal storage unit of the device and an external storage device.

所属领域的技术人员可以清楚地了解到，为了描述的方便和简洁，上述描述的设备、装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the device, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到各种等效的修改或替换，这些修改或替换都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以权利要求的保护范围为准。The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any equivalents can be easily conceived by those skilled in the art within the technical scope disclosed in the present application. Modifications or substitutions are intended to be included within the scope of the present application. Therefore, the scope of protection of this application should be determined by the scope of protection of the claims.

Claims

A method for automatically labeling, comprising:

Performing word segmentation on the tagged text to obtain preprocessed text;

Entering the pre-processed text into the inverse frequency TF-IDF algorithm model to obtain a keyword set of the to-be-labeled text;

The initial transfer matrix is obtained according to the keyword set of the tagged text, and the initial transition matrix and the initial keyword probability distribution are subjected to multiple iterative multiplication operations until convergence, and the final probability distribution of the keyword is obtained;

Obtain a corresponding row of the maximum value of the probability in the final probability distribution of the keyword, obtain a keyword corresponding to the corresponding row of the maximum value of the probability, and set the keyword as a tag of the text to be tagged.

The method for automatically tagging according to claim 1, wherein the pre-processing of the word-to-be-labeled text to obtain pre-processed text comprises:

The wording of the tagged text is used to obtain the word after the word segmentation;

Setting a weight value for the participles included in the text after the word segmentation;

Delete the stop words in the text after the word segmentation, and count the word frequency of each word segment to obtain the first triplet;

Obtaining a similarity between words in the first triad that is greater than a word frequency threshold of the preset word frequency threshold;

If the word similarity between the word segments is greater than the preset word similarity threshold, any one of the word segments is retained, the second triad is obtained, and the second triad is used as the pre-processed text.

The method of automatically tagging according to claim 2, wherein the word inverse frequency TF-IDF algorithm model is:

Wherein, the TF partial molecule n _i,j represents the number of occurrences of the word t _i in the text j, the denominator represents all the word frequency words in the text j, the IWF part of the molecule represents the sum of the frequency of all words in the corpus, and nt _i represents the word t The total frequency of _i appearing in the corpus;

And inputting the pre-processed text into the inverse frequency TF-IDF algorithm model to obtain a keyword set of the to-be-labeled text, including:

Generating a corpus term statistical result set;

Obtain the preprocessed text;

The pre-processed text and the corpus term statistical result set are input into the word inverse frequency TF-IDF algorithm model to obtain a keyword set of the tagged text.

The method for automatically labeling according to claim 1, wherein the initial transfer matrix is obtained according to the keyword set of the to-be-labeled text, and the initial transfer matrix is an n-dimensional square matrix, and the n-dimensional square matrix The dimension is equal to the total number of keywords in the keyword set; the initial keyword probability distribution is an n-dimensional column vector whose row value is 1/n; wherein n is equal to the total number of keywords in the keyword set Positive integer

The initialization transition matrix and the initial keyword probability distribution are recorded as V _m =MV _m-1 through multiple iterative multiplication operations, where m is a positive integer, V ₀ is an initial keyword probability distribution, and M is an initialization. Transfer matrix.

The method for automatically labeling according to claim 4, wherein the step of pre-processing the word to be tagged to obtain the pre-processed text comprises:

Crawl the text to be tagged and store it in the MongoDB database.

An automatic labeling device, comprising:

a text preprocessing unit, configured to perform word segmentation on the tagged text to obtain a preprocessed text;

a keyword set obtaining unit, configured to input the preprocessed text into a reverse frequency TF-IDF algorithm model to obtain a keyword set of the to-be-labeled text;

The final probability distribution obtaining unit is configured to obtain an initial transition matrix according to the keyword set of the to-be-labeled text, and the initial transition matrix and the initial keyword probability distribution are subjected to multiple iterative multiplication operations until convergence, and the final probability distribution of the keyword is obtained. ;

The marking unit is configured to obtain a corresponding row of the probability maximum value in the final probability distribution of the keyword, obtain a keyword corresponding to the corresponding row of the maximum probability, and set the keyword as a label of the to-be-labeled text.

The apparatus for automatically labeling according to claim 6, wherein the text preprocessing unit comprises:

a word segmentation unit for segmenting the text to be tagged and obtaining the text after the word segmentation;

a weighting unit for setting a weighting value for the participle included in the text after the word segmentation;

a statistical unit for deleting the stop words in the text after the word segmentation, and counting the word frequency of each word segment to obtain the first triplet;

a similarity obtaining unit, configured to obtain a similarity between words in the first triplet that is greater than a word frequency threshold of the preset word frequency threshold;

The deleted word unit is configured to: if the word similarity between the word segments is greater than the preset word similarity threshold, retain any one of the word segments, obtain the second triplet, and use the second triplet as the preprocessed text.

The automatic labeling apparatus according to claim 7, wherein the word inverse frequency TF-IDF algorithm model is:

The keyword set obtaining unit includes:

a first processing unit, configured to generate a corpus term statistical result set;

a second processing unit, configured to acquire preprocessed text;

The keyword set calculation unit is configured to input the pre-processing text and the corpus term statistical result set into the word inverse frequency TF-IDF algorithm model to obtain a keyword set of the tagged text.

The apparatus for automatically labeling according to claim 6, wherein the initialization transfer matrix is obtained according to the keyword set of the to-be-labeled text, and the initialization transfer matrix is an n-dimensional square matrix, and an n-dimensional square matrix The dimension is equal to the total number of keywords in the keyword set; the initial keyword probability distribution is an n-dimensional column vector whose row value is 1/n; wherein n is equal to the total number of keywords in the keyword set Positive integer

The apparatus for automatically labeling according to claim 9, further comprising:

Crawl unit for crawling the text to be tagged and storing it in the MongoDB database. That is, the original data is crawled from the Internet, and the text to be tagged is stored and stored in the MongoDB database. By crawling the data, you can set a filter condition that crawls the text of the unset label for labeling.

A computer apparatus comprising a memory, a processor, and a computer program stored on the memory and operative on the processor, wherein the processor, when executing the computer program, implements the following steps:

Performing word segmentation on the tagged text to obtain preprocessed text;

The computer device according to claim 11, wherein the pre-processing of the word-to-be-labeled text to obtain the pre-processed text comprises:

The computer apparatus according to claim 12, wherein said word inverse frequency TF-IDF algorithm model is:

Generating a corpus term statistical result set;

Obtain the preprocessed text;

The computer device according to claim 11, wherein in the initializing transfer matrix according to the keyword set of the to-be-labeled text, the initial transfer matrix is an n-dimensional square matrix, and the dimension of the n-dimensional square matrix is The initial keyword probability distribution is an n-dimensional column vector whose row value is 1/n; wherein n is a positive integer equal to the total number of keywords in the keyword set. ;

The computer device according to claim 11, wherein the pre-processing of the word-to-be-labeled text to obtain a pre-processed text comprises:

Crawl the text to be tagged and store it in the MongoDB database.

A storage medium, characterized in that the storage medium stores a computer program, the computer program comprising program instructions that, when executed by a processor, cause the processor to perform the following operations:

Performing word segmentation on the tagged text to obtain preprocessed text;

The storage medium according to claim 16, wherein the pre-processing of the word-to-be-labeled text to obtain the pre-processed text comprises:

The storage medium according to claim 17, wherein said word inverse frequency TF-IDF algorithm model is:

Generating a corpus term statistical result set;

Obtain the preprocessed text;

The storage medium according to claim 16, wherein the initialization transfer matrix is obtained according to the keyword set of the to-be-labeled text, wherein the initialization transfer matrix is an n-dimensional square matrix, and the dimension of the n-dimensional square matrix is The initial keyword probability distribution is an n-dimensional column vector whose row value is 1/n; wherein n is a positive integer equal to the total number of keywords in the keyword set. ;

The storage medium according to claim 16, wherein the step of preprocessing the word to be tagged to obtain a preprocessed text comprises:

Crawl the text to be tagged and store it in the MongoDB database.