CN111428028A

CN111428028A - Information classification method based on deep learning and related equipment

Info

Publication number: CN111428028A
Application number: CN202010142300.9A
Authority: CN
Inventors: 金美芝
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2020-07-17

Abstract

The present application relates to the technical field of data analysis, and in particular, to a deep learning-based information classification method and related equipment, including: obtaining the data quantity of the information to be identified, determining a clustering method of the information to be identified according to the data quantity, and applying the clustering method to treat the information to be identified. The identification information is preprocessed to obtain pre-classified data; the word vector conversion is performed on the pre-classified data to obtain the word vector of the pre-classified data; the word vector of the pre-classified data is input into the deep learning model for text feature extraction, and multiple Text features; classify each text feature to obtain the classification result of the text feature; use the voting mechanism to score the classification result, and determine the classification label of the information to be identified according to the scoring result. The present application effectively solves the problem of data imbalance, which leads to the problem that the content of the original information cannot be accurately reflected when the text feature extraction is performed by applying a deep learning model.

Description

Information classification method and related equipment based on deep learning

技术领域technical field

本申请涉及数据分析技术领域，尤其涉及一种基于深度学习的信息分类方法及相关设备。The present application relates to the technical field of data analysis, and in particular, to a deep learning-based information classification method and related equipment.

背景技术Background technique

通常人们能够很清楚的理解文本中表达的多个意图，但是机器人却很难了解到文本中表达的全部意图，这导致机器给出的回答不全，客户无法从机器人中获得完整满意的答案，甚至有可能会因为没有理解客户多意图而返回错误的答案，给客户造成了极差的体验，降低了客户满意度，因此让机器能解读出客户表达的多意图是客服机器人的重要任务。Usually people can clearly understand multiple intentions expressed in the text, but it is difficult for the robot to understand all the intentions expressed in the text, which leads to incomplete answers given by the machine, and customers cannot obtain complete and satisfactory answers from the robot, even It is possible to return the wrong answer because it does not understand the multiple intentions of the customer, causing a very poor experience for the customer and reducing customer satisfaction. Therefore, it is an important task for the customer service robot to allow the machine to interpret the multiple intentions expressed by the customer.

目前，在进行多意图识别时主要采用的是文本分类的方式。但是，在进行文本分类时存在着数据不均衡的问题导致通过应用深度学习模型进行文本特征提取时无法得到准确反映出原始信息的内容。At present, text classification is mainly used in multi-intent recognition. However, there is a problem of data imbalance in text classification, which makes it impossible to obtain the content that accurately reflects the original information when extracting text features by applying deep learning models.

发明内容SUMMARY OF THE INVENTION

基于此，针对目前在进行文本分类时存在着数据不均衡的问题导致通过应用深度学习模型进行文本特征提取时无法得到准确反映出原始信息的内容的问题，提供一种基于深度学习的信息分类方法及相关设备。Based on this, in view of the problem of data imbalance in the current text classification, the content of the original information cannot be accurately reflected when applying the deep learning model for text feature extraction, a deep learning-based information classification method is provided. and related equipment.

一种基于深度学习的信息分类方法，包括如下步骤：An information classification method based on deep learning, comprising the following steps:

获取待识别信息的数据数量，根据所述数据数量确定所述待识别信息的聚类方式，应用所述聚类方式对所述待识别信息进行预处理，得到预分类数据；Obtaining the data quantity of the information to be identified, determining a clustering method of the information to be identified according to the quantity of data, and applying the clustering method to preprocess the information to be identified to obtain pre-classified data;

对所述预分类数据进行词向量转换，得到所述预分类数据的词向量；performing word vector conversion on the pre-classified data to obtain a word vector of the pre-classified data;

将所述预分类数据的词向量入参到预置深度学习模型中进行文本特征提取，得到多个文本特征；Inputting the word vector of the pre-classified data into a preset deep learning model to perform text feature extraction to obtain multiple text features;

对所述各文本特征进行分类，得到文本特征的分类结果；classifying the text features to obtain a classification result of the text features;

应用预置投票机制对所述分类结果进行打分，得到打分结果，根据所述打分结果确定所述待识别信息的分类标签。The classification result is scored by using a preset voting mechanism to obtain a scoring result, and the classification label of the information to be identified is determined according to the scoring result.

在其中一个可能的实施例中，所述获取待识别信息的数据数量，根据所述数据数量确定所述待识别信息的聚类方式，应用所述聚类方式对所述待识别信息进行预处理，得到预分类数据，包括：In one possible embodiment, the amount of data of the information to be identified is obtained, a clustering method of the information to be identified is determined according to the amount of data, and the information to be identified is preprocessed by applying the clustering method , get pre-classified data, including:

将所述数据数量与预设的数据数量阈值进行比较，若所述数据数量大于所述数据数量阈值，则确定所述待识别信息为大样本数据，否则确定所述待识别信息为小样本数据；Compare the data quantity with a preset data quantity threshold, and if the data quantity is greater than the data quantity threshold, determine that the information to be identified is large sample data, otherwise determine that the information to be identified is small sample data ;

若所述待识别信息为大样本数据，则在去除所述大样本数据中的噪声点和孤立点后应用聚类算法对所述大样本数据进行聚类，得到所述预分类数据；If the information to be identified is large sample data, after removing noise points and isolated points in the large sample data, apply a clustering algorithm to cluster the large sample data to obtain the pre-classified data;

若所述待识别信息为小样本数据，则应用聚类算法对所述小样本数据中的相似样本进行聚类生成多个簇，对所述各簇中的数据分别采用遗传交叉算法进行处理，得到所述预分类数据。If the to-be-identified information is small sample data, a clustering algorithm is used to cluster similar samples in the small sample data to generate multiple clusters, and the data in each cluster is processed by a genetic cross algorithm, respectively, The pre-classified data is obtained.

在其中一个可能的实施例中，所述对所述预分类数据进行词向量转换，得到所述预分类数据的词向量，包括：In one possible embodiment, performing word vector transformation on the pre-classified data to obtain a word vector of the pre-classified data, including:

获取预设的词向量嵌入模型，根据所述词向量嵌入模型的属性，将所述预分类数据划分成多个句子；Obtain a preset word vector embedding model, and divide the pre-classified data into multiple sentences according to the attributes of the word vector embedding model;

将所述句子输入到所述词向量嵌入模型进行映射，得到初始文本词向量；Inputting the sentence into the word vector embedding model for mapping to obtain an initial text word vector;

计算所述初始文本词向量的特征值，删除特征值为零的初始文本词向量，汇总剩余的初始文本词向量，得到所述预分类数据的词向量。Calculate the eigenvalues of the initial text word vectors, delete the initial text word vectors whose eigenvalues are zero, summarize the remaining initial text word vectors, and obtain the word vectors of the pre-classified data.

在其中一个可能的实施例中，所述将所述预分类数据的词向量入参到预置深度学习模型中进行文本特征提取，得到多个文本特征，包括：In one possible embodiment, the word vector of the pre-classified data is input into a preset deep learning model to perform text feature extraction to obtain multiple text features, including:

将预设的标准词向量输入到预置循环神经网络模型中的输入层，通过所述循环神经网络模型中的隐层对经所述输入层处理后的词向量进行概率预测，得到概率预测结果，应用所述循环神经网络模型中的输出层对所述概率预测结果进行转换后得到预测关键词；Input the preset standard word vector into the input layer in the preset cyclic neural network model, and perform probability prediction on the word vector processed by the input layer through the hidden layer in the cyclic neural network model to obtain the probability prediction result , and applying the output layer in the cyclic neural network model to convert the probability prediction result to obtain prediction keywords;

将所述预测关键词语与所述标准词向量对应的关键词进行比较，若一致，则将所述预分类数据的词向量入参到所述循环神经网络模型中进行特征提取，否则改变所述隐层中的参数进行重新预测直到所述预测关键词语与所述标准词向量对应的关键词一致。Compare the predicted keywords with the keywords corresponding to the standard word vectors, and if they are consistent, input the word vectors of the pre-classified data into the cyclic neural network model for feature extraction, otherwise change the The parameters in the hidden layer are re-predicted until the predicted keyword is consistent with the keyword corresponding to the standard word vector.

在其中一个可能的实施例中，所述对所述各文本特征进行分类，得到文本特征的分类结果，包括：In one possible embodiment, the classification of the text features to obtain a classification result of the text features includes:

获取不同类别的分类器，根据各所述分类器之间的层级关系，建立分类器子树；Obtain classifiers of different categories, and establish a classifier subtree according to the hierarchical relationship between the classifiers;

将所述文本特征输入到所述分类器子树的根节点，进行首次分类，得到首次分类结果，将所述首次分类结果输入到所述根节点的下一级叶子节点；The text feature is input into the root node of the classifier subtree, first classification is performed, the first classification result is obtained, and the first classification result is input into the next-level leaf node of the root node;

以所述下一级叶子节点作为新的根节点继续进行分类，直到所述下一级叶子节点为最小叶子节点；Continue to classify with the next-level leaf node as a new root node, until the next-level leaf node is the smallest leaf node;

汇总所述最小叶子节点的分类结果，得到所述文本特征的分类结果。Summarize the classification results of the minimum leaf nodes to obtain the classification results of the text features.

在其中一个可能的实施例中，所述应用预置投票机制对所述分类结果进行打分，得到打分结果，根据所述打分结果确定所述待识别信息的分类标签，包括：In one possible embodiment, the classification result is scored by applying a preset voting mechanism to obtain a scoring result, and the classification label of the to-be-identified information is determined according to the scoring result, including:

获取各所述最小叶子节点对应的末端分类器的分类准确率，以所述分类准确率作为所述末端分类器的权重；Obtain the classification accuracy rate of the end classifier corresponding to each of the minimum leaf nodes, and use the classification accuracy rate as the weight of the end classifier;

以所述权重作为辅助参数，应用所述投票机制对所述末端分类器输出的分类标签进行投票打分；Using the weight as an auxiliary parameter, applying the voting mechanism to vote and score the classification label output by the end classifier;

提取投票分数大于分数阈值的分类标签作为所述待识别信息的分类标签。A classification label with a voting score greater than a score threshold is extracted as the classification label of the information to be identified.

在其中一个可能的实施例中，所述以所述下一级叶子节点作为新的根节点继续进行分类，直到所述下一级叶子节点为最小叶子节点，包括：In one possible embodiment, the classification continues with the next-level leaf node as a new root node until the next-level leaf node is the smallest leaf node, including:

获取任意一级所有叶子节点的输出结果之间的相似度，提取相似度大于相似度阈值的多个输出结果对应的目标叶子节点作为下一层级分类的根节点；Obtain the similarity between the output results of all leaf nodes at any level, and extract the target leaf nodes corresponding to multiple output results whose similarity is greater than the similarity threshold as the root node of the next level classification;

获取所述目标叶子节点对应的节点标签，将所述节点标签输入到下一层级分类器中进行分类，直到所述下一层级分类的根节点为所述最小叶子节点。The node label corresponding to the target leaf node is acquired, and the node label is input into the next-level classifier for classification, until the root node of the next-level classification is the minimum leaf node.

一种基于深度学习的信息分类装置，包括如下模块：An information classification device based on deep learning, comprising the following modules:

预分类模块，设置为获取待识别信息的数据数量，根据所述数据数量确定所述待识别信息的聚类方式，应用所述聚类方式对所述待识别信息进行预处理，得到预分类数据；The pre-classification module is configured to obtain the data quantity of the information to be identified, determine the clustering method of the information to be identified according to the data quantity, and apply the clustering method to preprocess the information to be identified to obtain pre-classified data ;

词向量模块，设置为对所述预分类数据进行词向量转换，得到所述预分类数据的词向量；A word vector module, configured to perform word vector conversion on the pre-classified data to obtain a word vector of the pre-classified data;

特征提取模块，设置为将所述预分类数据的词向量入参到预置深度学习模型中进行文本特征提取，得到多个文本特征；A feature extraction module, configured to input the word vectors of the pre-classified data into a preset deep learning model to perform text feature extraction to obtain multiple text features;

结果生成模块，设置为对所述各文本特征进行分类，得到文本特征的分类结果；A result generation module, configured to classify each text feature, and obtain a classification result of the text feature;

标签生成模块，设置为应用预置投票机制对所述分类结果进行打分，得到打分结果，根据所述打分结果确定所述待识别信息的分类标签。The label generation module is configured to use a preset voting mechanism to score the classification result, obtain a score result, and determine the classification label of the to-be-identified information according to the score result.

一种计算机设备，包括存储器和处理器，所述存储器中存储有计算机可读指令，所述计算机可读指令被所述处理器执行时，使得所述处理器执行上述基于深度学习的信息分类方法的步骤。A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, the processor is made to execute the above-mentioned deep learning-based information classification method A step of.

一种存储有计算机可读指令的存储介质，所述计算机可读指令被一个或多个处理器执行时，使得一个或多个处理器执行上述基于深度学习的信息分类方法的步骤。A storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the above-mentioned deep learning-based information classification method.

与现有机制相比，本申请通过获取待识别信息的数据数量，根据所述数据数量确定所述待识别信息的聚类方式，应用所述聚类方式对所述待识别信息进行预处理，得到预分类数据；对所述预分类数据进行词向量转换，得到所述预分类数据的词向量；将所述预分类数据的词向量入参到预置深度学习模型中进行文本特征提取，得到多个文本特征；对所述各文本特征进行分类，得到文本特征的分类结果；应用预置投票机制对所述分类结果进行打分，得到打分结果，根据所述打分结果确定所述待识别信息的分类标签。可以有效的解决数据不均衡的问题导致通过应用深度学习模型进行文本特征提取时无法得到准确反映出原始信息的内容的问题。Compared with the existing mechanism, the present application obtains the data quantity of the information to be identified, determines the clustering method of the to-be-identified information according to the data quantity, and applies the clustering method to preprocess the to-be-identified information, Obtain pre-classified data; perform word vector transformation on the pre-classified data to obtain a word vector of the pre-classified data; input the word vector of the pre-classified data into a preset deep learning model to perform text feature extraction, and obtain multiple text features; classify each text feature to obtain a classification result of the text feature; apply a preset voting mechanism to score the classification result to obtain a scoring result, and determine the information to be identified according to the scoring result Category labels. It can effectively solve the problem of data imbalance, which leads to the problem that the content of the original information cannot be accurately reflected when applying the deep learning model for text feature extraction.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本申请的限制。Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for purposes of illustrating preferred embodiments only and are not to be considered limiting of the application.

图1为本申请在一个实施例中的一种基于深度学习的信息分类方法的整体流程图；1 is an overall flowchart of a deep learning-based information classification method in an embodiment of the present application;

图2为本申请在一个实施例中的一种基于深度学习的信息分类方法中的预分类过程示意图；2 is a schematic diagram of a pre-classification process in a deep learning-based information classification method in an embodiment of the present application;

图3为本申请在一个实施例中的一种基于深度学习的信息分类方法中的结果生成过程示意图；3 is a schematic diagram of a result generation process in a deep learning-based information classification method in an embodiment of the present application;

图4为本申请在一个实施例中的一种基于深度学习的信息分类装置的结构图。FIG. 4 is a structural diagram of an apparatus for classifying information based on deep learning in an embodiment of the present application.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

本技术领域技术人员可以理解，除非特意声明，这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是，本申请的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件，但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。It will be understood by those skilled in the art that the singular forms "a", "an", "the" and "the" as used herein can include the plural forms as well, unless expressly stated otherwise. It should be further understood that the word "comprising" used in the specification of this application refers to the presence of stated features, integers, steps, operations, elements and/or components, but does not preclude the presence or addition of one or more other features, Integers, steps, operations, elements, components and/or groups thereof.

图1为本申请在一个实施例中的一种基于深度学习的信息分类方法的整体流程图，一种基于深度学习的信息分类方法，包括以下步骤：Fig. 1 is the overall flow chart of a kind of information classification method based on deep learning in one embodiment of the application, a kind of information classification method based on deep learning, comprises the following steps:

S1、获取待识别信息的数据数量，根据所述数据数量确定所述待识别信息的聚类方式，应用所述聚类方式对所述待识别信息进行预处理，得到预分类数据；S1. Obtain the data quantity of the information to be identified, determine a clustering method of the information to be identified according to the quantity of data, and apply the clustering method to preprocess the information to be identified to obtain pre-classified data;

具体的，在进行信息识别时，要判断待识别信息是单意图还是多意图，其中，单意图是指文本语句中只包含一个意图，如：“我想听周杰伦的歌”，这句话意图可以归结为音乐意图，而多意图是指文本话语中可以包含多个意图，如：“我想买苹果”，这句话的意图可以归结为水果类意图，我想买水果，同时也可以归结为电子类意图，我想买苹果手机，然后再根据用户以往的搜索记录或上下文信息判断本次对话中该句话是所属于那个意图，并优先返回最佳答案。，再统计待识别信息的数据量时需要自然语言识别算法对待识别信息中的单句意图和多句意图进行分类，每一个单句意图对应的单句作为1个数据，每一个多句意图对应的多个句子一起作为1个数据。在进行聚类时，所使用的聚类算法主要是K-mean算法和凝聚层次聚类算法。Specifically, when performing information identification, it is necessary to determine whether the information to be identified is single-intent or multi-intent. Among them, single-intent means that the text sentence contains only one intention, such as: "I want to listen to Jay Chou's song", this sentence intends It can be attributed to musical intent, while multi-intent means that the text utterance can contain multiple intents, such as: "I want to buy apples", the intent of this sentence can be attributed to fruit-like intent, I want to buy fruit, and it can also be attributed to For the electronic intent, I want to buy an iPhone, and then based on the user's previous search records or contextual information, determine which intent the sentence belongs to in this conversation, and return the best answer first. , when the data volume of the information to be recognized is counted, the natural language recognition algorithm is required to classify the single-sentence intent and the multi-sentence intent in the to-be-recognized information. The single sentence corresponding to each single-sentence intent is regarded as one data, and each Sentences together as 1 data. When clustering, the clustering algorithms used are mainly K-mean algorithm and agglomerative hierarchical clustering algorithm.

S2、对所述预分类数据进行词向量转换，得到所述预分类数据的词向量；S2, performing word vector conversion on the pre-classified data to obtain a word vector of the pre-classified data;

具体的，词向量转换通常使用的Wordvec2，在本步骤中采用的是BERT模型作为词向量嵌入模型，在使用BERT模型时可以采用如下方式对文本进行词向量的转换：1、安装BERT模型：(1)安装BERT于服务器端；(2)安装BER T于客户端；2、启动服务：执行下列代码bert-serving-start-model_dir/tmp/engli sh_L-12_H-768_A-12-num_worker＝4，其中/tmp/english_L-12_H-768_A-12/是下载的模型的路径3.、使用python脚本进行文本向量化。执行的代码为：from bert _serving.client import BertClient bc＝BertClient()bc.encode(['哪种保险好','小孩买什么保险好'])Specifically, Wordvec2 is usually used for word vector conversion. In this step, the BERT model is used as the word vector embedding model. When using the BERT model, the text can be converted into word vectors in the following ways: 1. Install the BERT model: ( 1) Install BERT on the server side; (2) Install BERT on the client side; 2. Start the service: execute the following code bert-serving-start-model_dir/tmp/engli sh_L-12_H-768_A-12-num_worker=4, where /tmp/english_L-12_H-768_A-12/ is the path of the downloaded model 3. Use python script for text vectorization. The executed code is: from bert _serving.client import BertClient bc=BertClient() bc.encode(['Which insurance is good', 'What kind of insurance is good for children'])

S3、将所述预分类数据的词向量入参到预置深度学习模型中进行文本特征提取，得到多个文本特征；S3. The word vector of the pre-classified data is input into a preset deep learning model to perform text feature extraction to obtain multiple text features;

具体的，在进行文本提取时所选用的深度学习模型通常是卷积神经网络模型或者是循环神经网络模型，在将词向量进行文本特征提取时，首先要对上述模型进行训练，当训练后的深度学习模型的文本特征正确率大于预设的阈值后，才能使用该深度学习模型对预分类的词向量进行文本特征提取。在本步骤中可以采用词频－逆向文件频率算法(TF-IDF)进行文本特征提取。Specifically, the deep learning model selected for text extraction is usually a convolutional neural network model or a recurrent neural network model. When extracting text features from word vectors, the above models must be trained first. After the text feature accuracy rate of the deep learning model is greater than a preset threshold, the deep learning model can be used to extract text features from the pre-classified word vectors. In this step, the word frequency-inverse document frequency algorithm (TF-IDF) can be used to extract text features.

S4、对所述各文本特征进行分类，得到文本特征的分类结果；S4, classify each text feature to obtain a classification result of the text feature;

具体的，文本特征分类用电脑对文本集(或其他实体或物件)按照一定的分类体系或标准进行自动分类标记。常用的文本特征分类方法是朴素贝叶斯分类方法、决策树方法和SVM支持向量机等等。在将文本特征进行分类时需要对分类器进行训练，只有经过验证的分类器分类出的结果才能作为可靠结果进行应用。Specifically, the text feature classification uses a computer to automatically classify and mark the text set (or other entities or objects) according to a certain classification system or standard. Commonly used text feature classification methods are Naive Bayes classification method, decision tree method and SVM support vector machine and so on. Classifiers need to be trained when classifying text features, and only the results classified by the validated classifiers can be used as reliable results.

其中，决策树方法是应用决策树模型对文本特征进行分类，先将文本特征输入到决策树模型的根节点进行第一次分类，然后再通过第一叶子节点进行第二次分类，依次类推直到决策树模型的最小叶子节点。决策树模型可以对文本特征由粗到细逐级进行分类，从而得到更加准确的分类结果。如文本特征为：橘子，则在决策树模型中的根节点为“生物”，第一叶子节点为“植物”，最小叶子节点为“水果”。Among them, the decision tree method is to use the decision tree model to classify text features, first input the text features to the root node of the decision tree model for the first classification, and then use the first leaf node for the second classification, and so on until The smallest leaf node of a decision tree model. The decision tree model can classify text features from coarse to fine, so as to obtain more accurate classification results. If the text feature is: orange, the root node in the decision tree model is "biology", the first leaf node is "plant", and the smallest leaf node is "fruit".

S5、应用预置投票机制对所述分类结果进行打分，得到打分结果，根据所述打分结果确定所述待识别信息的分类标签。S5. Use a preset voting mechanism to score the classification result, obtain a scoring result, and determine a classification label of the information to be identified according to the scoring result.

其中，投票机制(voting)是集成学习里面针对分类问题的一种结合策略。基本思想是选择所有机器学习算法当中输出最多的那个类。机器学习分类算法的输出有两种类型：一种是直接输出类标签，另外一种是输出类概率，使用前者进行投票叫做硬投票(Majority/Hard voting)，使用后者进行分类叫做软投票(Soft voting)。在本步骤中，采用的是软投票机制，在投票是增加权重作为辅助参数，从而能够更好的获得分类标签。Among them, the voting mechanism (voting) is a combination strategy for classification problems in ensemble learning. The basic idea is to select the class with the most output among all machine learning algorithms. There are two types of outputs of machine learning classification algorithms: one is to output class labels directly, and the other is to output class probabilities. Using the former for voting is called Majority/Hard voting, and using the latter for classification is called soft voting ( Soft voting). In this step, the soft voting mechanism is adopted, and the weight is added as an auxiliary parameter in the voting, so that the classification label can be better obtained.

对于软投票机制可以采用如下步骤：首先获得所使用的机器学习算法所输出的类概率，如A类为50％，B类为30％，C类为20％，然后在获得各个类别的权重值，如A类为0.3，B类为0.5，C类为0.2，计算得到各个类的加权平均值，值大的类会被选择。For the soft voting mechanism, the following steps can be taken: first obtain the class probability output by the machine learning algorithm used, such as 50% for class A, 30% for class B, and 20% for class C, and then obtain the weight value of each class , such as class A is 0.3, class B is 0.5, class C is 0.2, the weighted average of each class is calculated, and the class with a larger value will be selected.

本实施例，通过对需要进行意图识别的信息进行预分类处理，根据不同的类型采用不同的处理方式，可以有效的解决数据不均衡的问题导致通过应用深度学习模型进行文本特征提取时无法得到准确反映出原始信息的内容的问题。In this embodiment, by pre-classifying the information that needs to be used for intent recognition, and using different processing methods according to different types, the problem of data imbalance can be effectively solved, resulting in that the text feature extraction cannot be accurately obtained by applying a deep learning model. Questions that reflect the content of the original information.

图2为本申请在一个实施例中的一种基于深度学习的信息分类方法中的预分类过程示意图，如图所示，所述S1、获取待识别信息的数据数量，根据所述数据数量确定所述待识别信息的聚类方式，应用所述聚类方式对所述待识别信息进行预处理，得到预分类数据，包括：2 is a schematic diagram of a pre-classification process in an information classification method based on deep learning in an embodiment of the present application. As shown in the figure, in S1, the data quantity of the information to be identified is obtained, which is determined according to the data quantity The clustering method of the to-be-identified information, using the clustering method to preprocess the to-be-identified information to obtain pre-classified data, including:

S11、将所述数据数量与预设的数据数量阈值进行比较，若所述数据数量大于所述数据数量阈值，则确定所述待识别信息为大样本数据，否则确定所述待识别信息为小样本数据；S11. Compare the data quantity with a preset data quantity threshold. If the data quantity is greater than the data quantity threshold, determine that the to-be-identified information is large sample data, otherwise, determine that the to-be-identified information is small sample;

其中，通常针对不同的数据集，大小样本划分的数据量阈值设置不同，一般可以采用中位数、众数、平均值等指标作为数据量阈值，样本数据量大于阈值判定为大样本，反之则判定为小样本。Among them, usually for different data sets, the data volume thresholds for large and small samples are set differently. Generally, indicators such as median, mode, and average can be used as data volume thresholds. If the sample data volume is greater than the threshold, it is determined as a large sample, and vice versa. judged to be a small sample.

S12、若所述待识别信息为大样本数据，则在去除所述大样本数据中的噪声点和孤立点后应用聚类算法对所述大样本数据进行聚类，得到所述预分类数据；S12. If the information to be identified is large sample data, apply a clustering algorithm to cluster the large sample data after removing noise points and isolated points in the large sample data to obtain the pre-classified data;

具体的，大样本数据是符合正态分布的数据，在进行大样本数据聚类时，需要对大样本数据中的如特殊符号“；”、“。”等非文字噪声点进行去除，并且如果大样本数据中存在着不符合正态分布的点，那么这些点就作为孤立点需要先行去除后在进行聚类。Specifically, large sample data is data that conforms to a normal distribution. When clustering large sample data, it is necessary to remove non-text noise points such as special symbols ";" and "." in the large sample data, and if If there are points that do not conform to the normal distribution in the large sample data, then these points need to be removed as outliers before clustering.

S13、若所述待识别信息为小样本数据，则应用聚类算法对所述小样本数据中的相似样本进行聚类生成多个簇，对所述各簇中的数据分别采用遗传交叉算法进行处理，得到所述预分类数据。S13. If the to-be-identified information is small sample data, use a clustering algorithm to cluster similar samples in the small sample data to generate multiple clusters, and use a genetic cross algorithm for the data in each cluster. processing to obtain the pre-classified data.

具体的，由于小样本数据相对于大样本数据来说，所包含的特征信息比较少，机器从小样本中学习到的特征信息相对比较少，计算得到的联合概率相对比较小。因此不能直接采用聚类算法如K-mean对其进行分类统计，这样会严重影响了分类准确率。在应用遗传交叉算法进行处理时，所采用的交叉算子为均匀交叉算子。在两个配体A和B中随机产生两个交叉点，然后安随机产生的0，1，2三个整数进行基因交换，从而形成两个新的个体，完成交叉。Specifically, since the small sample data contains less feature information than the large sample data, the machine learns relatively less feature information from the small sample, and the calculated joint probability is relatively small. Therefore, clustering algorithms such as K-mean cannot be directly used for classification and statistics, which will seriously affect the classification accuracy. When applying the genetic crossover algorithm for processing, the used crossover operator is the uniform crossover operator. Two crossover points are randomly generated in the two ligands A and B, and then three integers of 0, 1, and 2 are randomly generated for gene exchange, thereby forming two new individuals and completing the crossover.

本实施例，通过对待分析数据进行样本分类，采用不同的聚类算法，从而避免了数据不均衡导致的信息分类不准确。In this embodiment, different clustering algorithms are used for sample classification of the data to be analyzed, thereby avoiding inaccurate information classification caused by data imbalance.

在一个实施例中，所述对所述预分类数据进行词向量转换，得到所述预分类数据的词向量，包括：In one embodiment, performing word vector conversion on the pre-classified data to obtain a word vector of the pre-classified data, including:

具体的，不同的词向量嵌入模型对于进行词向量转换的字符长度具有不同的限制。Specifically, different word vector embedding models have different restrictions on the character length for word vector conversion.

本实施例，利用BERT进行词向量嵌入，该模型采用了Transformer序列模型，具有双向功能，且能够获得比词更高级别的句子级别的语义表征，普适性强、效果好的等优点。In this embodiment, BERT is used for word vector embedding. The model adopts the Transformer sequence model, which has a bidirectional function, and can obtain sentence-level semantic representation higher than words, and has the advantages of strong universality and good effect.

在一个实施例中，所述将所述预分类数据的词向量入参到预置深度学习模型中进行文本特征提取，得到多个文本特征，包括：In one embodiment, the word vector of the pre-classified data is input into a preset deep learning model to perform text feature extraction to obtain multiple text features, including:

其中，在循环神经网络模型中包含有输入层、隐层和输出层，其中输入层用于接收数据，隐层用于处理数据，而输出层用于将结果输出。其中，在隐层中会对数据进行一系列的处理，主要包括：梯度截断、正则化、门控等。通过隐层对数据进行有效的处理。Among them, the recurrent neural network model includes an input layer, a hidden layer and an output layer, wherein the input layer is used to receive data, the hidden layer is used to process the data, and the output layer is used to output the results. Among them, a series of processing will be performed on the data in the hidden layer, mainly including: gradient truncation, regularization, gating, etc. Data is efficiently processed through the hidden layer.

本实施例，通过深度学习模型对词向量进行文本特征提取，保证了文本特征的准确度。In this embodiment, the text feature extraction is performed on the word vector through the deep learning model, which ensures the accuracy of the text feature.

图3为本申请在一个实施例中的一种基于深度学习的信息分类方法中的结果生成过程示意图，如图所示，所述S4、对所述各文本特征进行分类，得到文本特征的分类结果，包括：FIG. 3 is a schematic diagram of a result generation process in an information classification method based on deep learning in an embodiment of the present application. As shown in the figure, the step S4 is to classify the text features to obtain the classification of the text features. Results, including:

S41、获取不同类别的分类器，根据各所述分类器之间的层级关系，建立分类器子树；S41, obtain classifiers of different categories, and establish a classifier subtree according to the hierarchical relationship between the classifiers;

具体的，在进行层级分类时可以采用概率计算的方法，如下：Specifically, a probability calculation method can be used when performing hierarchical classification, as follows:

P(n_k|d_i)＝∏p(n_j|d_i)xa_k P(n _k |d _i )=∏p(n _j |d _i )xa _k

其中P(n_k|d_i)表示文档d_i分类最终到达节点n_k的概率，P(n_j|d_i)表示文档d_i在到达节点n_k节点之前经过的祖先节点n_j的概率，a_k表示高度惩罚因子。where P(n _k |d _i ) represents the probability of document d _i classification finally reaching node n _k , P(n _j |d _i ) represents the probability that document d _i passes through ancestor node n _j before reaching node n _k , a _k represents the height penalty factor.

S42、将所述文本特征输入到所述分类器子树的根节点，进行首次分类，得到首次分类结果，将所述首次分类结果输入到所述根节点的下一级叶子节点；S42, input the text feature into the root node of the classifier subtree, perform first classification, obtain the first classification result, and input the first classification result into the next-level leaf node of the root node;

S43、以所述下一级叶子节点作为新的根节点继续进行分类，直到所述下一级叶子节点为最小叶子节点；S43, continue to classify with the next-level leaf node as a new root node, until the next-level leaf node is the smallest leaf node;

具体的，获取任意一级所有叶子节点的输出结果之间的相似度，提取相似度大于相似度阈值的多个输出结果对应的目标叶子节点作为下一层级分类的根节点；Specifically, the similarity between the output results of all leaf nodes at any level is obtained, and the target leaf nodes corresponding to multiple output results whose similarity is greater than the similarity threshold are extracted as the root node of the next level classification;

其中，相似度阈值时根据分类器的类型进行设定的，如：判断分类器属于任务型、闲聊、FAQ的概率，若计算出的概率分别为0.8、0.3、0.9，设置的阈值为0.5，则可以判断该文本数据属于任务型和闲聊，闲聊属于叶子节点，则不会继续进行判断，再通过下一级分类器计算文本属于客服任务型和保顾任务型的概率，若计算出的概率分别为0.88,0.21，则最终的判定结果为客服任务型和闲聊。Among them, the similarity threshold is set according to the type of the classifier, such as: judging the probability that the classifier belongs to task type, small talk, and FAQ, if the calculated probabilities are 0.8, 0.3, and 0.9, the set threshold is 0.5, Then it can be judged that the text data belongs to the task type and chat, and the chat belongs to the leaf node, then the judgment will not continue, and then the next-level classifier is used to calculate the probability that the text belongs to the customer service task type and the customer service task type. If the calculated probability are 0.88 and 0.21 respectively, then the final judgment results are customer service task type and small talk.

S44、汇总所述最小叶子节点的分类结果，得到所述文本特征的分类结果。S44. Summarize the classification results of the minimum leaf nodes to obtain the classification results of the text features.

本实施例，通过采用改进的层次分类方法，从而避免了因上面层次的分类出现分类错误，而导致后面所有的分类都发生错误的问题。In this embodiment, by adopting the improved hierarchical classification method, the problem that all subsequent classifications are erroneous due to classification errors occurring in the above-level classifications is avoided.

在一个实施例中，所述应用预置投票机制对所述分类结果进行打分，得到打分结果，根据所述打分结果确定所述待识别信息的分类标签，包括：In one embodiment, the classification result is scored by applying a preset voting mechanism to obtain a scoring result, and the classification label of the to-be-identified information is determined according to the scoring result, including:

其中，分类准确率的计算公式为：Among them, the calculation formula of classification accuracy is:

式子中，TP为真正例，TN为真负例，FP＝假正例，FN＝假负例。In the formula, TP is the true example, TN is the true negative example, FP = false positive example, and FN = false negative example.

本实施例，利用投票机制对分类结果进行有效打分，从而大大提升了分类标签的准确率。In this embodiment, the voting mechanism is used to effectively score the classification results, thereby greatly improving the accuracy of the classification labels.

上述任一所对应的实施例或实施方式中所提及的技术特征也同样适用于本申请中的图4所对应的实施例，后续类似之处不再赘述。The technical features mentioned in any of the above-mentioned corresponding embodiments or implementation manners are also applicable to the embodiment corresponding to FIG. 4 in the present application, and the similarities will not be repeated hereafter.

以上对本申请中一种基于深度学习的信息分类方法进行说明，以下对执行上述基于深度学习的信息分类装置进行描述。A method for classifying information based on deep learning in the present application is described above, and the following describes an apparatus for performing the above-mentioned information classification based on deep learning.

如图4所示的一种基于深度学习的信息分类装置的结构图，其可应用于基于深度学习的信息分类。本申请实施例中的基于深度学习的信息分类装置能够实现对应于上述图1所对应的实施例中所执行的基于深度学习的信息分类方法的步骤。基于深度学习的信息分类装置实现的功能可以通过硬件实现，也可以通过硬件执行相应的软件实现。硬件或软件包括一个或多个与上述功能相对应的模块，所述模块可以是软件和/或硬件。As shown in FIG. 4 , a structure diagram of an information classification device based on deep learning can be applied to information classification based on deep learning. The apparatus for classifying information based on deep learning in this embodiment of the present application can implement steps corresponding to the method for classifying information based on deep learning performed in the embodiment corresponding to FIG. 1 . The functions implemented by the deep learning-based information classification apparatus may be implemented by hardware, or by executing corresponding software in hardware. The hardware or software includes one or more modules corresponding to the above functions, and the modules may be software and/or hardware.

在一个实施例中，提出了一种基于深度学习的信息分类装置，如图4所示，包括如下模块：In one embodiment, an information classification device based on deep learning is proposed, as shown in FIG. 4 , including the following modules:

预分类模块10，设置为获取待识别信息的数据数量，根据所述数据数量确定所述待识别信息的聚类方式，应用所述聚类方式对所述待识别信息进行预处理，得到预分类数据；The pre-classification module 10 is configured to obtain the data quantity of the information to be identified, determine the clustering method of the to-be-identified information according to the data quantity, and apply the clustering method to pre-process the to-be-identified information to obtain a pre-classification data;

词向量模块20，设置为对所述预分类数据进行词向量转换，得到所述预分类数据的词向量；The word vector module 20 is configured to perform word vector conversion on the pre-classified data to obtain the word vector of the pre-classified data;

特征提取模块30，设置为将所述预分类数据的词向量入参到预置深度学习模型中进行文本特征提取，得到多个文本特征；The feature extraction module 30 is configured to input the word vector of the pre-classified data into a preset deep learning model to perform text feature extraction to obtain a plurality of text features;

结果生成模块40，设置为对所述各文本特征进行分类，得到文本特征的分类结果；The result generation module 40 is configured to classify the text features to obtain the classification results of the text features;

标签生成模块50，设置为应用预置投票机制对所述分类结果进行打分，得到打分结果，根据所述打分结果确定所述待识别信息的分类标签。The label generation module 50 is configured to use a preset voting mechanism to score the classification result, obtain a score result, and determine the classification label of the to-be-identified information according to the score result.

在一个实施例中，提出了一种计算机设备，所述计算机设备包括存储器和处理器，存储器中存储有计算机可读指令，计算机可读指令被处理器执行时，使得处理器执行上述各实施例中的所述基于深度学习的信息分类方法的步骤。In one embodiment, a computer device is proposed. The computer device includes a memory and a processor, and computer-readable instructions are stored in the memory. When the computer-readable instructions are executed by the processor, the processor causes the processor to execute the foregoing embodiments. The steps in the deep learning-based information classification method.

在一个实施例中，提出了一种存储有计算机可读指令的存储介质，该计算机可读指令被一个或多个处理器执行时，使得一个或多个处理器执行上述各实施例中的所述基于深度学习的信息分类方法的步骤。其中，所述存储介质可以为非易失性存储介质，也可以是易失性存储介质，具体本申请不做限定。In one embodiment, a storage medium storing computer-readable instructions is provided, and when the computer-readable instructions are executed by one or more processors, the one or more processors cause the one or more processors to execute all of the above-mentioned embodiments. Describe the steps of a deep learning-based information classification method. The storage medium may be a non-volatile storage medium or a volatile storage medium, which is not specifically limited in this application.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序可以存储于一计算机可读存储介质中，存储介质可以包括：只读存储器(ROM，Read Only Memory)、随机存取存储器(RAM，RandomAccess Memory)、磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium, and the storage medium can include: Read Only Memory (ROM, Read Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-described embodiments can be combined arbitrarily. In order to simplify the description, all possible combinations of the technical features in the above-described embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, all It should be considered as the range described in this specification.

以上所述实施例仅表达了本申请一些示例性实施例，其中描述较为具体和详细，但并不能因此而理解为对本申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent some exemplary embodiments of the present application, and the descriptions are relatively specific and detailed, but should not be construed as limiting the patent scope of the present application. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

1. An information classification method based on deep learning is characterized by comprising the following steps:

acquiring the data quantity of information to be identified, determining the clustering mode of the information to be identified according to the data quantity, and preprocessing the information to be identified by applying the clustering mode to obtain pre-classified data;

performing word vector conversion on the pre-classified data to obtain word vectors of the pre-classified data;

inputting the word vectors of the pre-classified data into a preset deep learning model for text feature extraction to obtain a plurality of text features;

classifying the text features to obtain a classification result of the text features;

and scoring the classification result by using a preset voting mechanism to obtain a scoring result, and determining the classification label of the information to be identified according to the scoring result.

2. The information classification method based on deep learning of claim 1, wherein the obtaining of the data quantity of the information to be identified, the determining of the clustering mode of the information to be identified according to the data quantity, and the preprocessing of the information to be identified by applying the clustering mode to obtain pre-classification data comprises:

comparing the data quantity with a preset data quantity threshold, if the data quantity is greater than the data quantity threshold, determining that the information to be identified is large sample data, otherwise, determining that the information to be identified is small sample data;

if the information to be identified is large sample data, clustering the large sample data by applying a clustering algorithm after removing noise points and isolated points in the large sample data to obtain pre-classified data;

if the information to be identified is small sample data, clustering similar samples in the small sample data by using a clustering algorithm to generate a plurality of clusters, and processing the data in each cluster by respectively adopting a genetic crossover algorithm to obtain the pre-classification data.

3. The method for information classification based on deep learning of claim 1, wherein the performing word vector transformation on the pre-classified data to obtain a word vector of the pre-classified data comprises:

acquiring a preset word vector embedding model, and dividing the pre-classified data into a plurality of sentences according to the attribute of the word vector embedding model;

inputting the sentence into the word vector embedding model for mapping to obtain an initial text word vector;

and calculating the characteristic value of the initial text word vector, deleting the initial text word vector with the characteristic value of zero, summarizing the rest initial text word vectors, and obtaining the word vector of the pre-classified data.

4. The information classification method based on deep learning of claim 1, wherein the step of inputting the word vector of the pre-classification data into a preset deep learning model for text feature extraction to obtain a plurality of text features comprises:

inputting a preset standard word vector into an input layer in a preset cyclic neural network model, performing probability prediction on the word vector processed by the input layer through a hidden layer in the cyclic neural network model to obtain a probability prediction result, and converting the probability prediction result by using an output layer in the cyclic neural network model to obtain a prediction keyword;

and comparing the predicted key words with the keywords corresponding to the standard word vectors, if the predicted key words are consistent with the keywords corresponding to the standard word vectors, adding the word vectors of the pre-classified data into the cyclic neural network model for feature extraction, and otherwise, changing the parameters in the hidden layer for re-prediction until the predicted key words are consistent with the keywords corresponding to the standard word vectors.

5. The information classification method based on deep learning according to any one of claims 1 to 4, wherein the classifying the text features to obtain a classification result of the text features includes:

obtaining classifiers of different categories, and establishing a classifier sub-tree according to the hierarchical relation among the classifiers;

inputting the text features into a root node of the classifier subtree, performing primary classification to obtain a primary classification result, and inputting the primary classification result into a next-level leaf node of the root node;

taking the next-level leaf node as a new root node to continue classifying until the next-level leaf node is the minimum leaf node;

and summarizing the classification result of the minimum leaf node to obtain the classification result of the text features.

6. The information classification method based on deep learning of claim 5, wherein the applying a preset voting mechanism to score the classification result to obtain a scoring result, and determining the classification label of the information to be identified according to the scoring result comprises:

obtaining the classification accuracy of the end classifier corresponding to each minimum leaf node, and taking the classification accuracy as the weight of the end classifier;

voting and scoring the classification labels output by the end classifier by applying the voting mechanism by taking the weights as auxiliary parameters;

and extracting the classification label with the voting score larger than a score threshold value as the classification label of the information to be identified.

7. The method for classifying information based on deep learning according to claim 5, wherein the classifying continues with the next-level leaf node as a new root node until the next-level leaf node is a minimum leaf node, including:

acquiring the similarity among the output results of all leaf nodes at any level, and extracting target leaf nodes corresponding to a plurality of output results with the similarity larger than a similarity threshold value as root nodes of next-level classification;

and acquiring a node label corresponding to the target leaf node, and inputting the node label into a next-level classifier for classification until a root node of the next-level classification is the minimum leaf node.

8. An information classification device based on deep learning is characterized by comprising the following modules:

the pre-classification module is used for acquiring the data quantity of the information to be identified, determining the clustering mode of the information to be identified according to the data quantity, and preprocessing the information to be identified by applying the clustering mode to obtain pre-classification data;

the word vector module is used for carrying out word vector conversion on the pre-classified data to obtain word vectors of the pre-classified data;

the feature extraction module is used for inputting the word vectors of the pre-classified data into a preset deep learning model to extract text features so as to obtain a plurality of text features;

the result generation module is used for classifying the text features to obtain the classification result of the text features;

and the label generation module is set to score the classification result by applying a preset voting mechanism to obtain a scoring result, and determine the classification label of the information to be identified according to the scoring result.

9. A computer device comprising a memory and a processor, the memory having stored therein computer-readable instructions, which, when executed by the processor, cause the processor to carry out the steps of the deep learning based information classification method according to any one of claims 1 to 7.

10. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the deep learning based information classification method according to any one of claims 1 to 7.