[go: up one dir, main page]

CN111488400A - Data classification method, apparatus and computer readable storage medium - Google Patents

Data classification method, apparatus and computer readable storage medium Download PDF

Info

Publication number
CN111488400A
CN111488400A CN201910348542.0A CN201910348542A CN111488400A CN 111488400 A CN111488400 A CN 111488400A CN 201910348542 A CN201910348542 A CN 201910348542A CN 111488400 A CN111488400 A CN 111488400A
Authority
CN
China
Prior art keywords
processed
data
graph
node
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910348542.0A
Other languages
Chinese (zh)
Other versions
CN111488400B (en
Inventor
屠明
黄静
何晓冬
周伯文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910348542.0A priority Critical patent/CN111488400B/en
Publication of CN111488400A publication Critical patent/CN111488400A/en
Application granted granted Critical
Publication of CN111488400B publication Critical patent/CN111488400B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • G06F16/212Schema design and management with details for data modelling support
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开涉及一种数据分类方法、装置和计算机可读存储介质,涉及计算机技术领域。该方法包括:根据待处理数据包含的各部分数据和各部分数据之间的关系,生成第一待处理图,第一待处理图中包含与各部分数据对应的节点和根据各部分数据之间的关系确定的各节点之间的边;根据提取的各部分数据的特征向量和各部分数据之间的关系,利用机器学习模型确定第一待处理图中相应各节点的关联特征向量和各节点的聚类结果;根据各节点的关联特征向量和各节点的聚类结果,利用分类器对待处理数据进行分类。本公开的技术方案能够提高数据分类的准确性。

Figure 201910348542

The present disclosure relates to a data classification method, an apparatus and a computer-readable storage medium, and relates to the field of computer technology. The method includes: generating a first to-be-processed graph according to each part of the data included in the data to be processed and the relationship between each part of the data, the first to-be-processed graph including nodes corresponding to each part of the data and according to the relationship between each part of the data The edge between each node determined by the relationship; according to the extracted feature vector of each part of the data and the relationship between each part of the data, use the machine learning model to determine the associated feature vector of each node in the first to-be-processed graph and each node. According to the associated feature vector of each node and the clustering result of each node, use the classifier to classify the data to be processed. The technical solutions of the present disclosure can improve the accuracy of data classification.

Figure 201910348542

Description

数据分类方法、装置和计算机可读存储介质Data classification method, apparatus and computer readable storage medium

技术领域technical field

本公开涉及计算机技术领域,特别涉及一种数据分类方法、数据分类装置和计算机可读存储介质。The present disclosure relates to the field of computer technology, and in particular, to a data classification method, a data classification apparatus, and a computer-readable storage medium.

背景技术Background technique

MIL(Multiple Instance Learning,多示例学习)是一种弱监督方法,可以用于处理弱标记数据(weakly labeled data)。在MIL中,每个数据样本为一个“包”(bag),一个“包”包含多个“示例”(instance)。一个“包”具有一个分类标签,而“示例”无分类标签。MIL (Multiple Instance Learning) is a weakly supervised method that can be used to deal with weakly labeled data. In MIL, each data sample is a "bag", and a "bag" contains multiple "instances". A "package" has a category label, while an "example" has no category label.

许多应用场景的机器学习任务都可以建模为MIL模型有这样的属性。例如,电商平台中的评论可以包含多个句子,但是一个评论只有一个标注结果;一张图片可以包含多个区域,但是一张图片只有一个关于目标的分类结果。Many application scenarios of machine learning tasks can be modeled as MIL models with such properties. For example, a comment in an e-commerce platform can contain multiple sentences, but a comment has only one annotation result; an image can contain multiple regions, but an image has only one classification result about the target.

在相关技术中,利用包与包之间的相似程度进行分类,或者将包映射为一个向量,然后利用单示例方式进行分类。In the related art, the degree of similarity between the packets is used for classification, or the packets are mapped into a vector, and then the classification is performed using a single-instance method.

发明内容SUMMARY OF THE INVENTION

本公开的发明人发现上述相关技术中存在如下问题:没有考虑一个包中各示例的关联性,从而导致数据分类的准确性低。The inventors of the present disclosure found that the above-mentioned related art has the following problems: the correlation of each example in a package is not considered, resulting in low accuracy of data classification.

鉴于此,本公开提出了一种数据分类技术方案,能够提高数据分类的准确性。In view of this, the present disclosure proposes a data classification technical solution, which can improve the accuracy of data classification.

根据本公开的一些实施例,提供了一种数据分类方法,包括:根据待处理数据包含的各部分数据和所述各部分数据之间的关系,生成第一待处理图,所述第一待处理图中包含与所述各部分数据对应的节点和根据所述各部分数据之间的关系确定的各节点之间的边;根据提取的所述各部分数据的特征向量和所述各部分数据之间的关系,利用机器学习模型确定所述第一待处理图中相应各节点的关联特征向量和所述各节点的聚类结果;根据所述各节点的关联特征向量和所述各节点的聚类结果,利用分类器对所述待处理数据进行分类。According to some embodiments of the present disclosure, a data classification method is provided, including: generating a first to-be-processed graph according to each part of data included in the data to be processed and the relationship between the various parts of the data, the first to-be-processed map The processing graph includes the nodes corresponding to the data of each part and the edges between the nodes determined according to the relationship between the data of each part; according to the extracted feature vector of the data of each part and the data of each part The relationship between the two nodes, the machine learning model is used to determine the associated feature vectors of the corresponding nodes in the first to-be-processed graph and the clustering results of the nodes; according to the associated feature vectors of the nodes and the Clustering results, using a classifier to classify the data to be processed.

在一些实施例中,根据所述各节点的特征向量和所述各节点的聚类结果,利用分类器对所述待处理数据进行分类包括:根据所述各节点的关联特征向量和所述聚类结果,生成第二待处理图,所述第二待处理图的节点个数根据所述聚类结果确定;根据所述第二待处理图,利用所述分类器对所述待处理数据的进行分类。In some embodiments, using a classifier to classify the data to be processed according to the feature vectors of the nodes and the clustering results of the nodes includes: according to the associated feature vectors of the nodes and the clustering results class result, generate a second graph to be processed, the number of nodes in the second graph to be processed is determined according to the clustering result; according to the second graph to be processed, use the classifier to classify the data to be processed. sort.

在一些实施例中,生成第二待处理图包括:根据所述各节点的关联特征向量和所述聚类结果,对所述第一待处理图进行池化处理;根据池化处理的结果,生成所述第二待处理图。In some embodiments, generating the second graph to be processed includes: performing pooling processing on the first graph to be processed according to the associated feature vector of each node and the clustering result; according to the result of the pooling processing, The second to-be-processed graph is generated.

在一些实施例中,生成所述第二待处理图包括:根据公式V=STZ生成所述第二待处理图的节点特征向量矩阵V,S为所述聚类结果中所述第一待处理图中各节点属于各类的概率组成的概率分布矩阵,Z为所述第一待处理图中各节点的关联特征向量组成的关联特征向量矩阵;根据公式A=STA′S生成所述第二待处理图的边矩阵A,A′为根据所述各部分数据之间的关系确定的所述第一待处理图的边矩阵;根据V和A生成所述第二待处理图。In some embodiments, generating the second graph to be processed includes: generating a node feature vector matrix V of the second graph to be processed according to the formula V=S T Z, where S is the first graph in the clustering result A probability distribution matrix composed of the probability that each node in the to-be-processed graph belongs to each category, Z is an associated eigenvector matrix composed of associated eigenvectors of each node in the first to-be-processed graph; generated according to the formula A=S T A'S The edge matrix A of the second to-be-processed graph, A' is the edge matrix of the first to-be-processed graph determined according to the relationship between the parts of the data; the second to-be-processed graph is generated according to V and A .

在一些实施例中,所述机器学习模型包括第一图神经网络模型、第二图神经网络模型;利用机器学习模型确定所述第一待处理图中各节点的关联特征向量和所述各节点的聚类结果包括:利用所述第一图神经网络模型,确定所述第一待处理图中各节点的关联特征向量;利用所述第二图神经网络模型,确定所述各节点的聚类结果。In some embodiments, the machine learning model includes a first graph neural network model and a second graph neural network model; the machine learning model is used to determine the associated feature vector of each node in the first to-be-processed graph and the each node The clustering result includes: using the first graph neural network model to determine the associated feature vector of each node in the first graph to be processed; using the second graph neural network model to determine the clustering of each node result.

在一些实施例中,所述机器学习模型还包括第三图神经网络模型;根据所述第二待处理图,利用所述分类器对所述待处理数据的进行分类包括:利用所述第三图神经网络模型,确定所述第二待处理图中各节点的特征向量;根据所述第二待处理图中各节点的特征向量,利用所述分类器对所述待处理数据进行分类。In some embodiments, the machine learning model further includes a third graph neural network model; according to the second graph to be processed, using the classifier to classify the data to be processed includes: using the third graph The graph neural network model determines the feature vector of each node in the second to-be-processed graph; and the classifier is used to classify the to-be-processed data according to the feature vector of each node in the second to-be-processed graph.

在一些实施例中,根据所述第二待处理图中各节点的特征向量,利用所述分类器对所述待处理数据进行分类包括:对所述第二待处理图中各节点的特征向量进行最大值池化处理或级联处理,确定待处理向量;根据所述待处理向量,利用所述分类器对所述待处理数据进行分类。In some embodiments, according to the feature vector of each node in the second to-be-processed graph, using the classifier to classify the to-be-processed data includes: classifying the feature vector of each node in the second to-be-processed graph Perform maximum pooling or cascade processing to determine a vector to be processed; and use the classifier to classify the data to be processed according to the vector to be processed.

在一些实施例中,所述各部分数据之间的关系根据所述各部分数据的特征向量之间的关联程度确定。In some embodiments, the relationship between the parts of the data is determined according to the degree of association between the feature vectors of the parts of the data.

在一些实施例中,所述机器学习模型和所述分类器利用交叉熵损失函数训练,所述交叉熵损失函数根据第一分类结果、第二分类结果和第三分类结果确定,所述第一分类结果为根据所述第一待处理图中各节点的关联特征向量利用所述分类器确定的分类结果,所述第二分类结果为根据所述第二待处理图中各节点的特征向量利用所述分类器确定的分类结果,所述第三分类结果为根据所述待处理向量利用所述分类器确定的分类结果。In some embodiments, the machine learning model and the classifier are trained using a cross-entropy loss function determined from a first classification result, a second classification result and a third classification result, the first classification result The classification result is the classification result determined by the classifier according to the associated feature vector of each node in the first to-be-processed graph, and the second classification result is based on the feature vector of each node in the second to-be-processed graph. The classification result determined by the classifier, and the third classification result is the classification result determined by the classifier according to the vector to be processed.

根据本公开的另一些实施例,提供一种数据分类装置,包括:生成单元,用于根据待处理数据包含的各部分数据和所述各部分数据之间的关系,生成第一待处理图,所述第一待处理图中包含与所述各部分数据对应的节点和根据所述各部分数据之间的关系确定的各节点之间的边;确定单元,用于根据提取的所述各部分数据的特征向量和所述各部分数据之间的关系,利用机器学习模型确定所述第一待处理图中相应各节点的关联特征向量和所述各节点的聚类结果;分类单元,用于根据所述各节点的关联特征向量和所述各节点的聚类结果,利用分类器对所述待处理数据进行分类。According to other embodiments of the present disclosure, a data classification apparatus is provided, comprising: a generating unit configured to generate a first to-be-processed graph according to each part of the data included in the data to be processed and the relationship between the various parts of the data, The first to-be-processed graph includes nodes corresponding to the parts of the data and edges between the nodes determined according to the relationship between the parts of the data; the determining unit is used to determine the parts according to the extracted parts The relationship between the feature vector of the data and the various parts of the data, using the machine learning model to determine the associated feature vector of the corresponding nodes in the first to-be-processed graph and the clustering results of the nodes; the classification unit, used for According to the associated feature vector of each node and the clustering result of each node, a classifier is used to classify the data to be processed.

根据本公开的又一些实施例,提供一种数据分类装置,包括:存储器;和耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器装置中的指令,执行上述任一个实施例中的数据分类方法。According to yet other embodiments of the present disclosure, there is provided a data classification apparatus, comprising: a memory; and a processor coupled to the memory, the processor configured to execute, based on instructions stored in the memory apparatus, The data classification method in any one of the above embodiments.

根据本公开的再一些实施例,提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述任一个实施例中的数据分类方法。According to still other embodiments of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the data classification method in any one of the foregoing embodiments.

在上述实施例中,以待处理数据中各部分数据为节点,以部分数据之间的关系为边,将待处理数据建模为图。然后根据图中各节点之间的关联进行特征提取和聚类处理,并基于处理结果进行分类。这样可以充分利用各部分数据之间的关联进行分类,从而提高数据分类的准确率。In the above-mentioned embodiment, the data to be processed is modeled as a graph with each part of the data in the data to be processed as a node and the relationship between the parts of the data as an edge. Then, feature extraction and clustering are performed according to the associations between the nodes in the graph, and classification is performed based on the processing results. In this way, the correlation between each part of the data can be fully utilized for classification, thereby improving the accuracy of data classification.

附图说明Description of drawings

构成说明书的一部分的附图描述了本公开的实施例,并且连同说明书一起用于解释本公开的原理。The accompanying drawings, which form a part of the specification, illustrate embodiments of the present disclosure and together with the description serve to explain the principles of the present disclosure.

参照附图,根据下面的详细描述,可以更加清楚地理解本公开,其中:The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings, wherein:

图1示出本公开的数据分类方法的一些实施例的流程图;FIG. 1 shows a flowchart of some embodiments of the data classification method of the present disclosure;

图2示出图1的步骤S30的一些实施例的流程图;FIG. 2 shows a flowchart of some embodiments of step S30 of FIG. 1;

图3示出图2的步骤S320的一些实施例的流程图;FIG. 3 shows a flowchart of some embodiments of step S320 of FIG. 2;

图4示出本公开的数据分类方法的一些实施例的示意图;FIG. 4 shows a schematic diagram of some embodiments of the data classification method of the present disclosure;

图5示出本公开的数据分类方法的另一些实施例的示意图;FIG. 5 shows a schematic diagram of other embodiments of the data classification method of the present disclosure;

图6示出本公开的数据分类装置的一些实施例的框图;6 illustrates a block diagram of some embodiments of the data classification apparatus of the present disclosure;

图7示出本公开的数据分类装置的另一些实施例的框图;7 shows a block diagram of other embodiments of the data classification apparatus of the present disclosure;

图8示出本公开的数据分类装置的又一些实施例的框图。8 shows a block diagram of further embodiments of the data classification apparatus of the present disclosure.

具体实施方式Detailed ways

现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。Meanwhile, it should be understood that, for the convenience of description, the dimensions of various parts shown in the accompanying drawings are not drawn in an actual proportional relationship.

以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses in any way.

对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为授权说明书的一部分。Techniques, methods, and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods, and devices should be considered part of the authorized description.

在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。In all examples shown and discussed herein, any specific value should be construed as illustrative only and not as limiting. Accordingly, other examples of exemplary embodiments may have different values.

应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further discussion in subsequent figures.

图1示出本公开的数据分类方法的一些实施例的流程图。FIG. 1 shows a flowchart of some embodiments of the data classification method of the present disclosure.

如图1所示,该方法包括:步骤S10,生成第一待处理图;步骤S20,确定关联特征向量和聚类结果;和步骤S30,对待处理数据分类。As shown in FIG. 1 , the method includes: step S10 , generating a first graph to be processed; step S20 , determining an associated feature vector and a clustering result; and step S30 , classifying the data to be processed.

在步骤S10中,根据待处理数据包含的各部分数据和各部分数据之间的关系,生成第一待处理图。第一待处理图中包含与各部分数据对应的节点和根据各部分数据之间的关系确定的各节点之间的边。例如,待处理数据为MIL模型中的包,各部分数据为包中包含的各示例。In step S10, a first to-be-processed graph is generated according to each part of the data included in the to-be-processed data and the relationship between each part of the data. The first graph to be processed includes nodes corresponding to each part of the data and edges between each node determined according to the relationship between each part of the data. For example, the data to be processed is a package in the MIL model, and each part of the data is each example contained in the package.

在一些实施例中,待处理数据可以为待处理图像,待处理图像中包含的各区域为各部分数据;待处理数据可以为待处理文本,待处理文本中包含的各句子为各部分数据。In some embodiments, the data to be processed may be an image to be processed, and each area included in the image to be processed is each part of the data; the data to be processed may be a text to be processed, and each sentence included in the text to be processed is each part of the data.

在一些实施例中,可以提取各部分数据的特征向量(例如利用图像处理方法、词向量法等),然后根据各部分数据的特征向量之间的关联程度确定各部分数据之间的关系。In some embodiments, feature vectors of each part of the data may be extracted (eg, using image processing method, word vector method, etc.), and then the relationship between each part of the data is determined according to the degree of correlation between the feature vectors of each part of the data.

例如,待处理数据为一个包含的K个示例的包,K个示例的特征向量集合为X=[x1,x2,……xk,……xK],xk为第k个示例的D维特征向量,k为小于K的正整数,D为正整数。可以计算各特征向量之间的距离,根据各特征向量之间的距离确定各示例之间的关系参数,进而根据下面的公式生成第一待处理图的K×K维的边矩阵A′:For example, the data to be processed is a package containing K examples, the feature vector set of the K examples is X=[x 1 , x 2 ,...x k ,...x K ], and x k is the kth example The D-dimensional feature vector of , k is a positive integer less than K, and D is a positive integer. The distances between the eigenvectors can be calculated, the relationship parameters between the examples are determined according to the distances between the eigenvectors, and then the K×K-dimensional edge matrix A′ of the first graph to be processed is generated according to the following formula:

Figure BDA0002043168840000061
Figure BDA0002043168840000061

amn为A′中的第m行n列的元素,dist(xm,xn)为特征向量xm和xn之间的距离,m和n为小于等于K的正整数。η为距离阈值,用于确定第一待处理图中的各节点之间是否有边(amn为1则有边,否则无边)。η=0的情况下,第一待处理图中没有边;η=+∞的情况下,第一待处理图为一个完全图。a mn is the element of the mth row and nth column in A', dist(x m , x n ) is the distance between the feature vectors x m and x n , and m and n are positive integers less than or equal to K. η is the distance threshold, which is used to determine whether there is an edge between each node in the first graph to be processed (a mn is 1, there is an edge, otherwise there is no edge). In the case of η=0, there is no edge in the first graph to be processed; in the case of η=+∞, the first graph to be processed is a complete graph.

例如,可以根据边矩阵A′建立第一待处理图G1=(A′,V′),V′是G1中节点的特征向量矩阵V′=X。For example, the first to-be-processed graph G 1 =(A', V') can be established according to the edge matrix A', where V' is the feature vector matrix V'=X of the nodes in G 1 .

在步骤S20中,根据提取的各部分数据的特征向量和各部分数据之间的关系,利用机器学习模型确定第一待处理图中相应各节点的关联特征向量和各节点的聚类结果。例如,可以进行多次聚类处理,以获取想要的聚类结果。In step S20, according to the extracted feature vector of each part of the data and the relationship between each part of the data, use the machine learning model to determine the associated feature vector of each node in the first to-be-processed graph and the clustering result of each node. For example, multiple clustering processes can be performed to obtain the desired clustering results.

在一些实施例中,机器学习模型包括第一图神经网络模型、第二图神经网络模型。例如,利用第一图神经网络模型,确定第一待处理图中各节点的关联特征向量;利用第二图神经网络模型,确定各节点的聚类结果。In some embodiments, the machine learning model includes a first graph neural network model, a second graph neural network model. For example, the first graph neural network model is used to determine the associated feature vector of each node in the first graph to be processed; the second graph neural network model is used to determine the clustering result of each node.

在一些实施例中,可以将A′、V′输入第一图神经网络模型进行图嵌入处理,确定第一待处理图中各节点的D′维关联特征向量以组成关联特征向量矩阵Z,D′为正整数In some embodiments, A' and V' may be input into the first graph neural network model for graph embedding processing, and the D'-dimensional associated feature vectors of each node in the first graph to be processed are determined to form associated feature vector matrices Z, D ' is a positive integer

可以看出,V′中的特征向量单独提取的各示例的特征向量,而Z中的特征向量是根据各示例之间的关联关系(特征向量之间的关系)确定的各示例的特征向量。也就是说,Z中的特征向量包含了各示例之间的关联关系,利用Z对待处理数据进行分类,能够提高分类的准确率。It can be seen that the eigenvectors in V' are the eigenvectors of each example extracted independently, while the eigenvectors in Z are the eigenvectors of each example determined according to the correlation between the examples (relationship between the eigenvectors). That is to say, the feature vector in Z contains the relationship between the examples, and using Z to classify the data to be processed can improve the accuracy of the classification.

在一些实施例中,可以将A′,V′输入第二图神经网络模型进行聚类处理,确定第一待处理图中各节点的聚类结果。例如,第二图神经网络模型将第一待处理图中各节点分为C类,C为根据实际情况设定的大于1的正整数。例如,可以将各节点聚类为两种,分别是对于分类的正样本和负样本。In some embodiments, A', V' may be input into the second graph neural network model for clustering processing, and the clustering result of each node in the first graph to be processed is determined. For example, the second graph neural network model divides each node in the first graph to be processed into C classes, where C is a positive integer greater than 1 set according to the actual situation. For example, each node can be clustered into two types, positive samples and negative samples for classification.

可以利用softmax函数对第二图神经网络模型的输出进行处理,得到第一待处理图中各节点属于各类的概率,进而得到为概率分布矩阵S。The softmax function can be used to process the output of the neural network model of the second graph to obtain the probability that each node in the first graph to be processed belongs to each category, and then obtain a probability distribution matrix S.

可以看出,S中包含了各节点的关联关系(聚类信息),利用S对待处理数据进行分类,能够提高分类的准确率。It can be seen that S contains the association relationship (clustering information) of each node. Using S to classify the data to be processed can improve the classification accuracy.

在步骤S30中,根据各节点的关联特征向量和各节点的聚类结果,利用分类器对待处理数据进行分类。例如,分类器可以根据MLP(Multi-Layer Perception,多层感知器)模型配置。In step S30, the data to be processed is classified by a classifier according to the associated feature vector of each node and the clustering result of each node. For example, the classifier can be configured according to the MLP (Multi-Layer Perception) model.

在一些实施例中,可以通过图2中的实施例实现步骤S30。In some embodiments, step S30 may be implemented by the embodiment in FIG. 2 .

图2示出图1的步骤S30的一些实施例的流程图。FIG. 2 shows a flowchart of some embodiments of step S30 of FIG. 1 .

如图2所示,步骤S30包括:步骤S310,生成第二待处理图;和步骤S320,对待处理数据分类。As shown in FIG. 2 , step S30 includes: step S310 , generating a second graph to be processed; and step S320 , classifying the data to be processed.

在步骤S310中,根据各节点的关联特征向量和聚类结果,生成第二待处理图,第二待处理图的节点个数根据聚类结果确定。In step S310, a second graph to be processed is generated according to the associated feature vector of each node and the clustering result, and the number of nodes in the second graph to be processed is determined according to the clustering result.

在一些实施例中,根据各节点的关联特征向量和聚类结果,对第一待处理图进行池化处理。然后根据池化处理的结果,生成第二待处理图。对于同构的(isomorphic)图,无论图中节点如何排列,池化处理能够保证得到的处理结果(图的表达)相同。In some embodiments, pooling is performed on the first graph to be processed according to the associated feature vector of each node and the clustering result. Then, according to the result of the pooling process, a second graph to be processed is generated. For isomorphic graphs, no matter how the nodes are arranged in the graph, the pooling process can ensure that the obtained processing results (the representation of the graph) are the same.

例如,根据公式V=STZ生成第二待处理图的节点特征向量矩阵V,中的特征向量为D′维;根据公式A=STA′S生成第二待处理图的边矩阵A;进而生成第二待处理图G2=(A,V)。For example, the node eigenvector matrix V of the second graph to be processed is generated according to the formula V=S T Z, and the eigenvectors in it are D'dimension; the edge matrix A of the second graph to be processed is generated according to the formula A=S T A'S ; and then generate a second to-be-processed graph G 2 =(A, V).

可以看出,根据各示例之间的关联信息(特征向量之间的关联关系、聚类结果)生成了G2,G2中的节点要小于G1中的节点数。这样就将节点的特征向量中不包含关联信息的“大图”转换成了节点的特征向量中包含关联信息的“小图”,从而提高了数据分类的准确性。It can be seen that G 2 is generated according to the association information (association between feature vectors, clustering results) among the examples, and the number of nodes in G 2 is smaller than the number of nodes in G 1 . In this way, the "big picture" that does not contain associated information in the feature vector of the node is converted into a "small picture" that includes the associated information in the feature vector of the node, thereby improving the accuracy of data classification.

在步骤S320中,根据第二待处理图,利用分类器对待处理数据的进行分类。In step S320, according to the second to-be-processed graph, the classifier is used to classify the data to be processed.

在一些实施例中,机器学习模型还包括第三图神经网络模型。可以利用第三图神经网络模型,确定第二待处理图中各节点的特征向量。即利用第三图神经网络模型对第二待处理图中各节点进行图嵌入处理。In some embodiments, the machine learning model further includes a third graph neural network model. The feature vector of each node in the second to-be-processed graph can be determined by using the third graph neural network model. That is, using the third graph neural network model to perform graph embedding processing on each node in the second graph to be processed.

例如,在一些实施例中,如果C设置为1,则G2中只有一个节点,图嵌入处理可以看作是学习得来的图表达;如果C设置为2则可以通过最大值池化或级联处理完成G2中节点的图嵌入处理。For example, in some embodiments, if C is set to 1, there is only one node in G 2 , and the graph embedding process can be regarded as a learned graph representation; The joint processing completes the graph embedding processing of the nodes in G2.

在一些实施例中,GNN(Graph Neural Network,图神经网络)具有两种基本处理,即聚合(aggregation)处理、联合(combination)处理。聚合处理用于从当前节点的邻居节点收集信息,联合处理用于融合收集到的信息以对当前节点进行表达。进行一次聚合处理和联合处理称为一跳(hop)。GNN第l跳的聚合处理和联合处理可以用下面的公式分别表示:In some embodiments, GNN (Graph Neural Network, Graph Neural Network) has two basic processes, ie, aggregation process and combination process. Aggregation processing is used to collect information from neighbor nodes of the current node, and joint processing is used to fuse the collected information to express the current node. Performing an aggregation process and a joint process is called a hop. The aggregation processing and joint processing of the first hop of GNN can be expressed by the following formulas respectively:

Figure BDA0002043168840000081
Figure BDA0002043168840000081

Figure BDA0002043168840000082
Figure BDA0002043168840000082

fagg()和fcom()分别为聚合处理函数和联合处理函数,N(v)表示图中节点v的邻居节点,

Figure BDA0002043168840000087
Figure BDA0002043168840000084
分别为第l跳的聚合处理结果和联合处理结果。
Figure BDA0002043168840000085
即为经过l跳学习得到的节点v的表达(如特征向量)。f agg () and f com () are the aggregation processing function and the joint processing function, respectively, N(v) represents the neighbor node of node v in the graph,
Figure BDA0002043168840000087
and
Figure BDA0002043168840000084
are the aggregate processing results and joint processing results of the lth hop, respectively.
Figure BDA0002043168840000085
That is, the expression of node v (such as feature vector) obtained after l-hop learning.

例如,根据上面的描述,上述各实施例中任一个GNN模型的聚合处理和联合处理都可以综合为下面的公式:For example, according to the above description, the aggregation processing and joint processing of any GNN model in the above embodiments can be synthesized into the following formula:

Figure BDA0002043168840000086
Figure BDA0002043168840000086

通过上述公式可以得到节点v的表示Vecv,Vecu为邻居节点u的表示,act()为激活函数(如LeakyReLu函数等),MEAN()为取均值函数。The expression Vec v of the node v can be obtained through the above formula, Vec u is the representation of the neighbor node u, act() is the activation function (such as the LeakyReLu function, etc.), and MEAN() is the mean value function.

在一些实施例中,步骤S320可以通过图3中的实施例实现。In some embodiments, step S320 may be implemented by the embodiment in FIG. 3 .

图3示出图2的步骤S320的一些实施例的流程图。FIG. 3 shows a flowchart of some embodiments of step S320 of FIG. 2 .

如图3所示,步骤S320包括:步骤S3210,确定待处理向量;和步骤S3220,对待处理数据分类。As shown in FIG. 3 , step S320 includes: step S3210 , determining a vector to be processed; and step S3220 , classifying the data to be processed.

在步骤S3210中,对第二待处理图中各节点的特征向量进行最大值池化处理或级联处理,确定待处理向量。例如,可以根据聚类结果中类的个数C,确定进行最大值池化处理或级联处理以得到固定维度的图嵌入结果(待处理向量)。In step S3210, maximum pooling or cascade processing is performed on the feature vectors of each node in the second to-be-processed graph to determine the to-be-processed vector. For example, according to the number C of classes in the clustering result, it may be determined to perform maximum pooling processing or cascade processing to obtain a fixed-dimensional graph embedding result (vector to be processed).

在步骤S3220中,根据待处理向量,利用分类器对待处理数据进行分类。In step S3220, the data to be processed is classified by a classifier according to the vector to be processed.

图4示出本公开的数据分类方法的一些实施例的示意图。FIG. 4 shows a schematic diagram of some embodiments of the data classification method of the present disclosure.

如图4所示,可以根据作为待处理数据的MIL模型(如具有多个区域的图像、具有多个句子的文本等),生成第一待处理图41。待处理图41中包含有多个节点,对应MIL模型的多个示例(如区域、句子等)。待处理图41的每个节点都具有特征向量,该特征向量为MIL模型中各示例的特征向量(如可以利用图像处理、自然语言处理等方法提取)。待处理图41中的边为各节点之间的关系(如可以根据各节点的特征向量确定)。As shown in FIG. 4 , the first to-be-processed map 41 can be generated according to the MIL model as the data to be processed (eg, an image with multiple regions, a text with multiple sentences, etc.). The to-be-processed graph 41 contains multiple nodes, corresponding to multiple examples of the MIL model (such as regions, sentences, etc.). Each node of the to-be-processed graph 41 has a feature vector, and the feature vector is the feature vector of each example in the MIL model (for example, it can be extracted by methods such as image processing and natural language processing). The edges in the to-be-processed graph 41 are the relationships between the nodes (for example, it can be determined according to the feature vectors of the nodes).

可以通过上述任一个实施例中的第一图神经网络模型对待处理图41中的节点进行图嵌入处理,得到待处理图41中的节点的关联特征向量。The nodes in the to-be-processed figure 41 may be processed by graph embedding using the first graph neural network model in any of the above embodiments, so as to obtain the associated feature vectors of the nodes in the to-be-processed figure 41 .

可以通过上述任一个实施例中的第二图神经网络模型对待处理图41中的节点进行聚类处理。The nodes in the to-be-processed FIG. 41 can be clustered by the second graph neural network model in any of the above embodiments.

在一些实施例中,第二图神经网络模型可以在无监督的情况下被训练为能够区分MIL模型中的正样本(正标记示例)和负样本(负标记示例),例如,通过聚类处理将待处理图41中的节点自动分为类421和类422,类421中的节点为正样本,类422中的节点为负样本。In some embodiments, the second graph neural network model can be trained unsupervised to be able to distinguish between positive samples (positively labeled examples) and negative samples (negatively labeled examples) in the MIL model, eg, through a clustering process The nodes in Figure 41 to be processed are automatically divided into classes 421 and 422, the nodes in class 421 are positive samples, and the nodes in class 422 are negative samples.

根据关联特征向量和聚类结果可以生成待处理图42。例如,待处理图42中具有两个节点,可以分别对应类421和类422。The to-be-processed graph 42 can be generated according to the associated feature vector and the clustering result. For example, the to-be-processed graph 42 has two nodes, which may correspond to class 421 and class 422 respectively.

对待处理图42进行图嵌入处理和池化处理,可以得到待处理向量44。待处理向量44可以被看作只有一个节点的图。The to-be-processed graph 42 is subjected to graph embedding processing and pooling processing, and a to-be-processed vector 44 can be obtained. The to-be-processed vector 44 can be viewed as a graph with only one node.

上述实施例将MIL模型转换成一个“大图”,然后将“大图”转换成一个“小图”,进一步将“小图”转换为只有一个节点的固定维度的“向量”(待处理向量),并对“向量”进行处理得到分类结果。这样,在分类过程中不但充分考虑了示例之间的关联关系,而且还进行了降维处理,从而提高了分类准确性和效率。The above embodiment converts the MIL model into a "big picture", then converts the "big picture" into a "small picture", and further converts the "small picture" into a fixed-dimensional "vector" with only one node (the vector to be processed). ), and process the "vector" to get the classification result. In this way, in the classification process, not only the correlation between the examples is fully considered, but also the dimensionality reduction process is carried out, thereby improving the classification accuracy and efficiency.

图5示出本公开的数据分类方法的另一些实施例的示意图。FIG. 5 shows a schematic diagram of other embodiments of the data classification method of the present disclosure.

如图5所示,机器学习模型包括GNN 51、GNN 52、GNN 53和MLP 54。As shown in Figure 5, the machine learning models include GNN 51, GNN 52, GNN 53, and MLP 54.

GNN 51用于对第一待处理图中的节点进行图嵌入处理,以获取各节点的关联特征向量。GNN 52用于对第一待处理图中的节点进行图聚类处理,以获取各节点的聚类结果。根据GNN 51和GNN 52的输出结果,可以进一步获取第二待处理图。The GNN 51 is used to perform graph embedding processing on the nodes in the first graph to be processed, so as to obtain the associated feature vector of each node. The GNN 52 is configured to perform graph clustering processing on the nodes in the first graph to be processed, so as to obtain the clustering results of each node. According to the output results of GNN 51 and GNN 52, the second graph to be processed can be further acquired.

在一些实施例中,在训练阶段,可以根据第一待处理图中各节点的关联特征向量(GNN 51的输出),利用分类器确定的第一分类结果。In some embodiments, in the training phase, the first classification result determined by the classifier may be used according to the associated feature vector (output of the GNN 51 ) of each node in the first graph to be processed.

在一些实施例中,在训练阶段,可以根据第二待处理图中各节点的特征向量利用分类器确定的第二分类结果。In some embodiments, in the training phase, the second classification result determined by the classifier may be used according to the feature vector of each node in the second graph to be processed.

GNN 53用于对第二待处理图中的节点进行图嵌入处理,以获取各节点的特征向量。还可以对第二待处理图进行进一步处理,获取待处理向量。MLP 54用于对待处理向量进行分类,以获取第三分类结果。The GNN 53 is used to perform graph embedding processing on the nodes in the second graph to be processed, so as to obtain feature vectors of each node. The second to-be-processed graph may also be further processed to obtain a to-be-processed vector. The MLP 54 is used to classify the vector to be processed to obtain a third classification result.

在一些实施例中,在训练阶段,可以根据第一分类结果、第二分类结果和第三分类结果确定交叉熵损失函数,利用交叉熵损失函数训练机器学习模型。这样,可以利用机器学习模型中不同深度的处理结果进行训练,从而提高机器学习模型的训练的效率和准确性。In some embodiments, in the training phase, a cross-entropy loss function may be determined according to the first classification result, the second classification result and the third classification result, and the machine learning model is trained by using the cross-entropy loss function. In this way, processing results of different depths in the machine learning model can be used for training, thereby improving the training efficiency and accuracy of the machine learning model.

在上述实施例中,以待处理数据中各部分数据为节点,以部分数据之间的关系为边,将待处理数据建模为图。然后根据图中各节点之间的关联进行特征提取和聚类处理,并基于处理结果进行分类。这样可以充分利用各部分数据之间的关联进行分类,从而提高数据分类的准确率。In the above-mentioned embodiment, the data to be processed is modeled as a graph with each part of the data in the data to be processed as a node and the relationship between the parts of the data as an edge. Then, feature extraction and clustering are performed according to the associations between the nodes in the graph, and classification is performed based on the processing results. In this way, the correlation between each part of the data can be fully utilized for classification, thereby improving the accuracy of data classification.

图6示出本公开的数据分类装置的一些实施例的框图。Figure 6 shows a block diagram of some embodiments of the data classification apparatus of the present disclosure.

如图6所示,数据分类装置6包括生成单元61、确定单元62和分类单元63。As shown in FIG. 6 , the data classification apparatus 6 includes a generation unit 61 , a determination unit 62 and a classification unit 63 .

生成单元61根据待处理数据包含的各部分数据和各部分数据之间的关系,生成第一待处理图。第一待处理图中包含与各部分数据对应的节点和根据各部分数据之间的关系确定的各节点之间的边。例如,各部分数据之间的关系根据各部分数据的特征向量之间的关联程度确定。The generating unit 61 generates a first to-be-processed graph according to the relationship between each part of the data included in the to-be-processed data. The first graph to be processed includes nodes corresponding to each part of the data and edges between each node determined according to the relationship between each part of the data. For example, the relationship between the parts of the data is determined according to the degree of association between the feature vectors of the parts of the data.

在一些实施例中,生成单元61根据各节点的关联特征向量和聚类结果,生成第二待处理图。生成单元61可以根据各节点的关联特征向量和聚类结果,对第一待处理图进行池化处理。生成单元61可以根据池化处理的结果,生成第二待处理图。In some embodiments, the generating unit 61 generates the second graph to be processed according to the associated feature vector of each node and the clustering result. The generating unit 61 may perform pooling processing on the first graph to be processed according to the associated feature vector of each node and the clustering result. The generating unit 61 may generate the second graph to be processed according to the result of the pooling process.

例如,生成单元61根据公式V=STZ生成第二待处理图的节点特征向量矩阵V。S为聚类结果中第一待处理图中各节点属于各类的概率组成的概率分布矩阵,Z为第一待处理图中各节点的关联特征向量组成的关联特征向量矩阵。生成单元61根据公式A=STA′S生成第二待处理图的边矩阵A。A′为根据各部分数据之间的关系确定的第一待处理图的边矩阵。生成单元61根据V和A生成第二待处理图。For example, the generating unit 61 generates the node feature vector matrix V of the second graph to be processed according to the formula V= ST Z. S is a probability distribution matrix composed of probabilities that each node in the first to-be-processed graph belongs to each category in the clustering result, and Z is an associated eigenvector matrix composed of associated eigenvectors of each node in the first to-be-processed graph. The generating unit 61 generates the edge matrix A of the second graph to be processed according to the formula A= ST A'S. A' is the edge matrix of the first graph to be processed determined according to the relationship between each part of the data. The generating unit 61 generates a second map to be processed according to V and A.

确定单元62根据提取的各部分数据的特征向量和各部分数据之间的关系,利用机器学习模型确定第一待处理图中相应各节点的关联特征向量和各节点的聚类结果。The determining unit 62 uses the machine learning model to determine the associated feature vectors of the corresponding nodes in the first to-be-processed graph and the clustering results of the nodes according to the extracted feature vectors of each part of the data and the relationship between each part of the data.

在一些实施例中,机器学习模型包括第一图神经网络模型、第二图神经网络模型。确定单元62利用第一图神经网络模型,确定第一待处理图中各节点的关联特征向量。确定单元62利用第二图神经网络模型,确定各节点的聚类结果。In some embodiments, the machine learning model includes a first graph neural network model, a second graph neural network model. The determining unit 62 uses the first graph neural network model to determine the associated feature vector of each node in the first graph to be processed. The determining unit 62 uses the second graph neural network model to determine the clustering result of each node.

在一些实施例中,机器学习模型还包括第三图神经网络模型。确定单元62利用第三图神经网络模型,确定第二待处理图中各节点的特征向量。In some embodiments, the machine learning model further includes a third graph neural network model. The determining unit 62 uses the neural network model of the third graph to determine the feature vector of each node in the second graph to be processed.

在一些实施例中,确定单元62对第二待处理图中各节点的特征向量进行最大值池化处理或级联处理,确定待处理向量。In some embodiments, the determining unit 62 performs maximum pooling processing or cascade processing on the feature vectors of each node in the second to-be-processed graph to determine the to-be-processed vector.

在一些实施例中,机器学习模型和分类器利用交叉熵损失函数训练。例如,交叉熵损失函数根据第一分类结果、第二分类结果和第三分类结果确定。第一分类结果为根据第一待处理图中各节点的关联特征向量利用分类器确定的分类结果。第二分类结果为根据第二待处理图中各节点的特征向量利用分类器确定的分类结果。第三分类结果为根据待处理向量利用所述分类器确定的分类结果。In some embodiments, machine learning models and classifiers are trained using a cross-entropy loss function. For example, the cross-entropy loss function is determined according to the first classification result, the second classification result, and the third classification result. The first classification result is the classification result determined by the classifier according to the associated feature vector of each node in the first to-be-processed graph. The second classification result is the classification result determined by the classifier according to the feature vector of each node in the second to-be-processed graph. The third classification result is the classification result determined by the classifier according to the vector to be processed.

分类单元63根据各节点的关联特征向量和各节点的聚类结果,利用分类器对待处理数据进行分类。The classification unit 63 uses a classifier to classify the data to be processed according to the associated feature vector of each node and the clustering result of each node.

在一些实施例中,第二待处理图的节点个数根据聚类结果确定。分类单元63根据第二待处理图,利用分类器对待处理数据的进行分类。In some embodiments, the number of nodes in the second graph to be processed is determined according to the clustering result. The classification unit 63 uses a classifier to classify the data to be processed according to the second to-be-processed map.

在一些实施例中,分类单元63根据第二待处理图中各节点的特征向量,利用分类器对待处理数据进行分类。In some embodiments, the classification unit 63 uses a classifier to classify the data to be processed according to the feature vector of each node in the second to-be-processed graph.

在一些实施例中,分类单元63根据待处理向量,利用分类器对待处理数据进行分类。In some embodiments, the classification unit 63 uses a classifier to classify the data to be processed according to the vector to be processed.

在上述实施例中,以待处理数据中各部分数据为节点,以部分数据之间的关系为边,将待处理数据建模为图。然后根据图中各节点之间的关联进行特征提取和聚类处理,并基于处理结果进行分类。这样可以充分利用各部分数据之间的关联进行分类,从而提高数据分类的准确率。In the above-mentioned embodiment, the data to be processed is modeled as a graph with each part of the data in the data to be processed as a node and the relationship between the parts of the data as an edge. Then, feature extraction and clustering are performed according to the associations between the nodes in the graph, and classification is performed based on the processing results. In this way, the correlation between each part of the data can be fully utilized for classification, thereby improving the accuracy of data classification.

图7示出本公开的数据分类装置的另一些实施例的框图。FIG. 7 shows a block diagram of other embodiments of the data classification apparatus of the present disclosure.

如图7所示,该实施例的数据分类装置7包括:存储器71以及耦接至该存储器71的处理器72,处理器72被配置为基于存储在存储器71中的指令,执行本公开中任意一个实施例中的数据分类方法。As shown in FIG. 7 , the data classification apparatus 7 of this embodiment includes: a memory 71 and a processor 72 coupled to the memory 71 , and the processor 72 is configured to execute any of the instructions in the present disclosure based on the instructions stored in the memory 71 A data classification method in one embodiment.

其中,存储器71例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)、数据库以及其他程序等。Wherein, the memory 71 may include, for example, a system memory, a fixed non-volatile storage medium, and the like. The system memory stores, for example, an operating system, an application program, a boot loader (Boot Loader), a database, and other programs.

图8示出本公开的数据分类装置的又一些实施例的框图。8 shows a block diagram of further embodiments of the data classification apparatus of the present disclosure.

如图8所示,该实施例的数据分类装置8包括:存储器810以及耦接至该存储器810的处理器820,处理器820被配置为基于存储在存储器810中的指令,执行前述任意一个实施例中的数据分类方法。As shown in FIG. 8 , the data classification apparatus 8 of this embodiment includes: a memory 810 and a processor 820 coupled to the memory 810 , and the processor 820 is configured to execute any one of the foregoing implementations based on instructions stored in the memory 810 The data classification method in the example.

存储器810例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)以及其他程序等。Memory 810 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a boot loader (Boot Loader), and other programs.

数据分类装置8还可以包括输入输出接口830、网络接口840、存储接口850等。这些接口830、840、850以及存储器810和处理器820之间例如可以通过总线860连接。其中,输入输出接口830为显示器、鼠标、键盘、触摸屏等输入输出设备提供连接接口。网络接口840为各种联网设备提供连接接口。存储接口850为SD卡、U盘等外置存储设备提供连接接口。The data classification device 8 may further include an input and output interface 830, a network interface 840, a storage interface 850, and the like. These interfaces 830 , 840 , 850 and the memory 810 and the processor 820 may be connected, for example, through a bus 860 . The input and output interface 830 provides a connection interface for input and output devices such as a display, a mouse, a keyboard, and a touch screen. Network interface 840 provides a connection interface for various networked devices. The storage interface 850 provides a connection interface for external storage devices such as SD cards and U disks.

本领域内的技术人员应当明白,本公开的实施例可提供为方法、系统、或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein .

至此,已经详细描述了根据本公开的数据分类方法、数据分类装置和计算机可读存储介质。为了避免遮蔽本公开的构思,没有描述本领域所公知的一些细节。本领域技术人员根据上面的描述,完全可以明白如何实施这里公开的技术方案。So far, the data classification method, the data classification apparatus, and the computer-readable storage medium according to the present disclosure have been described in detail. Some details that are well known in the art are not described in order to avoid obscuring the concept of the present disclosure. Those skilled in the art can fully understand how to implement the technical solutions disclosed herein based on the above description.

可能以许多方式来实现本公开的方法和系统。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和系统。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。The methods and systems of the present disclosure may be implemented in many ways. For example, the methods and systems of the present disclosure may be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order of steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure can also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

虽然已经通过示例对本公开的一些特定实施例进行了详细说明,但是本领域的技术人员应该理解,以上示例仅是为了进行说明,而不是为了限制本公开的范围。本领域的技术人员应该理解,可在不脱离本公开的范围和精神的情况下,对以上实施例进行修改。本公开的范围由所附权利要求来限定。While some specific embodiments of the present disclosure have been described in detail by way of examples, those skilled in the art will appreciate that the above examples are provided for illustration only, and are not intended to limit the scope of the present disclosure. Those skilled in the art will appreciate that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims (12)

1.一种数据分类方法,包括:1. A data classification method, comprising: 根据待处理数据包含的各部分数据和所述各部分数据之间的关系,生成第一待处理图,所述第一待处理图中包含与所述各部分数据对应的节点和根据所述各部分数据之间的关系确定的各节点之间的边;A first graph to be processed is generated according to each part of the data included in the data to be processed and the relationship between the parts of the data, and the first graph to be processed includes nodes corresponding to the parts of the data and Edges between nodes determined by the relationship between part of the data; 根据提取的所述各部分数据的特征向量和所述各部分数据之间的关系,利用机器学习模型确定所述第一待处理图中相应各节点的关联特征向量和所述各节点的聚类结果;According to the extracted feature vector of each part of the data and the relationship between the each part of the data, use the machine learning model to determine the associated feature vector of each node in the first to-be-processed graph and the clustering of each node result; 根据所述各节点的关联特征向量和所述各节点的聚类结果,利用分类器对所述待处理数据进行分类。According to the associated feature vector of each node and the clustering result of each node, a classifier is used to classify the data to be processed. 2.根据权利要求1所述的数据分类方法,其中,根据所述各节点的特征向量和所述各节点的聚类结果,利用分类器对所述待处理数据进行分类包括:2. The data classification method according to claim 1, wherein, according to the feature vector of each node and the clustering result of each node, using a classifier to classify the data to be processed comprises: 根据所述各节点的关联特征向量和所述聚类结果,生成第二待处理图,所述第二待处理图的节点个数根据所述聚类结果确定;According to the associated feature vector of each node and the clustering result, a second graph to be processed is generated, and the number of nodes in the second graph to be processed is determined according to the clustering result; 根据所述第二待处理图,利用所述分类器对所述待处理数据的进行分类。According to the second to-be-processed map, the classifier is used to classify the to-be-processed data. 3.根据权利要求2所述的数据分类方法,其中,生成第二待处理图包括:3. The data classification method according to claim 2, wherein generating the second graph to be processed comprises: 根据所述各节点的关联特征向量和所述聚类结果,对所述第一待处理图进行池化处理;performing pooling processing on the first graph to be processed according to the associated feature vector of each node and the clustering result; 根据池化处理的结果,生成所述第二待处理图。According to the result of the pooling process, the second graph to be processed is generated. 4.根据权利要求3所述的数据分类方法,其中,生成所述第二待处理图包括:4. The data classification method according to claim 3, wherein generating the second graph to be processed comprises: 根据公式V=STZ生成所述第二待处理图的节点特征向量矩阵V,S为所述聚类结果中所述第一待处理图中各节点属于各类的概率组成的概率分布矩阵,Z为所述第一待处理图中各节点的关联特征向量组成的关联特征向量矩阵;The node feature vector matrix V of the second graph to be processed is generated according to the formula V=S T Z, and S is the probability distribution matrix composed of the probabilities that each node in the first to-be-processed graph belongs to each category in the clustering result , Z is the associated eigenvector matrix formed by the associated eigenvectors of each node in the first to-be-processed graph; 根据公式A=STA′S生成所述第二待处理图的边矩阵A,A′为根据所述各部分数据之间的关系确定的所述第一待处理图的边矩阵;Generate the edge matrix A of the second graph to be processed according to the formula A= ST A'S, where A' is the edge matrix of the first graph to be processed determined according to the relationship between the parts of the data; 根据V和A生成所述第二待处理图。The second pending map is generated from V and A. 5.根据权利要求2所述的数据分类方法,其中,5. The data classification method according to claim 2, wherein, 所述机器学习模型包括第一图神经网络模型、第二图神经网络模型;The machine learning model includes a first graph neural network model and a second graph neural network model; 利用机器学习模型确定所述第一待处理图中各节点的关联特征向量和所述各节点的聚类结果包括:Using a machine learning model to determine the associated feature vector of each node in the first to-be-processed graph and the clustering result of each node includes: 利用所述第一图神经网络模型,确定所述第一待处理图中各节点的关联特征向量;Using the first graph neural network model, determine the associated feature vector of each node in the first graph to be processed; 利用所述第二图神经网络模型,确定所述各节点的聚类结果。Using the second graph neural network model, the clustering result of each node is determined. 6.根据权利要求2所述的数据分类方法,其中,6. The data classification method according to claim 2, wherein, 所述机器学习模型还包括第三图神经网络模型;The machine learning model also includes a third graph neural network model; 根据所述第二待处理图,利用所述分类器对所述待处理数据的进行分类包括:According to the second to-be-processed map, using the classifier to classify the to-be-processed data includes: 利用所述第三图神经网络模型,确定所述第二待处理图中各节点的特征向量;Using the third graph neural network model, determine the feature vector of each node in the second graph to be processed; 根据所述第二待处理图中各节点的特征向量,利用所述分类器对所述待处理数据进行分类。According to the feature vector of each node in the second to-be-processed graph, the classifier is used to classify the to-be-processed data. 7.根据权利要求6所述的数据分类方法,其中,根据所述第二待处理图中各节点的特征向量,利用所述分类器对所述待处理数据进行分类包括:7. The data classification method according to claim 6, wherein, according to the feature vector of each node in the second to-be-processed graph, using the classifier to classify the to-be-processed data comprises: 对所述第二待处理图中各节点的特征向量进行最大值池化处理或级联处理,确定待处理向量;Perform maximum pooling or cascade processing on the feature vectors of each node in the second to-be-processed graph to determine the to-be-processed vector; 根据所述待处理向量,利用所述分类器对所述待处理数据进行分类。According to the to-be-processed vector, the classifier is used to classify the to-be-processed data. 8.根据权利要求1-7任一项所述的数据分类方法,其中,8. The data classification method according to any one of claims 1-7, wherein, 所述各部分数据之间的关系根据所述各部分数据的特征向量之间的关联程度确定。The relationship between the parts of the data is determined according to the degree of association between the feature vectors of the parts of the data. 9.根据权利要求7所述的数据分类方法,其中,9. The data classification method according to claim 7, wherein, 所述机器学习模型和所述分类器利用交叉熵损失函数训练,所述交叉熵损失函数根据第一分类结果、第二分类结果和第三分类结果确定,所述第一分类结果为根据所述第一待处理图中各节点的关联特征向量利用所述分类器确定的分类结果,所述第二分类结果为根据所述第二待处理图中各节点的特征向量利用所述分类器确定的分类结果,所述第三分类结果为根据所述待处理向量利用所述分类器确定的分类结果。The machine learning model and the classifier are trained by using a cross-entropy loss function, the cross-entropy loss function is determined according to the first classification result, the second classification result and the third classification result, and the first classification result is based on the The associated feature vector of each node in the first graph to be processed is determined by using the classifier, and the second classification result is determined by using the classifier according to the feature vector of each node in the second graph to be processed. Classification result, the third classification result is a classification result determined by the classifier according to the vector to be processed. 10.一种数据分类装置,包括:10. A data classification device, comprising: 生成单元,用于根据待处理数据包含的各部分数据和所述各部分数据之间的关系,生成第一待处理图,所述第一待处理图中包含与所述各部分数据对应的节点和根据所述各部分数据之间的关系确定的各节点之间的边;A generating unit, configured to generate a first to-be-processed graph according to each part of the data included in the data to be processed and the relationship between the various parts of the data, and the first to-be-processed graph includes nodes corresponding to the each part of the data and the edges between the nodes determined according to the relationship between the parts of the data; 确定单元,用于根据提取的所述各部分数据的特征向量和所述各部分数据之间的关系,利用机器学习模型确定所述第一待处理图中相应各节点的关联特征向量和所述各节点的聚类结果;The determining unit is configured to use a machine learning model to determine, according to the extracted feature vector of each part of the data and the relationship between the each part of the data, the associated feature vector of the corresponding nodes in the first to-be-processed graph and the Clustering results of each node; 分类单元,用于根据所述各节点的关联特征向量和所述各节点的聚类结果,利用分类器对所述待处理数据进行分类。A classification unit, configured to use a classifier to classify the data to be processed according to the associated feature vector of each node and the clustering result of each node. 11.一种数据分类装置,包括:11. A data classification device, comprising: 存储器;和memory; and 耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器装置中的指令,执行权利要求1-9任一项所述的数据分类方法。A processor coupled to the memory, the processor configured to perform the data classification method of any of claims 1-9 based on instructions stored in the memory device. 12.一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现权利要求1-9任一项所述的数据分类方法。12. A computer-readable storage medium on which a computer program is stored, which implements the data classification method according to any one of claims 1-9 when the program is executed by a processor.
CN201910348542.0A 2019-04-28 2019-04-28 Data classification method, apparatus and computer readable storage medium Active CN111488400B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910348542.0A CN111488400B (en) 2019-04-28 2019-04-28 Data classification method, apparatus and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910348542.0A CN111488400B (en) 2019-04-28 2019-04-28 Data classification method, apparatus and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111488400A true CN111488400A (en) 2020-08-04
CN111488400B CN111488400B (en) 2021-03-30

Family

ID=71796780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910348542.0A Active CN111488400B (en) 2019-04-28 2019-04-28 Data classification method, apparatus and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111488400B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651450A (en) * 2020-12-30 2021-04-13 哈尔滨工程大学 Medical image classification method based on multi-example deep learning
CN112766346A (en) * 2021-01-12 2021-05-07 合肥黎曼信息科技有限公司 Multi-example learning method based on graph convolution network

Citations (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750541A (en) * 2011-04-22 2012-10-24 北京文通科技有限公司 Document image classifying distinguishing method and device
CN102799899A (en) * 2012-06-29 2012-11-28 北京理工大学 Special audio event layered and generalized identification method based on SVM (Support Vector Machine) and GMM (Gaussian Mixture Model)
CN103345645A (en) * 2013-06-27 2013-10-09 复旦大学 Commodity image category forecasting method based on online shopping platform
CN104217225A (en) * 2014-09-02 2014-12-17 中国科学院自动化研究所 A visual target detection and labeling method
CN104244035A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network video flow classification method based on multilayer clustering
CN104463191A (en) * 2014-10-30 2015-03-25 华南理工大学 Robot visual processing method based on attention mechanism
CN105046269A (en) * 2015-06-19 2015-11-11 鲁东大学 A multi-instance multi-label scene classification method based on multi-core fusion
US20160048727A1 (en) * 2014-08-15 2016-02-18 Konica Minolta Laboratory U.S.A., Inc. Method and system for recognizing an object
CN105468638A (en) * 2014-09-09 2016-04-06 中国银联股份有限公司 Data classification method, system and classifier implementation method
US20170193097A1 (en) * 2016-01-03 2017-07-06 Gracenote, Inc. Model-based media classification service using sensed media noise characteristics
CN107153677A (en) * 2017-04-18 2017-09-12 北京思特奇信息技术股份有限公司 A kind of data processing method and system for searching value user
CN107391545A (en) * 2017-05-25 2017-11-24 阿里巴巴集团控股有限公司 A kind of method classified to user, input method and device
US20180107645A1 (en) * 2016-10-13 2018-04-19 SkywriterRX, Inc. Book analysis and recommendation
CN108280058A (en) * 2018-01-02 2018-07-13 中国科学院自动化研究所 Relation extraction method and apparatus based on intensified learning
CN108399431A (en) * 2018-02-28 2018-08-14 国信优易数据有限公司 Disaggregated model training method and sorting technique
CN108875827A (en) * 2018-06-15 2018-11-23 广州深域信息科技有限公司 A kind of method and system of fine granularity image classification
CN108962238A (en) * 2018-04-25 2018-12-07 苏州思必驰信息科技有限公司 Dialogue method, system, equipment and storage medium based on structural neural networks
US20190005384A1 (en) * 2017-06-29 2019-01-03 General Electric Company Topology aware graph neural nets
CN109241903A (en) * 2018-08-30 2019-01-18 平安科技(深圳)有限公司 Sample data cleaning method, device, computer equipment and storage medium
CN109299248A (en) * 2018-12-12 2019-02-01 成都航天科工大数据研究院有限公司 A kind of business intelligence collection method based on natural language processing
CN109300550A (en) * 2018-11-09 2019-02-01 天津新开心生活科技有限公司 Medical data relationship mining method and device
CN109325412A (en) * 2018-08-17 2019-02-12 平安科技(深圳)有限公司 Pedestrian recognition method, device, computer equipment and storage medium
CN109389162A (en) * 2018-09-28 2019-02-26 北京达佳互联信息技术有限公司 Sample image screening technique and device, electronic equipment and storage medium
CN109582864A (en) * 2018-11-19 2019-04-05 华南师范大学 Course recommended method and system based on big data science and changeable weight adjustment
CN109614979A (en) * 2018-10-11 2019-04-12 北京大学 A data augmentation method and image classification method based on selection and generation
CN109614975A (en) * 2018-10-26 2019-04-12 桂林电子科技大学 Graph embedding method, device and storage medium

Patent Citations (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750541A (en) * 2011-04-22 2012-10-24 北京文通科技有限公司 Document image classifying distinguishing method and device
CN102799899A (en) * 2012-06-29 2012-11-28 北京理工大学 Special audio event layered and generalized identification method based on SVM (Support Vector Machine) and GMM (Gaussian Mixture Model)
CN103345645A (en) * 2013-06-27 2013-10-09 复旦大学 Commodity image category forecasting method based on online shopping platform
US20160048727A1 (en) * 2014-08-15 2016-02-18 Konica Minolta Laboratory U.S.A., Inc. Method and system for recognizing an object
CN104244035A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network video flow classification method based on multilayer clustering
CN104217225A (en) * 2014-09-02 2014-12-17 中国科学院自动化研究所 A visual target detection and labeling method
CN105468638A (en) * 2014-09-09 2016-04-06 中国银联股份有限公司 Data classification method, system and classifier implementation method
CN104463191A (en) * 2014-10-30 2015-03-25 华南理工大学 Robot visual processing method based on attention mechanism
CN105046269A (en) * 2015-06-19 2015-11-11 鲁东大学 A multi-instance multi-label scene classification method based on multi-core fusion
US20170193097A1 (en) * 2016-01-03 2017-07-06 Gracenote, Inc. Model-based media classification service using sensed media noise characteristics
US20180107645A1 (en) * 2016-10-13 2018-04-19 SkywriterRX, Inc. Book analysis and recommendation
CN107153677A (en) * 2017-04-18 2017-09-12 北京思特奇信息技术股份有限公司 A kind of data processing method and system for searching value user
CN107391545A (en) * 2017-05-25 2017-11-24 阿里巴巴集团控股有限公司 A kind of method classified to user, input method and device
US20190005384A1 (en) * 2017-06-29 2019-01-03 General Electric Company Topology aware graph neural nets
CN108280058A (en) * 2018-01-02 2018-07-13 中国科学院自动化研究所 Relation extraction method and apparatus based on intensified learning
CN108399431A (en) * 2018-02-28 2018-08-14 国信优易数据有限公司 Disaggregated model training method and sorting technique
CN108962238A (en) * 2018-04-25 2018-12-07 苏州思必驰信息科技有限公司 Dialogue method, system, equipment and storage medium based on structural neural networks
CN108875827A (en) * 2018-06-15 2018-11-23 广州深域信息科技有限公司 A kind of method and system of fine granularity image classification
CN109325412A (en) * 2018-08-17 2019-02-12 平安科技(深圳)有限公司 Pedestrian recognition method, device, computer equipment and storage medium
CN109241903A (en) * 2018-08-30 2019-01-18 平安科技(深圳)有限公司 Sample data cleaning method, device, computer equipment and storage medium
CN109389162A (en) * 2018-09-28 2019-02-26 北京达佳互联信息技术有限公司 Sample image screening technique and device, electronic equipment and storage medium
CN109614979A (en) * 2018-10-11 2019-04-12 北京大学 A data augmentation method and image classification method based on selection and generation
CN109614975A (en) * 2018-10-26 2019-04-12 桂林电子科技大学 Graph embedding method, device and storage medium
CN109300550A (en) * 2018-11-09 2019-02-01 天津新开心生活科技有限公司 Medical data relationship mining method and device
CN109582864A (en) * 2018-11-19 2019-04-05 华南师范大学 Course recommended method and system based on big data science and changeable weight adjustment
CN109299248A (en) * 2018-12-12 2019-02-01 成都航天科工大数据研究院有限公司 A kind of business intelligence collection method based on natural language processing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王刚 等: ""一种新的基于多示例学习的场景分类方法"", 《山东大学学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651450A (en) * 2020-12-30 2021-04-13 哈尔滨工程大学 Medical image classification method based on multi-example deep learning
CN112651450B (en) * 2020-12-30 2022-10-25 哈尔滨工程大学 A medical image classification method based on multi-instance deep learning
CN112766346A (en) * 2021-01-12 2021-05-07 合肥黎曼信息科技有限公司 Multi-example learning method based on graph convolution network

Also Published As

Publication number Publication date
CN111488400B (en) 2021-03-30

Similar Documents

Publication Publication Date Title
US9053358B2 (en) Learning device for generating a classifier for detection of a target
CN111460806B (en) Intent recognition method, device, equipment and storage medium based on loss function
US10963685B2 (en) Generating variations of a known shred
CN112001488A (en) Training generative antagonistic networks
CN102667859B (en) The pattern recognition device of the general object undertaken by exclusiveness sorter and method
CN112488241B (en) Zero sample picture identification method based on multi-granularity fusion network
US20170076152A1 (en) Determining a text string based on visual features of a shred
CN111444372A (en) System and method for image processing
US20110235926A1 (en) Information processing apparatus, method and program
JP2004054956A (en) Face detection method and system using pattern classifier learned from face / similar face image
CN113987188B (en) A kind of short text classification method, device and electronic equipment
CN110879938A (en) Text sentiment classification method, device, equipment and storage medium
CN114287005B (en) Negative sampling algorithm for enhanced image classification
JP6004015B2 (en) Learning method, information processing apparatus, and learning program
CN106845358A (en) A kind of method and system of handwritten character characteristics of image identification
CN107609113A (en) A kind of Automatic document classification method
CN111522953A (en) A marginal attack method, device and storage medium for naive Bayes classifier
CN110414622A (en) Classifier training method and device based on semi-supervised learning
CN111488400B (en) Data classification method, apparatus and computer readable storage medium
CN110199300A (en) Fuzzy input for autoencoders
Sahbi Explicit context-aware kernel map learning for image annotation
CN111710331B (en) Voice filing method and device based on multi-slice deep neural network
CN111950592A (en) A Multimodal Sentiment Feature Fusion Method Based on Supervised Least Squares Multi-Class Kernel Canonical Correlation Analysis
CN114722180B (en) Method, apparatus, device, medium and program product for generating intent labels
CN111767710B (en) Indonesia emotion classification method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant