CN118506902A

CN118506902A - Molecular property prediction method based on metric small sample learning method

Info

Publication number: CN118506902A
Application number: CN202410969130.XA
Authority: CN
Inventors: 宋弢; 王珣; 王爽; 王璐璐
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2024-07-19
Filing date: 2024-07-19
Publication date: 2024-08-16
Anticipated expiration: 2044-07-19
Also published as: CN118506902B

Abstract

The invention belongs to the technical field of artificial intelligence, and particularly relates to a molecular property prediction method based on a measurement small sample learning method. The invention adopts a Sinkhorn K-means algorithm of a prototype network framework combined with a graph pre-training strategy, low-rank representation, contrast learning and optimization. The prototype network framework aims at solving the problems of feature learning and clustering of the graph data, and the prototype network framework is used for extracting general features of molecules, including local and global information, by combining the graph to cope with the complexity of the molecules. Mapping high-dimensional molecular data to a more compact representation space by low-rank representation helps to improve prediction performance. Contrast learning is introduced to maintain structural features of the data after dimension reduction so as to further improve prediction accuracy. The tagged data is integrated into the predictive model by means of the extended Sinkhorn K-means algorithm, thereby enabling more accurate molecular property predictions.

Description

Molecular property prediction method based on metric small sample learning method

技术领域Technical Field

本发明属于人工智能技术领域，特别涉及基于度量小样本学习方法的分子性质预测方法。The present invention belongs to the field of artificial intelligence technology, and in particular relates to a molecular property prediction method based on a metric small sample learning method.

背景技术Background Art

分子性质预测是计算机辅助药物发现中的关键任务，对于药物筛选和药物设计具有重要意义。目前，药物研发中存在着分子性质预测的高度复杂性和数据噪声问题。传统的分子性质预测方法在面对大规模和多样化的分子数据时，难以提供准确的预测结果，从而拖累了药物研发的进展。在分子性质预测领域，尤其是在数据有限的情况下，准确性一直是一个关键挑战。Molecular property prediction is a key task in computer-aided drug discovery and is of great significance for drug screening and drug design. Currently, there are problems of high complexity and data noise in molecular property prediction in drug development. Traditional molecular property prediction methods are difficult to provide accurate prediction results when faced with large-scale and diverse molecular data, thus hindering the progress of drug development. In the field of molecular property prediction, especially when data is limited, accuracy has always been a key challenge.

发明内容Summary of the invention

本发明提出基于度量小样本学习方法的分子性质预测方法，解决了上述问题。The present invention proposes a molecular property prediction method based on a metric small sample learning method to solve the above problems.

本发明的技术方案是这样实现的：The technical solution of the present invention is achieved in this way:

基于度量小样本学习方法的分子性质预测方法，包括以下步骤：The molecular property prediction method based on the metric small sample learning method includes the following steps:

S1、选用原型网络框架构建模型的框架；S1, the prototype network framework is selected to build the model framework;

S2、借助原型网络框架结合图方法学习图数据的通用特征；S2, using the prototype network framework combined with graph methods to learn the general features of graph data;

S3、运用低秩表示对数据进行降维处理，得到子空间和降维后的嵌入表示；S3. Use low-rank representation to reduce the dimension of the data and obtain the subspace and embedded representation after dimension reduction;

S4、使用对比学习捕获数据点与其邻居和其他数据点之间的相似性；S4, using contrastive learning to capture the similarities between a data point and its neighbors and other data points;

S5、采用扩展的Sinkhorn K-means算法，将带标签的数据纳入算法，进行数据的聚类分析。S5. Use the extended Sinkhorn K-means algorithm to incorporate labeled data into the algorithm and perform cluster analysis on the data.

可选的，步骤S3中，模型需要进行子空间学习，其中，子空间学习包括SVD分解和QR分解，SVD分解的公式如下：Optionally, in step S3, the model needs to perform subspace learning, where subspace learning includes SVD decomposition and QR decomposition. The formula of SVD decomposition is as follows:

； ;

其中，A是待分解的矩阵/原矩阵，U是大小为m×m的酉矩阵，也是左奇异矩阵，V是大小为n×n的酉矩阵，也是右奇异矩阵，Σ是一个包含有奇异值的对角矩阵，大小为m×n，U和V是两个正交矩阵，其中的每一行或每一列分别被称为左奇异向量和右奇异向量，T表示矩阵的转置；Where A is the matrix to be decomposed/original matrix, U is a unitary matrix of size m×m, also a left singular matrix, V is a unitary matrix of size n×n, also a right singular matrix, Σ is a diagonal matrix containing singular values, size m×n, U and V are two orthogonal matrices, each row or column of which is called a left singular vector and a right singular vector, respectively, and T represents the transpose of the matrix;

QR分解的公式如下：The formula for QR decomposition is as follows:

； ;

其中，Q是一个m×n的正交矩阵，其列向量是彼此正交且是单位长度的，满足；即：Q的转置和Q的乘积等于单位矩阵I，而R是一个n×n的上三角矩阵，即：除了主对角线及其以上的元素外，其他元素都为零。Where Q is an m×n orthogonal matrix whose column vectors are mutually orthogonal and of unit length, satisfying ; That is, the product of the transpose of Q and Q is equal to the identity matrix I, and R is an n×n upper triangular matrix, that is, except for the elements on the main diagonal and above, all other elements are zero.

通过上述技术方案，SVD分解是一种常见的矩阵分解方法，可以将一个矩阵分解为三个矩阵的乘积，分别是左奇异矩阵、奇异值矩阵和右奇异矩阵。奇异值分解是特征分解在任意矩阵上的推广，其目的是为了提取一个矩阵最重要的特征，在降维、数据压缩、推荐系统等有广泛的应用。QR分解是为了简化矩阵运算和求解问题的过程。这是一种常用的矩阵分解方法，可以将一个矩阵分解为一个正交矩阵和一个上三角矩阵的乘积。Through the above technical solution, SVD decomposition is a common matrix decomposition method, which can decompose a matrix into the product of three matrices, namely the left singular matrix, the singular value matrix and the right singular matrix. Singular value decomposition is a generalization of eigendecomposition on any matrix. Its purpose is to extract the most important features of a matrix. It has a wide range of applications in dimensionality reduction, data compression, recommendation systems, etc. QR decomposition is to simplify the process of matrix operations and problem solving. This is a commonly used matrix decomposition method that can decompose a matrix into the product of an orthogonal matrix and an upper triangular matrix.

可选的，步骤S3中，低秩表示的具体步骤如下：Optionally, in step S3, the specific steps of low-rank representation are as follows:

A1、对原始数据进行QR分解，将输入矩阵降维为一个三角矩阵R，其包含了数据的主要结构信息；A1. Perform QR decomposition on the original data and reduce the dimension of the input matrix to a triangular matrix R, which contains the main structural information of the data;

A2、对三角矩阵R进行SVD分解，得到左奇异矩阵U；A2. Perform SVD decomposition on the triangular matrix R to obtain the left singular matrix U;

A3、在U的基础上求出相似矩阵S，通过计算左奇异矩阵U与其转置的乘积，得到相似性矩阵W，即：；A3. Based on U, we find the similarity matrix S. By calculating the product of the left singular matrix U and its transpose, we get the similarity matrix W, that is: ;

其中，S是相似矩阵，U是左奇异矩阵，T表示矩阵的转置；Among them, S is the similarity matrix, U is the left singular matrix, and T represents the transpose of the matrix;

A4、对相似性矩阵S的对角线元素进行置零操作；A4. Set the diagonal elements of the similarity matrix S to zero;

A5、为了使得每个数据点的相似性权重之和为1，对相似性矩阵S进行归一化操作；A5. In order to make the sum of the similarity weights of each data point equal to 1, the similarity matrix S is normalized;

A6、根据相似矩阵S构建不相似矩阵S1，不相似矩阵S1是与相似矩阵S具有相同形状的全1矩阵，对于一个N-way K-shot C-query问题，不相似矩阵S1的计算公式如下：A6. Construct a dissimilarity matrix S1 based on the similarity matrix S. The dissimilarity matrix S1 is an all-1 matrix with the same shape as the similarity matrix S. For an N-way K-shot C-query problem, the calculation formula of the dissimilarity matrix S1 is as follows:

； ;

其中e是全1向量，E是单位矩阵，N是类别数，K是每个类别样本数，C是查询集样本数量，T表示矩阵的转置；Where e is a vector of all 1s, E is the identity matrix, N is the number of categories, K is the number of samples in each category, C is the number of samples in the query set, and T represents the transpose of the matrix;

通过上述技术方案，低秩表示是一种数据表示方法，旨在通过使用低秩矩阵来揭示数据的内在结构和模式。低秩表示方法的核心思想是将原始数据表示为低秩矩阵的线性组合，从而提取出数据的潜在信息和共享特征。SVD分解通过取前k个奇异值对应的列向量，可以将原始数据降维到低维空间，同时保留数据的主要结构和信息。Through the above technical solutions, low-rank representation is a data representation method that aims to reveal the intrinsic structure and pattern of data by using low-rank matrices. The core idea of the low-rank representation method is to represent the original data as a linear combination of low-rank matrices, thereby extracting the potential information and shared features of the data. SVD decomposition can reduce the original data to a low-dimensional space by taking the column vectors corresponding to the first k singular values, while retaining the main structure and information of the data.

可选的，步骤S3中，数据降维的步骤如下：Optionally, in step S3, the steps of data dimensionality reduction are as follows:

B1、通过最大化相似特征的相似性和最小化不相似特征的不相似性，学习到特征在子空间的投影P，即：B1. By maximizing the similarity of similar features and minimizing the dissimilarity of dissimilar features, we learn the projection P of the feature in the subspace, that is:

； ;

其中F为分子特征,S1是不相似矩阵，S是相似矩阵，T表示矩阵的转置；Where F is the molecular feature, S1 is the dissimilarity matrix, S is the similarity matrix, and T represents the transpose of the matrix;

B2、对P进行特征值分解，提取其特征向量并保留前k个最大特征值对应的特征向量，将原始数据矩阵与这些特征向量相乘，即可将数据F映射到一个新的坐标空间，即：B2. Perform eigenvalue decomposition on P, extract its eigenvectors and retain the eigenvectors corresponding to the first k largest eigenvalues. Multiply the original data matrix with these eigenvectors to map the data F to a new coordinate space, namely:

； ;

其中V是前k个最大的特征向量，F1是数据在新坐标空间的特征映射。Where V is the first k largest eigenvectors, and F1 is the feature map of the data in the new coordinate space.

可选的，步骤S5中，扩展的Sinkhorn K-means算法指的是基于Sinkhorn K-means算法对数据进行聚类，它是在K-means算法的基础上进行改进，通过引入最优传输矩阵和Sinkhorn进行迭代，首先初始化K个聚类中心，然后，通过迭代的方式不断更新聚类中心和数据点之间的关系。Optionally, in step S5, the extended Sinkhorn K-means algorithm refers to clustering data based on the Sinkhorn K-means algorithm, which is an improvement on the K-means algorithm by introducing the optimal transmission matrix and Sinkhorn for iteration. First, K cluster centers are initialized, and then the relationship between the cluster centers and data points is continuously updated through iteration.

可选的，扩展的Sinkhorn K-means算法进行聚类分析的步骤如下：The optional, extended Sinkhorn K-means algorithm performs cluster analysis as follows:

C1、初始化类中心，借鉴原型网络，对支撑集中所有样本的嵌入取平均，得到各个类别的原型表示，也就是初始的类中心向量；C1. Initialize the class center. By referring to the prototype network, average the embeddings of all samples in the support set to obtain the prototype representation of each category, that is, the initial class center vector;

C2、根据每个数据点与聚类中心的距离M，计算概率转移矩阵P，使用Sinkhorn算法进行迭代优化，得到稳定的矩阵；C2. Calculate the probability transfer matrix P based on the distance M between each data point and the cluster center, and use the Sinkhorn algorithm for iterative optimization to obtain a stable matrix;

C3、根据概率转移矩阵P，计算每个化合物点属于每个类别的概率，得到概率矩阵；C3. According to the probability transfer matrix P, the probability of each compound point belonging to each category is calculated to obtain a probability matrix;

C4、计算新的类中心并更新类中心，将概率转移矩阵与数据降维后的嵌入相乘，得到每个类别中样本的加权和，再将其与总样本数相除，即可得到类中心的估计值；C4. Calculate the new class center and update the class center. Multiply the probability transfer matrix with the embedding after data dimensionality reduction to obtain the weighted sum of samples in each category, and then divide it by the total number of samples to get the estimated value of the class center. ;

将计算的估计值与原始类中心的差值作为类中心的更新量，使用学习率对更新量进行调整缩，并将调整更新后的值加到原始类中心上，完成类中心的更新，即：The difference between the calculated estimate and the original class center is used as the update amount of the class center, using the learning rate Adjust the update amount and add the adjusted updated value to the original class center On the top, complete the update of the class center, namely:

； ;

C5、重复上述步骤，直到聚类中心不再发生显著变化或达到最大迭代次数。C5. Repeat the above steps until the cluster center no longer changes significantly or the maximum number of iterations is reached.

可选的，步骤C2中，Sinkhorn算法进行迭代优化，得到稳定的矩阵，具体步骤如下：Optionally, in step C2, the Sinkhorn algorithm is iteratively optimized to obtain a stable matrix, and the specific steps are as follows:

D1、将高斯核函数的值作为概率转移矩阵的初始值，并对其进行归一化操作，即：D1. Use the value of the Gaussian kernel function as the initial value of the probability transfer matrix and perform a normalization operation on it, namely:

； ;

其中P是概率转移矩阵，其中，既包括有标签的支撑集中的数据，又包括没有标签的查询集中的数据，是样本i的嵌入，是j阶级的中心，表示每个分子到类中心的欧氏距离，是高斯核函数的带宽参数，控制着数据映射的范围和数据点对决策边界的影响程度；Where P is the probability transfer matrix, which includes both the data in the support set with labels and the data in the query set without labels. is the embedding of sample i, It is the center of class j. represents the Euclidean distance from each molecule to the cluster center, It is the bandwidth parameter of the Gaussian kernel function, which controls the range of data mapping and the influence of data points on the decision boundary;

D2、通过迭代更新的方式逐步逼近概率转移矩阵P，在每次迭代中，首先计算当前的行权值向量，并将其与概率转移矩阵P相乘，实现行缩放。接着，根据已知的标签值，将前支撑集的数据在对应的位置设为0或1，以对已知类别进行约束。同样地，在列方向上进行相同的操作，通过相乘和约束操作来逐步调整传输矩阵P的元素值，使其趋近于最优解。D2, gradually approximate the probability transfer matrix P through iterative updating. In each iteration, first calculate the current row weight vector and multiply it with the probability transfer matrix P to achieve row scaling. Then, according to the known label value, set the data of the previous support set to 0 or 1 at the corresponding position to constrain the known category. Similarly, perform the same operation in the column direction, and gradually adjust the element values of the transfer matrix P through multiplication and constraint operations to make it approach the optimal solution.

D3、通过将每次迭代中传输矩阵P在第二个维度上求和的结果与上一次迭代结果之间的差值的绝对值的最大值与预先设定的阈值进行比较，判断当前传输矩阵是否已经足够接近最优解，如果差值的绝对值的最大值大于阈值，说明当前传输矩阵与上一次迭代的结果之间还存在较大的差异，需要继续迭代调整。如果差值的绝对值的最大值小于或等于阈值，认为当前传输矩阵已经足够接近最优解，可以结束迭代过程。D3. By comparing the maximum absolute value of the difference between the result of summing the transmission matrix P in the second dimension in each iteration and the result of the previous iteration with a preset threshold, it is determined whether the current transmission matrix is close enough to the optimal solution. If the maximum absolute value of the difference is greater than the threshold, it means that there is still a large difference between the current transmission matrix and the result of the previous iteration, and it is necessary to continue iterative adjustment. If the maximum absolute value of the difference is less than or equal to the threshold, it is considered that the current transmission matrix is close enough to the optimal solution, and the iteration process can be terminated.

通过上述技术方案，上述Sinkhorn算法对最优传输矩阵进行了迭代优化，得到了稳定的传输矩阵。Through the above technical solution, the above Sinkhorn algorithm iteratively optimizes the optimal transmission matrix and obtains a stable transmission matrix.

可选的，步骤S2中，原型网络框架结合图方法通过节点级和图级的预训练，使得模型捕获了图的局部和全局信息。Optionally, in step S2, the prototype network framework is combined with the graph method through node-level and graph-level pre-training, so that the model captures local and global information of the graph.

采用了上述技术方案后，本发明的有益效果是：After adopting the above technical solution, the beneficial effects of the present invention are:

本发明采用了原型网络框架，结合了先进的预训练技术（原型网络框架结合图）和低秩表示，以改进分子性质预测。利用原型网络框架结合图提取了分子的通用特征，包括局部和全局信息，以应对分子复杂性。随后，通过低秩表示，将高维分子数据映射到更紧凑的表示空间，有助于提高预测性能。同时，引入了对比学习来保持数据在降维后的结构特征，以进一步提高预测准确性。最后，借助扩展的Sinkhorn K-means算法，将带有标签的数据集成到预测模型中，从而实现更精确的分子性质预测。The present invention adopts a prototype network framework, combined with advanced pre-training techniques (prototype network framework combined with graphs) and low-rank representation to improve the prediction of molecular properties. The prototype network framework combined with graphs extracts common features of molecules, including local and global information, to cope with molecular complexity. Subsequently, high-dimensional molecular data is mapped to a more compact representation space through low-rank representation, which helps to improve the prediction performance. At the same time, contrastive learning is introduced to maintain the structural features of the data after dimensionality reduction to further improve the prediction accuracy. Finally, with the help of the extended Sinkhorn K-means algorithm, the labeled data is integrated into the prediction model to achieve more accurate prediction of molecular properties.

原型网络框架，旨在解决图数据的特征学习和聚类问题。原型网络框架结合图方法通过节点级和图级的预训练，能够全面捕获图的局部和全局特征。低秩表示技术，将数据进行降维处理，得到了更紧凑的子空间和降维后的嵌入表示，有助于数据的可视化和理解。为了维护数据在降维后的结构特征，运用对比学习，强调了数据点与其邻居以及其他数据点之间的相似性关系。扩展的Sinkhorn K-means算法，将带有标签的数据融入聚类过程，以更准确地对数据进行聚类分析。在此过程中，利用支持集的均值来初始化类中心，通过计算数据点与类中心的距离来不断优化传输矩阵，从而得到稳定的传输矩阵。利用这个传输矩阵，能够精确地将化合物分配到其所属的类别中，为数据的有效分类提供了强大支持。这些方法的结合使得本发明在图数据的特征学习和聚类任务中取得了显著的成果。本发明中的分子性质预测方法能够更准确地预测分子的性质，为药物设计和发现提供了有力支持。The prototype network framework is designed to solve the problem of feature learning and clustering of graph data. The prototype network framework, combined with the graph method, can fully capture the local and global features of the graph through node-level and graph-level pre-training. The low-rank representation technology reduces the dimension of the data to obtain a more compact subspace and embedded representation after dimensionality reduction, which is helpful for data visualization and understanding. In order to maintain the structural characteristics of the data after dimensionality reduction, contrastive learning is used to emphasize the similarity relationship between the data point and its neighbors and other data points. The extended Sinkhorn K-means algorithm integrates the labeled data into the clustering process to more accurately cluster the data. In this process, the class center is initialized by using the mean of the support set, and the transmission matrix is continuously optimized by calculating the distance between the data point and the class center, thereby obtaining a stable transmission matrix. Using this transmission matrix, the compound can be accurately assigned to the category to which it belongs, providing strong support for the effective classification of the data. The combination of these methods enables the present invention to achieve remarkable results in the task of feature learning and clustering of graph data. The molecular property prediction method in the present invention can more accurately predict the properties of molecules, providing strong support for drug design and discovery.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.

图1是随着血脑屏障穿透性数据集的维度增加，模型性能变化趋势图；Figure 1 is a graph showing the performance trend of the model as the dimension of the blood-brain barrier permeability dataset increases;

图2是随着毒性ClinTox数据集的维度增加，模型性能变化趋势图；Figure 2 is a graph showing the performance trend of the model as the dimension of the ClinTox dataset increases;

图3是随着β-分泌酶的抑制活性数据集的维度增加，模型性能变化趋势图；FIG3 is a graph showing the trend of model performance changes as the dimension of the β-secretase inhibitory activity dataset increases;

图4是随着人类免疫缺陷病毒数据集的维度增加，模型性能变化趋势图；Figure 4 is a graph showing the performance trend of the model as the dimension of the human immunodeficiency virus dataset increases;

图5是随着血脑屏障穿透性支撑集中样本数量的增加，模型性能变化趋势图；FIG5 is a graph showing the trend of model performance changes as the number of samples in the blood-brain barrier permeability support set increases;

图6是随着毒性ClinTox支撑集中样本数量的增加，模型性能变化趋势图；FIG6 is a graph showing the trend of model performance changes as the number of samples in the toxicity ClinTox support set increases;

图7是随着β-分泌酶的抑制活性支撑集中样本数量的增加，模型性能变化趋势图；FIG7 is a graph showing the trend of model performance changes as the number of samples in the β-secretase inhibitory activity support set increases;

图8是随着人类免疫缺陷病毒支撑集中样本数量的增加，模型性能变化趋势图；FIG8 is a graph showing the trend of model performance changes as the number of samples in the human immunodeficiency virus support set increases;

图9是随着毒性Tox21支撑集中样本数量的增加，模型性能变化趋势图；FIG9 is a graph showing the trend of model performance changes as the number of samples in the toxicity Tox21 support set increases;

图10是随着血脑屏障穿透性查询集中数据量增加，模型性能变化趋势图；FIG10 is a graph showing the trend of model performance changes as the amount of data in the blood-brain barrier permeability query set increases;

图11是随着毒性ClinTox查询集中数据量增加，模型性能变化趋势图；FIG11 is a graph showing the trend of model performance changes as the amount of data in the toxicity ClinTox query set increases;

图12是随着β-分泌酶的抑制活性查询集中数据量增加，模型性能变化趋势图；FIG12 is a graph showing the trend of model performance changes as the amount of data in the β-secretase inhibitory activity query set increases;

图13是随着人类免疫缺陷病毒查询集中数据量增加，模型性能变化趋势图。FIG13 is a graph showing the trend of model performance changes as the amount of data in the human immunodeficiency virus query set increases.

具体实施方式DETAILED DESCRIPTION

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

本申请实施例公开了基于度量小样本学习方法的分子性质预测方法。The embodiments of the present application disclose a method for predicting molecular properties based on a metric small sample learning method.

实施例：根据图1至图13所示，基于度量小样本学习方法的分子性质预测方法，包括以下几个部分：Embodiment: As shown in FIG. 1 to FIG. 13 , the molecular property prediction method based on the metric small sample learning method includes the following parts:

1、方法部分1. Methods

1.1模型构建1.1 Model construction

本发明采用了原型网络框架。首先，借助原型网络框架结合图方法学习了图数据的通用特征。原型网络框架结合图是一种基于图神经网络的预训练策略，它通过节点级和图级的预训练，捕获了图的局部和全局信息。接着，运用低秩表示对数据进行降维处理，得到子空间和降维后的嵌入表示。然后使用对比学习来捕获数据点与其邻居和其他数据点之间的相似性，确保数据在降维后仍保持一定的空间结构。最后，采用扩展的Sinkhorn K-means算法，将带标签的数据纳入算法，从而进行数据的聚类分析。在此过程中，将支持集中不同类别分子表征的均值作为初始类中心，借助数据点与类中心的距离计算最佳传输矩阵，并不断优化以得到稳定的传输矩阵。利用该传输矩阵，能够准确地判定化合物所属的类别。The present invention adopts a prototype network framework. First, the general features of graph data are learned by combining the prototype network framework with the graph method. The prototype network framework combined with the graph is a pre-training strategy based on a graph neural network, which captures the local and global information of the graph through pre-training at the node level and the graph level. Next, the data is reduced in dimension using a low-rank representation to obtain a subspace and an embedded representation after dimensionality reduction. Contrastive learning is then used to capture the similarity between the data point and its neighbors and other data points to ensure that the data still maintains a certain spatial structure after dimensionality reduction. Finally, an extended Sinkhorn K-means algorithm is used to incorporate labeled data into the algorithm to perform cluster analysis of the data. In this process, the mean of the molecular representations of different categories in the support set is used as the initial class center, and the optimal transmission matrix is calculated with the help of the distance between the data point and the class center, and is continuously optimized to obtain a stable transmission matrix. Using this transmission matrix, the category to which the compound belongs can be accurately determined.

1.2子空间学习1.2 Subspace Learning

模型首先学习了一个子空间，它可以提高特征判别的能力。The model first learns a subspace that improves the discriminative power of features.

奇异值分解（简称SVD）是一种常见的矩阵分解方法，可以将一个矩阵分解为三个矩阵的乘积，分别是左奇异矩阵、奇异值矩阵和右奇异矩阵。奇异值分解是特征分解在任意矩阵上的推广，其目的是为了提取一个矩阵最重要的特征，在降维、数据压缩、推荐系统等有广泛的应用。其公式为：Singular value decomposition (SVD) is a common matrix decomposition method that can decompose a matrix into the product of three matrices, namely the left singular matrix, the singular value matrix, and the right singular matrix. Singular value decomposition is a generalization of eigendecomposition on any matrix. Its purpose is to extract the most important features of a matrix. It is widely used in dimensionality reduction, data compression, recommendation systems, etc. Its formula is:

； ;

其中，A是待分解的矩阵/原矩阵；U是大小为m×m的酉矩阵，也是左奇异矩阵，可以用于数据压缩和降维；V是大小为n×n的酉矩阵，也是右奇异矩阵，可以用于数据恢复和重建；Σ是一个包含有奇异值的对角矩阵，大小为m×n。U和V是两个正交矩阵，其中的每一行或每一列分别被称为左奇异向量和右奇异向量，T表示矩阵的转置。Among them, A is the matrix to be decomposed/the original matrix; U is a unitary matrix of size m×m, which is also a left singular matrix and can be used for data compression and dimensionality reduction; V is a unitary matrix of size n×n, which is also a right singular matrix and can be used for data recovery and reconstruction; Σ is a diagonal matrix containing singular values, which is m×n in size. U and V are two orthogonal matrices, each row or column of which is called a left singular vector and a right singular vector, respectively, and T represents the transpose of the matrix.

为了简化矩阵运算和求解问题的过程，使用了QR分解。这是一种常用的矩阵分解方法，可以将一个矩阵分解为一个正交矩阵和一个上三角矩阵的乘积。给定一个m×n的矩阵A，QR分解将A分解为两个矩阵的乘积：In order to simplify the process of matrix operations and problem solving, QR decomposition is used. This is a commonly used matrix decomposition method that can decompose a matrix into the product of an orthogonal matrix and an upper triangular matrix. Given an m×n matrix A, QR decomposition decomposes A into the product of two matrices:

； ;

其中，Q是一个m×n的正交矩阵，其列向量是彼此正交且是单位长度的，满足Where Q is an m×n orthogonal matrix whose column vectors are mutually orthogonal and of unit length, satisfying

； ;

即：Q的转置和Q的乘积等于单位矩阵I。而R是一个n×n的上三角矩阵，即：除了主对角线及其以上的元素外，其他元素都为零。That is, the product of the transpose of Q and Q is equal to the identity matrix I. R is an n×n upper triangular matrix, that is, except for the elements on the main diagonal and above, all other elements are zero.

Q矩阵由单位正交向量构成，它包含了原始矩阵在列空间的正交基。每个向量代表原始数据在列方向上的结构和特征。而R矩阵则记录了原始矩阵在每一列方向上的模长以及与其他列的正交关系，即：原始数据在列方向上的结构特征。The Q matrix is composed of unit orthogonal vectors, which contains the orthogonal basis of the original matrix in the column space. Each vector represents the structure and characteristics of the original data in the column direction. The R matrix records the modulus of the original matrix in each column direction and the orthogonal relationship with other columns, that is, the structural characteristics of the original data in the column direction.

低秩表示（简称LRR）是一种数据表示方法，旨在通过使用低秩矩阵来揭示数据的内在结构和模式。LRR方法的核心思想是将原始数据表示为低秩矩阵的线性组合，从而提取出数据的潜在信息和共享特征，在本发明的具体步骤如下：Low rank representation (LRR for short) is a data representation method that aims to reveal the intrinsic structure and pattern of data by using low rank matrices. The core idea of the LRR method is to represent the original data as a linear combination of low rank matrices, thereby extracting the potential information and shared features of the data. The specific steps of the present invention are as follows:

（1）对原始数据进行QR分解。这个步骤是为了加速SVD计算，将输入矩阵降维为一个三角矩阵R，其包含了数据的主要结构信息。(1) Perform QR decomposition on the original data. This step is to speed up the SVD calculation and reduce the input matrix to a triangular matrix R, which contains the main structural information of the data.

（2）对三角矩阵R进行SVD分解，得到左奇异矩阵U。U矩阵的列向量表示原始数据在列空间上的结构和特征，是原始矩阵在列方向上的正交基。这些列向量按照奇异值由大到小的顺序排列，表示特征向量的重要性逐个递减。因此通过取前k个奇异值对应的列向量，可以将原始数据降维到低维空间，同时保留数据的主要结构和信息。(2) Perform SVD decomposition on the triangular matrix R to obtain the left singular matrix U. The column vectors of the U matrix represent the structure and characteristics of the original data in the column space and are the orthogonal basis of the original matrix in the column direction. These column vectors are arranged in descending order according to the singular value, indicating that the importance of the eigenvectors decreases one by one. Therefore, by taking the column vectors corresponding to the first k singular values, the original data can be reduced to a low-dimensional space while retaining the main structure and information of the data.

（3）在U的基础上求出相似矩阵S，通过计算左奇异矩阵U与其转置的乘积，得到相似性矩阵W。即：(3) Based on U, we can find the similarity matrix S. By calculating the product of the left singular matrix U and its transpose, we can get the similarity matrix W. That is:

； ;

其中S是相似矩阵，U是左奇异矩阵，T表示矩阵的转置。具体来说，使用torch.matmul()函数计算U与其转置的乘积，并使用torch.abs()函数取绝对值，得到一个对称矩阵。这个相似性矩阵描述了数据点之间的相似性程度。Where S is the similarity matrix, U is the left singular matrix, and T represents the transpose of the matrix. Specifically, the torch.matmul() function is used to calculate the product of U and its transpose, and the torch.abs() function is used to take the absolute value to obtain a symmetric matrix. This similarity matrix describes the degree of similarity between data points.

（4）对相似性矩阵S的对角线元素进行置零操作，这是因为数据点与自身的相似性为1，置零可以避免自身的相似性对降维结果产生影响。(4) The diagonal elements of the similarity matrix S are set to zero. This is because the similarity between a data point and itself is 1. Setting the elements to zero can prevent its own similarity from affecting the dimensionality reduction result.

（5）为了使得每个数据点的相似性权重之和为1，对相似性矩阵S进行归一化操作。(5) In order to make the sum of the similarity weights of each data point equal to 1, the similarity matrix S is normalized.

（6）根据相似矩阵S构建不相似矩阵S1。不相似矩阵S1用于描述数据点之间的不相似性。它是相似矩阵S的补集，表示数据点之间的相似度权重为0的部分。首先，初始化一个与相似矩阵S具有相同形状的全1矩阵S1。为了避免数据点与自身的相似度对降维结果产生影响，将S1的对角线元素置为0。同时，为了确保每个数据点的相似性权重之和为1，对S1进行归一化操作。对于一个N-way K-shot C-query问题，不相似矩阵S1的计算遵循以下公式：(6) Construct the dissimilarity matrix S1 based on the similarity matrix S. The dissimilarity matrix S1 is used to describe the dissimilarity between data points. It is the complement of the similarity matrix S, which represents the part of the similarity weight between data points with a value of 0. First, initialize an all-1 matrix S1 with the same shape as the similarity matrix S. In order to avoid the influence of the similarity between the data point and itself on the dimensionality reduction result, the diagonal elements of S1 are set to 0. At the same time, in order to ensure that the sum of the similarity weights of each data point is 1, S1 is normalized. For an N-way K-shot C-query problem, the calculation of the dissimilarity matrix S1 follows the following formula:

； ;

其中e是全1向量，E是单位矩阵，N是类别数，K是每个类别样本数，C是查询集样本数量，T代表矩阵的转置。Where e is the all-1 vector, E is the identity matrix, N is the number of categories, K is the number of samples in each category, C is the number of samples in the query set, and T represents the transpose of the matrix.

至此，已经得到了用来描述数据共享特征的相似矩阵S和描述数据差异度的不相似矩阵S1。接下来，通过S和S1计算特征在子空间的投影，并将分子的原特征映射到新的坐标空间，即可实现数据降维，具体过程如下：So far, we have obtained the similarity matrix S used to describe the shared features of the data and the dissimilarity matrix S1 used to describe the differences in the data. Next, we calculate the projection of the features in the subspace through S and S1, and map the original features of the molecules to the new coordinate space to achieve data dimensionality reduction. The specific process is as follows:

（1）通过最大化相似特征的相似性和最小化不相似特征的不相似性，可以学习到特征在子空间的投影P，即：(1) By maximizing the similarity of similar features and minimizing the dissimilarity of dissimilar features, the projection P of the feature in the subspace can be learned, that is:

； ;

其中F为分子特征,S1和S分别是不相似矩阵和相似矩阵，T代表矩阵的转置。Where F is the molecular feature, S1 and S are the dissimilarity matrix and similarity matrix respectively, and T represents the transpose of the matrix.

（2）对P进行特征值分解，提取其特征向量并保留前k个最大特征值对应的特征向量。然后，将原始数据矩阵与这些特征向量相乘，即可将数据F映射到一个新的坐标空间，即：(2) Perform eigenvalue decomposition on P, extract its eigenvectors and retain the eigenvectors corresponding to the first k largest eigenvalues. Then, multiply the original data matrix with these eigenvectors to map the data F to a new coordinate space, namely:

； ;

其中V是top-k个特征向量，F1是数据在新坐标空间的特征映射。这个过程实现了数据的降维和特征向量的提取。Where V is the top-k feature vectors, and F1 is the feature map of the data in the new coordinate space. This process achieves data dimensionality reduction and feature vector extraction.

1.3聚类1.3 Clustering

基于Sinkhorn K-means算法对数据进行聚类，它是在K-means算法的基础上进行改进，通过引入最优传输矩阵和Sinkhorn迭代来实现更好的聚类效果，主要用于解决小样本学习问题。The data is clustered based on the Sinkhorn K-means algorithm. It is an improvement on the K-means algorithm. It achieves better clustering effect by introducing the optimal transmission matrix and Sinkhorn iteration. It is mainly used to solve small sample learning problems.

K-means算法是一种常用的聚类算法，用于将数据集划分为K个簇，每个簇具有相似的特征。它的主要思想是通过迭代的方式，将数据点分配到最近的聚类中心，并更新聚类中心的位置，以达到最小化数据点与所属聚类中心之间距离的目标。其算法步骤为：K-means algorithm is a commonly used clustering algorithm, which is used to divide a data set into K clusters, each with similar characteristics. Its main idea is to assign data points to the nearest cluster center in an iterative manner and update the location of the cluster center to minimize the distance between the data point and the cluster center to which it belongs. The algorithm steps are:

（1）选择聚类数K：确定要将数据集划分为的簇的数量K。(1) Select the number of clusters K: Determine the number K of clusters into which the dataset is to be divided.

（2）初始化聚类中心：可以从数据集中随机选择K个样本作为初始聚类中心，或者通过其他方法进行初始化。(2) Initialize cluster centers: K samples can be randomly selected from the data set as initial cluster centers, or initialized by other methods.

（3）分配样本到最近的聚类中心：对于每个样本，计算其与每个聚类中心之间的距离，将样本分配到距离最近的聚类中心所属的簇。(3) Assign samples to the nearest cluster center: For each sample, calculate the distance between it and each cluster center, and assign the sample to the cluster to which the nearest cluster center belongs.

（4）更新聚类中心：对于每个簇，计算该簇中所有样本的均值，将该均值作为新的聚类中心。(4) Update cluster centers: For each cluster, calculate the mean of all samples in the cluster and use the mean as the new cluster center.

（5）重复步骤（3）和步骤（4），直到满足停止条件。停止条件可以是达到最大迭代次数、聚类中心的变化小于设定的阈值，或者达到了预先定义的性能指标。(5) Repeat steps (3) and (4) until the stopping condition is met. The stopping condition can be that the maximum number of iterations is reached, the change in the cluster center is less than the set threshold, or a pre-defined performance indicator is reached.

（6）得到最终聚类结果：当算法停止时，每个样本将属于与其最近的聚类中心所属的簇。最终的聚类结果就是这些簇的集合。(6) Get the final clustering result: When the algorithm stops, each sample will belong to the cluster to which its closest cluster center belongs. The final clustering result is the set of these clusters.

Sinkhorn算法是一种用于解决最优传输问题的迭代优化算法。最优传输问题是指在给定两个概率分布之间找到最佳映射方案，使得总的运输成本最小。该算法的主要目标是计算一个转移矩阵，该矩阵可以表示从一个概率分布到另一个概率分布的最佳映射，也被称为最优传输矩阵或权重矩阵。Sinkhorn算法通过迭代更新转移矩阵，逐步逼近最优传输方案。它通过缩放行和列、约束已知类别等操作，在更新过程中优化转移矩阵以最小化源分布与目标分布之间的距离或相似度。这个算法的关键思想是通过迭代更新转移矩阵P，使其逐渐趋近于最优解，在计算机视觉、自然语言处理等领域中被广泛应用。The Sinkhorn algorithm is an iterative optimization algorithm for solving the optimal transport problem. The optimal transport problem refers to finding the best mapping solution between two given probability distributions so that the total transportation cost is minimized. The main goal of the algorithm is to calculate a transfer matrix, which can represent the best mapping from one probability distribution to another, also known as the optimal transfer matrix or weight matrix. The Sinkhorn algorithm gradually approaches the optimal transport solution by iteratively updating the transfer matrix. It optimizes the transfer matrix to minimize the distance or similarity between the source distribution and the target distribution during the update process by scaling rows and columns, constraining known categories, and other operations. The key idea of this algorithm is to iteratively update the transfer matrix P so that it gradually approaches the optimal solution. It is widely used in computer vision, natural language processing and other fields.

本发明对sinkhorn k-means算法进行了改进：在计算最优传输矩阵时，sinkhornk-means算法只需要使用未标记数据，而本发明不仅使用了查询集中的未标记数据，还使用了支持集中的标记数据，因此本发明的聚类的过程是半监督的。The present invention improves the sinkhorn k-means algorithm: when calculating the optimal transmission matrix, the sinkhorn k-means algorithm only needs to use unlabeled data, while the present invention not only uses the unlabeled data in the query set, but also uses the labeled data in the support set, so the clustering process of the present invention is semi-supervised.

通过利用标记数据和未标记数据，本发明的方法能够更好地利用有限的标记信息来指导聚类过程。在计算最优传输矩阵时，本发明不仅考虑了未标记数据与支持集中标记数据之间的相似性，还充分利用了查询集中的未标记数据与标记数据之间的关联。通过这种半监督的方式，本发明的方法能够更好地利用数据的潜在信息，从而更准确地推断查询集中的数据的所属的类别。By utilizing labeled data and unlabeled data, the method of the present invention can better utilize limited labeled information to guide the clustering process. When calculating the optimal transmission matrix, the present invention not only considers the similarity between the unlabeled data and the labeled data in the support set, but also fully utilizes the association between the unlabeled data and the labeled data in the query set. Through this semi-supervised approach, the method of the present invention can better utilize the potential information of the data, thereby more accurately inferring the category to which the data in the query set belongs.

在本发明中，首先初始化K个聚类中心。然后，通过迭代的方式不断更新聚类中心和数据点之间的关系。具体步骤如下：In the present invention, K cluster centers are first initialized. Then, the relationship between the cluster centers and data points is continuously updated in an iterative manner. The specific steps are as follows:

（1）首先初始化类中心，借鉴原型网络(1) First, initialize the class center and refer to the prototype network

对支撑集中所有样本的嵌入取平均，得到各个类别的原型表示，也就是初始的类中心向量。The embeddings of all samples in the support set are averaged to obtain the prototype representation of each category, which is the initial class center vector.

（2）根据每个数据点与聚类中心的距离（文中是M），计算概率转移矩阵（最优传输矩阵）（文中是P），使用Sinkhorn算法进行迭代优化，得到稳定的矩阵。具体来说：(2) According to the distance between each data point and the cluster center (M in this paper), the probability transfer matrix (optimal transfer matrix) (P in this paper) is calculated, and the Sinkhorn algorithm is used for iterative optimization to obtain a stable matrix. Specifically:

首先将高斯核函数的值作为概率转移矩阵的初始值，并对其进行归一化操作，即：First, the value of the Gaussian kernel function is used as the initial value of the probability transfer matrix and normalized, that is:

； ;

其中P是概率转移矩阵（最优传输矩阵），其中既包括有标签的支撑集中的数据，又包括没有标签的查询集中的数据。是样本i的嵌入，是j阶级的中心，表示每个分子到类中心的欧氏距离，是高斯核函数的带宽参数，控制着数据映射的范围和数据点对决策边界的影响程度。Where P is the probability transfer matrix (optimal transfer matrix), which includes both the data in the support set with labels and the data in the query set without labels. is the embedding of sample i, It is the center of class j. represents the Euclidean distance from each molecule to the cluster center, It is the bandwidth parameter of the Gaussian kernel function, which controls the range of data mapping and the influence of data points on the decision boundary.

通过迭代更新的方式逐步逼近最优传输矩阵P。在每次迭代中，首先计算当前的行权值向量，并将其与概率转移矩阵P相乘，实现行缩放。接着，根据已知的标签值，将前支撑集的数据在对应的位置设为0或1，以对已知类别进行约束。同样地，在列方向上进行相同的操作，通过相乘和约束操作来逐步调整传输矩阵P的元素值，使其趋近于最优解。The optimal transfer matrix P is gradually approached by iterative updating. In each iteration, the current row weight vector is first calculated and multiplied with the probability transfer matrix P to achieve row scaling. Then, according to the known label value, the data of the previous support set is set to 0 or 1 at the corresponding position to constrain the known category. Similarly, the same operation is performed in the column direction, and the element values of the transfer matrix P are gradually adjusted through multiplication and constraint operations to make it approach the optimal solution.

通过将每次迭代中传输矩阵P在第二个维度上求和的结果与上一次迭代结果之间的差值的绝对值的最大值与预先设定的阈值进行比较，判断当前传输矩阵是否已经足够接近最优解。如果差值的绝对值的最大值大于阈值，说明当前传输矩阵与上一次迭代的结果之间还存在较大的差异，需要继续迭代调整。反之，如果差值的绝对值的最大值小于或等于阈值，可以认为当前传输矩阵已经足够接近最优解，可以结束迭代过程。这种比较方法可以用来监测迭代过程中传输矩阵的收敛情况，即：判断当前的迭代是否已经达到了最优传输矩阵。差值越小，则表示传输矩阵在迭代过程中变化越小，越接近最优解。因此，通过设定阈值，可以控制迭代过程的收敛性和精度，以得到最优传输矩阵。By comparing the maximum absolute value of the difference between the result of summing the transmission matrix P in the second dimension in each iteration and the result of the previous iteration with a preset threshold, it is determined whether the current transmission matrix is close enough to the optimal solution. If the maximum absolute value of the difference is greater than the threshold, it means that there is still a large difference between the current transmission matrix and the result of the previous iteration, and it is necessary to continue iterative adjustment. On the contrary, if the maximum absolute value of the difference is less than or equal to the threshold, it can be considered that the current transmission matrix is close enough to the optimal solution and the iteration process can be ended. This comparison method can be used to monitor the convergence of the transmission matrix during the iteration process, that is, to determine whether the current iteration has reached the optimal transmission matrix. The smaller the difference, the smaller the change of the transmission matrix during the iteration process, and the closer it is to the optimal solution. Therefore, by setting the threshold, the convergence and accuracy of the iteration process can be controlled to obtain the optimal transmission matrix.

（3）根据概率转移矩阵，计算每个化合物点属于每个类别的概率，得到概率矩阵。(3) According to the probability transfer matrix, the probability of each compound point belonging to each category is calculated to obtain the probability matrix.

将传输矩阵P（即：概率转移矩阵P）中的每个样本中概率最大的位置置为1，表示该样本所属的类别，其余位置置为0。The position with the highest probability in each sample in the transmission matrix P (i.e., probability transfer matrix P) is set to 1 to indicate the category to which the sample belongs, and the remaining positions are set to 0.

（4）计算新的类中心并更新类中心(4) Calculate the new class center and update the class center

将最优传输矩阵与数据降维后的嵌入相乘，得到每个类别中样本的加权和，再将其与总样本数相除，即可得到类中心的估计值。Multiply the optimal transfer matrix by the embedding after data dimensionality reduction to obtain the weighted sum of samples in each category, and then divide it by the total number of samples to get the estimated value of the class center .

将计算的估计值与原始类中心的差值作为类中心的更新量，使用学习率对更新量进行调整（缩放），并将调整（更新）后的值加到原始类中心上，完成类中心的更新，即：The difference between the calculated estimate and the original class center is used as the update amount of the class center, using the learning rate Adjust (scale) the update amount and add the adjusted (updated) value to the original class center On the top, complete the update of the class center, namely:

； ;

（5）重复以上步骤，直到聚类中心不再发生显著变化或达到最大迭代次数。(5) Repeat the above steps until the cluster center no longer changes significantly or the maximum number of iterations is reached.

2、数据集2. Dataset

为了评估模型在不同性质预测任务中的性能，本发明采用了五个公开可用的基准数据集，分别是血脑屏障穿透性数据集（简称BBBP）、毒性数据集ClinTox、β-分泌酶的抑制活性数据集、人类免疫缺陷病毒数据集（简称HIV）与毒性数据集Tox21，其具体描述分别如下：In order to evaluate the performance of the model in different property prediction tasks, the present invention uses five publicly available benchmark datasets, namely the blood-brain barrier permeability dataset (BBBP for short), the toxicity dataset ClinTox, the β-secretase inhibitory activity dataset, the human immunodeficiency virus dataset (HIV for short) and the toxicity dataset Tox21, which are described as follows:

（1）BBBP：BBBP数据集是用于预测化合物在血脑屏障中穿透性的公开数据集。血脑屏障是一层位于中枢神经系统和外周循环系统之间的生物屏障，可以阻挡大多数药物、激素和神经递质，通过抑制血液中化学物质的进入来保护大脑。因此，了解合物在血脑屏障中的穿透性对于药物发现和设计具有重要意义。BBBP是一个二进制数据集，包含超过2000个具有不同的结构和化学性质的化合物，每个化合物都被标记为可渗透（“1”）或不可渗透（“0”）这两类标签，这些标签是基于实验测量或模拟预测得出的。该数据集广泛应用于药物发现和计算机辅助药物设计领域。(1) BBBP: The BBBP dataset is a public dataset used to predict the permeability of compounds across the blood-brain barrier. The blood-brain barrier is a biological barrier located between the central nervous system and the peripheral circulatory system that blocks most drugs, hormones, and neurotransmitters and protects the brain by inhibiting the entry of chemicals in the blood. Therefore, the permeability of compounds across the blood-brain barrier is of great significance for drug discovery and design. BBBP is a binary dataset containing more than 2,000 compounds with different structures and chemical properties. Each compound is labeled as permeable ("1") or impermeable ("0") based on experimental measurements or simulation predictions. This dataset is widely used in the fields of drug discovery and computer-aided drug design.

（2）ClinTox：ClinTox数据集是一个用于预测药物毒性的公开数据集，它被广泛应用于药物研发和计算机辅助药物设计领域。该数据集由Williams等人于2017年发布，包含了1000多种小分子药物、天然产物和生物活性分子，每个化合物都被标记为有毒（“1”）或无毒（“0”）两类。(2) ClinTox: The ClinTox dataset is a public dataset for predicting drug toxicity. It is widely used in drug development and computer-aided drug design. The dataset was released by Williams et al. in 2017 and contains more than 1,000 small molecule drugs, natural products, and bioactive molecules. Each compound is labeled as toxic ("1") or non-toxic ("0").

（3）Bace：BACE数据集是一个用于预测药物分子对β-分泌酶抑制活性的数据集。0和1代表了化合物对酶抑制活性的分类标签,其中“0”表示化合物对酶的抑制作用较弱或不存在，而“0”表示化合物对酶的抑制作用较强。该数据集包含1000多个化合物，每个化合物都被测量了其对β-分泌酶的抑制活性，以半抑制浓度（IC50）的负对数形式表示，较低的IC50值表示化合物对酶的抑制作用更强。(3) Bace: The BACE dataset is a dataset used to predict the inhibitory activity of drug molecules on β-secretase. 0 and 1 represent the classification labels of the compound's inhibitory activity on the enzyme, where "0" indicates that the compound has a weak or non-existent inhibitory effect on the enzyme, while "0" indicates that the compound has a strong inhibitory effect on the enzyme. The dataset contains more than 1,000 compounds, and each compound has been measured for its inhibitory activity on β-secretase, expressed in the negative logarithm of the half-inhibitory concentration (IC50). A lower IC50 value indicates that the compound has a stronger inhibitory effect on the enzyme.

（4）HIV：人类免疫缺陷病毒数据集来源于药物治疗计划(简称DTP)的艾滋病抗病毒筛查，该筛查测试了超过40000种化合物抑制HIV复制的能力。(4) HIV: The human immunodeficiency virus dataset comes from the Drug Treatment Program (DTP) HIV antiviral screening, which tested more than 40,000 compounds for their ability to inhibit HIV replication.

3、实验部分3. Experimental part

3.1实验设置3.1 Experimental Setup

3.1.1基线模型3.1.1 Baseline Model

为了验证模型的有效性，将其与目前公认效果较好的六种基线模型进行对比实验，以下是这六种基线模型的详细介绍：In order to verify the effectiveness of the model, a comparative experiment was conducted with six baseline models that are currently recognized to have good results. The following is a detailed introduction to these six baseline models:

（1）Meta-MGNN:一种基于MAML小样本学习方法，这是一种基于优化的小样本学习方法。首先，使用预训练的图神经网络原型网络框架结合图学习分子的嵌入表示，然后进一步使用元学习来学习模型的初始化参数，这些参数可以通过少量数据样本快速适应新的分子特性。(1) Meta-MGNN: A MAML-based small sample learning method, which is an optimization-based small sample learning method. First, a pre-trained graph neural network prototype network framework is used in conjunction with graph learning to embed the molecular representation, and then meta-learning is further used to learn the initialization parameters of the model, which can quickly adapt to new molecular properties with a small amount of data samples.

（2）FS-GNNTR：模型以分子图为输入，对图嵌入的局部空间进行上下文建模，同时保留深度表示的全局信息。使用双模块元学习框架在小样本学习任务中迭代更新模型参数。(2) FS-GNNTR: The model takes a molecular graph as input and models the context in the local space of graph embedding while retaining the global information of the deep representation. A dual-module meta-learning framework is used to iteratively update the model parameters in the few-shot learning task.

（3）FS-GNNConv：一种跨图神经网络和卷积神经网络的小样本学习策略，并提出了一个双模块元学习框架，能够利用图嵌入的丰富信息，用于从任务可转移知识中进行学习。(3) FS-GNNConv: A few-shot learning strategy across graph neural networks and convolutional neural networks, and a dual-module meta-learning framework that can leverage the rich information of graph embeddings for learning from task-transferable knowledge.

（4）Relation Guide：模型首先通过具有共同性质的分子训练多了个性质感知图神经网络，并从这些分子中提取分子表示，构建性质感知矩阵，然后使用斯皮尔曼相关系数来计算性质与性质感知矩阵之间的关系。在这个过程中，采用元学习策略从元训练的性质类别中学习公共预测知识。(4) Relation Guide: The model first trains a property-aware graph neural network with molecules with common properties, extracts molecular representations from these molecules, constructs a property-aware matrix, and then uses the Spearman correlation coefficient to calculate the relationship between the properties and the property-aware matrix. In this process, a meta-learning strategy is used to learn common prediction knowledge from meta-trained property categories.

（5）HSL-RG：这是一种基于关系图的层次结构学习方法，使用任务自适应元学习方法，分别从全局和局部两个方面探索分子的结构语义。首先，利用图核构建关系图，以传递来自相邻分子的分子结构信息进行全局探索；然后，设计基于结构优化的自监督学习信号，以在局部范围内学习分子的恒定表示。(5) HSL-RG: This is a relational graph-based hierarchical structure learning method that uses a task-adaptive meta-learning method to explore the structural semantics of molecules from both global and local perspectives. First, a relational graph is constructed using a graph kernel to transfer molecular structural information from neighboring molecules for global exploration; then, a self-supervised learning signal based on structural optimization is designed to learn a constant representation of molecules in a local range.

（6）PAR：PAR是基于属性感知的小分子性质预测关系网络，使用属性感知嵌入函数，将通用分子嵌入转换为与目标属性相关的子结构感知空间，再使用自适应关系图学习模块来联合估计分子关系图，同时使用元学习策略以便在任务中有选择地更新参数。(6) PAR: PAR is an attribute-aware small molecule property prediction relationship network. It uses an attribute-aware embedding function to convert the general molecular embedding into a substructure-aware space related to the target attribute, and then uses an adaptive relationship graph learning module to jointly estimate the molecular relationship graph. At the same time, a meta-learning strategy is used to selectively update parameters in the task.

3.1.2模型参数设置3.1.2 Model parameter setting

在本发明中，为了防止过拟合，将迭代次数设为40。除了在对模型的某些性能进行测试时，将降维维度设定为40，支撑集的样本数量设定为5，查询集的样本数量设定为10。In the present invention, in order to prevent overfitting, the number of iterations is set to 40. Except when testing certain performance of the model, the dimension reduction is set to 40, the number of samples in the support set is set to 5, and the number of samples in the query set is set to 10.

3.2降维维度对模型的影响3.2 Impact of Dimensionality Reduction on the Model

为了探究降维的维度对模型性能的影响，本发明对维度分别设置为5、10、15、20、30、40、50、60并进行了性能测试，具体结果如图1-图4和表1所示。In order to explore the impact of the reduced dimension on the model performance, the present invention sets the dimensions to 5, 10, 15, 20, 30, 40, 50, and 60 respectively and conducts performance tests. The specific results are shown in Figures 1 to 4 and Table 1.

表1 降维的维度对模型的影响Table 1 The impact of dimension reduction on the model

在图1-图4和表1中，可以得到以下结论：From Figures 1 to 4 and Table 1, we can draw the following conclusions:

（1）初期，在维度为5至10之间，模型性能快速提升；随着维度的进一步增加，性能提升速度变缓，最终达到峰值后不再随着维度的增加而增加。(1) In the early stage, when the dimension is between 5 and 10, the model performance improves rapidly. As the dimension increases further, the performance improvement rate slows down, and finally reaches a peak and no longer increases with the increase of dimension.

（2）在BBBP、ClinTox、Bace和HIV这些数据集中，ClinTox在维度为30时性能最佳，而其余数据集在维度为40时性能最佳。(2) Among the BBBP, ClinTox, Bace, and HIV datasets, ClinTox performs best when the dimension is 30, while the remaining datasets perform best when the dimension is 40.

因此，为了能保持模型性能最佳的同时减小后续的计算量，在后续实验中维度将被设为40，以实现模型性能和计算效率之间的平衡。Therefore, in order to maintain the best model performance while reducing the subsequent calculation amount, the dimension will be set to 40 in subsequent experiments to achieve a balance between model performance and computational efficiency.

3.3支撑集样本数量对模型的影响3.3 The impact of the number of support set samples on the model

在之前的实验中，训练集样本数量一直设为5，为了探索发明提出的小样本学习方法如何从更多的训练数据中获益，本发明测试了支撑集中的样本数量对模型性能的影响，训练集样本数量分别取1、2、3、4、5与10，具体结果如表2与表3所示。In previous experiments, the number of training set samples has been set to 5. In order to explore how the small sample learning method proposed in the invention can benefit from more training data, the present invention tests the influence of the number of samples in the support set on the model performance. The number of training set samples is 1, 2, 3, 4, 5 and 10 respectively. The specific results are shown in Tables 2 and 3.

表2 Tox21中支撑集数量对模型的影响Table 2 The impact of the number of support sets on the model in Tox21

表3 支撑集数量对模型的影响Table 3 The impact of the number of support sets on the model

从图5-图9中可以看出，在毒性、血脑屏障穿透性等多种性质预测数据集中，模型的性能都随着支撑集中数据量的增加而增加，且增速大多符合先快后慢的趋势。其中，Tox21包含了12项任务，详见表3。当支撑集中的样本数量为1时，ROC_AUC已经超过80，这已经超过了目前许多模型的性能；而当支撑集中的样本数量增加到10时，ROC_AUC超过90，性能提升了约10%。As can be seen from Figures 5 to 9, in various property prediction data sets such as toxicity and blood-brain barrier permeability, the performance of the model increases with the increase in the amount of data in the support set, and the growth rate mostly conforms to the trend of first fast and then slow. Among them, Tox21 includes 12 tasks, see Table 3 for details. When the number of samples in the support set is 1, ROC_AUC exceeds 80, which has exceeded the performance of many current models; and when the number of samples in the support set increases to 10, ROC_AUC exceeds 90, and the performance is improved by about 10%.

这是因为在聚类过程中，支撑集中的标记数据不仅仅用于计算初始的类中心，在计算最优传输矩阵阶段，算法也考虑了支撑集中标记数据与查询集中未标记数据的相似性和关联，以进行半监督训练来充分利用数据的潜在信息。This is because in the clustering process, the labeled data in the support set is not only used to calculate the initial class center. In the stage of calculating the optimal transfer matrix, the algorithm also considers the similarity and correlation between the labeled data in the support set and the unlabeled data in the query set to perform semi-supervised training to fully utilize the potential information of the data.

3.4查询集样本数量对模型的影响3.4 The impact of the number of query set samples on the model

由于在半监督学习阶段，未标记数据全部来源于查询集，因此查询集中的样本数量会影响模型的性能。为了测试查询集中的样本数量对模型性能的影响，在BBBP、ClinTox、Bace与HIV数据集上，分别将样本数量分别设置为1至10进行测试，具体结果如表4与图10-图13所示。Since all unlabeled data in the semi-supervised learning stage come from the query set, the number of samples in the query set will affect the performance of the model. In order to test the impact of the number of samples in the query set on the model performance, the number of samples was set to 1 to 10 for testing on the BBBP, ClinTox, Bace and HIV datasets, respectively. The specific results are shown in Table 4 and Figures 10-13.

表4 查询集数量对模型的影响Table 4 The impact of the number of query sets on the model

如图10-图13所示，在所有任务上，随着查询集中样本数不断增加，模型的性能都有所提高。当查询集的数量从1增加到2时，所有任务的性能都有显著提升，这表明增加一个额外的查询实例可以显著提高模型的泛化能力。随着查询集数量的进一步增加，性能提升逐渐减缓，这可能是因为模型已经从有限的数据中学习到了足够的信息。尽管如此，总体趋势表明，增加查询集的数量仍然有助于提高模型的性能，特别是在某些任务上（如ClinTox和HIV），即使在查询集数量达到10时，性能仍在继续改善。As shown in Figures 10-13, on all tasks, the performance of the model improves as the number of samples in the query set increases. When the number of query sets increases from 1 to 2, the performance of all tasks improves significantly, indicating that adding an additional query instance can significantly improve the generalization ability of the model. As the number of query sets increases further, the performance improvement gradually slows down, which may be because the model has learned enough information from the limited data. Nevertheless, the overall trend shows that increasing the number of query sets still helps to improve the performance of the model, especially on some tasks (such as ClinTox and HIV), where the performance continues to improve even when the number of query sets reaches 10.

因此，这个实验结果表明，模型能够有效利用未标记数据中的信息，可以通过使用更多的未标记数据获得更好的性能。然而，需要注意的是，增加查询集的数量可能会导致计算成本的增加，因此需要在性能提升和计算资源之间进行权衡，其他实验中查询集的样本数量被设为5。Therefore, the experimental results show that the model can effectively utilize the information in unlabeled data and can achieve better performance by using more unlabeled data. However, it should be noted that increasing the number of query sets may lead to an increase in computational costs, so a trade-off needs to be made between performance improvement and computing resources. The number of query set samples in other experiments is set to 5.

3.5与基线模型结果比较3.5 Comparison with Baseline Model Results

为了测试模型的性能，将本发明的模型与当前效果较好的一些模型进行比较，包括基于优化的小样本学习方法与基于度量的小样本学习方法，结果如表5所示。其中，Pos、Neg和All分别表示模型用正样本、负样本和所有样本进行进训练。In order to test the performance of the model, the model of the present invention is compared with some current models with better results, including the optimization-based small sample learning method and the metric-based small sample learning method, and the results are shown in Table 5. Among them, Pos, Neg and All respectively indicate that the model is trained with positive samples, negative samples and all samples.

表5 基线模型的总体性能Table 5 Overall performance of baseline models

在所有模型中，本发明提出的模型在查询集数量为5和10时都取得了最好的性能，这表明Few-SK能够有效地利用有限数据训练出准确模型。当查询集数量为5时，Few-SK比次优模型PAR的ROC-AUC值高4.13%；当查询集数量为10时，Few-SK比次优模型HSL-RG的ROC-AUC值高4.92%，远超目前最优模型的性能。Among all the models, the model proposed in this paper achieved the best performance when the number of query sets was 5 and 10, which shows that Few-SK can effectively use limited data to train an accurate model. When the number of query sets is 5, the ROC-AUC value of Few-SK is 4.13% higher than that of the second-best model PAR; when the number of query sets is 10, the ROC-AUC value of Few-SK is 4.92% higher than that of the second-best model HSL-RG, far exceeding the performance of the current optimal model.

整体来看，基于度量的小样本学习方法比基于优化的小样本学习方法效果要好。这可能是因为基于度量的小样本学习方法更加注重样本间的相似性，而这种相似性对于小样本学习来说至关重要；相比之下，基于优化的小样本学习方法通常通过设计并优化损失函数的方法使模型能在新任务上取得较好的泛化能力。Overall, the metric-based small sample learning method is better than the optimization-based small sample learning method. This may be because the metric-based small sample learning method pays more attention to the similarity between samples, which is crucial for small sample learning; in contrast, the optimization-based small sample learning method usually designs and optimizes the loss function to enable the model to achieve better generalization ability on new tasks.

以上仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention should be included in the protection scope of the present invention.

Claims

1. A molecular property prediction method based on a metric small sample learning method, characterized in that it comprises the following steps:

S1, the prototype network framework is selected to build the model framework;

S2, using the prototype network framework combined with graph methods to learn the general features of graph data;

S3. Use low-rank representation to reduce the dimension of the data and obtain the subspace and embedded representation after dimension reduction;

S4, using contrastive learning to capture the similarities between a data point and its neighbors and other data points;

S5. Use the extended Sinkhorn K-means algorithm to incorporate labeled data into the algorithm and perform cluster analysis on the data.

2. The molecular property prediction method based on the metric small sample learning method according to claim 1 is characterized in that, in step S3, the model needs to perform subspace learning, wherein the subspace learning includes SVD decomposition and QR decomposition, and the formula of SVD decomposition is as follows:

;

Where A is the matrix to be decomposed/original matrix, U is a unitary matrix of size m×m, also a left singular matrix, V is a unitary matrix of size n×n, also a right singular matrix, Σ is a diagonal matrix containing singular values, size m×n, U and V are two orthogonal matrices, each row or column of which is called a left singular vector and a right singular vector, respectively, and T represents the transpose of the matrix;

The formula for QR decomposition is as follows:

;

Where Q is an m×n orthogonal matrix whose column vectors are mutually orthogonal and of unit length, satisfying ; That is, the product of the transpose of Q and Q is equal to the identity matrix I, and R is an n×n upper triangular matrix, that is, except for the elements on the main diagonal and above, all other elements are zero.

3. The molecular property prediction method based on the metric small sample learning method according to claim 2 is characterized in that in step S3, the specific steps of low-rank representation are as follows:

A1. Perform QR decomposition on the original data and reduce the dimension of the input matrix to a triangular matrix R, which contains the main structural information of the data;

A2. Perform SVD decomposition on the triangular matrix R to obtain the left singular matrix U;

A3. Based on U, we find the similarity matrix S. By calculating the product of the left singular matrix U and its transpose, we get the similarity matrix W, that is: ;

Among them, S is the similarity matrix, U is the left singular matrix, and T represents the transpose of the matrix;

A4. Set the diagonal elements of the similarity matrix S to zero;

A5. In order to make the sum of the similarity weights of each data point equal to 1, the similarity matrix S is normalized;

A6. Construct a dissimilarity matrix S1 based on the similarity matrix S. The dissimilarity matrix S1 is an all-1 matrix with the same shape as the similarity matrix S. For an N-way K-shot C-query problem, the calculation formula of the dissimilarity matrix S1 is as follows:

;

Where e is a vector of all 1s, E is the identity matrix, N is the number of categories, K is the number of samples in each category, C is the number of samples in the query set, and T represents the transpose of the matrix.

4. The molecular property prediction method based on the metric small sample learning method according to claim 3 is characterized in that in step S3, the step of data dimensionality reduction is as follows:

B1. By maximizing the similarity of similar features and minimizing the dissimilarity of dissimilar features, we learn the projection P of the feature in the subspace, that is:

;

Where F is the molecular feature, S1 is the dissimilarity matrix, S is the similarity matrix, and T represents the transpose of the matrix;

B2. Perform eigenvalue decomposition on P, extract its eigenvectors and retain the eigenvectors corresponding to the first k largest eigenvalues. Multiply the original data matrix with these eigenvectors to map the data F to a new coordinate space, namely:

;

Where V is the first k largest eigenvectors, and F1 is the feature map of the data in the new coordinate space.

5. The molecular property prediction method based on the metric small sample learning method according to claim 1 is characterized in that, in step S5, the extended Sinkhorn K-means algorithm refers to clustering data based on the Sinkhorn K-means algorithm, which is improved on the basis of the K-means algorithm, by introducing the optimal transmission matrix and Sinkhorn for iteration, first initializing K cluster centers, and then continuously updating the relationship between the cluster centers and data points through iteration.

6. The molecular property prediction method based on the metric small sample learning method according to claim 1, characterized in that the steps of performing cluster analysis using the extended Sinkhorn K-means algorithm are as follows:

C1. Initialize the class center. By referring to the prototype network, average the embeddings of all samples in the support set to obtain the prototype representation of each category, that is, the initial class center vector;

C2. Calculate the probability transfer matrix P based on the distance M between each data point and the cluster center, and use the Sinkhorn algorithm for iterative optimization to obtain a stable matrix;

C3. According to the probability transfer matrix P, the probability of each compound point belonging to each category is calculated to obtain a probability matrix;

C4. Calculate the new class center and update the class center. Multiply the probability transfer matrix with the embedding after data dimensionality reduction to obtain the weighted sum of samples in each category, and then divide it by the total number of samples to get the estimated value of the class center. ;

The difference between the calculated estimate and the original class center is used as the update amount of the class center, using the learning rate Adjust the update amount and add the adjusted updated value to the original class center On the top, complete the update of the class center, namely:

;

C5. Repeat the above steps until the cluster center no longer changes significantly or the maximum number of iterations is reached.

7. The molecular property prediction method based on the metric small sample learning method according to claim 6 is characterized in that in step C2, the Sinkhorn algorithm is iteratively optimized to obtain a stable matrix, and the specific steps are as follows:

D1. Use the value of the Gaussian kernel function as the initial value of the probability transfer matrix and perform a normalization operation on it, namely:

;

Where P is the probability transfer matrix, which includes both the data in the support set with labels and the data in the query set without labels. is the embedding of sample i, It is the center of class j. represents the Euclidean distance from each molecule to the cluster center, It is the bandwidth parameter of the Gaussian kernel function, which controls the range of data mapping and the influence of data points on the decision boundary;

D2. Gradually approximate the probability transfer matrix P through iterative updating. In each iteration, first calculate the current row weight vector and multiply it with the probability transfer matrix P to achieve row scaling. Then, according to the known label value, set the data of the previous support set to 0 or 1 at the corresponding position to constrain the known category. Similarly, perform the same operation in the column direction, and gradually adjust the element values of the transfer matrix P through multiplication and constraint operations to make it approach the optimal solution.

D3. By comparing the maximum absolute value of the difference between the result of summing the transmission matrix P in the second dimension in each iteration and the result of the previous iteration with a preset threshold, it is determined whether the current transmission matrix is already the optimal solution. If the maximum absolute value of the difference is greater than the threshold, it means that there is still a difference between the current transmission matrix and the result of the previous iteration, and it is necessary to continue iterative adjustment. If the maximum absolute value of the difference is less than or equal to the threshold, it is considered that the current transmission matrix is already the optimal solution and the iteration process can be ended.

8. According to the molecular property prediction method based on the metric small sample learning method according to claim 1, it is characterized in that in step S2, the prototype network framework is combined with the graph method through node-level and graph-level pre-training, so that the model captures the local and global information of the graph.