CN114818948B

CN114818948B - Data-mechanism driven material attribute prediction method of graph neural network

Info

Publication number: CN114818948B
Application number: CN202210483568.8A
Authority: CN
Inventors: 张桃红; 陈赛安; 陈晗
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2023-02-03
Anticipated expiration: 2042-05-05
Also published as: CN114818948A

Abstract

The invention discloses a data-mechanism driven material attribute prediction method of a graph neural network, which comprises the following steps: s1, obtaining descriptor characteristics and graph structures of molecules of a material to be predicted; s2, screening out a final feature descriptor by utilizing a feature engineering; s3, extracting molecular diagram features of different layers by using a diagram volume and a diagram attention network; s4, fusing the molecular diagram features and the descriptor features by using a feature fusion layer; s5, a correction module is used for better fusing the calculated value and the experimental value; the calculated value is a numerical value obtained by first-nature principle simulation calculation, and the experimental value is an actual material attribute measured by an experiment; and S6, fusing the calculated value of the mechanism driving model and the deep learning data driving model for model reasoning, and outputting the numerical value of the prediction attribute. The invention integrates the descriptor characteristics and the graph structure characteristics of the molecules and overcomes the problems that the graph structure data information is incomplete and the descriptor characteristics ignore the molecular attributes.

Description

A Data-Mechanism-Driven Material Property Prediction Method Based on Graph Neural Network

技术领域technical field

本发明涉及材料发现和图神经网络技术领域，特别涉及一种图神经网络的数据-机理驱动的材料属性预测方法。The invention relates to the technical fields of material discovery and graph neural network, in particular to a data-mechanism-driven material attribute prediction method of graph neural network.

背景技术Background technique

分子材料广泛应用于医疗卫生、食品、日常化工等领域。因此，加快新分子材料的发现对促进科学和社会的发展具有重要意义。目前，分子材料的研究非常耗时，需要大量的努力来确定一定的目标性质，优化分子的合成条件。理论高通量计算方法通常用于预测分子的性质。这种有合理解释的机制驱动的计算模型可以有效地加速新材料的发现。然而，机构驱动的计算模型是一个具有参数简化的理论模型。它忽略了材料缺陷、真实环境、设施、研究人员技能等因素的影响。这些因素可能会导致预测的不准确性。Molecular materials are widely used in medical and health, food, daily chemical and other fields. Therefore, accelerating the discovery of new molecular materials is of great significance to promote the development of science and society. Currently, the study of molecular materials is very time-consuming and requires a lot of effort to determine certain target properties and optimize the synthesis conditions of molecules. Theoretical high-throughput computational methods are often used to predict the properties of molecules. Such mechanism-driven computational models with plausible explanations can effectively accelerate the discovery of new materials. However, the mechanism-driven computational model is a theoretical model with parameter simplifications. It ignores the effects of material imperfections, real environment, facilities, researcher skills, etc. These factors may lead to inaccuracies in forecasts.

近年来，大数据驱动的人工智能方法被广泛应用于计算机视觉、自然语言处理、医学和交通等领域。由于分子大数据的强大的非线性能力和可行性，基于机器学习和深度学习的材料特性预测受到了研究者的广泛关注。目前，材料中的人工智能方法主要有两个方面。一种是基于描述符的机器学习预测，它需要找到与目标属性有很强相关性的描述符；另一种是基于图神经网络的端到端深度学习模型，它是一种利用分子图结构作为输入的神经网络，可以从分子图结构中提取抽象信息，映射到目标属性。然而，图神经网络与其他机器学习方法也存在同样的问题，即缺乏泛化，易于达到训练数据的极限。特别是对于新材料的发现，深度学习方法的预测可能不准确。当真实的分子被抽象为图结构时，它将失去部分三维结构信息和核外电子信息。而且这会导致对结果的预测不准确。因此，将描述符特征与图结构化特征相融合，既可以解决描述符缺少图结构信息的问题，又可以解决图结构泛化性差的问题。In recent years, big data-driven AI methods have been widely used in fields such as computer vision, natural language processing, medicine, and transportation. Due to the powerful nonlinear capability and feasibility of molecular big data, the prediction of material properties based on machine learning and deep learning has attracted extensive attention from researchers. Currently, there are two main aspects of AI approaches in materials. One is descriptor-based machine learning prediction, which needs to find descriptors with a strong correlation with the target attribute; the other is an end-to-end deep learning model based on graph neural network, which is an As input, the neural network can extract abstract information from the molecular graph structure and map to the target attributes. However, graph neural networks suffer from the same problems as other machine learning methods, namely lack of generalization and easy to reach the limit of training data. Especially for the discovery of new materials, the predictions of deep learning methods may not be accurate. When a real molecule is abstracted into a graph structure, it will lose part of its three-dimensional structure information and extranuclear electronic information. And it can lead to inaccurate predictions of outcomes. Therefore, fusing descriptor features with graph-structured features can not only solve the problem of descriptors lacking graph structure information, but also solve the problem of poor generalization of graph structures.

发明内容Contents of the invention

本发明提供了一种图神经网络的数据-机理驱动的材料属性预测方法，以解决机理驱动的计算模型忽略材料属性和深度学习网络泛化性能低的问题。通过将描述符特征和图结构特征输入到深度学习网络中进行训练，并使用机理驱动模型对深度学习的输出进行调整，提升了分子属性预测的准确性。The invention provides a data-mechanism-driven material attribute prediction method of a graph neural network to solve the problems that the mechanism-driven calculation model ignores material attributes and the deep learning network has low generalization performance. By inputting descriptor features and graph structure features into a deep learning network for training, and using a mechanism-driven model to adjust the output of deep learning, the accuracy of molecular property prediction is improved.

为解决上述技术问题，本发明提供了如下技术方案：In order to solve the problems of the technologies described above, the present invention provides the following technical solutions:

本发明提供了一种图神经网络的数据-机理驱动的材料属性预测方法，包括：The present invention provides a data-mechanism-driven material property prediction method of a graph neural network, including:

S1，获取待预测材料分子的描述符特征和图结构；S1, obtain the descriptor features and graph structure of the material molecules to be predicted;

S2，利用特征工程筛选出最终的特征描述符；S2, using feature engineering to screen out the final feature descriptor;

S3，利用图卷积和图注意力网络提取不同层次的分子图特征；S3, using graph convolution and graph attention networks to extract molecular graph features at different levels;

S4，利用特征融合层将分子图特征与描述符特征相融合；S4, using the feature fusion layer to fuse molecular graph features and descriptor features;

S5，利用修正模块来将计算值和实验值进行更好的融合；其中，所述的计算值为第一性原理模拟计算得出的数值，实验值为实验测得的实际的材料属性；S5, using the correction module to better integrate the calculated value and the experimental value; wherein, the calculated value is a value obtained by first-principle simulation calculation, and the experimental value is an actual material property measured experimentally;

S6，将机理驱动模型的计算值与深度学习数据驱动模型融合用于模型推理，并输出预测属性的数值。In S6, the calculated value of the mechanism-driven model is fused with the deep learning data-driven model for model reasoning, and the value of the predicted attribute is output.

进一步地，所述S1中，获取待预测材料分子的描述符特征和图结构，包括：Further, in the S1, the descriptor feature and graph structure of the material molecule to be predicted are obtained, including:

从PubChem网站，通过网站提供的Restful api接口获取具有图结构信息以及实验数据的json文件，其中图结构信息包括原子属性和键的属性。分子相应的描述符特征使用开源化学信息软件rdkit在线收集。分子描述符涵盖了分子的基本性质、电子性质、拓扑性质。From the PubChem website, the json file with graph structure information and experimental data is obtained through the Restful API interface provided by the website, where the graph structure information includes the properties of atoms and keys. The corresponding descriptor features of the molecules were collected online using the open source cheminformatics software rdkit. Molecular descriptors cover the basic properties, electronic properties, and topological properties of molecules.

进一步地，所述S2中，利用特征工程筛选出最终的特征描述符，包括：Further, in said S2, feature engineering is used to screen out the final feature descriptors, including:

对初始的特征描述，构建多项式特征来进行特征筛选降低数据拟合的难度。使用皮尔逊相关系数和最大信息系数对多项式特征的相关度进行排序，筛选出最终的特征描述符。For the initial feature description, polynomial features are constructed to perform feature screening to reduce the difficulty of data fitting. The correlation of polynomial features is sorted using the Pearson correlation coefficient and the maximum information coefficient to filter out the final feature descriptors.

进一步地，所述多项式特征，包括：Further, the polynomial features include:

特征描述符可以表示为X＝{x₁,x₂,x₃,...,x_k},,k表示特征的维度。使用x²,x^1/2,ln(1+x)进行特征多项式变换，构建的多项式特征为

The feature descriptor can be expressed as X={x ₁ , x ₂ , x ₃ , . . . , x _k }, where k represents the dimension of the feature. Use x ² , x ^1/2 , ln(1+x) to perform feature polynomial transformation, and the constructed polynomial feature is

进一步地，所述S3中，利用图卷积和图注意力网络提取不同层次的分子图特征，包括：Further, in the S3, graph convolution and graph attention networks are used to extract molecular graph features at different levels, including:

构建包含不同卷积数量的特征提取层提取不同层次的分子特征，浅层的图卷积获取局部的特征信息，深层的图卷积获取全局的长范围的特征信息。图卷积可以对分子中原子的邻居信息进行聚集，图卷积的层数越多，可获取的特征信息越广。图注意力可以对每个原子周围的邻居节点进行权重的调整。图网络的输入为三元组{V,E,A}，其中V表示构成分子的原子的特征矩阵，E表示边的特征矩阵，A表示图的邻接矩阵。具体的特征金字塔构建方法如下：Construct feature extraction layers with different numbers of convolutions to extract molecular features at different levels, shallow graph convolutions to obtain local feature information, and deep graph convolutions to obtain global long-range feature information. Graph convolution can gather the neighbor information of atoms in the molecule. The more layers of graph convolution, the wider the feature information that can be obtained. Graph attention can adjust the weights of neighbor nodes around each atom. The input of the graph network is a triplet {V,E,A}, where V represents the characteristic matrix of atoms constituting the molecule, E represents the characteristic matrix of edges, and A represents the adjacency matrix of the graph. The specific feature pyramid construction method is as follows:

X₀＝GCN(V,E,A) (1)X ₀ =GCN(V,E,A) (1)

X₁＝GAT(X₀) (2)X ₁ =GAT(X ₀ ) (2)

X₂＝GAT(GCN(GCN(X₁))) (3)X ₂ =GAT(GCN(GCN(X ₁ ))) (3)

X₃＝GAT(GCN(GCN(GCN(GCN(X₂))))) (4)X ₃ =GAT(GCN(GCN(GCN(GCN(X ₂ ))))) (4)

X_Out＝Set2Set(mean(X₁+X₂+X₃)) (5)X _Out ＝Set2Set(mean(X ₁ +X ₂ +X ₃ )) (5)

其中，GCN表示图卷积网络层，GAT表示图注意力网络层；X₀表示经过图卷积得到的初始特征，X₁,X₂,X₃分别表示经过不同层次的图卷积网络得到的特征；mean表示求特征的平均值，Set2Set表示全局池化读出操作。Among them, GCN represents the graph convolutional network layer, GAT represents the graph attention network layer; X ₀ represents the initial features obtained through graph convolution, X ₁ , X ₂ , and X ₃ represent the features obtained through different layers of graph convolutional networks. Features; mean means to calculate the average value of features, and Set2Set means to read out the global pool.

进一步地，所述S4中，利用特征融合层将分子图特征与描述符特征相融合，包括：Further, in said S4, the feature fusion layer is used to fuse the molecular graph feature and the descriptor feature, including:

特征融合层将经过特征工程筛选得到的描述符特征与经过图神经网络特征提取得到的分子图特征相融合。描述符特征是基于电子和外围结构进行建模的，这些计算方法包含了大量的三维信息和核外电子的相互作用信息，可以用来补充图结构的信息丢失问题。具体的特征融合方法如下：The feature fusion layer fuses the descriptor features obtained through feature engineering screening with the molecular graph features obtained through graph neural network feature extraction. Descriptor features are modeled based on electron and peripheral structures, and these calculation methods contain a large amount of three-dimensional information and interaction information of extranuclear electrons, which can be used to supplement the information loss problem of graph structures. The specific feature fusion method is as follows:

X_Cat＝Concat((X_Out,X_f)) (6)X _Cat ＝Concat((X _Out ,X _f )) (6)

X_fusion＝ReLU(FC(X_Cat)) (7)X _fusion = ReLU(FC(X _Cat )) (7)

X_Out表示图结构特征信息,X_Cat表示特征描述符信息；Concat表示将特征按照特征的维度进行拼接；FC表示全连接层；ReLU表示非线性激活函数；X_fusion表示融合后的特征。X _Out represents graph structure feature information, X _Cat represents feature descriptor information; Concat represents splicing features according to the dimension of features; FC represents fully connected layer; ReLU represents nonlinear activation function; X _fusion represents fused features.

进一步地，所述S5中，利用修正模块来将计算值和实验值进行更好的融合；其中，所述的计算值为第一性原理模拟计算得出的数值，实验值为实验测得的实际的材料属性，包括：Further, in the S5, the correction module is used to better integrate the calculated value and the experimental value; wherein, the calculated value is a value obtained by first-principles simulation calculation, and the experimental value is measured experimentally Actual material properties, including:

为了更好的将计算值和实验值进行更好的融合，使用实验值来构造深度学习的预测标签L。深度学习预测标签L既考虑了深度学习模型的预测结果，又考虑了真实的实验数据。具体的预测标签创建方法如下：In order to better integrate the calculated value and the experimental value, the experimental value is used to construct the predicted label L of deep learning. The deep learning prediction label L not only considers the prediction results of the deep learning model, but also considers the real experimental data. The specific prediction label creation method is as follows:

L＝F(L_c,L_e)＝αL_c+βL_e (α+β＝1) (8)L=F(L _c ,L _e )=αL _c +βL _e (α+β=1) (8)

其中，L_c其中是深度学习网络的输出，由理论模型的计算结果进行训练，L_e表示真实的实验数据，α和β为计算结果和实验数据的加权系数；bz表示模型训练过程中的批量大小，Loss为损失量。Among them, L _c is the output of the deep learning network, which is trained by the calculation results of the theoretical model, L _e represents the real experimental data, α and β are the weighting coefficients of the calculation results and experimental data; bz represents the batch size during the model training process Size, Loss is the amount of loss.

进一步地，所述S6中，将机理驱动模型的计算值与深度学习数据驱动模型融合用于模型推理，并输出预测属性的数值，包括：Further, in said S6, the calculated value of the mechanism-driven model is fused with the deep learning data-driven model for model reasoning, and the value of the predicted attribute is output, including:

机理驱动模型是通过公式计算出分子的属性数值，机理驱动模型产生的结果是可信的。通过机理驱动模型来对深度学习模型进行调制，增强深度学习模型输出的准确性。具体的模型输出计算方式如下：The mechanism-driven model calculates the attribute values of molecules through formulas, and the results produced by the mechanism-driven model are credible. Modulate the deep learning model through the mechanism-driven model to enhance the accuracy of the deep learning model output. The specific model output calculation method is as follows:

O_D表示深度学习网络模型的预测，O_M表示机理模型的结果，表示最终的输出；ε₁，ε₂分别表示机理模型和深度学习模型的调制系数，其中ε₁+ε₂＝1。 _OD represents the prediction of the deep learning network model, _OM represents the result of the mechanism model, and represents the final output; ε ₁ and ε ₂ represent the modulation coefficients of the mechanism model and the deep learning model, respectively, where ε ₁ +ε ₂ =1.

本发明提供的技术方案带来的有益效果至少包括：The beneficial effects brought by the technical solution provided by the present invention at least include:

本发明地上述技术方案提供了初始的分子描述符特征和图结构的获取方法；利用特征工程从初始的分子描述符特征中选出最相关的描述符特征；利用图卷积和图注意力网络提取不同层次的分子图特征；利用特征融合层将分子图特征与描述符特征相融合；利用修正模块来将计算值和实验值进行更好的融合；将机理驱动模型的计算值与深度学习数据驱动模型融合用于模型推理，并输出预测属性的数值；机理模型计算得出的数据与深度学习模型预测得到的特征相结合相互补充，提升了分子属性预测的准确性。The above technical scheme of the present invention provides an initial molecular descriptor feature and a graph structure acquisition method; using feature engineering to select the most relevant descriptor feature from the initial molecular descriptor feature; using graph convolution and graph attention network Extract molecular graph features at different levels; use the feature fusion layer to fuse molecular graph features and descriptor features; use the correction module to better integrate calculated values and experimental values; combine the calculated values of the mechanism-driven model with deep learning data The driving model fusion is used for model reasoning and outputs the value of predicted properties; the data calculated by the mechanism model and the features predicted by the deep learning model are combined and complement each other, improving the accuracy of molecular property prediction.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.

图1是本发明实施例提供的图神经网络的数据-机理驱动的材料属性预测方法的执行流程示意图；Figure 1 is a schematic diagram of the execution flow of the data-mechanism-driven material property prediction method of the graph neural network provided by the embodiment of the present invention;

图2是本发明实施例提供的图神经网络的数据-机理驱动的材料属性预测方法整体的处理流程的示意图；2 is a schematic diagram of the overall processing flow of the data-mechanism-driven material property prediction method of the graph neural network provided by the embodiment of the present invention;

图3是本发明实施例提供的获取待预测材料分子的描述符特征和图结构的示意图；Fig. 3 is a schematic diagram of obtaining the descriptor features and graph structure of the material molecule to be predicted provided by the embodiment of the present invention;

图4是本发明实施例提供的深度学习网络结构示意图；4 is a schematic diagram of a deep learning network structure provided by an embodiment of the present invention;

图5是本发明实施例提供的修正模块示意图。Fig. 5 is a schematic diagram of a correction module provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明实施方式作进一步地详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the implementation manner of the present invention will be further described in detail below in conjunction with the accompanying drawings.

如图1所示，本发明实施例提供了一种图神经网络的数据-机理驱动的材料属性预测方法，该方法包括：As shown in Figure 1, an embodiment of the present invention provides a data-mechanism-driven material property prediction method of a graph neural network, which includes:

需要说明的是，由于分子数据集通常只包含分子的结构化特征，而没有描述符特征，这些特征描述符可以很好的对分子的三维结构信息进行补充。因此，本实例为了解决图结构特征不够丰富的问题，获取了分子的描述符特征，进行信息互补。It should be noted that since molecular datasets usually only contain structural features of molecules without descriptor features, these feature descriptors can be a good supplement to the three-dimensional structure information of molecules. Therefore, in this example, in order to solve the problem that the graph structure features are not rich enough, the descriptor features of molecules are obtained for information complementation.

需要说明的是，本实施例使用特征工程的方法来筛选出最相关的特征。具体地，使用x²,x^1/2,ln(1+x)进行特征多项式变换，构建多项式特征。使用皮尔森相关系数和最大信息系数进行相关度排序，选出最终的特征描述符。It should be noted that this embodiment uses a feature engineering method to filter out the most relevant features. Specifically, x ² , x ^1/2 , ln(1+x) are used to perform feature polynomial transformation to construct polynomial features. Use the Pearson correlation coefficient and the maximum information coefficient to sort the correlation, and select the final feature descriptor.

具体地，在本实施例中，提取不同层次的分子图特征方法为：构建包含不同卷积数量的特征提取层提取不同层次的分子特征，浅层的图卷积获取局部的特征信息，深层的图卷积获取全局的长范围的特征信息。不同层次的特征信息通过拼接的方式进行融合，输入到全局池化层，得到最后的图特征。图卷积可以对分子中原子的邻居信息进行聚集，图卷积的层数越多，可获取的特征信息越广。图注意力可以对每个原子周围的邻居节点进行权重的调整。图网络的输入为三元组{V,E,A}，其中V表示构成分子的原子的特征矩阵，E表示边的特征矩阵，A表示图的邻接矩阵。具体的特征金字塔构建方法如下：Specifically, in this embodiment, the method for extracting molecular graph features at different levels is as follows: constructing feature extraction layers with different numbers of convolutions to extract molecular features at different levels, shallow graph convolutions to obtain local feature information, and deep layers Graph convolution obtains global long-range feature information. The feature information of different levels is fused by splicing and input to the global pooling layer to obtain the final graph features. Graph convolution can gather the neighbor information of atoms in the molecule. The more layers of graph convolution, the wider the feature information that can be obtained. Graph attention can adjust the weights of neighbor nodes around each atom. The input of the graph network is a triplet {V,E,A}, where V represents the characteristic matrix of atoms constituting the molecule, E represents the characteristic matrix of edges, and A represents the adjacency matrix of the graph. The specific feature pyramid construction method is as follows:

X₀＝GCN(V,E,A) (1)X ₀ =GCN(V,E,A) (1)

X₁＝GAT(X₀) (2)X ₁ =GAT(X ₀ ) (2)

X₂＝GAT(GCN(GCN(X₁))) (3)X ₂ =GAT(GCN(GCN(X ₁ ))) (3)

具体地，在本实施例中，特征融合层将经过特征工程筛选得到的描述符特征与经过图神经忘了特征提取到的分子图特征相融合。两者按照相同的维度对特征进行拼接，并使用全连接层和非线性激活函数来更好的融合特征。描述符特征是基于电子和外围结构进行建模的，这些计算方法包含了大量的三维信息和核外电子的相互作用信息，可以用来补充图结构的信息丢失问题。具体的特征融合方法如下：Specifically, in this embodiment, the feature fusion layer fuses the descriptor features obtained through feature engineering screening with the molecular graph features extracted through graph neural forgetting features. The two splice features according to the same dimension, and use fully connected layers and nonlinear activation functions to better fuse features. Descriptor features are modeled based on electron and peripheral structures, and these calculation methods contain a large amount of three-dimensional information and interaction information of extranuclear electrons, which can be used to supplement the information loss problem of graph structures. The specific feature fusion method is as follows:

X_Cat＝Concat((X_Out,X_f)) (6)X _Cat ＝Concat((X _Out ,X _f )) (6)

X_fusion＝ReLU(FC(X_Cat)) (7)X _fusion = ReLU(FC(X _Cat )) (7)

具体地，在本实施例中，为了更好的将计算值和实验值进行更好的融合，使用实验值来构造深度学习的预测标签L。深度学习预测标签L既考虑了深度学习模型的预测结果，又考虑了真实的实验数据。具体的预测标签创建方法如下：Specifically, in this embodiment, in order to better integrate the calculated value and the experimental value, the experimental value is used to construct the predicted label L of deep learning. The deep learning prediction label L not only considers the prediction results of the deep learning model, but also considers the real experimental data. The specific prediction label creation method is as follows:

L＝F(L_c,L_e)＝αL_c+βL_e(α+β＝1) (8)L=F(L _c ,L _e )=αL _c +βL _e (α+β=1) (8)

S6，将机理驱动模型的计算值与深度学习数据驱动模型融合用于模型推理，并输出预测属性的数值；S6, merging the calculated value of the mechanism-driven model with the deep learning data-driven model for model reasoning, and outputting the value of the predicted attribute;

具体地，在本实施例中，机理驱动模型是通过公式计算出分子的属性数值，机理驱动模型产生的结果是可信的。通过机理驱动模型来对深度学习模型进行调制，增强深度学习模型输出的准确性。具体的模型输出计算方式如下：Specifically, in this embodiment, the mechanism-driven model calculates the property values of molecules through formulas, and the results generated by the mechanism-driven model are credible. Modulate the deep learning model through the mechanism-driven model to enhance the accuracy of the deep learning model output. The specific model output calculation method is as follows:

进一步地，本实施例的图神经网络的数据-机理驱动的材料属性预测方法所采用的网络模型的网络结构如图2所示。Further, the network structure of the network model adopted by the data-mechanism-driven material property prediction method of the graph neural network in this embodiment is shown in FIG. 2 .

实施例一Embodiment one

在本实施例中，使用收集好地分子数据集来验证图神经网络的数据-机理驱动的材料属性预测方法的效果。分子数据集包括4208条数据，每个分子都包括：图结构、描述子、实验值和计算值。图结构包括图中原子的特征、键的特征。描述子包括：相对分子质量、重原子相对质量、氨基和羟基基团数量、硝基数量、氢受体数量、氢给体数量、可旋转键数量、价电子数量、相对分子质量、重原子相对质量、氨基和羟基基团数量。分子数据集的收集方法如图3所示。In this embodiment, the collected molecular data set is used to verify the effect of the data-mechanism-driven material property prediction method of the graph neural network. The molecular data set includes 4208 pieces of data, and each molecule includes: graph structure, descriptor, experimental value and calculated value. The graph structure includes the characteristics of the atoms in the graph and the characteristics of the bonds. Descriptors include: relative molecular mass, relative mass of heavy atoms, number of amino and hydroxyl groups, number of nitro groups, number of hydrogen acceptors, number of hydrogen donors, number of rotatable bonds, number of valence electrons, relative molecular mass, relative weight of heavy atoms mass, number of amino and hydroxyl groups. The collection method of the molecular dataset is shown in Fig. 3.

本模型训练的batch size的大小为32，使用Adam优化器进行参数优化并且设置其初始学习率为0.001，之后根据验证集结果降低到0.0001。整个训练集共训练了120个epoches，每个批次大小为32。数据集分为训练集、验证集和测试集三部分，比例为8:1:1。评价指标为平均绝对误差。The batch size of this model training is 32, and the Adam optimizer is used for parameter optimization and its initial learning rate is set to 0.001, which is then reduced to 0.0001 according to the results of the verification set. The entire training set is trained for a total of 120 epochs, each with a batch size of 32. The data set is divided into three parts: training set, verification set and test set, with a ratio of 8:1:1. The evaluation index is mean absolute error.

表1实验数据(图结构和描述符)及其解释Table 1 Experimental data (graph structure and descriptors) and their interpretation

具体的实施步骤为：The specific implementation steps are:

(1)获取待预测材料分子的描述符特征和图结构。分子特征收集方法如图所示，从PubChem网站获取图结构信息以及实验数据。使用开源化学信息软件rdkit收集分子描述符特征。(1) Obtain the descriptor features and graph structure of the material molecules to be predicted. The molecular feature collection method is shown in the figure, and the graph structure information and experimental data are obtained from the PubChem website. Molecular descriptor features were collected using the open source cheminformatics software rdkit.

(2)使用特征工程的方法对分子描述符特征进行筛选。(2) Use the method of feature engineering to screen molecular descriptor features.

(3)使用具有不同层次的图卷积和图注意力网络来提取分子图的结构化特征信息。(3) Graph convolution and graph attention networks with different layers are used to extract structural feature information of molecular graphs.

(4)将描述符特征与图神经网络的输出进行融合。使用修正模块来利用实验值构建出更准确的标签，对图神经网络的训练进行指导。(4) Fusing the descriptor features with the output of the graph neural network. Use the correction module to use the experimental values to construct more accurate labels to guide the training of the graph neural network.

(5)将机理驱动模型的计算值与深度学习数据驱动模型融合用于模型推理，并输出预测属性的数值。(5) The calculated value of the mechanism-driven model is fused with the deep learning data-driven model for model reasoning, and the value of the predicted attribute is output.

进一步地，将本实施例的图神经网络的数据-机理驱动的材料属性预测方法记为MD-GNN，网络的详细结构如图4所示。为了证明本实施例提出的特征融合方法在提升分子属性预测准确率上的有效性，设置了三组实验为：仅使用图结构进行属性预测，使用的模型包括ene-s2s、GAT、GraphSage和SchNet，仅使用分子描述符进行属性预测，使用的传统机器学习模型包括随机森林，梯度提升决策树和多层感知机，将两者融合的MD-GNN模型，在验证特征融合的准确性时，为了控制影响因素，没有使用修正模块，直接使用计算值作为标签进行训练。Further, the data-mechanism-driven material property prediction method of the graph neural network in this embodiment is denoted as MD-GNN, and the detailed structure of the network is shown in FIG. 4 . In order to prove the effectiveness of the feature fusion method proposed in this example in improving the accuracy of molecular attribute prediction, three sets of experiments were set up: only use graph structure for attribute prediction, and the models used include ene-s2s, GAT, GraphSage and SchNet , using only molecular descriptors for attribute prediction, the traditional machine learning models used include random forests, gradient boosting decision trees and multi-layer perceptrons, the MD-GNN model that combines the two, when verifying the accuracy of feature fusion, for Control the influencing factors, do not use the correction module, and directly use the calculated value as the label for training.

表2不同角度的粗糙度类别预测准确率(％)Table 2 Prediction accuracy of roughness category from different angles (%)

表2给出了不同角度的粗糙度类别预测准确率，使用图结构+描述符的特征要比单独使用图结构或者使用描述符特征效果要好。融合模型比仅使用ene-s2s模型的损失降低了0.151，比GAT模型的损失降低了0.156，比GraphSages模型的损失降低了0.175，比SchNet模型的损失降低了0.158。同时通过对比仅使用图结构特征和仅使用描述符特征可以发现，图结构特征的损失要比描述特征更低，说明对于分子结构来说，图结构特征要更重要。Table 2 shows the prediction accuracy of roughness categories from different angles. Using graph structure + descriptor features is better than using graph structure alone or using descriptor features. The fusion model reduces the loss by 0.151 compared to the ene-s2s model alone, reduces the loss by 0.156 compared to the GAT model, reduces the loss by 0.175 compared to the GraphSages model, and reduces the loss by 0.158 compared to the SchNet model. At the same time, by comparing the use of only graph structure features and only descriptor features, it can be found that the loss of graph structure features is lower than that of description features, indicating that graph structure features are more important for molecular structures.

为了说明修正模块的有效性，对提出的MD-GNN模型与ene-s2s、GAT、GraphSage和SchNet进行了对比实验。在校正块中，标签由实验数据和计算数据构成。然后，通过M-D融合块将基于机制的模型融合到人工智能模型中，以调节模型预测。与ene-s2s、GAT、GraphSage和SchNet进行了比较，表3列出了添加修正模块和不添加修正模块的实验结果对比。To illustrate the effectiveness of the correction module, comparative experiments are carried out on the proposed MD-GNN model with ene-s2s, GAT, GraphSage and SchNet. In the calibration block, the labels consist of experimental and calculated data. Then, the mechanism-based model is fused into the AI model via the M-D fusion block to condition model predictions. Compared with ene-s2s, GAT, GraphSage and SchNet, Table 3 lists the comparison of the experimental results with and without the correction module.

表3添加修正模块和不添加修正模块的实验结果对比Table 3 Comparison of experimental results between adding correction module and not adding correction module

为了观察校正块的性能，在表3中，将ene-s2s、GAT、GraphSage和SchNet分别与未经校正的校正块进行比较，在未经校正的情况下，分别使用计算值和实验值作为标签，并计算了模型输出与测试集实验值之间的误差。该模型仅以计算值为标志，与实际实验值均存在较大误差。当使用实验值作为标签时，误差也很高，这是因为实验中的数据不是完全干净的，使得模型难以拟合分布。In order to observe the performance of the correction block, in Table 3, ene-s2s, GAT, GraphSage and SchNet are compared with the uncorrected correction block, and in the uncorrected case, the calculated and experimental values are used as labels, respectively , and calculated the error between the model output and the experimental value on the test set. The model is only marked by the calculated value, and there is a large error between the actual experimental value. The error is also high when using experimental values as labels, because the data in the experiment is not completely clean, making it difficult for the model to fit the distribution.

表4中的MAE表明，组合机制驱动的数据对于预测真实实验特性至关重要，并且在组合所提出的校正块后，误差都减小了。The MAEs in Table 4 show that combining mechanism-driven data is crucial for predicting real experimental properties, and the errors are all reduced after combining the proposed correction blocks.

综上，本实施例的方法将分子描述符特征与图结构特征相融合输入到网络模型中预测分子的属性。同时，为了更好的提升图神经网络的特征学习能力，引入了修正模块，缩小了计算值与实验值之间的差距。To sum up, the method of this embodiment fuses molecular descriptor features and graph structure features into the network model to predict the properties of molecules. At the same time, in order to better improve the feature learning ability of the graph neural network, a correction module is introduced to narrow the gap between the calculated value and the experimental value.

还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。It should also be noted that in this article, relational terms such as first and second etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations Any such actual relationship or order exists between. The term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or end-equipment comprising a set of elements includes not only those elements but also items not expressly listed other elements, or also include elements inherent in such a process, method, article, or end-equipment. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or terminal device comprising said element.

最后需要说明的是，以上所述是本发明优选实施方式，应当指出，尽管已描述了本发明优选实施例，但对于本技术领域的技术人员来说，一旦得知了本发明的基本创造性概念，在不脱离本发明所述原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明实施例范围的所有变更和修改。Finally, it should be noted that the above description is a preferred embodiment of the present invention, and it should be pointed out that although the preferred embodiment of the present invention has been described, for those skilled in the art, once the basic creative concepts of the present invention are understood , under the premise of not departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications should also be regarded as the protection scope of the present invention. Therefore, the appended claims are intended to be construed to cover the preferred embodiment and all changes and modifications which fall within the scope of the embodiments of the present invention.

Claims

1. A data-mechanism driven material property prediction method for a graph neural network, comprising:

s1, obtaining descriptor characteristics and graph structure information of a material molecule to be predicted, wherein json files with graph structure information and experimental data are obtained through a website, the graph structure information comprises atomic attributes and bond attributes, the descriptor characteristics corresponding to the molecule are collected on line by using open source chemical information software rdkit, and the molecule descriptor covers basic properties, electronic properties and topological properties of the molecule;

s2, screening out a final feature descriptor by utilizing feature engineering, wherein for initial feature description, polynomial features are constructed for feature screening, correlation degrees of the polynomial features are ranked by using Pearson correlation coefficients and maximum information coefficients, and the final feature descriptor is screened out;

the polynomial feature comprising:

the feature descriptor is represented by X = { X = ₁ ,x ₂ ,x ₃ ,...,x _k H, k denotes the dimension of the feature, x denotes the descriptor feature, using x ² ,x ^1/2 Ln (1 + x) is subjected to characteristic polynomial transformation, and the constructed polynomial is characterized by

S3, extracting molecular diagram features of different layers by using a diagram volume and a diagram attention network;

s4, fusing the molecular diagram features with the feature descriptors by using a feature fusion layer;

s5, fusing the calculated value and the experimental value by using a correction module; the calculated value is a numerical value obtained by first-nature principle simulation calculation, and the experimental value is an actual material attribute measured by an experiment;

s6, fusing a calculated value of the mechanism driving model and the deep learning data driving model for model reasoning, and outputting a numerical value of the prediction attribute;

the mechanism driving model calculates the attribute value of the molecule through a formula, the result generated by the mechanism driving model is interpretable and credible, the deep learning model is modulated through the mechanism driving model, the accuracy of the output of the deep learning model is enhanced, and the model output calculation mode is as follows:

O _D representing predictions of a deep-learning network model, O _M The results of the mechanism model are shown, representing the final output;ε ₁ ，ε ₂ represents the modulation coefficients of the mechanism model and the deep learning model, respectively, where ₁ +ε ₂ ＝1。

2. The data-mechanism driven material property prediction method of the graph neural network as claimed in claim 1, wherein in S3, extracting molecular graph features of different levels by using graph convolution and graph attention network comprises:

the method comprises the steps that feature extraction layers with different convolution quantities are constructed to extract molecular features of different levels, shallow graph convolution obtains local feature information, deep graph convolution obtains global feature information in a long range, the graph convolution is used for gathering neighbor information of atoms in a molecule, the more the number of graph convolution layers is, the wider the obtained feature information is, the more the graph convolution layers are, the graph attention is used for adjusting the weight of neighbor nodes around each atom, the input of a graph network is a triple group { V, E, A }, wherein V represents a feature matrix of atoms forming the molecule, E represents a feature matrix of an edge, and A represents an adjacent matrix of the graph.

3. The data-mechanism driven material property prediction method of graph neural network of claim 1, wherein in S4, fusing molecular graph features with feature descriptors using a feature fusion layer, comprises:

the feature fusion layer fuses descriptor features obtained through feature engineering screening and molecular diagram features obtained through diagram neural network feature extraction, and the descriptor features are modeled based on electrons and peripheral structures.

4. The data-mechanism driven material property prediction method of graph neural network of claim 1, wherein in S5, a correction module is used to better fuse the calculated value and the experimental value; the calculation value is a numerical value obtained by first-nature principle simulation calculation, and the experimental value is an actual material attribute measured by an experiment, and comprises the following steps:

the experimental values are used to construct a prediction label L for deep learning.