CN119008034A

CN119008034A - Drug toxicity prediction method based on drug fusion correlation data and graph convolution network

Info

Publication number: CN119008034A
Application number: CN202410922164.3A
Authority: CN
Inventors: 苏苒; 李婷玉
Original assignee: Hefei Institute Of Innovation And Development Tianjin University
Current assignee: Hefei Institute Of Innovation And Development Tianjin University
Priority date: 2024-07-10
Filing date: 2024-07-10
Publication date: 2024-11-22
Anticipated expiration: 2044-07-10
Also published as: CN119008034B

Abstract

The invention relates to a method for predicting drug toxicity based on drug fusion correlation data and a graph rolling network, which belongs to the technical field of machine learning and combination of drug chemistry and toxicology genome, and solves the problem of how to improve the prediction effect of drug toxicity; the problem that the data size of the model is small is solved by adopting a ten-fold cross validation method; the constructed model is compared with the existing drug toxicity prediction model for verification, and the test result shows that the method has higher prediction accuracy through characteristic extraction and classification of the graph convolutional neural network and the gene expression data.

Description

Drug toxicity prediction method based on drug fusion correlation data and graph convolution network

Technical Field

The invention belongs to the technical field of machine learning and combination of pharmaceutical chemistry and toxicology genome, and relates to a method for predicting pharmaceutical toxicity based on pharmaceutical fusion correlation data and a graph convolution network.

Background

Drug toxicity refers to the adverse reaction or effect that a drug may cause at therapeutic doses. Drug toxicity may be due to the physicochemical properties of the drug itself or to toxic metabolites produced by drug metabolism, or may be due to adverse effects caused by interactions of the drug with human tissues or organs. Drug toxicity prediction is a major problem in the drug development process, and all drugs need to be subjected to reliable side effect detection, particularly toxicity test, so as to ensure the safety and reliability of the actual use of the drugs. The existing drug toxicity prediction methods are mainly based on in vitro tests, animal tests, calculation simulation and other means, and although the methods can provide certain information, the methods can not completely simulate complex metabolic transformation and toxicity reaction in human bodies, and errors and limitations can exist in prediction results, so that the prediction methods are required to be continuously improved and perfected, and toxicity evaluation is carried out by combining various means so as to improve the prediction accuracy of drug safety and effectiveness.

Because of the ability of machine learning to construct abstract features from data, many researchers currently use regression models or classification models in combination with chemical structure and biological activity data to predict toxicity, enabling us to correlate chemical structure with biological activity of drugs, and thus more accurately assess the toxicity potential of drugs. While the structural information of a drug can provide a more convincing explanation for drug toxicity at the molecular level, it is not sufficiently applicable in complex biological environments, mainly because of the often difficult-to-quantify interactions between different molecular structures, in which case the pharmacogenomic data can provide more valuable information for toxicity prediction. Although the above studies have successfully predicted drug toxicity using a large variety of data, toxicity prediction is generally performed using only response data at a single concentration, and failure to fully utilize response data at different concentrations results in limited ability to comprehensively evaluate and accurately predict drug toxicity. In addition, because the complexity of biological environment makes various links exist between chemical substances and organisms, and the interaction between proteins is the basis for realizing biological functions, the interaction between proteins is deeply discussed, and the method has great significance for the development of biological science. Protein interactions refer to interactions that occur within a cell or in vivo, through direct contact or indirect interactions between different proteins. These interactions may be protein-to-protein interactions or protein-to-other biological molecules (e.g., nucleic acids, small molecule compounds, etc.), and by studying protein interactions, the function, regulatory mechanisms, and complex interrelationships of the biological processes of the protein can be understood in depth.

The Protein-Protein interaction network (PPI network, protein-Protein Interaction network) is a network diagram model for studying Protein interactions, with network structural properties of small worldwide, headless distribution and functional modularity. However, the protein interaction network has the problems of noise and the like, so that the data quality is low, and the prediction accuracy is further affected. Therefore, by combining the pharmaceutical chemical structure with a protein interaction network and carrying out feature extraction and classification on the pharmaceutical chemical structure and the gene expression data under the action of a graph convolution neural network, the research has important significance for predicting pharmaceutical toxicity under the condition of fusing the compound structure and the gene expression data.

Disclosure of Invention

The technical scheme of the invention is used for solving the problem of how to improve the prediction effect of the drug toxicity.

The invention solves the technical problems through the following technical scheme:

the method for predicting drug toxicity based on drug fusion correlation data and graph convolution network comprises the following steps:

step 1, screening data on an Open TG-GATEs website, selecting toxicity data of drugs in a TG-GATEs database and gene expression data corresponding to each drug, marking the gene expression data, and marking target proteins, non-target proteins and drug targets;

Step 2, constructing new drug correlation data by calculating the compactness between the drugs and the proteins, calculating the correlation between the drugs and calculating the chemical similarity of the drugs;

and 3, constructing a graph convolutional neural network model, determining a training set and a testing set by adopting twelve-fold cross validation, training the graph convolutional neural network model by adopting the training set, testing the trained graph convolutional neural network model by adopting the testing set, and giving out performance evaluation indexes.

Further, the calculation formula of the tightness between the medicine and the protein in the step 2 is as follows:

Wherein, For compactness between drug and protein, v _n is a known target for a given drug m, L _vvn is the shortest distance between v and v _n in the PPI network,For converting the protein-to-protein distance into a compact protein-to-protein distance, T (m) represents the set formed by all targets of drug m.

Further, the method for calculating the correlation between the drugs in step 2 is as follows:

Given drugs m ₁ and m ₂, the correlation between the two drugs is the average compactness between their known targets, calculated as follows:

Where No. T (m ₁) is the total number of targets known to drug m ₁ and No. T (m ₂) is the total number of targets known to drug m ₂.

Further, the method for calculating the chemical similarity of the medicines in the step 2 is specifically as follows:

the drug chemical similarity is calculated by adopting the Tanimoto coefficient, and the calculation formula is as follows:

wherein G, H are two different compounds.

Further, the method for constructing the new drug-related data in the step 2 specifically comprises the following steps:

fusing the inter-drug correlation with the inter-drug chemical similarity to obtain drug correlation data S' fused with the chemical similarity relationship as follows:

Wherein S is inter-drug correlation data, T is inter-drug chemical similarity data, and S ^′ is new drug correlation data obtained by fusing the two.

Further, the method for constructing the graph roll-up neural network model in the step 3 is as follows:

The inter-drug related data S and the new drug related data S' are used as the input of a graph convolution layer, the graph convolution layer is provided with two layers and is used for extracting characteristics of graph structure data, the graph convolution layer simplifies the calculation process of a graph convolution kernel by using the local first-order approximation of a chebyshev polynomial, and the calculation formula is as follows:

Where H is the output of the convolutional layer of the graph, H ^(l+1) is the feature of the first +1st layer, H ^(l) is the feature of the first layer, for the input layer the feature matrix K of dimension NX, Is to add a self-connected matrix on the basis of the adjacency matrix A,Is thatA nonlinear activation function; w ^l is the weight matrix of the first layer.

The structure of the graph has N nodes, each node has corresponding characteristics, the nodes are combined to form an N X-dimensional matrix K, an N X N-dimensional adjacent matrix A is formed between the nodes, and S' are input to generate a final prediction result.

Further, the performance evaluation indexes described in the step 3 include six indexes including an accuracy ACC, a Recall, a specificity SPECIFICITY, an accuracy Precision, and a sensitivity SENSITIVITY, F score F1, and the calculation formulas thereof are as follows:

wherein TP, TN, FP, FN represents the number of model true positives, true negatives, false positives, and false negatives, respectively.

Further, the method for determining the training set and the testing set by using twelve-fold cross validation in the step 3 is as follows: dividing the data into 12 independent data with the same distribution, selecting 11 of the data as a training set, selecting the rest 1 as a test set, training 12 classifiers, and taking the average value of the results of the 12 classifier test sets as a performance evaluation index of the graph convolution neural network model.

An electronic device comprising a memory for storing a program supporting the processor to perform the above method of predicting drug toxicity based on drug fusion correlation data and a graph rolling network, and a processor configured to execute the program stored in the memory.

A storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method of predicting drug toxicity based on drug fusion correlation data and a graph convolution network.

The invention has the advantages that:

According to the invention, the protein interaction network is utilized to obtain the correlation between medicines, meanwhile, the chemical structure of the medicines is analyzed, the correlation between medicines is combined with the similarity, and the characteristic extraction and classification are carried out through the graph convolution neural network model and the gene expression data, so that a prediction result is obtained; the problem that the data size of the model is small is solved by adopting a ten-fold cross validation method; the constructed model is compared with the existing drug toxicity prediction model for verification, and the test result shows that the method has higher prediction accuracy through characteristic extraction and classification of the graph convolutional neural network and the gene expression data.

Drawings

FIG. 1 is a flow chart of a method for predicting drug toxicity based on drug fusion correlation data and a graph convolution network;

FIG. 2 is a schematic diagram of drug and protein actions;

FIG. 3 is a graph convolution neural network model index comparison graph constructed based on drug related data S and new drug related data S';

FIG. 4 is a graph comparing indices of the PSO-optimized SVM model and the graph convolution neural network model.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The technical scheme of the invention is further described below with reference to the attached drawings and specific embodiments:

Example 1

As shown in fig. 1, a method for predicting drug toxicity based on drug fusion correlation data and graph convolution network according to an embodiment of the present invention includes the following steps:

step 1, constructing a data set

Step 1-1, preparing a data set. The dataset comprises: the TG-GATEs database contains toxicity data M of medicines and gene expression data V corresponding to each medicine.

In step 1-1, the TG-GATEs database is a large study program led by the company An Si taylor of japan (ASTELLAS PHARMA inc.) in which toxicity data of drugs are stored, mainly including biochemical blood and histopathological related data in vivo (rat) and in vitro (primary hepatocytes of rats and humans). Thereafter, the desired gene expression data was obtained from TOXY-GATEs (an on-line tool platform for analysis of data in Open TG-GATES), the interaction relationship between proteins was constructed by using the STRING database (an on-line database and tool for analysis of protein interactions), and the chemical formulas of 108 test drugs were obtained from DrugBank (a database providing bioinformatics and chemoinformatics).

In this embodiment, through screening the data on the Open TG-GATEs website, 108 kinds of drug toxicity data M and gene expression data V corresponding to each drug are selected, and the dimension of the expression data is 6009.

And step 1-2, marking the gene expression data V in the step 1-1, and marking target proteins, non-target proteins and drug targets.

In step 1-2, the drug similarity measurement plays a great role in the prediction process, so that the gene expression data V needs to be marked, and the compactness between the drug and the protein is found. As shown in fig. 2, the drug is represented by the m node and the target protein is represented by the v node; non-target proteins are represented by v _n (n=1, 2,3, …) nodes, and the proteins in the solid oval region are drug targets, i.e., the drug can be associated with target proteins by these targets. The dotted edges of the protein-protein interaction network represent information in the genomic space that interacts with the drug-target, and the compactness between drug m and protein v can be defined as any number within [0,1 ].

Step 2, constructing drug correlation data based on fusion drug chemical structure

The drug correlation data S ^′ includes calculating closeness between drug and protein based on PPI networkPPI-based network utilization closenessCalculating the correlation S between medicines, calculating the chemical similarity T of medicines, wherein,

Step2-1, calculating the compactness between the medicine and the protein, defining the compactness between the medicine and the protein through a PPI network, and connecting a pharmacological space and a genome space:

Wherein, For compactness between drug and protein, v _n is a known target for a given drug m, L _vvn is the shortest distance between v and v _n in the PPI network,For converting protein-to-protein distance into a compact protein-to-protein profile, this data can be obtained directly from the sting database, T (m) representing the set of all targets for drug m.

Equation (1) can demonstrate that the compactness between drug m and protein v corresponds to the sum of the closeness between all target targets of m and v, and if there is no link between the two proteins, L _vvn is defined as infinity, whereby the compactness between drug and target protein can be obtained.

Step 2-2, calculating the correlation between drugs

Calculating drug-to-drug correlation data according to formula (2), given drugs m ₁ and m ₂, defining the correlation between the two drugs as the average compactness between their known targets:

Step 2-3, calculating the chemical similarity of the medicines

To obtain chemical similarity between drugs, a MACCS chemical molecular fingerprint of the target drug is obtained through the chemical formula of the target drug, the length of the chemical molecular fingerprint is 167, and the fingerprint is measured. The drug chemical similarity is calculated by utilizing the Tanimoto coefficient, and the calculation formula is as follows:

wherein G, H are two different compounds.

Step 2-4, constructing new drug related data S'

Step 2-2 and step 2-3 have obtained inter-drug correlation and inter-drug chemical similarity, and the two are fused to obtain drug correlation data S' fused with chemical similarity relationship:

As can be seen from the formula (4), the fusion of the correlation data and the chemical similarity data in the present invention is mainly obtained by weighted averaging the two. The method aims to reduce the influence caused by the low quality of PPI network data, so that the model prediction performance is improved.

Step 3, constructing and training a graph convolutional neural network model and evaluating

And constructing a graph roll-up neural network model constructed based on the drug related data and the new drug related data.

Step 3-1, constructing a graph convolution neural network model

The medicine related data S and the new medicine related data S' are used as the input of a graph convolution layer, the graph convolution layer is provided with two layers and is used for extracting the characteristics of the graph structure data, the graph convolution layer simplifies the calculation process of a graph convolution kernel by using the partial first-order approximation of the chebyshev polynomial, the calculation complexity of calculating the Laplace matrix in the graph convolution is reduced, and a calculation formula is shown as (5). In the graph, there are N nodes, each node has its corresponding feature, these nodes are combined to form a matrix K in n×x dimensions, an adjacent matrix a in n×n dimensions is formed between the nodes, and after S and S' are input, a final prediction result is generated:

Wherein the output of the picture scroll laminate is named H, H ^(l+1) is a feature of the first +1 layer, and for the input layer A is a feature matrix K of dimension NX, Is to add a self-connected matrix on the basis of the adjacency matrix A,Is thatIs a nonlinear activation function, W ^l is the weight matrix of the first layer. Since the elements on the main diagonal of the adjacent matrix a are all 0, this partial feature is ignored when multiplying with the matrix H, and therefore it is necessary to add an identity matrix I to the adjacent matrix a. By letting outMultiplying byA symmetrical normalized matrix can be obtained.

Step 3-2, determining training set and test set by twelve-fold cross validation

Firstly dividing data into 12 independent data with the same distribution, selecting 11 data as a training set, selecting the rest 1 data as a test set, training 12 classifiers, and taking the average value of the results of the 12 classifier test sets as a performance evaluation index of the graph convolution neural network model.

The performance evaluation index comprises six indexes of accuracy ACC, recall rate Recall, specificity SPECIFICITY, accuracy Precision and sensitivity SENSITIVITY, F1 scoring value F1, and the calculation formulas are as follows:

wherein TP, TN, FP, FN represents the number of model true positives, true negatives, false positives, and false negatives, respectively. The test results are shown in table 1:

TABLE 1 index information of prediction model constructed based on two kinds of correlation data

In order to more accurately analyze the difference of the prediction model constructed based on S' and S, the results of each index are represented by a bar graph, as shown in FIG. 3.

From the above data, it can be seen that, compared with the graph convolution neural network model constructed by using the correlation data S obtained based on the PPI network alone, the graph convolution neural network model constructed based on the fusion correlation data S' has improved indexes. The overall effect of drug toxicity prediction can be greatly improved by using a graph convolution neural network model based on feature extraction and classification of fusion data. Compared with a single correlation data model, the accuracy of the prediction model built based on the fusion data is improved by about 4%, the recall rate is improved by 7%, the effectiveness of predicting drug toxicity of the graph convolution neural network model built based on the fusion correlation data S' is demonstrated, and meanwhile, the necessity of carrying out chemical similarity fusion on drug correlation obtained only based on PPI network data is also demonstrated.

Comparative test

The effectiveness of the graph rolling neural network model is verified through a comparison test, and in order to better verify the prediction model constructed by the method, three evaluation indexes of the SVM model and the graph rolling neural network model based on Particle Swarm Optimization (PSO) optimization are compared under the condition of inputting the same data set (fusing the drug data correlation S').

(1) An SVM model based on Particle Swarm Optimization (PSO) optimization is constructed.

The Support Vector Machine (SVM) is a commonly used machine learning algorithm for classification and regression tasks, the basic principle is to find an optimal hyperplane in a high-dimensional space, separate sample points of different classes, and the mathematical expression is:

w^Tx+b＝0 (12)

where w is the normal vector of the hyperplane (i.e., the direction of the hyperplane), x is the eigenvector of the input sample, and b is the bias term. When the sample point falls on the hyperplane, the equation equal sign is established, and the sample point is judged to belong to a positive category (category 1) according to an inequality w ^T x+b >0, and the sample point is judged to belong to a negative category (category-1) according to w ^T x+b < 0.

PSO is an optimization algorithm based on group intelligence, and comprises a group of particles, each particle represents a candidate solution, and the position of the particle is continuously adjusted according to individual experience and group experience until the optimal solution is found.

The PSO-optimized SVM model combines the global search capability of PSO with the classification accuracy of SVM. In the training process, PSO searches the optimal hyperplane by adjusting parameters of the SVM model, so that the classification accuracy and generalization capability are improved. Through iterative optimization, PSO can help SVM find better parameter configuration, and then the performance of the model is improved.

(2) An evaluation index of an SVM model and a graph rolling neural network model based on Particle Swarm Optimization (PSO) optimization is determined.

The performance indexes of the evaluation model comprise an accuracy ACC, a Recall rate Recall and an F1 scoring value (F1), and the calculation method is shown in formulas (6), (7) and (11) in the step 3-2. By constructing the SVM model based on PSO optimization and comparing the SVM model with the graph convolution neural network model based on fusion correlation data, index results are shown in table 2:

Table 2 comparison of two model indexes

In order to analyze the difference between the two models more accurately, the results of each index are represented by a bar graph, as shown in fig. 4.

It can be seen from table 2 that the graph convolution neural network prediction model constructed based on the fusion data has better performance on each index than the SVM model optimized based on the Particle Swarm Optimization (PSO). The difference in the performance of these two models is not very significant in accuracy. However, the recall rate is improved by nearly 2% compared with the SVM model after PSO optimization, and the test result shows that the model provided by the invention has a relatively efficient prediction effect on drug toxicity.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network, characterized in that it includes the following steps:

Step 1: Screen the data on the Open TG-GATEs website, select the toxicity data of drugs in the TG-GATEs database and the gene expression data corresponding to each drug, annotate the gene expression data, and mark the target proteins, non-target proteins, and drug targets;

Step 2: construct new drug correlation data by calculating the closeness between drugs and proteins, calculating the correlation between drugs, and calculating the chemical similarity of drugs;

Step 3: Build a graph convolutional neural network model, use twelve-fold cross-validation to determine the training set and test set, use the training set to train the graph convolutional neural network model, use the test set to test the trained graph convolutional neural network model and give performance evaluation indicators.

2. The method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network according to claim 1, characterized in that the calculation formula for the closeness between the drug and the protein in step 2 is as follows:

in, is the closeness between drug and protein, _vn is the known target of a given drug m, _Lvvn is the shortest distance between v and _vn in the PPI network, It is used to convert the distance between proteins into the closeness between proteins. T(m) represents the set formed by all targets of drug m.

3. The method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network according to claim 2, characterized in that the calculation method of the correlation between drugs in step 2 is as follows:

Given drugs m ₁ and m ₂ , the correlation between the two drugs is the average closeness between their known targets, calculated as follows:

Wherein, No.T(m ₁ ) is the total number of known targets belonging to drug m ₁ , and No.T(m ₂ ) is the total number of known targets belonging to drug m ₂ .

4. The method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network according to claim 3, characterized in that the method for calculating drug chemical similarity described in step 2 is specifically as follows:

The Tanimoto coefficient was used to calculate the drug chemical similarity, and the calculation formula is as follows:

Among them, G and H are two different compounds.

5. The method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network according to claim 4, characterized in that the method for constructing new drug correlation data in step 2 is as follows:

By integrating the correlation between drugs and the chemical similarity between drugs, the drug correlation data S ^′ integrating the chemical similarity relationship is obtained as follows:

Among them, S is the drug correlation data, T is the drug chemical similarity data, and S′ is the new drug correlation data obtained by merging the two.

6. The method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network according to claim 4, characterized in that the method for constructing the graph convolutional neural network model in step 3 is as follows:

The drug correlation data S and the new drug correlation data S′ are used as the input of the graph convolution layer. The graph convolution layer has two layers, which are used to extract features from graph structure data. The graph convolution layer uses the local first-order approximation of Chebyshev polynomials to simplify the calculation process of the graph convolution kernel. The calculation formula is as follows:

Among them, H is the output of the graph convolution layer, H ^(l+1) is the feature of the l+1th layer, H ^(l) is the feature of the lth layer, and for the input layer it is the N×X-dimensional feature matrix K. It is a self-connected matrix added to the adjacency matrix A. yes , σ is the nonlinear activation function; W ^l is the weight matrix of the lth layer.

There are N nodes in the graph structure, and each node has its corresponding features. These nodes are combined to form an N×X-dimensional matrix K. An N×N-dimensional adjacency matrix A is formed between the nodes. After inputting S and S′, the final prediction result is generated.

7. The method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network according to claim 6 is characterized in that the performance evaluation indexes described in step 3 include six indicators: accuracy ACC, recall rate Recall, specificity Specificity, precision Precision, sensitivity Sensitivity, and F1 score F1, and their calculation formulas are as follows:

Among them, TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives of the model, respectively.

8. According to the method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network in claim 7, it is characterized in that the method of determining the training set and the test set by using twelve-fold cross-validation in step 3 is as follows: the data is divided into 12 independent and identically distributed data, 11 of which are selected as training sets and the remaining 1 is selected as a test set, 12 classifiers are trained, and the average value of the test set results of the 12 classifiers is taken as the performance evaluation index of the graph convolutional neural network model.

9. An electronic device comprising a memory and a processor, characterized in that the memory is used to store a program that supports the processor to execute the method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network as described in any one of claims 1 to 8, and the processor is configured to execute the program stored in the memory.

10. A storage medium having a computer program stored thereon, characterized in that when the computer program is executed by a processor, the steps of the method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network as described in any one of claims 1 to 8 are executed.