[go: up one dir, main page]

CN119008034A - Drug toxicity prediction method based on drug fusion correlation data and graph convolution network - Google Patents

Drug toxicity prediction method based on drug fusion correlation data and graph convolution network Download PDF

Info

Publication number
CN119008034A
CN119008034A CN202410922164.3A CN202410922164A CN119008034A CN 119008034 A CN119008034 A CN 119008034A CN 202410922164 A CN202410922164 A CN 202410922164A CN 119008034 A CN119008034 A CN 119008034A
Authority
CN
China
Prior art keywords
drug
data
graph
correlation data
follows
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410922164.3A
Other languages
Chinese (zh)
Other versions
CN119008034B (en
Inventor
苏苒
李婷玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Institute Of Innovation And Development Tianjin University
Original Assignee
Hefei Institute Of Innovation And Development Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Institute Of Innovation And Development Tianjin University filed Critical Hefei Institute Of Innovation And Development Tianjin University
Priority to CN202410922164.3A priority Critical patent/CN119008034B/en
Publication of CN119008034A publication Critical patent/CN119008034A/en
Application granted granted Critical
Publication of CN119008034B publication Critical patent/CN119008034B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/40ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Genetics & Genomics (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Toxicology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method for predicting drug toxicity based on drug fusion correlation data and a graph rolling network, which belongs to the technical field of machine learning and combination of drug chemistry and toxicology genome, and solves the problem of how to improve the prediction effect of drug toxicity; the problem that the data size of the model is small is solved by adopting a ten-fold cross validation method; the constructed model is compared with the existing drug toxicity prediction model for verification, and the test result shows that the method has higher prediction accuracy through characteristic extraction and classification of the graph convolutional neural network and the gene expression data.

Description

Drug toxicity prediction method based on drug fusion correlation data and graph convolution network
Technical Field
The invention belongs to the technical field of machine learning and combination of pharmaceutical chemistry and toxicology genome, and relates to a method for predicting pharmaceutical toxicity based on pharmaceutical fusion correlation data and a graph convolution network.
Background
Drug toxicity refers to the adverse reaction or effect that a drug may cause at therapeutic doses. Drug toxicity may be due to the physicochemical properties of the drug itself or to toxic metabolites produced by drug metabolism, or may be due to adverse effects caused by interactions of the drug with human tissues or organs. Drug toxicity prediction is a major problem in the drug development process, and all drugs need to be subjected to reliable side effect detection, particularly toxicity test, so as to ensure the safety and reliability of the actual use of the drugs. The existing drug toxicity prediction methods are mainly based on in vitro tests, animal tests, calculation simulation and other means, and although the methods can provide certain information, the methods can not completely simulate complex metabolic transformation and toxicity reaction in human bodies, and errors and limitations can exist in prediction results, so that the prediction methods are required to be continuously improved and perfected, and toxicity evaluation is carried out by combining various means so as to improve the prediction accuracy of drug safety and effectiveness.
Because of the ability of machine learning to construct abstract features from data, many researchers currently use regression models or classification models in combination with chemical structure and biological activity data to predict toxicity, enabling us to correlate chemical structure with biological activity of drugs, and thus more accurately assess the toxicity potential of drugs. While the structural information of a drug can provide a more convincing explanation for drug toxicity at the molecular level, it is not sufficiently applicable in complex biological environments, mainly because of the often difficult-to-quantify interactions between different molecular structures, in which case the pharmacogenomic data can provide more valuable information for toxicity prediction. Although the above studies have successfully predicted drug toxicity using a large variety of data, toxicity prediction is generally performed using only response data at a single concentration, and failure to fully utilize response data at different concentrations results in limited ability to comprehensively evaluate and accurately predict drug toxicity. In addition, because the complexity of biological environment makes various links exist between chemical substances and organisms, and the interaction between proteins is the basis for realizing biological functions, the interaction between proteins is deeply discussed, and the method has great significance for the development of biological science. Protein interactions refer to interactions that occur within a cell or in vivo, through direct contact or indirect interactions between different proteins. These interactions may be protein-to-protein interactions or protein-to-other biological molecules (e.g., nucleic acids, small molecule compounds, etc.), and by studying protein interactions, the function, regulatory mechanisms, and complex interrelationships of the biological processes of the protein can be understood in depth.
The Protein-Protein interaction network (PPI network, protein-Protein Interaction network) is a network diagram model for studying Protein interactions, with network structural properties of small worldwide, headless distribution and functional modularity. However, the protein interaction network has the problems of noise and the like, so that the data quality is low, and the prediction accuracy is further affected. Therefore, by combining the pharmaceutical chemical structure with a protein interaction network and carrying out feature extraction and classification on the pharmaceutical chemical structure and the gene expression data under the action of a graph convolution neural network, the research has important significance for predicting pharmaceutical toxicity under the condition of fusing the compound structure and the gene expression data.
Disclosure of Invention
The technical scheme of the invention is used for solving the problem of how to improve the prediction effect of the drug toxicity.
The invention solves the technical problems through the following technical scheme:
the method for predicting drug toxicity based on drug fusion correlation data and graph convolution network comprises the following steps:
step 1, screening data on an Open TG-GATEs website, selecting toxicity data of drugs in a TG-GATEs database and gene expression data corresponding to each drug, marking the gene expression data, and marking target proteins, non-target proteins and drug targets;
Step 2, constructing new drug correlation data by calculating the compactness between the drugs and the proteins, calculating the correlation between the drugs and calculating the chemical similarity of the drugs;
and 3, constructing a graph convolutional neural network model, determining a training set and a testing set by adopting twelve-fold cross validation, training the graph convolutional neural network model by adopting the training set, testing the trained graph convolutional neural network model by adopting the testing set, and giving out performance evaluation indexes.
Further, the calculation formula of the tightness between the medicine and the protein in the step 2 is as follows:
Wherein, For compactness between drug and protein, v n is a known target for a given drug m, L vvn is the shortest distance between v and v n in the PPI network,For converting the protein-to-protein distance into a compact protein-to-protein distance, T (m) represents the set formed by all targets of drug m.
Further, the method for calculating the correlation between the drugs in step 2 is as follows:
Given drugs m 1 and m 2, the correlation between the two drugs is the average compactness between their known targets, calculated as follows:
Where No. T (m 1) is the total number of targets known to drug m 1 and No. T (m 2) is the total number of targets known to drug m 2.
Further, the method for calculating the chemical similarity of the medicines in the step 2 is specifically as follows:
the drug chemical similarity is calculated by adopting the Tanimoto coefficient, and the calculation formula is as follows:
wherein G, H are two different compounds.
Further, the method for constructing the new drug-related data in the step 2 specifically comprises the following steps:
fusing the inter-drug correlation with the inter-drug chemical similarity to obtain drug correlation data S' fused with the chemical similarity relationship as follows:
Wherein S is inter-drug correlation data, T is inter-drug chemical similarity data, and S is new drug correlation data obtained by fusing the two.
Further, the method for constructing the graph roll-up neural network model in the step 3 is as follows:
The inter-drug related data S and the new drug related data S' are used as the input of a graph convolution layer, the graph convolution layer is provided with two layers and is used for extracting characteristics of graph structure data, the graph convolution layer simplifies the calculation process of a graph convolution kernel by using the local first-order approximation of a chebyshev polynomial, and the calculation formula is as follows:
Where H is the output of the convolutional layer of the graph, H (l+1) is the feature of the first +1st layer, H (l) is the feature of the first layer, for the input layer the feature matrix K of dimension NX, Is to add a self-connected matrix on the basis of the adjacency matrix A,Is thatA nonlinear activation function; w l is the weight matrix of the first layer.
The structure of the graph has N nodes, each node has corresponding characteristics, the nodes are combined to form an N X-dimensional matrix K, an N X N-dimensional adjacent matrix A is formed between the nodes, and S' are input to generate a final prediction result.
Further, the performance evaluation indexes described in the step 3 include six indexes including an accuracy ACC, a Recall, a specificity SPECIFICITY, an accuracy Precision, and a sensitivity SENSITIVITY, F score F1, and the calculation formulas thereof are as follows:
wherein TP, TN, FP, FN represents the number of model true positives, true negatives, false positives, and false negatives, respectively.
Further, the method for determining the training set and the testing set by using twelve-fold cross validation in the step 3 is as follows: dividing the data into 12 independent data with the same distribution, selecting 11 of the data as a training set, selecting the rest 1 as a test set, training 12 classifiers, and taking the average value of the results of the 12 classifier test sets as a performance evaluation index of the graph convolution neural network model.
An electronic device comprising a memory for storing a program supporting the processor to perform the above method of predicting drug toxicity based on drug fusion correlation data and a graph rolling network, and a processor configured to execute the program stored in the memory.
A storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method of predicting drug toxicity based on drug fusion correlation data and a graph convolution network.
The invention has the advantages that:
According to the invention, the protein interaction network is utilized to obtain the correlation between medicines, meanwhile, the chemical structure of the medicines is analyzed, the correlation between medicines is combined with the similarity, and the characteristic extraction and classification are carried out through the graph convolution neural network model and the gene expression data, so that a prediction result is obtained; the problem that the data size of the model is small is solved by adopting a ten-fold cross validation method; the constructed model is compared with the existing drug toxicity prediction model for verification, and the test result shows that the method has higher prediction accuracy through characteristic extraction and classification of the graph convolutional neural network and the gene expression data.
Drawings
FIG. 1 is a flow chart of a method for predicting drug toxicity based on drug fusion correlation data and a graph convolution network;
FIG. 2 is a schematic diagram of drug and protein actions;
FIG. 3 is a graph convolution neural network model index comparison graph constructed based on drug related data S and new drug related data S';
FIG. 4 is a graph comparing indices of the PSO-optimized SVM model and the graph convolution neural network model.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The technical scheme of the invention is further described below with reference to the attached drawings and specific embodiments:
Example 1
As shown in fig. 1, a method for predicting drug toxicity based on drug fusion correlation data and graph convolution network according to an embodiment of the present invention includes the following steps:
step 1, constructing a data set
Step 1-1, preparing a data set. The dataset comprises: the TG-GATEs database contains toxicity data M of medicines and gene expression data V corresponding to each medicine.
In step 1-1, the TG-GATEs database is a large study program led by the company An Si taylor of japan (ASTELLAS PHARMA inc.) in which toxicity data of drugs are stored, mainly including biochemical blood and histopathological related data in vivo (rat) and in vitro (primary hepatocytes of rats and humans). Thereafter, the desired gene expression data was obtained from TOXY-GATEs (an on-line tool platform for analysis of data in Open TG-GATES), the interaction relationship between proteins was constructed by using the STRING database (an on-line database and tool for analysis of protein interactions), and the chemical formulas of 108 test drugs were obtained from DrugBank (a database providing bioinformatics and chemoinformatics).
In this embodiment, through screening the data on the Open TG-GATEs website, 108 kinds of drug toxicity data M and gene expression data V corresponding to each drug are selected, and the dimension of the expression data is 6009.
And step 1-2, marking the gene expression data V in the step 1-1, and marking target proteins, non-target proteins and drug targets.
In step 1-2, the drug similarity measurement plays a great role in the prediction process, so that the gene expression data V needs to be marked, and the compactness between the drug and the protein is found. As shown in fig. 2, the drug is represented by the m node and the target protein is represented by the v node; non-target proteins are represented by v n (n=1, 2,3, …) nodes, and the proteins in the solid oval region are drug targets, i.e., the drug can be associated with target proteins by these targets. The dotted edges of the protein-protein interaction network represent information in the genomic space that interacts with the drug-target, and the compactness between drug m and protein v can be defined as any number within [0,1 ].
Step 2, constructing drug correlation data based on fusion drug chemical structure
The drug correlation data S includes calculating closeness between drug and protein based on PPI networkPPI-based network utilization closenessCalculating the correlation S between medicines, calculating the chemical similarity T of medicines, wherein,
Step2-1, calculating the compactness between the medicine and the protein, defining the compactness between the medicine and the protein through a PPI network, and connecting a pharmacological space and a genome space:
Wherein, For compactness between drug and protein, v n is a known target for a given drug m, L vvn is the shortest distance between v and v n in the PPI network,For converting protein-to-protein distance into a compact protein-to-protein profile, this data can be obtained directly from the sting database, T (m) representing the set of all targets for drug m.
Equation (1) can demonstrate that the compactness between drug m and protein v corresponds to the sum of the closeness between all target targets of m and v, and if there is no link between the two proteins, L vvn is defined as infinity, whereby the compactness between drug and target protein can be obtained.
Step 2-2, calculating the correlation between drugs
Calculating drug-to-drug correlation data according to formula (2), given drugs m 1 and m 2, defining the correlation between the two drugs as the average compactness between their known targets:
Where No. T (m 1) is the total number of targets known to drug m 1 and No. T (m 2) is the total number of targets known to drug m 2.
Step 2-3, calculating the chemical similarity of the medicines
To obtain chemical similarity between drugs, a MACCS chemical molecular fingerprint of the target drug is obtained through the chemical formula of the target drug, the length of the chemical molecular fingerprint is 167, and the fingerprint is measured. The drug chemical similarity is calculated by utilizing the Tanimoto coefficient, and the calculation formula is as follows:
wherein G, H are two different compounds.
Step 2-4, constructing new drug related data S'
Step 2-2 and step 2-3 have obtained inter-drug correlation and inter-drug chemical similarity, and the two are fused to obtain drug correlation data S' fused with chemical similarity relationship:
Wherein S is inter-drug correlation data, T is inter-drug chemical similarity data, and S is new drug correlation data obtained by fusing the two.
As can be seen from the formula (4), the fusion of the correlation data and the chemical similarity data in the present invention is mainly obtained by weighted averaging the two. The method aims to reduce the influence caused by the low quality of PPI network data, so that the model prediction performance is improved.
Step 3, constructing and training a graph convolutional neural network model and evaluating
And constructing a graph roll-up neural network model constructed based on the drug related data and the new drug related data.
Step 3-1, constructing a graph convolution neural network model
The medicine related data S and the new medicine related data S' are used as the input of a graph convolution layer, the graph convolution layer is provided with two layers and is used for extracting the characteristics of the graph structure data, the graph convolution layer simplifies the calculation process of a graph convolution kernel by using the partial first-order approximation of the chebyshev polynomial, the calculation complexity of calculating the Laplace matrix in the graph convolution is reduced, and a calculation formula is shown as (5). In the graph, there are N nodes, each node has its corresponding feature, these nodes are combined to form a matrix K in n×x dimensions, an adjacent matrix a in n×n dimensions is formed between the nodes, and after S and S' are input, a final prediction result is generated:
Wherein the output of the picture scroll laminate is named H, H (l+1) is a feature of the first +1 layer, and for the input layer A is a feature matrix K of dimension NX, Is to add a self-connected matrix on the basis of the adjacency matrix A,Is thatIs a nonlinear activation function, W l is the weight matrix of the first layer. Since the elements on the main diagonal of the adjacent matrix a are all 0, this partial feature is ignored when multiplying with the matrix H, and therefore it is necessary to add an identity matrix I to the adjacent matrix a. By letting outMultiplying byA symmetrical normalized matrix can be obtained.
Step 3-2, determining training set and test set by twelve-fold cross validation
Firstly dividing data into 12 independent data with the same distribution, selecting 11 data as a training set, selecting the rest 1 data as a test set, training 12 classifiers, and taking the average value of the results of the 12 classifier test sets as a performance evaluation index of the graph convolution neural network model.
The performance evaluation index comprises six indexes of accuracy ACC, recall rate Recall, specificity SPECIFICITY, accuracy Precision and sensitivity SENSITIVITY, F1 scoring value F1, and the calculation formulas are as follows:
wherein TP, TN, FP, FN represents the number of model true positives, true negatives, false positives, and false negatives, respectively. The test results are shown in table 1:
TABLE 1 index information of prediction model constructed based on two kinds of correlation data
In order to more accurately analyze the difference of the prediction model constructed based on S' and S, the results of each index are represented by a bar graph, as shown in FIG. 3.
From the above data, it can be seen that, compared with the graph convolution neural network model constructed by using the correlation data S obtained based on the PPI network alone, the graph convolution neural network model constructed based on the fusion correlation data S' has improved indexes. The overall effect of drug toxicity prediction can be greatly improved by using a graph convolution neural network model based on feature extraction and classification of fusion data. Compared with a single correlation data model, the accuracy of the prediction model built based on the fusion data is improved by about 4%, the recall rate is improved by 7%, the effectiveness of predicting drug toxicity of the graph convolution neural network model built based on the fusion correlation data S' is demonstrated, and meanwhile, the necessity of carrying out chemical similarity fusion on drug correlation obtained only based on PPI network data is also demonstrated.
Comparative test
The effectiveness of the graph rolling neural network model is verified through a comparison test, and in order to better verify the prediction model constructed by the method, three evaluation indexes of the SVM model and the graph rolling neural network model based on Particle Swarm Optimization (PSO) optimization are compared under the condition of inputting the same data set (fusing the drug data correlation S').
(1) An SVM model based on Particle Swarm Optimization (PSO) optimization is constructed.
The Support Vector Machine (SVM) is a commonly used machine learning algorithm for classification and regression tasks, the basic principle is to find an optimal hyperplane in a high-dimensional space, separate sample points of different classes, and the mathematical expression is:
wTx+b=0 (12)
where w is the normal vector of the hyperplane (i.e., the direction of the hyperplane), x is the eigenvector of the input sample, and b is the bias term. When the sample point falls on the hyperplane, the equation equal sign is established, and the sample point is judged to belong to a positive category (category 1) according to an inequality w T x+b >0, and the sample point is judged to belong to a negative category (category-1) according to w T x+b < 0.
PSO is an optimization algorithm based on group intelligence, and comprises a group of particles, each particle represents a candidate solution, and the position of the particle is continuously adjusted according to individual experience and group experience until the optimal solution is found.
The PSO-optimized SVM model combines the global search capability of PSO with the classification accuracy of SVM. In the training process, PSO searches the optimal hyperplane by adjusting parameters of the SVM model, so that the classification accuracy and generalization capability are improved. Through iterative optimization, PSO can help SVM find better parameter configuration, and then the performance of the model is improved.
(2) An evaluation index of an SVM model and a graph rolling neural network model based on Particle Swarm Optimization (PSO) optimization is determined.
The performance indexes of the evaluation model comprise an accuracy ACC, a Recall rate Recall and an F1 scoring value (F1), and the calculation method is shown in formulas (6), (7) and (11) in the step 3-2. By constructing the SVM model based on PSO optimization and comparing the SVM model with the graph convolution neural network model based on fusion correlation data, index results are shown in table 2:
Table 2 comparison of two model indexes
In order to analyze the difference between the two models more accurately, the results of each index are represented by a bar graph, as shown in fig. 4.
It can be seen from table 2 that the graph convolution neural network prediction model constructed based on the fusion data has better performance on each index than the SVM model optimized based on the Particle Swarm Optimization (PSO). The difference in the performance of these two models is not very significant in accuracy. However, the recall rate is improved by nearly 2% compared with the SVM model after PSO optimization, and the test result shows that the model provided by the invention has a relatively efficient prediction effect on drug toxicity.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1.基于药物融合相关性数据及图卷积网络预测药物毒性方法,其特征在于,包括以下步骤:1. A method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network, characterized in that it includes the following steps: 步骤1、对Open TG-GATEs网站上的数据进行筛选,选取TG-GATEs数据库药物的毒性数据和每种药物相对应的基因表达数据,对基因表达数据进行标注,标记目标蛋白质、非目标蛋白质和药物靶点;Step 1: Screen the data on the Open TG-GATEs website, select the toxicity data of drugs in the TG-GATEs database and the gene expression data corresponding to each drug, annotate the gene expression data, and mark the target proteins, non-target proteins, and drug targets; 步骤2、通过计算药物与蛋白质之间的紧密度、计算药物与药物之间的相关性、计算药物化学相似性,构建新药物相关性数据;Step 2: construct new drug correlation data by calculating the closeness between drugs and proteins, calculating the correlation between drugs, and calculating the chemical similarity of drugs; 步骤3、构建图卷积神经网络模型,采用十二折交叉验证确定训练集以及测试集,采用训练集对图卷积神经网络模型进行训练,采用测试集对训练好的图卷积神经网络模型进行测试并给出性能评价指标。Step 3: Build a graph convolutional neural network model, use twelve-fold cross-validation to determine the training set and test set, use the training set to train the graph convolutional neural network model, use the test set to test the trained graph convolutional neural network model and give performance evaluation indicators. 2.根据权利要求1所述的基于药物融合相关性数据及图卷积网络预测药物毒性方法,其特征在于,步骤2中所述的药物与蛋白质之间的紧密度的计算公式如下:2. The method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network according to claim 1, characterized in that the calculation formula for the closeness between the drug and the protein in step 2 is as follows: 其中,为药物与蛋白质之间的紧密度,vn是给定药物m的已知靶点,Lvvn是PPI网络中v和vn之间的最短距离,用于将蛋白质与蛋白质之间的距离转换为蛋白质与蛋白质之间的紧密型,T(m)代表药物m的全部靶点所形成的集合。in, is the closeness between drug and protein, vn is the known target of a given drug m, Lvvn is the shortest distance between v and vn in the PPI network, It is used to convert the distance between proteins into the closeness between proteins. T(m) represents the set formed by all targets of drug m. 3.根据权利要求2所述的基于药物融合相关性数据及图卷积网络预测药物毒性方法,其特征在于,步骤2中所述的药物与药物之间的相关性的计算方法如下:3. The method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network according to claim 2, characterized in that the calculation method of the correlation between drugs in step 2 is as follows: 给定药物m1和m2,两种药物之间的相关性为它们已知靶点之间的平均紧密度,计算公式如下:Given drugs m 1 and m 2 , the correlation between the two drugs is the average closeness between their known targets, calculated as follows: 其中,No.T(m1)为属于药物m1已知的靶点总数,No.T(m2)为属于药物m2已知的靶点总数。Wherein, No.T(m 1 ) is the total number of known targets belonging to drug m 1 , and No.T(m 2 ) is the total number of known targets belonging to drug m 2 . 4.根据权利要求3所述的基于药物融合相关性数据及图卷积网络预测药物毒性方法,其特征在于,步骤2中所述的计算药物化学相似性的方法具体如下:4. The method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network according to claim 3, characterized in that the method for calculating drug chemical similarity described in step 2 is specifically as follows: 采用Tanimoto系数计算药物化学相似性,计算公式如下:The Tanimoto coefficient was used to calculate the drug chemical similarity, and the calculation formula is as follows: 其中,G、H为两种不同化合物。Among them, G and H are two different compounds. 5.根据权利要求4所述的基于药物融合相关性数据及图卷积网络预测药物毒性方法,其特征在于,步骤2中所述的构建新药物相关性数据的方法具体如下:5. The method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network according to claim 4, characterized in that the method for constructing new drug correlation data in step 2 is as follows: 将药物间相关性与药物间化学相似性融合,得到融合了化学相似性关系的药物相关性数据S如下:By integrating the correlation between drugs and the chemical similarity between drugs, the drug correlation data S integrating the chemical similarity relationship is obtained as follows: 其中,S为药物间相关性数据,T为药物间化学相似性数据,S′为对两者进行融合后得到的新药物相关性数据。Among them, S is the drug correlation data, T is the drug chemical similarity data, and S′ is the new drug correlation data obtained by merging the two. 6.根据权利要求4所述的基于药物融合相关性数据及图卷积网络预测药物毒性方法,其特征在于,步骤3中所述的构建图卷积神经网络模型的方法如下:6. The method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network according to claim 4, characterized in that the method for constructing the graph convolutional neural network model in step 3 is as follows: 将药物间相关数据S和新药物相关性数据S′作为图卷积层的输入,图卷积层有两层,用于对图结构数据进行特征提取,图卷积层使用切比雪夫多项式的局部一阶近似来简化图卷积核的计算过程,计算公式如下:The drug correlation data S and the new drug correlation data S′ are used as the input of the graph convolution layer. The graph convolution layer has two layers, which are used to extract features from graph structure data. The graph convolution layer uses the local first-order approximation of Chebyshev polynomials to simplify the calculation process of the graph convolution kernel. The calculation formula is as follows: 其中,H为图卷积层的输出,H(l+1)是第l+1层的特征,H(l)是第l层的特征,对于输入层是N×X维的特征矩阵K,是在邻接矩阵A基础上加入自连接的矩阵,的度矩阵,σ是非线性激活函数;Wl是第l层的权重矩阵。Among them, H is the output of the graph convolution layer, H (l+1) is the feature of the l+1th layer, H (l) is the feature of the lth layer, and for the input layer it is the N×X-dimensional feature matrix K. It is a self-connected matrix added to the adjacency matrix A. yes , σ is the nonlinear activation function; W l is the weight matrix of the lth layer. 图结构中有N个节点,每一个节点都有其相应特征,将这些节点进行组合,形成一个N×X维的矩阵K,节点之间会形成一个N×N维的邻接矩阵A,将S和S′输入后,生成最终的预测结果。There are N nodes in the graph structure, and each node has its corresponding features. These nodes are combined to form an N×X-dimensional matrix K. An N×N-dimensional adjacency matrix A is formed between the nodes. After inputting S and S′, the final prediction result is generated. 7.根据权利要求6所述的基于药物融合相关性数据及图卷积网络预测药物毒性方法,其特征在于,步骤3中所述的性能评价指标包括准确率ACC、召回率Recall、特异性Specificity、精确度Precision、敏感性Sensitivity、F1评分值F1这六项指标,其计算公式分别如下:7. The method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network according to claim 6 is characterized in that the performance evaluation indexes described in step 3 include six indicators: accuracy ACC, recall rate Recall, specificity Specificity, precision Precision, sensitivity Sensitivity, and F1 score F1, and their calculation formulas are as follows: 其中,TP、TN、FP、FN分别表示模型真阳性、真阴性、假阳性和假阴性的数量。Among them, TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives of the model, respectively. 8.根据权利要求7所述的基于药物融合相关性数据及图卷积网络预测药物毒性方法,其特征在于,步骤3中所述的采用十二折交叉验证确定训练集以及测试集的方法如下:将数据划分为12份独立同分布的数据,选择其中的11份作为训练集,剩下的1份作为测试集,训练12个分类器,并取12个分类器测试集结果的平均值作为图卷积神经网络模型的性能评价指标。8. According to the method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network in claim 7, it is characterized in that the method of determining the training set and the test set by using twelve-fold cross-validation in step 3 is as follows: the data is divided into 12 independent and identically distributed data, 11 of which are selected as training sets and the remaining 1 is selected as a test set, 12 classifiers are trained, and the average value of the test set results of the 12 classifiers is taken as the performance evaluation index of the graph convolutional neural network model. 9.一种电子设备,包括存储器以及处理器,其特征在于,所述存储器用于存储支持处理器执行权利要求1至8任一项所述基于药物融合相关性数据及图卷积网络预测药物毒性方法的程序,所述处理器被配置为用于执行所述存储器中存储的程序。9. An electronic device comprising a memory and a processor, characterized in that the memory is used to store a program that supports the processor to execute the method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network as described in any one of claims 1 to 8, and the processor is configured to execute the program stored in the memory. 10.一种存储介质,存储介质上存储有计算机程序,其特征在于,所述计算机程序被处理器运行时执行权利要求1至8任一项所述基于药物融合相关性数据及图卷积网络预测药物毒性方法的步骤。10. A storage medium having a computer program stored thereon, characterized in that when the computer program is executed by a processor, the steps of the method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network as described in any one of claims 1 to 8 are executed.
CN202410922164.3A 2024-07-10 2024-07-10 A method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network Active CN119008034B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410922164.3A CN119008034B (en) 2024-07-10 2024-07-10 A method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410922164.3A CN119008034B (en) 2024-07-10 2024-07-10 A method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network

Publications (2)

Publication Number Publication Date
CN119008034A true CN119008034A (en) 2024-11-22
CN119008034B CN119008034B (en) 2025-09-16

Family

ID=93472920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410922164.3A Active CN119008034B (en) 2024-07-10 2024-07-10 A method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network

Country Status (1)

Country Link
CN (1) CN119008034B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109378081A (en) * 2018-09-27 2019-02-22 华东师范大学 A method for analyzing functional network features of breast cancer
CN113593633A (en) * 2021-08-02 2021-11-02 中国石油大学(华东) Drug-protein interaction prediction model based on convolutional neural network
CN114242186A (en) * 2021-12-30 2022-03-25 湖南大学 Method, system and storage medium for relocation of Chinese and Western medicines fused with GHP and GCN
US20220130495A1 (en) * 2021-04-06 2022-04-28 Beijing Baidu Netcom Science Technology Co., Ltd. Method and Device for Determining Correlation Between Drug and Target, and Electronic Device
CN114613425A (en) * 2022-03-10 2022-06-10 中国石油大学(华东) Drug-target interaction prediction algorithm based on graph volume and similarity
CN115458045A (en) * 2022-09-15 2022-12-09 哈尔滨工业大学 Drug pair interaction prediction method based on heterogeneous information network and recommendation system
CN117496211A (en) * 2022-07-20 2024-02-02 天津大学 Method for predicting drug liver toxicity based on histopathological images and graph representation framework

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109378081A (en) * 2018-09-27 2019-02-22 华东师范大学 A method for analyzing functional network features of breast cancer
US20220130495A1 (en) * 2021-04-06 2022-04-28 Beijing Baidu Netcom Science Technology Co., Ltd. Method and Device for Determining Correlation Between Drug and Target, and Electronic Device
CN113593633A (en) * 2021-08-02 2021-11-02 中国石油大学(华东) Drug-protein interaction prediction model based on convolutional neural network
CN114242186A (en) * 2021-12-30 2022-03-25 湖南大学 Method, system and storage medium for relocation of Chinese and Western medicines fused with GHP and GCN
CN114613425A (en) * 2022-03-10 2022-06-10 中国石油大学(华东) Drug-target interaction prediction algorithm based on graph volume and similarity
CN117496211A (en) * 2022-07-20 2024-02-02 天津大学 Method for predicting drug liver toxicity based on histopathological images and graph representation framework
CN115458045A (en) * 2022-09-15 2022-12-09 哈尔滨工业大学 Drug pair interaction prediction method based on heterogeneous information network and recommendation system

Also Published As

Publication number Publication date
CN119008034B (en) 2025-09-16

Similar Documents

Publication Publication Date Title
Liao et al. DeepDock: enhancing ligand-protein interaction prediction by a combination of ligand and structure information
Zhang et al. Predicting drug-induced liver injury in human with Naïve Bayes classifier approach
Semenova et al. A Bayesian neural network for toxicity prediction
US12260939B2 (en) Systems and methods for predicting compounds associated with transcriptional signatures
CN111627494B (en) Protein property prediction method and device based on multidimensional features and computing equipment
CN114333986A (en) Method and device for model training, drug screening and affinity prediction
CN103093108B (en) A kind of Chinese medicine system Pharmacological Analysis platform and the method for analysis
CN111933212A (en) Clinical omics data processing method and device based on machine learning
Wu et al. The role of artificial intelligence in drug screening, drug design, and clinical trials
CN115050428A (en) Drug property prediction method and system based on deep learning fusion molecular graph and fingerprint
CN112133367A (en) Method and device for predicting interaction relationship between drug and target
Yi et al. Learning representation of molecules in association network for predicting intermolecular associations
Nandhini et al. Hybrid CNN-LSTM and modified wild horse herd Model-based prediction of genome sequences for genetic disorders
CN118412146A (en) Prediction model construction method, prediction method and device for drug combination synergy
Park et al. Dual representation learning for predicting drug-side effect frequency using protein target information
CN114360637A (en) A protein-ligand affinity evaluation method based on graph attention network
Bai et al. Prediction of the antioxidant response elements' response of compound by deep learning
Shi et al. Protein complex detection with semi-supervised learning in protein interaction networks
Gousiadou et al. Development of artificial neural network models to predict the PAMPA effective permeability of new, orally administered drugs active against the coronavirus SARS-CoV-2
Mazumdar et al. Predicting renal toxicity of compounds with deep learning and machine learning methods
Sundar et al. An intelligent prediction model for target protein identification in hepatic carcinoma using novel graph theory and ann model
CN119008034B (en) A method for predicting drug toxicity based on drug fusion correlation data and graph convolutional network
Hassan et al. Dimensionality reduction methods for extracting functional networks from large‐scale CRISPR screens
Shehab et al. OPTUNA optimization for predicting chemical respiratory toxicity using ML models
Balamurugan et al. Machine Learning based Modeling of Drugs using Virtual Screening and in Silico Approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant