Detailed Description
The molecular structure prediction system based on the multi-mode fine-granularity molecular pre-training model for prompt learning comprises a data acquisition module, a processing module and a prediction module;
the data acquisition module is used for acquiring sample data in the multi-mode molecular pre-training data set and sample data in the downstream task data set;
The processing module is used for establishing a multi-modal fine-grained molecular pre-training model MolFinePrompt based on prompt learning and obtaining a trained multi-modal fine-grained molecular pre-training model based on prompt learning;
MolFinePrompt is Fine-Grained Multimodal Molecular PRETRAINING LARGE Model via project-Learning;
The prediction module is used for predicting the relationship between the attribute and the drug interaction of the molecular structure to be detected based on a trained multi-mode fine-granularity molecular pre-training model for prompting learning, and a prediction result is obtained.
The molecular structure to be tested and the data obtained by the data acquisition module come from the same field (the same data types refer to all pairs of molecular text, for example: molecular structure CC (=O) O and corresponding text description :Acetic acid is a product of the oxidation of ethanol and of the destructive distillation of wood.It is used locally,occasionally internally,as a counterirritant and also as a reagent.Acetic acid otic(for the ear)is an antibiotic that treats infections caused by bacteria or fungus.( acetic acid is the product of ethanol oxidation and destructive distillation of wood, it is used locally, occasionally internally, as an anti-irritant and reagent, otic acetate is an antibiotic, it is used to treat infections caused by bacteria or fungi)
The second embodiment is different from the first embodiment in that the data acquisition module is configured to acquire sample data in a multi-modal molecular pre-training data set and sample data in a downstream task data set, and the specific process is as follows:
Sample data in the multimodal molecular pre-training dataset are molecular-text pairs;
Sample data in the downstream task dataset is molecular structure (only structure has no text, text is prompted, molecular structure is with trained GIN);
The sample data in the dataset and the sample data in the downstream task dataset both include a molecular structure G and a textual description T;
The molecular structure G and the textual description T are 316k molecular-textual pairs collected from the common dataset Pubchem, k units being thousands;
for example, the molecular structure diagram G ' and the standardized molecular text description T ' which are processed by the fine granularity in the pre-training data set are used as the input of MolFinePrompt model, the molecular structure diagram G ' and the specific task instruction text which are processed by the fine granularity in the downstream task data set are used as the input of the pre-training model, and the model outputs the result of the prediction task.
Other steps and parameters are the same as in the first embodiment.
The third embodiment is different from the first or second embodiments in that the processing module is configured to establish a multimodal fine granularity molecular pre-training model based on prompt learning, and obtain a trained multimodal fine granularity molecular pre-training model based on prompt learning, where the specific process is as follows:
The multi-mode fine granularity molecular pre-training model based on prompt learning comprises a fine granularity molecular graph construction module, a molecular and text description representation learning module, a self-supervision cooperative contrast optimization module and an instruction fine adjustment downstream task module;
The fine-granularity molecular graph construction module adopts a specific decomposition rule to decompose a molecular structure graph G in sample data in the multi-mode molecular pre-training data set into sub-structure units, and builds a fine-granularity molecular graph G' based on the sub-structure units;
The molecular and text description representation learning module is used for inputting text description T in sample data in a multi-mode molecular pre-training data set into a text encoder, outputting semantic features by the text encoder, inputting fine granularity molecular graph G '' corresponding to 165 ten thousand molecular structures in PubChem data set into a molecular encoder, and outputting molecular feature vectors by the molecular encoder;
the self-supervision collaborative contrast optimization module trains a pre-trained molecular encoder GIN and a text encoder by adopting a contrast learning method to obtain a trained molecular encoder GIN and a trained text encoder, wherein the trained molecular encoder GIN and the trained text encoder form a multi-modal fine-granularity molecular pre-training model;
The instruction fine-tuning downstream task module is used for adopting a prompt learning guiding multi-mode fine-granularity molecular pre-training model to understand downstream tasks and obtaining a trained multi-mode fine-granularity molecular pre-training model based on prompt learning.
Improving the applicability and flexibility of the method in specific application scenes.
Based on the above, the multi-mode fine-granularity molecular pre-training model is optimized by constructing a structure diagram of the fine-granularity molecules and combining text information by using contrast learning, and the multi-mode fine-granularity molecular pre-training model is applied to the key specific task in the discovery field by using a prompt learning technology.
Other steps and parameters are the same as in the first or second embodiment.
The fourth embodiment is different from the first to third embodiments in that the fine-grained molecular graph construction module adopts a specific decomposition rule to decompose a molecular structure graph G in sample data in the multi-mode molecular pre-training data set into sub-structural units, and constructs the fine-grained molecular graph G based on the sub-structural units;
the specific process is as follows:
1) Acquiring the chemical element characteristics of each atom in the molecular structure represented by the SMILES character string in the sample data in the multi-mode molecular pre-training dataset based on RDKit tool;
the chemical element characteristics of each atom include the serial number of the atom, the connectivity, the bond type (single bond, double bond, triple bond) between the atoms, and the ring type of the atom (whether ring is formed or not);
Taking the connectivity characteristic as a node characteristic;
Taking the bond type among atoms and the atom ring forming type as bond characteristics;
The connectivity is the number of chemical bonds connected to the atoms;
2) Converting the molecular structure represented by the SMILES string in the sample data within the multimodal molecular pre-training dataset into a 2D topology graph g= (V, E) based on RDKit tool for more complete characterization of the chemical nature of the molecular structure;
Wherein V represents a node set of the 2D topology graph G, and E represents an edge set of the 2D topology graph G;
V represents a corresponding atom in the molecule, E is a chemical bond between the atoms;
3) Decomposing a molecular structure in sample data in a multi-mode molecular pre-training data set based on BRICS decomposition method to obtain a substructure unit, wherein the specific process is as follows:
performing preliminary decomposition on molecular structures in sample data in the multi-mode molecular pre-training dataset based on BRICS decomposition method to obtain a plurality of sub-structural units after preliminary decomposition;
each of the sub-structural units after preliminary decomposition is decomposed to obtain a plurality of minimum sub-structural units (the step can be performed separately, the plurality of sub-structural units after preliminary decomposition cannot be taken as a plurality of minimum sub-structural units), for example, SMIELS structures of a molecule are expressed as c1=cc=c2c (=c1) N (C (=o) N2 CCO) CCO, and can be decomposed into three sub-structural units of CCO, c1=cc=c2c (=c1) N (C (=o) N2) and CCO by a BRICS decomposition method, and c1=cc=c2c (=c1) N (C (=o) N2) is further decomposed into c1=cc=c2c (=c1), c= O, C1 n=cn=c1, and finally the molecule has five sub-structural units of CCO, c1=cc=c2c (=c1), c= O, C1 n=cn=c1 and CCO according to a decomposition rule;
4) Adding the sub-structural unit as a new node V f into the topological graph;
Adding the connection relation E f between each sub-structure unit and the node contained by the sub-structure unit as a new edge into the topological graph;
Constructing an empty graph level node V g, connecting the graph level node V g with all new nodes V f to obtain a connection relation E g, and forming a fine-grained molecular graph G' based on V, V f,Vg,E,Ef,Eg to mine hidden semantic information in molecules;
G′=(V′,E'),V′=[V,Vf,Vg],E′=[E,Ef,Eg]。
other steps and parameters are the same as in one to three embodiments.
The fifth embodiment is different from the first to fourth embodiments in that the molecule and text description representation learning module is used for inputting text description T in sample data in a multi-mode molecule pre-training data set into a text encoder, outputting semantic features by the text encoder, inputting fine granularity molecule graph G '' corresponding to 165 ten thousand molecular structures in PubChem data set into the molecule encoder, outputting molecule feature vectors by the molecule encoder, training the molecule encoder GIN, and obtaining a pre-trained molecule encoder GIN;
the specific process is as follows:
Step 1), changing The molecular names in The texts of The molecular text pairs in The sample data in The multi-mode molecular pre-training data set into a unified format of 'thermo molecular is';
inputting The text with The unified format of 'thermo is' into a text encoder, and outputting semantic features of The text encoder;
The text encoder is SciBERT based on the BERT architecture;
The text description serves as a supplementary knowledge of the molecular diagram and is summarized in natural language form from the point of view of the functions, properties, etc. related to the molecule. To prevent The leakage of The collected molecular text data and eliminate The deviation of The molecular names, we change The text names in The 316k molecular-text pair to a unified format of "thermo is" to enhance The generalization and interpretability of The model so that it can concentrate on The inherent links of The molecular structure and properties, rather than relying on specific molecular names.
SciBERT provides more abundant and accurate semantic domain knowledge for molecules by pre-training on a large number of corpora in the fields of biochemistry, medicine and the like.
Step 2), inputting node features and key features corresponding to 165 ten thousand molecular structures in a PubChem data set and a fine-grained molecular graph G '' corresponding to 165 ten thousand molecular structures in a PubChem data set into a molecular encoder GIN, and outputting feature vectors by the molecular encoder GIN;
the molecular encoder GIN sequentially comprises an input layer, a hidden layer and a hidden layer;
Optimizing the molecular encoder GIN by adopting a self-supervision learning method of a generating task and a predicting task until convergence to obtain a pre-trained molecular encoder GIN;
Adopting the feature vectors corresponding to V and E to complete the generation task;
adopting a feature vector corresponding to V g,Eg to complete a prediction task;
generating a task of atom connectivity, a type corresponding to the serial number of the atom (hydrocarbon oxygen belongs to different atom types) and a bond type among the atoms;
The predictive tasks are the number of atoms in the molecular structure (how many atoms are in the molecule) and the number of bonds (how many chemical bonds are in the molecule);
The first layer in the molecular encoder GIN receives the original node characteristics as input, the 2 nd layer to the 4 th layer are hidden layers, each layer is based on the output of the previous layer as input, the characteristics of the neighbor nodes are aggregated, the representation of the nodes is updated through a learnable multi-layer perceptron (MLP), and the node characteristics after adding residual connections (adding the node representation of the current layer to the node representation of the previous layer) are the output of each layer;
For molecular structure processing, we have employed graph isomorphic networks (Graph Isomorphism Network, GIN) to encode topologies, a Graph Neural Network (GNN) variant widely used in the field of molecular representation learning. GIN is distinguished by its excellent ability to capture and express topological features of molecules.
Processing 165 ten thousand molecular structures in PubChem data set based on RDKit tool to obtain node characteristics and key characteristics, inputting the node characteristics and the key characteristics into a five-layer GIN model, outputting characteristic vectors by the GIN model, optimizing the molecular encoder GIN by adopting two self-supervision learning tasks of generating task and predicting task until convergence to obtain a pre-trained molecular encoder GIN model;
Our model is able to learn and generalize more accurately the structural information of the molecules.
Other steps and parameters are the same as in one to four embodiments.
The sixth embodiment is different from the first to fifth embodiments in that the self-supervision cooperative contrast optimization module trains the pre-trained molecular encoder GIN and text encoder by adopting a contrast learning method to obtain the trained molecular encoder GIN and text encoder, wherein the trained molecular encoder GIN and text encoder form a multi-modal fine-granularity molecular pre-training model;
the specific process is as follows:
The training set (molecule-text pair) is sample data within the multimodal molecule pre-training data set;
The loss function is InfoNCE loss functions;
the optimization method is a contrast learning method;
The feature vector corresponding to V g,Eg in the molecular feature vectors output from the pre-trained molecular encoder GIN is used as the feature vector representation of the molecule.
Training the text encoder and the pre-trained molecular encoder GIN until convergence to obtain a trained text encoder and a trained molecular encoder GIN;
The trained text encoder and the molecular encoder GIN form a multi-mode fine-granularity molecular pre-training model so as to achieve the optimal alignment between the molecular structural characteristics and the semantic characteristics;
The InfoNCE loss function is:
Wherein, Inputting a pre-trained molecular encoder GIN for molecules in the ith sample, wherein the pre-trained molecular encoder GIN outputs an image feature vector;
Inputting text encoders for text in the ith pair of samples, and outputting text feature vectors by the text encoders;
inputting text encoder for text in jth sample, text feature vector output by the text encoder;
sim () is a similarity function that measures the similarity between two modal features, τ is a temperature parameter, and N is the total number of molecular text pairs.
The semantic relevance between the molecules and the text is enhanced by adopting a multi-modal contrast learning method. The method improves the matching quality of molecules and texts by continuously optimizing the model, and simultaneously ensures that unmatched pairs of molecular texts keep a certain distance in an embedding space. Comparing n pairs of molecular text (g 1,t1),(g2,t2)…(gn,tn), wherein g i,ti is the i-th molecule and the text description corresponding to the i-th molecule, respectively;
verifying a multimodal fine-grained molecular pre-training model:
the zero sample retrieval task of a new molecular structure/text based on a multi-mode fine-granularity molecular pre-training model comprises the following specific processes:
Acquiring new molecular structure text pair datasets PCDes and MoMu (new molecular structure text pairs);
freezing a zero sample molecular structure/text retrieval task based on a multi-mode fine granularity molecular pre-training model;
The zero sample molecular structure retrieval task inputs a new molecular structure text to a molecular pre-training model based on multi-mode fine granularity to obtain a molecular structure feature h g and a semantic feature h t, and calculates cosine similarity of the molecular structure feature h g and the semantic feature vector h t to obtain a cosine similarity matrix, wherein each element in the cosine similarity matrix represents similarity between one molecular structure feature vector and all semantic feature vectors;
Determining indexes of text representations of which each molecular structure represents the most similarity, comparing the indexes with indexes of correct text representations (a first text corresponding to a first molecular structure and a second text corresponding to a second molecular structure), and if the indexes are matched, considering that the retrieval is successful, and calculating the retrieval accuracy;
And sequencing the semantic feature vectors according to the similarity value for each molecular structure feature vector, and searching the most similar items, wherein if the molecular structure feature vector is successfully searched in the first 20 most similar items after sequencing. Calculating the proportion of correct retrieval to obtain the average recall rate.
Other steps and parameters are the same as in one of the first to fifth embodiments.
The difference between the embodiment and one to six embodiments is that the instruction fine-tuning downstream task module is used for adopting a prompt learning guide multi-mode fine-granularity molecular pre-training model to understand downstream tasks so as to obtain a trained multi-mode fine-granularity molecular pre-training model based on prompt learning;
the specific process is as follows:
In order to improve the performance of the model on a specific molecular prediction task, the project adopts a prompt and parameter high-efficiency fine tuning model guided by combining expertise so as to adapt to the specific molecular prediction task. Specifically, initialization hints are first customized for different tasks, enabling the model to understand task goals and requirements. By designing specialized hints for attribute prediction and drug interaction (DDI) tasks, rich background and target information is provided, helping the model understand related biological and chemical principles, ensuring that the model can identify and utilize key molecular features.
1) The method comprises the following steps of:
11 Acquiring attribute prediction tasks in a training set (only the structure has no text, the text is prompted, and the molecular structure uses a trained GIN) in sample data in a downstream task data set;
The attribute prediction task comprises BBBP, bace, sider, tox, toxCast and HIV six subtasks, wherein each subtask corresponds to one data set, namely the attribute prediction task comprises six data sets;
BBBP is blood brain barrier penetrability;
the Bace is a beta-amyloid precursor protein lyase;
Sider is a side effect;
The Tox21 is a 21 st century toxicology test;
The ToxCast was a project of years of toxicology predictive research initiated by the united states Environmental Protection Agency (EPA);
the HIV is a human immunodeficiency virus;
The BBBP dataset comprises 2039 molecular structures and corresponding attribute tags;
the Bace dataset comprises 1513 molecular structures and corresponding attribute tags;
The Sider dataset comprises 1427 molecular structures and corresponding attribute tags;
The Tox21 dataset comprises 7831 molecular structures and corresponding attribute tags;
the ToxCast dataset comprises 8576 molecular structures and corresponding attribute tags;
The HIV dataset contains 41127 molecular structures and corresponding attribute tags;
12 Acquiring a drug interaction task in a training set (only the structure has no text, the text is prompted, and the molecular structure is used for a trained GIN) in sample data in a downstream task data set;
The medication interaction task comprises ZhangDDI, chChMiner, deepDDI sub-tasks, each corresponding to one dataset, i.e. the medication interaction task comprises three datasets,
The ZhangDDI dataset contains 48548 molecular structure pairs (two molecular structures) and corresponding drug interaction relationship tags;
The CHCHMINER dataset contains 48514 molecular structure pairs (two molecular structures) and corresponding drug interaction relationship tags;
The DeepDDI dataset contains 192284 molecular structure pairs (two molecular structures) and corresponding drug interaction relationship tags;
13 Constructing an instruction prompt text according to the task target, the task background, the guiding principle and the task requirement of each subtask;
Taking BBBP tasks as an example:
The task, such as the prompt :"ThistaskisBBBP,ourobjectiveistopredict whetheradrugmoleculecanpenetratetheblood-brainbarrier,whichiscomposedofbrain capillaryendothelialcells.Theblood-brainbarrierishighlyselective,allowingonlycertain substancestopassthrough.Weneedtoanalyzethefollowingmolecularstructural characteristics:lipophilicity,molecularweight,chargestate,proteinbindingcapacity,presence ofhydrophobicgroups,activityofmetabolicproducts,andthecyclicstructureofthemolecule.Ifthemoleculehashighlipophilicity,asmallmolecularweight,lackscharge,bindsminimally withproteins,hashydrophobicgroups,noactivemetabolites,andisapolycycliccompound,it ismorelikelytopenetratetheblood-brainbarrier.Pleaseusetheseguidingprinciplesto determinewhetherthismoleculehasthecapabilitytodoso.( of BBBP (blood brain barrier penetrability) task, is a BBBP task, and the task target, the task background, the guiding principle and the task requirement are designed, wherein the task target is to predict whether a drug molecule can penetrate the blood brain barrier, and the task background is that the blood brain barrier consists of brain capillary endothelial cells. The blood brain barrier is highly selective and allows only certain substances to pass through, guidelines that if a molecule is highly lipophilic, small in molecular weight, lacks charge, binds very little to proteins, has hydrophobic groups, is inactive metabolite, and is polycyclic, it is easier to penetrate the blood brain barrier, and the task is to use these guidelines to determine whether the molecule has the ability to penetrate the blood brain barrier. )
Bace (beta-amyloid precursor protein lyase) task suggest :ThistaskisBACE,ourgoalistopredict whetheradrugmoleculecaneffectivelyinhibitBACEenzyme,whichisakeytargetforthe treatmentofAlzheimer'sdisease.TheactivityofBACEenzymeiscloselyrelatedtothe productionofbeta-amyloidprotein.Weneedtoanalyzethefollowingmolecularstructural characteristics:bindingaffinity,molecularweight,lipophilicity,hydrogenbondcapability,stereoisomerism,andmetabolicstability.Ifthemoleculedemonstratesstrongbindingaffinity withtheactivesiteofBACEenzyme,possesseskeychemicalgroupstoformnecessary hydrogenbonds,andisstableunderphysiologicalconditionswithgoodcellmembrane permeability,itmaybeaneffectiveBACEinhibitor.Pleaseusetheseguidingprinciplesto analyzeanddeterminewhetherthismoleculehasthepotentialforBACEinhibition.( that this task is a BACE task, which we designed the task goal, task background, guidelines and task requirements, and the task goal is to predict whether a drug molecule can effectively inhibit BACE enzyme, which is a key target for the treatment of Alzheimer's disease. Task background BACE enzyme activity is closely related to the production of beta-amyloid. Guidelines we need to analyze the molecular structural characteristics of binding affinity, molecular weight, lipophilicity, hydrogen bonding ability, stereoisomers and metabolic stability. If the molecule exhibits a strong binding affinity to the active site of the BACE enzyme, possesses critical chemical groups that form the necessary hydrogen bonds, and is stable under physiological conditions and has good cell membrane permeability, it may be an effective BACE inhibitor. Task requirements please use these guidelines to analyze and determine if the molecule has the potential for BACE inhibition);
Sider (SIDE effect) task prompt :ThistaskisSIDE,ourgoalistopredictthepotentialsideeffects thatadrugmoleculemightcause.Predictingdrugsideeffectsiscrucialforthesafety assessmentofmedications.Weneedtoanalyzethefollowingmolecularstructural characteristics:pharmacologicalmechanismsofactionsuchasreceptorbindingorenzyme inhibition,propertieslikemolecularweight,lipophilicity,andsolubility,andpharmacokinetic characteristicssuchasabsorption,distribution,metabolism,andexcretion.Pleaseusethese guidelinestoanalyzeandpredictthepotentialsideeffectsthatthisdrugmoleculemay cause.( is SIDE, and the task targets, task backgrounds, guidelines and task demands are designed, and the task targets are predicted to have potential SIDE effects possibly caused by drug molecules. Task background prediction of drug side effects is critical for drug safety assessment. The following molecular structural characteristics, such as pharmacological action mechanisms of receptor binding or enzyme inhibition, properties of molecular weight, lipophilicity, solubility and the like, and pharmacokinetic characteristics of absorption, distribution, metabolism, excretion and the like, need to be analyzed. Task requirements please use these guidelines to analyze and predict the potential side effects that the drug molecule may cause. )
Tox21 (21 st century toxicology test) task cue :ThistaskisTOX21,ourgoalistopredictwhethera pharmaceuticalmoleculemightinducearangeofbiologicaleffectsassociatedwithtoxicity.TheTOX21projectaimstoidentifythepotentialtoxicityofchemicalsubstances,including hormonedisruption,genotoxicity,andmore.Weneedtoanalyzemolecularstructural characteristics:thepresenceofaromaticrings,thebalancebetweenlipophilicityand hydrophilicity,metabolicstability,andpermeability.Ifamoleculeaffectsthepharmacological action of known toxicity-related targets,shows metabolic instability,or is likely to produce toxicmetabolites,orexhibitspoorcellularpermeability,thenitmayhaveTOX21properties.PleaseusetheseguidingprinciplestopredictwhetherthepharmaceuticalmoleculehasTOX21characteristics.( this task was Tox21, our goal was to predict whether drug molecules would induce a range of toxicological related biological effects. The TOX21 program aims to determine potential toxicity of chemicals, including hormonal interference, genotoxicity, and the like. We need to analyze the molecular structural characteristics of the presence of aromatic rings, the balance between lipophilicity and hydrophilicity, metabolic stability and permeability. If the molecule affects the pharmacological effects of a known toxicity-related target, it may exhibit metabolic instability, or may produce toxic metabolites, or exhibit poor cell permeability, it may have TOX21 properties. Please use these guidelines to predict whether a drug molecule will possess TOX21 properties. )
ToxCast (a project of years of toxicology predictive research initiated by the Environmental Protection Agency (EPA)) task cue :Thistaskis ToxCast,ourgoalistoassessthepotentialtoxicityofchemicalsubstancestobiologicalsystems.TheToxCastprojectisalarge-scale,high-throughputscreeningprojectthatusesavarietyof biologicalteststopredictthetoxicityofchemicalsubstances.Weneedtoanalyzethefollowing molecularstructuralcharacteristics:molecularweight,solubility,lipophilicity,etc.Ifamolecule hasalargemolecularweight,lowsolubility,highlipophilicity,andactivemetabolites,itis morelikelytoexhibitspecifictoxicity.Pleasejudgewhetherthismoleculehaspotentialtoxicity basedontheseguidelines.( is ToxCast, which we devised the task objective, task background, guidelines and task requirements, the task objective being to evaluate the potential toxicity of chemical substances to biological systems. Task background the ToxCast item is a large-scale, high-throughput screening item that uses various biological tests to predict chemical toxicity. Guidelines we need to analyze the molecular structural characteristics of molecular weight, solubility, lipophilicity, etc. Specific toxicity is more likely to be exhibited if the molecular weight is large, the solubility is low, the lipophilicity is high, and the metabolite activity is strong. Task requirements please use these guidelines to determine whether the molecule is potentially toxic. )
HIV (human immunodeficiency Virus) task prompt :ThistaskisHIV,ourgoalistoassessthepotential inhibitoryeffectofdrugmoleculesonHIV.HIVisavirusthatattacksthehumanimmune system,particularlyaffectingCD4+Tcells,leadingtoAcquiredImmunodeficiencySyndrome(AIDS).Weneedtoanalyzethefollowingmolecularstructuralcharacteristics:lipophilicity,molecularweight,chargestate,proteinbindingcapacity,hydrophobicgroups,andtheactivity ofmetabolicproducts.IfthemoleculehasanoptimizedchemicalstructuretargetingHIV'skey proteins,asmallermolecularweight,strongproteinbindingcapacity,hydrophobicgroups,and goodmetabolicstability,thenitismorelikelytohaveaninhibitoryeffectonHIV.Pleasejudge whetherthismoleculehasantiviralpotentialofspecificdrugmolecules.( this task is HIV, and we devised the task goal, task background, guidelines and task requirements, task goal, assessment of potential inhibition of HIV by drug molecules. Background of the task HIV is a virus that attacks the human immune system, especially affecting CD4+ T cells, resulting in acquired immunodeficiency syndrome (AIDS). Guidelines we need to analyze the molecular structural characteristics of lipophilicity, molecular weight, charge state, protein binding capacity, hydrophobic groups and the activity of metabolites. If the molecule has an optimized chemical structure for HIV key proteins, a smaller molecular weight, a strong protein binding capacity, hydrophobic groups, and good metabolic stability, it is more likely to exert an inhibitory effect on HIV. Task requirements please use these guidelines to predict the antiviral potential of a particular drug molecule. )
Task prompt :This task is Drug-Drug Interaction,which refers to the phenomenon where the simultaneous use of two or more drugs in the body can lead to an enhancement or reduction in their efficacy,or even cause adverse reactions,due to the mutual influence between the drugs.Determine if the interaction between these two drugs is positive or not.( of a drug interaction task is drug interaction, and the task target, task background, guiding principle and task requirement are designed, wherein the task target refers to the phenomenon that the effect of two or more drugs is enhanced or reduced and even adverse reaction is caused by the simultaneous use of the two or more drugs in a body due to the interaction among the drugs. Task requirements please determine if there is an interaction between the two drugs. ) Drug interactions are different types of drugs only to determine if there is an interaction between each drug pair.
2) Obtaining a feature vector h p of the prompt text based on the instruction prompt text constructed in the step 1);
3) Performing attribute prediction on a molecular structure (without attribute tags) based on a multi-mode fine-granularity molecular pre-training model and feature vectors of a prompt text;
4) Predicting the drug interaction relationship of the molecular structure pair (without interaction relationship) based on the multimodal fine-granularity molecular pre-training model and the feature vector of the prompt text;
5) Repeating the steps 1) to 4) until convergence, and obtaining a trained multi-mode fine-granularity molecular pre-training model based on prompt learning.
Other steps and parameters are the same as those of embodiments one to six to one.
The eighth embodiment is different from the first to seventh embodiments in that the step of 2) obtaining the feature vector h p of the instruction prompt text based on the instruction prompt text constructed in step 1), and the specific process is as follows:
Creating a prompt embedding layer P epsilon R (l×d), wherein l is the number of virtual token, d is the embedding dimension;
inputting the instruction prompt text constructed in the step 1) into a word segmentation device corresponding to the trained text encoder, and outputting a token ID sequence by the word segmentation device;
Inputting a token ID sequence into a trained text encoder, and outputting word embedding vectors by a word embedding layer in the trained text encoder;
the word embedding vector is used as the initial weight of the created prompt embedding layer P, and an initialized prompt embedding layer is obtained;
And inputting the initialized virtual token in the prompt embedding layer into a trained text encoder, and outputting a feature vector h p corresponding to the instruction prompt text by the trained text encoder.
Each subtask corresponds to a number of molecular structures, each molecular structure corresponds to one h p, and h p in 1 subtask is the same.
Other steps and parameters are the same as those of embodiments one to seven to one.
The ninth embodiment is different from one of the first to eighth embodiments in that in the 3) performing attribute prediction on a molecular structure (without attribute tag) based on a multimodal fine-grained molecular pre-training model and a feature vector of a prompt text, the specific process is as follows:
31 Inputting BBBP (blood brain barrier penetrability) molecular structure (without attribute tag) into a trained molecular encoder GIN in a multi-mode fine granularity molecular pre-training model, and outputting a structural feature vector h g1 by the trained molecular encoder GIN;
Obtaining a feature vector h p1 of the prompt text according to 2) based on the command prompt text constructed by BBBP of 1);
Fusing the feature vector h p1 of the prompt text and the structural feature vector h g1 to obtain attribute feature representation corresponding to the molecular structure of BBBP after prompt guidance;
Wherein,
H property1 is an attribute characteristic representation corresponding to the molecular structure of BBBP after prompting and guiding;
Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;
inputting the attribute characteristic representation corresponding to the molecular structure of BBBP after prompt guidance into a classifier to classify (acquire attributes);
32 Inputting Bace (beta-amyloid precursor protein lyase) molecular structure (without attribute tag) into a trained molecular encoder GIN in a multi-modal fine-grained molecular pre-training model, and outputting a structural feature vector h g2 by the trained molecular encoder GIN;
Obtaining a feature vector h p2 of the prompt text according to 2) based on the command prompt text constructed by Bace of 1);
Fusing the feature vector h p2 of the prompt text and the structural feature vector h g2 to obtain attribute feature representation corresponding to the molecular structure of Bace after prompt guidance;
Wherein,
H property2 is an attribute characteristic representation corresponding to the molecular structure of Bace after prompting and guiding;
Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;
inputting the attribute characteristic representation corresponding to the molecular structure of Bace after prompt guidance into a classifier to classify (acquire attributes);
33 Inputting Sider (molecular structure without attribute tag) into a trained molecular encoder GIN in a multi-mode fine granularity molecular pre-training model, and outputting a structural feature vector h g3 by the trained molecular encoder GIN;
obtaining a feature vector h p3 of the prompt text according to 2) based on the command prompt text constructed by Sider of 1);
Fusing the feature vector h p3 of the prompt text and the structural feature vector h g3 to obtain attribute feature representation corresponding to the molecular structure of Sider after prompt guidance;
Wherein,
H property3 is an attribute characteristic representation corresponding to the molecular structure of Sider after prompting and guiding;
Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;
Inputting the attribute characteristic representation corresponding to the molecular structure of Sider after prompt guidance into a classifier to classify (acquire attributes);
34 Inputting the molecular structure (without attribute tag) of Tox21 (21 st century toxicology test) into a trained molecular encoder GIN in a multi-mode fine granularity molecular pre-training model, and outputting a structural feature vector h g4 by the trained molecular encoder GIN;
the instruction prompt text constructed based on Tox21 of 1) is used for obtaining a feature vector h p4 of the prompt text according to 2);
Fusing the feature vector h p4 of the prompt text and the structural feature vector h g4 to obtain attribute feature representation corresponding to the molecular structure of Tox21 after prompt guidance;
Wherein,
H property4 is an attribute characteristic representation corresponding to the molecular structure of Tox21 after prompting guidance;
Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;
inputting the attribute characteristic representation corresponding to the molecular structure of the Tox21 after prompting and guiding into a classifier to classify (acquire attributes);
35 Inputting ToxCast (a multi-years toxicology prediction research project started by the united states Environmental Protection Agency (EPA)) molecular structures (without attribute tags) into a trained molecular encoder GIN in a multi-modal fine-granularity molecular pre-training model, the trained molecular encoder GIN outputting a structural feature vector h g5;
Obtaining a feature vector h p5 of the prompt text according to 2) based on the command prompt text constructed by ToxCast of 1);
Fusing the feature vector h p5 of the prompt text and the structural feature vector h g5 to obtain attribute feature representation corresponding to the molecular structure of ToxCast after prompt guidance;
Wherein,
H property5 is an attribute characteristic representation corresponding to the molecular structure of ToxCast after prompting and guiding;
Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;
Inputting the attribute characteristic representation corresponding to the molecular structure of ToxCast after prompt guidance into a classifier to classify (acquire attributes);
36 Inputting the molecular structure (without attribute tag) of HIV (human immunodeficiency virus) into a trained molecular encoder GIN in a multi-mode fine granularity molecular pre-training model, and outputting a structural feature vector h g6 by the trained molecular encoder GIN;
Based on the instruction prompt text constructed by the HIV of 1), obtaining a feature vector h p6 of the prompt text according to 2);
Fusing the feature vector h p6 and the structural feature vector h g6 of the prompt text to obtain attribute feature representations corresponding to the molecular structure of the HIV after prompt guidance;
Wherein,
H property6 is an attribute characteristic representation corresponding to the molecular structure of the HIV after prompting guidance;
Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;
inputting attribute characteristic representations corresponding to the molecular structure of the HIV after prompting and guiding into a classifier to classify (acquire attributes);
Each subtask corresponds to a data set, each data set comprises a plurality of molecular structures and corresponding attribute labels, each subtask corresponds to a plurality of molecular structures, each molecular structure corresponds to one h p, and h p in 1 subtasks are identical.
Other steps and parameters are the same as in one to eight of the embodiments.
The tenth embodiment is different from one of the first to ninth embodiments in that the step 4) of predicting the drug interaction relationship of the molecular structure pair (without interaction relationship) based on the multimodal fine-granularity molecular pre-training model and the feature vector of the prompt text comprises the following steps:
41 Inputting ZhangDDI molecular structure pairs (without attribute tags) into a trained molecular encoder GIN in a multi-mode fine-granularity molecular pre-training model, and outputting structural feature vectors h g7 and h g8 by the trained molecular encoder GIN;
Obtaining a feature vector h p7 of the prompt text according to 2) based on the command prompt text constructed by ZhangDDI of 1);
Splicing the structural feature vector h g7、hg8 and the feature vector h p7 of the prompt text to obtain a corresponding drug interaction relation feature representation of the molecular structure of the ZhangDDI after prompt guidance;
Wherein,
H ddi1 is a characteristic representation of the molecular structure of ZhangDDI after guidance for the corresponding drug interaction relationship;
splicing;
inputting the molecular structure of ZhangDDI after prompting and guiding into a classifier to classify the corresponding drug interaction relation characteristics (obtain attributes);
42 Inputting CHCHMINER molecular structure pairs (without attribute tags) into a trained molecular encoder GIN in a multi-mode fine-granularity molecular pre-training model, and outputting structural feature vectors h g9 and h g10 by the trained molecular encoder GIN;
Obtaining a feature vector h p8 of the prompt text according to 2) based on the command prompt text constructed by CHCHMINER of 1);
Splicing the structural feature vector h g9、hg10 and the feature vector h p8 of the prompt text to obtain a corresponding drug interaction relation feature representation of the molecular structure of the CHCHMINER after prompt guidance;
Wherein,
H ddi2 is a characteristic representation of the molecular structure of CHCHMINER after guidance for the corresponding drug interaction relationship;
inputting the molecular structure of CHCHMINER after prompting and guiding to a corresponding drug interaction relation characteristic representation into a classifier to classify (acquire attributes);
43 Inputting DeepDDI molecular structure pairs (without attribute tags) into a trained molecular encoder GIN in a multi-mode fine-granularity molecular pre-training model, and outputting structural feature vectors h g11 and h g12 by the trained molecular encoder GIN;
Obtaining a feature vector h p9 of the prompt text according to 2) based on the command prompt text constructed by DeepDDI of 1);
Splicing the structural feature vectors h g11 and h g12 and the feature vector h p9 of the prompt text to obtain a corresponding drug interaction relation feature representation of the molecular structure of DeepDDI after prompt guidance;
Wherein,
H ddi3 is a characteristic representation of the molecular structure of DeepDDI after guidance for the corresponding drug interaction relationship;
the molecular structure of DeepDDI after the guidance is prompted to classify the corresponding drug interaction relationship feature representation into a classifier (obtain the attribute).
Each subtask corresponds to a data set, each data set comprises a plurality of molecular structure pairs (two molecular structures) and corresponding drug interaction relation labels, each subtask corresponds to a plurality of molecular structures, each molecular structure corresponds to one h p, and h p in 1 subtask is the same.
Other steps and parameters are the same as in embodiments one through nine through one.
The present invention is capable of other and further embodiments and its several details are capable of modification and variation in light of the present invention, as will be apparent to those skilled in the art, without departing from the spirit and scope of the invention as defined in the appended claims.