CN119479906A

CN119479906A - Molecular structure prediction system based on multimodal fine-grained molecular pre-training model based on prompt learning

Info

Publication number: CN119479906A
Application number: CN202411539556.8A
Authority: CN
Inventors: 李洋; 卫政鑫; 汪国华
Original assignee: Northeast Forestry University
Current assignee: Northeast Forestry University
Priority date: 2024-10-31
Filing date: 2024-10-31
Publication date: 2025-02-18
Anticipated expiration: 2044-10-31
Also published as: CN119479906B

Abstract

A molecular structure prediction system based on a multimodal fine-grained molecular pre-training model of prompt learning. The present invention relates to the field of molecular structure prediction, and in particular to a molecular structure prediction system. The purpose of the present invention is to solve the problems of low accuracy and efficiency in processing complex molecular data and low accuracy in prediction of intermolecular interactions caused by data scarcity and insufficient task applicability in existing methods. The system includes: a data acquisition module for acquiring sample data in a multimodal molecular pre-training data set and sample data in a downstream task data set; a processing module for establishing a multimodal fine-grained molecular pre-training model based on prompt learning, and acquiring a trained multimodal fine-grained molecular pre-training model based on prompt learning; a prediction module for predicting the properties and drug interaction relationships of the molecular structure to be tested based on the trained multimodal fine-grained molecular pre-training model of prompt learning, and obtaining a prediction result.

Description

Molecular structure prediction system of multi-mode fine granularity molecular pre-training model based on prompt learning

Technical Field

The invention relates to the field of molecular structure prediction, in particular to a molecular structure prediction system.

Background

With the continuous progress of artificial intelligence technology, deep learning has become a key tool for accelerating research and development. Traditional discovery methods are limited by expertise and experimental conditions and develop relatively slowly. At present, a molecular representation learning method based on machine learning relies on a supervision model, and a new view angle is provided for research and development by analyzing characteristics such as molecular fingerprints, SMILES character strings, two-dimensional molecular figures, three-dimensional structures and the like. However, the scarcity of data and the complexity of labeling limit the development of these models. To overcome these limitations, researchers have begun to focus on large-scale pre-trained language models (PTMs) in the field of Natural Language Processing (NLP), such as BERT, GPT series, and T5, that perform unsupervised pre-training on large-scale text data, reduce the dependence on large amounts of annotation data by fine tuning of small amounts of annotation data, and improve the generalization ability of the model to exhibit excellent performance in a variety of tasks.

Inspired by the multi-modal models of CLIP, BLIP2, LLaVA, etc., researchers have employed self-supervised learning methods to explore the inherent links between different modal molecules. Some studies consider SMILES strings as languages with special grammars, whose relationship to molecular structures is understood by masking language model tasks. Whereas the T5 backbone model utilizes the transducer's attention mechanism to learn the serialized representation of the molecular structure. This approach makes the molecular structure expression more compact and flexible, but the SMILES string representation has limitations, especially the difficulty in expressing the information of the structure space inside the molecule. In order to fully capture the structural features of the molecule, researchers consider representing the molecule in other forms. The topology of the molecules provides a visual representation of the spatial arrangement, helping to understand the inherent association and property functions of the molecules. Researchers learn complex features of molecules by converting SMILES strings into two-dimensional molecular figures to simulate molecular topologies and using a graph neural network to aggregate and propagate information between atoms and chemical bonds. Although these methods have advanced in molecular modeling, they may destroy the integrity of the molecular topology when reconstructing the internal structure of the mask, resulting in insufficient capture of the unique structural characteristics of the molecule, thereby affecting the accuracy of the model in predicting the nature and function of the molecule

The multi-modal fine-grained molecular pre-training model has made significant progress in molecular property prediction, molecular to text generation tasks, molecular optimization, and the like. However, the potential of these models in predicting intermolecular interactions has not been fully explored. Currently, fine tuning of pre-trained language models through transfer learning has become a common practice to improve the performance of the models at specific tasks. Although this strategy achieves a significant performance improvement in the natural language task, traditional full model micro-computing and storage costs are increasing as the pre-training model grows in size. Recently, the parameter efficient fine tuning method only updates a small part of the parameters of the model to alleviate this problem. It exhibits the advantages of modular design, strong adaptability and prevention of catastrophic forgetfulness while maintaining similar performance. Hint adjustment has become a promising approach to reduce the amount of parameter adjustment by introducing trainable hint vectors into the input. However, incorrect initialization can cause that the model cannot effectively use the knowledge obtained by the model in the pre-training stage, so that the performance of the model in a specific task is affected.

Disclosure of Invention

The invention aims to solve the problems of low accuracy and efficiency and low inter-molecular interaction prediction accuracy in the process of processing complex molecular data caused by data scarcity and task applicability deficiency in the existing method, and provides a molecular structure prediction system based on a multi-mode fine-granularity molecular pre-training model for prompt learning.

The prediction system of the multi-mode fine granularity molecular pre-training model based on prompt learning comprises a data acquisition module, a processing module and a prediction module;

the data acquisition module is used for acquiring sample data in the multi-mode molecular pre-training data set and sample data in the downstream task data set;

the processing module is used for establishing a multi-modal fine-grained molecular pre-training model based on prompt learning and obtaining a trained multi-modal fine-grained molecular pre-training model based on prompt learning;

The prediction module is used for predicting the relationship between the attribute and the drug interaction of the molecular structure to be detected based on a trained multi-mode fine-granularity molecular pre-training model for prompting learning, and a prediction result is obtained.

The beneficial effects of the invention are as follows:

the invention provides a multi-mode fine-granularity molecular pre-training model MolFinePrompt based on prompt learning, which considers the structural integrity of molecules during learning of complex structures, accurately predicts interaction among molecules, fully plays the potential of instruction fine adjustment in different tasks, and improves the accuracy and efficiency of drug discovery.

By integrating the substructure embodying chemical properties into a molecular topological structure, the multi-mode fine granularity molecular pre-training model based on prompt learning improves the accuracy and efficiency when complex molecular data are processed;

In order to enhance the adaptability of the multi-mode fine-granularity molecular pre-training model based on prompt learning to diversified downstream tasks, the invention designs a corresponding prompt template for each task, guides the multi-mode fine-granularity molecular pre-training model based on prompt learning to more effectively identify and utilize the characteristics related to the tasks, and enhances the generalization and the adaptability of the multi-mode fine-granularity molecular pre-training model based on prompt learning in the diversified tasks.

The invention can more accurately understand the properties and potential synergistic effects of molecules based on the multi-mode fine-granularity molecular pre-training model MolFinePrompt for prompting learning by deeply analyzing the interaction modes among different drug molecules, and provides a foundation for the development of drug combination therapy.

Drawings

Fig. 1 is a block diagram of the system of the present invention.

Detailed Description

The molecular structure prediction system based on the multi-mode fine-granularity molecular pre-training model for prompt learning comprises a data acquisition module, a processing module and a prediction module;

The processing module is used for establishing a multi-modal fine-grained molecular pre-training model MolFinePrompt based on prompt learning and obtaining a trained multi-modal fine-grained molecular pre-training model based on prompt learning;

MolFinePrompt is Fine-Grained Multimodal Molecular PRETRAINING LARGE Model via project-Learning;

The molecular structure to be tested and the data obtained by the data acquisition module come from the same field (the same data types refer to all pairs of molecular text, for example: molecular structure CC (=O) O and corresponding text description ：Acetic acid is a product of the oxidation of ethanol and of the destructive distillation of wood.It is used locally,occasionally internally,as a counterirritant and also as a reagent.Acetic acid otic(for the ear)is an antibiotic that treats infections caused by bacteria or fungus.( acetic acid is the product of ethanol oxidation and destructive distillation of wood, it is used locally, occasionally internally, as an anti-irritant and reagent, otic acetate is an antibiotic, it is used to treat infections caused by bacteria or fungi)

The second embodiment is different from the first embodiment in that the data acquisition module is configured to acquire sample data in a multi-modal molecular pre-training data set and sample data in a downstream task data set, and the specific process is as follows:

Sample data in the multimodal molecular pre-training dataset are molecular-text pairs;

Sample data in the downstream task dataset is molecular structure (only structure has no text, text is prompted, molecular structure is with trained GIN);

The sample data in the dataset and the sample data in the downstream task dataset both include a molecular structure G and a textual description T;

The molecular structure G and the textual description T are 316k molecular-textual pairs collected from the common dataset Pubchem, k units being thousands;

for example, the molecular structure diagram G ' and the standardized molecular text description T ' which are processed by the fine granularity in the pre-training data set are used as the input of MolFinePrompt model, the molecular structure diagram G ' and the specific task instruction text which are processed by the fine granularity in the downstream task data set are used as the input of the pre-training model, and the model outputs the result of the prediction task.

Other steps and parameters are the same as in the first embodiment.

The third embodiment is different from the first or second embodiments in that the processing module is configured to establish a multimodal fine granularity molecular pre-training model based on prompt learning, and obtain a trained multimodal fine granularity molecular pre-training model based on prompt learning, where the specific process is as follows:

The multi-mode fine granularity molecular pre-training model based on prompt learning comprises a fine granularity molecular graph construction module, a molecular and text description representation learning module, a self-supervision cooperative contrast optimization module and an instruction fine adjustment downstream task module;

The fine-granularity molecular graph construction module adopts a specific decomposition rule to decompose a molecular structure graph G in sample data in the multi-mode molecular pre-training data set into sub-structure units, and builds a fine-granularity molecular graph G' based on the sub-structure units;

The molecular and text description representation learning module is used for inputting text description T in sample data in a multi-mode molecular pre-training data set into a text encoder, outputting semantic features by the text encoder, inputting fine granularity molecular graph G '' corresponding to 165 ten thousand molecular structures in PubChem data set into a molecular encoder, and outputting molecular feature vectors by the molecular encoder;

the self-supervision collaborative contrast optimization module trains a pre-trained molecular encoder GIN and a text encoder by adopting a contrast learning method to obtain a trained molecular encoder GIN and a trained text encoder, wherein the trained molecular encoder GIN and the trained text encoder form a multi-modal fine-granularity molecular pre-training model;

The instruction fine-tuning downstream task module is used for adopting a prompt learning guiding multi-mode fine-granularity molecular pre-training model to understand downstream tasks and obtaining a trained multi-mode fine-granularity molecular pre-training model based on prompt learning.

Improving the applicability and flexibility of the method in specific application scenes.

Based on the above, the multi-mode fine-granularity molecular pre-training model is optimized by constructing a structure diagram of the fine-granularity molecules and combining text information by using contrast learning, and the multi-mode fine-granularity molecular pre-training model is applied to the key specific task in the discovery field by using a prompt learning technology.

Other steps and parameters are the same as in the first or second embodiment.

The fourth embodiment is different from the first to third embodiments in that the fine-grained molecular graph construction module adopts a specific decomposition rule to decompose a molecular structure graph G in sample data in the multi-mode molecular pre-training data set into sub-structural units, and constructs the fine-grained molecular graph G based on the sub-structural units;

the specific process is as follows:

1) Acquiring the chemical element characteristics of each atom in the molecular structure represented by the SMILES character string in the sample data in the multi-mode molecular pre-training dataset based on RDKit tool;

the chemical element characteristics of each atom include the serial number of the atom, the connectivity, the bond type (single bond, double bond, triple bond) between the atoms, and the ring type of the atom (whether ring is formed or not);

Taking the connectivity characteristic as a node characteristic;

Taking the bond type among atoms and the atom ring forming type as bond characteristics;

The connectivity is the number of chemical bonds connected to the atoms;

2) Converting the molecular structure represented by the SMILES string in the sample data within the multimodal molecular pre-training dataset into a 2D topology graph g= (V, E) based on RDKit tool for more complete characterization of the chemical nature of the molecular structure;

Wherein V represents a node set of the 2D topology graph G, and E represents an edge set of the 2D topology graph G;

V represents a corresponding atom in the molecule, E is a chemical bond between the atoms;

3) Decomposing a molecular structure in sample data in a multi-mode molecular pre-training data set based on BRICS decomposition method to obtain a substructure unit, wherein the specific process is as follows:

performing preliminary decomposition on molecular structures in sample data in the multi-mode molecular pre-training dataset based on BRICS decomposition method to obtain a plurality of sub-structural units after preliminary decomposition;

each of the sub-structural units after preliminary decomposition is decomposed to obtain a plurality of minimum sub-structural units (the step can be performed separately, the plurality of sub-structural units after preliminary decomposition cannot be taken as a plurality of minimum sub-structural units), for example, SMIELS structures of a molecule are expressed as c1=cc=c2c (=c1) N (C (=o) N2 CCO) CCO, and can be decomposed into three sub-structural units of CCO, c1=cc=c2c (=c1) N (C (=o) N2) and CCO by a BRICS decomposition method, and c1=cc=c2c (=c1) N (C (=o) N2) is further decomposed into c1=cc=c2c (=c1), c= O, C1 n=cn=c1, and finally the molecule has five sub-structural units of CCO, c1=cc=c2c (=c1), c= O, C1 n=cn=c1 and CCO according to a decomposition rule;

4) Adding the sub-structural unit as a new node V _f into the topological graph;

Adding the connection relation E _f between each sub-structure unit and the node contained by the sub-structure unit as a new edge into the topological graph;

Constructing an empty graph level node V _g, connecting the graph level node V _g with all new nodes V _f to obtain a connection relation E _g, and forming a fine-grained molecular graph G' based on V, V _f,V_g,E,E_f,E_g to mine hidden semantic information in molecules;

G′=(V′,E'),V′=[V,V_f,V_g],E′＝[E,E_f,E_g]。

other steps and parameters are the same as in one to three embodiments.

The fifth embodiment is different from the first to fourth embodiments in that the molecule and text description representation learning module is used for inputting text description T in sample data in a multi-mode molecule pre-training data set into a text encoder, outputting semantic features by the text encoder, inputting fine granularity molecule graph G '' corresponding to 165 ten thousand molecular structures in PubChem data set into the molecule encoder, outputting molecule feature vectors by the molecule encoder, training the molecule encoder GIN, and obtaining a pre-trained molecule encoder GIN;

the specific process is as follows:

Step 1), changing The molecular names in The texts of The molecular text pairs in The sample data in The multi-mode molecular pre-training data set into a unified format of 'thermo molecular is';

inputting The text with The unified format of 'thermo is' into a text encoder, and outputting semantic features of The text encoder;

The text encoder is SciBERT based on the BERT architecture;

The text description serves as a supplementary knowledge of the molecular diagram and is summarized in natural language form from the point of view of the functions, properties, etc. related to the molecule. To prevent The leakage of The collected molecular text data and eliminate The deviation of The molecular names, we change The text names in The 316k molecular-text pair to a unified format of "thermo is" to enhance The generalization and interpretability of The model so that it can concentrate on The inherent links of The molecular structure and properties, rather than relying on specific molecular names.

SciBERT provides more abundant and accurate semantic domain knowledge for molecules by pre-training on a large number of corpora in the fields of biochemistry, medicine and the like.

Step 2), inputting node features and key features corresponding to 165 ten thousand molecular structures in a PubChem data set and a fine-grained molecular graph G '' corresponding to 165 ten thousand molecular structures in a PubChem data set into a molecular encoder GIN, and outputting feature vectors by the molecular encoder GIN;

the molecular encoder GIN sequentially comprises an input layer, a hidden layer and a hidden layer;

Optimizing the molecular encoder GIN by adopting a self-supervision learning method of a generating task and a predicting task until convergence to obtain a pre-trained molecular encoder GIN;

Adopting the feature vectors corresponding to V and E to complete the generation task;

adopting a feature vector corresponding to V _g,E_g to complete a prediction task;

generating a task of atom connectivity, a type corresponding to the serial number of the atom (hydrocarbon oxygen belongs to different atom types) and a bond type among the atoms;

The predictive tasks are the number of atoms in the molecular structure (how many atoms are in the molecule) and the number of bonds (how many chemical bonds are in the molecule);

The first layer in the molecular encoder GIN receives the original node characteristics as input, the 2 nd layer to the 4 th layer are hidden layers, each layer is based on the output of the previous layer as input, the characteristics of the neighbor nodes are aggregated, the representation of the nodes is updated through a learnable multi-layer perceptron (MLP), and the node characteristics after adding residual connections (adding the node representation of the current layer to the node representation of the previous layer) are the output of each layer;

For molecular structure processing, we have employed graph isomorphic networks (Graph Isomorphism Network, GIN) to encode topologies, a Graph Neural Network (GNN) variant widely used in the field of molecular representation learning. GIN is distinguished by its excellent ability to capture and express topological features of molecules.

Processing 165 ten thousand molecular structures in PubChem data set based on RDKit tool to obtain node characteristics and key characteristics, inputting the node characteristics and the key characteristics into a five-layer GIN model, outputting characteristic vectors by the GIN model, optimizing the molecular encoder GIN by adopting two self-supervision learning tasks of generating task and predicting task until convergence to obtain a pre-trained molecular encoder GIN model;

Our model is able to learn and generalize more accurately the structural information of the molecules.

Other steps and parameters are the same as in one to four embodiments.

The sixth embodiment is different from the first to fifth embodiments in that the self-supervision cooperative contrast optimization module trains the pre-trained molecular encoder GIN and text encoder by adopting a contrast learning method to obtain the trained molecular encoder GIN and text encoder, wherein the trained molecular encoder GIN and text encoder form a multi-modal fine-granularity molecular pre-training model;

the specific process is as follows:

The training set (molecule-text pair) is sample data within the multimodal molecule pre-training data set;

The loss function is InfoNCE loss functions;

the optimization method is a contrast learning method;

The feature vector corresponding to V _g,E_g in the molecular feature vectors output from the pre-trained molecular encoder GIN is used as the feature vector representation of the molecule.

Training the text encoder and the pre-trained molecular encoder GIN until convergence to obtain a trained text encoder and a trained molecular encoder GIN;

The trained text encoder and the molecular encoder GIN form a multi-mode fine-granularity molecular pre-training model so as to achieve the optimal alignment between the molecular structural characteristics and the semantic characteristics;

The InfoNCE loss function is:

Wherein, Inputting a pre-trained molecular encoder GIN for molecules in the ith sample, wherein the pre-trained molecular encoder GIN outputs an image feature vector;

Inputting text encoders for text in the ith pair of samples, and outputting text feature vectors by the text encoders;

inputting text encoder for text in jth sample, text feature vector output by the text encoder;

sim () is a similarity function that measures the similarity between two modal features, τ is a temperature parameter, and N is the total number of molecular text pairs.

The semantic relevance between the molecules and the text is enhanced by adopting a multi-modal contrast learning method. The method improves the matching quality of molecules and texts by continuously optimizing the model, and simultaneously ensures that unmatched pairs of molecular texts keep a certain distance in an embedding space. Comparing n pairs of molecular text (g ₁,t₁),(g₂,t₂)…(g_n,t_n), wherein g _i,t_i is the i-th molecule and the text description corresponding to the i-th molecule, respectively;

verifying a multimodal fine-grained molecular pre-training model:

the zero sample retrieval task of a new molecular structure/text based on a multi-mode fine-granularity molecular pre-training model comprises the following specific processes:

Acquiring new molecular structure text pair datasets PCDes and MoMu (new molecular structure text pairs);

freezing a zero sample molecular structure/text retrieval task based on a multi-mode fine granularity molecular pre-training model;

The zero sample molecular structure retrieval task inputs a new molecular structure text to a molecular pre-training model based on multi-mode fine granularity to obtain a molecular structure feature h _g and a semantic feature h _t, and calculates cosine similarity of the molecular structure feature h _g and the semantic feature vector h _t to obtain a cosine similarity matrix, wherein each element in the cosine similarity matrix represents similarity between one molecular structure feature vector and all semantic feature vectors;

Determining indexes of text representations of which each molecular structure represents the most similarity, comparing the indexes with indexes of correct text representations (a first text corresponding to a first molecular structure and a second text corresponding to a second molecular structure), and if the indexes are matched, considering that the retrieval is successful, and calculating the retrieval accuracy;

And sequencing the semantic feature vectors according to the similarity value for each molecular structure feature vector, and searching the most similar items, wherein if the molecular structure feature vector is successfully searched in the first 20 most similar items after sequencing. Calculating the proportion of correct retrieval to obtain the average recall rate.

Other steps and parameters are the same as in one of the first to fifth embodiments.

The difference between the embodiment and one to six embodiments is that the instruction fine-tuning downstream task module is used for adopting a prompt learning guide multi-mode fine-granularity molecular pre-training model to understand downstream tasks so as to obtain a trained multi-mode fine-granularity molecular pre-training model based on prompt learning;

the specific process is as follows:

In order to improve the performance of the model on a specific molecular prediction task, the project adopts a prompt and parameter high-efficiency fine tuning model guided by combining expertise so as to adapt to the specific molecular prediction task. Specifically, initialization hints are first customized for different tasks, enabling the model to understand task goals and requirements. By designing specialized hints for attribute prediction and drug interaction (DDI) tasks, rich background and target information is provided, helping the model understand related biological and chemical principles, ensuring that the model can identify and utilize key molecular features.

1) The method comprises the following steps of:

11 Acquiring attribute prediction tasks in a training set (only the structure has no text, the text is prompted, and the molecular structure uses a trained GIN) in sample data in a downstream task data set;

The attribute prediction task comprises BBBP, bace, sider, tox, toxCast and HIV six subtasks, wherein each subtask corresponds to one data set, namely the attribute prediction task comprises six data sets;

BBBP is blood brain barrier penetrability;

the Bace is a beta-amyloid precursor protein lyase;

Sider is a side effect;

The Tox21 is a 21 st century toxicology test;

The ToxCast was a project of years of toxicology predictive research initiated by the united states Environmental Protection Agency (EPA);

the HIV is a human immunodeficiency virus;

The BBBP dataset comprises 2039 molecular structures and corresponding attribute tags;

the Bace dataset comprises 1513 molecular structures and corresponding attribute tags;

The Sider dataset comprises 1427 molecular structures and corresponding attribute tags;

The Tox21 dataset comprises 7831 molecular structures and corresponding attribute tags;

the ToxCast dataset comprises 8576 molecular structures and corresponding attribute tags;

The HIV dataset contains 41127 molecular structures and corresponding attribute tags;

12 Acquiring a drug interaction task in a training set (only the structure has no text, the text is prompted, and the molecular structure is used for a trained GIN) in sample data in a downstream task data set;

The medication interaction task comprises ZhangDDI, chChMiner, deepDDI sub-tasks, each corresponding to one dataset, i.e. the medication interaction task comprises three datasets,

The ZhangDDI dataset contains 48548 molecular structure pairs (two molecular structures) and corresponding drug interaction relationship tags;

The CHCHMINER dataset contains 48514 molecular structure pairs (two molecular structures) and corresponding drug interaction relationship tags;

The DeepDDI dataset contains 192284 molecular structure pairs (two molecular structures) and corresponding drug interaction relationship tags;

13 Constructing an instruction prompt text according to the task target, the task background, the guiding principle and the task requirement of each subtask;

Taking BBBP tasks as an example:

The task, such as the prompt ："ThistaskisBBBP,ourobjectiveistopredict whetheradrugmoleculecanpenetratetheblood-brainbarrier,whichiscomposedofbrain capillaryendothelialcells.Theblood-brainbarrierishighlyselective,allowingonlycertain substancestopassthrough.Weneedtoanalyzethefollowingmolecularstructural characteristics:lipophilicity,molecularweight,chargestate,proteinbindingcapacity,presence ofhydrophobicgroups,activityofmetabolicproducts,andthecyclicstructureofthemolecule.Ifthemoleculehashighlipophilicity,asmallmolecularweight,lackscharge,bindsminimally withproteins,hashydrophobicgroups,noactivemetabolites,andisapolycycliccompound,it ismorelikelytopenetratetheblood-brainbarrier.Pleaseusetheseguidingprinciplesto determinewhetherthismoleculehasthecapabilitytodoso.( of BBBP (blood brain barrier penetrability) task, is a BBBP task, and the task target, the task background, the guiding principle and the task requirement are designed, wherein the task target is to predict whether a drug molecule can penetrate the blood brain barrier, and the task background is that the blood brain barrier consists of brain capillary endothelial cells. The blood brain barrier is highly selective and allows only certain substances to pass through, guidelines that if a molecule is highly lipophilic, small in molecular weight, lacks charge, binds very little to proteins, has hydrophobic groups, is inactive metabolite, and is polycyclic, it is easier to penetrate the blood brain barrier, and the task is to use these guidelines to determine whether the molecule has the ability to penetrate the blood brain barrier. )

Bace (beta-amyloid precursor protein lyase) task suggest ：ThistaskisBACE,ourgoalistopredict whetheradrugmoleculecaneffectivelyinhibitBACEenzyme,whichisakeytargetforthe treatmentofAlzheimer'sdisease.TheactivityofBACEenzymeiscloselyrelatedtothe productionofbeta-amyloidprotein.Weneedtoanalyzethefollowingmolecularstructural characteristics:bindingaffinity,molecularweight,lipophilicity,hydrogenbondcapability,stereoisomerism,andmetabolicstability.Ifthemoleculedemonstratesstrongbindingaffinity withtheactivesiteofBACEenzyme,possesseskeychemicalgroupstoformnecessary hydrogenbonds,andisstableunderphysiologicalconditionswithgoodcellmembrane permeability,itmaybeaneffectiveBACEinhibitor.Pleaseusetheseguidingprinciplesto analyzeanddeterminewhetherthismoleculehasthepotentialforBACEinhibition.( that this task is a BACE task, which we designed the task goal, task background, guidelines and task requirements, and the task goal is to predict whether a drug molecule can effectively inhibit BACE enzyme, which is a key target for the treatment of Alzheimer's disease. Task background BACE enzyme activity is closely related to the production of beta-amyloid. Guidelines we need to analyze the molecular structural characteristics of binding affinity, molecular weight, lipophilicity, hydrogen bonding ability, stereoisomers and metabolic stability. If the molecule exhibits a strong binding affinity to the active site of the BACE enzyme, possesses critical chemical groups that form the necessary hydrogen bonds, and is stable under physiological conditions and has good cell membrane permeability, it may be an effective BACE inhibitor. Task requirements please use these guidelines to analyze and determine if the molecule has the potential for BACE inhibition);

Sider (SIDE effect) task prompt ：ThistaskisSIDE,ourgoalistopredictthepotentialsideeffects thatadrugmoleculemightcause.Predictingdrugsideeffectsiscrucialforthesafety assessmentofmedications.Weneedtoanalyzethefollowingmolecularstructural characteristics:pharmacologicalmechanismsofactionsuchasreceptorbindingorenzyme inhibition,propertieslikemolecularweight,lipophilicity,andsolubility,andpharmacokinetic characteristicssuchasabsorption,distribution,metabolism,andexcretion.Pleaseusethese guidelinestoanalyzeandpredictthepotentialsideeffectsthatthisdrugmoleculemay cause.( is SIDE, and the task targets, task backgrounds, guidelines and task demands are designed, and the task targets are predicted to have potential SIDE effects possibly caused by drug molecules. Task background prediction of drug side effects is critical for drug safety assessment. The following molecular structural characteristics, such as pharmacological action mechanisms of receptor binding or enzyme inhibition, properties of molecular weight, lipophilicity, solubility and the like, and pharmacokinetic characteristics of absorption, distribution, metabolism, excretion and the like, need to be analyzed. Task requirements please use these guidelines to analyze and predict the potential side effects that the drug molecule may cause. )

Tox21 (21 st century toxicology test) task cue ：ThistaskisTOX21,ourgoalistopredictwhethera pharmaceuticalmoleculemightinducearangeofbiologicaleffectsassociatedwithtoxicity.TheTOX21projectaimstoidentifythepotentialtoxicityofchemicalsubstances,including hormonedisruption,genotoxicity,andmore.Weneedtoanalyzemolecularstructural characteristics:thepresenceofaromaticrings,thebalancebetweenlipophilicityand hydrophilicity,metabolicstability,andpermeability.Ifamoleculeaffectsthepharmacological action of known toxicity-related targets,shows metabolic instability,or is likely to produce toxicmetabolites,orexhibitspoorcellularpermeability,thenitmayhaveTOX21properties.PleaseusetheseguidingprinciplestopredictwhetherthepharmaceuticalmoleculehasTOX21characteristics.( this task was Tox21, our goal was to predict whether drug molecules would induce a range of toxicological related biological effects. The TOX21 program aims to determine potential toxicity of chemicals, including hormonal interference, genotoxicity, and the like. We need to analyze the molecular structural characteristics of the presence of aromatic rings, the balance between lipophilicity and hydrophilicity, metabolic stability and permeability. If the molecule affects the pharmacological effects of a known toxicity-related target, it may exhibit metabolic instability, or may produce toxic metabolites, or exhibit poor cell permeability, it may have TOX21 properties. Please use these guidelines to predict whether a drug molecule will possess TOX21 properties. )

ToxCast (a project of years of toxicology predictive research initiated by the Environmental Protection Agency (EPA)) task cue ：Thistaskis ToxCast,ourgoalistoassessthepotentialtoxicityofchemicalsubstancestobiologicalsystems.TheToxCastprojectisalarge-scale,high-throughputscreeningprojectthatusesavarietyof biologicalteststopredictthetoxicityofchemicalsubstances.Weneedtoanalyzethefollowing molecularstructuralcharacteristics:molecularweight,solubility,lipophilicity,etc.Ifamolecule hasalargemolecularweight,lowsolubility,highlipophilicity,andactivemetabolites,itis morelikelytoexhibitspecifictoxicity.Pleasejudgewhetherthismoleculehaspotentialtoxicity basedontheseguidelines.( is ToxCast, which we devised the task objective, task background, guidelines and task requirements, the task objective being to evaluate the potential toxicity of chemical substances to biological systems. Task background the ToxCast item is a large-scale, high-throughput screening item that uses various biological tests to predict chemical toxicity. Guidelines we need to analyze the molecular structural characteristics of molecular weight, solubility, lipophilicity, etc. Specific toxicity is more likely to be exhibited if the molecular weight is large, the solubility is low, the lipophilicity is high, and the metabolite activity is strong. Task requirements please use these guidelines to determine whether the molecule is potentially toxic. )

HIV (human immunodeficiency Virus) task prompt ：ThistaskisHIV,ourgoalistoassessthepotential inhibitoryeffectofdrugmoleculesonHIV.HIVisavirusthatattacksthehumanimmune system,particularlyaffectingCD4+Tcells,leadingtoAcquiredImmunodeficiencySyndrome(AIDS).Weneedtoanalyzethefollowingmolecularstructuralcharacteristics:lipophilicity,molecularweight,chargestate,proteinbindingcapacity,hydrophobicgroups,andtheactivity ofmetabolicproducts.IfthemoleculehasanoptimizedchemicalstructuretargetingHIV'skey proteins,asmallermolecularweight,strongproteinbindingcapacity,hydrophobicgroups,and goodmetabolicstability,thenitismorelikelytohaveaninhibitoryeffectonHIV.Pleasejudge whetherthismoleculehasantiviralpotentialofspecificdrugmolecules.( this task is HIV, and we devised the task goal, task background, guidelines and task requirements, task goal, assessment of potential inhibition of HIV by drug molecules. Background of the task HIV is a virus that attacks the human immune system, especially affecting CD4+ T cells, resulting in acquired immunodeficiency syndrome (AIDS). Guidelines we need to analyze the molecular structural characteristics of lipophilicity, molecular weight, charge state, protein binding capacity, hydrophobic groups and the activity of metabolites. If the molecule has an optimized chemical structure for HIV key proteins, a smaller molecular weight, a strong protein binding capacity, hydrophobic groups, and good metabolic stability, it is more likely to exert an inhibitory effect on HIV. Task requirements please use these guidelines to predict the antiviral potential of a particular drug molecule. )

Task prompt ：This task is Drug-Drug Interaction,which refers to the phenomenon where the simultaneous use of two or more drugs in the body can lead to an enhancement or reduction in their efficacy,or even cause adverse reactions,due to the mutual influence between the drugs.Determine if the interaction between these two drugs is positive or not.( of a drug interaction task is drug interaction, and the task target, task background, guiding principle and task requirement are designed, wherein the task target refers to the phenomenon that the effect of two or more drugs is enhanced or reduced and even adverse reaction is caused by the simultaneous use of the two or more drugs in a body due to the interaction among the drugs. Task requirements please determine if there is an interaction between the two drugs. ) Drug interactions are different types of drugs only to determine if there is an interaction between each drug pair.

2) Obtaining a feature vector h _p of the prompt text based on the instruction prompt text constructed in the step 1);

3) Performing attribute prediction on a molecular structure (without attribute tags) based on a multi-mode fine-granularity molecular pre-training model and feature vectors of a prompt text;

4) Predicting the drug interaction relationship of the molecular structure pair (without interaction relationship) based on the multimodal fine-granularity molecular pre-training model and the feature vector of the prompt text;

5) Repeating the steps 1) to 4) until convergence, and obtaining a trained multi-mode fine-granularity molecular pre-training model based on prompt learning.

Other steps and parameters are the same as those of embodiments one to six to one.

The eighth embodiment is different from the first to seventh embodiments in that the step of 2) obtaining the feature vector h _p of the instruction prompt text based on the instruction prompt text constructed in step 1), and the specific process is as follows:

Creating a prompt embedding layer P epsilon R ^(l×d), wherein l is the number of virtual token, d is the embedding dimension;

inputting the instruction prompt text constructed in the step 1) into a word segmentation device corresponding to the trained text encoder, and outputting a token ID sequence by the word segmentation device;

Inputting a token ID sequence into a trained text encoder, and outputting word embedding vectors by a word embedding layer in the trained text encoder;

the word embedding vector is used as the initial weight of the created prompt embedding layer P, and an initialized prompt embedding layer is obtained;

And inputting the initialized virtual token in the prompt embedding layer into a trained text encoder, and outputting a feature vector h _p corresponding to the instruction prompt text by the trained text encoder.

Each subtask corresponds to a number of molecular structures, each molecular structure corresponds to one h _p, and h _p in 1 subtask is the same.

Other steps and parameters are the same as those of embodiments one to seven to one.

The ninth embodiment is different from one of the first to eighth embodiments in that in the 3) performing attribute prediction on a molecular structure (without attribute tag) based on a multimodal fine-grained molecular pre-training model and a feature vector of a prompt text, the specific process is as follows:

31 Inputting BBBP (blood brain barrier penetrability) molecular structure (without attribute tag) into a trained molecular encoder GIN in a multi-mode fine granularity molecular pre-training model, and outputting a structural feature vector h _g1 by the trained molecular encoder GIN;

Obtaining a feature vector h _p1 of the prompt text according to 2) based on the command prompt text constructed by BBBP of 1);

Fusing the feature vector h _p1 of the prompt text and the structural feature vector h _g1 to obtain attribute feature representation corresponding to the molecular structure of BBBP after prompt guidance;

Wherein,

H _property1 is an attribute characteristic representation corresponding to the molecular structure of BBBP after prompting and guiding;

Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;

inputting the attribute characteristic representation corresponding to the molecular structure of BBBP after prompt guidance into a classifier to classify (acquire attributes);

32 Inputting Bace (beta-amyloid precursor protein lyase) molecular structure (without attribute tag) into a trained molecular encoder GIN in a multi-modal fine-grained molecular pre-training model, and outputting a structural feature vector h _g2 by the trained molecular encoder GIN;

Obtaining a feature vector h _p2 of the prompt text according to 2) based on the command prompt text constructed by Bace of 1);

Fusing the feature vector h _p2 of the prompt text and the structural feature vector h _g2 to obtain attribute feature representation corresponding to the molecular structure of Bace after prompt guidance;

Wherein,

H _property2 is an attribute characteristic representation corresponding to the molecular structure of Bace after prompting and guiding;

Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;

inputting the attribute characteristic representation corresponding to the molecular structure of Bace after prompt guidance into a classifier to classify (acquire attributes);

33 Inputting Sider (molecular structure without attribute tag) into a trained molecular encoder GIN in a multi-mode fine granularity molecular pre-training model, and outputting a structural feature vector h _g3 by the trained molecular encoder GIN;

obtaining a feature vector h _p3 of the prompt text according to 2) based on the command prompt text constructed by Sider of 1);

Fusing the feature vector h _p3 of the prompt text and the structural feature vector h _g3 to obtain attribute feature representation corresponding to the molecular structure of Sider after prompt guidance;

Wherein,

H _property3 is an attribute characteristic representation corresponding to the molecular structure of Sider after prompting and guiding;

Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;

Inputting the attribute characteristic representation corresponding to the molecular structure of Sider after prompt guidance into a classifier to classify (acquire attributes);

34 Inputting the molecular structure (without attribute tag) of Tox21 (21 st century toxicology test) into a trained molecular encoder GIN in a multi-mode fine granularity molecular pre-training model, and outputting a structural feature vector h _g4 by the trained molecular encoder GIN;

the instruction prompt text constructed based on Tox21 of 1) is used for obtaining a feature vector h _p4 of the prompt text according to 2);

Fusing the feature vector h _p4 of the prompt text and the structural feature vector h _g4 to obtain attribute feature representation corresponding to the molecular structure of Tox21 after prompt guidance;

Wherein,

H _property4 is an attribute characteristic representation corresponding to the molecular structure of Tox21 after prompting guidance;

Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;

inputting the attribute characteristic representation corresponding to the molecular structure of the Tox21 after prompting and guiding into a classifier to classify (acquire attributes);

35 Inputting ToxCast (a multi-years toxicology prediction research project started by the united states Environmental Protection Agency (EPA)) molecular structures (without attribute tags) into a trained molecular encoder GIN in a multi-modal fine-granularity molecular pre-training model, the trained molecular encoder GIN outputting a structural feature vector h _g5;

Obtaining a feature vector h _p5 of the prompt text according to 2) based on the command prompt text constructed by ToxCast of 1);

Fusing the feature vector h _p5 of the prompt text and the structural feature vector h _g5 to obtain attribute feature representation corresponding to the molecular structure of ToxCast after prompt guidance;

Wherein,

H _property5 is an attribute characteristic representation corresponding to the molecular structure of ToxCast after prompting and guiding;

Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;

Inputting the attribute characteristic representation corresponding to the molecular structure of ToxCast after prompt guidance into a classifier to classify (acquire attributes);

36 Inputting the molecular structure (without attribute tag) of HIV (human immunodeficiency virus) into a trained molecular encoder GIN in a multi-mode fine granularity molecular pre-training model, and outputting a structural feature vector h _g6 by the trained molecular encoder GIN;

Based on the instruction prompt text constructed by the HIV of 1), obtaining a feature vector h _p6 of the prompt text according to 2);

Fusing the feature vector h _p6 and the structural feature vector h _g6 of the prompt text to obtain attribute feature representations corresponding to the molecular structure of the HIV after prompt guidance;

Wherein,

H _property6 is an attribute characteristic representation corresponding to the molecular structure of the HIV after prompting guidance;

Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;

inputting attribute characteristic representations corresponding to the molecular structure of the HIV after prompting and guiding into a classifier to classify (acquire attributes);

Each subtask corresponds to a data set, each data set comprises a plurality of molecular structures and corresponding attribute labels, each subtask corresponds to a plurality of molecular structures, each molecular structure corresponds to one h _p, and h _p in 1 subtasks are identical.

Other steps and parameters are the same as in one to eight of the embodiments.

The tenth embodiment is different from one of the first to ninth embodiments in that the step 4) of predicting the drug interaction relationship of the molecular structure pair (without interaction relationship) based on the multimodal fine-granularity molecular pre-training model and the feature vector of the prompt text comprises the following steps:

41 Inputting ZhangDDI molecular structure pairs (without attribute tags) into a trained molecular encoder GIN in a multi-mode fine-granularity molecular pre-training model, and outputting structural feature vectors h _g7 and h _g8 by the trained molecular encoder GIN;

Obtaining a feature vector h _p7 of the prompt text according to 2) based on the command prompt text constructed by ZhangDDI of 1);

Splicing the structural feature vector h _g7、h_g8 and the feature vector h _p7 of the prompt text to obtain a corresponding drug interaction relation feature representation of the molecular structure of the ZhangDDI after prompt guidance;

Wherein,

H _ddi1 is a characteristic representation of the molecular structure of ZhangDDI after guidance for the corresponding drug interaction relationship;

splicing;

inputting the molecular structure of ZhangDDI after prompting and guiding into a classifier to classify the corresponding drug interaction relation characteristics (obtain attributes);

42 Inputting CHCHMINER molecular structure pairs (without attribute tags) into a trained molecular encoder GIN in a multi-mode fine-granularity molecular pre-training model, and outputting structural feature vectors h _g9 and h _g10 by the trained molecular encoder GIN;

Obtaining a feature vector h _p8 of the prompt text according to 2) based on the command prompt text constructed by CHCHMINER of 1);

Splicing the structural feature vector h _g9、h_g10 and the feature vector h _p8 of the prompt text to obtain a corresponding drug interaction relation feature representation of the molecular structure of the CHCHMINER after prompt guidance;

Wherein,

H _ddi2 is a characteristic representation of the molecular structure of CHCHMINER after guidance for the corresponding drug interaction relationship;

inputting the molecular structure of CHCHMINER after prompting and guiding to a corresponding drug interaction relation characteristic representation into a classifier to classify (acquire attributes);

43 Inputting DeepDDI molecular structure pairs (without attribute tags) into a trained molecular encoder GIN in a multi-mode fine-granularity molecular pre-training model, and outputting structural feature vectors h _g11 and h _g12 by the trained molecular encoder GIN;

Obtaining a feature vector h _p9 of the prompt text according to 2) based on the command prompt text constructed by DeepDDI of 1);

Splicing the structural feature vectors h _g11 and h _g12 and the feature vector h _p9 of the prompt text to obtain a corresponding drug interaction relation feature representation of the molecular structure of DeepDDI after prompt guidance;

Wherein,

H _ddi3 is a characteristic representation of the molecular structure of DeepDDI after guidance for the corresponding drug interaction relationship;

the molecular structure of DeepDDI after the guidance is prompted to classify the corresponding drug interaction relationship feature representation into a classifier (obtain the attribute).

Each subtask corresponds to a data set, each data set comprises a plurality of molecular structure pairs (two molecular structures) and corresponding drug interaction relation labels, each subtask corresponds to a plurality of molecular structures, each molecular structure corresponds to one h _p, and h _p in 1 subtask is the same.

Other steps and parameters are the same as in embodiments one through nine through one.

The present invention is capable of other and further embodiments and its several details are capable of modification and variation in light of the present invention, as will be apparent to those skilled in the art, without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A molecular structure prediction system based on a multimodal fine-grained molecular pre-training model of prompt learning, characterized in that: the system comprises: a data acquisition module, a processing module, and a prediction module;

The data acquisition module is used to acquire sample data in the multimodal molecular pre-training data set and sample data in the downstream task data set;

The processing module is used to establish a multimodal fine-grained molecular pre-training model based on prompt learning, and obtain the trained multimodal fine-grained molecular pre-training model based on prompt learning;

The prediction module is used to predict the properties and drug interaction relationship of the molecular structure to be tested based on the multimodal fine-grained molecular pre-training model of trained prompt learning to obtain a prediction result.

2. According to the molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning in claim 1, it is characterized in that: the data acquisition module is used to acquire sample data in the multimodal molecular pre-training data set and sample data in the downstream task data set;

The specific process is:

The sample data in the multimodal molecular pre-training dataset are molecule-text pairs;

The sample data in the downstream task dataset is a molecular structure.

3. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 2, characterized in that: the processing module is used to establish the multimodal fine-grained molecular pre-training model based on prompt learning, and obtain the trained multimodal fine-grained molecular pre-training model based on prompt learning;

The specific process is:

The multimodal fine-grained molecular pre-training model based on prompt learning includes: a fine-grained molecular graph construction module, a molecular and text description representation learning module, a self-supervised collaborative comparison optimization module, and an instruction fine-tuning downstream task module;

The fine-grained molecular graph construction module uses a specific decomposition rule to decompose the molecular structure graph G in the sample data in the multimodal molecular pre-training data set into sub-structure units, and constructs a fine-grained molecular graph G' based on the sub-structure units;

The molecular and text description representation learning module is used to input the text description T in the sample data in the multimodal molecular pre-training data set into the text encoder, and the text encoder outputs semantic features; input the fine-grained molecular graph G″ corresponding to 1.65 million molecular structures in the PubChem data set into the molecular encoder, and the molecular encoder outputs a molecular feature vector; train the molecular encoder GIN to obtain a pre-trained molecular encoder GIN;

The self-supervised collaborative contrast optimization module uses a contrast learning method to train the pre-trained molecular encoder GIN and text encoder to obtain the trained molecular encoder GIN and text encoder; the trained molecular encoder GIN and text encoder constitute a multimodal fine-grained molecular pre-training model;

The instruction fine-tuning downstream task module is used to use prompt learning to guide the multimodal fine-grained molecular pre-training model to understand downstream tasks, and obtain a trained multimodal fine-grained molecular pre-training model based on prompt learning.

4. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 3 is characterized in that: the fine-grained molecular graph construction module uses a specific decomposition rule to decompose the molecular structure graph G in the sample data in the multimodal molecular pre-training data set into sub-structure units, and constructs a fine-grained molecular graph G' based on the sub-structure units;

The specific process is:

1) Based on the RDKit tool, obtain the chemical element characteristics of each atom in the molecular structure represented by the SMILES string in the sample data in the multimodal molecular pre-training dataset;

The chemical element characteristics of each atom include the atomic number, degree of connectivity, type of bonds between atoms, and type of atomic ring formation;

Use connectivity features as node features;

The bond type between atoms and the type of atoms forming a ring are used as bond characteristics;

The connectivity is the number of chemical bonds connecting atoms;

2) Based on the RDKit tool, the molecular structure represented by the SMILES string in the sample data in the multimodal molecular pre-training dataset is converted into a 2D topological graph G = (V, E);

Wherein, V represents the node set of the 2D topological graph G, and E represents the edge set of the 2D topological graph G;

3) Based on the BRICS decomposition method, the molecular structure in the sample data in the multimodal molecular pre-training data set is decomposed to obtain substructure units; the specific process is as follows:

Based on the BRICS decomposition method, the molecular structure in the sample data in the multimodal molecular pre-training data set is preliminarily decomposed to obtain multiple substructure units after the preliminary decomposition;

Decompose each sub-structure unit after the preliminary decomposition to obtain multiple minimum sub-structure units;

4) Add the substructure unit as a new node V _f to the topology graph;

Add the connection relationship E _f between each substructure unit and the nodes it contains as a new edge to the topology graph;

Construct an empty graph-level node V _g , connect the graph-level node V _g with all new nodes V _f to obtain the connection relationship E _g , and form a fine-grained molecular graph G' based on V, V _f , V _g , E, E _f , E _g ;

G = (V', E'), V = [V, V _f , V _g ], E = [E, E _f , E _g ].

5. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 4 is characterized in that: the molecular and text description representation learning module is used to input the text description T in the sample data in the multimodal molecular pre-training data set into the text encoder, and the text encoder outputs semantic features; the fine-grained molecular graph G″ corresponding to 1.65 million molecular structures in the PubChem data set is input into the molecular encoder, and the molecular encoder outputs a molecular feature vector; the molecular encoder GIN is trained to obtain a pre-trained molecular encoder GIN;

The specific process is:

Step 1), changing the molecule name in the text of the molecule text pair in the sample data in the multimodal molecule pre-training data set to a unified format of "The molecule is";

The text in the unified format of "The molecule is" is input into the text encoder, and the output semantic features of the text encoder are;

The text encoder is SciBERT based on the BERT architecture;

Step 2), the fine-grained molecular graph G″ corresponding to the 1.65 million molecular structures in the PubChem dataset and the node features and bond features corresponding to the 1.65 million molecular structures in the PubChem dataset are input into the molecular encoder GIN, and the molecular encoder GIN outputs a feature vector;

The molecular encoder GIN includes: input layer, hidden layer, hidden layer, hidden layer, hidden layer;

The molecular encoder GIN is optimized by using the self-supervised learning method of generation task and prediction task until convergence, and a pre-trained molecular encoder GIN is obtained;

The generated tasks are the connectivity of atoms, the types corresponding to the atomic numbers, and the bond types between atoms;

The prediction task is the number of atoms and the number of bonds in a molecular structure.

6. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 5 is characterized in that: the self-supervised collaborative contrast optimization module uses a contrast learning method to train the pre-trained molecular encoder GIN and text encoder to obtain the trained molecular encoder GIN and text encoder; the trained molecular encoder GIN and text encoder constitute a multimodal fine-grained molecular pre-training model;

The specific process is:

The training set is sample data in the multimodal molecular pre-training data set;

The loss function is the InfoNCE loss function;

The optimization method is contrastive learning method;

The text encoder and the pre-trained molecular encoder GIN are trained until convergence to obtain the trained text encoder and molecular encoder GIN;

The trained text encoder and molecular encoder GIN constitute a multimodal fine-grained molecular pre-training model.

7. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 6, characterized in that: the instruction fine-tuning downstream task module is used to use prompt learning to guide the multimodal fine-grained molecular pre-training model to understand the downstream task, and obtain a trained multimodal fine-grained molecular pre-training model based on prompt learning;

The specific process is:

1) Build the instruction prompt text; the specific process is:

11) Obtaining attribute prediction tasks in the training set from sample data in the downstream task dataset;

The attribute prediction task includes six subtasks: BBBP, Bace, Sider, Tox21, ToxCast, and HIV. Each subtask corresponds to a data set, that is, the attribute prediction task includes six data sets;

The BBBP is blood-brain barrier penetrating;

The Bace is a β-amyloid precursor protein cleaving enzyme;

The Sider is a side effect;

The Tox21 is the 21st Century Toxicology Test;

The ToxCast is a multi-year toxicology prediction research project initiated by the EPA;

The HIV is human immunodeficiency virus;

The BBBP dataset contains 2039 molecular structures and corresponding attribute labels;

The Bace dataset contains 1513 molecular structures and corresponding attribute labels;

The Sider dataset contains 1427 molecular structures and corresponding attribute labels;

The Tox21 dataset contains 7831 molecular structures and corresponding attribute labels;

The ToxCast dataset contains 8576 molecular structures and corresponding attribute labels;

The HIV dataset contains 41,127 molecular structures and corresponding attribute labels;

12) Obtain drug interaction tasks in the training set from sample data in the downstream task dataset;

The drug interaction task includes three subtasks: ZhangDDI, ChChMiner, and DeepDDI. Each subtask corresponds to a dataset, that is, the drug interaction task includes three datasets;

The ZhangDDI dataset contains 48,548 molecular structure pairs and corresponding drug interaction relationship labels;

The ChChMiner dataset contains 48,514 molecular structure pairs and corresponding drug interaction relationship labels;

The DeepDDI dataset contains 192,284 molecular structure pairs and corresponding drug interaction relationship labels;

13) Construct instruction prompt text according to the task objectives, task background, guiding principles and task requirements of each subtask;

2) Based on the instruction prompt text constructed in 1), obtain the feature vector h _p of the prompt text;

3) Predict the properties of the molecular structure based on the multimodal fine-grained molecular pre-training model and the feature vector of the prompt text;

4) Predict drug interaction relationships for molecular structure pairs based on the multimodal fine-grained molecular pre-training model and feature vectors of prompt text;

5) Repeat 1) to 4) until convergence to obtain a trained multimodal fine-grained molecular pre-training model based on prompt learning.

8. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 7, characterized in that: in said 2), based on the instruction prompt text constructed in 1), the feature vector h _p of the prompt text is obtained; the specific process is:

Create a prompt embedding layer P∈R( ^l×d ), where l is the number of virtual tokens, d is the dimension of the embedding, and R is a real number;

Input the instruction prompt text constructed in 1) into the word segmenter corresponding to the trained text encoder, and the word segmenter outputs a token ID sequence;

The token ID sequence is input into the trained text encoder, and the word embedding layer in the trained text encoder outputs the word embedding vector;

The word embedding vector is used as the initial weight of the created prompt embedding layer P to obtain the initialized prompt embedding layer;

The virtual token in the initialized prompt embedding layer is input into the trained text encoder, and the trained text encoder outputs the feature vector h _p corresponding to the instruction prompt text.

9. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 8, characterized in that: in said 3), based on the feature vector of the multimodal fine-grained molecular pre-training model and the prompt text, the molecular structure is predicted for attributes;

The specific process is:

31) Input the molecular structure of BBBP into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs a structural feature vector h _g1 ;

Based on the instruction prompt text constructed by BBBP in 1), obtain the feature vector h _p1 of the prompt text according to 2);

The feature vector h _p1 of the prompt text and the structural feature vector h _g1 are fused to obtain the attribute feature representation corresponding to the molecular structure of the prompt-guided BBBP;

in,

j _property1 is the attribute feature representation corresponding to the molecular structure of BBBP after prompt guidance;

α is the weight parameter, For splicing;

The attribute feature representation corresponding to the molecular structure of BBBP guided by the prompt is input into the classifier for classification;

32) Input the molecular structure of Bace into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs the structural feature vector h _g2 ;

Based on the instruction prompt text constructed by Bace in 1), obtain the feature vector j _p2 of the prompt text according to 2);

The feature vector h _p2 of the prompt text and the structural feature vector h _g2 are fused to obtain the attribute feature representation corresponding to the molecular structure of Bace guided by the prompt;

in,

h _property2 is the attribute feature representation corresponding to the molecular structure of Bace after prompt guidance;

α is the weight parameter, For splicing;

The attribute feature representation corresponding to the molecular structure of Bace guided by the prompt is input into the classifier for classification;

33) Input the molecular structure of Sider into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs the structural feature vector h _g3 ;

Based on the instruction prompt text constructed by Sider in 1), obtain the feature vector h _p3 of the prompt text according to 2);

The feature vector h _p3 of the prompt text and the structural feature vector h _g3 are merged to obtain the attribute feature representation corresponding to the molecular structure of Sider after prompt guidance;

in,

h _property3 is the attribute feature representation corresponding to the molecular structure of Sider after prompt guidance;

α is the weight parameter, For splicing;

The attribute feature representation corresponding to the molecular structure of Sider guided by the prompt is input into the classifier for classification;

34), inputting the molecular structure of Tox21 into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs a structural feature vector h _g4 ;

Based on the instruction prompt text constructed by Tox21 in 1), obtain the feature vector h _p4 of the prompt text according to 2);

The feature vector h _p4 of the prompt text and the structural feature vector h _g4 are fused to obtain the attribute feature representation corresponding to the molecular structure of Tox21 guided by the prompt;

in,

h _property4 is the property feature representation corresponding to the molecular structure of Tox21 after prompt guidance;

α is the weight parameter, For splicing;

The attribute feature representation corresponding to the molecular structure of Tox21 guided by the prompt is input into the classifier for classification;

35) Input the molecular structure of ToxCast into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs the structural feature vector h _g5 ;

Based on the instruction prompt text constructed by ToxCast in 1), obtain the feature vector h _p5 of the prompt text according to 2);

The feature vector h _p5 of the prompt text and the structural feature vector h _g5 are fused to obtain the attribute feature representation corresponding to the molecular structure of ToxCast guided by the prompt;

in,

h _property5 is the property feature representation corresponding to the molecular structure of ToxCast after prompt guidance;

α is the weight parameter, For splicing;

The attribute feature representation corresponding to the molecular structure of ToxCast guided by the prompt is input into the classifier for classification;

36) Inputting the molecular structure of HIV into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, the trained molecular encoder GIN outputs a structural feature vector h _g6 ;

Based on the instruction prompt text constructed by HIV in 1), obtain the feature vector h _p6 of the prompt text according to 2);

The feature vector h _p6 of the prompt text and the structural feature vector h _g6 are merged to obtain the attribute feature representation corresponding to the molecular structure of HIV after prompt guidance;

in,

h _property6 is the attribute feature representation corresponding to the molecular structure of HIV after prompt guidance;

α is the weight parameter, For splicing;

The attribute feature representation corresponding to the molecular structure of HIV after prompt guidance is input into the classifier for classification.

10. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 9, characterized in that: in said 4), based on the feature vector of the multimodal fine-grained molecular pre-training model and the prompt text, the drug interaction relationship of the molecular structure pair is predicted; the specific process is:

41) Input the molecular structure pair of ZhangDDI into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs structural feature vectors h _g7 and h _g8 ;

Based on the instruction prompt text constructed by ZhangDDI in 1), obtain the feature vector h _p7 of the prompt text according to 2);

The structural feature vectors h _g7 and h _g8 are concatenated with the feature vector h _p7 of the prompt text to obtain the feature representation of the drug interaction relationship corresponding to the molecular structure of ZhangDDI guided by the prompt;

in,

h _ddi1 is the characteristic representation of the drug interaction relationship corresponding to the molecular structure of ZhangDDI after prompt guidance;

For splicing;

The molecular structure of ZhangDDI guided by the prompt is input into the classifier to classify the corresponding drug interaction relationship features;

42) Input the molecular structure pair of ChChMiner into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs the structural feature vectors h _g9 and h _g10 ;

Based on the instruction prompt text constructed by ChChMiner in 1), obtain the feature vector h _p8 of the prompt text according to 2);

The structural feature vectors h _g9 and h _g10 are concatenated with the feature vector h _p8 of the prompt text to obtain the feature representation of the drug interaction relationship corresponding to the molecular structure of ChChMiner guided by the prompt;

in,

h _ddi2 is the characteristic representation of drug interaction relationship corresponding to the molecular structure of ChChMiner after prompt guidance;

The molecular structure pairs of ChChMiner guided by the prompts and the corresponding drug interaction relationship feature representations are input into the classifier for classification;

43) Input the molecular structure pair of DeepDDI into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs structural feature vectors h _g11 and h _g12 ;

Based on the instruction prompt text constructed by DeepDDI in 1), obtain the feature vector h _p9 of the prompt text according to 2);

The structural feature vectors h _g11 and h _g12 are concatenated with the feature vector h _p9 of the prompt text to obtain the characteristic representation of the drug interaction relationship corresponding to the molecular structure of DeepDDI guided by the prompt;

in,

h _ddi3 is the characteristic representation of drug interaction relationship corresponding to the molecular structure of DeepDDI after prompt guidance;

The molecular structure of DeepDDI guided by the prompts is input into the classifier for classification.