[go: up one dir, main page]

CN119479906A - Molecular structure prediction system based on multimodal fine-grained molecular pre-training model based on prompt learning - Google Patents

Molecular structure prediction system based on multimodal fine-grained molecular pre-training model based on prompt learning Download PDF

Info

Publication number
CN119479906A
CN119479906A CN202411539556.8A CN202411539556A CN119479906A CN 119479906 A CN119479906 A CN 119479906A CN 202411539556 A CN202411539556 A CN 202411539556A CN 119479906 A CN119479906 A CN 119479906A
Authority
CN
China
Prior art keywords
molecular
prompt
encoder
text
trained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411539556.8A
Other languages
Chinese (zh)
Other versions
CN119479906B (en
Inventor
李洋
卫政鑫
汪国华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeast Forestry University
Original Assignee
Northeast Forestry University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeast Forestry University filed Critical Northeast Forestry University
Priority to CN202411539556.8A priority Critical patent/CN119479906B/en
Publication of CN119479906A publication Critical patent/CN119479906A/en
Application granted granted Critical
Publication of CN119479906B publication Critical patent/CN119479906B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Machine Translation (AREA)

Abstract

基于提示学习的多模态细粒度分子预训练模型的分子结构预测系统,本发明涉及分子结构预测领域,特别涉及分子结构预测系统。本发明的目的是为了解决现有方法因数据稀缺和任务适用性不足导致在处理复杂分子数据时存在准确性和效率低,以及分子间相互作用预测准确性低的问题。系统包括:数据获取模块用于获取多模态分子预训练数据集内的样本数据以及下游任务数据集内的样本数据;处理模块用于建立基于提示学习的多模态细粒度分子预训练模型,并获取训练好的基于提示学习的多模态细粒度分子预训练模型;预测模块用于基于训练好的提示学习的多模态细粒度分子预训练模型对待测分子结构进行属性和药物相互作用关系的预测,获得预测结果。

A molecular structure prediction system based on a multimodal fine-grained molecular pre-training model of prompt learning. The present invention relates to the field of molecular structure prediction, and in particular to a molecular structure prediction system. The purpose of the present invention is to solve the problems of low accuracy and efficiency in processing complex molecular data and low accuracy in prediction of intermolecular interactions caused by data scarcity and insufficient task applicability in existing methods. The system includes: a data acquisition module for acquiring sample data in a multimodal molecular pre-training data set and sample data in a downstream task data set; a processing module for establishing a multimodal fine-grained molecular pre-training model based on prompt learning, and acquiring a trained multimodal fine-grained molecular pre-training model based on prompt learning; a prediction module for predicting the properties and drug interaction relationships of the molecular structure to be tested based on the trained multimodal fine-grained molecular pre-training model of prompt learning, and obtaining a prediction result.

Description

Molecular structure prediction system of multi-mode fine granularity molecular pre-training model based on prompt learning
Technical Field
The invention relates to the field of molecular structure prediction, in particular to a molecular structure prediction system.
Background
With the continuous progress of artificial intelligence technology, deep learning has become a key tool for accelerating research and development. Traditional discovery methods are limited by expertise and experimental conditions and develop relatively slowly. At present, a molecular representation learning method based on machine learning relies on a supervision model, and a new view angle is provided for research and development by analyzing characteristics such as molecular fingerprints, SMILES character strings, two-dimensional molecular figures, three-dimensional structures and the like. However, the scarcity of data and the complexity of labeling limit the development of these models. To overcome these limitations, researchers have begun to focus on large-scale pre-trained language models (PTMs) in the field of Natural Language Processing (NLP), such as BERT, GPT series, and T5, that perform unsupervised pre-training on large-scale text data, reduce the dependence on large amounts of annotation data by fine tuning of small amounts of annotation data, and improve the generalization ability of the model to exhibit excellent performance in a variety of tasks.
Inspired by the multi-modal models of CLIP, BLIP2, LLaVA, etc., researchers have employed self-supervised learning methods to explore the inherent links between different modal molecules. Some studies consider SMILES strings as languages with special grammars, whose relationship to molecular structures is understood by masking language model tasks. Whereas the T5 backbone model utilizes the transducer's attention mechanism to learn the serialized representation of the molecular structure. This approach makes the molecular structure expression more compact and flexible, but the SMILES string representation has limitations, especially the difficulty in expressing the information of the structure space inside the molecule. In order to fully capture the structural features of the molecule, researchers consider representing the molecule in other forms. The topology of the molecules provides a visual representation of the spatial arrangement, helping to understand the inherent association and property functions of the molecules. Researchers learn complex features of molecules by converting SMILES strings into two-dimensional molecular figures to simulate molecular topologies and using a graph neural network to aggregate and propagate information between atoms and chemical bonds. Although these methods have advanced in molecular modeling, they may destroy the integrity of the molecular topology when reconstructing the internal structure of the mask, resulting in insufficient capture of the unique structural characteristics of the molecule, thereby affecting the accuracy of the model in predicting the nature and function of the molecule
The multi-modal fine-grained molecular pre-training model has made significant progress in molecular property prediction, molecular to text generation tasks, molecular optimization, and the like. However, the potential of these models in predicting intermolecular interactions has not been fully explored. Currently, fine tuning of pre-trained language models through transfer learning has become a common practice to improve the performance of the models at specific tasks. Although this strategy achieves a significant performance improvement in the natural language task, traditional full model micro-computing and storage costs are increasing as the pre-training model grows in size. Recently, the parameter efficient fine tuning method only updates a small part of the parameters of the model to alleviate this problem. It exhibits the advantages of modular design, strong adaptability and prevention of catastrophic forgetfulness while maintaining similar performance. Hint adjustment has become a promising approach to reduce the amount of parameter adjustment by introducing trainable hint vectors into the input. However, incorrect initialization can cause that the model cannot effectively use the knowledge obtained by the model in the pre-training stage, so that the performance of the model in a specific task is affected.
Disclosure of Invention
The invention aims to solve the problems of low accuracy and efficiency and low inter-molecular interaction prediction accuracy in the process of processing complex molecular data caused by data scarcity and task applicability deficiency in the existing method, and provides a molecular structure prediction system based on a multi-mode fine-granularity molecular pre-training model for prompt learning.
The prediction system of the multi-mode fine granularity molecular pre-training model based on prompt learning comprises a data acquisition module, a processing module and a prediction module;
the data acquisition module is used for acquiring sample data in the multi-mode molecular pre-training data set and sample data in the downstream task data set;
the processing module is used for establishing a multi-modal fine-grained molecular pre-training model based on prompt learning and obtaining a trained multi-modal fine-grained molecular pre-training model based on prompt learning;
The prediction module is used for predicting the relationship between the attribute and the drug interaction of the molecular structure to be detected based on a trained multi-mode fine-granularity molecular pre-training model for prompting learning, and a prediction result is obtained.
The beneficial effects of the invention are as follows:
the invention provides a multi-mode fine-granularity molecular pre-training model MolFinePrompt based on prompt learning, which considers the structural integrity of molecules during learning of complex structures, accurately predicts interaction among molecules, fully plays the potential of instruction fine adjustment in different tasks, and improves the accuracy and efficiency of drug discovery.
By integrating the substructure embodying chemical properties into a molecular topological structure, the multi-mode fine granularity molecular pre-training model based on prompt learning improves the accuracy and efficiency when complex molecular data are processed;
In order to enhance the adaptability of the multi-mode fine-granularity molecular pre-training model based on prompt learning to diversified downstream tasks, the invention designs a corresponding prompt template for each task, guides the multi-mode fine-granularity molecular pre-training model based on prompt learning to more effectively identify and utilize the characteristics related to the tasks, and enhances the generalization and the adaptability of the multi-mode fine-granularity molecular pre-training model based on prompt learning in the diversified tasks.
The invention can more accurately understand the properties and potential synergistic effects of molecules based on the multi-mode fine-granularity molecular pre-training model MolFinePrompt for prompting learning by deeply analyzing the interaction modes among different drug molecules, and provides a foundation for the development of drug combination therapy.
Drawings
Fig. 1 is a block diagram of the system of the present invention.
Detailed Description
The molecular structure prediction system based on the multi-mode fine-granularity molecular pre-training model for prompt learning comprises a data acquisition module, a processing module and a prediction module;
the data acquisition module is used for acquiring sample data in the multi-mode molecular pre-training data set and sample data in the downstream task data set;
The processing module is used for establishing a multi-modal fine-grained molecular pre-training model MolFinePrompt based on prompt learning and obtaining a trained multi-modal fine-grained molecular pre-training model based on prompt learning;
MolFinePrompt is Fine-Grained Multimodal Molecular PRETRAINING LARGE Model via project-Learning;
The prediction module is used for predicting the relationship between the attribute and the drug interaction of the molecular structure to be detected based on a trained multi-mode fine-granularity molecular pre-training model for prompting learning, and a prediction result is obtained.
The molecular structure to be tested and the data obtained by the data acquisition module come from the same field (the same data types refer to all pairs of molecular text, for example: molecular structure CC (=O) O and corresponding text description :Acetic acid is a product of the oxidation of ethanol and of the destructive distillation of wood.It is used locally,occasionally internally,as a counterirritant and also as a reagent.Acetic acid otic(for the ear)is an antibiotic that treats infections caused by bacteria or fungus.( acetic acid is the product of ethanol oxidation and destructive distillation of wood, it is used locally, occasionally internally, as an anti-irritant and reagent, otic acetate is an antibiotic, it is used to treat infections caused by bacteria or fungi)
The second embodiment is different from the first embodiment in that the data acquisition module is configured to acquire sample data in a multi-modal molecular pre-training data set and sample data in a downstream task data set, and the specific process is as follows:
Sample data in the multimodal molecular pre-training dataset are molecular-text pairs;
Sample data in the downstream task dataset is molecular structure (only structure has no text, text is prompted, molecular structure is with trained GIN);
The sample data in the dataset and the sample data in the downstream task dataset both include a molecular structure G and a textual description T;
The molecular structure G and the textual description T are 316k molecular-textual pairs collected from the common dataset Pubchem, k units being thousands;
for example, the molecular structure diagram G ' and the standardized molecular text description T ' which are processed by the fine granularity in the pre-training data set are used as the input of MolFinePrompt model, the molecular structure diagram G ' and the specific task instruction text which are processed by the fine granularity in the downstream task data set are used as the input of the pre-training model, and the model outputs the result of the prediction task.
Other steps and parameters are the same as in the first embodiment.
The third embodiment is different from the first or second embodiments in that the processing module is configured to establish a multimodal fine granularity molecular pre-training model based on prompt learning, and obtain a trained multimodal fine granularity molecular pre-training model based on prompt learning, where the specific process is as follows:
The multi-mode fine granularity molecular pre-training model based on prompt learning comprises a fine granularity molecular graph construction module, a molecular and text description representation learning module, a self-supervision cooperative contrast optimization module and an instruction fine adjustment downstream task module;
The fine-granularity molecular graph construction module adopts a specific decomposition rule to decompose a molecular structure graph G in sample data in the multi-mode molecular pre-training data set into sub-structure units, and builds a fine-granularity molecular graph G' based on the sub-structure units;
The molecular and text description representation learning module is used for inputting text description T in sample data in a multi-mode molecular pre-training data set into a text encoder, outputting semantic features by the text encoder, inputting fine granularity molecular graph G '' corresponding to 165 ten thousand molecular structures in PubChem data set into a molecular encoder, and outputting molecular feature vectors by the molecular encoder;
the self-supervision collaborative contrast optimization module trains a pre-trained molecular encoder GIN and a text encoder by adopting a contrast learning method to obtain a trained molecular encoder GIN and a trained text encoder, wherein the trained molecular encoder GIN and the trained text encoder form a multi-modal fine-granularity molecular pre-training model;
The instruction fine-tuning downstream task module is used for adopting a prompt learning guiding multi-mode fine-granularity molecular pre-training model to understand downstream tasks and obtaining a trained multi-mode fine-granularity molecular pre-training model based on prompt learning.
Improving the applicability and flexibility of the method in specific application scenes.
Based on the above, the multi-mode fine-granularity molecular pre-training model is optimized by constructing a structure diagram of the fine-granularity molecules and combining text information by using contrast learning, and the multi-mode fine-granularity molecular pre-training model is applied to the key specific task in the discovery field by using a prompt learning technology.
Other steps and parameters are the same as in the first or second embodiment.
The fourth embodiment is different from the first to third embodiments in that the fine-grained molecular graph construction module adopts a specific decomposition rule to decompose a molecular structure graph G in sample data in the multi-mode molecular pre-training data set into sub-structural units, and constructs the fine-grained molecular graph G based on the sub-structural units;
the specific process is as follows:
1) Acquiring the chemical element characteristics of each atom in the molecular structure represented by the SMILES character string in the sample data in the multi-mode molecular pre-training dataset based on RDKit tool;
the chemical element characteristics of each atom include the serial number of the atom, the connectivity, the bond type (single bond, double bond, triple bond) between the atoms, and the ring type of the atom (whether ring is formed or not);
Taking the connectivity characteristic as a node characteristic;
Taking the bond type among atoms and the atom ring forming type as bond characteristics;
The connectivity is the number of chemical bonds connected to the atoms;
2) Converting the molecular structure represented by the SMILES string in the sample data within the multimodal molecular pre-training dataset into a 2D topology graph g= (V, E) based on RDKit tool for more complete characterization of the chemical nature of the molecular structure;
Wherein V represents a node set of the 2D topology graph G, and E represents an edge set of the 2D topology graph G;
V represents a corresponding atom in the molecule, E is a chemical bond between the atoms;
3) Decomposing a molecular structure in sample data in a multi-mode molecular pre-training data set based on BRICS decomposition method to obtain a substructure unit, wherein the specific process is as follows:
performing preliminary decomposition on molecular structures in sample data in the multi-mode molecular pre-training dataset based on BRICS decomposition method to obtain a plurality of sub-structural units after preliminary decomposition;
each of the sub-structural units after preliminary decomposition is decomposed to obtain a plurality of minimum sub-structural units (the step can be performed separately, the plurality of sub-structural units after preliminary decomposition cannot be taken as a plurality of minimum sub-structural units), for example, SMIELS structures of a molecule are expressed as c1=cc=c2c (=c1) N (C (=o) N2 CCO) CCO, and can be decomposed into three sub-structural units of CCO, c1=cc=c2c (=c1) N (C (=o) N2) and CCO by a BRICS decomposition method, and c1=cc=c2c (=c1) N (C (=o) N2) is further decomposed into c1=cc=c2c (=c1), c= O, C1 n=cn=c1, and finally the molecule has five sub-structural units of CCO, c1=cc=c2c (=c1), c= O, C1 n=cn=c1 and CCO according to a decomposition rule;
4) Adding the sub-structural unit as a new node V f into the topological graph;
Adding the connection relation E f between each sub-structure unit and the node contained by the sub-structure unit as a new edge into the topological graph;
Constructing an empty graph level node V g, connecting the graph level node V g with all new nodes V f to obtain a connection relation E g, and forming a fine-grained molecular graph G' based on V, V f,Vg,E,Ef,Eg to mine hidden semantic information in molecules;
G′=(V′,E'),V′=[V,Vf,Vg],E′=[E,Ef,Eg]。
other steps and parameters are the same as in one to three embodiments.
The fifth embodiment is different from the first to fourth embodiments in that the molecule and text description representation learning module is used for inputting text description T in sample data in a multi-mode molecule pre-training data set into a text encoder, outputting semantic features by the text encoder, inputting fine granularity molecule graph G '' corresponding to 165 ten thousand molecular structures in PubChem data set into the molecule encoder, outputting molecule feature vectors by the molecule encoder, training the molecule encoder GIN, and obtaining a pre-trained molecule encoder GIN;
the specific process is as follows:
Step 1), changing The molecular names in The texts of The molecular text pairs in The sample data in The multi-mode molecular pre-training data set into a unified format of 'thermo molecular is';
inputting The text with The unified format of 'thermo is' into a text encoder, and outputting semantic features of The text encoder;
The text encoder is SciBERT based on the BERT architecture;
The text description serves as a supplementary knowledge of the molecular diagram and is summarized in natural language form from the point of view of the functions, properties, etc. related to the molecule. To prevent The leakage of The collected molecular text data and eliminate The deviation of The molecular names, we change The text names in The 316k molecular-text pair to a unified format of "thermo is" to enhance The generalization and interpretability of The model so that it can concentrate on The inherent links of The molecular structure and properties, rather than relying on specific molecular names.
SciBERT provides more abundant and accurate semantic domain knowledge for molecules by pre-training on a large number of corpora in the fields of biochemistry, medicine and the like.
Step 2), inputting node features and key features corresponding to 165 ten thousand molecular structures in a PubChem data set and a fine-grained molecular graph G '' corresponding to 165 ten thousand molecular structures in a PubChem data set into a molecular encoder GIN, and outputting feature vectors by the molecular encoder GIN;
the molecular encoder GIN sequentially comprises an input layer, a hidden layer and a hidden layer;
Optimizing the molecular encoder GIN by adopting a self-supervision learning method of a generating task and a predicting task until convergence to obtain a pre-trained molecular encoder GIN;
Adopting the feature vectors corresponding to V and E to complete the generation task;
adopting a feature vector corresponding to V g,Eg to complete a prediction task;
generating a task of atom connectivity, a type corresponding to the serial number of the atom (hydrocarbon oxygen belongs to different atom types) and a bond type among the atoms;
The predictive tasks are the number of atoms in the molecular structure (how many atoms are in the molecule) and the number of bonds (how many chemical bonds are in the molecule);
The first layer in the molecular encoder GIN receives the original node characteristics as input, the 2 nd layer to the 4 th layer are hidden layers, each layer is based on the output of the previous layer as input, the characteristics of the neighbor nodes are aggregated, the representation of the nodes is updated through a learnable multi-layer perceptron (MLP), and the node characteristics after adding residual connections (adding the node representation of the current layer to the node representation of the previous layer) are the output of each layer;
For molecular structure processing, we have employed graph isomorphic networks (Graph Isomorphism Network, GIN) to encode topologies, a Graph Neural Network (GNN) variant widely used in the field of molecular representation learning. GIN is distinguished by its excellent ability to capture and express topological features of molecules.
Processing 165 ten thousand molecular structures in PubChem data set based on RDKit tool to obtain node characteristics and key characteristics, inputting the node characteristics and the key characteristics into a five-layer GIN model, outputting characteristic vectors by the GIN model, optimizing the molecular encoder GIN by adopting two self-supervision learning tasks of generating task and predicting task until convergence to obtain a pre-trained molecular encoder GIN model;
Our model is able to learn and generalize more accurately the structural information of the molecules.
Other steps and parameters are the same as in one to four embodiments.
The sixth embodiment is different from the first to fifth embodiments in that the self-supervision cooperative contrast optimization module trains the pre-trained molecular encoder GIN and text encoder by adopting a contrast learning method to obtain the trained molecular encoder GIN and text encoder, wherein the trained molecular encoder GIN and text encoder form a multi-modal fine-granularity molecular pre-training model;
the specific process is as follows:
The training set (molecule-text pair) is sample data within the multimodal molecule pre-training data set;
The loss function is InfoNCE loss functions;
the optimization method is a contrast learning method;
The feature vector corresponding to V g,Eg in the molecular feature vectors output from the pre-trained molecular encoder GIN is used as the feature vector representation of the molecule.
Training the text encoder and the pre-trained molecular encoder GIN until convergence to obtain a trained text encoder and a trained molecular encoder GIN;
The trained text encoder and the molecular encoder GIN form a multi-mode fine-granularity molecular pre-training model so as to achieve the optimal alignment between the molecular structural characteristics and the semantic characteristics;
The InfoNCE loss function is:
Wherein, Inputting a pre-trained molecular encoder GIN for molecules in the ith sample, wherein the pre-trained molecular encoder GIN outputs an image feature vector;
Inputting text encoders for text in the ith pair of samples, and outputting text feature vectors by the text encoders;
inputting text encoder for text in jth sample, text feature vector output by the text encoder;
sim () is a similarity function that measures the similarity between two modal features, τ is a temperature parameter, and N is the total number of molecular text pairs.
The semantic relevance between the molecules and the text is enhanced by adopting a multi-modal contrast learning method. The method improves the matching quality of molecules and texts by continuously optimizing the model, and simultaneously ensures that unmatched pairs of molecular texts keep a certain distance in an embedding space. Comparing n pairs of molecular text (g 1,t1),(g2,t2)…(gn,tn), wherein g i,ti is the i-th molecule and the text description corresponding to the i-th molecule, respectively;
verifying a multimodal fine-grained molecular pre-training model:
the zero sample retrieval task of a new molecular structure/text based on a multi-mode fine-granularity molecular pre-training model comprises the following specific processes:
Acquiring new molecular structure text pair datasets PCDes and MoMu (new molecular structure text pairs);
freezing a zero sample molecular structure/text retrieval task based on a multi-mode fine granularity molecular pre-training model;
The zero sample molecular structure retrieval task inputs a new molecular structure text to a molecular pre-training model based on multi-mode fine granularity to obtain a molecular structure feature h g and a semantic feature h t, and calculates cosine similarity of the molecular structure feature h g and the semantic feature vector h t to obtain a cosine similarity matrix, wherein each element in the cosine similarity matrix represents similarity between one molecular structure feature vector and all semantic feature vectors;
Determining indexes of text representations of which each molecular structure represents the most similarity, comparing the indexes with indexes of correct text representations (a first text corresponding to a first molecular structure and a second text corresponding to a second molecular structure), and if the indexes are matched, considering that the retrieval is successful, and calculating the retrieval accuracy;
And sequencing the semantic feature vectors according to the similarity value for each molecular structure feature vector, and searching the most similar items, wherein if the molecular structure feature vector is successfully searched in the first 20 most similar items after sequencing. Calculating the proportion of correct retrieval to obtain the average recall rate.
Other steps and parameters are the same as in one of the first to fifth embodiments.
The difference between the embodiment and one to six embodiments is that the instruction fine-tuning downstream task module is used for adopting a prompt learning guide multi-mode fine-granularity molecular pre-training model to understand downstream tasks so as to obtain a trained multi-mode fine-granularity molecular pre-training model based on prompt learning;
the specific process is as follows:
In order to improve the performance of the model on a specific molecular prediction task, the project adopts a prompt and parameter high-efficiency fine tuning model guided by combining expertise so as to adapt to the specific molecular prediction task. Specifically, initialization hints are first customized for different tasks, enabling the model to understand task goals and requirements. By designing specialized hints for attribute prediction and drug interaction (DDI) tasks, rich background and target information is provided, helping the model understand related biological and chemical principles, ensuring that the model can identify and utilize key molecular features.
1) The method comprises the following steps of:
11 Acquiring attribute prediction tasks in a training set (only the structure has no text, the text is prompted, and the molecular structure uses a trained GIN) in sample data in a downstream task data set;
The attribute prediction task comprises BBBP, bace, sider, tox, toxCast and HIV six subtasks, wherein each subtask corresponds to one data set, namely the attribute prediction task comprises six data sets;
BBBP is blood brain barrier penetrability;
the Bace is a beta-amyloid precursor protein lyase;
Sider is a side effect;
The Tox21 is a 21 st century toxicology test;
The ToxCast was a project of years of toxicology predictive research initiated by the united states Environmental Protection Agency (EPA);
the HIV is a human immunodeficiency virus;
The BBBP dataset comprises 2039 molecular structures and corresponding attribute tags;
the Bace dataset comprises 1513 molecular structures and corresponding attribute tags;
The Sider dataset comprises 1427 molecular structures and corresponding attribute tags;
The Tox21 dataset comprises 7831 molecular structures and corresponding attribute tags;
the ToxCast dataset comprises 8576 molecular structures and corresponding attribute tags;
The HIV dataset contains 41127 molecular structures and corresponding attribute tags;
12 Acquiring a drug interaction task in a training set (only the structure has no text, the text is prompted, and the molecular structure is used for a trained GIN) in sample data in a downstream task data set;
The medication interaction task comprises ZhangDDI, chChMiner, deepDDI sub-tasks, each corresponding to one dataset, i.e. the medication interaction task comprises three datasets,
The ZhangDDI dataset contains 48548 molecular structure pairs (two molecular structures) and corresponding drug interaction relationship tags;
The CHCHMINER dataset contains 48514 molecular structure pairs (two molecular structures) and corresponding drug interaction relationship tags;
The DeepDDI dataset contains 192284 molecular structure pairs (two molecular structures) and corresponding drug interaction relationship tags;
13 Constructing an instruction prompt text according to the task target, the task background, the guiding principle and the task requirement of each subtask;
Taking BBBP tasks as an example:
The task, such as the prompt :"ThistaskisBBBP,ourobjectiveistopredict whetheradrugmoleculecanpenetratetheblood-brainbarrier,whichiscomposedofbrain capillaryendothelialcells.Theblood-brainbarrierishighlyselective,allowingonlycertain substancestopassthrough.Weneedtoanalyzethefollowingmolecularstructural characteristics:lipophilicity,molecularweight,chargestate,proteinbindingcapacity,presence ofhydrophobicgroups,activityofmetabolicproducts,andthecyclicstructureofthemolecule.Ifthemoleculehashighlipophilicity,asmallmolecularweight,lackscharge,bindsminimally withproteins,hashydrophobicgroups,noactivemetabolites,andisapolycycliccompound,it ismorelikelytopenetratetheblood-brainbarrier.Pleaseusetheseguidingprinciplesto determinewhetherthismoleculehasthecapabilitytodoso.( of BBBP (blood brain barrier penetrability) task, is a BBBP task, and the task target, the task background, the guiding principle and the task requirement are designed, wherein the task target is to predict whether a drug molecule can penetrate the blood brain barrier, and the task background is that the blood brain barrier consists of brain capillary endothelial cells. The blood brain barrier is highly selective and allows only certain substances to pass through, guidelines that if a molecule is highly lipophilic, small in molecular weight, lacks charge, binds very little to proteins, has hydrophobic groups, is inactive metabolite, and is polycyclic, it is easier to penetrate the blood brain barrier, and the task is to use these guidelines to determine whether the molecule has the ability to penetrate the blood brain barrier. )
Bace (beta-amyloid precursor protein lyase) task suggest :ThistaskisBACE,ourgoalistopredict whetheradrugmoleculecaneffectivelyinhibitBACEenzyme,whichisakeytargetforthe treatmentofAlzheimer'sdisease.TheactivityofBACEenzymeiscloselyrelatedtothe productionofbeta-amyloidprotein.Weneedtoanalyzethefollowingmolecularstructural characteristics:bindingaffinity,molecularweight,lipophilicity,hydrogenbondcapability,stereoisomerism,andmetabolicstability.Ifthemoleculedemonstratesstrongbindingaffinity withtheactivesiteofBACEenzyme,possesseskeychemicalgroupstoformnecessary hydrogenbonds,andisstableunderphysiologicalconditionswithgoodcellmembrane permeability,itmaybeaneffectiveBACEinhibitor.Pleaseusetheseguidingprinciplesto analyzeanddeterminewhetherthismoleculehasthepotentialforBACEinhibition.( that this task is a BACE task, which we designed the task goal, task background, guidelines and task requirements, and the task goal is to predict whether a drug molecule can effectively inhibit BACE enzyme, which is a key target for the treatment of Alzheimer's disease. Task background BACE enzyme activity is closely related to the production of beta-amyloid. Guidelines we need to analyze the molecular structural characteristics of binding affinity, molecular weight, lipophilicity, hydrogen bonding ability, stereoisomers and metabolic stability. If the molecule exhibits a strong binding affinity to the active site of the BACE enzyme, possesses critical chemical groups that form the necessary hydrogen bonds, and is stable under physiological conditions and has good cell membrane permeability, it may be an effective BACE inhibitor. Task requirements please use these guidelines to analyze and determine if the molecule has the potential for BACE inhibition);
Sider (SIDE effect) task prompt :ThistaskisSIDE,ourgoalistopredictthepotentialsideeffects thatadrugmoleculemightcause.Predictingdrugsideeffectsiscrucialforthesafety assessmentofmedications.Weneedtoanalyzethefollowingmolecularstructural characteristics:pharmacologicalmechanismsofactionsuchasreceptorbindingorenzyme inhibition,propertieslikemolecularweight,lipophilicity,andsolubility,andpharmacokinetic characteristicssuchasabsorption,distribution,metabolism,andexcretion.Pleaseusethese guidelinestoanalyzeandpredictthepotentialsideeffectsthatthisdrugmoleculemay cause.( is SIDE, and the task targets, task backgrounds, guidelines and task demands are designed, and the task targets are predicted to have potential SIDE effects possibly caused by drug molecules. Task background prediction of drug side effects is critical for drug safety assessment. The following molecular structural characteristics, such as pharmacological action mechanisms of receptor binding or enzyme inhibition, properties of molecular weight, lipophilicity, solubility and the like, and pharmacokinetic characteristics of absorption, distribution, metabolism, excretion and the like, need to be analyzed. Task requirements please use these guidelines to analyze and predict the potential side effects that the drug molecule may cause. )
Tox21 (21 st century toxicology test) task cue :ThistaskisTOX21,ourgoalistopredictwhethera pharmaceuticalmoleculemightinducearangeofbiologicaleffectsassociatedwithtoxicity.TheTOX21projectaimstoidentifythepotentialtoxicityofchemicalsubstances,including hormonedisruption,genotoxicity,andmore.Weneedtoanalyzemolecularstructural characteristics:thepresenceofaromaticrings,thebalancebetweenlipophilicityand hydrophilicity,metabolicstability,andpermeability.Ifamoleculeaffectsthepharmacological action of known toxicity-related targets,shows metabolic instability,or is likely to produce toxicmetabolites,orexhibitspoorcellularpermeability,thenitmayhaveTOX21properties.PleaseusetheseguidingprinciplestopredictwhetherthepharmaceuticalmoleculehasTOX21characteristics.( this task was Tox21, our goal was to predict whether drug molecules would induce a range of toxicological related biological effects. The TOX21 program aims to determine potential toxicity of chemicals, including hormonal interference, genotoxicity, and the like. We need to analyze the molecular structural characteristics of the presence of aromatic rings, the balance between lipophilicity and hydrophilicity, metabolic stability and permeability. If the molecule affects the pharmacological effects of a known toxicity-related target, it may exhibit metabolic instability, or may produce toxic metabolites, or exhibit poor cell permeability, it may have TOX21 properties. Please use these guidelines to predict whether a drug molecule will possess TOX21 properties. )
ToxCast (a project of years of toxicology predictive research initiated by the Environmental Protection Agency (EPA)) task cue :Thistaskis ToxCast,ourgoalistoassessthepotentialtoxicityofchemicalsubstancestobiologicalsystems.TheToxCastprojectisalarge-scale,high-throughputscreeningprojectthatusesavarietyof biologicalteststopredictthetoxicityofchemicalsubstances.Weneedtoanalyzethefollowing molecularstructuralcharacteristics:molecularweight,solubility,lipophilicity,etc.Ifamolecule hasalargemolecularweight,lowsolubility,highlipophilicity,andactivemetabolites,itis morelikelytoexhibitspecifictoxicity.Pleasejudgewhetherthismoleculehaspotentialtoxicity basedontheseguidelines.( is ToxCast, which we devised the task objective, task background, guidelines and task requirements, the task objective being to evaluate the potential toxicity of chemical substances to biological systems. Task background the ToxCast item is a large-scale, high-throughput screening item that uses various biological tests to predict chemical toxicity. Guidelines we need to analyze the molecular structural characteristics of molecular weight, solubility, lipophilicity, etc. Specific toxicity is more likely to be exhibited if the molecular weight is large, the solubility is low, the lipophilicity is high, and the metabolite activity is strong. Task requirements please use these guidelines to determine whether the molecule is potentially toxic. )
HIV (human immunodeficiency Virus) task prompt :ThistaskisHIV,ourgoalistoassessthepotential inhibitoryeffectofdrugmoleculesonHIV.HIVisavirusthatattacksthehumanimmune system,particularlyaffectingCD4+Tcells,leadingtoAcquiredImmunodeficiencySyndrome(AIDS).Weneedtoanalyzethefollowingmolecularstructuralcharacteristics:lipophilicity,molecularweight,chargestate,proteinbindingcapacity,hydrophobicgroups,andtheactivity ofmetabolicproducts.IfthemoleculehasanoptimizedchemicalstructuretargetingHIV'skey proteins,asmallermolecularweight,strongproteinbindingcapacity,hydrophobicgroups,and goodmetabolicstability,thenitismorelikelytohaveaninhibitoryeffectonHIV.Pleasejudge whetherthismoleculehasantiviralpotentialofspecificdrugmolecules.( this task is HIV, and we devised the task goal, task background, guidelines and task requirements, task goal, assessment of potential inhibition of HIV by drug molecules. Background of the task HIV is a virus that attacks the human immune system, especially affecting CD4+ T cells, resulting in acquired immunodeficiency syndrome (AIDS). Guidelines we need to analyze the molecular structural characteristics of lipophilicity, molecular weight, charge state, protein binding capacity, hydrophobic groups and the activity of metabolites. If the molecule has an optimized chemical structure for HIV key proteins, a smaller molecular weight, a strong protein binding capacity, hydrophobic groups, and good metabolic stability, it is more likely to exert an inhibitory effect on HIV. Task requirements please use these guidelines to predict the antiviral potential of a particular drug molecule. )
Task prompt :This task is Drug-Drug Interaction,which refers to the phenomenon where the simultaneous use of two or more drugs in the body can lead to an enhancement or reduction in their efficacy,or even cause adverse reactions,due to the mutual influence between the drugs.Determine if the interaction between these two drugs is positive or not.( of a drug interaction task is drug interaction, and the task target, task background, guiding principle and task requirement are designed, wherein the task target refers to the phenomenon that the effect of two or more drugs is enhanced or reduced and even adverse reaction is caused by the simultaneous use of the two or more drugs in a body due to the interaction among the drugs. Task requirements please determine if there is an interaction between the two drugs. ) Drug interactions are different types of drugs only to determine if there is an interaction between each drug pair.
2) Obtaining a feature vector h p of the prompt text based on the instruction prompt text constructed in the step 1);
3) Performing attribute prediction on a molecular structure (without attribute tags) based on a multi-mode fine-granularity molecular pre-training model and feature vectors of a prompt text;
4) Predicting the drug interaction relationship of the molecular structure pair (without interaction relationship) based on the multimodal fine-granularity molecular pre-training model and the feature vector of the prompt text;
5) Repeating the steps 1) to 4) until convergence, and obtaining a trained multi-mode fine-granularity molecular pre-training model based on prompt learning.
Other steps and parameters are the same as those of embodiments one to six to one.
The eighth embodiment is different from the first to seventh embodiments in that the step of 2) obtaining the feature vector h p of the instruction prompt text based on the instruction prompt text constructed in step 1), and the specific process is as follows:
Creating a prompt embedding layer P epsilon R (l×d), wherein l is the number of virtual token, d is the embedding dimension;
inputting the instruction prompt text constructed in the step 1) into a word segmentation device corresponding to the trained text encoder, and outputting a token ID sequence by the word segmentation device;
Inputting a token ID sequence into a trained text encoder, and outputting word embedding vectors by a word embedding layer in the trained text encoder;
the word embedding vector is used as the initial weight of the created prompt embedding layer P, and an initialized prompt embedding layer is obtained;
And inputting the initialized virtual token in the prompt embedding layer into a trained text encoder, and outputting a feature vector h p corresponding to the instruction prompt text by the trained text encoder.
Each subtask corresponds to a number of molecular structures, each molecular structure corresponds to one h p, and h p in 1 subtask is the same.
Other steps and parameters are the same as those of embodiments one to seven to one.
The ninth embodiment is different from one of the first to eighth embodiments in that in the 3) performing attribute prediction on a molecular structure (without attribute tag) based on a multimodal fine-grained molecular pre-training model and a feature vector of a prompt text, the specific process is as follows:
31 Inputting BBBP (blood brain barrier penetrability) molecular structure (without attribute tag) into a trained molecular encoder GIN in a multi-mode fine granularity molecular pre-training model, and outputting a structural feature vector h g1 by the trained molecular encoder GIN;
Obtaining a feature vector h p1 of the prompt text according to 2) based on the command prompt text constructed by BBBP of 1);
Fusing the feature vector h p1 of the prompt text and the structural feature vector h g1 to obtain attribute feature representation corresponding to the molecular structure of BBBP after prompt guidance;
Wherein,
H property1 is an attribute characteristic representation corresponding to the molecular structure of BBBP after prompting and guiding;
Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;
inputting the attribute characteristic representation corresponding to the molecular structure of BBBP after prompt guidance into a classifier to classify (acquire attributes);
32 Inputting Bace (beta-amyloid precursor protein lyase) molecular structure (without attribute tag) into a trained molecular encoder GIN in a multi-modal fine-grained molecular pre-training model, and outputting a structural feature vector h g2 by the trained molecular encoder GIN;
Obtaining a feature vector h p2 of the prompt text according to 2) based on the command prompt text constructed by Bace of 1);
Fusing the feature vector h p2 of the prompt text and the structural feature vector h g2 to obtain attribute feature representation corresponding to the molecular structure of Bace after prompt guidance;
Wherein,
H property2 is an attribute characteristic representation corresponding to the molecular structure of Bace after prompting and guiding;
Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;
inputting the attribute characteristic representation corresponding to the molecular structure of Bace after prompt guidance into a classifier to classify (acquire attributes);
33 Inputting Sider (molecular structure without attribute tag) into a trained molecular encoder GIN in a multi-mode fine granularity molecular pre-training model, and outputting a structural feature vector h g3 by the trained molecular encoder GIN;
obtaining a feature vector h p3 of the prompt text according to 2) based on the command prompt text constructed by Sider of 1);
Fusing the feature vector h p3 of the prompt text and the structural feature vector h g3 to obtain attribute feature representation corresponding to the molecular structure of Sider after prompt guidance;
Wherein,
H property3 is an attribute characteristic representation corresponding to the molecular structure of Sider after prompting and guiding;
Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;
Inputting the attribute characteristic representation corresponding to the molecular structure of Sider after prompt guidance into a classifier to classify (acquire attributes);
34 Inputting the molecular structure (without attribute tag) of Tox21 (21 st century toxicology test) into a trained molecular encoder GIN in a multi-mode fine granularity molecular pre-training model, and outputting a structural feature vector h g4 by the trained molecular encoder GIN;
the instruction prompt text constructed based on Tox21 of 1) is used for obtaining a feature vector h p4 of the prompt text according to 2);
Fusing the feature vector h p4 of the prompt text and the structural feature vector h g4 to obtain attribute feature representation corresponding to the molecular structure of Tox21 after prompt guidance;
Wherein,
H property4 is an attribute characteristic representation corresponding to the molecular structure of Tox21 after prompting guidance;
Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;
inputting the attribute characteristic representation corresponding to the molecular structure of the Tox21 after prompting and guiding into a classifier to classify (acquire attributes);
35 Inputting ToxCast (a multi-years toxicology prediction research project started by the united states Environmental Protection Agency (EPA)) molecular structures (without attribute tags) into a trained molecular encoder GIN in a multi-modal fine-granularity molecular pre-training model, the trained molecular encoder GIN outputting a structural feature vector h g5;
Obtaining a feature vector h p5 of the prompt text according to 2) based on the command prompt text constructed by ToxCast of 1);
Fusing the feature vector h p5 of the prompt text and the structural feature vector h g5 to obtain attribute feature representation corresponding to the molecular structure of ToxCast after prompt guidance;
Wherein,
H property5 is an attribute characteristic representation corresponding to the molecular structure of ToxCast after prompting and guiding;
Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;
Inputting the attribute characteristic representation corresponding to the molecular structure of ToxCast after prompt guidance into a classifier to classify (acquire attributes);
36 Inputting the molecular structure (without attribute tag) of HIV (human immunodeficiency virus) into a trained molecular encoder GIN in a multi-mode fine granularity molecular pre-training model, and outputting a structural feature vector h g6 by the trained molecular encoder GIN;
Based on the instruction prompt text constructed by the HIV of 1), obtaining a feature vector h p6 of the prompt text according to 2);
Fusing the feature vector h p6 and the structural feature vector h g6 of the prompt text to obtain attribute feature representations corresponding to the molecular structure of the HIV after prompt guidance;
Wherein,
H property6 is an attribute characteristic representation corresponding to the molecular structure of the HIV after prompting guidance;
Alpha is a weight parameter, and the alpha is a weight parameter, Splicing;
inputting attribute characteristic representations corresponding to the molecular structure of the HIV after prompting and guiding into a classifier to classify (acquire attributes);
Each subtask corresponds to a data set, each data set comprises a plurality of molecular structures and corresponding attribute labels, each subtask corresponds to a plurality of molecular structures, each molecular structure corresponds to one h p, and h p in 1 subtasks are identical.
Other steps and parameters are the same as in one to eight of the embodiments.
The tenth embodiment is different from one of the first to ninth embodiments in that the step 4) of predicting the drug interaction relationship of the molecular structure pair (without interaction relationship) based on the multimodal fine-granularity molecular pre-training model and the feature vector of the prompt text comprises the following steps:
41 Inputting ZhangDDI molecular structure pairs (without attribute tags) into a trained molecular encoder GIN in a multi-mode fine-granularity molecular pre-training model, and outputting structural feature vectors h g7 and h g8 by the trained molecular encoder GIN;
Obtaining a feature vector h p7 of the prompt text according to 2) based on the command prompt text constructed by ZhangDDI of 1);
Splicing the structural feature vector h g7、hg8 and the feature vector h p7 of the prompt text to obtain a corresponding drug interaction relation feature representation of the molecular structure of the ZhangDDI after prompt guidance;
Wherein,
H ddi1 is a characteristic representation of the molecular structure of ZhangDDI after guidance for the corresponding drug interaction relationship;
splicing;
inputting the molecular structure of ZhangDDI after prompting and guiding into a classifier to classify the corresponding drug interaction relation characteristics (obtain attributes);
42 Inputting CHCHMINER molecular structure pairs (without attribute tags) into a trained molecular encoder GIN in a multi-mode fine-granularity molecular pre-training model, and outputting structural feature vectors h g9 and h g10 by the trained molecular encoder GIN;
Obtaining a feature vector h p8 of the prompt text according to 2) based on the command prompt text constructed by CHCHMINER of 1);
Splicing the structural feature vector h g9、hg10 and the feature vector h p8 of the prompt text to obtain a corresponding drug interaction relation feature representation of the molecular structure of the CHCHMINER after prompt guidance;
Wherein,
H ddi2 is a characteristic representation of the molecular structure of CHCHMINER after guidance for the corresponding drug interaction relationship;
inputting the molecular structure of CHCHMINER after prompting and guiding to a corresponding drug interaction relation characteristic representation into a classifier to classify (acquire attributes);
43 Inputting DeepDDI molecular structure pairs (without attribute tags) into a trained molecular encoder GIN in a multi-mode fine-granularity molecular pre-training model, and outputting structural feature vectors h g11 and h g12 by the trained molecular encoder GIN;
Obtaining a feature vector h p9 of the prompt text according to 2) based on the command prompt text constructed by DeepDDI of 1);
Splicing the structural feature vectors h g11 and h g12 and the feature vector h p9 of the prompt text to obtain a corresponding drug interaction relation feature representation of the molecular structure of DeepDDI after prompt guidance;
Wherein,
H ddi3 is a characteristic representation of the molecular structure of DeepDDI after guidance for the corresponding drug interaction relationship;
the molecular structure of DeepDDI after the guidance is prompted to classify the corresponding drug interaction relationship feature representation into a classifier (obtain the attribute).
Each subtask corresponds to a data set, each data set comprises a plurality of molecular structure pairs (two molecular structures) and corresponding drug interaction relation labels, each subtask corresponds to a plurality of molecular structures, each molecular structure corresponds to one h p, and h p in 1 subtask is the same.
Other steps and parameters are the same as in embodiments one through nine through one.
The present invention is capable of other and further embodiments and its several details are capable of modification and variation in light of the present invention, as will be apparent to those skilled in the art, without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1.基于提示学习的多模态细粒度分子预训练模型的分子结构预测系统,其特征在于:所述系统包括:数据获取模块、处理模块、预测模块;1. A molecular structure prediction system based on a multimodal fine-grained molecular pre-training model of prompt learning, characterized in that: the system comprises: a data acquisition module, a processing module, and a prediction module; 所述数据获取模块用于获取多模态分子预训练数据集内的样本数据以及下游任务数据集内的样本数据;The data acquisition module is used to acquire sample data in the multimodal molecular pre-training data set and sample data in the downstream task data set; 处理模块用于建立基于提示学习的多模态细粒度分子预训练模型,并获取训练好的基于提示学习的多模态细粒度分子预训练模型;The processing module is used to establish a multimodal fine-grained molecular pre-training model based on prompt learning, and obtain the trained multimodal fine-grained molecular pre-training model based on prompt learning; 所述预测模块用于基于训练好的提示学习的多模态细粒度分子预训练模型对待测分子结构进行属性和药物相互作用关系的预测,获得预测结果。The prediction module is used to predict the properties and drug interaction relationship of the molecular structure to be tested based on the multimodal fine-grained molecular pre-training model of trained prompt learning to obtain a prediction result. 2.根据权利要求1所述的基于提示学习的多模态细粒度分子预训练模型的分子结构预测系统,其特征在于:所述数据获取模块用于获取多模态分子预训练数据集内的样本数据以及下游任务数据集内的样本数据;2. According to the molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning in claim 1, it is characterized in that: the data acquisition module is used to acquire sample data in the multimodal molecular pre-training data set and sample data in the downstream task data set; 具体过程为:The specific process is: 多模态分子预训练数据集内的样本数据为分子-文本对;The sample data in the multimodal molecular pre-training dataset are molecule-text pairs; 下游任务数据集内的样本数据为分子结构。The sample data in the downstream task dataset is a molecular structure. 3.根据权利要求2所述的基于提示学习的多模态细粒度分子预训练模型的分子结构预测系统,其特征在于:所述处理模块用于建立基于提示学习的多模态细粒度分子预训练模型,并获取训练好的基于提示学习的多模态细粒度分子预训练模型;3. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 2, characterized in that: the processing module is used to establish the multimodal fine-grained molecular pre-training model based on prompt learning, and obtain the trained multimodal fine-grained molecular pre-training model based on prompt learning; 具体过程为:The specific process is: 基于提示学习的多模态细粒度分子预训练模型包括:细粒度分子图构建模块、分子和文本描述表示学习模块、自监督协同对比优化模块、指令微调下游任务模块;The multimodal fine-grained molecular pre-training model based on prompt learning includes: a fine-grained molecular graph construction module, a molecular and text description representation learning module, a self-supervised collaborative comparison optimization module, and an instruction fine-tuning downstream task module; 所述细粒度分子图构建模块采用特定的分解规则将多模态分子预训练数据集内的样本数据中的分子结构图G分解为子结构单元,基于子结构单元构建细粒度分子图G';The fine-grained molecular graph construction module uses a specific decomposition rule to decompose the molecular structure graph G in the sample data in the multimodal molecular pre-training data set into sub-structure units, and constructs a fine-grained molecular graph G' based on the sub-structure units; 所述分子和文本描述表示学习模块用于将多模态分子预训练数据集内的样本数据中的文本描述T输入文本编码器,文本编码器输出语义特征;将PubChem数据集中的165万个分子结构对应的细粒度分子图G″输入分子编码器,分子编码器输出分子特征向量;对分子编码器GIN进行训练,得到预训练好的分子编码器GIN;The molecular and text description representation learning module is used to input the text description T in the sample data in the multimodal molecular pre-training data set into the text encoder, and the text encoder outputs semantic features; input the fine-grained molecular graph G″ corresponding to 1.65 million molecular structures in the PubChem data set into the molecular encoder, and the molecular encoder outputs a molecular feature vector; train the molecular encoder GIN to obtain a pre-trained molecular encoder GIN; 所述自监督协同对比优化模块采用对比学习方法训练预训练好的分子编码器GIN和文本编码器,获得训练好的分子编码器GIN和文本编码器;训练好的分子编码器GIN和文本编码器构成多模态细粒度分子预训练模型;The self-supervised collaborative contrast optimization module uses a contrast learning method to train the pre-trained molecular encoder GIN and text encoder to obtain the trained molecular encoder GIN and text encoder; the trained molecular encoder GIN and text encoder constitute a multimodal fine-grained molecular pre-training model; 所述指令微调下游任务模块用于采用提示学习引导多模态细粒度分子预训练模型理解下游任务,获得训练好的基于提示学习的多模态细粒度分子预训练模型。The instruction fine-tuning downstream task module is used to use prompt learning to guide the multimodal fine-grained molecular pre-training model to understand downstream tasks, and obtain a trained multimodal fine-grained molecular pre-training model based on prompt learning. 4.根据权利要求3所述的基于提示学习的多模态细粒度分子预训练模型的分子结构预测系统,其特征在于:所述细粒度分子图构建模块采用特定的分解规则将多模态分子预训练数据集内的样本数据中的分子结构图G分解为子结构单元,基于子结构单元构建细粒度分子图G';4. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 3 is characterized in that: the fine-grained molecular graph construction module uses a specific decomposition rule to decompose the molecular structure graph G in the sample data in the multimodal molecular pre-training data set into sub-structure units, and constructs a fine-grained molecular graph G' based on the sub-structure units; 具体过程为:The specific process is: 1)、基于RDKit工具获取多模态分子预训练数据集内的样本数据中的SMILES字符串表示的分子结构中每个原子的化学元素特征;1) Based on the RDKit tool, obtain the chemical element characteristics of each atom in the molecular structure represented by the SMILES string in the sample data in the multimodal molecular pre-training dataset; 所述每个原子的化学元素特征包含原子的序号、连接度、原子间的键类型和原子成环类型;The chemical element characteristics of each atom include the atomic number, degree of connectivity, type of bonds between atoms, and type of atomic ring formation; 将连接度特征作为节点特征;Use connectivity features as node features; 将原子间的键类型和原子成环类型作为键特征;The bond type between atoms and the type of atoms forming a ring are used as bond characteristics; 所述连接度为与原子相连的化学键的数量;The connectivity is the number of chemical bonds connecting atoms; 2)、基于RDKit工具将多模态分子预训练数据集内的样本数据中的SMILES字符串表示的分子结构转换为2D拓扑图G=(V,E);2) Based on the RDKit tool, the molecular structure represented by the SMILES string in the sample data in the multimodal molecular pre-training dataset is converted into a 2D topological graph G = (V, E); 其中,V表示2D拓扑图G的节点集合,E表示2D拓扑图G的边集合;Wherein, V represents the node set of the 2D topological graph G, and E represents the edge set of the 2D topological graph G; 3)、基于BRICS分解方法对多模态分子预训练数据集内的样本数据中的分子结构进行分解,得到子结构单元;具体过程为:3) Based on the BRICS decomposition method, the molecular structure in the sample data in the multimodal molecular pre-training data set is decomposed to obtain substructure units; the specific process is as follows: 基于BRICS分解方法对多模态分子预训练数据集内的样本数据中的分子结构进行初步分解,得到初步分解后的多个子结构单元;Based on the BRICS decomposition method, the molecular structure in the sample data in the multimodal molecular pre-training data set is preliminarily decomposed to obtain multiple substructure units after the preliminary decomposition; 对初步分解后的每个子结构单元进行分解,得到最小的多个子结构单元;Decompose each sub-structure unit after the preliminary decomposition to obtain multiple minimum sub-structure units; 4)、将子结构单元作为新节点Vf添加至拓扑图中;4) Add the substructure unit as a new node V f to the topology graph; 将每个子结构单元与自身包含的节点的连接关系Ef作为新的边添加至拓扑图中;Add the connection relationship E f between each substructure unit and the nodes it contains as a new edge to the topology graph; 构建空的图级节点Vg,将图级节点Vg与所有新节点Vf相连得到连接关系Eg,基于V,Vf,Vg,E,Ef,Eg形成细粒度分子图G';Construct an empty graph-level node V g , connect the graph-level node V g with all new nodes V f to obtain the connection relationship E g , and form a fine-grained molecular graph G' based on V, V f , V g , E, E f , E g ; G=(V′,E'),V=[V,Vf,Vg],E=[E,Ef,Eg]。G = (V', E'), V = [V, V f , V g ], E = [E, E f , E g ]. 5.根据权利要求4所述的基于提示学习的多模态细粒度分子预训练模型的分子结构预测系统,其特征在于:所述分子和文本描述表示学习模块用于将多模态分子预训练数据集内的样本数据中的文本描述T输入文本编码器,文本编码器输出语义特征;将PubChem数据集中的165万个分子结构对应的细粒度分子图G″输入分子编码器,分子编码器输出分子特征向量;对分子编码器GIN进行训练,得到预训练好的分子编码器GIN;5. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 4 is characterized in that: the molecular and text description representation learning module is used to input the text description T in the sample data in the multimodal molecular pre-training data set into the text encoder, and the text encoder outputs semantic features; the fine-grained molecular graph G″ corresponding to 1.65 million molecular structures in the PubChem data set is input into the molecular encoder, and the molecular encoder outputs a molecular feature vector; the molecular encoder GIN is trained to obtain a pre-trained molecular encoder GIN; 具体过程为:The specific process is: 步骤1)、将多模态分子预训练数据集内的样本数据中的分子文本对的文本中的分子名称改为“The molecule is”的统一格式;Step 1), changing the molecule name in the text of the molecule text pair in the sample data in the multimodal molecule pre-training data set to a unified format of "The molecule is"; 将改为“The molecule is”的统一格式的文本输入文本编码器,文本编码器的输出语义特征;The text in the unified format of "The molecule is" is input into the text encoder, and the output semantic features of the text encoder are; 文本编码器为基于BERT架构的SciBERT;The text encoder is SciBERT based on the BERT architecture; 步骤2)、将PubChem数据集中的165万个分子结构对应的细粒度分子图G″以及PubChem数据集中的165万个分子结构对应的节点特征和键特征输入分子编码器GIN,分子编码器GIN输出特征向量;Step 2), the fine-grained molecular graph G″ corresponding to the 1.65 million molecular structures in the PubChem dataset and the node features and bond features corresponding to the 1.65 million molecular structures in the PubChem dataset are input into the molecular encoder GIN, and the molecular encoder GIN outputs a feature vector; 分子编码器GIN依次包括:输入层、隐藏层、隐藏层、隐藏层、隐藏层;The molecular encoder GIN includes: input layer, hidden layer, hidden layer, hidden layer, hidden layer; 采用生成任务和预测任务自监督学习方法对分子编码器GIN进行优化,直至收敛,得到预训练好的分子编码器GIN;The molecular encoder GIN is optimized by using the self-supervised learning method of generation task and prediction task until convergence, and a pre-trained molecular encoder GIN is obtained; 生成任务为原子的连接度、原子的序号对应的类型和原子间的键类型;The generated tasks are the connectivity of atoms, the types corresponding to the atomic numbers, and the bond types between atoms; 预测任务为分子结构中原子的数量和键的数量。The prediction task is the number of atoms and the number of bonds in a molecular structure. 6.根据权利要求5所述的基于提示学习的多模态细粒度分子预训练模型的分子结构预测系统,其特征在于:所述自监督协同对比优化模块采用对比学习方法训练预训练好的分子编码器GIN和文本编码器,获得训练好的分子编码器GIN和文本编码器;训练好的分子编码器GIN和文本编码器构成多模态细粒度分子预训练模型;6. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 5 is characterized in that: the self-supervised collaborative contrast optimization module uses a contrast learning method to train the pre-trained molecular encoder GIN and text encoder to obtain the trained molecular encoder GIN and text encoder; the trained molecular encoder GIN and text encoder constitute a multimodal fine-grained molecular pre-training model; 具体过程为:The specific process is: 训练集为多模态分子预训练数据集内的样本数据;The training set is sample data in the multimodal molecular pre-training data set; 损失函数为InfoNCE损失函数;The loss function is the InfoNCE loss function; 优化方法为对比学习方法;The optimization method is contrastive learning method; 对文本编码器和预训练好的分子编码器GIN进行训练,直到收敛,获得训练好的文本编码器和分子编码器GIN;The text encoder and the pre-trained molecular encoder GIN are trained until convergence to obtain the trained text encoder and molecular encoder GIN; 训练好的文本编码器和分子编码器GIN构成多模态细粒度分子预训练模型。The trained text encoder and molecular encoder GIN constitute a multimodal fine-grained molecular pre-training model. 7.根据权利要求6所述的基于提示学习的多模态细粒度分子预训练模型的分子结构预测系统,其特征在于:所述指令微调下游任务模块用于采用提示学习引导多模态细粒度分子预训练模型理解下游任务,获得训练好的基于提示学习的多模态细粒度分子预训练模型;7. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 6, characterized in that: the instruction fine-tuning downstream task module is used to use prompt learning to guide the multimodal fine-grained molecular pre-training model to understand the downstream task, and obtain a trained multimodal fine-grained molecular pre-training model based on prompt learning; 具体过程为:The specific process is: 1)、构建指令提示文本;具体过程为:1) Build the instruction prompt text; the specific process is: 11)、获取下游任务数据集内的样本数据中的训练集中的属性预测任务;11) Obtaining attribute prediction tasks in the training set from sample data in the downstream task dataset; 属性预测任务包括BBBP、Bace、Sider、Tox21、ToxCast、HIV六项子任务,每项子任务对应一个数据集,即属性预测任务包括六个数据集;The attribute prediction task includes six subtasks: BBBP, Bace, Sider, Tox21, ToxCast, and HIV. Each subtask corresponds to a data set, that is, the attribute prediction task includes six data sets; 所述BBBP为血脑屏障穿透性;The BBBP is blood-brain barrier penetrating; 所述Bace为β-位淀粉样前体蛋白裂解酶;The Bace is a β-amyloid precursor protein cleaving enzyme; 所述Sider为副作用;The Sider is a side effect; 所述Tox21为21世纪毒理学测试;The Tox21 is the 21st Century Toxicology Test; 所述ToxCast为EPA开始的一项多年毒理学预测研究项目;The ToxCast is a multi-year toxicology prediction research project initiated by the EPA; 所述HIV为人类免疫缺陷病毒;The HIV is human immunodeficiency virus; 所述BBBP数据集中包含2039个分子结构和对应的属性标签;The BBBP dataset contains 2039 molecular structures and corresponding attribute labels; 所述Bace数据集中包含1513个分子结构和对应的属性标签;The Bace dataset contains 1513 molecular structures and corresponding attribute labels; 所述Sider数据集中包含1427个分子结构和对应的属性标签;The Sider dataset contains 1427 molecular structures and corresponding attribute labels; 所述Tox21数据集中包含7831个分子结构和对应的属性标签;The Tox21 dataset contains 7831 molecular structures and corresponding attribute labels; 所述ToxCast数据集中包含8576个分子结构和对应的属性标签;The ToxCast dataset contains 8576 molecular structures and corresponding attribute labels; 所述HIV数据集中包含41127个分子结构和对应的属性标签;The HIV dataset contains 41,127 molecular structures and corresponding attribute labels; 12)、获取下游任务数据集内的样本数据中的训练集中的药物相互作用任务;12) Obtain drug interaction tasks in the training set from sample data in the downstream task dataset; 药物相互作用任务包括ZhangDDI、ChChMiner、DeepDDI三项子任务,每项子任务对应一个数据集,即药物相互作用任务包括三个数据集;The drug interaction task includes three subtasks: ZhangDDI, ChChMiner, and DeepDDI. Each subtask corresponds to a dataset, that is, the drug interaction task includes three datasets; 所述ZhangDDI数据集中包含48548个分子结构对和对应的药物相互作用关系标签;The ZhangDDI dataset contains 48,548 molecular structure pairs and corresponding drug interaction relationship labels; 所述ChChMiner数据集中包含48514个分子结构对和对应的药物相互作用关系标签;The ChChMiner dataset contains 48,514 molecular structure pairs and corresponding drug interaction relationship labels; 所述DeepDDI数据集中包含192284个分子结构对和对应的药物相互作用关系标签;The DeepDDI dataset contains 192,284 molecular structure pairs and corresponding drug interaction relationship labels; 13)、根据每项子任务的任务目标、任务背景、指导原则和任务需求构建指令提示文本;13) Construct instruction prompt text according to the task objectives, task background, guiding principles and task requirements of each subtask; 2)、基于1)构建的指令提示文本,获得提示文本的特征向量hp2) Based on the instruction prompt text constructed in 1), obtain the feature vector h p of the prompt text; 3)、基于多模态细粒度分子预训练模型和提示文本的特征向量,对分子结构进行属性预测;3) Predict the properties of the molecular structure based on the multimodal fine-grained molecular pre-training model and the feature vector of the prompt text; 4)、基于多模态细粒度分子预训练模型和提示文本的特征向量,对分子结构对进行药物相互作用关系预测;4) Predict drug interaction relationships for molecular structure pairs based on the multimodal fine-grained molecular pre-training model and feature vectors of prompt text; 5)、重复执行1)至4)直至收敛,获得训练好的基于提示学习的多模态细粒度分子预训练模型。5) Repeat 1) to 4) until convergence to obtain a trained multimodal fine-grained molecular pre-training model based on prompt learning. 8.根据权利要求7所述的基于提示学习的多模态细粒度分子预训练模型的分子结构预测系统,其特征在于:所述2)中基于1)构建的指令提示文本,获得提示文本的特征向量hp;具体过程为:8. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 7, characterized in that: in said 2), based on the instruction prompt text constructed in 1), the feature vector h p of the prompt text is obtained; the specific process is: 创建一个提示嵌入层P∈R(l×d),其中l为虚拟token的数量,d为嵌入的维度;R为实数;Create a prompt embedding layer P∈R( l×d ), where l is the number of virtual tokens, d is the dimension of the embedding, and R is a real number; 将1)中构建的指令提示文本输入训练好的文本编码器对应的分词器,分词器输出token ID序列;Input the instruction prompt text constructed in 1) into the word segmenter corresponding to the trained text encoder, and the word segmenter outputs a token ID sequence; token ID序列输入训练好的文本编码器,训练好的文本编码器中的词嵌入层输出词嵌入向量;The token ID sequence is input into the trained text encoder, and the word embedding layer in the trained text encoder outputs the word embedding vector; 词嵌入向量作为创建的提示嵌入层P的初始权重,获得初始化的提示嵌入层;The word embedding vector is used as the initial weight of the created prompt embedding layer P to obtain the initialized prompt embedding layer; 将初始化的提示嵌入层中的虚拟token输入训练好的文本编码器,训练好的文本编码器输出指令提示文本对应的特征向量hpThe virtual token in the initialized prompt embedding layer is input into the trained text encoder, and the trained text encoder outputs the feature vector h p corresponding to the instruction prompt text. 9.根据权利要求8所述的基于提示学习的多模态细粒度分子预训练模型的分子结构预测系统,其特征在于:所述3)中基于多模态细粒度分子预训练模型和提示文本的特征向量,对分子结构进行属性预测;9. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 8, characterized in that: in said 3), based on the feature vector of the multimodal fine-grained molecular pre-training model and the prompt text, the molecular structure is predicted for attributes; 具体过程为:The specific process is: 31)、将BBBP的分子结构输入多模态细粒度分子预训练模型中的训练好的分子编码器GIN,训练好的分子编码器GIN输出结构特征向量hg131) Input the molecular structure of BBBP into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs a structural feature vector h g1 ; 基于1)的BBBP构建的指令提示文本,按照2)获得提示文本的特征向量hp1Based on the instruction prompt text constructed by BBBP in 1), obtain the feature vector h p1 of the prompt text according to 2); 将提示文本的特征向量hp1和结构特征向量hg1融合,得到提示引导后的BBBP的分子结构对应的属性特征表示;The feature vector h p1 of the prompt text and the structural feature vector h g1 are fused to obtain the attribute feature representation corresponding to the molecular structure of the prompt-guided BBBP; 其中,in, jproperty1为提示引导后的BBBP的分子结构对应的属性特征表示;j property1 is the attribute feature representation corresponding to the molecular structure of BBBP after prompt guidance; α为权重参数,为拼接;α is the weight parameter, For splicing; 将提示引导后的BBBP的分子结构对应的属性特征表示输入分类器进行分类;The attribute feature representation corresponding to the molecular structure of BBBP guided by the prompt is input into the classifier for classification; 32)、将Bace的分子结构输入多模态细粒度分子预训练模型中的训练好的分子编码器GIN,训练好的分子编码器GIN输出结构特征向量hg232) Input the molecular structure of Bace into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs the structural feature vector h g2 ; 基于1)的Bace构建的指令提示文本,按照2)获得提示文本的特征向量jp2Based on the instruction prompt text constructed by Bace in 1), obtain the feature vector j p2 of the prompt text according to 2); 将提示文本的特征向量hp2和结构特征向量hg2融合,得到提示引导后的Bace的分子结构对应的属性特征表示;The feature vector h p2 of the prompt text and the structural feature vector h g2 are fused to obtain the attribute feature representation corresponding to the molecular structure of Bace guided by the prompt; 其中,in, hproperty2为提示引导后的Bace的分子结构对应的属性特征表示;h property2 is the attribute feature representation corresponding to the molecular structure of Bace after prompt guidance; α为权重参数,为拼接;α is the weight parameter, For splicing; 将提示引导后的Bace的分子结构对应的属性特征表示输入分类器进行分类;The attribute feature representation corresponding to the molecular structure of Bace guided by the prompt is input into the classifier for classification; 33)、将Sider的分子结构输入多模态细粒度分子预训练模型中的训练好的分子编码器GIN,训练好的分子编码器GIN输出结构特征向量hg333) Input the molecular structure of Sider into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs the structural feature vector h g3 ; 基于1)的Sider构建的指令提示文本,按照2)获得提示文本的特征向量hp3Based on the instruction prompt text constructed by Sider in 1), obtain the feature vector h p3 of the prompt text according to 2); 将提示文本的特征向量hp3和结构特征向量hg3融合,得到提示引导后的Sider的分子结构对应的属性特征表示;The feature vector h p3 of the prompt text and the structural feature vector h g3 are merged to obtain the attribute feature representation corresponding to the molecular structure of Sider after prompt guidance; 其中,in, hproperty3为提示引导后的Sider的分子结构对应的属性特征表示;h property3 is the attribute feature representation corresponding to the molecular structure of Sider after prompt guidance; α为权重参数,为拼接;α is the weight parameter, For splicing; 将提示引导后的Sider的分子结构对应的属性特征表示输入分类器进行分类;The attribute feature representation corresponding to the molecular structure of Sider guided by the prompt is input into the classifier for classification; 34)、将Tox21的分子结构输入多模态细粒度分子预训练模型中的训练好的分子编码器GIN,训练好的分子编码器GIN输出结构特征向量hg434), inputting the molecular structure of Tox21 into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs a structural feature vector h g4 ; 基于1)的Tox21构建的指令提示文本,按照2)获得提示文本的特征向量hp4Based on the instruction prompt text constructed by Tox21 in 1), obtain the feature vector h p4 of the prompt text according to 2); 将提示文本的特征向量hp4和结构特征向量hg4融合,得到提示引导后的Tox21的分子结构对应的属性特征表示;The feature vector h p4 of the prompt text and the structural feature vector h g4 are fused to obtain the attribute feature representation corresponding to the molecular structure of Tox21 guided by the prompt; 其中,in, hproperty4为提示引导后的Tox21的分子结构对应的属性特征表示;h property4 is the property feature representation corresponding to the molecular structure of Tox21 after prompt guidance; α为权重参数,为拼接;α is the weight parameter, For splicing; 将提示引导后的Tox21的分子结构对应的属性特征表示输入分类器进行分类;The attribute feature representation corresponding to the molecular structure of Tox21 guided by the prompt is input into the classifier for classification; 35)、将ToxCast的分子结构输入多模态细粒度分子预训练模型中的训练好的分子编码器GIN,训练好的分子编码器GIN输出结构特征向量hg535) Input the molecular structure of ToxCast into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs the structural feature vector h g5 ; 基于1)的ToxCast构建的指令提示文本,按照2)获得提示文本的特征向量hp5Based on the instruction prompt text constructed by ToxCast in 1), obtain the feature vector h p5 of the prompt text according to 2); 将提示文本的特征向量hp5和结构特征向量hg5融合,得到提示引导后的ToxCast的分子结构对应的属性特征表示;The feature vector h p5 of the prompt text and the structural feature vector h g5 are fused to obtain the attribute feature representation corresponding to the molecular structure of ToxCast guided by the prompt; 其中,in, hproperty5为提示引导后的ToxCast的分子结构对应的属性特征表示;h property5 is the property feature representation corresponding to the molecular structure of ToxCast after prompt guidance; α为权重参数,为拼接;α is the weight parameter, For splicing; 将提示引导后的ToxCast的分子结构对应的属性特征表示输入分类器进行分类;The attribute feature representation corresponding to the molecular structure of ToxCast guided by the prompt is input into the classifier for classification; 36)、将HIV的分子结构输入多模态细粒度分子预训练模型中的训练好的分子编码器GIN,训练好的分子编码器GIN输出结构特征向量hg636) Inputting the molecular structure of HIV into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, the trained molecular encoder GIN outputs a structural feature vector h g6 ; 基于1)的HIV构建的指令提示文本,按照2)获得提示文本的特征向量hp6Based on the instruction prompt text constructed by HIV in 1), obtain the feature vector h p6 of the prompt text according to 2); 将提示文本的特征向量hp6和结构特征向量hg6融合,得到提示引导后的HIV的分子结构对应的属性特征表示;The feature vector h p6 of the prompt text and the structural feature vector h g6 are merged to obtain the attribute feature representation corresponding to the molecular structure of HIV after prompt guidance; 其中,in, hproperty6为提示引导后的HIV的分子结构对应的属性特征表示;h property6 is the attribute feature representation corresponding to the molecular structure of HIV after prompt guidance; α为权重参数,为拼接;α is the weight parameter, For splicing; 将提示引导后的HIV的分子结构对应的属性特征表示输入分类器进行分类。The attribute feature representation corresponding to the molecular structure of HIV after prompt guidance is input into the classifier for classification. 10.根据权利要求9所述的基于提示学习的多模态细粒度分子预训练模型的分子结构预测系统,其特征在于:所述4)中基于多模态细粒度分子预训练模型和提示文本的特征向量,对分子结构对进行药物相互作用关系预测;具体过程为:10. The molecular structure prediction system of the multimodal fine-grained molecular pre-training model based on prompt learning according to claim 9, characterized in that: in said 4), based on the feature vector of the multimodal fine-grained molecular pre-training model and the prompt text, the drug interaction relationship of the molecular structure pair is predicted; the specific process is: 41)、将ZhangDDI的分子结构对输入多模态细粒度分子预训练模型中的训练好的分子编码器GIN,训练好的分子编码器GIN输出结构特征向量hg7和hg841) Input the molecular structure pair of ZhangDDI into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs structural feature vectors h g7 and h g8 ; 基于1)的ZhangDDI构建的指令提示文本,按照2)获得提示文本的特征向量hp7Based on the instruction prompt text constructed by ZhangDDI in 1), obtain the feature vector h p7 of the prompt text according to 2); 将结构特征向量hg7、hg8和提示文本的特征向量hp7进行拼接,得到提示引导后的ZhangDDI的分子结构对对应的药物相互作用关系特征表示;The structural feature vectors h g7 and h g8 are concatenated with the feature vector h p7 of the prompt text to obtain the feature representation of the drug interaction relationship corresponding to the molecular structure of ZhangDDI guided by the prompt; 其中,in, hddi1为提示引导后的ZhangDDI的分子结构对对应的药物相互作用关系特征表示;h ddi1 is the characteristic representation of the drug interaction relationship corresponding to the molecular structure of ZhangDDI after prompt guidance; 为拼接; For splicing; 将提示引导后的ZhangDDI的分子结构对对应的药物相互作用关系特征输入分类器进行分类;The molecular structure of ZhangDDI guided by the prompt is input into the classifier to classify the corresponding drug interaction relationship features; 42)、将ChChMiner的分子结构对输入多模态细粒度分子预训练模型中的训练好的分子编码器GIN,训练好的分子编码器GIN输出结构特征向量hg9和hg1042) Input the molecular structure pair of ChChMiner into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs the structural feature vectors h g9 and h g10 ; 基于1)的ChChMiner构建的指令提示文本,按照2)获得提示文本的特征向量hp8Based on the instruction prompt text constructed by ChChMiner in 1), obtain the feature vector h p8 of the prompt text according to 2); 将结构特征向量hg9、hg10和提示文本的特征向量hp8进行拼接,得到提示引导后的ChChMiner的分子结构对对应的药物相互作用关系特征表示;The structural feature vectors h g9 and h g10 are concatenated with the feature vector h p8 of the prompt text to obtain the feature representation of the drug interaction relationship corresponding to the molecular structure of ChChMiner guided by the prompt; 其中,in, hddi2为提示引导后的ChChMiner的分子结构对对应的药物相互作用关系特征表示;h ddi2 is the characteristic representation of drug interaction relationship corresponding to the molecular structure of ChChMiner after prompt guidance; 将提示引导后的ChChMiner的分子结构对对应的药物相互作用关系特征表示输入分类器进行分类;The molecular structure pairs of ChChMiner guided by the prompts and the corresponding drug interaction relationship feature representations are input into the classifier for classification; 43)、将DeepDDI的分子结构对输入多模态细粒度分子预训练模型中的训练好的分子编码器GIN,训练好的分子编码器GIN输出结构特征向量hg11和hg1243) Input the molecular structure pair of DeepDDI into the trained molecular encoder GIN in the multimodal fine-grained molecular pre-training model, and the trained molecular encoder GIN outputs structural feature vectors h g11 and h g12 ; 基于1)的DeepDDI构建的指令提示文本,按照2)获得提示文本的特征向量hp9Based on the instruction prompt text constructed by DeepDDI in 1), obtain the feature vector h p9 of the prompt text according to 2); 将结构特征向量hg11和hg12和提示文本的特征向量hp9进行拼接,得到提示引导后的DeepDDI的分子结构对对应的药物相互作用关系特征表示;The structural feature vectors h g11 and h g12 are concatenated with the feature vector h p9 of the prompt text to obtain the characteristic representation of the drug interaction relationship corresponding to the molecular structure of DeepDDI guided by the prompt; 其中,in, hddi3为提示引导后的DeepDDI的分子结构对对应的药物相互作用关系特征表示;h ddi3 is the characteristic representation of drug interaction relationship corresponding to the molecular structure of DeepDDI after prompt guidance; 将提示引导后的DeepDDI的分子结构对对应的药物相互作用关系特征表示输入分类器进行分类。The molecular structure of DeepDDI guided by the prompts is input into the classifier for classification.
CN202411539556.8A 2024-10-31 2024-10-31 A molecular structure prediction system based on a multimodal fine-grained molecular pre-training model based on prompt learning Active CN119479906B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411539556.8A CN119479906B (en) 2024-10-31 2024-10-31 A molecular structure prediction system based on a multimodal fine-grained molecular pre-training model based on prompt learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411539556.8A CN119479906B (en) 2024-10-31 2024-10-31 A molecular structure prediction system based on a multimodal fine-grained molecular pre-training model based on prompt learning

Publications (2)

Publication Number Publication Date
CN119479906A true CN119479906A (en) 2025-02-18
CN119479906B CN119479906B (en) 2025-09-30

Family

ID=94578380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411539556.8A Active CN119479906B (en) 2024-10-31 2024-10-31 A molecular structure prediction system based on a multimodal fine-grained molecular pre-training model based on prompt learning

Country Status (1)

Country Link
CN (1) CN119479906B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116431830A (en) * 2023-04-13 2023-07-14 清华大学 Representation learning method and device for multimodal biomedical data
CN117238436A (en) * 2023-09-21 2023-12-15 江苏运动健康研究院 Model pre-training method and device for drug molecular analysis design
GB202402771D0 (en) * 2023-07-14 2024-04-10 Benevolental Tech Limited Method and system for predicting biological entities
CN117935938A (en) * 2023-12-28 2024-04-26 北京水木分子生物科技有限公司 Protein natural language interaction method based on biological medicine large language model
CN118355391A (en) * 2021-11-16 2024-07-16 伯耐沃伦人工智能科技有限公司 Biological entity recognition method and system for drug discovery
CN118430639A (en) * 2024-04-19 2024-08-02 中国科学院计算机网络信息中心 Drug-drug interaction prediction method and system based on multimodal knowledge graph

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118355391A (en) * 2021-11-16 2024-07-16 伯耐沃伦人工智能科技有限公司 Biological entity recognition method and system for drug discovery
CN116431830A (en) * 2023-04-13 2023-07-14 清华大学 Representation learning method and device for multimodal biomedical data
GB202402771D0 (en) * 2023-07-14 2024-04-10 Benevolental Tech Limited Method and system for predicting biological entities
CN117238436A (en) * 2023-09-21 2023-12-15 江苏运动健康研究院 Model pre-training method and device for drug molecular analysis design
CN117935938A (en) * 2023-12-28 2024-04-26 北京水木分子生物科技有限公司 Protein natural language interaction method based on biological medicine large language model
CN118430639A (en) * 2024-04-19 2024-08-02 中国科学院计算机网络信息中心 Drug-drug interaction prediction method and system based on multimodal knowledge graph

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHEN WANG 等: "Multi‐modal Homogeneous Chemical Reaction Performance Prediction with Graph and Chemical Language Information", 《CJC》, 20 February 2025 (2025-02-20) *
张洁;陈军;: "磁共振动脉自旋标记技术在常见成瘾疾病中的研究进展", 影像诊断与介入放射学, no. 04, 25 August 2020 (2020-08-25) *

Also Published As

Publication number Publication date
CN119479906B (en) 2025-09-30

Similar Documents

Publication Publication Date Title
Abdel-Nabi et al. Deep learning-based question answering: a survey
CN111710428B (en) Biomedical text representation method for modeling global and local context interaction
Peng et al. TOP: a deep mixture representation learning method for boosting molecular toxicity prediction
WO2023226351A1 (en) Small-molecule generation method based on pharmacophore model, and device and medium
Wang et al. Graph foundation models: A comprehensive survey
Alruqi et al. Evaluation of an Arabic chatbot based on extractive question-answering transfer learning and language transformers
Zhang et al. Chinese medical relation extraction based on multi-hop self-attention mechanism
Zeng et al. Automatic melody harmonization via reinforcement learning by exploring structured representations for melody sequences
Lu et al. Extracting chemical-protein interactions from biomedical literature via granular attention based recurrent neural networks
CN119132386A (en) A drug-target binding affinity prediction method based on graph neural network
Jiang et al. Relation-aware graph structure embedding with co-contrastive learning for drug–drug interaction prediction
Hao et al. TCKGCN: Graph convolutional network for aspect-based sentiment analysis with three-channel knowledge fusion
CN120636600B (en) Molecular property prediction method and device based on graphic neural network and large language model
CN119479906B (en) A molecular structure prediction system based on a multimodal fine-grained molecular pre-training model based on prompt learning
Liu et al. An improvement to conformer-based model for high-accuracy speech feature extraction and learning
CN119517201A (en) A drug interaction event prediction method based on multimodal graph diffusion static subgraph
CN117542542A (en) Deep learning-based drug metabolism interaction evaluation method
CN118692701A (en) A drug interaction prediction method integrating multimodal drug information and static subgraphs
Liu et al. Mol-L2: Transferring text knowledge with frozen language models for molecular representation learning
Zhang Multi-modal graph-based sentiment analysis via hybrid contrastive learning
Mousa et al. A comparative survey on large language models for biological data
Perikos et al. Explainable emotion recognition in social networks with transformers
Xu et al. AHRNN: attention‐based hybrid robust neural network for emotion recognition
Shang et al. Molecular structure-driven multi-relation DGI prediction with high-low-order attention denoise
Mori Structured Understanding of Unstructured Data: A Cross-Domain Review of LLM-Augmented Analysis Systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant