CN111508568B

CN111508568B - Molecule generation method, molecule generation device, computer readable storage medium and terminal device

Info

Publication number: CN111508568B
Application number: CN202010311281.8A
Authority: CN
Inventors: 王硕; 谢昌谕; 张胜誉; 廖奔犇; 姚小军
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2023-08-29
Anticipated expiration: 2040-04-20
Also published as: CN111508568A

Abstract

The embodiment of the invention discloses a molecular generation method, a molecular generation device, a computer readable storage medium and a terminal device, which are applied to the technical field of information processing of artificial intelligence. The molecular generation device can determine activity information of the generated molecules formed after the shape of the seed molecules is changed, so that the stability of the generated molecules combined with the receptor is considered; meanwhile, the molecule generating device can also compare the first pharmacophore characteristic of the generating molecule with higher activity with the second pharmacophore characteristic of the seed molecule to obtain the generating molecule with the similarity in a preset range as a new seed molecule, and then the new generating molecule is formed.

Description

Molecule generation method, molecule generation device, computer readable storage medium and terminal device

Technical Field

The present invention relates to the field of information processing technologies of artificial intelligence, and in particular, to a method and an apparatus for generating molecules, a computer readable storage medium, and a terminal device.

Background

In the field of drug design, in order to design a more accurate drug, not only the spatial information of the drug, namely the shape and structure of the molecule represented by the drug, but also the mode of binding the drug as a ligand to a receptor, a series of conditions under which the drug acts, and the like are considered.

Along with the development of artificial intelligence, the practical application of artificial intelligence is more and more widespread, and the application of artificial intelligence to the field of drug design can make the drug design more accurate, specifically, through convolutional neural network (Convolutional Neural Network, CNN) and Long Short Term Memory (LSTM) network, after combining arbitrary molecules with other information, other molecules can be formed. However, the similarity of the molecules obtained by the prior art methods to the original input molecules in the network may be different, thereby affecting the result of the docking of the resulting molecules as ligands to the receptor, resulting in a reduction in the rate at which the resulting molecules meet expectations.

Disclosure of Invention

The embodiment of the invention provides a molecule generation method, a molecule generation device, a computer readable storage medium and terminal equipment, which realize the generation of generated molecules with rich skeletons and high activity.

In one aspect, an embodiment of the present invention provides a method for generating a molecule, including:

(a) Performing shape change treatment on seed molecules to obtain a generation component subset, wherein the generation component subset comprises generation molecules;

(b) Determining activity information of the molecule generated in the molecule set;

(c) Calculating a similarity parameter between a first pharmacophore characteristic of at least one generation molecule with highest activity and a second pharmacophore characteristic of the seed molecule according to the activity information of the generation molecules in the generation molecule set;

(d) According to the similarity parameter, when the similarity between the first pharmacophore characteristic and the second pharmacophore characteristic of a certain generation molecule is determined to be in a preset range, the certain generation molecule is used as a new seed molecule;

the steps of obtaining (a) - (d) are performed for the new seed molecule.

Another aspect of an embodiment of the present invention provides a molecular generating device, including:

a molecule generating unit, configured to perform shape change processing on seed molecules to obtain a generating component set, where the generating molecule set includes generating molecules;

an activity determination unit configured to determine activity information of the molecule generated in the generated molecule set;

a pharmacophore unit, configured to calculate a similarity parameter between a first pharmacophore feature of at least one of the generated molecules with highest activity and a second pharmacophore feature of the seed molecule according to activity information of the generated molecules in the generated molecule set;

A new molecule unit, configured to, according to the similarity parameter, when it is determined that the similarity between the first pharmacophore feature and the second pharmacophore feature of a certain generated molecule is within a preset range, take the certain generated molecule as a new seed molecule; and notifying the molecule generating unit, the activity determining unit, the pharmacophore unit and the new molecule unit to respectively execute the steps of obtaining the raw ingredient subset, determining activity information, comparing the pharmacophore and obtaining the new seed molecule.

Another aspect of the embodiments of the present invention also provides a computer readable storage medium storing a plurality of computer programs adapted to be loaded by a processor and to perform the molecular generation method according to the embodiments of the present invention.

In another aspect, the embodiment of the invention further provides a terminal device, which comprises a processor and a memory;

the memory is used for storing a plurality of computer programs, and the computer programs are used for being loaded by a processor and executing the molecular generation method according to the embodiment of the invention; the processor is configured to implement each of the plurality of computer programs.

It can be seen that, in the method of this embodiment, the molecular generating device determines the activity information of the generated molecules formed after the shape change of the seed molecules, so as to consider the stability of the generated molecules binding to the receptor; meanwhile, the molecule generating device can also compare the first pharmacophore characteristic of the generating molecule with higher activity with the second pharmacophore characteristic of the seed molecule to obtain the generating molecule with the similarity in a preset range as a new seed molecule, and then the new generating molecule is formed.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a schematic diagram of a method for generating a molecule according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of generating a molecule according to one embodiment of the present invention;

FIG. 3 is a flow chart of a method of training a shape-changing encoding model and a subtitle decoding model in one embodiment of the present invention;

FIG. 4 is a schematic diagram of the structure of a shape-changing coding initial model and a subtitle decoding initial model according to an embodiment of the present invention;

FIG. 5 is a flow chart of a shape change encoding initial model and a subtitle decoding initial model in an application embodiment of the present invention;

FIG. 6 is a schematic diagram of a resulting molecule in an example of an application of the present invention;

FIG. 7 is a schematic diagram of a method of molecular generation in an embodiment of the invention;

FIG. 8 is another schematic diagram of a molecular generation method in an application example of the present invention;

FIG. 9 is a schematic diagram of a molecular generating device according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the invention provides a molecule generation method, which mainly generates new generation molecules through known seed molecules, and enables the drug effects between the generation molecules and the seed molecules to be similar, as shown in fig. 1, the molecule generation device in the embodiment can realize the generation of molecules according to the following steps:

(a) Performing shape change treatment on the seed molecules to obtain a generating component subset, wherein the generating molecule subset comprises generating molecules (n generating molecules are taken as an example in the figure in the generating molecule set); (b) Determining activity information of the molecule generated in the molecule set; (c) Calculating similarity parameters of first pharmacophore characteristics of at least one generation molecule with highest activity (m are taken as an example of m higher activity molecules in the figure, and m is less than or equal to n) and second pharmacophore characteristics of the seed molecules according to activity information of the generation molecules in the generation molecule set; (d) According to the similarity parameter, when the similarity between the first pharmacophore characteristic and the second pharmacophore characteristic of a certain generated molecule is determined to be within a preset range, the certain generated molecule is used as a new seed molecule (p new seed molecules are taken as examples in the figure, and p is smaller than or equal to m); the steps (a) - (d) are performed for the new seed molecule.

Specifically, when the seed molecule is subjected to shape change processing, a shape change encoding model and a subtitle decoding model, which are artificial intelligence machine learning models, can be used to obtain the generated component sets. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

It can be seen that the molecular generating device can determine the activity information of the generated molecules formed after the shape of the seed molecules is changed, so that the stability of the generated molecules combined with the receptor is considered; meanwhile, the molecule generating device can also compare the first pharmacophore characteristic of the generating molecule with higher activity with the second pharmacophore characteristic of the seed molecule to obtain the generating molecule with the similarity in a preset range as a new seed molecule, and then the new generating molecule is formed.

The embodiment of the invention provides a molecule generating method, which is mainly a method executed by a molecule generating device, and a flow chart is shown in fig. 2, and comprises the following steps:

step 101, performing shape change processing on seed molecules (seed molecules) to obtain a raw component subset, wherein the raw component subset includes a plurality of generated molecules.

It can be understood that in the process of designing a drug, the shape of the seed molecule can be changed based on the existing seed molecule with a certain drug effect, for example, a certain structure is added based on the seed molecule, or some original structures in the seed molecule are replaced, and the like, when the method is specifically implemented:

The molecule generating device can call a shape change coding model and a subtitle decoding model, and the shape change coding model carries out shape change coding according to seed molecules and pharmacophores in the seed molecules to obtain coded characteristic information; and the subtitle decoding model decodes according to the encoded characteristic information to obtain character string representations of all the generated molecules in the generated molecule set.

The pharmacophore in the seed molecule refers to a structure and key information of the pharmacophore in the seed molecule, such as a functional group of the seed molecule, and may be a complex of a plurality of functional groups in the seed molecule, so that the shape coding model may combine the pharmacophore in the seed molecule and the seed molecule, and the coded characteristic information may be obtained.

Here, a pharmacophore (pharmacophore) is an abstract description of molecular features necessary for molecular recognition of ligands by biological macromolecules or small molecules, and the international union of theory and applied chemistry (International Union of Pure and Applied Chemistry, IUPAC) defines a pharmacophore as a "collection of spatial and electronic features" which is necessary to ensure optimal supramolecular interactions with a specific biological target and trigger (or block) biological responses thereof, and a pharmacophore model explains how structurally different ligands bind to a common receptor site, and furthermore, can be used to recognize novel ligands that will bind to the same receptor by de novo design or virtual screening.

Functional groups (groups) are groups of organic substances that determine the nature of the substance, meaning that atoms or groups of the molecule that are relatively reactive and readily react, often determine the main nature of the compound, reflect the main characteristics of the compound, and compounds containing the same functional groups have similar properties and can be classified as such.

The shape change coding model and the subtitle decoding model can be machine learning models, and can be obtained through training by a certain training method, and the running logic of the trained shape change coding model and subtitle decoding model is stored in the molecular generation device, and when the molecular generation device initiates the flow of the embodiment, the two models can be called.

Further, the molecule generating means may perform the following steps 102 to 104 for each of the generated molecules in the generated molecule set generated in step 101; the set of generated molecules generated in step 101 may be sampled to obtain a part of generated molecules, and the following steps 102 to 104 may be performed for this part of generated molecules.

Step 102, determining activity information of the molecule generated in the molecule set.

The activity information of the production molecule means information that the production molecule binds to the receptor as a ligand and can affect the physiological function of the receptor, for example, for a receptor inhibitor, the protein can be inhibited from functioning after binding to the protein. Specifically, the molecular generation device may determine activity information of the generated molecules using, but not limited to, the following methods:

(1) A method of molecular docking (molecular docking).

Molecular docking is a method of predicting the preferred orientation of a ligand molecule when it binds to a receptor molecule to form a stable complex, and information on the preferred orientation can in turn be used to predict the strength of association or binding affinity between two molecules by scoring functions (scoring functions).

In this embodiment, the molecule generating device will dock the generated molecule with the corresponding target (i.e. the receptor molecule) to obtain a docked molecule, and then score the molecule according to the docked molecule and a preset scoring function to obtain a docking score, where the activity information of the generated molecule includes the docking score, and the molecule with a higher docking score tends to have a more stable and higher physiological activity.

Specifically, when the molecule generating device is used for docking the generated molecule with the target, a vina docking method can be adopted, specifically, a docking auxiliary module based on High-throughput molecular dynamics (High-Throughput Molecular Dynamics, HTMD) is used for reading and converting the generated molecule and a protein file in a program database file (Program Database File, PDB) format into molecule types, and then rigid docking is carried out. And after automatic docking and scoring by Vina (AutoDock Vina), the obtained docking score is generally represented by a negative number, and the higher the absolute value, the stronger the activity. The PDB format protein file is a target file corresponding to the generation molecule.

(2) A method of predicting model of affinity by drug-protein interaction pair (drug-protein interaction pairs).

Specifically, the molecular generating device can generate another format file according to the generated molecules and the corresponding target files, then predict the affinity value of the generated molecules according to the another format file, and if the affinity value is larger, the activity of the generated molecules is higher.

When generating another format file, the molecular generating device may firstly convert the generated molecule into a MOL2 format generated molecule by using an sdWriter tool of the RDKit, then generate a hierarchical data format (Hierarchical Data Format, HDF) file by using the MOL2 format generated molecule and the corresponding protein's pocketPDB file, that is, another format file, and then predict according to the other format file, where the obtained affinity value is specifically an inhibition constant Kd, or a dissociation constant Ki, and both constants are used to describe the binding affinity of the molecular enzyme or receptor.

And step 103, calculating similarity parameters of the first pharmacophore characteristic of at least one generation molecule with highest activity and the second pharmacophore characteristic of the seed molecule according to the activity information of the generation molecules in the generation molecule set. The similarity parameter herein may refer to a parameter that may represent the similarity between the first pharmacophore feature and the second pharmacophore feature of the seed molecule.

In this embodiment, the molecule generating apparatus sorts the generated molecules in the generated molecule set according to the activity information of the generated molecules in the generated molecule set, and selects at least one generated molecule having the highest activity.

Then, the molecule generating device calculates a first pharmacophore characteristic and a second pharmacophore characteristic of any selected generated molecule, specifically, a single thermal coding matrix corresponding to the generated molecule and the seed molecule respectively, namely, the first pharmacophore characteristic and the second pharmacophore characteristic, wherein the pharmacophore characteristic of any molecule is a 3D pharmacophore fingerprint for representing the characteristic of the pharmacophore of the corresponding molecule.

Further, the molecular generation device may calculate a Root Mean Square Deviation (RMSD) between the first pharmacophore feature and the second pharmacophore feature when calculating the similarity parameter between the first pharmacophore feature and the second pharmacophore feature.

And 104, according to the similarity parameters, when the similarity between the first pharmacophore characteristic and the second pharmacophore characteristic of a certain generated molecule is determined to be within a preset range, taking the certain generated molecule as a new seed molecule, and returning to the steps 101 to 104 for the new seed molecule, namely, for the new seed molecule, executing the steps of obtaining a generation component subset, determining activity information, comparing the pharmacophore and obtaining the new seed molecule.

Specifically, if RMSD is used as the similarity parameter between the first pharmacophore feature and the second pharmacophore feature, when the root mean square deviation between the first pharmacophore feature and the second pharmacophore feature of a certain generation molecule is negative 1, it indicates that the similarity between the first pharmacophore feature and the second pharmacophore corresponding to the generation molecule is smaller, and within a certain range, the skeletons between the generation molecule and the seed molecule are dissimilar, and a certain generation molecule is used as a new seed molecule. And when the root mean square deviation of the first pharmacophore characteristic and the second pharmacophore characteristic of the generated molecule is not minus 1, the similarity between the first pharmacophore characteristic and the second pharmacophore characteristic of the generated molecule exceeds a preset range, namely the similarity is higher, the skeletons between the generated molecule and the seed molecule are similar, and the generated molecule cannot be used as a new seed molecule.

Thus, by performing steps 101 through 104 in a loop, a resulting molecule with a more diverse backbone and a higher activity is obtained, and a Virtual Screening (VS) process is achieved, where virtual screening is a computational technique for drug discovery that searches a small molecular database to identify structures that are most likely to bind to a drug target, typically a protein receptor or enzyme.

In a specific embodiment, the training of the shape-changing coding model and the subtitle decoding model used in the step 101 may be implemented according to the following training method, and the flowchart is shown in fig. 3, including:

in step 201, a shape change coding initial model and a subtitle decoding initial model are determined.

It will be understood that, when determining the shape change encoding initial model and the subtitle decoding initial model, the molecular generating device determines initial values of parameters in the multi-layer structure and each layer structure included in the shape change encoding initial model and the subtitle decoding initial model, respectively.

As shown in fig. 4 in particular, the shape change encoding initial model may include: the system comprises a voxelization module and a variation self-encoding (VAE) module, wherein the voxelization module is used for voxelization of seed molecules to obtain three-dimensional (3D) molecular characterization with pharmacophore information; and the VAE coding module is used for carrying out molecular characteristic re-parameterization (namely the shape change of the molecule) and coding processing on the 3D molecular representation output by the voxelization module combined with the characteristic of the pharmacophore in the seed molecule to obtain coded characteristic information. Wherein the Voxelization (Voxelization) process performed by the Voxelization module is a process of converting a geometric representation of an object into a voxel representation closest to the object, and the resulting 3D molecular representation not only contains surface information of the object, but also can describe internal properties of the object.

The subtitle decoding initial model can comprise a feature conversion module and a decoding module, wherein the feature conversion module is used for converting the coded feature information output by the shape change coding initial model into feature information of a high-dimensional space; and the decoding module is used for outputting and generating character string representations of all the generated molecules in the molecular set according to the characteristic information of the high-dimensional space.

The parameters of the shape change coding initial model and the subtitle decoding initial model refer to parameters which are fixed and do not need to be assigned at any time, such as parameters of parameter scale, network layer number, user vector length and the like, and are used in the calculation process of each layer structure in the shape change coding initial model and the subtitle decoding initial model.

Step 202, determining a training sample, wherein the training sample comprises a plurality of seed molecule samples and receptor molecule samples corresponding to the seed molecule samples.

Specifically, when a training sample is selected, a small molecular structure database with a reduced range, such as a ZINC 15 database (only including quasi-drug molecules), is selected, wherein a molecule with a character string length within a certain range (such as within 60 characters) and without free radicals is used as a seed molecule sample.

And 203, performing shape change coding on the seed molecular sample through a shape change coding initial model to obtain coded sample characteristic information, and decoding the coded sample characteristic information through a subtitle decoding initial model to obtain information for generating a molecular sample set.

Specifically, the voxelization module in the shape change coding initial model performs voxelization treatment on the seed molecule sample to obtain a 3D molecular representation with pharmacophore information, and the VAE module performs molecular feature re-parameterization and coding treatment on the 3D molecular representation output by the voxelization module combined with the characteristics of the pharmacophore in the seed molecule sample to obtain coded sample feature information. The feature conversion module in the subtitle decoding initial model converts the coded sample feature information output by the shape change coding initial model into feature information of a high-dimensional space; and the decoding module outputs and generates character string representations of all the generated molecules in the molecule sample set according to the characteristic information of the high-dimensional space.

And 204, adjusting the parameter values of the shape change coding initial model and the subtitle decoding initial model according to the generated molecular sample set and the receptor molecular sample obtained by the subtitle decoding initial model so as to obtain a final shape change coding model and a final subtitle decoding model.

Specifically, the molecular generating device calculates a loss function related to the shape change coding initial model and the subtitle decoding initial model according to the information of the generated molecular sample set, which is the result obtained by the subtitle decoding initial model in the step 203, and training the receptor molecular samples in the samples, where the loss function is used to indicate the binding activity of the generated molecular sample set obtained by combining the shape change coding initial model and the subtitle decoding initial model with the corresponding receptor molecular sample, and the difference between the binding activity of the seed molecular sample with the corresponding receptor molecular sample, such as KL divergence (Kullback-Leibler divergence, KL) and log loss of multiple categories (log loss).

The training process of the shape change coding model and the subtitle decoding model is to minimize the difference, and the training process is to continuously optimize the parameter values of the parameters in the shape change coding initial model and the subtitle decoding initial model determined in the step 201 through a series of mathematical optimization means such as back propagation derivative and gradient descent, and minimize the calculated value of the loss function.

Specifically, when the function value of the calculated loss function is large, such as larger than a preset value, it is necessary to change the parameter value, such as to decrease the weight value of a certain neuron connection, or the like, so that the function value of the loss function calculated in accordance with the adjusted parameter value is decreased.

It should be noted that, the steps 203 to 204 are generated molecular sets obtained by the shape-changing coding initial model and the subtitle decoding initial model, and the parameter values in the shape-changing coding initial model and the subtitle decoding initial model are adjusted once, respectively, and in practical applications, it is necessary to continuously and circularly execute the steps 203 to 204 until the adjustment of the parameter values satisfies a certain stop condition.

Therefore, the molecular generating device needs to determine whether the current adjustment of the parameter value satisfies the preset stop condition after performing steps 201 to 204 of the above embodiments, and when the current adjustment of the parameter value satisfies the preset stop condition, the flow is ended; when not satisfied, the initial model is encoded and the initial model is decoded for the shape change after the parameter values are adjusted, and the above steps 203 to 204 are performed again. Wherein the preset stop conditions include, but are not limited to, any one of the following conditions: the difference between the current adjusted parameter value and the last adjusted parameter value is smaller than a threshold value, namely the adjusted parameter value reaches convergence; and the number of times of adjustment of the parameter value is equal to a preset number of times, etc.

It should be further noted that, in a specific embodiment, in the process of performing the step 201, a property prediction model may be further determined, where the property prediction model is used to predict a drug property according to the encoded sample feature information, so as to obtain a property prediction result.

Among these, the drug properties may include solubility (solubility) such as water-solubility or lipid-solubility, and the like, and may also include drug and biotoxicity, and the like, and the property prediction model may be specific to the above or drug absorption, distribution, metabolism, excretion, toxicity (Absorption Distribution Metabolism Excretion Toxicity, ADMET), and the like. The property prediction model may specifically be bayesian optimization (Bayesian optimization, BO) or the like, and is not limited herein, and specifically, the property prediction model may predict the drug properties with respect to the feature information of the high-dimensional space output by the feature transformation module in the subtitle decoding initial model.

In this case, when the molecular generating device performs the above step 204, it may calculate a loss function related to the shape change encoding initial model and the subtitle decoding initial model from the property prediction result output by the property prediction model, the generated molecular sample set obtained by the subtitle decoding initial model, and the acceptor molecular sample, and then adjust parameter values of the shape change encoding initial model and the subtitle decoding initial model according to the loss function. Specifically, the calculated loss function includes: the property prediction model outputs a difference value between a property prediction result according to the characteristic information of the high-dimensional space and a property prediction result obtained according to the seed molecules; and the difference value of the binding activity between the generated molecular sample set obtained by combining the shape change coding initial model and the subtitle decoding initial model and the corresponding receptor molecular sample and the binding activity between the seed molecular sample and the corresponding receptor molecular sample. Therefore, the prediction result of the property prediction model is added to the supervision of the training process of the shape change coding model and the subtitle decoding model, the shape change coding model and the subtitle decoding model obtained by training can be further optimized, and the drug property of the generated molecules obtained by final training can be within a certain range.

In the following, a specific application example is described for describing the molecular generation method in the present invention, and in this embodiment, the molecular generation device may use a two-word laboratory server cluster, and each server may include 8 graphics processors (Graphics Processing Unit, GPU) and at least include 72 cores. The method in this embodiment may include the following two parts:

(1) As shown in fig. 5, the shape change encoding model and the subtitle decoding model may be trained by:

step 301, determining a shape change coding initial model, a subtitle decoding initial model and a property prediction model as shown in fig. 4, wherein the property prediction model specifically adopts a BO optimization model, an output end of the shape change coding initial model is connected to the subtitle decoding initial model, and an output end of a feature conversion module in the subtitle decoding initial model is connected to not only the decoding module but also an input end of the BO optimization model.

Wherein if the BO optimization model is used to predict water solubility, the BO optimization model can be set to range from 0 to 5, and the single step optimization number is 500.

Further, the molecular generating device may further set parameters in the training process as shown in the following table 1:

TABLE 1

Further, in this embodiment, the decoding module in the subtitle decoding initial model determined by the molecular generating device may specifically be LSTM; an attention conversion (Attention transformer) decoder can be adopted, the attention conversion decoder can ensure that each training sample has global semantic information, the distance is shortened to 1 by calculating the attention in pairs, and in addition, each head can learn different subspace semantics by applying multiple-head attention, so that the effect is improved.

Step 302, determining a training sample includes: comprises a plurality of seed molecule samples and receptor molecule samples corresponding to the seed molecule samples.

And 303, voxelization module in the shape change coding initial model performs voxelization on the seed molecule sample to obtain 3D molecular characterization with pharmacophore information, and the VAE module performs molecular feature re-parameterization and coding on the 3D molecular characterization output by the voxelization module combined with the characteristics of the pharmacophore in the seed molecule sample to obtain coded sample feature information. The voxelization module reads a seed molecule sample into an RDMol format through an RDkit tool, then converts the seed molecule sample into a 3D molecule through an embedding operation carried out by the RDkit tool after hydrogenating elements, reads the 3D molecule by using HTMD, and can output a 3D molecule representation with pharmacophore information by using a preset voxelization function.

The feature conversion module in the subtitle decoding initial model converts the coded sample feature information output by the shape change coding initial model into feature information of a high-dimensional space; the decoding module outputs and generates character string representations of each generated molecule in the molecule sample set according to the characteristic information of the high-dimensional space, and the property prediction model predicts the drug properties of the formed coded sample characteristic information, such as water solubility or fat solubility, according to the characteristic information of the high-dimensional space.

And step 304, calculating a loss function related to the shape change coding initial model and the subtitle decoding initial model according to the property prediction result predicted by the property prediction model, the generated molecular sample set obtained by the subtitle decoding initial model and the receptor molecular sample in the training sample.

Step 305, adjusting parameter values in the shape change coding initial model and the subtitle decoding initial model according to the loss function calculated in the step 304.

The BO optimization model is used for adding sample points to update posterior distribution, namely gaussian process after an objective function (namely the loss function) to be optimized is given, until the posterior distribution is consistent with the actual distribution, and compared with reinforcement learning (Reinforcement Learning, RL), the BO optimization model can explicitly optimize the output of a certain part of high-dimensional space characteristic information under the condition of consuming less calculation resources, namely, restrict and optimize the pharmaceutical property of the encoded sample characteristic information output by the characteristic conversion module.

Step 306, judging whether the current adjustment of the parameter value meets the preset stopping condition, if yes, ending the flow, and the shape change coding initial model and the subtitle decoding initial model after the parameter value is adjusted in the step 305 are the trained shape change coding model and subtitle decoding model; if not, the initial model is encoded and the subtitle is decoded for the shape change after the parameter values are adjusted, and the process returns to step 303.

(2) As shown in fig. 6 and 7, the following steps can be performed on any seed molecule to form a highly active and skeleton-rich molecule:

step 401, for any seed molecule, after the trained shape change coding model and subtitle decoding model, outputting the character string representation of each generated molecule in the generated molecule set.

Step 402, sampling from the generated molecule set to obtain a part of generated molecules, calculating activity information, such as a docking score or an affinity value, corresponding to the generated molecules, and selecting at least one generated molecule with higher activity according to the activity information.

For example, as shown in fig. 7, when calculating activity information, each of the generated molecules may be docked with the corresponding target spot by vina to obtain a docked molecule, then the docked molecule is scored to obtain a docking score, the generated molecule corresponding to the docked molecule with the highest docking score is extracted, and then the activity of the extracted generated molecule is higher, and for any of the extracted generated molecules, steps 403 and 404 are performed as follows.

Step 403, calculating a first pharmacophore characteristic of each of the at least one generation molecule with higher activity and a second pharmacophore characteristic of the seed molecule, and calculating RMSD values between the first and second pharmacophore characteristics of each generation molecule as similarity parameters.

Step 404, if the RMSD value of the first pharmacophore feature of a higher activity generation molecule and the RMSD value of the second pharmacophore feature of the seed molecule are-1, which indicates that the skeleton between any generation molecule and the seed molecule is dissimilar, the generation molecule is used as a new seed molecule, and steps 401 to 404 are executed again, that is, a new generation molecule set is formed again and corresponding processing is performed; if the RMSD value of the first pharmacophore characteristic of the generating molecule and the second pharmacophore characteristic of the seed molecule is not-1, judging whether the RMSD corresponding to the generating molecule with higher activity is-1 or not, and continuously iterating until all generating molecules serving as new seed molecules in the generating molecules with higher activity are determined, so that generating molecules with rich frameworks can be generated.

In the practical process, several small molecules in the AA2AR protein-small molecule acting pair are selected as seed molecules, and the method in the embodiment of the present invention is adopted, that is, the steps 401 to 404 are circularly executed, a plurality of generated molecules are generated for the seed molecules, and the seed molecules and each generated molecule are respectively butted with corresponding targets and then scored, so as to obtain the butting scores as shown in the following table 2:

TABLE 2

In fact, in the course of docking each round, 20 docking models are established, and it can be seen from the table that, by adopting the scheme of this embodiment, after two rounds of to three rounds of cyclic iterations, the docking score of the obtained new seed molecule (i.e. the generated molecule screened from the generated molecules) indicates that the activity of the new seed molecule is higher, and then the new seed molecule that is most suitable for the target point is always screened after two rounds of to three rounds of cyclic iterations, which fully proves that the combination of the generated molecule and the virtual screening in the scheme of this embodiment, the obtained new seed molecule has a certain accuracy and rationality, and accelerates the discovery of new drugs.

Meanwhile, if a property prediction model, such as a model for predicting a lipid water distribution coefficient, is combined in the process of training a shape change coding model and a subtitle decoding model, the log P of the lipid water distribution coefficient is controlled to be in the range of 0-3, so that after the seed molecules are trained to obtain the shape change coding model and the subtitle decoding model, the log P of the obtained generated molecules can also be controlled to be in the range of 0-3, which is also a common water-soluble but easily absorbable medicine range, and the mode of the new seed molecules obtained by the method of the embodiment is proved to have a function of guiding specific properties. In addition, the skeleton of the generated molecule obtained by screening by the method of the embodiment is also rich, such as the structure of the generated molecule shown in fig. 8, and it can be seen that the position of the specific functional group in the generated molecule is more flexible. Therefore, small molecules with target specificity can be directly generated aiming at different targets, and the theoretical effective rate of drug-protein interaction is improved; and the method is optimized for the properties of small molecules, and the potential enrichment degree of sampling information is increased, so that the screening task based on ligand shapes can be better realized.

Therefore, the method of the embodiment can control the obtained resultant molecule to have higher water solubility/fat solubility or lower biotoxicity; and the resulting molecule may have a specified target-specific (target-specific), i.e. an explicit corresponding target activity; and the skeleton is rich.

The embodiment of the invention also provides a molecular generating device, the structure schematic diagram of which is shown in fig. 9, and the molecular generating device specifically can comprise:

and a molecule generating unit 10 for performing shape change processing on the seed molecules to obtain a generating component set including generating molecules.

The molecule generating unit 10 is specifically configured to call a shape change encoding model and a subtitle decoding model; the shape change coding model carries out shape change coding according to the seed molecules and pharmacophores in the seed molecules to obtain coded characteristic information; and the caption decoding model decodes according to the encoded characteristic information to obtain character string representations of each generated molecule in the generated molecule set.

An activity determination unit 11 for determining activity information of the generated molecules obtained by the molecule generation unit 10 in the generated molecule set.

The activity determining unit 11 is specifically configured to dock the generated molecule with a corresponding target, so as to obtain a docked molecule; and scoring according to the molecules after the butt joint and a preset scoring function to obtain a butt joint score, wherein the activity information of the generated molecules comprises the butt joint score.

Or, the activity determining unit 11 is specifically configured to generate another format file according to the generation molecule and the corresponding target file; predicting an affinity value of the generated molecule according to the other format file, wherein the activity information comprises the affinity value.

And a pharmacophore unit 12, configured to calculate a similarity parameter between a first pharmacophore feature of at least one of the generated molecules with highest activity and a second pharmacophore feature of the seed molecule according to the activity information of the generated molecules in the generated molecule set determined by the activity determination unit 11.

A new molecule unit 13, configured to, according to the similarity parameter calculated by the pharmacophore unit 12, when it is determined that the similarity between the first pharmacophore feature and the second pharmacophore feature of a certain generated molecule is within a preset range, take the certain generated molecule as a new seed molecule; the steps of obtaining a subset of the ingredients, determining activity information, comparing pharmacophores and obtaining new seed molecules are performed separately for the new seed molecules, informing the molecule generating unit 10, the activity determining unit 11, the pharmacophore unit 12 and the new molecule unit 13.

The new molecule unit 13 is specifically configured to calculate a root mean square deviation between the first pharmacophore feature and the second pharmacophore feature, and determine that a similarity between the first pharmacophore feature and the second pharmacophore feature of a certain generated molecule is within a preset range when the root mean square deviation corresponding to the certain generated molecule is minus 1.

Further, the molecular generating device in the embodiment of the invention may further include: a training unit 14 for determining a shape change coding initial model and a subtitle decoding initial model; determining a training sample, wherein the training sample comprises a plurality of seed molecule samples and receptor molecule samples corresponding to the seed molecule samples; performing shape change coding on the seed molecular sample through the shape change coding initial model to obtain coded sample characteristic information, and decoding the coded sample characteristic information through the subtitle decoding initial model to obtain information for generating a molecular sample set; and adjusting parameter values of the shape change coding initial model and the subtitle decoding initial model according to a generated molecular sample set obtained by the subtitle decoding initial model and the receptor molecular sample to obtain a final shape change coding model and a final subtitle decoding model.

Further, the training unit 14 is further configured to determine a property prediction model, where the property prediction model is used to predict a drug property according to the encoded sample feature information, so as to obtain a property prediction result. In this way, the training unit 14 is specifically configured to calculate a loss function related to the shape change coding initial model and the subtitle decoding initial model from the property prediction result, the generated molecular sample set obtained by the subtitle decoding initial model, and the receptor molecular sample when adjusting the parameter values of the shape change coding initial model and the subtitle decoding initial model based on the generated molecular sample set obtained by the subtitle decoding initial model and the receptor molecular sample; and adjusting parameter values of the shape change coding initial model and the subtitle decoding initial model according to the loss function.

The training unit 14 is further configured to stop the adjustment of the fixed parameter value when the number of adjustments of the parameter value is equal to a preset number of times, or when the difference between the currently adjusted fixed parameter value and the last adjusted fixed parameter value is smaller than a threshold value.

It can be seen that, in the molecular generating device of the present embodiment, the activity determining unit 11 determines the activity information of the generated molecule formed after the shape change of the seed molecule, so that the stability of the generated molecule binding to the receptor is considered; meanwhile, the pharmacophore unit 12 in the molecule generating device also compares the first pharmacophore characteristic of the generating molecule with higher activity with the second pharmacophore characteristic of the seed molecule, the new molecule unit 13 obtains the generating molecule with the similarity in a preset range as the new seed molecule, and then the new generating molecule is formed.

The embodiment of the present invention further provides a terminal device, whose structure schematic diagram is shown in fig. 10, where the terminal device may generate relatively large differences due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 20 (e.g., one or more processors) and a memory 21, and one or more storage media 22 (e.g., one or more mass storage devices) storing application programs 221 or data 222. Wherein the memory 21 and the storage medium 22 may be transitory or persistent. The program stored in the storage medium 22 may include one or more modules (not shown), each of which may include a series of instruction operations in the terminal device. Still further, the central processor 20 may be arranged to communicate with the storage medium 22 and execute a series of instruction operations in the storage medium 22 on the terminal device.

Specifically, the application program 221 stored in the storage medium 22 includes an application program for molecular generation, and the program may include the molecular generation unit 10, the activity determination unit 11, the pharmacophore unit 12, the new molecular unit 13, and the training unit 14 in the molecular generation apparatus described above, which will not be described herein. Still further, the central processor 20 may be arranged to communicate with the storage medium 22, and to execute a series of operations on the terminal device corresponding to the application of molecular generation stored in the storage medium 22.

The terminal device may also include one or more power supplies 23, one or more wired or wireless network interfaces 24, one or more input/output interfaces 25, and/or one or more operating systems 223, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

The steps performed by the above molecular generating device described in the above method embodiment may be based on the structure of the terminal device shown in fig. 10.

Embodiments of the present invention also provide a computer readable storage medium storing a plurality of computer programs adapted to be loaded by a processor and to perform a molecular generation method as performed by the molecular generation device described above.

The embodiment of the invention also provides terminal equipment, which comprises a processor and a memory; the memory is used for storing a plurality of computer programs, and the computer programs are used for loading and executing the molecular generation method executed by the molecular generation device by the processor; the processor is configured to implement each of the plurality of computer programs.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

The above detailed description of the molecular generating method, the device, the computer readable storage medium and the terminal device provided by the embodiments of the present invention applies specific examples to illustrate the principles and the embodiments of the present invention, and the above description of the embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of generating a molecule, comprising:

(c) Calculating similarity parameters of the first pharmacophore characteristic of at least one generation molecule with highest activity and the second pharmacophore characteristic of the seed molecule according to the activity information of the generation molecules in the generation molecule set;

performing the steps of (a) - (d) for the new seed molecule;

the processing of the shape change of the seed molecules to obtain a raw component subset specifically comprises:

calling a shape change coding model and a subtitle decoding model;

the shape change coding model carries out shape change coding according to the seed molecules and pharmacophores in the seed molecules to obtain coded characteristic information;

and the caption decoding model decodes according to the encoded characteristic information to obtain character string representations of each generated molecule in the generated molecule set.

2. The method of claim 1, wherein the method further comprises:

determining a shape change coding initial model and a subtitle decoding initial model;

determining a training sample, wherein the training sample comprises a plurality of seed molecule samples and receptor molecule samples corresponding to the seed molecule samples;

performing shape change coding on the seed molecular sample through the shape change coding initial model to obtain coded sample characteristic information, and decoding the coded sample characteristic information through the subtitle decoding initial model to obtain information for generating a molecular sample set;

and adjusting parameter values of the shape change coding initial model and the subtitle decoding initial model according to a generated molecular sample set obtained by the subtitle decoding initial model and the receptor molecular sample to obtain a final shape change coding model and a final subtitle decoding model.

3. The method of claim 2, wherein the method further comprises:

and determining a property prediction model, wherein the property prediction model is used for predicting the property of the medicine according to the characteristic information of the coded sample, and obtaining a property prediction result.

4. The method according to claim 3, wherein the adjusting the parameter values of the shape change coding initial model and the subtitle decoding initial model according to the generated molecular sample set obtained by the subtitle decoding initial model and the receptor molecular sample specifically includes:

Calculating a loss function related to the shape change coding initial model and the subtitle decoding initial model according to the property prediction result, a generated molecular sample set obtained by the subtitle decoding initial model and the receptor molecular sample;

and adjusting parameter values of the shape change coding initial model and the subtitle decoding initial model according to the loss function.

5. The method of claim 2, wherein the adjustment of the fixed parameter value is stopped when the number of adjustments to the parameter value is equal to a preset number or when the difference between the currently adjusted fixed parameter value and the last adjusted fixed parameter value is less than a threshold.

6. The method of any one of claims 1 to 5, wherein said determining activity information of the generated molecules in the set of generated molecules comprises:

docking the generated molecules with corresponding targets to obtain docked molecules;

and scoring according to the molecules after the butt joint and a preset scoring function to obtain a butt joint score, wherein the activity information of the generated molecules comprises the butt joint score.

7. The method of any one of claims 1 to 5, wherein said determining activity information of the generated molecules in the set of generated molecules comprises:

Generating another format file according to the generated molecules and the corresponding target files;

predicting an affinity value of the generated molecule according to the other format file, wherein the activity information comprises the affinity value.

8. The method according to any one of claims 1 to 5, wherein calculating a similarity parameter of a first pharmacophore feature of the at least one generating molecule with highest activity to a second pharmacophore feature of the seed molecule comprises:

and calculating the root mean square deviation between the first pharmacophore characteristic and the second pharmacophore characteristic of at least one generation molecule, wherein when the root mean square deviation corresponding to a certain generation molecule is minus 1, the similarity between the first pharmacophore characteristic and the second pharmacophore characteristic of the certain generation molecule is within a preset range.

9. A molecular generating device, comprising:

a molecule generating unit, configured to perform shape change processing on seed molecules to obtain a generating component set, where the generating molecule set includes generating molecules; the molecule generating unit is specifically used for calling a shape change coding model and a subtitle decoding model; the shape change coding model carries out shape change coding according to the seed molecules and pharmacophores in the seed molecules to obtain coded characteristic information; the caption decoding model decodes according to the encoded characteristic information to obtain character string representations of all the generated molecules in the generated molecule set;

10. A computer readable storage medium, characterized in that it stores a plurality of computer programs adapted to be loaded by a processor and to perform the molecular generation method according to any one of claims 1 to 8.

11. A terminal device comprising a processor and a memory;

The memory is used for storing a plurality of computer programs for loading and executing the molecular generation method according to any one of claims 1 to 8 by a processor; the processor is configured to implement each of the plurality of computer programs.