[go: up one dir, main page]

CN116189759B - A virtual screening method for quorum sensing lead compounds and its application - Google Patents

A virtual screening method for quorum sensing lead compounds and its application Download PDF

Info

Publication number
CN116189759B
CN116189759B CN202310234744.9A CN202310234744A CN116189759B CN 116189759 B CN116189759 B CN 116189759B CN 202310234744 A CN202310234744 A CN 202310234744A CN 116189759 B CN116189759 B CN 116189759B
Authority
CN
China
Prior art keywords
network
sent
layer
protein
quorum sensing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310234744.9A
Other languages
Chinese (zh)
Other versions
CN116189759A (en
Inventor
江高飞
薛卫
张家璇
刘佐
韦中
徐阳春
沈其荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Agricultural University
Original Assignee
Nanjing Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Agricultural University filed Critical Nanjing Agricultural University
Priority to CN202310234744.9A priority Critical patent/CN116189759B/en
Publication of CN116189759A publication Critical patent/CN116189759A/en
Application granted granted Critical
Publication of CN116189759B publication Critical patent/CN116189759B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • General Engineering & Computer Science (AREA)
  • Public Health (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Peptides Or Proteins (AREA)

Abstract

本发明公开了一种群体感应先导化合物的虚拟筛选方法,主要流程包括:输入的分子化合物结构通过预处理构建分子邻接矩阵,送入GNN1网络生成化合物特征;输入的蛋白质序列,提取其蛋白质氨基酸组成、二肽频率组合成蛋白质初步特征向量,送入交叉网络,生成交叉融合特征;同时,将蛋白质序列生成对应的接触图,随后送入GNN2网络生成蛋白序列特征;最终将三个特征组合送入全连接层预测得到亲和力值。本发明可用于发现新的具有群体感应活性的化合物,为青枯菌等细菌的控制和防治提供新的思路和手段;同时该方法可以高效地筛选出与PhcA和PhcR蛋白结合的化合物,从而发现具有群体感应活性的化合物。

The present invention discloses a virtual screening method for quorum sensing lead compounds, and the main process includes: the input molecular compound structure is pre-processed to construct a molecular adjacency matrix, and is sent to the GNN1 network to generate compound features; the input protein sequence is extracted from its protein amino acid composition and dipeptide frequency to form a preliminary protein feature vector, which is sent to the cross network to generate cross-fusion features; at the same time, the protein sequence is generated into a corresponding contact map, which is then sent to the GNN2 network to generate protein sequence features; finally, the three feature combinations are sent to the fully connected layer to predict the affinity value. The present invention can be used to discover new compounds with quorum sensing activity, and provide new ideas and means for the control and prevention of bacteria such as Ralstonia solanacearum; at the same time, the method can efficiently screen out compounds that bind to PhcA and PhcR proteins, thereby discovering compounds with quorum sensing activity.

Description

Virtual screening method and application of group induction lead compound
Technical Field
The invention relates to the field of pharmaceutical chemistry, in particular to a targeting virtual screening method and application of a bacterial wilt colony induction lead compound.
Background
Ralstonia solanacearum (Ralstoniasolanacearum) is one of the most destructive soil-borne pathogens in the world, and the pathogens are widely distributed in tropical, subtropical and temperate climatic regions of the world and gradually spread to high-altitude regions with high dimensionality. The related virulence behavior in the process of invasion of the rhizosphere of the crop by the ralstonia solanacearum is regulated and controlled by quorum sensing. The bacterial wilt has two sets of quorum sensing systems, namely an AHL system and a trihydroxy methyl palmitate (3-OH PAME) system, wherein the AHL system does not influence virulence. The bacterial wilt regulates and controls the metabolism and virulence globally through a 3-OH PAME quorum sensing system, coordinates the assembly of various secretion systems, and commands the time sequence expression and secretion of various virulence factors, thereby smoothly completing the rhizosphere invasion process. The system consists of PhcBSR synthetic components and regulatory factor PhcA. Wherein PhcB is responsible for synthesizing the signal molecule 3-OH PAME, phcS is responsible for receiving the sensing signal molecule, and when the concentration of 3-OH PAME exceeds a certain threshold value, phcS activates PhcR, thereby releasing PhcR from inhibiting PhcA. PhcA not only regulate the primary metabolism and AHL quorum sensing system of the bacterial wilt, but also regulate the motility, siderophores, biomembrane, cell wall degrading enzyme, type III virulence factors and extracellular polysaccharide which destroy the immune system of plants and other virulence behaviors closely related to the rhizosphere invasion process. There are studies on attempts to block rhizosphere invasion processes of ralstonia solanacearum by degrading quorum sensing molecules, but the blocking effect is not ideal.
Currently, virtual screening is a very common strategy in computer-aided drug design and has been widely used. Drug Targeting Affinity (DTA) prediction is an important step in virtual screening, which can rapidly match targeting and drugs, speeding up the drug development process. DTA predictions provide information on the binding strength of drugs to target proteins and can be used to show whether small molecules bind to proteins. For proteins with known structure and site information, we can use molecular modeling and molecular docking to model in detail to obtain more accurate results, which is called structure-based virtual screening. However, there are still many proteins without structural information. Even with the use of homologous models, it is still difficult to obtain structural information of many proteins. Thus, predicting the binding affinity of proteins to drug molecules using sequences (sequence-based virtual screening) is an urgent issue, which is also an important aspect of the present invention.
Virtual screening based on molecular docking has become a core technology for designing computer-aided compounds, and is widely applied to the targeting development process of novel compounds. Therefore, the adoption of a virtual screening of bacterial quorum sensing lead compound to interfere with quorum sensing may be one of the important ways to control soil-borne bacterial wilt.
Disclosure of Invention
The invention aims to provide a target virtual screening method of a bacterial wilt quorum sensing lead compound, which can automatically search an optimal graph structure based on reinforcement learning, improve the performance of a virtual screening model of the quorum sensing lead compound of a graph network, is a screening method of the quorum sensing lead compound based on PhcA and PhcR protein structures, can be used for discovering new compounds with quorum sensing activity, and provides a new thought and means for controlling and preventing bacteria such as bacterial wilt.
The technical scheme adopted by the invention is as follows:
a virtual screening method of a population induction lead compound takes a molecular compound structure and a protein sequence as input, sends the molecular compound structure and the protein sequence into a preprocessing module to extract preliminary characteristics, and sends the preliminary characteristics into a prediction model network, wherein the structure and parameters of a prediction model are generated through training of an LSTM controller, and the specific flow is as follows:
The input molecular compound structure constructs a molecular adjacency matrix through pretreatment, and is sent into a GNN1 network to generate compound characteristics;
The input protein sequence is extracted to form a preliminary protein characteristic vector by extracting the protein amino acid composition and the dipeptide frequency combination, and the preliminary protein characteristic vector is sent into a cross network to generate cross fusion characteristics;
Finally, three characteristic combinations are sent to the full-connection layer for prediction to obtain an affinity value.
Further, the molecular adjacent matrix obtained by the construction is set as X 1, the element value of the adjacent atomic matrix on the molecular structure diagram is 1, the element value of the non-adjacent atomic matrix is 0, the size of the molecular adjacent matrix is (n×n), and n is the number of nodes in the structure diagram, namely the number of all atoms.
Further, pconsc software is used to process the protein sequence, a probability matrix of whether the residual pair is contacted is output, the size is m X m, a value larger than 0.5 in the matrix is reserved, other values are set to be 0, and the filtered matrix is a protein contact diagram X 2, wherein m is the number of the residual pairs.
Further, the amino acid composition is the frequency of occurrence of each of the 20 amino acids constituting the sequence, and the frequency of dipeptide is the frequency of occurrence of any two amino acid pairs constituting the sequence.
Further, the prediction model network is formed by connecting a GNN1 network, a GNN2 network and a cross network in parallel, and then sending the combined signals into a splicing layer and a DROPOUT layer, and then connecting two full-connection layers.
Further, the crossover network is formed by connecting 5 crossover layers in series, and finally connecting a 128-dimensional full-connection layer, wherein the output of the full-connection layer is f 3, and each crossover layer has the following formula:
Cl+1=C0CT lWc,l+bc,l+Cl
wherein l=1, 2,..5, C l and C l+1 are the outputs of the first and the l+1 layer cross layer, respectively, C 0 is the amino acid composition, and the combination of dipeptide frequencies X 3,Wc,l and b c,l are the connection parameters between these two layers, wherein all variables in the above formula are column vectors. The output of each layer is the output of the previous layer plus the feature crossing.
Further, the molecular adjacency matrix and the protein contact diagram are respectively input into two different GNN1 and GNN2 networks, each network consists of 3 GNN layers, the output characteristics of the two GNN networks are f 1、f2, the cross fusion characteristics are added, the total characteristics of the corresponding small molecule-protein pair for prediction are obtained after the two GNN networks are spliced, f 1+f2+f3 is added, then the total characteristics are sent into a full-connection layer, the output dimension is 128, then into a second full-connection layer, and the output dimension is 1, namely the affinity value predicted by the network.
Furthermore, virtual screening network model optimization is realized through an LSTM controller, wherein the model optimization is to obtain two GNN optimal structure parameters and parameters of other neurons in the whole network by using reinforcement learning in a determined parameter space, and the GNN structure M needs to determine several parameters, namely a sampling function (S), a correlation metric function (Att), an aggregation function (Agg), a multi-head attention number (K), an output hidden embedding (Dim) and an activation function (Act).
Further, the optimization is composed of two steps, namely, an LSTM predicts S, att, agg, act, K, dim corresponding operations of a GNN1, each prediction is executed by a softmax classifier of the LSTM, then the predicted value is input to the next time point to obtain the next parameter prediction, when the number of layers of the GNN Layer is 3, the LSTM controller completes one-time architecture generation, the process is repeated to generate parameters of the GNN2, the whole prediction network is constructed and trained to obtain weight parameters of the GNN network and other network layers, then the parameters of the LSTM are optimized by reinforcement learning based on the accuracy obtained after the network training to obtain an optimal controller model, and the two steps are alternately executed for a certain number of steps to obtain a final screening network model.
The method can be applied to virtual screening of the bacterial wilt colony induction lead compounds.
The method has the advantages that firstly, the method extracts the molecular adjacency matrix, the protein contact diagram and the protein sequence cross fusion characteristic to form the multidimensional characteristic, the molecular and protein characteristics can be better reflected, and secondly, the method uses the reinforcement learning optimization model structure, so that the previous model parameter selection by experience or a large number of manpower is avoided. The perfection and popularization of the method can effectively carry out virtual screening of the lead compounds, and have wide prospect and remarkable significance.
Drawings
FIG. 1 is a diagram of a virtual screening model of the present invention;
Fig. 2 is a molecular diagram representation.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, a method for virtually screening a quorum sensing lead compound, which is a method for screening a quorum sensing lead compound based on PhcA and PhcR protein structures, is established by a computer-aided virtual screening technology, a molecular biology technology and the like based on PhcA and PhcR protein structures. The method is realized through a virtual screening model, the input of the virtual screening model is a molecular compound structure and a protein sequence, the molecular compound structure and the protein sequence are sent to a preprocessing module to extract preliminary characteristics, the preliminary characteristics are sent to a prediction model network, and the structure and the parameters of the prediction model are generated through training of an LSTM controller. The basic flow is to input the molecular formula of the compound, construct the molecular adjacency matrix, and send the molecular adjacency matrix into the GNN1 subnetwork to generate the compound characteristics. Inputting protein sequence, extracting protein amino acid composition and dipeptide frequency composition to form protein preliminary characteristic vector, feeding into cross network to produce cross fusion characteristic, at the same time, producing correspondent contact diagram by protein sequence, then feeding into GNN2 network to produce protein sequence characteristic. Finally, three characteristic combinations are sent into the full-connection layer to predict and obtain an affinity value, wherein the output value is 0, and the affinity value is 1. The following describes the construction of a molecular adjacency matrix, a protein contact diagram and the generation of protein sequence cross fusion characteristics, and the optimization of the GNN structure and other parameters of a prediction network are realized by an LSTM controller.
1. Data preprocessing
(1) Construction of molecular adjacency matrix
The molecular representation is in the dataset in SMILES format. A molecular diagram is constructed from a string of drug SMILES, which uses atoms as nodes and bonds as edges. The process of constructing the molecular graph structure is shown in FIG. 2. Let the constructed molecular adjacent matrix be X 1, the element value of the adjacent atomic matrix be 1, the non-adjacent value be 0, the size be (n X n), n is the number of nodes in the figure, i.e. the number of all atoms.
(2) Protein contact map
The protein contact diagram is a graphical representation for describing interactions between proteins, which shows the contact and interactions between proteins, for describing protein structure and function. And using Pconsc software to process the protein sequence, outputting a probability matrix of whether residual pairs are contacted or not, wherein the size is m X m, reserving values larger than 0.5 in the matrix, setting other values to 0, and the filtered matrix is a protein contact diagram X 2.
(3) Protein composition characterization
The protein composition is characterized by a combination of amino acid composition and dipeptide frequency, X 3, and has a size of 420 dimensions. The amino acid composition is the frequency of occurrence of each of the 20 amino acids constituting the sequence. The frequency of dipeptide is the frequency of occurrence of any two amino acid pairs, and the total number of amino acids constituting the protein sequence is 20, and the total number of dipeptide is 400.
2. Prediction model architecture
(1) Cross network
The cross network input X 3, the network is composed of 5 cross layers connected in series, and finally a 128-dimensional full-connection layer is connected, the output of the full-connection layer is f 3, and each cross layer has the following formula:
Cl+1=C0CT lWc,l+bc,l+Cl
Where l=1, 2,..5. C l and C l+1 are the outputs of the first and the l+1th layers crosslayer, respectively, and C 0, X 3,Wc,l and b c,l are the connection parameters between these two layers. All variables in the above equation are column vectors. The output of each layer is the output of the previous layer plus the feature crossing.
(2) Affinity prediction network overall structure
The prediction model consists of two GNN networks (GNN 1 and GNN 2) and a cross network which are connected in parallel, and the two GNN networks are combined and then sent into a splicing layer and a DROPOUT layer, and then two full-connection layers are connected. The molecular adjacency matrix of drug molecules and proteins, the protein contact map, is input to two different GNN1, GNN2 networks. Each network consists of 3 GNN layers. The output characteristics of the two GNN networks are f 1、f2, and the combination of the two GNN networks and the cross fusion characteristics is f 1+f2+f3 after the two GNN networks are spliced, so that the overall characteristics of the corresponding small molecule-protein pairs for prediction are obtained. One full connection layer is then fed with an output dimension of 128, and then a second full connection layer is fed with an output dimension of 1, i.e. the predicted affinity value of the network. Wherein the specific structural parameters of the GNN layer and other network layers are obtained by the following network model optimization training process.
(3) LSTM controller implementing virtual screening network model optimization
Model optimization is to use reinforcement learning to obtain two GNN optimal structural parameters and parameters of other neurons in the whole network in a determined parameter space. The GNN structure M needs to determine several parameters, namely a sampling function (S), a correlation metric function (Att), an aggregation function (Agg), a multi-head attention number (K), an output concealment embedding (Dim) and an activation function (Act).
The specific description and the corresponding parameter values of each parameter are as follows:
1. Output hidden embedding (Dim). Dim is the output dimension of each layer GNN, which is an integer value.
2. A sampling function (S). For each layer GNN, a sampling function is required. The sampling is used in the neural network to select receptive fields for a given target node. Three methods, namely a fixed neighbor number sampling method, b importance sampling method and c first-order neighbor ordering method are used in the method.
3. A correlation metric function (Att) and a multi-headed attention number K. For each layer of GNN we choose an Att method and a multi-head attention number K. The Att can select two measurement functions, namely GAT and GCN, which correspond to two network structures of GAT and GCN respectively. The GCN network includes 2 graph convolution layers, 1 ReLU activation function layer, and 1 Dropout layer. The GAT network includes K graph multi-headed attention layers, 1 Softmax activation function layer, and 1 Dropout layer. Wherein GAT assigns neighborhood importance by using a layer of interest, and GCN assigns neighborhood importance according to the degree of node.
4. Aggregation function (Agg). For each layer of GNN, agg polymerization is required. The optional aggregation functions Agg are a. Sum aggregator, b. Mean aggregator, c. Pool aggregator.
5. Activating a function (Act). For each layer of GNN, the Act activation function needs to be used. The optional activation functions Act include a.ReLU, b.LeakyReLU, c.ELU, d.Linear, e.Softmax. Increase the nonlinear fitting capacity of the graph network, and increase the expression capacity of the model.
The network model optimization uses an LSTM controller neural network to train the network, and consists of two steps, namely, LSTM predicts the [ S, att, agg, act, K, dim ] corresponding operation of a GNN1, each prediction is executed by a softmax classifier of LSTM, and then the predicted value is input to the next time point to obtain the next parameter prediction. When the number of layers of the GNN Layer is 3, the LSTM controller completes one-time architecture generation, and the process is repeated to generate the parameters of the GNN 2. And constructing and training the whole prediction network to obtain the GNN network and other network layer weight parameters. And then, based on the accuracy obtained after the network training, optimizing the LSTM parameters by reinforcement learning to obtain an optimized controller model. And (5) alternately executing a certain number of steps to finish the two steps to obtain the final screening network model.
The training of the model is further described below:
model training uses a common KIBA dataset. The dataset includes 229 unique proteins and 2111 unique drugs, with 118254 pairs of affinity interactions between proteins and drug molecules. In the training method, the data set is divided into a training set, a verification set and a test set according to the proportion of 80% to 10%.
Training parameters the controller is an LSTM network with 100 hidden units. It was trained using an ADAM optimizer with a learning rate of 0.0035. The controller samples the graph network structure to generate a sub-model and trains 200 epochs. During training, L2 regularization of λ=0.0005 was applied. Furthermore, dropout with p=0.5 is applied to the input of both layers and the normalized attention layer.
The LSTM comprises three layers, an input layer, a hidden layer, and an output layer, the input dimension 6*1 dimensions, the hidden layer neurons 100, a time step of 10.
LSTM network parameters are set as follows:
Layer1:LSTM(input_size=8,hidden_size=100,num_layers=2)
Layer2:Dropout(p=0.5)
Layer3:Linear(hidden_size=500,n_class=1)
Wherein input_size represents the input data dimension, hidden_size represents the output dimension, num_layers represents LSTM stacked several layers, default to 1, n_class represents the output dimension of the LSTM network, and 1 represents the output regression value.
After the controller has been trained 1000 times, we let the controller output the best model from the 200 samples GNN. The result shows that the optimal performance model of the original virtual screening model can be designed by the optimization method.
Experimental results
After the controller has been trained 1000 times, we let the controller output the best model from the 200 samples GNN. The result shows that the optimal performance model of the original virtual screening model can be designed by the optimization method.
The structure of the optimal virtual screening model after optimization and screening is as follows:
GNN1 structure:
Layer1 molecular map annotates force Layer GAT1 (input dimension= (n×n), output dimension = 128, number of hidden Layer elements = 128, number of head attentions = 4, activation function = elu (), aggregation function = sum ()).
Layer2, split sub-convolution Layer GCN2 (input dimension=128, output dimension=256, number of hidden Layer elements=256, number of head attentions=4, activation function= relu (), aggregation function=max ()).
Layer3 molecular map annotates force Layer GAT3 (input dimension=256, output dimension=128, number of hidden Layer elements=128, number of head attentions=8, activation function= elu (), aggregation function=avg ()).
GNN2 structure:
layer1 protein map convolution Layer GCN4 (input dimension= (m x m), output dimension = 64, number of hidden Layer elements = 64, number of head attentions = 16, activation function = relu (), aggregation function = max ()).
Layer2 protein map attention Layer GCN5 (input dimension=64, output dimension=256, number of hidden Layer units=256, number of head attention=4, activation function= elu (), aggregation function= pooling ()).
Layer3 protein map convolution Layer GCN6 (input dimension=256, output dimension=256, number of hidden Layer elements=256, number of head attentions=16, activation function= relu (), aggregation function=max ()).
Cross network, input dimension=420, output dimension=128, layer number n=5
Feature concatenation layer Concat1 (input dimension= (128,256,128), output dimension=512).
Dropout layer (512 ), ratio p=0.5.
Full link layer: linear (512,128).
Full link layer: linear (128, 1).
And (3) achieving a better state through 300 times of iterative network, and storing corresponding parameters for virtual screening of the bacterial wilt colony induction lead compounds.
In a word, the invention provides a screening method of quorum sensing lead compounds based on PhcA and PhcR protein structures, which can be used for discovering new compounds with quorum sensing activity and providing new ideas and means for controlling and preventing bacteria such as bacterial wilt. The method combines the computer-aided virtual screening technology and the molecular biology technology, and can efficiently screen out the compound combined with PhcA and PhcR proteins, thereby discovering the compound with quorum sensing activity. The method has the advantages of simple and convenient operation, high efficiency, high speed, high screening accuracy and the like, and can be widely applied to the fields of biological medicine, agriculture and the like.
Notably, phcA and PhcR in the present invention are two key proteins in bacterial wilt, whose structure and function play an important role in quorum sensing of bacteria. However, in other bacteria, different quorum sensing proteins may be present, and thus screening and research based on different bacterial species is required in order to find quorum sensing lead compounds suitable for different bacteria.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It should be understood by those skilled in the art that the above embodiments do not limit the scope of the present invention in any way, and all technical solutions obtained by equivalent substitution and the like fall within the scope of the present invention.
The invention is not related in part to the same as or can be practiced with the prior art.

Claims (7)

1.一种群体感应先导化合物的虚拟筛选方法,其特征在于,将分子化合物结构、蛋白质序列作为输入送入预处理模块提取初步特征,再将其送入预测模型网络,其中预测模型的结构及参数通过LSTM控制器训练生成;具体流程如下:1. A virtual screening method for quorum sensing lead compounds, characterized in that the molecular compound structure and protein sequence are sent as input to the preprocessing module to extract preliminary features, and then sent to the prediction model network, wherein the structure and parameters of the prediction model are generated by LSTM controller training; the specific process is as follows: 输入的分子化合物结构通过预处理构建分子邻接矩阵,送入GNN1网络生成化合物特征;The input molecular compound structure is preprocessed to construct a molecular adjacency matrix and sent to the GNN1 network to generate compound features; 输入的蛋白质序列,提取其蛋白质氨基酸组成、二肽频率组合成蛋白质初步特征向量,送入交叉网络,生成交叉融合特征;同时,将蛋白质序列生成对应的接触图,随后送入GNN2网络生成蛋白序列特征;The input protein sequence is used to extract its protein amino acid composition and dipeptide frequency to form a preliminary protein feature vector, which is then sent to the cross network to generate cross-fusion features. At the same time, the protein sequence is used to generate a corresponding contact map, which is then sent to the GNN2 network to generate protein sequence features. 最终将三个特征组合送入全连接层预测得到亲和力值;Finally, the three features are combined and sent to the fully connected layer to predict the affinity value; 所述预测模型网络由GNN1网络、GNN2网络、交叉网络并联组成,合并后送入一个拼接层和DROPOUT层,再接上两个全连接层;The prediction model network is composed of a GNN1 network, a GNN2 network, and a cross network in parallel, which are merged and sent to a concatenation layer and a DROPOUT layer, and then connected to two fully connected layers; 所述交叉网络由5个交叉层串联组成,最后接一个128维全连接层,全连接层输出为f 3,每个交叉层具有以下公式:The cross network consists of five cross layers connected in series, and finally connected to a 128-dimensional fully connected layer. The output of the fully connected layer is f 3 . Each cross layer has the following formula: CC l+1l+1 = C= C 00 CC TT ll WW c,lc,l +b+b c,lc,l +C+C ll 其中:l=1, 2, …, 5C l C l+1分别是第l层和第l+1层cross layer的输出,C 0 即氨基酸组成、二肽频率的组合X 3 X 3 作为交叉网络输入,W c,lb c,l是这两层之间的连接参数;上式中所有的变量均是列向量;每一层的输出,都是上一层的输出加上特征交叉;Where: l = 1, 2, …, 5 , C l and C l +1 are the outputs of the lth and l + 1th cross layers respectively, C 0 is the combination of amino acid composition and dipeptide frequency X 3 , X 3 is the cross network input, W c,l and b c,l are the connection parameters between the two layers; all variables in the above formula are column vectors; the output of each layer is the output of the previous layer plus the feature cross; 所述分子邻接矩阵、蛋白质接触图被分别输入到两个不同的GNN1和GNN2网络,每个网络由3个GNN层组成,两个GNN网络的输出特征为,再加上交叉融合特征,拼接后为,得到用于预测的相应小分子-蛋白质对的总体特征;随后送入一个全连接层,输出维度为128,接着送入第二个全连接层,输出维度为1,即网络预测的亲和力值。The molecular adjacency matrix and protein contact map are input into two different GNN1 and GNN2 networks respectively. Each network consists of three GNN layers. The output features of the two GNN networks are , plus the cross-fusion features, the splicing is , and obtain the overall characteristics of the corresponding small molecule-protein pair for prediction; then it is sent to a fully connected layer with an output dimension of 128, and then sent to a second fully connected layer with an output dimension of 1, which is the affinity value predicted by the network. 2. 根据权利要求1所述的一种群体感应先导化合物的虚拟筛选方法,其特征在于,设构建得到的分子邻接矩阵为, 分子结构图上相邻的原子矩阵元素值为1,不相邻为0,分子邻接矩阵大小为是结构图中节点的数量,即所有原子的数量。2. The virtual screening method for a quorum sensing lead compound according to claim 1, characterized in that the constructed molecular adjacency matrix is , the adjacent atomic matrix element value on the molecular structure diagram is 1, and the non-adjacent atomic matrix element value is 0. The size of the molecular adjacency matrix is , is the number of nodes in the structure graph, that is, the number of all atoms. 3.根据权利要求1所述的一种群体感应先导化合物的虚拟筛选方法,其特征在于,使用Pconsc4软件处理蛋白序列,输出残差对是否接触的概率矩阵,保留矩阵中大于0.5的值,其他值置为0,过滤后的矩阵为蛋白质接触图X 2 ,其中是残差的数量。3. A virtual screening method for a quorum sensing lead compound according to claim 1, characterized in that the protein sequence is processed using Pconsc4 software to output a probability matrix of whether the residual pairs are in contact , retain the values greater than 0.5 in the matrix, and set other values to 0. The filtered matrix is the protein contact map X 2 , where is the number of residuals. 4.根据权利要求1所述的一种群体感应先导化合物的虚拟筛选方法,其特征在于,氨基酸组成是构成序列的20种氨基酸各自出现的频率,二肽的频率是任意两个氨基酸构成的氨基酸对出现的频率。4. A virtual screening method for a quorum sensing lead compound according to claim 1, characterized in that the amino acid composition is the frequency of occurrence of each of the 20 amino acids constituting the sequence, and the frequency of the dipeptide is the frequency of occurrence of an amino acid pair consisting of any two amino acids. 5.根据权利要求1所述的一种群体感应先导化合物的虚拟筛选方法,其特征在于,通过LSTM控制器实现虚拟筛选网络模型优化,模型优化就是在确定的参数空间使用强化学习得到两个GNN最佳结构参数以及整个网络中其它神经元的参数;GNN结构M需要确定几个参数:采样功能(S)、相关度量函数(Att)、聚合功能(Agg)、多头注意力数量(K)、输出隐藏嵌入(Dim)和激活函数(Act)。5. According to claim 1, a virtual screening method for a quorum sensing lead compound is characterized in that the virtual screening network model optimization is realized by an LSTM controller, and the model optimization is to use reinforcement learning in a determined parameter space to obtain the optimal structural parameters of two GNNs and the parameters of other neurons in the entire network; the GNN structure M needs to determine several parameters: sampling function (S), related metric function (Att), aggregation function (Agg), multi-head attention number (K), output hidden embedding (Dim) and activation function (Act). 6. 根据权利要求5所述的一种群体感应先导化合物的虚拟筛选方法,其特征在于,优化由两步组成,首先LSTM预测一个GNN1的S、Att、Agg、Act、K、Dim对应操作,每个预测都由LSTM的softmax分类器执行,接着将该预测值输入到下一个时间点,得到下一个参数预测;当达到GNN Layer的层数3时,LSTM控制器完成一次架构的生成;重复该过程生成GNN2的参数;构建并训练整个预测网络,得到GNN网络及其它网络层权重参数;然后基于网络训练后得到的准确率用强化学习优化LSTM的参数,以得到优化控制器模型;两个步骤交替执行一定步数结束得到最终筛选网络模型。6. A virtual screening method for a quorum sensing lead compound according to claim 5, characterized in that the optimization consists of two steps. First, LSTM predicts the corresponding operations of S, Att, Agg, Act, K, and Dim of a GNN1, and each prediction is performed by the softmax classifier of LSTM. Then, the predicted value is input to the next time point to obtain the next parameter prediction; when the number of layers of GNN Layer reaches 3, the LSTM controller completes the generation of an architecture; repeats the process to generate the parameters of GNN2; constructs and trains the entire prediction network to obtain the weight parameters of the GNN network and other network layers; then, based on the accuracy obtained after network training, the parameters of LSTM are optimized by reinforcement learning to obtain an optimized controller model; the two steps are alternately executed for a certain number of steps to obtain the final screening network model. 7.一种如权利要求1至6中任一项所述群体感应先导化合物的虚拟筛选方法在青枯菌中的应用。7. Use of the virtual screening method for quorum sensing lead compounds according to any one of claims 1 to 6 in Ralstonia solanacearum.
CN202310234744.9A 2023-03-13 2023-03-13 A virtual screening method for quorum sensing lead compounds and its application Active CN116189759B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310234744.9A CN116189759B (en) 2023-03-13 2023-03-13 A virtual screening method for quorum sensing lead compounds and its application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310234744.9A CN116189759B (en) 2023-03-13 2023-03-13 A virtual screening method for quorum sensing lead compounds and its application

Publications (2)

Publication Number Publication Date
CN116189759A CN116189759A (en) 2023-05-30
CN116189759B true CN116189759B (en) 2025-05-06

Family

ID=86446247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310234744.9A Active CN116189759B (en) 2023-03-13 2023-03-13 A virtual screening method for quorum sensing lead compounds and its application

Country Status (1)

Country Link
CN (1) CN116189759B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025114899A1 (en) * 2023-11-29 2025-06-05 Technology Innovation Institute – Sole Proprietorship LLC System and method for predicting protein binding using a multi-modal prediction model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114333984A (en) * 2022-01-10 2022-04-12 青岛理工大学 Intelligent prediction method for small molecule-protein binding affinity
CN115188412A (en) * 2022-07-27 2022-10-14 上海数因信科智能科技有限公司 Drug prediction algorithm based on Transformer and graph neural network
CN115631805A (en) * 2022-09-09 2023-01-20 北京算路科技有限公司 Medicine and protein affinity prediction method and system based on graph neural network

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004519206A (en) * 2000-06-30 2004-07-02 インサイト・ゲノミックス・インコーポレイテッド Protein modification and conservative molecules
US7424370B2 (en) * 2004-02-06 2008-09-09 Council Of Scientific And Industrial Research Computational method for identifying adhesin and adhesin-like proteins of therapeutic potential
CN101019123A (en) * 2004-02-06 2007-08-15 科学与工业研究委员会 Computational method for identifying adhesin and adhesin-like proteins of therapeutic potential
EP3191524A4 (en) * 2014-09-12 2018-05-30 The Regents of The University of California Macropinocytosing human anti-cd46 antibodies and targeted cancer therapeutics
CN109652392B (en) * 2019-02-20 2020-12-04 南京农业大学 A kind of ferulic acid esterase and its preparation method and application
CN115104105A (en) * 2020-02-19 2022-09-23 英矽智能科技有限公司 Antagonistic autocoder architecture for graph-to-sequence model approach
CN114724623B (en) * 2022-04-29 2024-12-20 中国海洋大学 A method for predicting drug-target affinity based on multi-source protein feature fusion
CN114974409B (en) * 2022-05-31 2024-12-10 浙江大学 A drug virtual screening system for newly discovered targets based on zero-shot learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114333984A (en) * 2022-01-10 2022-04-12 青岛理工大学 Intelligent prediction method for small molecule-protein binding affinity
CN115188412A (en) * 2022-07-27 2022-10-14 上海数因信科智能科技有限公司 Drug prediction algorithm based on Transformer and graph neural network
CN115631805A (en) * 2022-09-09 2023-01-20 北京算路科技有限公司 Medicine and protein affinity prediction method and system based on graph neural network

Also Published As

Publication number Publication date
CN116189759A (en) 2023-05-30

Similar Documents

Publication Publication Date Title
CN114999565B (en) Drug target affinity prediction method based on representation learning and graph neural network
CN113764037B (en) Method and apparatus for model training, antibody engineering and binding site prediction
CN114333984A (en) Intelligent prediction method for small molecule-protein binding affinity
CN113539358B (en) Hilbert coding-based enhancer-promoter interaction prediction method and device
CN110543895A (en) An Image Classification Method Based on VGGNet and ResNet
CN117423378B (en) Graph representation learning-based intelligent drug-target recommendation method
CN116189759B (en) A virtual screening method for quorum sensing lead compounds and its application
CN112530515A (en) Novel deep learning model for predicting protein affinity of compound, computer equipment and storage medium
CN116343911B (en) Medicine target affinity prediction method and system based on three-dimensional spatial biological reaction
CN115116549A (en) Cell data annotation method, device, equipment and medium
CN114582423A (en) Protein solubility prediction method based on combined machine learning model
CN119132386B (en) A drug-target binding affinity prediction method based on graph neural network
CN116230113A (en) Compound-protein interaction prediction method fusing multi-view information
CN119626312B (en) Protein interaction prediction method based on cross-modal enhancement representation learning
CN114530197A (en) Medicine target point prediction method and system based on matrix completion
Luo et al. A Caps-UBI model for protein ubiquitination site prediction
CN119889426A (en) Drug target prediction method based on 3D structure and multi-level attention mechanism
Ning et al. DMHGNN: Double multi-view heterogeneous graph neural network framework for drug-target interaction prediction
CN114512188B (en) DNA binding protein recognition method based on improved protein sequence position specificity matrix
CN120375914B (en) A drug-target binding affinity prediction method based on multi-feature fusion to reduce feature redundancy
CN120145001A (en) Vehicle trajectory prediction method based on improved Transformer network based on graph convolution and dilated temporal convolution
CN116705146B (en) A multi-perspective enzyme function prediction method that combines molecular structure and sequence mining
CN118692554A (en) A method for predicting multiple conformations and conformational transition pathways of proteins
CN116525027A (en) Drug target interaction prediction method based on comprehensive learning technology
CN113889183B (en) PROTAC molecular degradation rate prediction system based on neural network and construction method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant