CN116189759B

CN116189759B - A virtual screening method for quorum sensing lead compounds and its application

Info

Publication number: CN116189759B
Application number: CN202310234744.9A
Authority: CN
Inventors: 江高飞; 薛卫; 张家璇; 刘佐; 韦中; 徐阳春; 沈其荣
Original assignee: Nanjing Agricultural University
Current assignee: Nanjing Agricultural University
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2025-05-06
Anticipated expiration: 2043-03-13
Also published as: CN116189759A

Abstract

The present invention discloses a virtual screening method for quorum sensing lead compounds, and the main process includes: the input molecular compound structure is pre-processed to construct a molecular adjacency matrix, and is sent to the GNN1 network to generate compound features; the input protein sequence is extracted from its protein amino acid composition and dipeptide frequency to form a preliminary protein feature vector, which is sent to the cross network to generate cross-fusion features; at the same time, the protein sequence is generated into a corresponding contact map, which is then sent to the GNN2 network to generate protein sequence features; finally, the three feature combinations are sent to the fully connected layer to predict the affinity value. The present invention can be used to discover new compounds with quorum sensing activity, and provide new ideas and means for the control and prevention of bacteria such as Ralstonia solanacearum; at the same time, the method can efficiently screen out compounds that bind to PhcA and PhcR proteins, thereby discovering compounds with quorum sensing activity.

Description

Virtual screening method and application of group induction lead compound

Technical Field

The invention relates to the field of pharmaceutical chemistry, in particular to a targeting virtual screening method and application of a bacterial wilt colony induction lead compound.

Background

Ralstonia solanacearum (Ralstoniasolanacearum) is one of the most destructive soil-borne pathogens in the world, and the pathogens are widely distributed in tropical, subtropical and temperate climatic regions of the world and gradually spread to high-altitude regions with high dimensionality. The related virulence behavior in the process of invasion of the rhizosphere of the crop by the ralstonia solanacearum is regulated and controlled by quorum sensing. The bacterial wilt has two sets of quorum sensing systems, namely an AHL system and a trihydroxy methyl palmitate (3-OH PAME) system, wherein the AHL system does not influence virulence. The bacterial wilt regulates and controls the metabolism and virulence globally through a 3-OH PAME quorum sensing system, coordinates the assembly of various secretion systems, and commands the time sequence expression and secretion of various virulence factors, thereby smoothly completing the rhizosphere invasion process. The system consists of PhcBSR synthetic components and regulatory factor PhcA. Wherein PhcB is responsible for synthesizing the signal molecule 3-OH PAME, phcS is responsible for receiving the sensing signal molecule, and when the concentration of 3-OH PAME exceeds a certain threshold value, phcS activates PhcR, thereby releasing PhcR from inhibiting PhcA. PhcA not only regulate the primary metabolism and AHL quorum sensing system of the bacterial wilt, but also regulate the motility, siderophores, biomembrane, cell wall degrading enzyme, type III virulence factors and extracellular polysaccharide which destroy the immune system of plants and other virulence behaviors closely related to the rhizosphere invasion process. There are studies on attempts to block rhizosphere invasion processes of ralstonia solanacearum by degrading quorum sensing molecules, but the blocking effect is not ideal.

Currently, virtual screening is a very common strategy in computer-aided drug design and has been widely used. Drug Targeting Affinity (DTA) prediction is an important step in virtual screening, which can rapidly match targeting and drugs, speeding up the drug development process. DTA predictions provide information on the binding strength of drugs to target proteins and can be used to show whether small molecules bind to proteins. For proteins with known structure and site information, we can use molecular modeling and molecular docking to model in detail to obtain more accurate results, which is called structure-based virtual screening. However, there are still many proteins without structural information. Even with the use of homologous models, it is still difficult to obtain structural information of many proteins. Thus, predicting the binding affinity of proteins to drug molecules using sequences (sequence-based virtual screening) is an urgent issue, which is also an important aspect of the present invention.

Virtual screening based on molecular docking has become a core technology for designing computer-aided compounds, and is widely applied to the targeting development process of novel compounds. Therefore, the adoption of a virtual screening of bacterial quorum sensing lead compound to interfere with quorum sensing may be one of the important ways to control soil-borne bacterial wilt.

Disclosure of Invention

The invention aims to provide a target virtual screening method of a bacterial wilt quorum sensing lead compound, which can automatically search an optimal graph structure based on reinforcement learning, improve the performance of a virtual screening model of the quorum sensing lead compound of a graph network, is a screening method of the quorum sensing lead compound based on PhcA and PhcR protein structures, can be used for discovering new compounds with quorum sensing activity, and provides a new thought and means for controlling and preventing bacteria such as bacterial wilt.

The technical scheme adopted by the invention is as follows:

a virtual screening method of a population induction lead compound takes a molecular compound structure and a protein sequence as input, sends the molecular compound structure and the protein sequence into a preprocessing module to extract preliminary characteristics, and sends the preliminary characteristics into a prediction model network, wherein the structure and parameters of a prediction model are generated through training of an LSTM controller, and the specific flow is as follows:

The input molecular compound structure constructs a molecular adjacency matrix through pretreatment, and is sent into a GNN1 network to generate compound characteristics;

The input protein sequence is extracted to form a preliminary protein characteristic vector by extracting the protein amino acid composition and the dipeptide frequency combination, and the preliminary protein characteristic vector is sent into a cross network to generate cross fusion characteristics;

Finally, three characteristic combinations are sent to the full-connection layer for prediction to obtain an affinity value.

Further, the molecular adjacent matrix obtained by the construction is set as X ₁, the element value of the adjacent atomic matrix on the molecular structure diagram is 1, the element value of the non-adjacent atomic matrix is 0, the size of the molecular adjacent matrix is (n×n), and n is the number of nodes in the structure diagram, namely the number of all atoms.

Further, pconsc software is used to process the protein sequence, a probability matrix of whether the residual pair is contacted is output, the size is m X m, a value larger than 0.5 in the matrix is reserved, other values are set to be 0, and the filtered matrix is a protein contact diagram X ₂, wherein m is the number of the residual pairs.

Further, the amino acid composition is the frequency of occurrence of each of the 20 amino acids constituting the sequence, and the frequency of dipeptide is the frequency of occurrence of any two amino acid pairs constituting the sequence.

Further, the prediction model network is formed by connecting a GNN1 network, a GNN2 network and a cross network in parallel, and then sending the combined signals into a splicing layer and a DROPOUT layer, and then connecting two full-connection layers.

Further, the crossover network is formed by connecting 5 crossover layers in series, and finally connecting a 128-dimensional full-connection layer, wherein the output of the full-connection layer is f ₃, and each crossover layer has the following formula:

C_l+1＝C₀C^T _lW_c,l+b_c,l+C_l

wherein l=1, 2,..5, C _l and C _l+1 are the outputs of the first and the l+1 layer cross layer, respectively, C ₀ is the amino acid composition, and the combination of dipeptide frequencies X ₃,W_c,l and b _c,l are the connection parameters between these two layers, wherein all variables in the above formula are column vectors. The output of each layer is the output of the previous layer plus the feature crossing.

Further, the molecular adjacency matrix and the protein contact diagram are respectively input into two different GNN1 and GNN2 networks, each network consists of 3 GNN layers, the output characteristics of the two GNN networks are f ₁、f₂, the cross fusion characteristics are added, the total characteristics of the corresponding small molecule-protein pair for prediction are obtained after the two GNN networks are spliced, f ₁+f₂+f₃ is added, then the total characteristics are sent into a full-connection layer, the output dimension is 128, then into a second full-connection layer, and the output dimension is 1, namely the affinity value predicted by the network.

Furthermore, virtual screening network model optimization is realized through an LSTM controller, wherein the model optimization is to obtain two GNN optimal structure parameters and parameters of other neurons in the whole network by using reinforcement learning in a determined parameter space, and the GNN structure M needs to determine several parameters, namely a sampling function (S), a correlation metric function (Att), an aggregation function (Agg), a multi-head attention number (K), an output hidden embedding (Dim) and an activation function (Act).

Further, the optimization is composed of two steps, namely, an LSTM predicts S, att, agg, act, K, dim corresponding operations of a GNN1, each prediction is executed by a softmax classifier of the LSTM, then the predicted value is input to the next time point to obtain the next parameter prediction, when the number of layers of the GNN Layer is 3, the LSTM controller completes one-time architecture generation, the process is repeated to generate parameters of the GNN2, the whole prediction network is constructed and trained to obtain weight parameters of the GNN network and other network layers, then the parameters of the LSTM are optimized by reinforcement learning based on the accuracy obtained after the network training to obtain an optimal controller model, and the two steps are alternately executed for a certain number of steps to obtain a final screening network model.

The method can be applied to virtual screening of the bacterial wilt colony induction lead compounds.

The method has the advantages that firstly, the method extracts the molecular adjacency matrix, the protein contact diagram and the protein sequence cross fusion characteristic to form the multidimensional characteristic, the molecular and protein characteristics can be better reflected, and secondly, the method uses the reinforcement learning optimization model structure, so that the previous model parameter selection by experience or a large number of manpower is avoided. The perfection and popularization of the method can effectively carry out virtual screening of the lead compounds, and have wide prospect and remarkable significance.

Drawings

FIG. 1 is a diagram of a virtual screening model of the present invention;

Fig. 2 is a molecular diagram representation.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings.

As shown in fig. 1, a method for virtually screening a quorum sensing lead compound, which is a method for screening a quorum sensing lead compound based on PhcA and PhcR protein structures, is established by a computer-aided virtual screening technology, a molecular biology technology and the like based on PhcA and PhcR protein structures. The method is realized through a virtual screening model, the input of the virtual screening model is a molecular compound structure and a protein sequence, the molecular compound structure and the protein sequence are sent to a preprocessing module to extract preliminary characteristics, the preliminary characteristics are sent to a prediction model network, and the structure and the parameters of the prediction model are generated through training of an LSTM controller. The basic flow is to input the molecular formula of the compound, construct the molecular adjacency matrix, and send the molecular adjacency matrix into the GNN1 subnetwork to generate the compound characteristics. Inputting protein sequence, extracting protein amino acid composition and dipeptide frequency composition to form protein preliminary characteristic vector, feeding into cross network to produce cross fusion characteristic, at the same time, producing correspondent contact diagram by protein sequence, then feeding into GNN2 network to produce protein sequence characteristic. Finally, three characteristic combinations are sent into the full-connection layer to predict and obtain an affinity value, wherein the output value is 0, and the affinity value is 1. The following describes the construction of a molecular adjacency matrix, a protein contact diagram and the generation of protein sequence cross fusion characteristics, and the optimization of the GNN structure and other parameters of a prediction network are realized by an LSTM controller.

1. Data preprocessing

(1) Construction of molecular adjacency matrix

The molecular representation is in the dataset in SMILES format. A molecular diagram is constructed from a string of drug SMILES, which uses atoms as nodes and bonds as edges. The process of constructing the molecular graph structure is shown in FIG. 2. Let the constructed molecular adjacent matrix be X ₁, the element value of the adjacent atomic matrix be 1, the non-adjacent value be 0, the size be (n X n), n is the number of nodes in the figure, i.e. the number of all atoms.

(2) Protein contact map

The protein contact diagram is a graphical representation for describing interactions between proteins, which shows the contact and interactions between proteins, for describing protein structure and function. And using Pconsc software to process the protein sequence, outputting a probability matrix of whether residual pairs are contacted or not, wherein the size is m X m, reserving values larger than 0.5 in the matrix, setting other values to 0, and the filtered matrix is a protein contact diagram X ₂.

(3) Protein composition characterization

The protein composition is characterized by a combination of amino acid composition and dipeptide frequency, X ₃, and has a size of 420 dimensions. The amino acid composition is the frequency of occurrence of each of the 20 amino acids constituting the sequence. The frequency of dipeptide is the frequency of occurrence of any two amino acid pairs, and the total number of amino acids constituting the protein sequence is 20, and the total number of dipeptide is 400.

2. Prediction model architecture

(1) Cross network

The cross network input X ₃, the network is composed of 5 cross layers connected in series, and finally a 128-dimensional full-connection layer is connected, the output of the full-connection layer is f ₃, and each cross layer has the following formula:

C_l+1＝C₀C^T _lW_c,l+b_c,l+C_l

Where l=1, 2,..5. C _l and C _l+1 are the outputs of the first and the l+1th layers crosslayer, respectively, and C ₀, X ₃,W_c,l and b _c,l are the connection parameters between these two layers. All variables in the above equation are column vectors. The output of each layer is the output of the previous layer plus the feature crossing.

(2) Affinity prediction network overall structure

The prediction model consists of two GNN networks (GNN 1 and GNN 2) and a cross network which are connected in parallel, and the two GNN networks are combined and then sent into a splicing layer and a DROPOUT layer, and then two full-connection layers are connected. The molecular adjacency matrix of drug molecules and proteins, the protein contact map, is input to two different GNN1, GNN2 networks. Each network consists of 3 GNN layers. The output characteristics of the two GNN networks are f ₁、f₂, and the combination of the two GNN networks and the cross fusion characteristics is f ₁+f₂+f₃ after the two GNN networks are spliced, so that the overall characteristics of the corresponding small molecule-protein pairs for prediction are obtained. One full connection layer is then fed with an output dimension of 128, and then a second full connection layer is fed with an output dimension of 1, i.e. the predicted affinity value of the network. Wherein the specific structural parameters of the GNN layer and other network layers are obtained by the following network model optimization training process.

(3) LSTM controller implementing virtual screening network model optimization

Model optimization is to use reinforcement learning to obtain two GNN optimal structural parameters and parameters of other neurons in the whole network in a determined parameter space. The GNN structure M needs to determine several parameters, namely a sampling function (S), a correlation metric function (Att), an aggregation function (Agg), a multi-head attention number (K), an output concealment embedding (Dim) and an activation function (Act).

The specific description and the corresponding parameter values of each parameter are as follows:

1. Output hidden embedding (Dim). Dim is the output dimension of each layer GNN, which is an integer value.

2. A sampling function (S). For each layer GNN, a sampling function is required. The sampling is used in the neural network to select receptive fields for a given target node. Three methods, namely a fixed neighbor number sampling method, b importance sampling method and c first-order neighbor ordering method are used in the method.

3. A correlation metric function (Att) and a multi-headed attention number K. For each layer of GNN we choose an Att method and a multi-head attention number K. The Att can select two measurement functions, namely GAT and GCN, which correspond to two network structures of GAT and GCN respectively. The GCN network includes 2 graph convolution layers, 1 ReLU activation function layer, and 1 Dropout layer. The GAT network includes K graph multi-headed attention layers, 1 Softmax activation function layer, and 1 Dropout layer. Wherein GAT assigns neighborhood importance by using a layer of interest, and GCN assigns neighborhood importance according to the degree of node.

4. Aggregation function (Agg). For each layer of GNN, agg polymerization is required. The optional aggregation functions Agg are a. Sum aggregator, b. Mean aggregator, c. Pool aggregator.

5. Activating a function (Act). For each layer of GNN, the Act activation function needs to be used. The optional activation functions Act include a.ReLU, b.LeakyReLU, c.ELU, d.Linear, e.Softmax. Increase the nonlinear fitting capacity of the graph network, and increase the expression capacity of the model.

The network model optimization uses an LSTM controller neural network to train the network, and consists of two steps, namely, LSTM predicts the [ S, att, agg, act, K, dim ] corresponding operation of a GNN1, each prediction is executed by a softmax classifier of LSTM, and then the predicted value is input to the next time point to obtain the next parameter prediction. When the number of layers of the GNN Layer is 3, the LSTM controller completes one-time architecture generation, and the process is repeated to generate the parameters of the GNN 2. And constructing and training the whole prediction network to obtain the GNN network and other network layer weight parameters. And then, based on the accuracy obtained after the network training, optimizing the LSTM parameters by reinforcement learning to obtain an optimized controller model. And (5) alternately executing a certain number of steps to finish the two steps to obtain the final screening network model.

The training of the model is further described below:

model training uses a common KIBA dataset. The dataset includes 229 unique proteins and 2111 unique drugs, with 118254 pairs of affinity interactions between proteins and drug molecules. In the training method, the data set is divided into a training set, a verification set and a test set according to the proportion of 80% to 10%.

Training parameters the controller is an LSTM network with 100 hidden units. It was trained using an ADAM optimizer with a learning rate of 0.0035. The controller samples the graph network structure to generate a sub-model and trains 200 epochs. During training, L2 regularization of λ=0.0005 was applied. Furthermore, dropout with p=0.5 is applied to the input of both layers and the normalized attention layer.

The LSTM comprises three layers, an input layer, a hidden layer, and an output layer, the input dimension 6*1 dimensions, the hidden layer neurons 100, a time step of 10.

LSTM network parameters are set as follows:

Layer1:LSTM(input_size=8,hidden_size=100,num_layers=2)

Layer2:Dropout(p=0.5)

Layer3:Linear(hidden_size=500,n_class=1)

Wherein input_size represents the input data dimension, hidden_size represents the output dimension, num_layers represents LSTM stacked several layers, default to 1, n_class represents the output dimension of the LSTM network, and 1 represents the output regression value.

After the controller has been trained 1000 times, we let the controller output the best model from the 200 samples GNN. The result shows that the optimal performance model of the original virtual screening model can be designed by the optimization method.

Experimental results

The structure of the optimal virtual screening model after optimization and screening is as follows:

GNN1 structure:

Layer1 molecular map annotates force Layer GAT1 (input dimension= (n×n), output dimension = 128, number of hidden Layer elements = 128, number of head attentions = 4, activation function = elu (), aggregation function = sum ()).

Layer2, split sub-convolution Layer GCN2 (input dimension=128, output dimension=256, number of hidden Layer elements=256, number of head attentions=4, activation function= relu (), aggregation function=max ()).

Layer3 molecular map annotates force Layer GAT3 (input dimension=256, output dimension=128, number of hidden Layer elements=128, number of head attentions=8, activation function= elu (), aggregation function=avg ()).

GNN2 structure:

layer1 protein map convolution Layer GCN4 (input dimension= (m x m), output dimension = 64, number of hidden Layer elements = 64, number of head attentions = 16, activation function = relu (), aggregation function = max ()).

Layer2 protein map attention Layer GCN5 (input dimension=64, output dimension=256, number of hidden Layer units=256, number of head attention=4, activation function= elu (), aggregation function= pooling ()).

Layer3 protein map convolution Layer GCN6 (input dimension=256, output dimension=256, number of hidden Layer elements=256, number of head attentions=16, activation function= relu (), aggregation function=max ()).

Cross network, input dimension=420, output dimension=128, layer number n=5

Feature concatenation layer Concat1 (input dimension= (128,256,128), output dimension=512).

Dropout layer (512 ), ratio p=0.5.

Full link layer: linear (512,128).

Full link layer: linear (128, 1).

And (3) achieving a better state through 300 times of iterative network, and storing corresponding parameters for virtual screening of the bacterial wilt colony induction lead compounds.

In a word, the invention provides a screening method of quorum sensing lead compounds based on PhcA and PhcR protein structures, which can be used for discovering new compounds with quorum sensing activity and providing new ideas and means for controlling and preventing bacteria such as bacterial wilt. The method combines the computer-aided virtual screening technology and the molecular biology technology, and can efficiently screen out the compound combined with PhcA and PhcR proteins, thereby discovering the compound with quorum sensing activity. The method has the advantages of simple and convenient operation, high efficiency, high speed, high screening accuracy and the like, and can be widely applied to the fields of biological medicine, agriculture and the like.

Notably, phcA and PhcR in the present invention are two key proteins in bacterial wilt, whose structure and function play an important role in quorum sensing of bacteria. However, in other bacteria, different quorum sensing proteins may be present, and thus screening and research based on different bacterial species is required in order to find quorum sensing lead compounds suitable for different bacteria.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It should be understood by those skilled in the art that the above embodiments do not limit the scope of the present invention in any way, and all technical solutions obtained by equivalent substitution and the like fall within the scope of the present invention.

The invention is not related in part to the same as or can be practiced with the prior art.

Claims

1. A virtual screening method for quorum sensing lead compounds, characterized in that the molecular compound structure and protein sequence are sent as input to the preprocessing module to extract preliminary features, and then sent to the prediction model network, wherein the structure and parameters of the prediction model are generated by LSTM controller training; the specific process is as follows:

The input molecular compound structure is preprocessed to construct a molecular adjacency matrix and sent to the GNN1 network to generate compound features;

The input protein sequence is used to extract its protein amino acid composition and dipeptide frequency to form a preliminary protein feature vector, which is then sent to the cross network to generate cross-fusion features. At the same time, the protein sequence is used to generate a corresponding contact map, which is then sent to the GNN2 network to generate protein sequence features.

Finally, the three features are combined and sent to the fully connected layer to predict the affinity value;

The prediction model network is composed of a GNN1 network, a GNN2 network, and a cross network in parallel, which are merged and sent to a concatenation layer and a DROPOUT layer, and then connected to two fully connected layers;

The cross network consists of five cross layers connected in series, and finally connected to a 128-dimensional fully connected layer. The output of the fully connected layer is f ₃ . Each cross layer has the following formula:

CC _l+1l+1 = C= C ₀₀ CC ^TT _ll WW _c,lc,l +b+b _c,lc,l +C+C _ll

Where: l = 1, 2, …, 5 , C _l and C _{l +1} are the outputs of the lth and l + 1th cross layers respectively, C ₀ is the combination of amino acid composition and dipeptide frequency X ₃ , X ₃ is the cross network input, W _c,l and b _c,l are the connection parameters between the two layers; all variables in the above formula are column vectors; the output of each layer is the output of the previous layer plus the feature cross;

The molecular adjacency matrix and protein contact map are input into two different GNN1 and GNN2 networks respectively. Each network consists of three GNN layers. The output features of the two GNN networks are , plus the cross-fusion features, the splicing is , and obtain the overall characteristics of the corresponding small molecule-protein pair for prediction; then it is sent to a fully connected layer with an output dimension of 128, and then sent to a second fully connected layer with an output dimension of 1, which is the affinity value predicted by the network.

2. The virtual screening method for a quorum sensing lead compound according to claim 1, characterized in that the constructed molecular adjacency matrix is , the adjacent atomic matrix element value on the molecular structure diagram is 1, and the non-adjacent atomic matrix element value is 0. The size of the molecular adjacency matrix is , is the number of nodes in the structure graph, that is, the number of all atoms.

3. A virtual screening method for a quorum sensing lead compound according to claim 1, characterized in that the protein sequence is processed using Pconsc4 software to output a probability matrix of whether the residual pairs are in contact , retain the values greater than 0.5 in the matrix, and set other values to 0. The filtered matrix is the protein contact map X ₂ , where is the number of residuals.

4. A virtual screening method for a quorum sensing lead compound according to claim 1, characterized in that the amino acid composition is the frequency of occurrence of each of the 20 amino acids constituting the sequence, and the frequency of the dipeptide is the frequency of occurrence of an amino acid pair consisting of any two amino acids.

5. According to claim 1, a virtual screening method for a quorum sensing lead compound is characterized in that the virtual screening network model optimization is realized by an LSTM controller, and the model optimization is to use reinforcement learning in a determined parameter space to obtain the optimal structural parameters of two GNNs and the parameters of other neurons in the entire network; the GNN structure M needs to determine several parameters: sampling function (S), related metric function (Att), aggregation function (Agg), multi-head attention number (K), output hidden embedding (Dim) and activation function (Act).

6. A virtual screening method for a quorum sensing lead compound according to claim 5, characterized in that the optimization consists of two steps. First, LSTM predicts the corresponding operations of S, Att, Agg, Act, K, and Dim of a GNN1, and each prediction is performed by the softmax classifier of LSTM. Then, the predicted value is input to the next time point to obtain the next parameter prediction; when the number of layers of GNN Layer reaches 3, the LSTM controller completes the generation of an architecture; repeats the process to generate the parameters of GNN2; constructs and trains the entire prediction network to obtain the weight parameters of the GNN network and other network layers; then, based on the accuracy obtained after network training, the parameters of LSTM are optimized by reinforcement learning to obtain an optimized controller model; the two steps are alternately executed for a certain number of steps to obtain the final screening network model.

7. Use of the virtual screening method for quorum sensing lead compounds according to any one of claims 1 to 6 in Ralstonia solanacearum.