Disclosure of Invention
The invention aims to provide a target virtual screening method of a bacterial wilt quorum sensing lead compound, which can automatically search an optimal graph structure based on reinforcement learning, improve the performance of a virtual screening model of the quorum sensing lead compound of a graph network, is a screening method of the quorum sensing lead compound based on PhcA and PhcR protein structures, can be used for discovering new compounds with quorum sensing activity, and provides a new thought and means for controlling and preventing bacteria such as bacterial wilt.
The technical scheme adopted by the invention is as follows:
a virtual screening method of a population induction lead compound takes a molecular compound structure and a protein sequence as input, sends the molecular compound structure and the protein sequence into a preprocessing module to extract preliminary characteristics, and sends the preliminary characteristics into a prediction model network, wherein the structure and parameters of a prediction model are generated through training of an LSTM controller, and the specific flow is as follows:
The input molecular compound structure constructs a molecular adjacency matrix through pretreatment, and is sent into a GNN1 network to generate compound characteristics;
The input protein sequence is extracted to form a preliminary protein characteristic vector by extracting the protein amino acid composition and the dipeptide frequency combination, and the preliminary protein characteristic vector is sent into a cross network to generate cross fusion characteristics;
Finally, three characteristic combinations are sent to the full-connection layer for prediction to obtain an affinity value.
Further, the molecular adjacent matrix obtained by the construction is set as X 1, the element value of the adjacent atomic matrix on the molecular structure diagram is 1, the element value of the non-adjacent atomic matrix is 0, the size of the molecular adjacent matrix is (n×n), and n is the number of nodes in the structure diagram, namely the number of all atoms.
Further, pconsc software is used to process the protein sequence, a probability matrix of whether the residual pair is contacted is output, the size is m X m, a value larger than 0.5 in the matrix is reserved, other values are set to be 0, and the filtered matrix is a protein contact diagram X 2, wherein m is the number of the residual pairs.
Further, the amino acid composition is the frequency of occurrence of each of the 20 amino acids constituting the sequence, and the frequency of dipeptide is the frequency of occurrence of any two amino acid pairs constituting the sequence.
Further, the prediction model network is formed by connecting a GNN1 network, a GNN2 network and a cross network in parallel, and then sending the combined signals into a splicing layer and a DROPOUT layer, and then connecting two full-connection layers.
Further, the crossover network is formed by connecting 5 crossover layers in series, and finally connecting a 128-dimensional full-connection layer, wherein the output of the full-connection layer is f 3, and each crossover layer has the following formula:
Cl+1=C0CT lWc,l+bc,l+Cl
wherein l=1, 2,..5, C l and C l+1 are the outputs of the first and the l+1 layer cross layer, respectively, C 0 is the amino acid composition, and the combination of dipeptide frequencies X 3,Wc,l and b c,l are the connection parameters between these two layers, wherein all variables in the above formula are column vectors. The output of each layer is the output of the previous layer plus the feature crossing.
Further, the molecular adjacency matrix and the protein contact diagram are respectively input into two different GNN1 and GNN2 networks, each network consists of 3 GNN layers, the output characteristics of the two GNN networks are f 1、f2, the cross fusion characteristics are added, the total characteristics of the corresponding small molecule-protein pair for prediction are obtained after the two GNN networks are spliced, f 1+f2+f3 is added, then the total characteristics are sent into a full-connection layer, the output dimension is 128, then into a second full-connection layer, and the output dimension is 1, namely the affinity value predicted by the network.
Furthermore, virtual screening network model optimization is realized through an LSTM controller, wherein the model optimization is to obtain two GNN optimal structure parameters and parameters of other neurons in the whole network by using reinforcement learning in a determined parameter space, and the GNN structure M needs to determine several parameters, namely a sampling function (S), a correlation metric function (Att), an aggregation function (Agg), a multi-head attention number (K), an output hidden embedding (Dim) and an activation function (Act).
Further, the optimization is composed of two steps, namely, an LSTM predicts S, att, agg, act, K, dim corresponding operations of a GNN1, each prediction is executed by a softmax classifier of the LSTM, then the predicted value is input to the next time point to obtain the next parameter prediction, when the number of layers of the GNN Layer is 3, the LSTM controller completes one-time architecture generation, the process is repeated to generate parameters of the GNN2, the whole prediction network is constructed and trained to obtain weight parameters of the GNN network and other network layers, then the parameters of the LSTM are optimized by reinforcement learning based on the accuracy obtained after the network training to obtain an optimal controller model, and the two steps are alternately executed for a certain number of steps to obtain a final screening network model.
The method can be applied to virtual screening of the bacterial wilt colony induction lead compounds.
The method has the advantages that firstly, the method extracts the molecular adjacency matrix, the protein contact diagram and the protein sequence cross fusion characteristic to form the multidimensional characteristic, the molecular and protein characteristics can be better reflected, and secondly, the method uses the reinforcement learning optimization model structure, so that the previous model parameter selection by experience or a large number of manpower is avoided. The perfection and popularization of the method can effectively carry out virtual screening of the lead compounds, and have wide prospect and remarkable significance.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, a method for virtually screening a quorum sensing lead compound, which is a method for screening a quorum sensing lead compound based on PhcA and PhcR protein structures, is established by a computer-aided virtual screening technology, a molecular biology technology and the like based on PhcA and PhcR protein structures. The method is realized through a virtual screening model, the input of the virtual screening model is a molecular compound structure and a protein sequence, the molecular compound structure and the protein sequence are sent to a preprocessing module to extract preliminary characteristics, the preliminary characteristics are sent to a prediction model network, and the structure and the parameters of the prediction model are generated through training of an LSTM controller. The basic flow is to input the molecular formula of the compound, construct the molecular adjacency matrix, and send the molecular adjacency matrix into the GNN1 subnetwork to generate the compound characteristics. Inputting protein sequence, extracting protein amino acid composition and dipeptide frequency composition to form protein preliminary characteristic vector, feeding into cross network to produce cross fusion characteristic, at the same time, producing correspondent contact diagram by protein sequence, then feeding into GNN2 network to produce protein sequence characteristic. Finally, three characteristic combinations are sent into the full-connection layer to predict and obtain an affinity value, wherein the output value is 0, and the affinity value is 1. The following describes the construction of a molecular adjacency matrix, a protein contact diagram and the generation of protein sequence cross fusion characteristics, and the optimization of the GNN structure and other parameters of a prediction network are realized by an LSTM controller.
1. Data preprocessing
(1) Construction of molecular adjacency matrix
The molecular representation is in the dataset in SMILES format. A molecular diagram is constructed from a string of drug SMILES, which uses atoms as nodes and bonds as edges. The process of constructing the molecular graph structure is shown in FIG. 2. Let the constructed molecular adjacent matrix be X 1, the element value of the adjacent atomic matrix be 1, the non-adjacent value be 0, the size be (n X n), n is the number of nodes in the figure, i.e. the number of all atoms.
(2) Protein contact map
The protein contact diagram is a graphical representation for describing interactions between proteins, which shows the contact and interactions between proteins, for describing protein structure and function. And using Pconsc software to process the protein sequence, outputting a probability matrix of whether residual pairs are contacted or not, wherein the size is m X m, reserving values larger than 0.5 in the matrix, setting other values to 0, and the filtered matrix is a protein contact diagram X 2.
(3) Protein composition characterization
The protein composition is characterized by a combination of amino acid composition and dipeptide frequency, X 3, and has a size of 420 dimensions. The amino acid composition is the frequency of occurrence of each of the 20 amino acids constituting the sequence. The frequency of dipeptide is the frequency of occurrence of any two amino acid pairs, and the total number of amino acids constituting the protein sequence is 20, and the total number of dipeptide is 400.
2. Prediction model architecture
(1) Cross network
The cross network input X 3, the network is composed of 5 cross layers connected in series, and finally a 128-dimensional full-connection layer is connected, the output of the full-connection layer is f 3, and each cross layer has the following formula:
Cl+1=C0CT lWc,l+bc,l+Cl
Where l=1, 2,..5. C l and C l+1 are the outputs of the first and the l+1th layers crosslayer, respectively, and C 0, X 3,Wc,l and b c,l are the connection parameters between these two layers. All variables in the above equation are column vectors. The output of each layer is the output of the previous layer plus the feature crossing.
(2) Affinity prediction network overall structure
The prediction model consists of two GNN networks (GNN 1 and GNN 2) and a cross network which are connected in parallel, and the two GNN networks are combined and then sent into a splicing layer and a DROPOUT layer, and then two full-connection layers are connected. The molecular adjacency matrix of drug molecules and proteins, the protein contact map, is input to two different GNN1, GNN2 networks. Each network consists of 3 GNN layers. The output characteristics of the two GNN networks are f 1、f2, and the combination of the two GNN networks and the cross fusion characteristics is f 1+f2+f3 after the two GNN networks are spliced, so that the overall characteristics of the corresponding small molecule-protein pairs for prediction are obtained. One full connection layer is then fed with an output dimension of 128, and then a second full connection layer is fed with an output dimension of 1, i.e. the predicted affinity value of the network. Wherein the specific structural parameters of the GNN layer and other network layers are obtained by the following network model optimization training process.
(3) LSTM controller implementing virtual screening network model optimization
Model optimization is to use reinforcement learning to obtain two GNN optimal structural parameters and parameters of other neurons in the whole network in a determined parameter space. The GNN structure M needs to determine several parameters, namely a sampling function (S), a correlation metric function (Att), an aggregation function (Agg), a multi-head attention number (K), an output concealment embedding (Dim) and an activation function (Act).
The specific description and the corresponding parameter values of each parameter are as follows:
1. Output hidden embedding (Dim). Dim is the output dimension of each layer GNN, which is an integer value.
2. A sampling function (S). For each layer GNN, a sampling function is required. The sampling is used in the neural network to select receptive fields for a given target node. Three methods, namely a fixed neighbor number sampling method, b importance sampling method and c first-order neighbor ordering method are used in the method.
3. A correlation metric function (Att) and a multi-headed attention number K. For each layer of GNN we choose an Att method and a multi-head attention number K. The Att can select two measurement functions, namely GAT and GCN, which correspond to two network structures of GAT and GCN respectively. The GCN network includes 2 graph convolution layers, 1 ReLU activation function layer, and 1 Dropout layer. The GAT network includes K graph multi-headed attention layers, 1 Softmax activation function layer, and 1 Dropout layer. Wherein GAT assigns neighborhood importance by using a layer of interest, and GCN assigns neighborhood importance according to the degree of node.
4. Aggregation function (Agg). For each layer of GNN, agg polymerization is required. The optional aggregation functions Agg are a. Sum aggregator, b. Mean aggregator, c. Pool aggregator.
5. Activating a function (Act). For each layer of GNN, the Act activation function needs to be used. The optional activation functions Act include a.ReLU, b.LeakyReLU, c.ELU, d.Linear, e.Softmax. Increase the nonlinear fitting capacity of the graph network, and increase the expression capacity of the model.
The network model optimization uses an LSTM controller neural network to train the network, and consists of two steps, namely, LSTM predicts the [ S, att, agg, act, K, dim ] corresponding operation of a GNN1, each prediction is executed by a softmax classifier of LSTM, and then the predicted value is input to the next time point to obtain the next parameter prediction. When the number of layers of the GNN Layer is 3, the LSTM controller completes one-time architecture generation, and the process is repeated to generate the parameters of the GNN 2. And constructing and training the whole prediction network to obtain the GNN network and other network layer weight parameters. And then, based on the accuracy obtained after the network training, optimizing the LSTM parameters by reinforcement learning to obtain an optimized controller model. And (5) alternately executing a certain number of steps to finish the two steps to obtain the final screening network model.
The training of the model is further described below:
model training uses a common KIBA dataset. The dataset includes 229 unique proteins and 2111 unique drugs, with 118254 pairs of affinity interactions between proteins and drug molecules. In the training method, the data set is divided into a training set, a verification set and a test set according to the proportion of 80% to 10%.
Training parameters the controller is an LSTM network with 100 hidden units. It was trained using an ADAM optimizer with a learning rate of 0.0035. The controller samples the graph network structure to generate a sub-model and trains 200 epochs. During training, L2 regularization of λ=0.0005 was applied. Furthermore, dropout with p=0.5 is applied to the input of both layers and the normalized attention layer.
The LSTM comprises three layers, an input layer, a hidden layer, and an output layer, the input dimension 6*1 dimensions, the hidden layer neurons 100, a time step of 10.
LSTM network parameters are set as follows:
Layer1:LSTM(input_size=8,hidden_size=100,num_layers=2)
Layer2:Dropout(p=0.5)
Layer3:Linear(hidden_size=500,n_class=1)
Wherein input_size represents the input data dimension, hidden_size represents the output dimension, num_layers represents LSTM stacked several layers, default to 1, n_class represents the output dimension of the LSTM network, and 1 represents the output regression value.
After the controller has been trained 1000 times, we let the controller output the best model from the 200 samples GNN. The result shows that the optimal performance model of the original virtual screening model can be designed by the optimization method.
Experimental results
After the controller has been trained 1000 times, we let the controller output the best model from the 200 samples GNN. The result shows that the optimal performance model of the original virtual screening model can be designed by the optimization method.
The structure of the optimal virtual screening model after optimization and screening is as follows:
GNN1 structure:
Layer1 molecular map annotates force Layer GAT1 (input dimension= (n×n), output dimension = 128, number of hidden Layer elements = 128, number of head attentions = 4, activation function = elu (), aggregation function = sum ()).
Layer2, split sub-convolution Layer GCN2 (input dimension=128, output dimension=256, number of hidden Layer elements=256, number of head attentions=4, activation function= relu (), aggregation function=max ()).
Layer3 molecular map annotates force Layer GAT3 (input dimension=256, output dimension=128, number of hidden Layer elements=128, number of head attentions=8, activation function= elu (), aggregation function=avg ()).
GNN2 structure:
layer1 protein map convolution Layer GCN4 (input dimension= (m x m), output dimension = 64, number of hidden Layer elements = 64, number of head attentions = 16, activation function = relu (), aggregation function = max ()).
Layer2 protein map attention Layer GCN5 (input dimension=64, output dimension=256, number of hidden Layer units=256, number of head attention=4, activation function= elu (), aggregation function= pooling ()).
Layer3 protein map convolution Layer GCN6 (input dimension=256, output dimension=256, number of hidden Layer elements=256, number of head attentions=16, activation function= relu (), aggregation function=max ()).
Cross network, input dimension=420, output dimension=128, layer number n=5
Feature concatenation layer Concat1 (input dimension= (128,256,128), output dimension=512).
Dropout layer (512 ), ratio p=0.5.
Full link layer: linear (512,128).
Full link layer: linear (128, 1).
And (3) achieving a better state through 300 times of iterative network, and storing corresponding parameters for virtual screening of the bacterial wilt colony induction lead compounds.
In a word, the invention provides a screening method of quorum sensing lead compounds based on PhcA and PhcR protein structures, which can be used for discovering new compounds with quorum sensing activity and providing new ideas and means for controlling and preventing bacteria such as bacterial wilt. The method combines the computer-aided virtual screening technology and the molecular biology technology, and can efficiently screen out the compound combined with PhcA and PhcR proteins, thereby discovering the compound with quorum sensing activity. The method has the advantages of simple and convenient operation, high efficiency, high speed, high screening accuracy and the like, and can be widely applied to the fields of biological medicine, agriculture and the like.
Notably, phcA and PhcR in the present invention are two key proteins in bacterial wilt, whose structure and function play an important role in quorum sensing of bacteria. However, in other bacteria, different quorum sensing proteins may be present, and thus screening and research based on different bacterial species is required in order to find quorum sensing lead compounds suitable for different bacteria.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It should be understood by those skilled in the art that the above embodiments do not limit the scope of the present invention in any way, and all technical solutions obtained by equivalent substitution and the like fall within the scope of the present invention.
The invention is not related in part to the same as or can be practiced with the prior art.