GB2584588A

GB2584588A - A computer-implemented method of training a graph neural network

Info

Publication number: GB2584588A
Application number: GB1821192.0A
Authority: GB
Inventors: Federici Canova Filippo; Z Gao David
Original assignee: Nanolayers Res Computing Ltd
Current assignee: Nanolayers Res Computing Ltd
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2020-12-16
Also published as: GB201821192D0

Abstract

Computer-implemented method for training a graph neural network, comprising during a first phase: using a genetic algorithm to optimise S220 an initial population of parameters relating to the graph neural network, the parameters comprising a set of graph parameters and neural network parameters; and outputting S240 an optimised set of graph parameters; and during a second phase: setting S250 the set of graph parameters to be constant based on the optimised set of graph parameters output during the first phase; for each input graph in a training dataset, calculating S260 node states using the constant set of graph parameters, the node states describing a state of each node in the input graph; and using S270 a learning algorithm to learn the neural network parameters that cause the neural network to map the node states to known outputs of the training dataset. When the input graph is a molecule graph, the graph neural network is configured to output an estimated atomic charge population for at least one atom in the molecule graph.

Description

A Computer-implemented Method of Training a Graph Neural Network

Field of the Invention

The invention relates to a computer-implemented method of training a graph neural network. In particular, but not exclusively, the invention relates to training an electronic graph predictor for approximating an atomic charge population in a molecule graph.

Background of the Invention

Data is often represented using graphical structures. Information regarding relationships among data in science and engineering (for example in computer vision, molecular chemistry and biology, pattern recognition, data mining) can be represented graphically.

Whilst many graphs are simple structures involving single nodes, information may also be contained within complex graphical structures including sequences or trees.

Existing approaches aim to simplify the information contained within graphical structures by using a pre-processing stage, which transforms the graphs into a set of flat vectors.

However, important topological information may be lost during the pre-processing stage and the achieved results depend, potentially unpredictably, on the algorithm used for the pre-processing stage.

A type of neural network model known as a graph neural network has been proposed with the capability to process inputs in the form of graphs. An input graph consists of one or more nodes connected to each other by edges therefore defining a graph topology. The nodes represent objects and are described by a set of numerical properties. Edges represent interactions between the objects and are also described by a set of numerical properties. A connection list would normally describe the graph topology. The graph neural network has a set of internal parameters, which are used to calculate a desired output property of each node of the input graph, using the node numerical properties, edge numerical properties, and graph topology. A numerical node state vector is used for each node, to encode information on the node and its interaction with neighbouring nodes. Once the node state vectors are available, the graph neural network is designed to calculate a node output from the node state and its numerical properties assigned as inputs.

However, the numerical properties describing the nodes and the edges, together with the internal parameter set of the graph neural network, need to be learned in order for the node outputs to be calculated accurately. A known approach uses a learning algorithm to train the numerical properties describing the nodes and the edges and the internal parameter set of the graph neural network, so to maximise the accuracy on calculated node outputs for a large sample of input graphs with known outputs A comprehensive discussion of graph neural networks can be found in a paper by F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner and G. Mon fardini, "The Graph Neural Network Model," in IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61-80, Jan. 2009, the contents of which are incorporated herein by reference.

Graph neural networks may be utilised in the field of organic chemistry to predict an atomic charge population of an atom to describe reaction pathways between atoms such that, for example, the electronic properties of materials may be determined. The nodes of a graph represent atoms and the interaction (chemical bonds) between neighbouring atoms is given by the edges of the graph.

As mentioned above, in the graph neural network prior art, gradient-based learning algorithms are applied to train all parts of the graph neural network simultaneously (i.e. the numerical properties describing the nodes and the edges and the internal parameter set of the graph neural network). However, this approach has been found to be particularly difficult to implement and prone to numerical instabilities. In order to learn complex behaviour from large amounts of data, neural networks require many layers of neurons each containing multiple neurons. The effect of any miniscule changes to parameters in initial layers cascade through the neural network, changing the output significantly. On the other hand, small changes to parameters close to the output layer will naturally have a small impact. Thus, training via gradient-based learning algorithms is not effective for large and deep neural networks, and an alternative and improved method is desired.

Furthermore, the calculation of electronic properties of materials is known to be computationally demanding and time consuming, and is usually calculated using quantum mechanical simulation software. This approach requires large computational resources and a knowledgeable expert to implement the software, incurring considerable research and development costs. Whilst a skilled and experienced chemist may be able to approximate an atomic charge population of an atom in their head, they are likely to develop a more accurate picture of an atom or molecule by using a tool capable of quickly and accurately predicting the atomic charge populations of interacting atoms and molecules, and could use the output of such a tool to subsequently design chemical reactions. Additionally, a student may find such a tool to be a valuable resource, akin to a calculator, to support their studies.

In practice for graph neural networks, node state vectors depend on neighbouring node states and can only be calculated self-consistently with an iterative procedure, until all of the node state vectors converge within a desired tolerance. The inventors have discovered through preliminary tests that several iterations may be necessary to converge the node state vectors, depending on the complexity of the chemical environment, which amplifies the above-mentioned issues with gradient descent approaches even further at each iteration. Gradient descent methods for optimisation problems are also known to fall into local minima where the gradients of the error with respect to the internal parameters become negligible, indicating the training has converged, but the solution found is not optimal.

It would therefore be desirable to develop an alternative and improved method of training a graph neural network. In particular, an aim is to provide a reliable and quick method capable of quickly and accurately training a graph neural network. Such a training technique would provide a graph neural network capable of, for example, quickly and accurately estimating the atomic charge populations of interacting atoms and molecules, which may then be used as a tool to design chemical reactions with an acceptable degree of accuracy.

Summary of the Invention

According to one aspect of the invention, there is provided a computer-implemented method for training a graph neural network. The graph neural network is configured to process a graph and, when the graph is a molecule graph, is configured to output an estimated atomic charge population for at least one atom in the molecule graph. The computer-implemented method comprises a first and a second phase.

In the first phase, a learning algorithm such as a genetic algorithm is used to optimise an initial population of parameters relating to the graph neural network, until the genetic algorithm population reaches a desired fitness level. The desired fitness level may occur when the genetic algorithm population ceases to increase its accuracy over several generations by a noticeable amount. The parameters relating to the graph neural network comprise a set of graph parameters and neural network parameters. The first phase also comprises outputting a set of optimised graph parameters. In principle, gradient descent methods or other stochastic methods (such as Monte Carlo algorithms), may be used as useful alternatives to the genetic algorithm, and this disclosure is intended to cover those alternatives. However, as mentioned, training with gradient-descent methods is not effective for large and deep neural networks. Genetic algorithms generally converge faster than Monte Carlo algorithms, and are trivially parallelised on large computer clusters with excellent scaling. Therefore, the better solution is to use a genetic algorithm.

In the second phase, the method comprises creating a set of constant graph parameters based on the set of optimised graph parameters, and, for each input graph in a training dataset, calculating node states using the set of constant graph parameters, wherein the node states describe a state of each node in the graph. In principle, the set of optimised graph parameters outputted in the first phase need not necessarily be a complete set of optimised graph parameters used as inputs in the second phase. That is to say that one or more parameters of the set may instead be optimised in the second phase. The second phase of the method further comprises using a learning algorithm to learn the neural network parameters that cause the neural network to map the node states to known outputs of the training dataset.

As mentioned already, the graph neural network prior art discusses the application of gradient-based learning algorithms to train every one of its parts simultaneously, the inventors have found this approach to be particularly difficult to implement, and prone to numerical stability issues. The disclosed method of training a graph neural network by using two phases provides an easier to implement training method which is more stable (i.e. less prone to numerical issues and which is also accurate). By appropriately choosing the learning algorithm, the computer-implemented method provides for faster training of the graph neural network, while still allowing for accurate results to be determined from unknown input graphs. In a particular implementation used on input molecule graphs, the trained graph neural network can be used to find an approximation of the atomic charge population of a given atom or plurality of atoms defined by the molecule graph.

The input graph associated with the computer implemented method may comprise a single node or a plurality of nodes. The nodes represent objects and may be described by a set of numerical properties. The input graph may also comprise at least one edge, wherein each edge defines an interaction between two nodes of the graph, and may also be described by a set of numerical properties. For example, an input graph may contain: a single node and no edges; two nodes connected by an edge; three nodes connected by two or three edges; and so on and so forth. A topology of the input graph may be described by a list of connections between nodes. The computer implemented method may by implemented flexibly on input graphs with different properties.

The set of graph parameters may comprise a bias parameter to represent each node in the graph neural network. The bias parameter is a vector quantity of a chosen size, and signifies the likelihood of a node reacting with another neighbouring node. The bias vector could, in principle, be of any size, based on a trade-off between an amount of information encoded versus computational resource availability. In one example embodiment, the bias parameter is a column matrix comprising 8 elements.

The set of graph parameters may also comprise an interaction parameter, wherein the interaction parameter represents an interaction between any two connected nodes in the graph. The interaction parameter may be a matrix, and have a dimension which matches a dimension of the bias parameter. In one example embodiment, the interaction parameter is a square matrix with 64 elements, however the interaction parameter could, in principle be of any size that relates to the size of the bias parameter.

The graph neural network comprises neurons, and the neural network parameters comprise connection weights between neurons, wherein each connection weight represents a relative importance of respective inputs (i.e. the graph parameters) to the output. The graph neural network may also comprise neuron biases. Optionally, the neuron biases of the graph neural network may be constrained to be zero in order to ensure a zero output on a zero input. When combined with an appropriate choice of interpolation function, constraining the neuron biases to zero ensures that predicted charge populations of isolated atoms match exactly with periodic table entries for those atoms.

The genetic algorithm is implemented in the first phase to optimise an initial population of parameters relating to the graph neural network until the genetic algorithm population reaches a desired fitness level, wherein the desired fitness level is reached when the genetic algorithm population reaches a slowly changing or steady value. Training an entire population of parameters is time-consuming and demanding on computing resources.

Mutations may occur that increase the fitness of the population, but waiting for such a mutation may not be worth any additional demand on computing resources. Hence, the evolution of the initial population of parameters may be stopped once the genetic algorithm population reaches a slowly changing or steady value, wherein the slowly changing or steady value occurs when the genetic algorithm population converges.

Small changes in the bias parameter and/or the interaction parameter cause large changes in the fitness of the genetic algorithm, making improvement through evolution slow and unstable. Therefore, the optimised graph parameters (including the bias and interaction parameters) outputted during the first phase are used to set constant graph parameters that are then used as inputs for the learning algorithm. The constant graph parameters are selected based on a best fitness output. A higher level of fitness implies a higher chance of selection, thereby steering the evolution towards more accurate parameters sets over time.

The constant graph parameters are used to calculate node states for each node in the molecule graph. Each node has a scaling constant, the scaling constant comprising one or both of: a total number of nodes connected to that node; and a size of the node state.

An advantage of choosing the scaling constant to be the total number of nodes connected to that node is to ensure that the model is self-consistent (avoiding any mismatch in dimensions) and that the function is a contraction map. The size of the node state is related to the size of the bias parameters and the interaction parameters to ensure that the model is self-consistent.

The process of calculating node states using the constant graph parameters comprises calculating an intermediate node state for each node using one or more of: the bias parameter of that node; the interaction parameter from that node to each connected node; the bias parameter for the connected node. The process of calculating an intermediate node states for each node may optionally comprise using the scaling constant. The intermediate node state may be rescaled using the scaling constant and unbiased using the node bias parameter of that node to determine the node state of that node. For a node that is not connected to any other node, calculating node states using the constant graph parameters comprises returning a zero vector for the node in order to ensure that the output of disconnected nodes is calculated to a pre-set, known quantity.

The following equation is used to calculate node states 5Y, for a given node I using the constant graph parameters: ) (1) In Equation 1, ci is a total number of nodes connected to node i; s is the size of the state parameter; and b7i is a bias parameter for node I. Term x, is an intermediate node state, and is further defined as: (2) In Equation 2, xi is the intermediate node state of the node i; ci is a total number of nodes connected to node i; s is the size of the node state; b71 is a bias parameter for node i; is a node state of a node j connected to node i; and Azizi is an interaction parameter between the two connected nodes i and]. Once the node states for all atoms in a training dataset containing known atoms are calculated, the neural networks for each atomic type can be trained independently.

Alternatively, calculating node states using the constant graph parameters can simply be performed using the Equation 2 (i.e. there is no unbiasing or unscaling).

During the second phase, the learning algorithm maps the node states to known outputs of the training dataset based on a predetermined degree of accuracy. By predetermining the degree of accuracy, the user can decide a satisfactory degree of accuracy, taking into consideration other trade-off factors such as computational time and numerical stability.

The idea of having the second phase is that it is much quicker than the first phase, since now the recurrent part of the graph neural network is assumed to be satisfactory and kept constant (e.g. the node biases and interaction parameters). The node state to output mapping can be done with smaller and simpler networks, so a genetic algorithm is not needed and faster gradient methods can be used reliably. This allows for testing of different neural networks quickly, trying to reach maximal accuracy. Accuracy itself can be defined in many ways. In the inventor's case, the mean relative error was used, since it gave better results than the mean absolute error, but that is not intended to be limiting to the inventive concept. For an atomic charge predictor implementation, the inventors targeted an absolute error of the order of 0.01 electrons, and they iteratively tested until they achieved that level of accuracy. However, they could have never known if that level of accuracy was even possible beforehand.

The learning algorithm may be a gradient descent algorithm. An advantage of using a gradient descent algorithm is that gradient descent algorithms are much faster at providing solutions than genetic algorithms.

The method of claim 1, wherein the graph neural network is configured to output a value of one of: less than or equal to 1; greater than or equal to -1, between -1 and 1 inclusive; a sub-range between -1 and 1 inclusive. An output of ensures that the population tends towards convergence, and thus prevents the model ballooning.

An interpolation function may be applied to the output of the graph neural network to map the output of the graph neural network to a plurality of predicted outputs of the graph.

The interpolation function may be a global interpolation function, or a piecewise interpolation function, which is defined by multiple sub-functions (having different conditions), providing further granularity.

The outputs may be a plurality of charge values, when the input graph is a molecule graph, and the charge values are predetermined. The outputs may be different for each atomic orbital of an atom.

The plurality of predicted charge values may be scaled by a normalisation factor. When calculating system total electronic population, small errors occur due to small prediction errors on each atomic component, therefore a normalisation factor is introduced to each predicted charge to ensure that the summed total is equal to the total electronic population. For example, if the total predicted charge is M, but should be N, then each predicted charge is scaled by N/M. In one implementation, the networks can be trained to output the correct charge. In another implementation, the scaling was added to make the results consistent with chemistry. In practice, the total charge from predictions is very close to the formal value for all molecules that were tested, within 1% error.

The graph neural network may comprise a feed-forward neural network, where the output from one layer is used as an input to feed the next layer, such that the information moves only in one direction. Feed-forward neural networks are easy to implement and train (e.g. with gradient-based methods), whilst still being universal function approximators. A recurrent network may also be used.

The graph neural network may comprise hyperbolic tangent activation, which is practical to implement. Alternatively, other activation functions may be suitable, such as rectifiers, provided that the interpolation function is adjusted accordingly. The choice of neural network is open. If a neural network is used to calculate node biases and interaction matrix elements, then a bound activation function is preferred, such as tanh, at least in the last layer. For the node states to output networks one could use any other activation function (even an unbound function like ReLU) and adjust the interpolation function accordingly.

In one useful example, the graph may be a molecule graph, and the graphs of the training dataset are also molecule graphs. Each node represents an atom in the molecule graph and is distinguished from other nodes by its atomic type (Z). The final node state of each node is used to compute an atomic charge population of a corresponding atom associated with each node of the molecule graph. In one example embodiment, the atom is one of: hydrogen, carbon, oxygen, nitrogen and fluorine. The final node state provides a user with additional information regarding s, p and d orbitals of a given atom, which is important when estimating the reactivity of that atom. Organic molecules, such those containing hydrogen, carbon, oxygen, nitrogen and fluorine, are of interest because they appear in many fields of science and technology including drug design and oils, and there is a publicly available database of organic molecule structures, but other molecules could form the basis of the learning, and provide a useful tool for a chemist.

Brief Description of the Drawings

The invention shall now be described, by way of example only, with reference to the accompanying drawings, in which: Figure 1 is a block diagram which shows an example of how a graph neural network is trained using a genetic algorithm, a learning algorithm and a training dataset; Figure 2 is a flowchart illustrating the process of optimising graph neural network parameters using the genetic algorithm, determining if the genetic algorithm has reached a desired fitness level, and using the learning algorithm to map node states; Figure 3 is a flowchart illustrating the operation of the learning algorithm; Figure 4 shows examples of nodes connected by edges; Figure 5 illustrates a hierarchical structure of parameters of the graph neural network.

Figure 6 shows an example of a bias parameter; Figure 7 shows an example of an interaction parameter; Figure 8 is a graph showing an example of how the fitness level of the genetic algorithm evolves with each generation of the genetic algorithm, until the fitness reaches a slowly changing or steady state; Figure 9 is a graphical representation of the output of the neural network as a function of predicted charge; Figure 10 is a block diagram which shows how the neural network produces a predicted charge output, based on an input molecule graph; and Figure 11 is a graph a regression plot for a number of predicted partial atomic charges (a), and a histogram of the error associated with the predicted partial atomic charges (b).

Figure 12 is a block diagram showing an example computer architecture configured to perform the computer-implemented method described herein.

Detailed Description

Embodiments of the invention will be now described with reference to the attached Figures. It is to be noted that the following description is merely used for enabling the skilled person to understand the invention, without any intention to limit the applicability of the invention to other embodiments which could be readily understood and/or envisaged by the reader.

Figure 1 is a block diagram which shows an example of a computer-implemented method for training a graph neural network. A training algorithm 100 acts on the graph neural network and comprises a genetic algorithm 110 and a learning algorithm 130. The training algorithm may be trained in two phases. In a first phase, the genetic algorithm 110 takes one or more inputs and outputs a set of optimised graph parameters. In a second phase the learning algorithm 130 is fed with the optimised graph parameter output in the first phase, together with a training dataset 120 and outputs a set of neural network parameters.

Figure 2 is a flowchart illustrating a process of optimising graph neural network parameters using the genetic algorithm 110. At step 5220, the genetic algorithm 110 optimises the graph neural network parameters, wherein the graph neural network parameters may include a set of graph parameters and neural network parameters. At step 5230, a decision is made as to whether the genetic algorithm 110 has reached a desired fitness level. The desired fitness level occurs when the genetic algorithm population ceases to increase its accuracy over several generations by a noticeable amount. If the desired fitness level has been reached, the process proceeds to step 5240 where the genetic algorithm 110 outputs optimised graph parameters. If the desired fitness level has not been reached, the process reverts back to step 5220. At step 5250, constant graph parameters are set based on the optimised graph parameters outputted by the by genetic algorithm 110. At step 5260 the constant graph parameters are used to calculate node states for each graph in the training dataset 120. The graph comprises one or more nodes, and the node states describe a state of each node in the graph. The calculated node states are fed, at step 5270, into the learning algorithm 130 described below in relation to Figure 3, to learn neural network parameters that cause the neural network to map the node states to known outputs of the training dataset 120.

Figure 3 is a flowchart illustrating the operation of the learning algorithm. At step 5320, the learning algorithm is fed with the training dataset 120 and optimised graph parameters outputted from the genetic algorithm 110. The training dataset 120 may comprise a plurality of graphs. At step 5330, constant graph parameters are set based on the optimised parameters, and at step S340 the constant graph parameters are used to determine a node state for each of the plurality of graphs in the training dataset. At step 350, the node state is fed into a neural network of the learning algorithm 130. The learning algorithm 130 iteratively operates the neural network until an output of the neural network maps to a known output of the training dataset 120.

In the first phase, the one or more inputs may comprise an initial population of parameters relating to the graph neural network. The parameters may further comprise a set of graph parameters and neural network parameters. By the set of graph parameters, we mean any parameter related to the graph, and by neural network parameters, we mean any parameter related to the neural network.

Figure 4 shows example graphs 400 illustrating example configurations of the nodes of a graph, wherein the nodes are connected by edges. Connections between different nodes describe the graph topology, and an example involving three different connected nodes is described below. The graph topology comprises a connection list of nodes in a graph. The graph may comprise a single node, or a plurality of nodes and at least one edge. Each edge represents an interaction between a node and a neighbouring node of the graph. In general a graph may contain: a single node and no edges; two nodes connected by an edge; three nodes connected by two or three edges; and so on and so forth. The nodes, edges and topology are each described by a set of numerical properties. In the case of a graph containing only a single node, the set of numerical properties describing the edges and topology would return null values. In one example configuration a node 410 is connected to another node 420 by edge 415. Node 410 is connected to a further node 430 by edge 425. Nodes 410, 420 and 430 may all be the same type of node, may all be different types of nodes, or may be a combination of types of nodes. In a second example configuration, node 440 is connected to nodes 450 and 460 by edges 445 and 460 respectively. Node 450 is connected to node 460 by edge 455. Nodes 440, 450 and 460 may all be the same type of node, may all be different types of nodes, or may be a combination of types of nodes. The computer-implemented method may further be implemented flexibly on graphs with different properties, and it will be appreciated that other node configurations are possible.

The set of numerical properties comprises a bias parameter to represent each node in the graph neural network. The bias parameter is a vector quantity of a chosen size, and signifies the likelihood of a node reacting with another neighbouring node. The bias vector could, in principle, be of any size, based on a trade-off between an amount of information encoded versus computational resource availability. In one example embodiment, the bias parameter is a column matrix comprising 8 elements.

The set of numerical properties also include an interaction parameter, wherein the interaction parameter represents an interaction between any two connected nodes in the graph. The interaction parameter is a matrix, and has a dimension which matches a dimension of the bias parameter. In one example embodiment, the interaction parameter is a square matrix with 64 elements, however the interaction parameter could, in principle be of any size that relates to the size of the bias parameter.

Figure 5 illustrates a hierarchical structure of sub-features of the graph neural network for this example embodiment. Graph neural network 510 comprises a set of graph parameters 520 and neural network parameters 530. The set of graph parameters 520 further comprises bias parameters 540 and interaction parameters 550 for each node. Each node in the graph neural network is represented by a bias parameter 540. Neural network parameters 530 comprise one or more neurons 560, and the neurons 560 have neuron biases 570 and one or more connection weights 580 between the neurons 560. Each of the one or more connection weights 580 represents a relative importance of respective inputs (i.e. the graph parameters) to the output of the neural network.

Each bias parameter represents a node in the graph neural network. Figure 6 shows an example of a bias parameter 610, where the bias parameter 610 may be a vector quantity. In the example of Figure 6, bias parameter 610 is shown as a column matrix containing a number of elements en and having a dimension dn. In one example embodiment, bias parameter 610 contains 8 elements. However, the bias vector could, in principle, comprise any number of elements.

Each interaction parameter represents a graph edge, that is an interaction between any two connected nodes in the input graph. Figure 7 shows an example of an interaction parameter 710, where, in the example of Figure 7, the interaction parameter 710 is shown as a square matrix. The interaction parameter 710 comprises n by m elements, and have dimensions d, by dm. The interaction parameter 710 is an 8 by 8 matrix. However, the interaction parameter could, in principle be of any size that relates to the dimension of the bias parameter.

As discussed in connection with Figures 1-3, the constant graph parameters are used to calculate node states for each node in the graph. Optionally, each node has a scaling constant, wherein the scaling constant comprises one or both of: a total number of nodes connected to that node; and a size of the node state. Choosing the scaling constant to be the total number of nodes connected to that node ensures that the model is self-consistent (avoiding any mismatch in dimensions) and that the function of Equation 1 is a contraction map. The size of the node state is related to the size of the bias parameter and the interaction parameter to ensure that the model is self-consistent. The node state may be calculated using the following equation: )j hz, (1) In Equation 1, c, is a total number of nodes connected to node i; s is the size of the node state; x1 is an intermediate node state of node i; and 197i is a bias parameter for node/.

The process of calculating node states using the constant graph parameters further comprises calculating an intermediate node state for each node using one or more of: the bias parameter of that node; the interaction parameter from that node to each connected node; the bias parameter for the connected node. The process of calculating an intermediate node state for each node may optionally comprise using the scaling constant. The intermediate node state, xi, for a node i may be calculated as follows: (2) In Equation 2, ci is a total number of nodes connected to node i; s is the size of the node state; b7i is a bias parameter for node i; xi is a node state of a node j connected to node 1; and Azizj is an interaction parameter between the two connected nodes i and j. The interaction parameter is summed from node j to all connected nodes. Once the node states for all nodes in the training dataset are calculated, the neural networks for each node type can be trained independently.

The intermediate node state, calculated using Equation 2, may be rescaled using the scaling constant and unbiased using the node bias parameter of that node to determine the node state of that node, as shown in Equation 1. For a node that is not connected to any other node, calculating node states using the constant graph parameters comprises returning a zero vector for the node in order to ensure numerical stability.

In an alternative example, if unbiasing and unscaling is not required, the node states can be calculated directly using Equation 2.

During the second phase, the learning algorithm 130 maps the node states to known outputs of the training dataset 120 based on a predetermined degree of accuracy. By predetermining the degree of accuracy, the user can decide a satisfactory degree of accuracy, taking into consideration other trade-off factors such as computational time/numerical stability. The idea of having the second phase is that it is much quicker than the first phase, since now the recurrent part of the graph neural network is assumed to be satisfactory and kept constant (e.g. the node biases and interaction parameters). The node state to output mapping can be done with smaller and simpler networks, so a genetic algorithm is not needed and faster gradient methods can be used reliably. This allows for testing of different neural networks quickly, trying to reach maximal accuracy.

Accuracy itself can be defined in many ways. In the inventor's case, the mean relative error was used, since it gave better results than the mean absolute error, but that is not intended to be limiting to the inventive concept. For an atomic charge predictor implementation, the inventors targeted an absolute error of the order of 0.01 electrons, and they iteratively tested until they achieved that level of accuracy. However, they could have never known if that level of accuracy was even possible beforehand.

In one example embodiment, the graph neural network comprises a feed-forward neural network, where the output from one layer is used as an input to feed the next layer, such that the information moves only in one direction. However, the skilled person will appreciate that the graph neural network may comprise any other suitable neural network, for example, a recurrent network.

Figure 8 is a graph showing an example of how an average fitness level of a genetic algorithm population evolves with each generation of the genetic algorithm 810, until the fitness of the reaches a slowly changing or steady state 830. Figure 8 shows an example of a best fitness level 820 level overlaid onto the average fitness level for each generation of the genetic algorithm 810. The slowly changing or steady value 830 occurs when the genetic algorithm population converges. The slowly changing or steady value 830 may be used to indicate when a desired fitness level of the genetic algorithm 810 has been reached.

The genetic algorithm 810 is used in the first phase of the training algorithm to optimise an initial population of parameters relating to the graph neural network until the genetic algorithm population 810 reaches a desired fitness level, wherein the desired fitness level is reached when the genetic algorithm population reaches a slowly changing or steady value 830. Training an entire population of parameters is time-consuming and demanding on computing resources. Mutations may occur that increase the fitness of the population.

Depending on an availability of computing resources, waiting for such a mutation may not be worth any additional demand on computing resources. However, if computation resources are not a factor, seeking a mutation that increases the fitness of the population may be beneficial. In one embodiment, the evolution of the genetic algorithm 810 is stopped once the genetic algorithm population reaches a slowly changing or steady value, wherein the slowly changing or steady value occurs when the genetic algorithm 810 population converges.

Convergence of the genetic algorithm 810 is ensured by configuring the training algorithm 100, acting on the graph neural network, to output a value of or a value whose square is <1. Optionally, the output may be further constrained to be a value of one of: less than or equal to 1; greater than or equal to -1, between -1 and 1 inclusive; a sub-range between -1 and 1 inclusive, or any other suitable value.

A best fitness output is used when setting the constant graph parameters based on the optimised graph parameters output during the first phase as discussed above in connection with Figures 1-3.

In this example embodiment, the learning algorithm 130 is a gradient descent algorithm.

An advantage of using a gradient descent algorithm is that gradient descent algorithms are much faster than using the genetic algorithm 110 to additionally carry out the function of the learning algorithm 130. However, another algorithm may be used.

The training algorithm 100 is configured to process one or more graphs as an input. In one example implementation, the input graph is a molecule graph. The training algorithm is configured to output an estimated atomic charge population for at least one atom in the molecule graph. Each node represents an atom in the molecule graph and is distinguished from other nodes by its atomic type (Z). Each edge represents electronic interactions between the connected atoms. The final node state of each node is used to compute an atomic charge population of a corresponding atom associated with each node of the molecule graph. In this example, the atom is one of: hydrogen, carbon, oxygen, nitrogen and fluorine, although the skilled person will appreciate that the atom could be any atom.

Figure 9 is a graphical representation 900 of the output of the graph neural network implemented by the learning algorithm as a function of predicted charge. In one embodiment, Figure 9 shows an interpolation function 910 that maps an output of the graph neural network to a plurality of predicted nodes states of an input graph. The interpolation function 910 may optionally be a piecewise interpolation function. The predicted node states may be predicted charge values associated with a node, and may be represented by one or more of gm, go and gm. In one specific example embodiment, the plurality of predicted charge values are associated with an atomic orbital of an atom, and are predetermined and different for each available atomic orbital. However, the skilled person will appreciate that the learning algorithm described above is suitable to be applied more broadly to predict other properties based on the output of the neural network.

The graph neural network uses hyperbolic tangent activation. The neuron biases of the graph neural network, discussed above in relation to Figure 5, may be constrained to be zero. When combined with an appropriate choice of go, constraining the neuron biases to zero ensures that predicted charge populations of isolated atoms match exactly with periodic table entries for those atoms.

Figure 10 is a block diagram which shows an example of a graph neural network embodied as a charge predictor 1000 comprising an input interpreter 1010 and a neural network 1020. The input interpreter 1010 takes a graph as an input and outputs a node state using Equation 2 optionally with Equation 1. The node state is used as an input by the neural network 1020 which then outputs a predicted charge value. The input graph is an input molecule graph. The charge predictor 1000 of Figure 10 is an example of how, once trained by the training algorithm 100, the charge predictor 1000 may be configured to produce a predicted charge output, based on an input molecule graph. In the embodiment of Figure 10, the input molecule graph may be any molecule graph containing atoms that have been used in training.

The predicted charge value is scaled by a normalisation factor. When calculating a total electronic population of a system, small errors may occur due to small prediction errors on each atomic component, therefore a normalisation factor may be introduced to each predicted charge to ensure that the summed total of any predicted charge values in the system is equal to the total electronic population of the system. However, scaling is optional.

Figure 12 shows an example computer architecture 1200, comprising at least a CPU 1210, a memory 1220 and an I/O interface 1230, configured to perform the computer-implemented method described herein.

Appendix A: Example Embodiment To aid understanding, the invention will now be described by the way of a specific worked example where the graph neural network training algorithm is utilised to estimate one or more atomic charge populations when the inputs to the genetic algorithm are molecule graphs. The following is intended only to aid the general understanding of the invention in practice, and is not intended to be in any way limiting. Indeed, the skilled reader will appreciate that the training algorithm approach described above is applicable to other types of input graph.

The following specific example of the training strategy involves finding an optimal set of atomic biases, interaction matrices and neural network parameters, such that atomic charges predicted by the training algorithm closely match known reference database values.

A.1 Atomic Charge Prediction Atomic charges are calculated from the graph representation of the system using a machine-learning model based on the graph neural network. Although the following is provided as a specific example, the general idea, however, is the same as the more general principle: abstract the system as a graph, compute the state of the nodes from the graph topology and input properties, and compute the output of the nodes from their states. The detailed procedure follows.

Atoms are represented by graph nodes, and are distinguished solely by their atomic type Z. Chemical bonds between atoms become the graph edges. In our implementation, this is the only input information required from the user in order to complete the calculation.

Internally, each node has an associated bias parameter, determined by its atomic type, and an intermediate node state xi, of a chosen fixed size s, that is computed from the input data and biases. The intermediate node state of the node, xi, depends on the node type, as well as the node state and type of all nodes connected to it: (A.1) where A is the interaction parameter between nodes i and j, and c, is the scaling constant. Due to the mutual dependence of the node states, they are calculated by applying Equation A.1 iteratively, and convergence is guaranteed as long as the function in Equation A.1 is a contraction map. In our implementation this is ensured by choosing the scaling constant to be the total number of nodes connected to i, and interaction parameters and bias parameters with elements bound between -1 and 1 inclusive. As the index notation suggests, the bias parameter is only dependent on the atomic type of the node. Similarly, different interaction parameters are used in the calculation of the node states, depending on the atomic type of both connected nodes.

Once converged, the intermediate node states are rescaled and unbiased, giving the final node state: (A.2) The above step is designed to ensure that the final node state of disconnected atoms is a zero vector. The last part of the calculation consists of using the final states to compute the atomic charge populations qt, on each atom: ii Litz, go glr (A.3) where t is the orbital type (s, p or d orbital). The function f is a conventional feed-forward neural network with hyperbolic tangent activation, which outputs a value between -1 and 1 inclusive, associated to the node state given in the inputs. The indexing indicates that different neural networks are applied to the node states depending on the atomic type of the node, and the desired orbital flavour of the output.

The neural network output is interpolated piecewise by the function g, which maps the output of the neural network to a charge value between gm and (I, , with g(0)= go, as shown in Figure 9. These three interpolation parameters are predetermined and different for each atom and orbital type.

The neuron biases of the neural network are constrained to all be zero, making sure that the neural network outputs zero whenever the input node state is also the zero vector. Combined to the appropriate choice of go, this ensures that isolated atoms are predicted exactly with their nominal charge population from the periodic table. The other interpolation extrema, gm and qm, are chosen so that unphysical electronic populations cannot occur.

The total electronic population zror on a system (assumed to be electrically neutral) is given by the sum of Z, over its atoms. However, the machine-learned model will hardly give the expected value exactly, due to a small prediction error on each atomic component.

For this reason, all predicted charges are scaled by a normalization factor that brings the total to Z7-01-.

A.2 Training strategy Training the charge predictor requires finding an optimal set of bias parameters, interaction parameters, and neural network parameters for the neural network, so that the predicted charges closely match reference database values.

First, the hyper-parameters of the model are set, i.e. parameters defining the model behaviour that cannot be learned from the data during training. The size of the node state and bias parameters was set to 8, and consequently the interaction parameters are 8 by 8. Larger values would allow more information to be encoded into the node states, potentially allowing for a more accurate model, while smaller values would imply lower complexity and thus faster training. This choice is a compromise between accuracy and speed, and prior testing with larger node state size revealed no significant accuracy improvement. The neural networks are chosen to have 6 layers of size 8 (input), 128, 64, 32, 16, and 1 (output) respectively: the inventors found this to work well with their training data, however they do not claim this to be the optimal topology.

The above choice of hyper-parameters sets the amount of model parameters for the training to find. This example model describes molecules containing up to five atomic species (H, C, N, 0 and F), and so the bias parameters bring 5x8 = 40 parameters, while all the interaction parameters give 5x5x8x8 = 1,600 parameters. The neural network parameters are the connection weights between neurons (biases are constrained to zero, thus not counted). Since there is a neural network for each type of electron of each specie (s/p for H, s/p/d for C, N, 0 and F) and each neural network has 11,792 weights, there are 165,088 parameters in all the neural networks combined. Thus, the training algorithm has to optimize a total of 166,728 parameters.

Even though the graph neural network prior art (F. Scarselli, M. Gori, A. C. Tsoi, M. nagenbuchner and G. Monfardini, "The Graph Neural Network Model," in IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61-80, Jan. 2009) discusses the application of gradient-based algorithms to train every one of its parts simultaneously (bias parameters, interaction parameters and output neural networks in our specific case), the inventors found these to be particularly difficult to implement and prone to numerical stability issues. Therefore, they designed their own.

The training algorithm runs in two separate phases. First, the entire parameter set is optimized using a genetic algorithm. The procedure starts by creating a random population of elements, each represented by a full set of graph neural network parameters. The genetic algorithm then computes the fitness of each element, i.e. a measure of the graph neural network accuracy with that parameter set. At this point, pairs of elements are selected from the population for breeding an offspring element: the parameters of the parent elements are mixed to form a new set. Mutations may also appear, randomizing some values in the resulting set. This is repeated until a new population of the same size as the original one is formed. Parents are picked at random, however, higher fitness implies higher chance of selection, thus steering the evolution towards more accurate parameter sets over long time.

Around 50,000 generations were required to observe a significant prediction improvement over the initial random models. However, the best accuracy obtained this way is still too low to make the model useful. The reason for this is in the recurrent nature of the graph neural network: even small changes in the bias parameters or interaction parameters result in significantly different node states, changing the output of the neural network, and ultimately affecting the overall fitness. Small changes in the bias parameters and interaction parameters cause large changes in the fitness, making improvement through evolution slow and unstable. Training the model with the genetic algorithm only is possible but also time-consuming and quite demanding on computer resources. For these reasons the evolution is stopped when the genetic algorithm population converges and the fitness of the genetic algorithm reaches a slowly changing or steady value. In principle, a mutation that increases fitness could always appear, but at this point it is a rare event, and waiting for it is not worth the computational resources.

Fortunately, small changes in the node states result in small changes in predicted charges (and accuracy as well), since the neural networks are smooth functions. If the states were known, the neural networks could be trained directly to model the state-charge relationship using a much faster gradient-descent based method.

In the second training phase, all parameters related to the bias parameters and interaction parameters become constants, obtained from the best model output in the previous phase: which effectively establishes a scheme for the computation of the node states from the input molecule graph. Since neural networks are smooth functions, the much faster gradient descent method can be used to optimize their parameters (165,088 values) so that they accurately map states onto atomic charge populations. Moreover, once the node states of all atoms in the database are calculated, the neural networks for each atomic type can be trained independently.

A.3 Results The method was tested by training the charge predictor on a database of 134,000 small organic molecules from the GDB9 set. The dataset was randomly shuffled and 70% of it was used as training data, while the other 30% remained for validation. Molecular geometries were taken from the public dataset of ref [RUPPDATA], and recalculated at the coupled cluster singles and doubles level [REFCCSD], with the cc-pvdz localized basis set. Electronic Mulliken population [REFMKEN] in the s, p, and d orbitals of each atom were computed from the CCSD density matrix.

The atomic charge predictor abstracts the molecular geometry into a non-positional graph, where edges between nodes are placed according to simple cut-off rules, predetermined from organic chemistry knowledge. The cut-off distance H-H is 0.85 A, H-X is 1.20 A (X being C, N, 0 or F), and X-X is 1.70 A. The output boundary parameters are given in the

following Table Al:

atom orbital gm [e] qo [e] qm [e] atom orbital qyo [e] qo [e] qm [e] H s 0.60 1.00 1.10 C s 2.60 3.00 3.80 H p 0.00 0.00 0.12 C p 1.90 3.00 3.20 N s 3.10 3.50 4.10 C a' 0.00 0.00 0.40 N p 2.90 3.50 4.00 0 s 3.60 3.70 3.90 N d 0.00 0.00 0.30 0 p 4.20 4.30 4.80 F s 3.60 3.80 4.00 0 d 0.00 0.00 0.10 F p 5.10 5.20 5.50 F d 0.00 0.00 0.40

Table Al

The minimum/maximum charge values were estimated from the values found in the database, allowing for a small uncertainty margin. The maximum values are also consistent with the size of the electronic shells. While organics do not really have d electrons, the basis set includes d orbitals for all atoms except H, thus a very small amount of charge inevitably spills in those. The main purpose of these boundary values is to limit the output space of the neural network within the expected range, avoiding occasional catastrophic mis-prediction of outlier molecules, and aiding the training algorithm to converge faster.

After training, the charge predictor was tested on molecules in the validation set, giving the regression plot in figure PQERRORa shown in Figure 11. The model is able to estimate the total atomic charges (sum of s, p and d) within a small error for all atomic species.

About 86.6% of the atomic charges are predicted within ± 0.05e of the expected value as shown in Figure 11. The mean absolute error (MAE) on partial charges for each orbital type and total are given in the following Table A2: atom MAE s [e] MAE p [e] MAE d [e] MAE pq [e] 0.01 0.0007 0.01 0.03 0.03 0.004 0.04 0.02 0.03 0.002 0.03 0 0.01 0.02 0.0007 0.02 0.002 0.01 0.0004 0.01

Table A2

Several methods to calculate partial electron populations exist (Mulliken, Bader, LOwdin, etc) and the resulting values can vary between them. The prediction error of the inventor's machine-learning model is much smaller than the error expected from the choice of reference method, proving that the inventor's model could be accurately trained on either one.

While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application as defined by the appended claims. Such variations are intended to be covered by the scope of this application. As such, the foregoing description of embodiments of the present application is not intended to be limiting. Rather, any limitations to the invention are presented in the following claims.

Claims

Claims 1. A computer-implemented method for training a graph neural network, the graph neural network being configured to process an input graph and, when the input graph is a molecule graph, being configured to output an estimated atomic charge population for at least one atom in the molecule graph, the method comprising: during a first phase, using a genetic algorithm to optimise an initial population of parameters relating to the graph neural network, the parameters relating to the graph neural network comprising: i) a set of graph parameters, and ii) neural network parameters; and outputting an optimised set of graph parameters; during a second phase: a) setting the set of graph parameters to be constant based on the optimised set of graph parameters output during the first phase, b) for each input graph in a training dataset, calculating node states using the constant set of graph parameters, the node states describing a state of each node in the input graph, and c) using a learning algorithm to learn the neural network parameters that cause the neural network to map the node states to known outputs of the training dataset.
2. The method of claim 1, wherein the input graph comprises: a single node; or a plurality of nodes and at least one edge, and wherein each edge defines an interaction between two nodes of the input graph.
3. The method of claim 1, wherein the set of graph parameters comprise a bias parameter to represent each node in the graph neural network.
4. The method of claim 3, wherein the bias parameter is a vector quantity.
5. The method of claim 4, wherein the bias parameter is a column matrix.
6. The method of claim 5, wherein the column matrix comprises 8 elements.
7. The method of claim 1, wherein the set of graph parameters comprise an interaction parameter to represent an interaction between two nodes connected in the graph.
8. The method of claim 7, wherein the interaction parameter is a matrix.
9. The method of claim 7 or 8, wherein the interaction parameter is a square matrix.
10. The method of claims 7 to 9, wherein the interaction parameter has a dimension which matches a dimension of the bias parameter.
11. The method of claim 1, wherein the graph neural network comprises neurons, and the neural network parameters comprise connection weights between neurons.
12. The method of claim 1, wherein the genetic algorithm produces an optimised initial population of parameters when a desired fitness level is reached, and the desired fitness level is when the genetic algorithm population reaches a slowly changing or steady value.
13. The method of claim 12, wherein the slowly changing or steady value occurs when the genetic algorithm population converges.
14. The method of claim 1, wherein when setting constant graph parameters based on the optimised graph parameters output during the first phase, a best fitness output of the first phase is used.
15. The method of claim 1, wherein calculating a node state using the constant graph parameter comprises using a scaling constant for each node, the scaling constant comprising one or both of: a total number of nodes connected to that node; and a size of the node state.
16. The method of 15 when dependent on claims 6 and 11, wherein the size of the node state is related to the size of the bias parameter and the interaction parameter.
17. The method of claim 1, wherein calculating a node state using the constant graph parameter comprises calculating an intermediate node state for each node using one or more of: the node bias parameter of that node; the interaction parameter from that node to each connected node; the bias parameter for the connected node.
18. The method of claim 17, when dependent on claim 16, wherein a calculating node state using the constant graph parameter comprises calculating an intermediate node state for each node using the scaling constant.
19. The method of claim 18, wherein the intermediate node state is rescaled using the scaling constant and unbiased using the node bias parameter of that node to determine the node state of that node.
20. The method of claim 1, wherein calculating a node state using the constant graph parameter comprises returning a zero vector for a node that is not connected to any other node.
21. The method of claim 1, wherein calculating a node state using the constant graph parameter comprises using the following equation: 54 = (cis); -h. wherein xi is an intermediate node state.
22. The method of claim 21, wherein the intermediate node state is calculated using the following equation: 1 Conn xi = (11 + I A zi x.) (cis) wherein x, is the intermediate node state of the node i, c, is a total number of nodes connected to node i, s is the size of the node state, bzi is a bias parameter for node i, xi is a node state of a node j connected to node i, and Azjzj is an interaction parameter between the two connected nodes i and j.
23. The method of claim 1, wherein calculating a node state using the constant graph parameter comprises using the following equation: Comm = xi (cis)(bzi+ Azizi.xj) wherein x, is the intermediate node state of the node i, c, is a total number of nodes connected to node i, s is the size of the node state, ki is a bias parameter for node i, xl is a node state of a node j connected to node i, and Azizj is an interaction parameter between the two connected nodes i and j.
24. The method of claim 1, wherein mapping a node state to known outputs of the training dataset is based on a predetermined degree of accuracy.
25. The method of claim 24, wherein the predetermined degree of accuracy is defined by the mean relative error.
26. The method of claim 1, wherein the learning algorithm is a gradient descent algorithm.
27. The method of claim 1, wherein the graph neural network is configured to output a value of one of: less than or equal to 1; greater than or equal to -1, between -1 and 1 inclusive; a sub-range between -1 and 1 inclusive.
28. The method of claim 1, further comprising applying an interpolation function to the output of the graph neural network to map the output of the graph neural network to a plurality of predicted node state values of the graph.
29. The method of claim 28, wherein the interpolation function is a piecewise interpolation function.
30. The method of claim 29, when dependent on claim 3, wherein the plurality of predicted node state values are a plurality of charge values and are predetermined and different for each available atomic orbital of an atom.
31. The method of claim 30, wherein the plurality of predicted charge values are scaled by a normalisation factor.
32. The method of claim 1, wherein the graph neural network comprises a feed-forward neural network.
33. The method of claim 28, wherein the graph neural network comprises hyperbolic tangent activation.
34. The method of claim 1, wherein the graph neural network comprises neuron biases.
35. The method of claim 34, wherein the neuron biases of the graph neural network are constrained to be zero.
36. The method of claim 1, wherein each node represents an atom in a molecule graph.
37. The method of claim any preceding claim, wherein the final node state of each node is used to compute an atomic charge population of a corresponding atom associated with each node of the molecule graph.
38. The method of claim 37, wherein the atom is one of: hydrogen, carbon, oxygen, nitrogen and fluorine.
39. The method of claim 1, wherein the input graphs of the training dataset are molecule graphs.