HK1196688B - Systems and methods for network-based biological activity assessment - Google Patents
Systems and methods for network-based biological activity assessment Download PDFInfo
- Publication number
- HK1196688B HK1196688B HK14110134.0A HK14110134A HK1196688B HK 1196688 B HK1196688 B HK 1196688B HK 14110134 A HK14110134 A HK 14110134A HK 1196688 B HK1196688 B HK 1196688B
- Authority
- HK
- Hong Kong
- Prior art keywords
- biological
- score
- processor
- agent
- network
- Prior art date
Links
Description
Background
The human body is often disturbed by exposure to potentially harmful agents (agents) that can pose serious health risks in the long term. Exposure to these agents can compromise the normal functioning of biological mechanisms within the human body. To understand and quantify the effects of these perturbations on the human body, researchers have investigated the mechanisms by which biological systems respond to exposure to agents. Some cohorts have been widely used in live animal testing methods. However, animal testing methods are not always sufficient because of questions about their reliability and relevance. There are numerous differences in the physiology of different animals. Thus, different species may respond differently to exposure to an agent. Therefore, it is questionable whether the responses obtained from animal tests can be extrapolated into human biology. Other methods include assessing risk through clinical studies on human volunteers. However, these risk assessments are performed a posteriori, and because the disease may take decades to manifest, these assessments may not be sufficient to elucidate the mechanism that links the harmful substance to the disease. Other methods include in vitro experiments. Although in vitro cell and tissue based methods have gained widespread acceptance as a complete or partial alternative to their animal based counterparts, these methods are of limited value. Because in vitro methods focus on specific aspects of cellular and tissue mechanisms; they do not always take into account the complex interactions that occur throughout biological systems.
Over the past decade, high-throughput measurements of nucleic acid, protein and metabolite levels in conjunction with traditional dose-related therapeutic and toxicity assays have emerged as a method for elucidating the mechanisms of action of many biological processes. Researchers have attempted to combine information from these disparate measurements with knowledge about biological pathways from the literature to construct meaningful biological models. To this end, researchers have begun to identify potential biological mechanisms of action using mathematical and computational techniques (e.g., clustering and statistical methods) that are capable of mining large amounts of data.
Previous work also explored the importance of characterizing features that reveal changes in gene expression caused by one or more perturbations to a biological process, as well as subsequent scoring as to the presence of that feature within an additional data set that is a measure of the magnitude of a particular activity of that process. Much of the work in this regard has been directed to the identification and scoring of features associated with disease phenotypes. These phenotype-derived features provide significant classification capability, but lack a mechanism or causal relationship between individual specific perturbations and features. Thus, these characteristics may represent a variety of different unknown perturbations that result in or are caused by the same disease phenotype by generally unknown mechanisms.
One challenge is to understand how the activities of various individual biological entities in a biological system allow for activation or inhibition of different biological mechanisms. Because individual entities (e.g., genes) can be involved in multiple biological processes (e.g., inflammation and cell proliferation), measurements of gene activity are not sufficient to identify the underlying biological process that triggered the activity. None of the current technologies has been applied to identify underlying mechanisms responsible for the activity of biological entities at the micro-scale, nor provide a quantitative assessment of the activation of the different biological mechanisms within which these entities act in response to potentially harmful agents and experimental conditions. Accordingly, there is a need for improved systems and methods for analyzing biological data at a system level based on biological mechanisms and quantifying changes in the biological system as the system responds to agent or environmental changes.
Disclosure of Invention
In one aspect, the systems and methods described herein relate to a computerized method and one or more computer processors for quantifying the response of a biological system to a perturbation of an agent.
In one aspect, a computerized method comprises: receiving, at a first processor, a treatment data set corresponding to a response of a biological system to an agent, wherein the biological system comprises or comprises a plurality of biological entities, each biological entity interacting with at least one other biological entity; receiving, at a second processor, a control data set corresponding to a biological system that was not exposed to the agent; providing, at a third processor, a computational causal network model representative of the biological system, and including or including: a node representing biological entities, edges representing relationships between the biological entities, and direction values of the node representing expected directions of change between the control data and the process data; calculating, with a fourth processor, an activity measurement of the node representing a difference between the process data and the control data; calculating, with a fifth processor, weight values for the nodes, wherein at least one weight value is different from at least one other weight value; and generating, with a sixth processor, a score representing the perturbation of the biological system to the agent for the computational model, wherein the score is based on the direction values, the weight values, and the activity measurements. The biological system may be represented by at least one mechanistic hypothesis. The biological system can be represented by a plurality of computational causal network models or at least one computational causal network model comprising a plurality of mechanistic hypotheses. The method may further include normalizing the scores based on a number of measurable nodes in the respective computational model.
The weight values may represent a confidence in at least one of the treatment data set and the control data set. The weight value may include or contain a local false non-discovery rate. The method may further include calculating, with a seventh processor, an approximate distribution of the activity measurements of the nodes on the model or a mechanistic assumption in the model; calculating, with an eighth processor, an expected value of the activity measurement for the approximate distribution; and generating, with a ninth processor, for each computational model, a score representing the perturbation of the subset of biological systems to the agent, wherein the score is based on the expected value. The approximate distribution may be based on the activity measurements. In some implementations, calculating the expected value may include performing a rectangular approximation. The method may further include calculating, with a tenth processor, a positive activation metric (positive activation metric) and a negative activation metric based on the activity measurements, the positive and negative activation metrics representing a correspondence and an inconsistency between the activity measurements and the orientation values, respectively, with respect to the model; and generating, with an eleventh processor, for each computational model, a score representing the perturbation of the subset of biological systems to the agent, wherein the score is based on the positive and negative activation scores. The positive activation metric, the negative activation metric, or both may be based on a local false non-discovery rate. The activity measurements may be fold-change values, and the fold-change value for each node includes or contains a logarithm of the difference between the process data and the control data for the biological entity represented by the respective node. The subset of biological systems may include or comprise at least one of a cell proliferation mechanism, a cell stress mechanism, a cell inflammation mechanism, and a DNA repair mechanism. The agent may include or comprise at least one of an aerosol generated by heating tobacco, an aerosol generated by burning tobacco, tobacco smoke, or cigarette smoke. The agent may comprise or comprise a heterogeneous substance, including molecules or entities that are not present within or derived from a biological system. The agent can include or comprise toxins, therapeutic compounds, irritants, relaxants, natural products, manufactured products, and food materials. The process data set may include or contain a plurality of process data sets such that each measurable node includes or contains a plurality of fold-change values defined by the first probability distribution and a plurality of weight values defined by the second probability distribution. The process data set may include or comprise a plurality of process data sets such that each measurable node includes or comprises a plurality of multiplier change values and corresponding weight values. The step of generating a score may comprise a linear or non-linear combination of activity measures, weight values and direction values; and normalization of the combination by a scaling factor. The combination may be an arithmetic combination and the scale factor is the square root of the number of biological entities from which the measurement data is received. The score may be generated by geometric perturbation index scoring (index technique), probabilistic perturbation index scoring, or expected perturbation index scoring. The method may further include determining a confidence interval for the score based on a parametric or non-parametric computational bootstrapping technique (bootstrapping).
In another aspect, a computer system for quantifying a perturbation of a biological system in response to an agent is also described. The system comprises at least one processor configured or adapted to: receiving a processing dataset corresponding to a response of a biological system to an agent, wherein the biological system comprises or comprises a plurality of biological entities, each biological entity interacting with at least one other biological entity; receiving a control data set corresponding to a biological system that has not been exposed to the agent; providing a computational causal network model representing a biological system, and the computational causal network model comprises or includes: a node representing biological entities, an edge representing a relationship between the biological entities, and a direction value of the node representing an expected direction of change between the control data and the process data; computing an activity measurement of the node representing a difference between the process data and the control data; calculating weight values of the nodes, wherein at least one weight value is different from at least one other weight value; and generating a score for the computational model representing the perturbation of the biological system to the agent, wherein the score is based on the direction values, the weight values, and the activity measurements. A biological system may be represented by at least one mechanistic hypothesis. The biological system can be represented by a plurality of computational causal network models or at least one computational causal network model comprising a plurality of mechanistic hypotheses. The computer system may further include normalizing the scores based on a number of scorable nodes in the respective computational model. The weight values may represent a confidence in at least one of the treatment data set and the control data set. The weight value may include or contain a local false non-discovery rate. In some implementations, the computer system further includes an approximate distribution of activity measurements of the compute nodes over the model or a mechanistic assumption in the model; calculating, with an eighth processor, an expected value of the activity measurement for the approximate distribution; and generating, for each computational model, a score representing the perturbation of the subset of biological systems to the agent, wherein the score is based on the expected value. The approximate distribution may be based on the activity measurements. In some implementations of the computer system, it may also include calculating the expected value, including performing a rectangular approximation. The system may further include calculating a positive activation metric and a negative activation metric based on the activity measurements, the positive and negative activation metrics representing a correspondence and an inconsistency between the activity measurements and the orientation values, respectively, with respect to the model; and generating a score representing the perturbation of the subset of biological systems to the agent for each computational model, wherein the score is based on the positive and negative activation scores. The positive activation metric, the negative activation metric, or both may be based on a local false non-discovery rate. The activity measurements may be fold-change values, and the fold-change value for each node may include or comprise a logarithm of the difference between the process data and the control data for the biological entity represented by the respective node. The subset of biological systems may include or comprise at least one of a cell proliferation mechanism, a cell stress mechanism, a cell inflammation mechanism, and a DNA repair mechanism. The agent may include or comprise at least one of an aerosol generated by heating tobacco, an aerosol generated by burning tobacco, tobacco smoke, or cigarette smoke. The agent may comprise or comprise a heterogeneous substance, including molecules or entities that are not present within or derived from a biological system. The agent can include or comprise toxins, therapeutic compounds, irritants, relaxants, natural products, manufactured products, and food materials. The process data set may include or contain a plurality of process data sets such that each measurable node includes or contains a plurality of fold-change values defined by the first probability distribution and a plurality of weight values defined by the second probability distribution. The process data set may include or comprise a plurality of process data sets such that each measurable node includes or comprises a plurality of multiplier change values and corresponding weight values. The step of generating a score may comprise a linear or non-linear combination of activity measures, weight values and direction values; and normalization of the combination by a scaling factor. The combination may be an arithmetic combination and the scale factor is the square root of the number of biological entities from which the measurement data is received. The score may be generated by a geometric perturbation index scoring technique, a probabilistic perturbation index scoring technique, or an expected perturbation index scoring technique. The system may also include a confidence interval that determines a score based on a parametric or non-parametric computational guidance technique.
In certain aspects, a computerized method may include receiving, at a first processor, a treatment dataset corresponding to a response of a biological system to an agent, wherein the biological system includes a plurality of biological entities, each biological entity interacting with at least one other biological entity, and receiving, at a second processor, a control dataset corresponding to the biological system not exposed to the agent. The computerized method may include providing, at the third processor, a computational causal network model representative of the biological system. The computational model may include or include nodes representing biological entities, edges representing relationships between the biological entities, and direction values of the nodes representing expected directions of change between the control data and the process data. The computerized method may further include calculating, with a fourth processor, an activity measure of the node representing a difference between the process data and the control data, and calculating, with a fifth processor, weight values of the node, wherein at least one weight value is different from at least one other weight value. The computerized method may further include generating, with a sixth processor, a score for the computational model representing a perturbation of the biological system to the agent, wherein the score is based on the direction values, the weight values, and the activity measurements. In certain implementations, the computerized method further includes normalizing the scores based on a number of nodes in the respective computational model. In some implementations, each of the first through sixth processors is included or contained within a single processor or a single computing device. In other implementations, one or more of the first through sixth processors are distributed across multiple processors or computing devices.
In certain implementations, the computational causal network model includes or includes a set of causal relationships that exist between nodes representing possible causes and nodes representing measured quantities. In such implementations, the activity measurements may include fold changes. The multiple change may be a number used to describe the change in node measurement from an initial value to a final value between the control data and the process data. The fold change number may represent the logarithm of the fold change in the activity of the biological entity between the control condition and the treatment condition. The activity measurements of each node may comprise or include a logarithm of the difference between the process data and the control data for the biological entity represented by the respective node. In such implementations, the weight value may represent a weight to be given to the multiple change value of the node. The weight values may represent the known biological importance of the measurement node with respect to the feature or outcome of interest (e.g., a known carcinogen in cancer research). The weight values may represent a confidence in at least one of the disturbance data set and the control data set. More particularly, the weight value may comprise or include a local error non-discovery rate. In such implementations, the computerized method may generate a score for the computational model by multiplying the activity measurements by the weight values and direction values and summing the nodes. In certain implementations, the computerized method includes or includes generating, with the processor, a confidence interval for each generated score. The confidence interval may include approximating a distribution of the generated scores.
In another aspect, the systems and methods described herein relate to a computerized method for quantifying a perturbation of a biological system in response to an agent. The computerized method may include receiving, at a first processor, a treatment dataset corresponding to a response of a biological system to an agent, wherein the biological system includes or comprises a plurality of biological entities, each biological entity interacting with at least one other biological entity, and receiving, at a second processor, a control dataset corresponding to the biological system not exposed to the agent. The computerized method may include providing, at the third processor, a computational causal network model representative of the biological system. The computational model may include or include nodes representing biological entities, edges representing relationships between the biological entities, and direction values of the nodes representing expected directions of change between the control data and the process data. The computerized method may further include calculating, with a fourth processor, activity measurements of the nodes representing differences between the process data and the control data, and calculating, with a fifth processor, an approximate distribution of the activity measurements over the nodes. The computerized method may further comprise or include calculating, with a sixth processor, expected values of the approximate distribution. The computerized method may further include generating, with a seventh processor, for each computational model, a score representing the perturbation of the subset of biological systems to the agent, wherein the score is based on the expected value. In some implementations, the first through seventh processors are each included or contained within a single processor or a single computing device. In other implementations, one or more of the first through seventh processors are distributed across multiple processors or computing devices.
In certain implementations, the computational causal network model includes or includes a set of causal relationships that exist between nodes representing possible causes and nodes representing measured quantities. In such implementations, the activity measurements may include or include fold changes. The multiple change may be a number used to describe the change in node measurement from an initial value to a final value between the control data and the process data. The fold change number may represent the logarithm of the fold change in the activity of the biological entity between the control condition and the treatment condition. The computerized method may include or include generating, with the processor, a range of fold change densities, which may represent an approximation of a set of values that fold change values are capable of taking in the biological system under the processing conditions. The processor may generate an approximate fold-change density, which may include or consist of an approximate probability distribution of fold-change values. In such implementations, the computerized method further includes or includes calculating an approximate expected value of the approximated multiple-change density. The computerized method may generate a score for the computational model based on the calculated expected value.
In some implementations, the approximate distribution can be based substantially on the activity measurements. Additionally and optionally, the desired value may include a rectangular approximation. In certain implementations, the computerized method includes or includes generating, with the processor, a confidence interval for each generated score. Generating the confidence interval may include performing a parameter-guided technique.
In yet another aspect, the systems and methods described herein relate to computerized methods for quantifying a perturbation of a biological system in response to an agent. The computerized method may include receiving, at a first processor, a treatment dataset corresponding to a response of a biological system to an agent, wherein the biological system includes or comprises a plurality of biological entities, each biological entity interacting with at least one other biological entity, and receiving, at a second processor, a control dataset corresponding to the biological system not exposed to the agent. The computerized method may include providing, at the third processor, a computational causal network model representative of the biological system. The computational model may include or include nodes representing biological entities, edges representing relationships between the biological entities, and direction values of the nodes representing expected directions of change between the control data and the process data. The computerized method may further include calculating, with a fourth processor, an activity measurement of the node representing a difference between the process data and the control data, and calculating, with a fifth processor, a positive activation score and a negative activation score based on the activity measurement, the positive and negative activation scores representing a correspondence and an inconsistency between the activity measurement and the orientation value, respectively. The computerized method may further comprise generating, with a sixth processor, for each computational model, a score representing the perturbation of the subset of biological systems to the agent, wherein the score is based on the positive and negative activation scores. In some implementations, each of the first through sixth processors is included or contained within a single processor or a single computing device. In other implementations, one or more of the first through sixth processors are distributed across multiple processors or computing devices.
In certain implementations, the computational causal network model includes or includes a set of causal relationships that exist between nodes representing possible causes and nodes representing measured quantities. In such implementations, the activity measurements may include or include fold changes. The multiple change may be a number used to describe a change in node measurement from an initial value to a final value between the control data and the process data. The fold change number may represent the logarithm of the fold change in the activity of the biological entity between the control condition and the treatment condition. The computerized method may include or include generating, with the processor, a range of fold change densities, which may represent an approximation of a set of values that fold change values can take in the biological system under the processing conditions. The computerized method may include calculating, with the processor, a positive activation score based on the fold change value and the orientation value. Positive and negative activation scores may indicate whether the observed activation/inhibition of the biological entity is consistent or inconsistent with the expected direction of change. In one example, a positive activation score is a probability that a direction value is consistent with an activity measurement. A negative activation score may be a probability that the direction value is inconsistent with the activity measurement. The computerized method may further comprise or include generating a score for the computational model by combining the positive and negative activation scores. In some implementations, the score is based on a local false non-discovery rate.
In certain implementations, the subset of biological systems includes or includes at least one of a cell proliferation mechanism, a cell stress mechanism, a cell inflammation mechanism, and a DNA repair mechanism. The agent may include or comprise at least one of an aerosol generated by heating tobacco, an aerosol generated by burning tobacco, tobacco smoke, or cigarette smoke. Agents may include cadmium, mercury, chromium, nicotine, tobacco specific nitrosamines and their metabolites (4-methylnitrosamino-1- (3-pyridyl) -1-butanone 4 (NNK), N' -nitrosonornicotine (NNN), N-Nitrosoanatabine (NAT), N-Nitrosoanabasine (NAB), and 4- (methylnitrosamino) -1- (3-pyridyl) -1-butanol (NNAL)). In certain implementations, the agent includes or comprises a product for nicotine replacement therapy (replacement therapy). The agent may comprise or comprise a heterogeneous substance, including molecules or entities that are not present within or derived from a biological system. The agent can also include or comprise toxins, therapeutic compounds, irritants, relaxants, natural products, manufactured products, and food materials. In certain implementations, the processing dataset includes or includes a plurality of processing datasets corresponding to certain nodes of the biological network model, where each such node corresponds to a plurality of fold-change values defined by the first probability distribution and a plurality of weight values defined by the second probability distribution.
In yet another aspect, the systems and methods described herein relate to a computerized method and one or more computer processors for quantifying a perturbation of a biological system in response to an agent. The computerized method may include receiving, at a first processor, a treatment dataset corresponding to a response of a biological system to an agent, wherein the biological system includes or comprises a plurality of biological entities, each biological entity interacting with at least one other biological entity, and receiving, at a second processor, a control dataset corresponding to the biological system not exposed to the agent. The computerized method may include providing, at the third processor, a computational causal network model representative of the biological system. The computational model may include or include nodes representing biological entities, edges representing relationships between the biological entities, and direction values of the nodes representing expected directions of change between the control data and the process data. The computerized method may further include calculating, with a fourth processor, activity measurements of the nodes representing differences between the process data and the control data. The computerized method may further include generating, with a fifth processor, a score representing the perturbation of the biological system to the agent for the computational model, wherein the score is based on the orientation value and the activity measurement. In certain implementations, the computerized method further includes normalizing the scores based on a number of nodes in the respective computational model. The computerized method may further include generating, with a sixth processor, a confidence interval for each generated score. The confidence interval may include approximating a distribution of the generated scores, and the t statistic may be derived from a variance of the approximated distribution of the generated scores. In some implementations, each of the first through sixth processors is included or contained within a single processor or a single computing device. In other implementations, one or more of the first through sixth processors are distributed across multiple processors or computing devices.
The computerized methods described herein may be implemented in a computerized system having one or more computing devices, each comprising one or more processors. Generally, the computerized systems described herein may include one or more engines including or containing one or more processing devices, e.g., computers, microprocessors, logic devices, or other devices or processors, configured with hardware, firmware, and software to perform one or more computerized methods described herein. In certain implementations, the computerized system includes or includes a system response profile engine (system response profile engine), a network modeling engine, and a network scoring engine. The engines may be interconnected from time to time, and also connected from time to time with one or more databases, including perturbation databases, measurable databases, experimental data databases, and literature databases. The computerized systems described herein may include or include a distributed computerized system having one or more processors and engines that communicate through a network interface. Such implementations may be suitable for distributed computing via a variety of communication systems. In another aspect, a computer program product comprising program code adapted to perform the methods described herein is also described. In another aspect, a computer or computer recordable medium or apparatus comprising the computer program product is also described.
Drawings
Further features of the disclosure, as well as its nature and various advantages, will become apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which like reference characters refer to the same parts throughout the drawings, and in which:
FIG. 1 is a block diagram of an exemplary computerized system for quantifying a response of a biological network to a perturbation.
FIG. 2 is a flow diagram of an exemplary process for quantifying the response of a biological network to a perturbation by calculating a Network Perturbation Amplitude (NPA) score.
Fig. 3 is a graphical representation of data underlying a system response curve, including data for two agents, two parameters, N biological entities.
FIG. 4 is a diagram of a computational model of a biological network having several biological entities and their relationships.
FIG. 5 is a flow diagram of an exemplary process for generating a geometric disturbance index (GPI) score.
FIG. 6 is a flow chart of an exemplary process for generating a Probability Perturbation Index (PPI) score.
FIG. 7 is a flow chart of an exemplary process for generating an expected disturbance index (EPI) score.
FIG. 8 is a flow diagram of an exemplary process for generating a confidence interval for a geometric disturbance index (GPI) score.
FIG. 9 illustrates a biological network model analyzed with the systems and methods disclosed herein.
Fig. 10-14 show the results of network Perturbation Amplitude (PA) scoring for network-based bio-mechanisms.
FIG. 15 is a block diagram of an exemplary distributed computerized system for quantifying the effects of biological perturbations; and
FIG. 16 is a block diagram of an exemplary computing device that may be used to implement any of the components in any of the computerized systems described herein.
Detailed Description
The word "comprising" or "comprises" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. Described herein are computing systems and methods for quantitatively evaluating the magnitude of changes within a biological system when the biological system is perturbed by an agent. Certain implementations include or include a method for calculating a digital value representing a magnitude of change within a portion of a biological system. The calculation uses as input a data set obtained from a controlled set of experiments in which the biological system is perturbed by an agent. The data is then applied to a network model of the characteristics of the biological system. The network model is used as the basis for simulation and analysis and represents the biological mechanisms and pathways that enable features of interest in a biological system. This feature, or some of its mechanisms and pathways, may contribute to the pathology of the disease and adverse health effects on the biological system. Prior knowledge of biological systems represented in the database is used to construct a network model populated with data regarding the state of numerous biological entities under various conditions, including under normal conditions and under perturbation by agents. The network model used is dynamic in that it represents the state changes of various biological entities in response to a perturbation and enables quantitative and objective assessment of the effect of an agent on a biological system. Computer systems for running these computing methods are also provided.
The numerical values generated by the computerized methods of the present invention can be used to determine, among other things, the magnitude of desired or adverse biological effects caused by manufactured products (for safety evaluation or comparison), therapeutic compounds including nutritional supplements (for determining efficacy or health benefits), and environmental active substances (for predicting risk of long-term exposure and relationship to side effects and onset of disease).
In one aspect, the systems and methods described herein provide a numerical value representing a calculation of the magnitude of change of a perturbed biological system based on a network model of the perturbed biological mechanism. Numerical values, referred to herein as Network Perturbation Amplitude (NPA) scores, can be used to generally represent the state changes of various entities in a defined biological mechanism. The digital values obtained for different agents or different types of perturbations can be used to relatively compare the effect of different agents or perturbations on a biological mechanism that enables itself to be, or appear to be, a characteristic of a biological system. Thus, the NPA score can be used to measure the response of a biological mechanism to different perturbations. The term "score" is used generically herein to refer to a value or set of values used to provide a quantitative measure of the magnitude of change in a biological system. Such scores are calculated using one or more data sets obtained from the sample or subject using any of a variety of mathematical and computational algorithms known in the art and in accordance with the methods disclosed herein.
NPA scores can help researchers and clinicians improve diagnosis, experimental design, treatment decisions, and risk assessment. For example, NPA scores can be used to screen a set of candidate biological mechanisms for toxicological analysis to identify those that are likely to be affected by exposure to potentially harmful agents. By providing a measure of network response to perturbations, these NPA scores can allow molecular events (measured by experimental data) to be correlated with phenotypic or biological outcomes occurring at a cellular, tissue, organ, or organism level. The NPA value can be used by a clinician to compare the biological mechanisms affected by an agent to the physiological condition of the patient to determine what health risks or benefits the patient is likely to experience when exposed to the agent (e.g., a patient with compromised immune function may be particularly susceptible to agents that result in a strong immunosuppressive response).
FIG. 1 is a block diagram of a computerized system 100 for quantifying a response of a network model to a disturbance. In particular, the system 100 includes or includes a system response curve engine 110, a network modeling engine 112, and a network scoring engine 114. The engines 110, 112 and 114 are interconnected from time to time and are also connected from time to time with one or more databases including the perturbation database 102, the measurable database 104, the experimental data database 106 and the literature database 108. As used herein, an engine includes or includes one or more processing devices, such as a computer, microprocessor, logic device, or one or more other devices described with reference to fig. 14, configured in hardware, firmware, and software to perform one or more computing operations.
FIG. 2 is a flow diagram of a process 200 for quantifying a response of a biological network to a perturbation by calculating a Network Perturbation Amplitude (NPA) score, according to one implementation. The steps of process 200 will be described as being performed by the various components of system 100 of fig. 1, but any of these steps may be performed by any suitable hardware or software component (local or remote) and may be arranged in any suitable order or performed in parallel. At step 210, the system response curve (SRP) engine 110 receives biological data from a variety of different sources, and the data itself may be of a variety of different types. The data includes experimental data from which the biological system is perturbed, as well as control data. At step 212, the SRP engine 110 generates a system response curve (SRP) representing the degree to which one or more entities in the biological system change in response to an agent being introduced into the biological system. At step 214, the network modeling engine 112 provides one or more databases containing a plurality of network models, one of which is selected to be related to an agent or feature of interest. The selection can be made on the basis of existing knowledge of the mechanisms underlying the biological functions of the system. In certain implementations, the network modeling engine 112 may use the system response curves, the networks in the database, and the networks previously described in the literature to extract causal relationships between entities within the system, thereby generating, refining, or augmenting the network model. At step 216, the web scoring engine 114 generates an NPA score for each perturbation using the web identified by the web modeling engine 112 at step 214 and the SRP generated by the SRP engine 110 at step 212. The NPA score quantifies the biological response (represented by SRP) to a perturbation or process in the context of the underlying relationship (represented by the network) between biological entities. The following description is divided into subsections for clarity of disclosure and not for limitation.
A. Biological system
A biological system in the context of the present invention is an organism or a part of an organism, including functional parts, the organism being referred to herein as a subject. The subject is typically a mammal, including a human. The subject can be an individual human in a human population. The term "mammal" as used herein includes or includes, but is not limited to, humans, non-human primates, mice, dogs, cats, cows, sheep, horses, and pigs. Mammals other than humans can advantageously be used as subjects that can be used to provide models of human disease. The non-human subject can be an unmodified, transgenic animal, genetically modified animal, or an animal carrying one or more gene mutations or silent genes. The subject can be male or female. Depending on the purpose of the operation, the subject can be one that has been exposed to the agent of interest. The subject can be one that has been exposed to the agent for an extended period of time, optionally including the time prior to the present study. The body can be one that has been exposed to the agent for a period of time, but is no longer in contact with the agent. The subject can be a subject that has been diagnosed or identified as having a disease. The subject can be a subject that has undergone or is undergoing treatment for a disease or adverse health condition. A subject may also be a subject who exhibits one or more symptoms or risk factors for a particular health condition or disease. The subject can be a pre-infected subject but without symptoms of the disease. In certain implementations, the disease or health condition in question is associated with exposure to an agent or use of an agent for an extended period of time. According to certain implementations, the system 100 (fig. 1) contains or generates a computerized model of one or more biological systems and their functional mechanisms (collectively, "biological networks" or "network models") that are relevant to the type or outcome of perturbation of interest.
Depending on the environment of operation, a biological system can be defined at different levels because it relates to the function of individual organisms in the population, typically organs, tissues, cell types, organelles, cell components, or cells of a particular individual. Each biological system includes one or more biological mechanisms or pathways whose operation is manifested as a functional characteristic of the system. Animal systems for reproducing the defined characteristics of human health condition and suitable for exposure to agents of interest are preferred biological systems. Cellular and organotypic systems for reflecting the cell types and tissues involved in the etiology or pathology of a disease are likewise preferred biological systems. Priority can be given to primitive cell or organ cultures that recapitulate as much as possible the human biology in vivo. It is also important to match the in vitro human cell culture with the most equivalent in vivo culture derived from animal models. This allows the use of a matched in vitro system as a reference system to generate a continuum of conversions from animal models in vivo to human biology. Thus, a biological system contemplated for use with the systems and methods described herein can be defined by a functional characteristic (biological function, physiological function, or cellular function), organelle, cell type, tissue type, organ, stage of development, or a combination of the foregoing (without limitation). Examples of biological systems include or include, but are not limited to, the lung, integument, skeletal, muscular, neural (central and peripheral), endocrine, cardiovascular, immune, circulatory, respiratory, urinary, renal, gastrointestinal, colorectal, liver and reproductive systems. Other example biological systems include or include, but are not limited to, various cellular functions in epithelial cells, neural cells, blood cells, connective tissue cells, smooth muscle cells, skeletal muscle cells, adipocytes, ovum cells, sperm cells, stem cells, lung cells, brain cells, cardiac muscle cells, laryngeal cells, pharyngeal cells, esophageal cells, stomach cells, kidney cells, liver cells, breast cells, prostate cells, pancreatic islet cells, spermary cells, bladder cells, cervical cells, uterine cells, colon cells, and rectal cells. Certain cells may be cells of a cell line, cultured in vitro or maintained indefinitely in vitro under appropriate culture conditions. Examples of cellular functions include or include, but are not limited to, cell proliferation (e.g., cell division), degeneration, regeneration, senescence, control of cellular activity by nuclei, cell-to-cell signaling, cell differentiation, cell retrodifferentiation, secretion, migration, phagocytosis, repair, apoptosis, and developmental programming. Examples of cellular components that can be considered biological systems include or include, but are not limited to, cytoplasm, cytoskeleton, membranes, ribosomes, mitochondria, nuclei, Endoplasmic Reticulum (ER), golgi apparatus, lysosomes, DNA, RNA, proteins, peptidoglycans, and antibodies.
B. Disturbance
A perturbation in a biological system can be caused by one or more agents over a period of time by exposure or contact with one or more portions of the biological system. An agent can be a single substance or a mixture of substances, including mixtures in which not all ingredients are identified or characterized. The chemical and physical properties of the agent or its components may not be fully characterized. An agent can be defined by its structure, its ingredients, or a source that under certain conditions will produce the agent. Examples of agents are heterogeneous substances, i.e., molecules or entities that are not present within or derived from a biological system, and any intermediates or metabolites produced therefrom after contact with a biological system. The agent can be a carbohydrate, protein, lipid, nucleic acid, alkaloid, vitamin, metal, heavy metal, mineral, oxygen, ion, enzyme, hormone, neurotransmitter, inorganic compound, organic compound, environmental agent, microorganism, particle, environmental condition, environmental force, or physical force. Non-limiting examples of agents include or include, but are not limited to, nutrients, metabolic wastes, poisons, drugs, toxins, therapeutic compounds, irritants, relaxants, natural products, manufactured products, food materials, pathogenic bacteria (prions, viruses, bacteria, fungi, protozoa), particles or entities having a size in the micrometer range or below, by-products of the foregoing, and mixtures of the foregoing. Non-limiting examples of physical agents include or include radiation, electromagnetic waves (including sunlight), increases or decreases in temperature, shear forces, fluid pressure, electrical discharge, or consequences or trauma thereof.
Some agents do not perturb a biological system unless it reaches a threshold concentration or it is in contact with the biological system for a period of time, or a combination of both. Agent exposure or exposure to cause perturbation may be quantified by dose. Thus, the perturbation can be caused by prolonged exposure to the agent. The duration of exposure can be expressed by units of time, by the frequency of exposure, or by a percentage of time within the actual or estimated lifetime of the subject. Perturbation can also be caused by inhibiting an agent from one or more parts of the biological system (as described above) or limiting the supply of an agent to one or more parts of the biological system. For example, a perturbation can result from a reduced supply or absence of nutrients, water, carbohydrates, proteins, lipids, alkaloids, vitamins, minerals, oxygen, ions, enzymes, hormones, neurotransmitters, antibodies, cytokines, light, or by restricting the movement of certain parts of the organism, or by inhibiting or requiring exercise.
The agent may cause different perturbations depending on which part(s) of the biological system are exposed and the exposure conditions. Non-limiting examples of agents may include or include any of an aerosol generated by heating tobacco, an aerosol generated by burning tobacco, tobacco smoke, or cigarette smoke, as well as gaseous or particulate components thereof. More non-limiting examples of agents include or include cadmium, mercury, chromium, nicotine, tobacco specific nitrosamines and their metabolites (4-methylnitrosamino-1- (3-pyridyl) -1-butanone 4 (NNK), N' -nitrosodemethylnicotine (NNN), N-Nitrosoanatabine (NAT), N-Nitrosoanabasine (NAB), and 4- (methylnitrosamino) -1- (3-pyridyl) -1-butanol (NNAL)), as well as any product used in nicotine replacement therapy. The exposure regimen or composite stimulus for an agent should reflect the range and environment of exposure in a daily setting. The setup of a standard exposure protocol can be designed for systematic application to the same well-defined experimental system. Each assay can be designed to collect time and dose related data to capture early and late events and ensure coverage of a typical dose range. However, it will be understood by those skilled in the art that the systems and methods described herein may be adapted and modified to suit the application being processed, and that the systems and methods designed herein may be used in other suitable applications, and that other such additions and modifications should not depart from the scope of the present invention.
In various implementations, high-output system-level measurements of gene expression, protein expression or turnover, microribonucleic acid expression or turnover, post-translational modifications, protein modifications, translocations, antibody-producing metabolite profiles, or a combination of two or more of the foregoing are generated under various conditions, including control of each. Functional outcome measures are desirable in the methods described herein because they can generally serve as anchors for evaluation and represent a clear step in disease etiology.
As used herein, "sample" refers to any biological sample (e.g., a cell, tissue, organ, or whole animal) that is independent of the subject or experimental system. The sample can include or comprise (without limitation) a single cell or a plurality of cells, a cellular component, a tissue biopsy, excised tissue, a tissue extract, a tissue culture extract, a tissue culture medium, exhaled breath, whole blood, platelets, serum, plasma, red blood cells, white blood cells, lymphocytes, neutrophils, macrophages, B cells or subsets thereof, T cells or subsets thereof, hematopoietic cell subsets, endothelial cells, synovial fluid, lymphatic fluid, ascites fluid, interstitial fluid, bone marrow, cerebrospinal fluid, pleural fluid, tumor infiltrates, saliva, mucus, sputum, semen, sweat, urine, or any other bodily fluid. Samples can be obtained from a subject by the following methods, including (but not limited to): venipuncture, drainage, biopsy, needle stick, lavage, scrape, surgical resection, or other methods known in the art.
During operation, for a given biological mechanism, outcome, perturbation, or combination of the foregoing, the system 100 can generate a network amplitude (PA) value, which is a quantitative measure of a change in state of a biological entity in the network in response to a processing condition.
The system 100 (fig. 1) includes one or more computerized network models related to a health condition, disease, or biological outcome of interest. One or more of these network models are based on existing biological knowledge and can be uploaded from external sources and managed within the system 100. The model can also be regenerated within the system 100 based on the measurements. The measurable elements are thus integrated into the biological network model using prior knowledge. Described below are types of data representing changes in a biological system of interest or representing responses to perturbations that can be used to generate or refine a network model.
Referring to fig. 2, at step 210, the System Response Profile (SRP) engine 110 receives biometric data. The SRP engine 110 may receive the data from a variety of different sources, and the data itself may be of a variety of different types. The biological data used by the SRP engine 110 may be obtained from literature databases (including data from preclinical, clinical, and post-clinical trials of pharmaceutical products or medical devices), genomic databases (genomic sequences and Expression data, e.g., Gene Expression library of the national center for Biotechnology information (Gene Expression Omnibus) or Arrayexpress of the European bioinformatics institute (Parkinson et al, 2010, Nucl. acids Res., doi:10.1093/nar/gkql040. PubmediD71405)), commercially available databases (e.g., Gene Logic of Gersturburg, Md., USA), or experimental work. The data may include or include raw data from one or more different sources, e.g., in vitro, ex vivo, or direct in vivo experiments using one or more species specifically designed to study the effects of particular processing conditions or exposure to particular agents. In vitro experimental systems may include or comprise tissue cultures or organotypic cultures (three-dimensional cultures) that represent key aspects of human disease. In such implementations, the agent dosages and exposure regimens used in these experiments may substantially reflect the range of exposure and environments that may be expected for humans during normal use or activity conditions, or during special use or activity conditions. Experimental parameters and experimental conditions may be selected as desired to reflect the nature and exposure conditions of the agent, the molecules and pathways of the biological system in question, the cell types and tissues involved, the outcome of interest, and aspects of the etiology of the disease. Molecules, cells or tissues derived from a particular animal model can be matched to a particular human molecule, cell or tissue culture to improve the interpretability of animal-based findings.
Many of the data received by the SRP engine 110 that are generated by high throughput experimentation techniques include or include, but are not limited to, methylation patterns determined by sequencing, specific hybridization of nucleic acids on a microarray, quantitative polymerase chain reaction, or other techniques known in the art, proteins/peptides (e.g., absolute or relative amounts of protein, specific fragments of protein, peptidoglycan, changes in secondary or tertiary structure, or post-translational modifications as determined by methods known in the art) and functional activities under certain conditions (e.g., enzymatic activity, proteolytic activity, translational regulatory activity, trafficking activity, binding affinity to certain binding partners). Modifications including post-translational modifications of proteins or peptides can include or include, but are not limited to, methylation, acetylation, farnesylation, biotinylation, stearoylation, formylation, myristoylation, protein palmitoylation, geranylgeranylation, pegylation, phosphorylation, sulfation, glycosylation, glycation alteration (sugar modification), lipidation, lipid alteration, ubiquitination, sumoylation, sulfur dioxide bonding, cysteinylation, oxidation, glutathione, carboxylation, aldose acidification reactions, and deamidation. In addition, proteins can also be post-translationally modified by a series of reactions, for example, Amadori reaction, schiff base reaction, and maillard reaction, which produce glycated protein products.
The data may also include or include measured functional outcomes such as, but not limited to, functional outcomes at a cellular level including cell proliferation, fate development and cell death, functional outcomes at a physiological level including lung capacity, blood pressure, exercise proficiency. The data may also include or include measurements of disease activity or severity, such as, but not limited to, tumor metastasis, tumor improvement, loss of function, and life expectancy at a stage of the disease. Disease activity can be measured by clinical evaluation whose outcome is a value or set of values that can be obtained under defined conditions from the evaluation of a sample (or population of samples) from one or more subjects. The clinical assessment can also be based on answers provided by the subject to an interview or questionnaire.
Such data may be generated for explicit use in determining the system response curve, or may be generated in previous experiments or published in the literature. Generally, the data includes or comprises information related to a molecule, biological structure, physiological condition, genetic characteristic, or phenotype. In certain implementations, the data includes or includes a description of a condition, location, quantity, activity, or substructure of a molecule, a biological structure, a physiological condition, a genetic characteristic, or a phenotype. As will be described later, in a clinical setting, the data may include or include raw or processed data obtained from assays performed on samples obtained from human subjects or observations of human subjects exposed to the agent.
At step 212, the system response curve (SRP) engine 110 generates a system response curve (SRP) based on the biometric data received at step 212. This step may include or include one or more of background correction, normalization, fold change calculation, significance determination, and identification of differential responses (e.g., expressing different genes). An SRP is a representation used to represent the extent to which one or more measured entities (e.g., molecules, nucleic acids, peptides, proteins, cells, etc.) within a biological system individually change in response to a perturbation (e.g., exposure to an agent) applied to the biological system. In one example, to generate an SRP, the SRP engine 110 collects a set of measurements for a given set of parameters (e.g., process or disturbance conditions) applied to a given experimental system (a "system-process" pair). Fig. 3 shows two SRPs: SRP302 that includes or contains biological activity data of N different biological entities subjected to a first treatment 306 with varying parameters (e.g., dosage and time of exposure to a first treatment agent), and similar SRP304 that includes or contains biological activity data of N different biological entities subjected to a second treatment 308. The data included or contained within the SRP may be raw experimental data, processed experimental data (e.g., filtered to remove outliers, labeled with confidence estimates, averaged over multiple trials), data generated by computational biological models, or data taken from the scientific literature. SRP can represent data in a number of ways, such as absolute values, absolute changes, fold changes, logarithmic changes, functions, and tables. The SRP engine 110 passes the SRP to the network modeling engine 112.
Although the SRP derived in the previous step represents experimental data to be used for determining the magnitude of network disturbances, it is a biological network model that is the basis for calculations and analysis. This analysis requires the development of detailed network models of mechanisms and pathways related to the characteristics of biological systems. Such architectures provide a mechanistic layer of understanding beyond the examination of gene lists that have been used in more typical gene expression analysis. A network model of a biological system is a mathematical construct that represents a dynamic biological system and is established by assembling quantitative information about various fundamental properties of the biological system.
Such a network architecture is an iterative process. Delineation of network boundaries is guided by literature investigations of mechanisms and pathways related to processes of interest (e.g., cell proliferation in the lung). The causal relationships used to describe these paths are taken from prior knowledge to consolidate the network. Document-based networks can be validated using high-throughput data sets containing relevant phenotypic endpoints. The SRP engine 110 can be used to analyze the data set, and the results of this analysis can be used to validate, refine, or generate a network model.
C. Network
Referring to FIG. 2, at step 214, the network modeling engine 112 uses the system response curves from the SRP engine 110 with a network model based on mechanisms or pathways based on the characteristics of the biological system of interest. In certain aspects, the network modeling engine 112 is used to identify networks that have been generated based on SRPs. The network modeling engine 112 may include or contain means for receiving updates and changes to the model. The network modeling engine 112 may also repeat the process of network generation, incorporating new data and generating additional or refined network models. The network modeling engine 112 may also facilitate the merging of one or more data sets or the merging of one or more networks. The set of networks taken from the database can be manually supplemented with additional nodes, edges, or entirely new networks (e.g., by mining the literature for additional genes that are directly regulated by a particular biological entity). These networks contain features that may allow process scoring to be performed. The network topology is maintained; causal networks are able to track measurable entities from any point in the network. Furthermore, the models are dynamic and the assumptions used to build them can be modified or restated and allow adaptation to different organizational environments and categories. This allows for repeated testing and improvement when new knowledge is available. The network modeling engine 112 may remove nodes or edges that have low confidence or are subjects that conflict with experimental results in the scientific literature. The network modeling engine 112 may also include or contain additional nodes or edges that may be inferred using supervised or unsupervised learning methods (e.g., metric learning, matrix filling, pattern recognition).
In certain aspects, a biological system is modeled as a mathematical graph consisting of vertices (or nodes) and edges connecting the nodes. For example, FIG. 4 shows a simple network 400 with 9 nodes (including nodes 402 and 404) and edges (406 and 408). A node can represent a biological entity or process in a biological system, such as, but not limited to, a compound, DNA, RNA, gene, protein, peptidoglycan, antibody, cell, tissue, organ, and cell or molecular process. The biological entities are not necessarily limited to those whose processing or control data is received or is available. Thus, a node representing a biological entity can include or contain the plurality of biological entities, and may include or contain one or more additional biological entities. At least some of the nodes are scorable, and the scores may represent activity levels of the nodes. Many nodes represent biological entities whose activity levels can be measured. However, in some implementations, the computerized method does not necessarily receive data for all such measurable nodes. Thus, the nodes are scorable and/or measurable. In some implementations, most nodes are measurable. The measurable nodes may contain or include measured data. Edges can represent relationships between nodes. Edges in the graph can represent various relationships between nodes. For example, an edge may represent a relationship of "binding to", "used to express", "co-regulated based on an expression profile", "inhibited", "co-occurring in manuscript", or "sharing structural elements". Generally, these types of relationships describe relationships between a pair of nodes. The nodes in the graph can also represent relationships between nodes. Thus, relationships between relationships or between a relationship and another type of biological entity represented in the graph may be represented. For example, a relationship between two nodes representing chemicals may represent a reaction. The reaction may be a node in the relationship between the reaction and the chemical used to inhibit the reaction.
The graph may be non-directional, meaning that there is no direction between the two vertices associated with each edge. Alternatively, the edges of the graph may point from one vertex to another. For example, in the context of organisms, transcriptional regulatory networks and metabolic networks can be modeled as directed graphs. In the graphical model of the transcriptional regulatory network, the nodes will represent genes and the edges represent transcriptional relationships between the nodes. As another example, protein-protein interaction networks describe direct physical interactions between proteins in the proteome of an organism, and there is generally no direction associated with the interactions in such networks. Thus, these networks can be modeled as undirected graphs. Some networks may have directed edges and undirected edges. The entities and relationships (i.e., nodes and edges) that make up the graph may be stored as a network of related nodes in a database in system 100.
The knowledge represented in the database can be of various types, taken from various sources. For example, certain data may represent a genomic database, including information about genes and relationships between them. In such an example, a node may represent an oncogene and another node connected to the oncogene node may represent a gene for suppressing the oncogene. The data may represent proteins and their relationships, diseases and their interrelationships, and various disease states. There are many different types of data that can be incorporated into a graphical representation. The computational model may represent a network of relationships between nodes representing knowledge in, for example, a DNA dataset, an RNA dataset, a protein dataset, an antibody dataset, a cell dataset, a tissue dataset, an organ dataset, a medical dataset, epidemiological data, a chemical dataset, a toxicology dataset, a patient dataset, and a demographic dataset. As used herein, a data set is a collection of numerical values resulting from the evaluation of a sample (or a group of samples) under defined conditions. The data set can be obtained by, for example, experimentally measuring a quantifiable entity of the sample; or alternatively, from a service provider (e.g., laboratory, clinical research organization), or from a public or proprietary database. The data sets may contain data and biological entities represented by nodes, and the nodes in each data set may be related to other nodes in the same data set or nodes in other data sets. Moreover, the network modeling engine 112 may generate computational models for representing genetic information in a dataset, such as DNA, RNA, protein, or antibodies, as medical information in a medical dataset, as information about individual patients in a patient dataset, and as information about an entire population in an epidemiological dataset. In addition to the various data sets described above, there may be many other data sets, or types of biological information that may be included or contained in generating the computational model. For example, the database can further include or contain medical record data, structural/activity relationship data, information about infectious pathologies, information about clinical trials, exposure pattern data, data related to usage history of the product, and any other type of life sciences related information.
Network modeling engine 112 may generate one or more network models representing, for example, regulated interactions between genes, interactions between proteins, or complex biochemical interactions within a cell or tissue. The network generated by the network modeling engine 112 may include or include a static model and a dynamic model. The network modeling engine 112 may represent the system using any applicable data scheme, such as hypergraphs and weighted bipartite graphs, in which two types of nodes are used to represent reactions and compounds. The network modeling engine 112 may also use other inference techniques to generate a network model, for example, based on analysis of over-expression of functionally related genes among the different genes expressed, bayesian network analysis, graphical gaussian model techniques, or gene association network techniques to identify relevant biological networks based on a set of experimental data (e.g., gene expression, metabolite concentrations, cellular responses, etc.). The biological system may be represented by a plurality of network models, including a computational causal network model.
As described above, the network model is based on mechanisms and pathways that underlie the functional characteristics of biological systems. The network modeling engine 112 can generate or contain a model representing results regarding characteristics of biological systems relevant to the study of long-term health risks or health benefits of an agent. Thus, network modeling engine 112 may generate or contain network models of various mechanisms for cell function, particularly those mechanisms that are related to or contribute to a feature of interest in a biological system, including (but not limited to) cell proliferation, cell stress, cell regeneration, apoptosis, DNA damage/repair, or inflammatory responses. In other embodiments, the network modeling engine 112 may contain or generate computational models related to acute systemic toxicity, carcinogenicity, transdermal penetration, cardiovascular disease, pulmonary disease, ecotoxicity, ocular lavage/erosion, genetic toxicity, immunotoxicity, neurotoxicity, pharmacokinetics, drug metabolism, organ toxicity, reproductive and developmental toxicity, skin irritation/erosion, or skin sensitization. In general, the network modeling engine 112 may contain or generate computational models for the states of nucleic acids (DNA, RNA, SNPs, sirnas, mirnas, RNAi), proteins, peptidoglycans, antibodies, cells, tissues, organs, and any other biological entities and their respective interactions. In one example, computational network models can be used to represent the state of the immune system and the functioning of various types of leukocytes during an immune response or inflammatory response. In other examples, computational network models can be used to represent the performance of the cardiovascular system and the function and metabolism of endothelial cells.
In certain implementations of the invention, the network is derived from a database of causal biological knowledge. The database may be generated by performing experimental studies on different biological mechanisms to extract relationships (e.g., activation or inhibition relationships) between the mechanisms, some of which may be causal relationships, and may be combined with commercially available databases, such as the genostruct technology Platform (genostruct technology Platform) or the silverta knowledge base (silverwedgebase), administered by the silverta corporation of cambridge, massachusetts, usa. Using the database of causal biological knowledge, the network modeling engine 112 may identify a network for linking the disturbance 102 with the measurables 104. In certain implementations, the network modeling engine 112 uses the system response curves from the SRP engine 110 and previously generated networks in the literature to extract causal relationships between biological entities. The database may be further processed to remove logical inconsistencies and generate new biological knowledge by applying homologous reasoning between different sets of biological entities, among other processing steps.
In certain implementations, the network model extracted from the database is based on inverse causal reasoning (RCR), automated reasoning techniques for processing networks of causal relationships to formulate mechanistic hypotheses and then evaluate those mechanistic hypotheses against a dataset of difference measurements. Each mechanism assumes that the biological entity is linked to a measurable quantity that it can affect. At least one mechanism hypothesis may be formulated, e.g., a plurality of mechanism hypotheses. For example, a measurable quantity can include or include, among other things, an increase or decrease in concentration, a number or relative abundance of biological entities, an activation or inhibition of a biological entity, or a change in the structure, function, or logic of a biological entity. RCR uses directed networks of experimentally observed causal interactions between biological entities as the basis for computation. The directed network may use a Biological Expression LanguageTM(BELTM) Language (syntax for recording interrelationships between biological entities). RCR computation for network model generationCertain constraints are specified, such as, but not limited to, path length (maximum number of edges connecting an upstream node to a downstream node) and possible causal paths for connecting an upstream node to a downstream node. The output of the RCR is a set of mechanistic hypotheses ranked according to statistical data for evaluating relevance and accuracy, representing upstream controllers of differences in experimental measurements. The mechanism assumes that the outputs can be combined into causal chains and larger networks to interpret data sets at a higher interconnect mechanism and path level.
One type of mechanism assumes that it includes a set of causal relationships that exist between nodes representing possible causes (upstream nodes or controllers) and nodes representing measured quantities (downstream nodes). The mechanism assumes that it can be used to make predictions, for example, if the number of entities represented by an upstream node increases, then the downstream node linked by a causal increasing relationship will be inferred as increasing, and the downstream node linked by a causal decreasing relationship will be inferred as decreasing.
The mechanistic hypothesis represents a relationship between a set of measured data (e.g., gene expression data) and biological entities that are known controllers of those genes. In addition, these relationships include or include the sign (plus or minus) of the effect between differential expression of upstream and downstream genes. The putative downstream genes were taken from a database that governs causal biological knowledge in the literature. Causal relationships, in the form of calculable causal network models, for the mechanism assumptions linking upstream entities to downstream genes are the basis for calculating network changes by the NPA scoring method. A biological system may be represented by at least one mechanistic hypothesis (e.g., a plurality of mechanistic hypotheses). The at least one computational causal network model may include a plurality of mechanistic hypotheses.
A scorable complex causal network model of a biological entity can be converted to a single causal network model by collecting individual mechanistic hypotheses representing the entities in the model and regrouping all downstream genes representing the entire complex causal network model with connections to a single upstream process; this is in effect a flattening of the underlying graph structure. In this way, changes in the activity of biological entities described by the network model can be assessed via a combination of their individual mechanism assumptions, such that the underlying gene expression measurements contribute to the network as a whole.
To generate a scorable network for use in the method of the invention, the reference nodes are first selected from an initial, typically complex causal network model. A reference node can be any entity in the network whose level or activity is positively correlated to the activity of the network as a whole (as opposed to, for example, its activity being an inhibitor that may be negatively correlated to the activity of the network). Then, causal relationships between each node in the model and the reference nodes are determined. This can be done by first requiring the model to be "causal consistent". The sign of the modulation of the downstream measurable entity (gene expression in this example) for each node in the model is adjusted based on the relationship between the model node and the reference node. For example, the notation for downstream gene expression for a model node having a positive causal relationship with a reference node (i.e., the node is expected to be positively regulated when the reference node is increased) is maintained. On the other hand, the sign of downstream gene expression for a model node with a negative causal relationship to the reference node (i.e., the node is expected to be negatively regulated when the reference node is increased) is reversed. All downstream gene expressions and their signs are then combined into a single mechanistic hypothesis, and downstream gene expressions (from multiple model nodes) with opposite signs are deleted from the mechanistic hypothesis.
For a network model to be causally consistent, for the addition of any node in the model, it should be possible to explicitly map the sign of "positive adjustment" or "negative adjustment" on each other node in the model by following the causality for the connected nodes. Biological interpretation can be used to resolve ambiguities to construct a causal consistent model by considering what processes are being scored by mechanistic assumptions and by what symbols each node is effectively related to a reference node. For example, the node at which negative feedback is connected back to the model has a particular relationship to the process being scored, and although negative feedback may adjust that node, it should not change that relationship. Thus, the connection between the negative feedback loop and the node can be removed from the model to obtain causal consistency in a manner consistent with known facts. Variations on the above described methods are discussed in U.S. patent application publication nos. 2007/0225956 and 2009/0099784, which are incorporated herein by reference in their entirety. Exemplary causal Network models are described in Westra JW, SchlagewK, Frushour BP, gel S, Cattlett NL, Han W, Eddy SF, Hengstermann A, Matthews AL, Mathis C et AL' S constraint of a computer Cell promotion Network Non-disturbed Lung Cells, BMC Syst Biol2011,5:105, which is incorporated herein by reference in its entirety.
In certain implementations, the system 100 may contain or generate a computerized model for cell proliferation mechanisms when the cells have been exposed to cigarette smoke. In such instances, the system 100 may also contain or generate one or more network models representing various health conditions associated with cigarette smoke exposure, including (but not limited to) cancer, pulmonary disease, and cardiovascular disease. In certain aspects, these network models are based on at least one of applied perturbations (e.g., exposure to an agent), responses under various conditions, measurable quantities of interest, results being studied (e.g., cell proliferation, cellular stress, inflammation, DNA repair), experimental data, clinical data, epidemiological data, and literature.
As an illustrative example, the network modeling engine 112 may be configured to generate a network model of cellular stress. The network modeling engine 112 may receive a network describing the relevant mechanisms involved in stress responses known from the literature database. The network modeling engine 112 may select one or more networks to operate in response to stress in the pulmonary and cardiovascular environments based on known biological mechanisms. In some implementations, the network modeling engine 112 identifies one or more functional units in the biological system and builds a larger network model by combining smaller networks based on their functionality. In particular, for cellular stress models, the network modeling engine 112 can consider functional units associated with responses to oxidative stress, genotoxic stress, hypoxic stress, osmotic, exogenous stress, and shear stress. Thus, the network building block for the cellular stress model may include or comprise exogenous metabolic response, genotoxic stress, endothelial shear stress, hypoxic response, osmotic stress, and oxidative stress. The network modeling engine 112 may also receive content from computational analysis of publicly available transcription data from stress correlation experiments performed in specific cell groupings.
When generating a network model of a biological mechanism, the network modeling engine 112 may include or contain one or more rules. Such rules may include or include rules for selecting network content, node types, and the like. The network modeling engine 112 may select one or more data sets from the experimental data database 106, including a combination of in vitro and in vivo experimental results. The network modeling engine 112 may use experimental data to verify nodes and edges identified in the literature. In examples where cellular stress is modeled, the network modeling engine 112 may select a data set for an experiment based on how well the experiment represents physiologically relevant stress within disease-free lung or cardiovascular tissue. The selection of the data set may be based on, for example, the availability of phenotypic stress endpoint data, the statistical stringency of gene expression profiling experiments, and the association of the experimental environment with normal lung or cardiovascular disease-free organisms.
After identifying the collection of related networks, the network modeling engine 112 may also process and refine those networks. For example, in some implementations, multiple biological entities and their connections may be grouped and represented by one or more new nodes (e.g., using clustering or other techniques).
The network modeling engine 112 may also include or contain descriptive information about the identified nodes and edges in the network. A node may be described by its associated biological entity, an indication of whether the associated biological entity is a measurable quantity, or any other descriptor of the biological entity. Some nodes are scorable, and the score may represent the activity level of the node. Many nodes represent biological entities whose activity levels can be measured. However, in some implementations, the computerized method does not necessarily receive data for all such measurable nodes. Thus, the nodes are scorable and/or measurable. In some implementations, most nodes are measurable. Measurable nodes may contain or include measured data. An edge may be described by, for example, the type of relationship it represents (e.g., causal (e.g., up-or down-regulated), relevance, conditional dependent or independent), the strength of the relationship, or a statistical confidence in the relationship. In some implementations, for each process, each node representing a measurable entity is associated with an expected direction of activity change (i.e., increase or decrease) in response to the process. For example, when bronchial epithelial cells are exposed to an agent such as Tumor Necrosis Factor (TNF), the activity of a particular gene may increase. This increase may occur due to direct regulatory relationships known from the literature (and represented by one network identified by the network modeling engine 112), or by tracking numerous regulatory relationships (e.g., autocrine signaling) via edges of one or more networks identified by the network modeling engine 112. In some cases, network modeling engine 112 may identify the expected direction in which each measurable entity changes in response to a particular disturbance. When different paths in the network indicate opposite expected directions of change for a particular entity, the two paths may be examined in more detail to determine the net direction of change, or measurements for that particular entity may be discarded. In some embodiments, the orientation value of a node may represent an expected direction of change between control data and process data. In some embodiments, the direction value of a node may represent an expected value change between control data and process data. In some embodiments, the direction value of a node may represent an expected increase or decrease in the value of the control data and the process data. Suitably, the change represents a change after processing.
D. Amplitude of network disturbance
The computational methods and systems provided herein translate SRPs into NPA scores. Experimental measurements identified as downstream effects of disturbances within the network model are aggregated into a response score specific to the network. Thus, at step 216, the network scoring engine 114 generates an NPA score for each perturbation using the network identified by the network modeling engine 112 at step 214 and the SRP generated by the SRP engine 110 at step 212. NPA scoring applies the defined algorithm or algorithms to an experimental data set consisting of a series of process-control comparisons, in which the experimental data is filtered to represent a particular range of organisms (e.g., a particular set of gene expression relationships) in the context of the defined biological network model. The NPA score quantifies the biological response of the process (represented by the SRP) in the context of the underlying relationships between biological entities (represented by the identified networks). The network scoring engine 114 includes or includes hardware and software components for generating an NPA score for each network included within or identified by the network modeling engine 112.
The network scoring engine 114 may be configured to implement any of a number of scoring techniques. Such techniques include techniques for generating scalar value scores. Such techniques also include techniques for generating a vector value score. The vector value score indicates the size and topological distribution of the network's response to the disturbance.
One described scoring technique is an intensity scoring technique. The intensity score is a scalar value score that is the mean of the activities. The intensity score is the mean of the activity observations for the different entities represented in the SRP. The strength of the network response is calculated as follows:
wherein d isiIndicating the expected direction of change of activity of the entity associated with node i, βiA logarithm representing a fold change in activity between the process and control conditions (i.e., a number describing the degree of change in the number from an initial value to a final value), and N is the number of nodes having associated measured biological entities. A positive strength score indicates that the SRP matches the expected activity change derived from the identified network, while a negative strength score indicates that the SRP does not match the expected activity change.
FIG. 5 is a flow diagram 500 of a GPI scoring technique that may be implemented by web scoring engine 114 at step 502, a web scoring engine fabric (observable) fold change vector β the fold change is a number that describes the degree to which a measurable change changes from an initial value to a final value under different conditions (e.g., between disturbance and control conditions)iRepresents the logarithm (e.g., base 2) of the fold change of the i-th measured biological entity's activity between perturbation and control conditions (i.e., the logarithm of the factor by which the entity's activity changes between two conditions)iA value of 0 indicates that no change in activity is observed between the disturbance and the control conditioniRepresenting the fold change in activity between perturbation conditions without logarithmic operation, β in such an implementationiA value of 1 indicates that no activity change is observed between the disturbance and the control condition. It should be understood that fold change is only one possible method for quantifying activity for the network scoring techniques described herein, and for expressing changes in measurable objectsOther conventional means may be used. In some embodiments, the step of generating a score may include a linear or non-linear combination of activity measures, weight values, and direction values; and normalization of the combination by a scaling factor. The combination may be an arithmetic combination and the scale factor may be the square root of the number of biological entities for which measured data is received. In some embodiments, the score is not a scalar value score.
At step 504, the network scoring engine 114 generates a weight vector r, again having N components, one for each component of the multiple change vector βiIndicating that the observed ith fold change is to be conferred βiIn some implementations, the weight represents a known biological significance of the ith measurement entity with respect to a feature or outcome of interest (e.g., a known carcinogen in a cancer study). in some implementations, the weight represents a confidence of an activity measure of the biological entity associated with the nodeiImproved laboratory conditions, increased number of biological replications, better reproducibility, smaller variance, and stronger signal may all contribute to a particular βiHigher confidence of the system.
One value that can advantageously be used for weighting is the local error non-discovery rate fndri(i.e., fold change value βiRepresenting the probability of violating the underlying primitive assumption about zero fold change, in some cases, under the observed p-value), both of which are incorporated by reference in their entirety as described by Strimmer et al in "a general modular for gene set implementation analysis" (BMC biologics 10:47, 2009) and by Strimmer in "a unified adaptive surface discovery rate (BMC Bioinformatics9:303, 2008). In certain implementations, fndriCalculated as follows:
wherein fdriIs the local error discovery rate (i.e., the multiple change value β)iDoes not indicate the probability of violating the underlying primitive assumption about zero fold change), viIs a Benjamini-Hochberg regulator described by Benjamini et al in the Journal of the Royal statistical Society, Series B57:289,1995, "Controlling the false discovery rate: algorithmic and Power full discovery to multiple testing", which is incorporated herein by reference in its entirety, p is β, which achieves at least the fold change as actually observediThe probability of such extreme multiple changes (assuming the original assumption of zero multiple change is true), and tdfIs the t distribution of df degrees of freedom note that p is about βiAnd standard deviation SiOf the standard deviation SiAnd based on total βi. In an alternative implementation, multiple tests are not adjusted; therefore, vi(β 1, …, β Ν) equals 1, and the vector r is weightedi=1-p(βi,Si(β1,…,βΝ))。
At step 506, the network scoring engine 114 scales the multiple change vector β using the weight vector r the result is where each component βiMultiplied by its associated weight component riOne way to achieve such a computational scaling is to create an NxN diagonal matrix with weight components ri on the diagonal and multiply this matrix by an Nx1 vector β, as shown in equation 3:
at step 508, the web scoring engine 114 identifies the expected direction of change for each component in the multiple change vector β the web scoring engine 114 may retrieve by querying the web modeling engine 112The expected direction of change of the causal biological network model. The network scoring engine 114 can then combine these expected directions of change into a vector d of N components, where the ith component d of the vector diIndicating the expected direction of change of the ith measurement biological entity (e.g., +1 for increased activity and-1 for decreased activity).
At step 510, the network scoring engine 114 combines the components of the scaled fold change vector (generated at step 506) with the expected direction of change (identified at step 508) for each component. In some implementations, the combination is an arithmetic combination, where the scaled multiple change riβiEach multiplied by its corresponding expected direction of change diAnd the results are superimposed for all N biological entities. Mathematically, this implementation of step 510 can be represented by:
in other implementations, the vectors d, r, and β may be combined in any linear or non-linear manner.
At step 512, the network scoring engine 114 normalizes the combination of step 510. In some implementations, the normalization includes multiplying by a predetermined scaling factor. One such scaling factor is the square root of the number of biological entities N. In this implementation, the GPI score can be represented by the following equation:
other scaling factors, which may or may not be predetermined, may also be used. In certain embodiments, the causal network model (e.g., mechanistic hypothesis) can be viewed as a unit negative vector s = (1,1, -1,1, …, -1)/√ N (where each dimension represents a causal property) in an N-dimensional downstream measurable spaceDownstream measurable values of the network model, here gene expression). The observed effect of perturbation on downstream gene expression is also a vector in this space. Thus geometrically, the magnitude of the disturbance in the causal network model can be determined by comparing the difference log2The expression vectors are projected onto the hypothetical unit vectors for quantization. However, the downstream measurements of the causal network model are from a general model. To specifically address the specificity of data supporting NPA scores, each downstream is assigned a confidence of activation, which is set to a local false non-discovery rate (fndr)i=(l-fdri)). It is equivalent to weight the dimensions of the downstream gene expression space according to the confidence of each differential expression, and thus consider defining the geometry of the gene expression space in a weighted scalar product:<s|β>w=STdiag (fndr) β, therefore, GPI = (∑ s)i-fndriβi) V. √ N. By weighting the differential log2 expression with the false non-discovery rate, individual differential expression values with little confidence are moved closer to 0 (no change), while values with stronger confidence are minimally reduced. A positive GPI score indicates an up-regulation of a process described by a mechanistic hypothesis, a GPI score of zero indicates that the process is invariant along the direction of the mechanistic hypothesis, and a negative GPI score indicates that the process is down-regulated.
Fig. 6 is a flow diagram 600 of a probabilistic disturbance index (PPI) scoring technique that may be implemented by the network scoring engine 114. As described above with respect to SRP engine 110 (fig. 1) and step 212 (fig. 2) of process 200, each SRP represents a measurement of an activity (or change in activity) of a biological entity under processing conditions. Each SRP is then associated with a measured number of activities, one for each measured biological entity. PPI is a quantification of the probability that a biological mechanism represented by the network of interest is activated given the observed SRP.
In step 602, network scoring engine 114 composes a fold change vector β. Such fold change vectors representing the observed fold changes in the activities of the N measured biological entities may be organized as described above with reference to step 502 of the Geometric Perturbation Index (GPI) scoring technique shown in fig. 5. At step 604, network scoring engine 114 generates a range of fold change densities. The range of fold change densities represents an approximation of the set of values that a fold change value can take in a biological system under the processing conditions, and may be approximated by the range [ -W, W ], where W is the theoretical maximum absolute value of the log2 fold change. By choosing W in this way, all observed fold changes will fall within the range [ -W, W ]. For example, the maximum expected signal for a gene chip (e.g., 16 on the log2 scale) can be used as the value W.
At step 606, web scoring engine 114 identifies the expected direction of change for each component in fold change vector β this step may be performed as described above with reference to step 508 of the GPI scoring technique shown in FIG. 5, resulting in a set of observed fold changes βiCorresponding expected direction of change di。
At step 608, the network scoring engine 114 generates a positive activation metric. In certain implementations, a positive activation metric represents that SRP indicates the observed activation/inhibition of a biological entity versus diThe degree to which the direction of the expected change is consistent. The consistent behavior is referred to herein as "positive activation". One positive activation metric that may be used is the probability that one or more networks are positively activated. Such a probability (referred to as PPI +) may be calculated according to the following expression:
wherein
Wherein fndriIs the false non-discovery rate discussed above with reference to equation 1. In some implementations, the network scoring engine 114 is configured to use the representations 0 through 0Between WThe segmented set of values numerically integrates the expression of equation 6. One set of zones that may be used is zone d(i-1)β(i-1),d(i)β(i)]Wherein the subscript (·) denotes values taken in order of change from the smallest multiple to the largest multiple, and having the convention d(0)β(0)And =0. In such implementations, the network scoring engine 114 calculates the positive activation metric PPI as follows+Approximate values of (a):
at step 610, the network scoring engine 114 generates a negative activation metric. In certain implementations, a negative activation metric represents that SRP indicates the observed activation/inhibition of a biological entity versus diThe degree to which the direction of the expected change is consistent. Inconsistent behavior is referred to herein as "negative activation". One negative activation metric that may be used is the probability that one or more networks are negatively activated. Such a probability (called Ρ Ρ I-) can be calculated according to the following expression:
wherein
Wherein fndriIs the false non-discovery rate discussed above with reference to equations 1 and 7. As discussed above with reference to positive activation metrics, in some implementations, the web scoring engine 114 is configured to use a representation between-W and 0The segmented set of values numerically integrates the expression of equation 9. One set of zones that may be used is zone d(i-1)β(i-1),d(i)β(i)]Wherein the subscript (·) denotes values taken in order of change from the smallest multiple to the largest multiple, and having the convention d(0)β(0)And =0. In such implementations, the network scoring engine 114 calculates an approximation of the negative activity metric PPI-as follows:
at step 612, the network scoring engine combines the positive activation metric (generated at step 608) with the negative activation metric (generated at step 610) to generate a composite metric referred to as a probability perturbation index, or PPI. The combination of step 612 can be any linear or non-linear combination. In certain implementations, the PPI is a weighted linear combination of a positive activation metric and a negative activation metric. For example, the network scoring engine 114 may be configured to generate PPIs according to the following equation:
wherein the PPI+And Ppi I-Are the positive and negative activation metrics described above. The PPI generated according to equation 12 is related to the GPI calculated according to equation 5 in the following manner:
additionally, the network scoring engine 114 may be configured to calculate the PPI of equation 12 by calculating the L1 norm of a vector whose ith component is defined by:
FIG. 7 is a flow chart 700 of an expected disturbance index (EPI) scoring technique that may be implemented by the network scoring engine 114, as discussed above with respect to the SRP engine 110 (FIG. 1) and step 212 (FIG. 2) of the process 200, each SRP represents a measure of activity (or change in activity) of a biological entity under processing conditionsiAre all taken from the distribution p (-) then the expected value of the distribution is
Since the true theoretical distribution p (-) is not readily known, the network scoring engine 114 can be configured to perform the steps described below to approximate the EPI values based on the observed activity and other information retrieved from the system 100.
In step 702, network scoring engine 114 composes a multiple change vector β. This fold-change vector, representing the observed fold-changes in activity of the N measured biological entities, may be organized as described above with reference to step 502 of the Geometric Perturbation Index (GPI) scoring technique shown in fig. 5 or step 602 of the Probabilistic Perturbation Index (PPI) scoring technique shown in fig. 6. At step 704, network scoring engine 114 generates a range of fold change densities. The web scoring engine 114 may generate the range of fold change densities as described above with reference to step 604 of the PPI scoring technique shown in fig. 6.
At step 706, the meshThe net scoring engine 114 identifies the expected direction of change for each component in the fold change vector β this step may be performed as described above with reference to step 508 of the GPI scoring technique shown in FIG. 5, resulting in a fold change β observediCorresponding set of expected direction of change di。
At step 708, the web scoring engine 114 generates an approximate fold change density if each fold change β taken from the distribution p (-) isiThen, the distribution p (-) can be approximated by:
at step 710, the web scoring engine 114 generates an approximate expected value of the approximate fold change density, thereby producing an EPI score. In some implementations, the network scoring engine 114 applies a computational interpolation technique (e.g., a linear or non-linear interpolation technique) to generate an approximate continuous distribution from the distribution of equation 16, and then uses the equation of equation 15 to compute the expected value of the distribution. In other implementations, the network scoring engine 114 is configured to use the discrete distribution of equation 16 as a rectangular approximation to a continuous distribution, and calculate the EPI as follows:
in equation 17, the subscript (. cndot.) denotes a value taken in the order of change from the smallest multiple to the largest multiple), n+Is that its activity is expected to be responsive to processing (d)iβi>= 0) and the number of entities increased (per step 706), and n-Is that its activity is expected to be responsive to processing (d)iβi<= 0) (per step 706). In an EPI score, a higher value fold change is more considered than a lower value fold change, thereby providingMeasurement of activities with high specificity.
The network scoring engine 114 may also be configured to determine confidence intervals near the network scores, which may be used by clinicians and researchers to evaluate experimental results reflected in the network scores, and may be used in more data processing steps by other components of the system 100 (e.g., by the aggregation engine 110). one useful method for determining confidence intervals is to evaluate a primitive hypothesis for a given first type (false positive) error risk α (e.g., α = 0.05) on a network score of 0 (or other suitable null value representing no activity difference between processing and control conditions)iCaused by normal distribution under the original assumption, having zero mean and based on tdfSample variance of degrees of freedom Si 2The network scoring engine may generate these quantities using statistical estimation and verification procedures, and the representation βiFor example, the t-statistic and adjusted t-statistic generated by the Linear model method of the "limma" R package, are commonly used for analysis of differential gene expression and are described by Smyth in "Linear models and empirical Bayes methods for accessing differential expression in micro expressions" (Statistical application in Genetics and Molecular Biology, 3:3, 2004), which is incorporated herein by reference in its entiretyiIs approximated, assuming βiResulting from the underlying normal distribution. In implementations where it appears that The assumption for The application of percentile bootstrapping is violated (which application may include or contain EPI), network scoring engine 114 may additionally apply The deskew percentile method described by Efron in "The jackknife, The bootstrap, and other forecasting places" (SIAM, 1982) and by Diciccio et al in "The estimate of bootstrap confidence intervals" (Journal of The Royal statistical society,50:338,1988), each of which is incorporated herein by reference in its entirety.
The particular technique used by the network scoring engine 114 to analytically determine the confidence interval will depend on the particular network scoring technique used and for βiIs assumed to be the underlying statistical distribution.
For example, when the network scoring engine 114 is configured to calculate the strength score (according to equation 1), the network scoring engine 114 treats the strength score as a random variable consisting of a weighted sum of independent approximately normal random variables. As a result, the distribution of the intensity scores is an approximately normal random variable with a zero mean and a variance calculated as follows
The network scoring engine 114 can use the variance Sstrength 2To obtain the t-system dose according to the following formula
The degree of freedom df is estimated by the Welch-Satterhwaite formula described by Satterhwaite in "An adaptation distribution of estimates of variance components" (Biometrics, 2:110,1946) and by Welch in "the Generation of study's plan world differential utilization variables incorporated" (biometrica, 34:28,1947), each of which is incorporated herein by reference in its entirety. Using these quantities, the web scoring engine 114 may generate a confidence interval of (1- α) for the strength score as follows
As another example, when web scoring engine 114 is configured to calculate a GPI score (as discussed above with reference to FIG. 5), web scoring engine 114 may also be configured to calculate a confidence interval for the GPI score according to the steps of flowchart 800 of FIG. 8 at step 802, web scoring engine 114 performs a first order Taylor expansion of the GPI score represented by equation 5 as β according to the following equationiFunction of (c):
β thereiniAnd ^ burst is the measured fold change value. The first order Taylor approximation of the GPI score retains the first two terms and discards O (N)2) An item.
At step 804, web scoring engine 114 evaluates β in the GPI calculationiWhether the coefficient of the term is βiAs a function of (c). These coefficients include or contain the desired directional term diAnd a weight riWhen these coefficients do not depend on βiThe first order term in equation 21 becomes about βiAnd the network scoring engine 114 proceeds to step 808, however, when the coefficients do depend on βiIn particular, when the weighting vector r is β, the network scoring engine 114 proceeds to step 806 to approximate the first order term in equation 21iAnd the expected directional term diIs not βiWhen the function of (c) is used, the first order term can be expressed as:
in particular, when the weighting vector r is the vector fndr of the error non-discovery rate valuesiIn time, as discussed above with reference to equation 2 and step 504 of fig. 5, network scoring engine 114 may use the following expression for the derivative terms of equation 22:
the derivative labeled "term 1 (term 1)" in equation 23 represents the derivative of the Benjamini-Hochberg adjustment factor, while the integral labeled "term 2 (term 2)" represents the p-value of the fold change of the ith biological entity. Because the Benjamini-Hochberg term is most relevant when the p-value is low, the network scoring engine 114 may be configured to approximate the product of term1 and term2 to 0 at step 806. As a result, the network scoring engine 114 may apply the basic theorem of calculus and use the following approximation of the derivative term of equation 23:
substituting the approximation of equation 24 into the expression of equation 21 yields the following approximation of the GPI score:
at step 808, network scoring engine 114 determines the approximate variance of the GPI score using the approximation of the GPI score generated in the preceding step if the GPI score has been approximated to a random variable βiAs in equation 21, the variance of the approximationWill be β given byiWeighted sum of variances of (a):
wherein Si 2Is the variance of the ith multiple change β i thus, the approximate variance of equation 25 can be written as:
wherein when diDiscard d when = +/-1iTerm because of di 2=1。
At step 810, the web scoring engine 114 evaluates the variance of the GPI score at the observed fold change value (e.g., as represented by equation 27). At step 812, the web scoring engine 114 generates a confidence interval for the GPI score according to the following equation:
wherein SGPICalculated as described above with reference to equations 26 and 27. Equation 28 may be adjusted as necessary to determine the variance of the PPI score at the observed fold change values.
In addition to or instead of the scalar value scores described above, the network scoring engine 114 may generate vector value scores. One vector value score is a vector of the fold change or absolute change in activity for each measurement node.
In certain implementations, for each perturbation (e.g., exposure to a known or unknown agent), the network scoring engine 114 can generate a plurality of NPA scores. For example, the network scoring engine 114 may generate a NPA score for a particular network, a particular agent dosage, and a particular exposure time.
E. Results of the experiment
The process 200 for quantifying biological network response to perturbation by calculating Network Perturbation Amplitude (NPA) scores has been used to analyze Tumor Necrosis Factor (TNF) treated Normal Human Bronchial Epithelial (NHBE) cells using several causal network models the activation of the stress-responsive and immune-responsive transcription factor NF-kB (nuclear factor k light chain of activated B cells (kappa-light-chain) enhancer) has been well defined as the primary mediator of tumor necrosis factor α (TNF α) -induced signaling in various systems Normal Human Bronchial Epithelial (NHBE) cells were treated with four different doses of TNF α (0.1, 1, 10 and 100 ng/mL) and total RNA was collected for microarray measurements at four different times after treatment (30 min, 2 hr, 4 hr and 24 hr) all treatments were compared to time-matched treatment controls to obtain 16 controls (4 doses x4 time points) normal human bronchial epithelial cells were treated with either a perturbation flux assay (syksx) and then treated with total RNA was extracted from a sham culture medium after harvest of sperm cells using a western blotting technique and a total harvest of total RNA was performed on a sham-line after 2 hAssay (Promega). NF-kB nuclear translocation was measured using the Cellomics NF-kB Activate HCS Reagent Kit (Thermo Scientific). The data processing and NPA methods are implemented in an R statistical environment. Raw RNA expression data were analyzed using the afy and limma package of the Bioconductor suite of microarray analysis tools available in the R statistical Environment (Gentleman R: Bioinformatics and computation)Environmental biological solutions using R and B bioconductor.New York Springer Science + Business Media, 2005; gentleman RC, CareyVJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, etc.: Bioconductor: open software definition for computational biology and dbioinformation. genome Biol2004,5: R80). Robust Microarray Analysis (RMA) background correction and quantile normalization were used to generate probe set expression values (Irizarry et al, Exploration, normalization, and summary oligonucleotide high sensitivity oligonucleotide array probe level data. Biostatics 2003,4: 249-264). The global linear model fits the data of all duplicate packets, and the specific controls of interest (comparison of the "processed" and "control" conditions) are evaluated to generate the raw p-values for each probe set on the expression array. The raw p-values were then corrected for various experimental effects using the Benjamini-Hochberg False Discovery Rate (FDR).
Probe sets were matched to RNA multi-degree nodes in the Selventa knowledge base using HG-U133_ Plus _2.na30 probe set mapping and the following criteria. First, only the "at" or "s _ at" sounding set is considered. Second, the probe sets mapped to multiple genes are discarded. Third, when multiple probe sets map to the same gene, the "at" probe set is preferentially selected over the "s _ at" probe set. Finally, when there still remain multiple probe sets mapped to the same gene, the probe set with the lowest geometric mean FDR corrected p-value among all the controls of interest is selected. The linear model is then reconciled for all duplicate packets to only those probe sets that map to RNA multi-degree nodes in the knowledge base, and the FDR corrected p-values are recalculated. The Selventa knowledge base is a base containing over 150 million nodes (biological concepts and entities) and over 750 million edges (declarations on causal and non-causal relationships between nodes). Statements in the Selventa knowledge base were derived from peer review scientific literature and other public and proprietary databases. In particular, each statement describes an individual experimental observation from an experiment performed in vitro or in vivo in the context of human, mouse, and mouse species. The statement also collects information about reference sources (e.g., PubMed ID (PMID) of the journal article listed in MEDLINE), as well as key background information including species (human, mouse or mouse) and tissues or cell lines derived from experimental observations. One example causal statement is that NFkB (nuclear factor k light chain enhancer of activated B cells) enhanced transcriptional activity leads to increased mRNA expression of CXCL1 (chemokine (C-X-C motif) ligand 1) [ HeLa cell line; Human; PMID16414985 ]. The knowledge base contains causal relationships derived from healthy tissue and disease areas (e.g., inflammation, metabolic disease, cardiovascular injury, liver injury, and cancer).
The GPI, EPI, and PPI scoring methods were first investigated using a causal network model, NF-kB directed model, created as a specific metric for NF-kB activation. This model included 155 genes (taken from 247 different references, some supported by multiple references) known to be directly regulated by NF-kB (genes whose expression was controlled in an NF-kB dependent manner and whose promoter sequences were directly bound by NF-kB). Both scoring methods showed the same pattern in response to TNF α, and it has been demonstrated that the dose-dependent response is constant, while the time-dependent response is generally saturated over time (see fig. 10 a). The EPI method differs in nature from GPI in that the EPI score increases continuously from 2 hours to 4 hours to 24 hours, whereas the GPI score reaches steady state from 4 hours to 24 hours. In addition, the EPI method produced a score close to zero for TNF α at 0.1 ng/mL. Generally, the EPI score appears to be reduced to a score of 0 (or close to 0), which by other methods would tend to be relatively lower. For the EPI method, it was found that the lowest dose at all time points except the 2 hour time point was not specific to the NF-KB pointing network model.
The NF- κ B orientation model fractions were then compared to NF- κ BETA nuclear translocation. Upon activation, NF- κ β is transported into the nuclei where it acts to regulate the expression of many genes. A series of feedback loops then results in subsequent translocation of the NF- κ beta back to the cytoplasm, and the shaking cycle continues multiple times. Since NF- κ β oscillation occurs in slightly different periods of time in different cells of the population, the first oscillation may be the most reliable total amount measurement of NF- κ β activation. Although the time of the first oscillation depends on the dose, 30 minutes after TNF α treatment may be a realistic time for measuring nuclear translocation of NF- κ B of the dose used. All three scoring methods produced a relationship that was monotonic, and in some cases nearly linear, between score and nuclear translocation, with Pearson correlation coefficients for the GPI and EPI scoring methods ranging from 0.85-0.98 (fig. 11). FIG. 11 shows NF-. kappa.B pointing NPA scores at 30 minutes plotted against NF-. kappa.B nuclear translocation at 30 minutes. The error bars in NF- κ b nuclear translocation represent the standard deviations of the mean nuclear translocation for three different views of the same population of cells. Interestingly, this dose-dependent relationship was maintained at various times after TNF α treatment (fig. 13). These findings demonstrate that NPA scores based on a causal network model enable quantification of NF- κ beta transcriptional activity.
The impact of the scope and composition of the causal network model on the NPA scoring method of the present invention was also investigated. First, the effect of manually selecting a set of measurements known to be specifically modulated by NF- κ β in a TNF α -dependent manner was evaluated. The submodel was constructed from a collection of 20 genes previously measured via polymerase chain reaction of reverse transcriptase (RT-PCR) to evaluate NF- κ B activity in response to TNF α treatment in fibroblasts of 3T3 mice (omitting 2 genes with no direct human homology). These genes perturbed by TNF α in 3T3 cells over a 12 hour time course upon administration of TNF α (10 different concentrations ranging from 100ng/mL to 0.005 ng/mL) were measured. This sub-model produced an activation pattern very similar to the NF- κ β orientation model (fig. 14), suggesting that inclusion of genes whose TNF α -dependent expression has not been directly validated has no adverse effect on the quality of the scores. Fig. 14 shows the results of transcriptional data from TNF α -treated NHBE cells scored using GPI and EPI for (a) NF- κ B-directed models and (B) submodels including 20 NF- κ B-regulated genes reported to be TNF α -responsive (Single-cell NF-kappa B dynamic differential activation and expression: 466) in fibroblasts of mouse 3T3 (NF kb, CASP4, CCL5, TNF Α IP3, CCL2, ZFP36, RIPK2, TNFSF10, NF rk BIE, IL6, CCL20, ICAM1, TNFRSF1A, TNFRSF1B, stm1, NRG1, SOD1, IL1 anflatl 1, IL 1A, erb 2).
Then, the investigation uses the effect of a causal network model derived from upstream biological entities that are not too close to the measurements. To do so, two additional models were constructed: an IKK/NF- κ β signaling model comprising 992 genes (taken from 414 different references) known to be modulated by perturbation of proteins in a causal network model of signaling from the i κ β kinase (IKK) protein to NF- κ B activation (fig. 9); and a TNF model comprising 1741 genes (taken from 589 different references) known to be modulated by treating cells with TNF α. Although the NF-. kappa.B targeting model completely included genes whose expression was directly controlled by a single transcription factor (NF-. kappa.B), both models each contained genes whose direct transcription controllers were not necessarily known. The expression of these genes can be controlled by transcription factors not involved in constructing the model. For example, while it is known that genes in the IKK/NF- κ β signaling model will be modulated by perturbation of proteins in the IKK/NF- κ β signaling causal network model, some of these genes can be modulated into secondary effects resulting from altered expression of smaller subsets of genes directly modulated by NF- κ β. Likewise, TNF α is a ligand and therefore does not directly mediate (mediate) the transcription of any gene. Treatment of cells with TNF α results in activation of a number of transcription factors, any of which can alter, directly or indirectly (e.g., through autocrine signaling), the expression of each gene in a TNF model.
Figure 9 shows the complete causal network model (top), along with a schematic of the underlying model architecture (middle). The CHUK, ib, and ikbkg serve as inhibitors of nfkkbia, nfkkbib, and nfkkbie, which in turn are inhibitors of nfkkb 1, NF kkb 2, and RELA. The nodes used in the model are listed under each section. Bold nodes represent nodes with measurable downstream gene expression in the knowledge base, and the number of measurable downstream gene expression is given in square brackets (these 1227 downstream measurable objects correspond to 992 unique measurable objects because the same downstream can be found under multiple nodes). The symbols used are as follows: "CHUK P @ S" denotes CHUK phosphorylated at serine (where residues are given if known), "CHUKP @ ST" denotes CHUK phosphorylated at serine or threonine (the exact residue is unknown), "kaof (CHUK) denotes kinase activity of CHUK," CHUK: ikb kk "denotes a complex of CHUK and ikb kk B proteins," ikappa B kinase complex Hs "denotes the polymerization of various kp-B kinases (CHUK, ikb kk B, and ikb kk) in homo sapiens (Hs)," degradinof (NF kkbia) "denotes a process of NF kk ba degradation, and" taof (NF kk B1) "denotes the transcriptional activity of NF kk B1.
The IKK/NF- κ B signaling model and the TNF model provide insight into the behavior of mechanisms assumed to be at different levels of closeness to the measurements. The IKK/NF-. kappa.BETA signaling model mainly includes genes regulated by NF-. kappa.B (directly or indirectly) (FIG. 9), and it produces a response pattern very similar to that of the NF-. kappa.B-directed model (FIG. 10 (B)). This similar response pattern suggests that there is no large difference between the overall level of behavior of a gene known to be directly regulated by a transcription factor and the behavior of a gene in which knowledge of direct regulation is unknown. The time and dose dependent response seen in the NF- κ B targeting model appeared to be slightly less robust than in the TNF model (fig. 10 (c)), e.g., at the 30 minute time point, but this approach also produced very similar responses. Thus, although the general pattern of response is well preserved among models, small but perceptible differences in response can be observed in models that are less close to the entity being measured.
To evaluate the ability of causal network models to respond specifically to associated perturbations in TNF α signaling, another model was constructed for the key cell cycle component, transcription factor E2F1, assuming that E2F1 is a less direct effector of TNF α signaling than NF- κ β. The E2F1 targeting model included 80 genes (taken from 54 different references) known to be directly regulated by E2F1 (expression is under control of E2F1, while the promoter sequence is bound by E2F 1). To provide a comparison of biological NPA results not directly related to NF- κ B signaling, NPA responses of the four models introduced above (NF- κ B pointing, IKK/NF- κ B signaling, TNF and E2F1 pointing) were evaluated in response to inhibition of cell cycle progression via CDK inhibitors. In particular, a publicly available microarray dataset was used to treat HCT116 colon Cancer cells (GSE 15395) with three different concentrations of CDK inhibitor R547 (Berkofsky-Fessler et al, preliminary biological markers for a cycle-dependent kinase inhibitors transfer to a Candida pharmacodynamic biological markers in phase inhibitors, mol Cancer Ther2009,8: 2517-. All three NPA methods demonstrated that dose and time dependence reduced the score of the E2F 1-directed model at the 4 hour, 6 hour, and 24 hour time points. The TNF model shows a similar response pattern as the E2F 1-directed model. In contrast, the scores of the NF-kb-directed model and the IKK/NF- κ beta signaling model do not show this same dose-and time-dependent pattern, suggesting that these models of interest potentially contain few genes for cell cycle regulation.
F. Hardware
FIG. 15 is a block diagram of a distributed computerized system 1500 for quantifying the effects of biological perturbations. The components of system 1500 are the same as those in system 100 of fig. 1, but the layout of system 100 is such that: such that each component communicates through the network interface 1510. Such implementations may be suitable for distributed computing via a variety of communication systems including wireless communication systems that may share access to common network resources, e.g., a "cloud computing" paradigm.
FIG. 16 is a block diagram of a computing device, such as the system 100 of FIG. 1 or any of the components of the system 1300 of FIG. 13 for performing the processes described with reference to FIGS. 1-10. Each of the components of system 100, including SRP engine 110, network modeling engine 112, network scoring engine 114, aggregation engine 116, and one or more databases (including results databases, perturbation databases, and literature databases) may be implemented on one or more computing devices 1600. In certain aspects, a plurality of the aforementioned components and databases may be included or contained within one computing device 1600. In certain implementations, components and databases may be implemented across several computing devices 1600.
The computing device 1600 includes at least one communication interface unit, an input/output controller 1610, a system memory, and one or more data storage devices. The system memory includes or includes at least one random access memory (RAM 1602) and at least one read-only memory (ROM 1604). All of these elements communicate with a central processing unit (CPU 1606) to facilitate the operation of computing device 1600. The computing device 1600 may be configured in many different ways. For example, computing device 1600 may be a conventional stand-alone computer, or alternatively, the functionality of computing device 1600 may be distributed across multiple computer systems and architectures. The computing device 1600 may be configured to perform some or all of the modeling, scoring, and aggregation operations. In fig. 10, computing device 1600 is linked to other servers or systems via a network or local network.
The computing device 1600 may be configured in accordance with a distributed architecture, where the database and processor are located in separate units or locations. Some such units perform primary processing functions and contain, at a minimum, a general-purpose controller or processor and system memory. In this regard, these units are each coupled via a communication interface unit 1608 to a communication hub or port (not shown) that serves as a primary communication link with other servers, client or user computers, and other related devices. The communication hub or port itself may have minimal processing power and function primarily as a communication router. Various communication protocols may be part of the system, including, but not limited to: ethernet (Ethernet), SAP, SASTM、ATP、BLUETOOTHTMGSM and TCP/IP.
CPU1606 includes a processor, e.g., one or more conventional microprocessors and one or more auxiliary coprocessors, e.g., a math coprocessor, for transferring the workload of CPU 1606. The CPU1606 communicates with the communication interface unit 1608 and the input/output controller 1610, whereby the CPU1606 communicates with other devices such as other servers, user terminals, or devices. The communication interface unit 1608 and the input/output controller 1610 may include or incorporate various communication channels for synchronous communication with, for example, other processors, servers, or client terminals. The devices communicating with each other need not constantly send signals to each other. Rather, such devices only need to send signals to each other when necessary, may actually avoid exchanging data for the majority of the time, and may need to perform several steps to establish a communication link between the devices.
CPU1606 also communicates with data storage devices. The data storage device may include a suitable combination of magnetic, optical, or semiconductor memory, and may include or include, for example, RAM1602, ROM1604, a flash memory drive, an optical disk (e.g., a compact disk), or a hard disk or hard drive. For example, CPU1606 and data storage devices can each reside entirely within a single computer or other computing device; or connected to each other via a communication medium (e.g., a USB port, a serial port, a coaxial line, an ethernet-type network line, a telephone line, a radio frequency transceiver, or other similar wireless or wired medium, or a combination of the above). For example, CPU1606 may be connected to a data storage device via communication interface unit 1608. CPU1606 may be configured to perform one or more particular processing functions.
The data storage device may store, for example, (i) an operating system 1612 for the computing device 1600; (ii) one or more applications 1614 (e.g., computer program code or a computer program product) adapted to direct CPU1606 in accordance with the systems and methods described herein, and in particular in accordance with the processes described in detail with respect to CPU 1606; or (iii) a database 1616 adapted to store information that may be used to store information needed by the program. In certain aspects, the database comprises or includes a database for storing experimental data and published literature models.
The operating system 1612 and applications 1614 may be stored in, for example, a compressed, uncompressed, and encrypted format, and may include or contain computer program code. Instructions of the program may be read into the main memory of the processor from a computer-readable medium other than the data storage device, such as ROM1604 or RAM 1602. Although execution of the sequences of instructions in the programs causes the CPU1606 to perform the process steps described herein, hardwired circuitry may be used in place of or in combination with software instructions to implement the processes of the present invention. Thus, the systems and methods described are not limited to any specific combination of hardware and software.
Suitable computer program code may be provided for performing one or more functions related to modeling, scoring and aggregation as described herein. Programs may also include or include program elements such as an operating system 1612, a database management system, and "device drivers" that allow the processor to interface with computer peripheral devices (e.g., video display, keyboard, computer mouse, etc.) via the input/output controller 1610.
The term "computer-readable medium" as used herein refers to any non-transitory medium that provides or participates in providing instructions to the processor of computing device 1600 (or any other processor of devices described herein) for execution. Such a medium may take many forms, including, but not limited to, non-volatile media and volatile media. Non-volatile media includes or includes, for example, optical, magnetic, or opto-magnetic disks, or integrated circuit memory such as flash memory. Volatile media includes or includes Dynamic Random Access Memory (DRAM), which typically constitutes a main memory. Common forms of computer-readable media include or include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM or EEPROM (electrically erasable programmable read-only memory), a FLASH-EEPROM, any other memory chip or cartridge, or any other non-transitory medium from which a computer can read.
Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to CPU1606 (or any other processor of a device described herein) for execution. For example, the instructions may initially be carried on a magnetic disk of a remote computer (not shown). The remote computer can load the instructions into its dynamic memory and send the instructions over an ethernet connection, a wire or even a telephone line using a modem. Communication devices local to the computing device 1600 (e.g., a server) can receive data on the respective communication lines and place the data on the processor's system bus. The system bus transports the data to main memory, from which the processor retrieves instructions for execution. The instructions received by main memory may optionally be stored in memory either before or after execution by processor. Further, the instructions may be received via the communication port as electrical, electromagnetic, or optical signals, which are exemplary forms of wireless communications or data streams for carrying various types of information. Further aspects and embodiments are set forth in the following paragraphs:
1. a computerized method for quantifying a perturbation of a biological system in response to an agent, comprising receiving, at a first processor, a treatment dataset corresponding to a response of the biological system to the agent, wherein the biological system comprises or comprises a plurality of biological entities, each biological entity interacting with at least one other biological entity; receiving, at a second processor, a control data set corresponding to a biological system that was not exposed to the agent; providing, at a third processor, a computational causal network model for representing the biological system, and the computational causal network model includes or includes: a node representing the biological entities, an edge representing a relationship between the biological entities, and a direction value of the node representing an expected direction of change between the control data and the process data; calculating, with a fourth processor, an activity measurement of the node representing a difference between the process data and the control data; calculating, with a fifth processor, weight values for the nodes, wherein at least one weight value is different from at least one other weight value; and generating, with a sixth processor, a score representing the perturbation of the biological system to the agent for the computational model, wherein the score is based on the direction values, the weight values, and the activity measurements.
2. The computerized method of paragraph 1, further comprising normalizing the scores based on a number of nodes in the respective computational model.
3. The computerized method of any of the above paragraphs, wherein the weight values represent a confidence in at least one of the treatment dataset and the control dataset.
4. The computerized method of any of the above paragraphs, wherein the weight values comprise local false non-discovery rates.
5. The computerized method of paragraph 1, further comprising calculating, with a seventh processor, an approximate distribution of activity measurements over the nodes; calculating, with an eighth processor, expected values of the approximate distributions; and generating, with a ninth processor, for each computational model, a score representing the perturbation of the subset of biological systems to the agent, wherein the score is based on the expected value.
6. The computerized method of paragraph 5, wherein the approximate distribution is based on the activity measurements.
7. The computerized method of any of paragraphs 5-6, wherein calculating an expected value comprises performing a rectangle approximation.
8. The computerized method of paragraph 1, further comprising calculating, with a tenth processor, a positive activation score and a negative activation score based on the activity measurements, the positive and negative activation scores representing a correspondence and an inconsistency between the activity measurements and the orientation values, respectively; and generating, with an eleventh processor, for each computational model, a score representing the perturbation of the subset of biological systems to the agent, wherein the score is based on the positive and negative activation scores.
9. The computerized method of paragraph 8, wherein the score is based on a local false non-discovery rate.
10. The computerized method of any of paragraphs 8-9, wherein the activity measurements are fold-change values, and the fold-change value for each node comprises a logarithm of a difference between the processing data and the control data for the biological entity represented by the respective node.
11. The computerized method of any of the preceding paragraphs, wherein the subset of biological systems comprises at least one of a cell proliferation mechanism, a cell stress mechanism, a cell inflammation mechanism, and a DNA repair mechanism.
12. The computerized method of any of the above paragraphs, wherein the agent comprises at least one of an aerosol generated by heating tobacco, an aerosol generated by burning tobacco, tobacco smoke, or cigarette smoke.
13. The computerized method of any of the above paragraphs, wherein the agent comprises a heterogeneous substance comprising molecules or entities not present within or derived from a biological system.
14. The computerized method of any of the above paragraphs, wherein the agent comprises a toxin, a therapeutic compound, a stimulant, a relaxant, a natural product, an article of manufacture, and a food material.
15. The computerized method of any of the above paragraphs, wherein the processing dataset comprises a plurality of processing datasets, such that each node comprises a plurality of fold-change values defined by the first probability distribution and a plurality of weight values defined by the second probability distribution.
While implementations of the invention have been particularly shown and described with reference to specific examples, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Claims (43)
1. A computerized method for quantifying a perturbation of a biological system in response to an agent, comprising
Receiving, at a first processor, a treatment data set corresponding to a response of a biological system to an agent, wherein the biological system comprises or comprises a plurality of biological entities, each biological entity interacting with at least another of the biological entities;
receiving, at a second processor, a control data set corresponding to the biological system not exposed to the agent;
providing, at a third processor, a computational causal network model representative of the biological system, and including or including:
a node representing the biological entity in question,
edges representing relationships between said biological entities, an
A direction value for the node representing an expected direction of change between the control data and the process data;
calculating, with a fourth processor, an activity measure for the node representing a difference between the process data and the control data, wherein the activity measure is a fold-change value and the fold-change value for each node comprises or includes a logarithm of the difference between the process data and the control data for the biological entity represented by the respective node;
calculating, with a fifth processor, weight values for the nodes, wherein at least one weight value is different from at least one other weight value; and
generating, with a sixth processor, a score representing a perturbation of the biological system to the agent for the computational model, wherein the score is based on the direction values, the weight values, and the activity measurements.
2. The computerized method of claim 1, wherein the biological system is represented by at least one mechanistic hypothesis.
3. The computerized method of claim 1, wherein the biological system is represented by a plurality of computational causal network models or at least one computational causal network model comprising a plurality of mechanistic hypotheses.
4. The computerized method of claim 1, further comprising normalizing the scores based on a number of measurable nodes in the respective computational model.
5. The computerized method of claim 1, wherein the weight value represents a confidence level for at least one of the treatment dataset and the control dataset.
6. The computerized method of claim 1, wherein the weight value comprises or includes a local false non-discovery rate, and wherein calculating the weight value for the node comprises calculating a probability that the activity measure represents a violation of an original assumption about a zero difference.
7. The computerized method of claim 1, further comprising calculating, with a seventh processor, an approximate distribution of the activity measurements of nodes over a model or over mechanistic hypotheses in a model; calculating, with an eighth processor, an expected value of the activity measurement with respect to the approximate distribution; and generating, with a ninth processor, for each computational model, a score representing a perturbation of the subset of the biological system to the agent, wherein the score is based on an expected value.
8. The computerized method of claim 7, wherein the approximate distribution is based on the activity measurements.
9. The computerized method of claim 7, wherein calculating an expected value comprises performing a rectangular approximation.
10. The computerized method of claim 1, further comprising calculating, with a tenth processor, a positive activation metric and a negative activation metric based on the activity measurements, the positive activation metric and the negative activation metric representing a correspondence and an inconsistency between the activity measurements and the orientation values, respectively, with respect to the model; and generating, with an eleventh processor, for each computational model, a score representing a perturbation of the subset of the biological system to the agent, wherein the score is based on a positive activation score and a negative activation score.
11. The computerized method of claim 10, wherein the positive activation metric, negative activation metric, or both are based on a local false non-discovery rate.
12. The computerized method of claim 1, wherein the subset of biological systems comprises or includes at least one of a cell proliferation mechanism, a cell stress mechanism, a cell inflammation mechanism, and a DNA repair mechanism.
13. The computerized method of claim 1, wherein the agent comprises or comprises at least one of an aerosol generated by heating tobacco, an aerosol generated by burning tobacco, tobacco smoke, or cigarette smoke.
14. The computerized method of claim 1, wherein the agent comprises or comprises a heterogeneous substance comprising molecules or entities that are not present in or derived from the biological system.
15. The computerized method of claim 1, wherein the agent comprises or comprises a toxin, a therapeutic compound, a stimulant, a relaxant, a natural product, an article of manufacture, and a food material.
16. The computerized method of claim 1, wherein the process data set comprises or includes a plurality of process data sets such that each measurable node comprises or includes a plurality of fold-change values defined by a first probability distribution and a plurality of weight values defined by a second probability distribution.
17. The computerized method of claim 1, wherein the process data set comprises or includes a plurality of process data sets such that each measurable node comprises or includes a plurality of fold-change values and corresponding weight values.
18. The computerized method of claim 1, wherein the step of generating the score comprises a linear or non-linear combination of the activity measure, the weight value, and the direction value; and the normalization of the combinations by the scaling factor.
19. The computerized method of claim 18, wherein the combination is an arithmetic combination and the scaling factor is the square root of the number of biological entities from which measurement data is received.
20. The computerized method of claim 1, wherein the score is generated by a geometric perturbation index scoring technique, a probabilistic perturbation index scoring technique, or an expected perturbation index scoring technique.
21. The computerized method of claim 1, further comprising determining a confidence interval for the score based on a parametric or non-parametric computational guidance technique.
22. A computer system for quantifying a perturbation of a biological system in response to an agent, the system comprising at least one processor configured or adapted to:
receiving a treatment data set corresponding to a response of a biological system to an agent, wherein the biological system comprises or comprises a plurality of biological entities, each biological entity interacting with at least another of the biological entities;
receiving a control dataset corresponding to the biological system not exposed to the agent;
providing a computational causal network model representative of the biological system, and the computational causal network model comprises or includes:
a node representing the biological entity in question,
edges representing relationships between said biological entities, an
A direction value for the node representing an expected direction of change between the control data and the process data;
calculating an activity measure for the node representing a difference between the process data and the control data, wherein the activity measure is a fold change value and the fold change value for each node comprises or includes a logarithm of the difference between the process data and the control data for the biological entity represented by the respective node;
calculating weight values for the nodes, wherein at least one weight value is different from at least one other weight value; and is
Generating a score for the computational model that represents a perturbation of the biological system to the agent, wherein the score is based on the direction value, the weight value, and the activity measure.
23. A computerized apparatus for quantifying a perturbation of a biological system in response to an agent, comprising
Means for receiving, at a first processor, a processing dataset corresponding to a response of a biological system to an agent, wherein the biological system comprises or comprises a plurality of biological entities, each biological entity interacting with at least another of the biological entities;
means for receiving, at a second processor, a control data set corresponding to the biological system not exposed to the agent;
means for providing, at a third processor, a computational causal network model representative of the biological system, and including or including:
a node representing the biological entity in question,
edges representing relationships between said biological entities, an
A direction value for the node representing an expected direction of change between the control data and the process data;
means for calculating, with a fourth processor, an activity measure for the node representing a difference between the process data and the control data, wherein the activity measure is a fold change value and the fold change value for each node comprises or includes a logarithm of the difference between the process data and the control data for the biological entity represented by the respective node;
means for calculating, with a fifth processor, weight values for the nodes, wherein at least one weight value is different from at least one other weight value; and
means for generating, with a sixth processor, a score for the computational model representing a perturbation of the biological system to the agent, wherein the score is based on the direction values, the weight values, and the activity measurements.
24. The computerized device of claim 23, wherein said biological system is represented by at least one mechanistic hypothesis.
25. The computerized device of claim 23, wherein the biological system is represented by a plurality of computational causal network models or at least one computational causal network model comprising a plurality of mechanistic hypotheses.
26. The computerized apparatus of claim 23, further comprising means for normalizing said score based on a number of measurable nodes in a respective computational model.
27. The computerized device of claim 23, wherein the weight value represents a confidence level for at least one of the treatment dataset and the control dataset.
28. The computerized apparatus of claim 23, wherein the weight values comprise or include local false non-discovery rates, and wherein the means for calculating weight values for the nodes comprises means for calculating a probability that the activity measurements represent an original hypothesis for a violation of zero variance.
29. The computerized device of claim 23, further comprising means for calculating, with a seventh processor, an approximate distribution of said activity measurements of nodes over a model or a mechanistic hypothesis in a model; means for calculating, with an eighth processor, expected values of activity measurements with respect to the approximate distribution; and means for generating, with a ninth processor, for each computational model, a score representing a perturbation of the subset of biological systems to the agent, wherein the score is based on an expected value.
30. The computerized device of claim 29, wherein the approximate distribution is based on the activity measurements.
31. The computerized device of claim 29, wherein the means for calculating expected values comprises means for performing a rectangular approximation.
32. The computerized device of claim 23, further comprising means for calculating, with a tenth processor, a positive activation metric and a negative activation metric based on the activity measurements, the positive activation metric and the negative activation metric representing a correspondence and an inconsistency between the activity measurements and the orientation values, respectively, with respect to the model; and means for generating, with an eleventh processor, for each computational model, a score representing a perturbation of the subset of biological systems to the agent, wherein the score is based on a positive activation score and a negative activation score.
33. The computerized apparatus of claim 32, wherein the positive activation metric, negative activation metric, or both are based on a local false non-discovery rate.
34. The computerized apparatus of claim 23, wherein the subset of biological systems comprises or includes at least one of a cell proliferation mechanism, a cell stress mechanism, a cell inflammation mechanism, and a DNA repair mechanism.
35. A computerized device according to claim 23, wherein said agent comprises or comprises at least one of an aerosol generated by heating tobacco, an aerosol generated by burning tobacco, tobacco smoke or cigarette smoke.
36. The computerized device of claim 23, wherein the agent comprises or comprises a heterogeneous substance comprising molecules or entities that are not present in or derived from the biological system.
37. The computerized device of claim 23, wherein the agent comprises or comprises a toxin, a therapeutic compound, a stimulant, a relaxant, a natural product, an article of manufacture, and a food material.
38. The computerized apparatus of claim 23, wherein the processing dataset comprises or includes a plurality of processing datasets such that each measurable node comprises or includes a plurality of fold-change values defined by a first probability distribution and a plurality of weight values defined by a second probability distribution.
39. The computerized device of claim 23, wherein the processing dataset comprises or includes a plurality of processing datasets such that each measurable node comprises or includes a plurality of multiplier change values and corresponding weight values.
40. The computerized device of claim 23, wherein means for generating the score comprises means for linearly or non-linearly combining the activity measure, the weight value, and the direction value; and means for normalizing said combination by a scaling factor.
41. The computerized device of claim 40, wherein said combination is an arithmetic combination and said scaling factor is the square root of the number of biological entities from which measurement data is received.
42. The computerized device of claim 23, wherein the score is generated by a geometric perturbation index scoring technique, a probabilistic perturbation index scoring technique, or an expected perturbation index scoring technique.
43. The computerized device of claim 23, further comprising means for determining a confidence interval for the score based on a parametric or non-parametric computational guidance technique.
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US61/495,824 | 2011-06-10 | ||
| US61/525,700 | 2011-08-19 | ||
| EP11195417.8 | 2011-12-22 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1196688A HK1196688A (en) | 2014-12-19 |
| HK1196688B true HK1196688B (en) | 2018-05-04 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN106934253B (en) | System and method for network-based assessment of biological activity | |
| US20210397995A1 (en) | Systems and methods relating to network-based biomarker signatures | |
| JP6407242B2 (en) | System and method for network-based biological activity assessment | |
| US20140207385A1 (en) | Systems and methods for characterizing topological network perturbations | |
| HK1196688B (en) | Systems and methods for network-based biological activity assessment | |
| HK1196688A (en) | Systems and methods for network-based biological activity assessment | |
| HK1197698B (en) | Systems and methods for network-based biological activity assessment | |
| HK1197698A (en) | Systems and methods for network-based biological activity assessment | |
| HK1211360B (en) | Systems and methods relating to network-based biomarker signatures | |
| HK1197483B (en) | Systems and methods for quantifying the impact of biological perturbations | |
| HK1197483A (en) | Systems and methods for quantifying the impact of biological perturbations |