[go: up one dir, main page]

HK1197483B - Systems and methods for quantifying the impact of biological perturbations - Google Patents

Systems and methods for quantifying the impact of biological perturbations Download PDF

Info

Publication number
HK1197483B
HK1197483B HK14110893.1A HK14110893A HK1197483B HK 1197483 B HK1197483 B HK 1197483B HK 14110893 A HK14110893 A HK 14110893A HK 1197483 B HK1197483 B HK 1197483B
Authority
HK
Hong Kong
Prior art keywords
biological
score
scores
network
data
Prior art date
Application number
HK14110893.1A
Other languages
Chinese (zh)
Other versions
HK1197483A (en
Inventor
J.霍恩格
F.马丁
M.派奇
A.塞沃尔
Original Assignee
菲利普莫里斯生产公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 菲利普莫里斯生产公司 filed Critical 菲利普莫里斯生产公司
Publication of HK1197483A publication Critical patent/HK1197483A/en
Publication of HK1197483B publication Critical patent/HK1197483B/en

Links

Description

System and method for quantifying the effects of biological perturbations
Background
The human body is often disturbed by exposure to potentially harmful agents (agents) that can pose serious health risks in the long term. Exposure to these agents can compromise the normal functioning of biological mechanisms within the human body. To understand and quantify the effects of these perturbations on the human body, researchers have investigated the mechanisms by which biological systems respond to exposure to agents. Some groups have been widely used in live animal testing methods. However, animal testing methods are not always sufficient because of questions about their reliability and relevance. There are numerous differences in the physiology of different animals. Thus, different species may respond differently to exposure to an agent. Therefore, it is questionable whether the responses obtained from animal tests can be extrapolated into human biology. Other methods include assessing risk through clinical studies on human volunteers. However, these risk assessments are performed a posteriori, and because the disease may take decades to manifest, these assessments may not be sufficient to elucidate the mechanism that links the harmful substance to the disease. Other methods include in vitro experiments. Although in vitro cell and tissue based methods have gained widespread acceptance as a complete or partial alternative to their animal based counterparts, these methods are of limited value. Because in vitro methods focus on specific aspects of cellular and tissue mechanisms; they do not always take into account the complex interactions that occur throughout biological systems.
Over the past decade, high throughput measurements of nucleic acid, protein and metabolite levels in conjunction with traditional dose-related efficacy and toxicity assays have emerged as a method for elucidating the mechanisms of action of many biological processes. Researchers have attempted to combine information from these disparate measurements with knowledge about biological pathways from the scientific literature to construct meaningful biological models. To this end, researchers have begun to identify potential biological mechanisms of action using mathematical and computational techniques (e.g., clustering and statistical methods) that are capable of mining large amounts of data.
Previous work also explored the importance of characterizing features that reveal changes in gene expression caused by one or more perturbations to a biological process, as well as subsequent scoring as to the presence of that feature within an additional data set that is a measure of the magnitude of a particular activity of that process. Much of the work in this regard has been directed to the identification and scoring of features associated with disease phenotypes. These phenotype-derived features provide significant classification capability, but lack a mechanism or causal relationship between individual specific perturbations and features. Thus, these characteristics may represent a variety of different unknown perturbations that result in or are caused by the same disease phenotype by generally unknown mechanisms.
One challenge is to understand how the activities of various individual biological entities in a biological system allow activation or inhibition of different biological mechanisms. Because individual entities (e.g., genes) can be involved in multiple biological processes (e.g., inflammation and cell proliferation), measurements of gene activity are not sufficient to identify the underlying biological process that triggered the activity.
None of the current techniques have been applied to perform predictive risk assessment and address the relationship between short-term exposure to perturbations and long-term disease outcomes. Typically, this problem is addressed by traditional longitudinal epidemiological studies, but such studies may present ethical challenges and fail to meet the current urgent need for risk assessment. Indeed, traditional longitudinal epidemiological studies cannot be used for new agents. Accordingly, there is a need for improved systems and methods to study the effects of perturbations on the human body.
Disclosure of Invention
Systems, methods, and products are described herein for quantifying a response of a biological system to one or more perturbations with activity data measured from a subset of entities of the biological system.
In one aspect, there is provided a computerized method for determining the effect of a perturbation on a biological system, comprising: receiving, at a processor, a first data set corresponding to a response of a biological system to a first process, wherein the biological system comprises a plurality of biological entities, wherein each biological entity in the biological system interacts with at least one other biological entity in the biological system; receiving, at the processor, a second data set corresponding to a response of the biological system to a second process, the second process being different from the first process; providing, at a processor, a plurality of computational network models representing biological systems, each model comprising nodes representing a plurality of biological entities and edges representing relationships between the nodes in the model; generating, at a processor, a first set of scores representing a perturbation of the biological system based on the first data set and the plurality of models, and a second set of scores representing a perturbation of the biological system based on the second data set and the plurality of computational models; and generating, at the processor, one or more biological impact factors representing a biological impact of the perturbation on the biological system based on each of the first set of scores and the second set of scores.
In one embodiment, more than two data sets are received and a corresponding number of score sets are generated. In certain embodiments, more than three, more than four, more than five, more than six, more than seven, more than eight, more than nine, or more than ten data sets are received. In some embodiments, at least the same number of data sets are received as the perturbation or treatment.
In one embodiment, a biological impact factor is generated for each treatment.
In one embodiment, at least one data set includes process data and corresponding control data.
In one embodiment, at least one of the plurality of networks is a causal network.
In one embodiment, the score within each set of scores is independently calculated by a geometric perturbation index scoring technique, a probabilistic perturbation index scoring technique, or an expected perturbation index scoring technique.
In one embodiment, each score in the first set of scores and the second set of scores comprises a score vector, and the step of generating the biological impact factor further comprises filtering, at the processor, the first score and the second score to decompose each of the first score and the second score into a plurality of projections on the base set of vectors.
In one embodiment, filtering further comprises removing at least one of the plurality of projections from at least one of the decomposed first score and second score.
In one embodiment, the set of basis vectors includes eigenvectors of a matrix describing the at least one model.
In one embodiment, generating the first set of scores and the second set of scores comprises: assigning, at the processor, a weight to each of the first set of scores and the second set of scores based on the respective computational network model and at least one of the first set of data and the second set of data; aggregating the weighted scores in the first set of scores; aggregating the weighted scores in the second set of scores; wherein the one or more biological impact factors are a function of the aggregated scores of the first set of scores and the second set of scores.
In one embodiment, the one or more biological impact factors are linear combinations, linear transforms, or quadratic functional forms of the aggregated scores of the first set of scores and the second set of scores.
In one embodiment, assigning weights to each of the first set of scores and the second set of scores includes selecting weights for each of the plurality of computational models to maximize a difference between the scores within the first set of scores and the scores within the second set of scores.
In one embodiment, generating the biological impact factor includes determining an inner product between a first vector representing the aggregated score for the first set of scores and a second vector representing the aggregated score for the second set of scores.
In one embodiment, generating the biological impact factor includes determining a distance between a first surface defined by a first vector representing the aggregated score for a first set of scores and a second surface defined by a second vector representing the aggregated score for a second set of scores.
In one embodiment, the computational network model is two or more selected from a cell proliferation network, an inflammatory process network, a cellular stress network, and a DNA damage, autophagy, cell death, and aging network.
In another aspect, a computer system is described, the computer system comprising a processor configured to: receiving first data corresponding to responses of a set of biological entities to a first process, wherein the biological system comprises a plurality of biological entities, the plurality of biological entities comprising a set of biological entities and wherein each biological entity in the biological system interacts with at least one other biological entity in the biological system; receiving second data corresponding to a response of the set of biological entities to a second process, the second process being different from the first process; providing a plurality of computational causal network models representing a biological system, each computational model comprising nodes representing a plurality of biological entities and edges representing relationships between nodes in the plurality of biological entities; generating a first score representing a perturbation of the biological system based on the first data and the plurality of computational models, and generating a second score representing a perturbation of the biological system based on the second data and the plurality of computational models; and generating a biological impact factor based on the first score and the second score.
In one embodiment, each of the first score and the second score comprises a score vector, and wherein the processor is further configured to: filtering the first score and the second score to decompose each of the first score and the second score into a plurality of projections on a set of basis vectors; and removing at least one of the plurality of projections from at least one of the first score and the second score.
In one embodiment, the set of basis vectors comprises eigenvectors of a matrix describing the at least one computational model, or wherein generating the biological impact factor comprises determining an inner product between a first vector representing the first score and a second vector representing the second score.
In one embodiment, generating the biological impact factor includes determining a distance between a first surface representing the first score and a second surface representing the second score.
In one embodiment, the biological system comprises at least one of a cell proliferation mechanism, a cell stress mechanism, a cell inflammation mechanism, and a DNA repair mechanism.
In one embodiment, the first treatment comprises exposure to an aerosol generated by heating tobacco, exposure to an aerosol generated by burning tobacco, exposure to tobacco smoke, exposure to cigarette smoke, exposure to tramp material that is not present in or available from biological systems and exposure to toxins, therapeutic compounds, stimulants, relaxants, natural products, manufactured products, food materials, exposure to at least one of cadmium, mercury, chromium, nicotine, tobacco specific nitrosamines and metabolites thereof (4-methylnitrosamino-1- (3-pyridyl) -1-butanone 4(NNK), N' -nitrosodemethylnicotine (NNN), N-nitrosoneonicotinine (tobacco), N-Nitrosoanabasine (NAB), and 4- (methylnitrosamino) -1- (3-pyridyl) -1-butanol (NNAL)) One or more of the above.
In another aspect, a computer program product is described, comprising computer code adapted to perform the method disclosed herein.
In another aspect, a computer or computer-readable medium is described, which includes a computer program product.
In another aspect, there is provided a method for determining a biological impact of a perturbation on a biological system, comprising: generating one or more biological impact factors representing a biological impact of the perturbation on the biological system, wherein at least one biological impact factor is determined according to the computerized method described herein; comparing the one or more biological impact factors to one or more biological impact factors that have been obtained in the absence of a perturbation or in the presence of a different perturbation; and wherein the comparison is indicative of a biological effect of the perturbation on the biological system.
In another aspect, there is provided a computerized method for determining a biological impact of a perturbation on a biological system, comprising: generating one or more biological impact factors representing a biological impact of the perturbation on the biological system, wherein at least one biological impact factor is determined according to the computerized method described in any of claims 1-15, 21 or 22; comparing the one or more biological impact factors to one or more biological impact factors that have been obtained in the absence of a perturbation or in the presence of a different perturbation; and wherein the comparison is indicative of a biological effect of the perturbation on the biological system.
In another aspect, there is provided a method for determining a biological impact of a perturbation on a biological system, comprising: generating one or more biological impact factors representing a biological impact of the perturbation on the biological system, wherein at least one biological impact factor is determined according to the computerized method described herein; comparing the one or more biological impact factors to one or more biological impact factors that have been obtained in the absence of a perturbation or in the presence of a different perturbation; and wherein the comparison is indicative of a biological effect of the perturbation on the biological system.
In one embodiment, the one or more biological impact factors represent or are used to estimate or determine the magnitude of an expected or adverse biological impact caused by a pathogen, hazardous substance, produced product, product produced for safety assessment or risk use comparison, therapeutic compound, or change in the environment or environmental active substance.
In one embodiment, more than two different perturbations are used to compare the effect of the different perturbations on the biological system.
In one embodiment, the one or more perturbations are indicative of at least two different processing conditions.
In one embodiment, the at least one treatment comprises exposure to an aerosol generated by heating tobacco, exposure to an aerosol generated by burning tobacco, exposure to tobacco smoke, exposure to cigarette smoke, exposure to tramp material comprising molecules or entities not present or obtainable from biological systems, and exposure to at least one of toxins, therapeutic compounds, stimulants, relaxants, natural products, manufactured products, food substances.
In one embodiment, the perturbation is caused by one or more agents.
In one embodiment, the agent is selected from the group consisting of: an aerosol generated by heating tobacco, an aerosol generated by burning tobacco, tobacco smoke, cigarette smoke, and any gaseous or particulate component thereof, cadmium, mercury, chromium, nicotine, tobacco specific nitrosamines and metabolites thereof (4-methylnitrosamino-1- (3-pyridyl) -1-butanone 4(NNK), N' -nitrosonornicotine (NNN), N-Nitrosoanatabine (NAT), N-Nitrosoanabasine (NAB), and 4- (methylnitrosamino) -1- (3-pyridyl) -1-butanol (NNAL)), or a combination of one or more of the foregoing.
In one embodiment, the at least one biological impact factor has been predetermined or pre-calculated.
In another aspect, there is provided a computerized method for determining the effect of a perturbation on a biological system, comprising: receiving, at a processor, first data corresponding to responses of a set of biological entities to a first process, wherein the biological system includes a plurality of biological entities comprising a plurality of sets of biological entities, wherein each biological entity in the biological system interacts with at least one other biological entity in the biological system; receiving, at the processor, second data corresponding to a response of the set of biological entities to a second process, the second process being different from the first process; providing, at a processor, a plurality of computational causal network models representing a biological system, each computational model comprising nodes representing a plurality of biological entities and edges representing relationships between entities of the plurality of biological entities; generating, at a processor, a first score representing a perturbation of the biological system based on the first data and the plurality of models, and a second score representing a perturbation of the biological system based on the second data and the plurality of computational models; and generating, at the processor, a biological impact factor representing a biological impact of the perturbation on the biological system based on the first score and the second score.
In one embodiment, each of the first score and the second score comprises a score vector, and the step of generating the biological impact factor further comprises filtering, at the processor, the first score and the second score to decompose each of the first score and the second score into a plurality of projections on a set of basis vectors.
In one embodiment, filtering further comprises removing at least one of the plurality of projections from at least one of the decomposed first score and second score.
In one embodiment, the set of basis vectors includes eigenvectors of a matrix describing the at least one computational model.
In one embodiment, generating the first score and the second score comprises: assigning, at the processor, a weight to each of the plurality of computational models based on the respective computational model and at least one of the first data and the second data; generating, at a processor, a plurality of first scores corresponding to a plurality of computational models and based on first data; and generating, at the processor, a plurality of second scores corresponding to the plurality of computational models and based on the second data; combining the plurality of first scores according to the assigned weights; combining the plurality of second scores according to the assigned weights; wherein the biological impact factor is a function of the plurality of first scores for binding and the plurality of second scores for binding.
In one embodiment, assigning a weight to each of the plurality of computational models includes selecting a weight for each of the plurality of computational models to maximize a difference between the plurality of first scores and the plurality of second scores.
In one embodiment, generating the biological impact factor includes determining an inner product between a first vector representing the first score and a second vector representing the second score.
In one embodiment, generating the biological impact factor includes determining a distance between a first surface representing the first score and a second surface representing the second score.
In one embodiment, the computational causal network model is two or more selected from a cell proliferation network, an inflammatory process network, a cellular stress network, and a DNA damage, autophagy, cell death, and aging network.
In another aspect, a computer system for determining biological impact factors is provided, comprising an apparatus adapted to implement a computerized method.
In one embodiment, the computer system includes a processor configured to: receiving first data corresponding to responses of a set of biological entities to a first process, wherein the biological system comprises a plurality of biological entities, the plurality of biological entities comprising a set of biological entities and wherein each biological entity in the biological system interacts with at least one other biological entity in the biological system; receiving second data corresponding to a response of the set of biological entities to a second process, the second process being different from the first process; providing a plurality of computational causal network models representing a biological system, each computational model comprising nodes representing a plurality of biological entities and edges representing relationships between nodes in the plurality of biological entities; generating a first score representing a perturbation of the biological system based on the first data and the plurality of computational models, and generating a second score representing a perturbation of the biological system based on the second data and the plurality of computational models; and generating a biological impact factor based on the first score and the second score.
In one embodiment, each of the first score and the second score comprises a score vector, and wherein the processor is further configured to: filtering the first score and the second score to decompose each of the first score and the second score into a plurality of projections on a set of basis vectors; and removing at least one of the plurality of projections from at least one of the first score and the second score.
In one embodiment, the set of basis vectors includes eigenvectors of a matrix describing the at least one computational model.
In one embodiment, generating the biological impact factor includes determining an inner product between a first vector representing the first score and a second vector representing the second score.
In one embodiment, generating the biological impact factor includes determining a distance between a first surface representing the first score and a second surface representing the second score.
In one embodiment, the biological system comprises at least one of a cell proliferation mechanism, a cell stress mechanism, a cell inflammation mechanism, and a DNA repair mechanism. In one embodiment, the first treatment comprises exposure to an aerosol generated by heating tobacco, exposure to an aerosol generated by burning tobacco, exposure to tobacco smoke, exposure to cigarette smoke, exposure to tramp material comprising molecules or entities not present or obtainable from biological systems, and exposure to at least one of toxins, therapeutic compounds, stimulants, relaxants, natural products, manufactured products, food substances.
In another aspect, a computer program product is provided comprising program code adapted to perform the computerized method of the invention.
In another aspect, a computer or computer readable medium is provided, which comprises the computer program product of the present invention.
In one aspect, the systems and methods described herein relate to a computerized method (e.g., a computer-implemented method) and one or more computer processors for quantifying the effect of a perturbation on a biological system (e.g., in response to a treatment condition such as an agent exposure or in response to a plurality of treatment conditions). The processor receives first data corresponding to a response of the set of biological entities to the first process. The set of biological entities is a portion of a plurality of biological entities included in a biological system. Each biological entity in the biological system and at least one other biological entity in the biological system affect each other. The processor also receives second data corresponding to a response of the set of biological entities to a second process, the second process being different from the first process. The processor also provides a plurality of computational causal network models representing the biological system. Each computational model includes nodes representing a plurality of biological entities and edges representing relationships between entities in the plurality of biological entities.
The processor then generates a first score for the perturbation of the biological system based on the first data and the plurality of computational models, and generates a second score representing the perturbation of the biological system based on the second data and the plurality of computational models. The processor then generates a "biological response factor" or "BIF" based on the first score and the second score. In various embodiments, the computerized method combines a plurality of model scores corresponding to a plurality of treatments (or agents) and generates a BIF that represents the relative biological effect elicited by the treatments (or agents). In some embodiments, generating the biological impact factor includes determining an inner product between a first vector representing the first score and a second vector representing the second score. In some embodiments, generating the biological impact factor includes determining a distance between a first surface representing the first set of scores and a second surface representing the second set of scores.
In some embodiments, each of the first and second scores comprises a score vector, and the step of generating the biological impact factor further comprises filtering, at the processor, the first score and the second score to decompose each of the first score and the second score into a plurality of projections on the set of basis vectors. The filtering may further include removing at least one of the plurality of projections from at least one of the decomposed first score and second score. The set of basis vectors includes eigenvectors representing a matrix of at least one computational model, such as a matrix of laplacian operators.
In some embodiments, generating the first and second scores includes assigning a weight to each of the plurality of computational models based on the response computational model and at least one of the first and second data. A weight may be assigned, for example, to maximize the difference between the first score and the second score. The processor may also generate a plurality of first scores corresponding to the plurality of computational models and based on the first data and a plurality of second scores corresponding to the plurality of computational models and based on the second data. The processor may then combine the plurality of first scores according to the assigned weights and combine the plurality of second scores according to the assigned weights. In some such embodiments, the biological impact factor is a function of the plurality of first scores for binding and the plurality of second scores for binding.
In certain embodiments, the biological system includes, but is not limited to, at least one of a cell proliferation mechanism, a cell stress mechanism, a cell inflammation mechanism, a DNA repair mechanism, a DNA damage mechanism, an autophagy mechanism, a cell death mechanism, and an aging mechanism. Treatment may include, but is not limited to, exposure to a variety of substances, including molecules or entities present in or derived from biological systems. Processing may include, but is not limited to: exposure to toxins, therapeutic compounds, stimulants, relaxants, natural products, manufactured products, and food substances. Treatments may include, but are not limited to, exposure to aerosols generated by heating tobacco, aerosols generated by burning tobacco, tobacco smoke, and cigarette smoke. Treatments may include, but are not limited to, exposure to cadmium, mercury, chromium, nicotine, tobacco specific nitrosamines and their metabolites (4-methylnitrosamine-1- (3-pyridyl) -1-butanone 4(NNK), N' -nitrosonornicotine (NNN), N-Nitrosoanatabine (NAT), N-Nitrosoanabasine (NAB), and 4- (methylnitrosamino) -1- (3-pyridyl) -1-butanol (NNAL)). In certain embodiments, the agent comprises a product for nicotine replacement therapy.
The computerized methods described herein may be implemented in a computerized system having one or more computing devices, each comprising one or more processors. Generally, the computerized systems described herein may include one or more engines comprising one or more processing devices, e.g., computers, microprocessors, logic devices, or other devices or processors, configured with hardware, firmware, and software to perform one or more computerized methods described herein. In certain implementations, the computerized system includes a system response curve engine, a network modeling engine, and a network scoring engine. The engines may be interconnected from time to time, and also connected from time to time with one or more databases, including perturbation databases, measurable databases, experimental data databases, and literature databases. The computerized systems described herein may include a distributed computerized system having one or more processors and engines that communicate through a network interface. Such implementations may be suitable for distributed computing via a variety of communication systems.
Drawings
Further features of the disclosure, as well as its nature and various advantages, will become apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which like reference characters refer to the same parts throughout the drawings, and in which:
FIG. 1 is a block diagram of an exemplary computerized system for quantifying the effects of biological perturbations.
FIG. 2 is a flow chart of an illustrative process for generating Biological Impact Factors (BIFs).
Fig. 3 is a graphical representation of data underlying a system response curve including data for two agents, two parameters, and N biological entities. .
FIG. 4 is a diagram of a computational model of a biological network having several biological entities and their relationships.
FIG. 5 is a block diagram of an exemplary computerized aggregation engine for generating BIFs.
FIG. 6 is a flow diagram of an exemplary process for generating a BIF from a network response score.
Fig. 7 illustrates an exemplary decomposition of a network response score vector.
Fig. 8A and 8B illustrate a rational filtering operation for the network response score vector.
Fig. 9 shows an example of network weighting during network response score aggregation.
FIG. 10 depicts two network response score surfaces that may be analyzed by the computerized system of FIG. 1.
FIG. 11 is a block diagram of an exemplary distributed computerized system for quantifying the effects of biological perturbations.
FIG. 12 is a block diagram of an exemplary computing device that may be used to implement any of the components in any of the computerized systems described herein.
Fig. 13 depicts experimental results of nasal epithelial tumorigenesis BIF generated in accordance with the systems and methods and illustrative embodiments disclosed herein.
FIG. 14 illustrates a systematic approach to experimental design of biological impact factor aggregation. Some well-chosen biological systems are exposed to the substance in a time and dose dependent manner to generate system level data that will be interpreted in the context of each biological network associated with the disease attack.
Fig. 15 illustrates a computational process of deriving biological impact factors for a given bioactivator using system-level data analyzed in the context of biological networks associated with disease attacks.
Detailed Description
FIG. 1 is a block diagram of a computerized system 100 for quantifying biological effects of one or more perturbations. In particular, the system 100 includes a system response curve engine 110, a network modeling engine 112, a network scoring engine 114, and an aggregation engine 116. Engines 110, 112, 114, and 116 are interconnected from time to time, and are also connected from time to time with one or more databases, including perturbation database 102, measurable database 104, experimental data database 106, and literature database 108. As used herein, an engine comprises one or more processing devices, such as a computer, microprocessor, logic device, or one or more other devices described with reference to fig. 12, configured in hardware, firmware, and software to perform one or more of the computing techniques described herein.
During operation, for a given perturbation, the system 100 generates a Biological Impact Factor (BIF), which is a quantitative measure of the effect, including the long-term effect of the perturbation on biological systems, including the human body. More specifically, the system 100 generates or provides a computerized model (collectively referred to as a "biological network") of one or more biological systems and mechanisms with respect to the type of perturbation, the desired biological mechanism of interest, or the specific long-term outcome of interest. For example, the system 100 may generate or provide a computational model of the mechanism for cell proliferation when the cells have been exposed to cigarette smoke. In such an example, the system 100 may also generate or provide one or more computational models representing different stages of disease, including but not limited to pulmonary disease and cardiovascular disease. In particular aspects, the system 100 generates these computerized models based on at least one of applied perturbations (e.g., exposure to an agent), measurable quantities of interest, results studied (e.g., cell proliferation, cell stress, inflammation, DNA repair), experimental results and knowledge obtained through scientific literature. The system 100 measures and quantifies the effect of the processing to generate the BIF. Prediction/verification engine 122 can receive one or more BIF values and can use these BIF values to make outcome predictions (e.g., reduce the incidence or likelihood of cancer after toxic substances are removed from a human environment). Prediction/verification engine 122 may also or alternatively compare BIF values to known biological outcomes to calibrate predictions of BIF values or inflammatory BIF values. An example of calibration and verification is represented by the results shown in fig. 13 below. The various components and engines of the system 100 include at least one of software and hardware components and will be further described with reference to fig. 11 and 12.
FIG. 2 is a flow diagram of a process 200 for quantifying the impact of a perturbation on a biological network by computing a Biological Impact Factor (BIF), according to one implementation. The steps of process 200 will be described as being performed by the various components of system 100 of fig. 1, but any of these steps may be performed by any suitable hardware or software component (local or remote) and may be arranged in any suitable order or performed in parallel. At step 210, the system response curve (SRP) engine 110 receives biological data from a variety of different sources, and the data itself may be of a variety of different types. In some implementations of step 210, the SRP engine 110 receives first data corresponding to a response of a set of biological entities to a first process and receives second data corresponding to a response of the set of biological entities to a second process, the second process being different from the first process. For example, the data received at step 210 may include data from experiments in which biological systems are perturbed by exposure to agents or environmental conditions, and may also include control data.
A biological system in the context of the present invention is an organism or a part of an organism, including functional parts, herein referred to as a subject. The subject is typically a mammal, including a human. The subject can be an individual human in a human population. The term "mammal" as used herein includes, but is not limited to, humans, non-human primates, mice, dogs, cats, cows, sheep, horses, and pigs. Mammals other than humans can advantageously be used as subjects that can be used to provide models of human disease. The non-human subject can be an unmodified, transgenic animal (e.g., a genetically modified animal or an animal carrying one or more gene mutations or silenced genes). The subject can be male or female. Depending on the purpose of the procedure, the subject can be one who has been exposed to the agent of interest. The subject can be one that has been exposed to the agent for an extended period of time, optionally including the time prior to the study. The subject can be a subject that has been exposed to the agent for a period of time, but is no longer in contact with the agent. The subject can be a subject that has been diagnosed or identified as having a disease. The subject can be a subject who has experienced or is experiencing a treatment for a disease or an adverse health condition. The subject may also be a subject who has displayed one or more symptoms or risk factors for a particular health condition or disease. The subject can be a pre-infected disease, and may have disease symptoms or no disease symptoms of the subject. In certain implementations, the disease or health condition in question is associated with exposure to an agent or use of an agent for an extended period of time. According to certain implementations, the system 100 (fig. 1) contains or generates a computerized model of one or more biological systems and their functional mechanisms (collectively, "biological networks" or "network models") that are relevant to the type or outcome of perturbation of interest.
Depending on the environment of operation, a biological system can be defined at different levels because it relates to the function of individual organisms in the population, typically organs, tissues, cell types, organelles, cell components, or cells of a particular individual. Each biological system includes one or more biological mechanisms or pathways whose operation is manifested as a functional characteristic of the system. Animal systems for reproducing the defined characteristics of human health condition and suitable for exposure to agents of interest are preferred biological systems. Cellular and organotypic systems for reflecting the cell types and tissues involved in the etiology or pathology of a disease are likewise preferred biological systems. Priority can be given to primitive cell or organ cultures that recapitulate as much as possible the human biology in vivo. It is also important to match the in vitro human cell culture with the most equivalent in vivo culture derived from animal models. This allows the use of a matched in vitro system as a reference system to generate a continuum of conversions from animal models in vivo to human biology. Thus, a biological system contemplated for use with the systems and methods described herein can be defined by a functional characteristic (biological function, physiological function, or cellular function), organelle, cell type, tissue type, organ, stage of development, or a combination of the foregoing (without limitation). Examples of biological systems include, but are not limited to, the lung, integument, bone, muscle, nerve (central and peripheral), endocrine, cardiovascular, immune, circulatory, respiratory, urinary, renal, gastrointestinal, colorectal, liver, and reproductive systems. Other exemplary biological systems include, but are not limited to, various cellular functions in epithelial cells, neural cells, blood cells, connective tissue cells, smooth muscle cells, skeletal muscle cells, adipocytes, ovum cells, sperm cells, stem cells, lung cells, brain cells, cardiac muscle cells, laryngeal cells, pharyngeal cells, esophageal cells, stomach cells, kidney cells, liver cells, breast cells, prostate cells, pancreatic islet cells, spermary cells, bladder cells, cervical cells, uterine cells, colon cells, and rectal cells. Certain cells may be cells of a cell line, cultured in vitro or maintained indefinitely in vitro under appropriate culture conditions. Examples of cellular functions that may also be considered functional features of a biological system include, but are not limited to, cell proliferation (e.g., cell division), degeneration, regeneration, senescence, control of cell activation by nuclei, cell-to-cell signaling, cell differentiation, cell counter-differentiation, secretion, migration, phagocytosis, repair, apoptosis, and development planning. Examples of cellular components that can be considered biological systems include, but are not limited to, cytoplasm, cytoskeleton, membranes, ribosomes, mitochondria, nuclei, Endoplasmic Reticulum (ER), golgi apparatus, lysosomes, DNA, RNA, proteins, peptidoglycans, and antibodies.
A perturbation in a biological system can be caused by one or more agents over a period of time by exposure or contact with one or more portions of the biological system. An agent can be a single substance or a mixture of substances, including mixtures in which not all ingredients are identified or characterized. The chemical and physical properties of the agent or its components may not be fully characterized. An agent can be defined by its structure, its ingredients, or a source that under certain constraints will produce the agent. Examples of agents are heterogeneous substances, i.e., molecules or entities that are not present within or derived from a biological system, and any intermediates or metabolites that are generated therefrom upon contact with a biological system. The agent can be a carbohydrate, protein, lipid, nucleic acid, alkaloid, vitamin, metal, heavy metal, mineral, oxygen, ion, enzyme, hormone, neurotransmitter, inorganic compound, organic compound, environmental agent, microorganism, particle, environmental condition, environmental force, or physical force. Non-limiting examples of agents include, but are not limited to, nutrients, metabolic wastes, poisons, drugs, toxins, therapeutic compounds, irritants, relaxants, natural products, manufactured products, food materials, pathogenic bacteria (prions, viruses, bacteria, fungi, protozoa), particles or entities having a size in the micrometer range or below, byproducts of the above items, and mixtures thereof. Non-limiting examples of physical agents include radiation, electromagnetic waves (including sunlight), increases or decreases in temperature, shear forces, fluid pressure, electrical discharge, or consequences or trauma thereof.
Some agents do not perturb a biological system unless it reaches a threshold concentration or it is in contact with the biological system for a period of time, or a combination of both. Agent exposure or exposure to cause perturbation may be quantified by dose. Thus, the perturbation can be caused by prolonged exposure to the agent. The duration of exposure can be expressed by units of time, by the frequency of exposure, or by a percentage of time within the subject's actual or estimated lifespan. Perturbation can also be caused by inhibiting an agent from one or more parts of the biological system (as described above) or limiting the supply of an agent to one or more parts of the biological system. For example, a perturbation can result from a reduced supply or absence of nutrients, water, carbohydrates, proteins, lipids, alkaloids, vitamins, minerals, oxygen, ions, enzymes, hormones, neurotransmitters, antibodies, cytokines, light, or by restricting the movement of certain parts of the organism, or by inhibiting or requiring exercise.
The agent may cause different perturbations depending on which part(s) of the biological system are exposed and the exposure conditions. Non-limiting examples of agents may include any of an aerosol generated by heating tobacco, an aerosol generated by burning tobacco, tobacco smoke, or cigarette smoke, as well as gaseous or particulate components thereof. More non-limiting examples of agents include cadmium, mercury, chromium, nicotine, tobacco specific nitrosamines and their metabolites, such as 4-methylnitrosamine-1- (3-pyridyl) -1-butanone 4(NNK), N' -nitrosonornicotine (NNN), N-Nitrosoanatabine (NAT), N-Nitrosoanabasine (NAB), and 4- (methylnitrosamino) -1- (3-pyridyl) -1-butanol (NNAL), as well as any product used in nicotine replacement therapy. The exposure regimen or composite stimulus for an agent should reflect the range and environment of exposure in a daily setting. The setup of a standard exposure protocol can be designed for systematic application to the same well-defined experimental system. Each assay can be designed to collect time and dose related data to capture early and late events and ensure coverage of a typical dose range. However, it will be understood by those skilled in the art that the systems and methods described herein may be adapted and modified to suit the application being processed, and that the systems and methods designed herein may be used in other suitable applications, and that other such additions and modifications should not depart from the scope of the present invention.
In various implementations, high-output system-level measurements of gene expression, protein expression or turnover, microribonucleic acid expression or turnover, post-translational modifications, protein modifications, translocations, antibody-generating metabolite profiles, or a combination of two or more of the foregoing are generated under various conditions, including control of each. Functional outcome measures are desirable in the methods described herein because they can generally serve as anchors for evaluation and represent a clear step in disease etiology.
As used herein, "sample" refers to any biological sample (e.g., a cell, tissue, organ, or whole animal) that is independent of the subject or experimental system. The sample can include, without limitation, a single cell or a plurality of cells, a cellular fraction, a tissue biopsy, excised tissue, a tissue extract, a tissue culture extract, a tissue culture medium, exhaled breath, whole blood, platelets, serum, plasma, red blood cells, white blood cells, lymphocytes, neutrophils, macrophages, B cells or subsets thereof, T cells or subsets thereof, hematopoietic cell subsets, endothelial cells, synovial fluid, lymph, ascites fluid, interstitial fluid, bone marrow, cerebrospinal fluid, pleural fluid, tumor infiltrates, saliva, mucus, sputum, semen, sweat, urine, or any other bodily fluid. Samples can be obtained from a subject by the following methods, including (but not limited to): venipuncture, drainage, biopsy, needle stick, lavage, scrape, surgical resection, or other methods known in the art.
During operation, for a given biological mechanism, outcome, perturbation, or combination of the foregoing, the system 100 can generate a response score value that is a quantitative measure of a state change generated by a biological entity in the network in response to a processing condition. The number of scores in the score set may correspond to the number of networks.
The system 100 (fig. 1) includes one or more computerized network models related to a health condition, disease, or biological outcome of interest. One or more of these network models are based on existing biological knowledge and can be uploaded from external sources and managed within the system 100. The model can also be regenerated within the system 100 based on the measurements. The measurable elements are thus integrated into the biological network model using prior knowledge. Described below are types of data representing changes in a biological system of interest or representing responses to perturbations that can be used to generate or refine a network model.
Returning to fig. 2, at step 210, the system response curve (SRP) engine 110 receives the biometric data. The SRP engine 110 may receive the data from a variety of different sources, and the data itself may be of a variety of different types. The biological data used by the SRP engine 110 may be obtained from literature databases (including data from preclinical, clinical, and post-clinical trials of pharmaceutical products or medical devices), genomic databases (genomic sequences and Expression data, e.g., Gene Expression library of the national center for Biotechnology information (Gene Expression Omnibus) or Arrayexpress of the European bioinformatics institute (Parkinson et al, 2010, Nucl. acids Res., doi:10.1093/nar/gkql040. PubmediD71405)), commercially available databases (e.g., Gene Logic of Gersturburg, Md., USA), or experimental work. The data may include raw data from one or more different sources, e.g., in vitro experiments, ex vivo or direct in vivo experiments using one or more species specifically designed to study the effects of particular processing conditions or exposure to particular agents. In vitro experimental systems may include tissue culture or organotypic culture (three-dimensional culture) representing key aspects of human disease. In such implementations, the agent dosages and exposure regimens used in these experiments may substantially reflect the range of exposure and environments that may be expected for humans during normal use or activity conditions, or during special use or activity conditions. Experimental parameters and experimental conditions may be selected as desired to reflect the nature and exposure conditions of the agent, the molecules and pathways of the biological system in question, the cell types and tissues involved, the outcome of interest, and aspects of the etiology of the disease. Molecules, cells or tissues derived from a particular animal model can be matched to a particular human molecule, cell or tissue culture to improve the interpretability of animal-based findings.
Many of the data received by the SRP engine 110 that are generated by high throughput experimentation techniques include, but are not limited to, methylation patterns with nucleic acids (e.g., absolute or relative amounts of a particular DNA or RNA species, changes in DNA sequence, changes in RNA sequence, tertiary structure, or as determined by sequencing, nucleic acid-specific hybridization on a microarray, quantitative polymerase chain reaction, or other techniques known in the art), proteins/peptides (e.g., absolute or relative amounts of protein, specific fragments of protein, peptidoglycan, changes in secondary or tertiary structure, or post-translational modifications as determined by methods known in the art) and functional activities under certain conditions (e.g., enzymatic activity, proteolytic activity, translational regulatory activity, trafficking activity, binding affinity to certain binding partners). Modifications, including post-translational modifications of proteins or peptides, can include, but are not limited to, methylation, acetylation, farnesylation, biotinylation, stearoylation, formylation, myristoylation, protein palmitoylation, geranylgeranylation, pegylation, phosphorylation, sulfation, glycosylation, glycation alteration (sugar modification), lipidation, lipid alteration, ubiquitination, sumoylation, sulfur dioxide bonding, cysteinylation, oxidation, glutathione, carboxylation, aldose acidification reactions, and deamidation. In addition, proteins can also be post-translationally modified by a series of reactions, for example, Amadori reaction, schiff base reaction, and maillard reaction, which produce glycated protein products.
The data may also include measured functional outcomes such as, but not limited to, functional outcomes at a cellular level including cell proliferation, fate development and cell death, functional outcomes at a physiological level including lung capacity, blood pressure, exercise proficiency. The data may also include measures of disease activity or severity, such as, but not limited to, tumor metastasis, tumor improvement, loss of function, and life expectancy at a certain stage of the disease. Disease activity can be measured by clinical evaluation whose outcome is a value or set of values that can be obtained under defined conditions from the evaluation of a sample (or population of samples) from one or more subjects. The clinical assessment can also be based on answers provided by the subject to interviews or questionnaires.
Such data may be generated for explicit use in determining the system response curve, or may be generated in previous experiments or published in the literature. Generally, the data includes information related to a molecule, biological structure, physiological condition, genetic characteristic, or phenotype. In certain implementations, the data includes a description of a condition, location, quantity, activity, or substructure of a molecule, a biological structure, a physiological condition, a genetic characteristic, or a phenotype. As will be described later, in a clinical setting, the data may include raw or processed data obtained from assays performed on samples obtained from human subjects or observations of human subjects exposed to the agent.
At step 212, the system response curve (SRP) engine 110 generates a system response curve (SRP) based on the biometric data received at step 212. An SRP is a representation used to represent the extent to which one or more measured entities (e.g., molecules, nucleic acids, peptides, proteins, cells, etc.) within a biological system individually change in response to a perturbation (e.g., exposure to an agent) applied to the biological system. This step may include one or more of background correction, normalization, fold change calculation, significance determination, and identification of differential responses (e.g., expressing different genes). In one example, to generate an SRP, the SRP engine 110 collects a set of measurements for a given set of parameters (e.g., process or disturbance conditions) applied to a given experimental system (a "system-process" pair). Fig. 3 shows two SRPs: SRP302 comprising biological activity data of N different biological entities subjected to a first treatment 306 with varying parameters (e.g., dose and time of exposure to a first treatment agent), and similar SRP304 comprising biological activity data of N different biological entities subjected to a second treatment 308. The data included within the SRP may be raw experimental data, processed experimental data (e.g., filtered to remove outliers, labeled with confidence estimates, averaged over multiple trials), data generated by computational biological models, or data taken from the scientific literature. SRP can represent data in a number of ways, such as absolute values, absolute changes, fold changes, logarithmic changes, functions, and tables. The SRP engine 110 passes the SRP to the network modeling engine 112.
At step 214, the network modeling engine 112 provides a plurality of computational models of the biological system, which include the biological entities for which data has been obtained at step 210. Each computational model includes nodes representing biological entities and edges representing relationships between the biological entities in the biological system. The network modeling engine 112 may derive the computational models from one or more databases that include a plurality of network models, each selected to be associated with an agent or feature of interest. The selection may be based on prior knowledge of the mechanism underlying the biological function of the system. In particular embodiments, network modeling engine 112 may use system response curves, networks in databases, and networks described in advance in the literature to extract causal relationships between entities within the system, thereby creating, refining, or expanding a network model.
In some embodiments of step 214, the network modeling engine 112 applies the system response curves from the SRP engine 110 to a network model based on a mechanism based on the biological functions of the system. Although the SRP derived in the previous step represents experimental data to be used for determining the magnitude of network disturbances, it is a biological network model that is the basis for calculations and analysis. This analysis requires the initial development of detailed network models of mechanisms and pathways that are related to the characteristics of the biological system. Such architectures provide a mechanistic layer of understanding beyond the examination of gene lists that have been used in more typical gene expression analysis. A network model of a biological system is a mathematical construct that represents a dynamic biological system and is built by assembling quantitative information about various basic properties of the biological system.
Such a network architecture may be an iterative process. Delineation of network boundaries is guided by scientific literature investigating the mechanisms and pathways associated with features of interest (e.g., cell proliferation in the lung). The causal relationships used to describe these paths are taken from prior knowledge to consolidate the network. Document-based networks can be validated using high-throughput data sets containing associated phenotypic endpoints. The SRP engine 110 can be used to analyze the data set, and the results of this analysis can be used to validate, refine, or generate a network model. In some implementations, the network modeling engine 112 is used to identify networks that have been generated based on SRPs. The network modeling engine 112 may include a means for receiving updates and changes to the model. The network modeling engine 112 may repeat the process of network generation by incorporating new data and generating additional or refined network models. The network modeling engine 112 may also facilitate the merging of one or more data sets or the merging of one or more networks. The set of networks taken from the database can be manually supplemented with additional nodes, edges, or entirely new networks (e.g., by mining the literature for additional genes that are directly regulated by a particular biological entity). These networks contain features that may allow process scoring to be performed. The network topology is maintained; causal networks are able to track measurable entities from any point in the network. Furthermore, the models are dynamic and the assumptions used to build them can be modified or restated and allow adaptation to different organizational environments and categories. This allows for repeated testing and improvement when new knowledge is available. The network modeling engine 112 may remove nodes or edges that have low confidence or are subjects that conflict with experimental results in the scientific literature. The network modeling engine 112 may also include additional nodes or edges that may be inferred using supervised or unsupervised learning methods (e.g., metric learning, matrix filling, pattern recognition).
In certain aspects, a biological system is modeled as a mathematical graph consisting of vertices (or nodes) and edges connecting the nodes. For example, FIG. 4 shows a simple network 400 with 9 nodes (including nodes 402 and 404) and edges (406 and 408). Nodes can represent biological entities in biological systems such as, but not limited to, compounds, DNA, RNA, proteins, peptidoglycans, antibodies, cells, tissues, and organs. Edges can represent relationships between nodes. Edges in the graph can represent various relationships between nodes. For example, an edge may represent a relationship of "binding to", "used to express", "co-regulated based on an expression profile", "inhibited", "co-occurring in manuscript", or "sharing structural elements". Generally, these types of relationships describe relationships between a pair of nodes. The nodes in the graph can also represent relationships between nodes. Thus, relationships between relationships or between a relationship and another type of biological entity represented in the graph may be represented. For example, a relationship between two nodes representing chemicals may represent a reaction. The reaction may be a node in the relationship between the reaction and the chemical used to inhibit the reaction.
The graph may be non-directional, meaning that there is no direction between the two vertices associated with each edge. Alternatively, the edges of the graph may point from one vertex to another. For example, in the context of organisms, transcriptional regulatory networks and metabolic networks can be modeled as directed graphs. In the graphical model of the transcriptional regulatory network, the nodes will represent genes and the edges represent transcriptional relationships between the nodes. As another example, protein-protein interaction networks describe direct physical interactions between proteins in the proteome of an organism, and there is generally no direction associated with the interactions in such networks. Thus, these networks can be modeled as undirected graphs. Some networks may have directed edges and undirected edges. The entities and relationships (i.e., nodes and edges) that make up the graph may be stored as a network of related nodes in a database in system 100.
The knowledge represented in the database can be of various types, taken from various sources. For example, certain data may represent a genomic database, including information about genes and relationships between them. In such an example, a node may represent an oncogene and another node connected to the oncogene node may represent a gene for suppressing the oncogene. The data may represent proteins and their relationships, diseases and their interrelationships, and various disease states. There are many different types of data that can be incorporated into a graphical representation. The computational model may represent a network of relationships between nodes representing knowledge in, for example, a DNA dataset, an RNA dataset, a protein dataset, an antibody dataset, a cell dataset, a tissue dataset, an organ dataset, a medical dataset, epidemiological data, a chemical dataset, a toxicology dataset, a patient dataset, and a demographic dataset. As used herein, a data set is a collection of numerical values resulting from the evaluation of a sample (or a group of samples) under defined conditions. The data set can be obtained by, for example, experimentally measuring a quantifiable entity of the sample; or alternatively, from a service provider (e.g., laboratory, clinical research organization), or from a public or proprietary database. The data sets may contain data and biological entities represented by nodes, and the nodes in each data set may be related to other nodes in the same data set or nodes in other data sets. Moreover, the network modeling engine 112 may generate computational models for representing genetic information in a dataset, such as DNA, RNA, protein, or antibodies, as medical information in a medical dataset, as information about individual patients in a patient dataset, and as information about an entire population in an epidemiological dataset. In addition to the various data sets described above, there may be many other data sets, or types of biological information that may be included in generating the computational model. For example, the database can further include medical record data, structural/activity relationship data, information about infectious pathologies, information about clinical trials, exposure pattern data, data related to usage history of the product, and any other type of life sciences related information.
Network modeling engine 112 may generate one or more network models representing, for example, regulated interactions between genes, interactions between proteins, or complex biochemical interactions within a cell or tissue. The network generated by the network modeling engine 112 may include a static model and a dynamic model. The network modeling engine 112 may represent the system using any applicable data scheme, such as hypergraphs and weighted bipartite graphs, in which two types of nodes are used to represent reactions and compounds. The network modeling engine 112 may also use other inference techniques to generate a network model, for example, based on analysis of over-expression of functionally related genes among the different genes expressed, bayesian network analysis, graphical gaussian model techniques, or gene association network techniques to identify relevant biological networks based on a set of experimental data (e.g., gene expression, metabolite concentrations, cellular responses, etc.).
As described above, the network model is based on mechanisms and pathways that underlie the functional characteristics of biological systems. The network modeling engine 112 can generate or contain a model representing results regarding characteristics of biological systems relevant to the study of long-term health risks or health benefits of an agent. Thus, network modeling engine 112 may generate or contain network models of various mechanisms for cell function, particularly those mechanisms that are related to or contribute to a feature of interest in a biological system, including (but not limited to) cell proliferation, cell stress, cell regeneration, apoptosis, DNA damage/repair, or inflammatory responses. In other embodiments, the network modeling engine 112 may contain or generate computational models related to acute systemic toxicity, carcinogenicity, transdermal penetration, cardiovascular disease, pulmonary disease, ecotoxicity, ocular lavage/erosion, genetic toxicity, immunotoxicity, neurotoxicity, pharmacokinetics, drug metabolism, organ toxicity, reproductive and developmental toxicity, skin irritation/erosion, or skin sensitization. In general, the network modeling engine 112 may contain or generate computational models for the states of nucleic acids (DNA, RNA, SNPs, sirnas, mirnas, RNAi), proteins, peptidoglycans, antibodies, cells, tissues, organs, and any other biological entities and their respective interactions. In one example, computational network models can be used to represent the state of the immune system and the functioning of various types of leukocytes during an immune response or inflammatory response. In other examples, computational network models can be used to represent the performance of the cardiovascular system and the function and metabolism of endothelial cells.
In certain implementations of the present disclosure, the network is derived from a database of causal biological knowledge. The database may be generated by performing experimental studies on different biological mechanisms to extract relationships (e.g., activation or inhibition relationships) between the mechanisms, some of which may be causal relationships, and may be combined with commercially available databases, such as the genostruct technology Platform (genostruct technology Platform) or the silverta knowledge base (silverwedgebase), administered by the silverta corporation of cambridge, massachusetts, usa. Using the database of causal biological knowledge, the network modeling engine 112 may identify a network for linking the disturbance 102 with the measurables 104. In certain implementations, the network modeling engine 112 uses the system response curves from the SRP engine 110 and previously generated networks in the literature to extract causal relationships between biological entities. The database may be further processed to remove logical inconsistencies and generate new biological knowledge by applying homologous reasoning between different sets of biological entities, among other processing steps.
In certain implementations, the network model built with the information extracted from the database is based on inverse causal inference (RCR), an automated inference technique for processing networks of causal relationships to formulate mechanistic hypotheses and then evaluate those mechanistic hypotheses against a dataset of difference measurements. Each mechanism assumes the linkage of a biological entity to itA measurable quantity that can be influenced. For example, a measurable quantity can include an increase or decrease in concentration, a number or relative abundance of biological entities, an activation or inhibition of a biological entity, or a change in the structure, function, or logic of a biological entity, among others. RCR uses directed networks of experimentally observed causal interactions between biological entities as the basis for computation. The directed network may use a Biological Expression LanguageTM(BELTM) Language (syntax for recording interrelationships between biological entities). The RCR calculation specifies certain constraints for the network model generation, such as, but not limited to, path length (the maximum number of edges connecting an upstream node to a downstream node) and possible causal paths for connecting an upstream node to a downstream node. The output of the RCR is a set of mechanistic hypotheses ranked according to statistical data for evaluating relevance and accuracy, representing upstream controllers of differences in experimental measurements. The mechanism assumes that the outputs can be combined into causal chains and larger networks to interpret data sets at a higher interconnect mechanism and path level.
One type of mechanism assumes that it includes a set of causal relationships that exist between nodes representing possible causes (upstream nodes or controllers) and nodes representing measured quantities (downstream nodes). This type of mechanism assumes that it can be used to make predictions, for example, if the number of entities represented by an upstream node increases, then the downstream node linked by a causal increasing relationship will be inferred as increasing, and the downstream node linked by a causal decreasing relationship will be inferred as decreasing.
The mechanistic hypothesis represents a relationship between a set of measured data (e.g., gene expression data) and biological entities that are known controllers of those genes. In addition, these relationships include the sign (plus or minus) of the effect between differential expression of an upstream entity and a downstream entity (e.g., a downstream gene). The mechanistic hypothesis for the downstream genes can be drawn from databases that govern causal biological knowledge in the literature. In particular embodiments, causal relationships in the form of calculable causal network models for mechanism assumptions linking upstream entities to downstream entities are the basis for calculating network changes through a network response scoring method.
In particular embodiments, a scorable complex causal network model of a biological entity can be converted into a single causal network model by gathering individual mechanistic hypotheses representing various features of the biological system in the model and regrouping all downstream entities (e.g., downstream genes) with connections to a single upstream entity or process, thereby representing the entire complex causal network model; this is in effect a flattening of the underlying graph structure. Changes in the characteristics and entities of the biological system represented in the network model can thus be derived by combining the various mechanistic assumptions.
In certain implementations, the system 100 may contain or generate a computerized model for cell proliferation mechanisms when the cells have been exposed to cigarette smoke. In such an example, the system 100 may also contain or generate one or more network models representing various health conditions associated with cigarette smoke exposure, including (but not limited to) cancer, lung disease, and cardiovascular disease. In certain aspects, these network models are based on at least one of applied perturbations (e.g., exposure to an agent), responses under various conditions, measurable quantities of interest, results being studied (e.g., cell proliferation, cellular stress, inflammation, DNA repair), experimental data, clinical data, epidemiological data, and literature.
As an illustrative example, the network modeling engine 112 may be configured to generate a network model of cellular stress. The network modeling engine 112 may receive a network describing the relevant mechanisms involved in stress responses known from the literature database. The network modeling engine 112 may select one or more networks to operate in response to stress in the pulmonary and cardiovascular environments based on known biological mechanisms. In some implementations, the network modeling engine 112 identifies one or more functional units in the biological system and builds a larger network model by combining smaller networks based on their functionality. In particular, for cellular stress models, the network modeling engine 112 can consider functional units associated with responses to oxidative stress, genotoxic stress, hypoxic stress, osmotic, exogenous stress, and shear stress. Thus, network components used in cellular stress models can include exogenous metabolic responses, genotoxic stress, endothelial shear stress, hypoxic response, osmotic stress, and oxidative stress. The network modeling engine 112 may also receive content from computational analysis of publicly available transcription data from stress correlation experiments performed in specific cell groupings.
When generating a network model of a biological mechanism, the network modeling engine 112 may include one or more rules. Such rules may include rules for selecting network content, node type, and the like. The network modeling engine 112 may select one or more data sets from the experimental data database 106, including a combination of in vitro and in vivo experimental results. The network modeling engine 112 may use experimental data to verify nodes and edges identified in the literature. In an example of modeling cellular stress, the network modeling engine 112 may select a data set for an experiment based on how well the experiment represents physiologically relevant stress within disease-free lung or cardiovascular tissue. The selection of the data set may be based on, for example, the availability of phenotypic stress endpoint data, the statistical stringency of gene expression profiling experiments, and the association of the experimental environment with normal lung or cardiovascular disease-free organisms.
After identifying the collection of related networks, the network modeling engine 112 may also process and refine those networks. For example, in some implementations, multiple biological entities and their connections may be grouped and represented by one or more new nodes (e.g., using clustering or other techniques).
The network modeling engine 112 may also include descriptive information about the identified nodes and edges in the network. As described above, a node may be described by its associated biological entity, an indication of whether the associated biological entity is a measurable quantity, or any other descriptor of the biological entity, while an edge may be described by, for example, the type of relationship it represents (e.g., causal relationship (e.g., up or down), relevance, conditional correlation, or independence), the strength of the relationship, or the statistical confidence in the relationship. In some implementations, for each process, each node representing a measurable entity is associated with an expected direction of activity change (i.e., increase or decrease) in response to the process. For example, when bronchial epithelial cells are exposed to an agent such as Tumor Necrosis Factor (TNF), the activity of a particular gene may increase. This increase may occur due to direct regulatory relationships known from the literature (and represented by one network identified by the network modeling engine 112), or by tracking numerous regulatory relationships (e.g., autocrine signaling) via edges of one or more networks identified by the network modeling engine 112. In some cases, network modeling engine 112 may identify the expected direction in which each measurable entity changes in response to a particular disturbance. When different paths in the network indicate opposite expected directions of change for a particular entity, the two paths may be examined in more detail to determine the net direction of change, or measurements for that particular entity may be discarded. The computational network model may be generated by the system 100, imported into the system 100, or identified within the system 100 (e.g., from a database of biological knowledge).
Returning to FIG. 2, at step 216, the network scoring engine 114 generates a network response score for each perturbation using the network identified by the network modeling engine 112 at step 214 and the data generated and received by the SRP engine 110 in the form of SRPs at step 212. The network response score quantifies the processed biological response (represented by the SRP) in the context of the underlying relationship between the biological entities (represented by the identified network). These network response scores may digitally or graphically present the effect of perturbing a biological system, for example, by exposure to potentially harmful agents. By providing measurements of network responses to the processing, these network response scores may allow molecular events (measured by experimental data) to be correlated with the phenotype of the network characterizing the cellular, tissue or organ level. The network scoring engine 114 may include hardware and software components for generating a network response score for each network included within or identified by the network modeling engine 112.
The network scoring engine 114 may be configured to implement the techniques described above, such as intensity scoring techniques, to generate a scalar value score that represents the overall intensity of the network's response to processing. The intensity score is the mean of the activity observations for the different entities represented in the SRP. In some embodiments, the strength of the network response is calculated as follows:
wherein d isiIndicating the expected direction of change of activity of the entity associated with node i, βiRepresents the logarithm of the fold change of the activity between the process and control conditions (i.e., the number describing the degree of change in the number from the initial value to the final value), and NumMeasNodes is the number of nodes with associated measured biological entities. A positive strength score indicates that the SRP matches the expected activity change derived from the identified network, while a negative strength score indicates that the SRP does not match the expected activity change.
In addition to or instead of the scalar value network scores described above, the network scoring engine 114 may generate vector value scores. Examples of methods for calculating values representative of network responses are described in U.S. patent provisional application No.61/525,700 filed 8/19/2011, for example, the aggregate disturbance index (GPI), the probabilistic disturbance index (PPI), and the expected disturbance index (EPI), which are all incorporated herein by reference. One vector value score is a vector of the fold change or absolute change in activity of each measured node. As described above, the multiple change is a number that describes the degree to which the measurable change changes from an initial value to a final value under different conditions (e.g., between disturbance and control conditions)In some implementations, Geometric Perturbation Index (GPI) values are used in the methods of the present disclosureiA weight vector r is also included in the calculation GPI, where each component ri of each weight vector r represents the ith fold change β to be assigned to an observediThe weight of (c). In certain implementations, the weight represents a known biological meaning of the ith measurement entity with respect to a feature or outcome of interest (e.g., a known carcinogen in cancer research). One value that can advantageously be used for weighting is the local error non-discovery rate fndri(i.e., fold change value βiRepresenting the probability of violating the underlying primitive assumption about zero fold change, in some cases, under the observed p-value), as described by Strimmer et al in "A general modulated frame for gene segmentation analysis" (BMC Bioinformatics10:47, 2009) and by Strimmer in "influenced application to false discovery rate evaluation" (BMC Bioinformatics9:303, 2008), both of which are incorporated herein by reference in their entiretiesiIndicating the expected direction of change of the ith measurement biological entity (e.g., +1 for increased activity and-1 for decreased activity). In some implementations, the combination is an arithmetic combination, wherein the reduction isMultiple change of magnification riβiEach multiplied by its corresponding expected direction of change diAnd the results are superimposed for all N biological entities. Mathematically, this implementation can be represented by:
in other implementations, the vectors d, r, and β may be combined in any linear or non-linear manner. This combination is normalized by multiplying by a predetermined scaling factor. One such scaling factor is the square root of the number of biological entities N. In this implementation, the GPI score can be represented by the following equation:
in certain embodiments, a Probability Perturbation Index (PPI) value is used in the methods of the present disclosure. PPI is measured by measuring the positive activation+And a negative activation metric PPI-binding, for example by the following:
for calculating GPI, fold change vectors β are collected and fold change strengths are generated to have a range that represents an approximation of the set of values that a fold change value can take in a biological system under processing conditions, and can be represented by the range [ -W, W]To approximate, where W is the theoretical expected maximum absolute value of the log2 fold change. Positive activation metrics indicate that SRP indicates the observed activation/inhibition of a biological entity and is represented by diThe degree to which the direction of the expected change is consistent. NetThe behavior of networks consistent with an SRP is referred to herein as "positive activation" (PPI +, and one positive activation metric that may be used is the probability that one or more networks are Positively Activated, which may be calculated according to the following expression:
wherein:
wherein fndriThe error discussed above is the discovery rate. The positive activation metric PPI is calculated as follows+Approximate values of (a):
inconsistent behavior is referred to herein as negative activation (negative Activated). One negative activation metric that may be used is the probability that one or more networks are negatively activated. Such a probability (called Ρ Ρ I-) can be calculated according to the following expression:
wherein
The negative activation measure PPI can be calculated according to the following disclosure-Approximation of (d):
another method of calculating a value representing a network response is an Expected Perturbation Index (EPI) scoring technique, where each SRP represents the activity (or change in activity) of a measured biological entity under processing conditions, then each SRP is associated with the number of activities measured, one for each measured biological entityiAre all taken from the distribution p (-) then the expected value of the distribution is
Since the true theoretical distribution p (-) is not readily known, the EPI value can be speeded up and generate the fold-change density by using the observed activation if each fold change β taken from the distribution p (-) isiThen, the distribution p (-) can be approximated by:
in some implementations, the network scoring engine 114 applies a computational interpolation technique (e.g., a linear or non-linear interpolation technique) to generate an approximate continuous distribution from the distribution of the above formula, and then computes the expected value of the distribution. In other implementations, the network scoring engine is configured to use the discrete distribution as a rectangular approximation to a continuous distribution, and calculate the EPI as follows:
in this formula, the subscript (. cndot.) denotes the value taken in the order of change from the smallest multiple to the largest multiple), n+Is that its activity is expected to be responsive to processing (d)iβi>= 0), and n) of the increasing number of entities, and-is that its activity is expected to be responsive to processing (d)iβi<= 0). In an EPI score, a higher value fold change will be more considered than a lower fold change, providing a measure of activity with high specificity.
In particular embodiments, for each perturbation (e.g., exposure to a known or unknown agent), network scoring unit 114 may generate a plurality of network response scores that constitute the set of scores for the respective perturbation or the respective treatment. For example, the network scoring engine 114 may generate a network response score for a particular network, a particular agent dosage, and a particular exposure time. The set of all of these network response scores is sent to the aggregation engine 116.
At step 218, the aggregation engine 116 generates a biological response factor (BIF) based on the plurality of network response scores generated by the aggregation engine 116 at step 216. Aggregation engine 116 may generate BIFs using only other supplemental information that may be derived from one or more networks. In particular embodiments, aggregation engine 116 may generate BIFs directly from SRPs corresponding to different biological networks. In particular embodiments, BIF values may be used to compare predicted biological outcomes of exposures to different treatments, where the different outcomes may be caused by different mechanisms resulting from the respective treatment conditions. In particular embodiments, BIF may be treated as an aggregate measure of perturbation effects on multiple biological networks that may be affected by disease attacks or biological outcomes. A number of graph theory computing techniques have been developed for generating BIF, any of which may be performed by the aggregation engine 116; examples of these techniques are discussed below. In particular embodiments, the score is a vector value score. In particular embodiments, the score is not a scalar value score. In particular embodiments, the one or more biological impact factors are determined by a linear combination, linear transformation, or quadratic form of the aggregate scores of the first and second sets of scores. The M computational models provided by the network modeling engine 112 are denoted as Net-1, Net-2. . . Net-M, wherein M is greater than or equal to 1. To generate the BIF, the aggregation engine 116 may use a graph statistics technique that utilizes statistics or features of some or all of the network models, such as the full network structure, the number of nodes, the number of edges, the weights of the nodes or edges (if weighted), any other characteristics of the nodes or edges (e.g., statistical confidence associated with the measurements of the biological entities and the relationships represented by the nodes and edges, respectively), any nodes or edges that are duplicated in different network models, confidence in the structure of the network model (e.g., measurements of how the network structure is consistently replicated in the literature), or any other data represented by the network model provided by the network modeling engine 112. Some of this data may be obtained from calculations performed by the SRP engine 110 (e.g., statistical confidence estimates for the measurements) and may be communicated to the aggregation engine 116 via the network modeling engine 112 or directly from the SRP engine 110 to the aggregation engine 116.
For each process and each network model Net-i, aggregation engine 116 also receives one or more vectors of network response scores, Si, from network scoring engine 114. As described above, Si can include one or more scalar value scores that represent the overall intensity of Net-i response to an agent perturbation; si can also include one or more vector value scores that represent the topological distribution of Net-i responses to agent perturbations. The network response score vectors Si and Sj associated with the different network models Net-i and Net-j, respectively, need not have the same dimensions, nor need any identical network response score generation techniques be based thereon.
In particular embodiments, aggregation engine 116 generates Biological Impact Factors (BIFs) using data from network modeling engine 112 and network response scores from network scoring engine 114. Fig. 5 shows four modules that may be included in the aggregation engine 116: a filtering module 510, a network weighting module 512, an aggregation module 514, and a relative scoring module 516. One or more of these modules 510, 512, 514, 516 may be implemented on at least one of hardware and software, as described with reference to fig. 11 and 12.
The aggregation engine 116 may be configured to generate the schematic graph theory process 600 depicted in fig. 6. The steps of process 600 will now be described as being performed by modules 510 and 516 (FIG. 5) of aggregation engine 116, but it will be understood that these steps may be performed in any suitable order and divided among one or more processing components.
At step 602, the aggregation engine 116 receives information about the computational network modules from the network modeling engine 112 and network response scores from the network scoring engine 114. At step 604, the filtering module 510 filters the score vector S1、S2、。。。、SM. In some implementations, the filtering operation performed at step 604 includes normalizing one or more components of the one or more score vectors. For example, if the first component of each score vector is a scalar value score representing the overall strength of the response of the associated network model, then these first components may be normalized by appropriate values such that the scores all fall within the desired range. One choice of a suitable normalization value is the maximum of the first component in the overall score vector; if all of the first component values are non-negative, then dividing each first component by the maximum value will first know the range [0, 1 ] for the first component]And (4) the following steps. In some embodiments, the filtering operation performed at step 602 includes removing outliers. When a component value of a score vector is more than a specified amount (e.g., a certain amount of standard deviation) from a specified value (e.g., mean, median, module value), the component of the score vector may be considered an outlier. The specified quantity and value may be known a priori, or may be based on a network response score vector S1、S2、。。。、SmIs calculated from the combination of (a).
In some embodiments, the filtering operation performed at step 604 comprises a geometric graphAnd (4) a shape technology. One such decomposition technique scores a vector S1、S2、。。。、SmOne or more parts of (a) are decomposed. For illustrative purposes, the overall vector SiIs discussed in the following description, but the decomposition may be performed on only certain components of the score vectors. In the decomposition, the vector Si is written as a combination of two or more vectors. Fig. 7 shows the decomposition of vector 702 into two components 708 and 710. As is known in the art, if Si has a dimension p, Si can be written as a linear combination of p different basis vectors that can span the p-dimensional vector space in which Si is embedded, with the mathematical expression:
St=a1v1+…+apvp
wherein { v1,…,vpIs a span set of vectors, and a1,…,apAre the corresponding scalar coefficients. Vector a1v1Is referred to as SiAt v1Projection of (2). In fig. 7, vectors 704 and 706 are basis vectors, and the projection of vector 702 on each basis vector is vectors 708 and 710, respectively. Without loss of generality, { v }1,…,vpIs assumed to be on an orthogonal basis. The value of the scalar coefficient can be calculated by calculating SiAnd the inner product between the corresponding vector.
The aggregation engine 116 may be configured to select a large number of basis vectors v1,…,vpAny of (or preprogrammed by) them. In some embodiments, the structure using the network model Net-i is determined, for example, using a spectrogram computation technique basis vector. Typically, spectral techniques use techniques derived from eigen-analysis of a matrix representing a network model. In one particular spectral technique, the basis vector v1,…,vpIt can be the eigenvectors of the combined laplacian matrix associated with the network model Net-i. If Net-i indicates that n is presentiA undirected network of nodes, then the combined laplacian is computed as:
L=D-A
wherein D is niMultiplying by niIs the degree of each node Net-i on the diagonal, and A is node-node n of Net-iiMultiplying by niOf the adjacent matrix. Other matrices whose eigenvectors may provide a suitable basis for decomposition at step 604 include node-node adjacency matrices, node-edge adjacency matrices, normalized laplacian matrices, gram matrices, or any other matrix representing the structure of Net-i.
Thus, in one embodiment, each score within the first set of scores and the second set of scores comprises a score vector, and the step of generating the biological impact factor further comprises filtering, at the processor, the first score and the second score to decompose each of the first score and the second score into a plurality of projections on the set of basis vectors. The filtering may further include removing at least one of the plurality of projections from at least one of the decomposed first and second scores. The set of basis vectors may include eigenvectors of a matrix representing the at least one model. In some embodiments, SiMay be used to adjust S using, for example, geometric filtering techniques or geometric computation techniquesiThe value of (c). In a particular aspect, using the geometry filtering technique includes adjusting a graphical representation of one or more network models, such as a vector, mesh (mesh), or higher latitude representation. Two such examples are shown in fig. 8A and 8B, respectively. In a first example, SiThe projection to some base vectors can be from SiSubtract ("decrease S)iDimension "). This is shown in fig. 8A: vector 702 is decomposed into vectors 708 and 710 and filter module 510 removes vector 708 from vector 702, retaining vector 806. The removed projections may be those with the smallest amplitude (e.g., length). When the basis vectors are generated as eigenvectors of a particular matrix, the removed projections may be those associated with eigenvectors whose eigenvalues have the smallest magnitude. A fixed number of projections may be removed or maintained. Instead of or in addition to reducing SiDimension of (S)iEach projection of (a) can be scaled separately, whichThe post-scaled projections are added together to form a new score vector Si. This is shown in fig. 8B: vector 702 is decomposed into vectors 708 and 710, and filtering module 510 scales vector 708 to form a new vector 812 and vector 710 to form a new loud 814. The scaling vector for each projection may be selected in a number of ways, including empirical observation or mathematical modeling based on the relative importance (signalicience) of each projection. In some embodiments, profile information is used. For example, when the basis vectors are generated as eigenvectors for a particular matrix, the scaling factor for each projection may be based on the eigenvalues associated with the eigenvectors. For example, SiProjected onto vector vjThe upper scaling value may be given by the following equation:
wherein λ isjIs associated with eigenvector vjThe associated eigenvalues. The parameter t is tunable such that a larger value results in a smaller scaled projection.
Returning to FIG. 6, at step 604, the network weighting module 512 may compute a network response score vector S associated with each M computational model1、S2、。。。、SMAre weighted. Generating the first set of scores and the second set of scores may include: assigning, at the processor, a weight to each of the first set of scores and the second set of scores based on the respective computational network model and at least one of the first set of data and the second set of data; aggregating the weighted scores in the first set of scores; aggregating the weighted scores in the second set of scores; wherein the one or more biological impact factors are a function of the aggregated scores of the first set of scores and the second set of scores. Assigning weights to each of the first set of scores and the second set of scores includes selecting weights for each of the plurality of computational models to maximize a difference between the scores within the first set of scores and the scores within the second set of scores. Such weighting may be based on the data received at step 210 (FIG. 2)And based on the corresponding network model. In some embodiments, step 604 includes a graph optimization computation technique. In one such embodiment, the weight associated with each score vector is selected to maximize the difference between the score vectors based on processing conditions that represent relatively "weak" perturbations to the biological system and based on processing conditions that represent relatively "strong" perturbations to the biological system. Fig. 9 shows an example, but does not limit the scope of application of the invention. Fig. 9 depicts a treatment condition graph 900 in which a biological system is exposed to a toxic agent for three different exposure times: short 902, medium 904, and long 906. For each exposure time, the SRP engine 110 collects data representing the measured activity of the set of biological entities. The network modeling engine 112 identifies three different networks Net-1908, Net-2910, Net-3912 with respect to the toxicant and biological system (including the measured biological entities), and the network scoring engine 114 calculates a scalar network response score for each of the three networks and the three exposure times. The network weighting module 512 then selects a weight set C for each of the three nets Net-1908, Net-2910, Net-39121、C2And C3The difference between the weighted sum of the short exposure network response scores and the weighted sum of the long exposure network response scores is maximized using the same weight. Weight C1、C2And C3May be constrained in some manner (e.g., C)1、C2And C3Must be non-negative and sum to 1). In other words, the network weighting module 512 performs the following optimization method (using a known computational optimization method):
s.t.c1,c2,c3≥0
c1+c2+c3=1
after weighting the network response score vectors at step 604, the aggregation module 514 adds the network response score vectors for each processing condition separately at step 606. These vectors may have been filtered by filtering module 510, weighted by network weighting module 512, both, or either. In some embodiments, step 606 includes linking all network response score vectors for a particular processing condition into a single vector. Let ASV-i represent the aggregated score vector for process i.
Thereafter, steps 602 through 606 (represented in FIG. 6 as steps 608 through 610) are repeated for the second processing conditions. These steps may be repeated for many additional processing conditions of interest, but as disclosed herein, in some embodiments, only two processing conditions are investigated. One of these treatment conditions may include exposure to an agent whose long-term biological effects are reasonably understood (such as smoke from a standard tobacco cigarette), while a second treatment condition may include exposure to an agent whose long-term biological effects are not well understood (such as an aerosol or vapor from a tobacco-related article). Regardless of how many processing conditions are studied, at the conclusion of step 606, an aggregated score vector ASV-i is generated for each processing condition i.
At step 608, the relative scoring module 516 generates a BIF from the aggregated score vectors. In some implementations, the relative scoring module 516 compares these aggregated to each other to generate one or more BIFs. As discussed above, BIF may indicate which biological pathways are similarly activated between different perturbations, which may allow predictions about the long-term effect of one perturbation to be made based on the long-term effects of other perturbations. A number of advantages and uses of BIF are discussed herein. The relative scoring module 516 may generate a BIF from the set of ASVs in a number of ways. In some embodiments, step 608 includes a geometric technique. For example, the BIF may be generated by calculating an inner product between two ASVs and using an angle associated with the inner product as a BIF measurement. In this embodiment, a smaller number of BIFs indicates greater agreement between the biological mechanisms activated by the two treatment conditions, suggesting similarity in long-term results according to these mechanisms. Any number of kernels may be used for inner product calculations, including identity matrices or diagonal matrices with various lock factors in the diagonal entities. Some such embodiments include profile information. For example, the relative scoring module 516 may use a block diagonal matrix kernel for inner product calculation, where the ith block is calculated according to the following formula:
wherein v isjIs the jth eigenvector associated with the Laplacian matrix of Net-i, and λjIs the associated jth eigenvalue. Using the kernel to calculate a raw score vector S1、S2、。。。、SMThe inner product between is an alternative way for aggregation engine 116 to implement the eigenvector decomposition and exponential scaling techniques as described above with reference to the above equations.
In some embodiments, each ASV is used to define a surface (possibly with multiple dimensions), and a BIF is generated by comparing these surfaces. Thus, generating the biological impact factor may comprise determining a distance between at least one first surface defined by at least one first vector representing the aggregated score of the at least one first set of scores and at least one surface of the earth defined by at least one second vector representing the aggregated score of the at least one second set of scores. Such implementations may include, among other things, geometry and optimization techniques. Such a method is illustrated in a simple example in fig. 10, which is a diagram 1000 depicting a surface 1002 corresponding to a first processing condition and a surface 1004 corresponding to a second processing condition. These surfaces are defined within a dose-exposure time space (dose axis 1008 and time axis 1010), and the height of each surface at a particular dose and exposure time is equal to the value of the scalar network response score 1006 (or the scalar value aggregate of the vector value score or multiple different scores). The BIF may be generated by a surface comparison framework performed in a number of ways. In some embodiments, the relative scoring module 516 identifies the dose and time at which the two surfaces are closest to each other. The difference in the network response scores at this point (i.e., the difference in the height of the surface) represents the closest condition of the biological mechanism activated by one perturbation to the biological mechanism activated by the second perturbation under the same dose-time conditions. In one example, when the first perturbation is exposure to a known toxic substance and the second perturbation is exposure to an unknown substance, the minimum distance comparison represents a "worst case" where the biological response to the unknown substance may be similar to the biological response of the known toxin. This worst case may be important for research and public health purposes. In some embodiments, the relative scoring module 516 identifies the dose and time at which the two surfaces are furthest apart from each other. Such an embodiment may be useful when studying the beneficial properties of a drug or a principle, as the point of maximum difference may show the "worst case" for the efficacy of a new drug compared to a known effective drug. In some embodiments, the relative scoring module identifies the value of the first surface that is closest to any value of the second surface, regardless of whether the points correspond to the same dose-exposure time condition. Identifying these closest points may enable a beneficial comparison between the two perturbations to be made; for example, the effect of the perturbation caused by smoking a conventional cigarette for a particular period of time is similar to the effect of the perturbation caused by smoking an aerosol or vapour from a tobacco-related product for a different period of time.
The relative score module 516 may represent the relative score in a number of different ways. In some embodiments, the relative scoring module may output a scalar value BIF summarizing the foregoing experiments and analysis. For example, if the two surfaces of fig. 10 are compared by a relative scoring module, finding the points where the values of the two surfaces are most similar, and identifying the response dose and exposure time for the first treatment (dose 1 and time 1, respectively) and the response dose and exposure time for the second treatment (dose 2 and time 2, respectively), the scalar value BIF can be calculated according to the following formula:
in the foregoing examples, BIF was described as being associated with perturbation of a biological mechanism. The BIF value is specifically described in some aspects as a numerical value that quantifies the long-term outcome of the selected perturbation on the respective biological mechanism. However, the system 100 is not limited to identifying the BIF of a particular disturbance, and may instead be used to generate BIF values for several different disturbances and to predict several different long-term outcomes for one or more of these disturbances.
Additionally and alternatively to perturbation and outcome, the system 100 may be used to estimate one or more BIF values for one or more other parameters including disease outcome, disease progression, biological mechanisms, and environmental conditions. For example, multiple BIF values may be generated, each value representing a different level of lung cancer progression-early, mid, and late. System 100 may include hardware and software components for generating and storing multiple BIF values for these different parameters. For example, the system 100 may include a database and storage for storing different BIF values associated with lung cancer progression. Each entity in the database may include different BIF values representing different stages of disease (i.e., lung cancer) progression. The entities in the database may include additional information associated with the BIF, such as a list of relevant biological mechanisms and biological entities. The database may be used for different purposes, e.g. clinical diagnosis and pre-diagnosis.
In one example for clinical analysis, the system 100 may be used to study the progression of lung cancer in a patient. System 100 may include a database of BIF values representing different stages of progression of a particular disease (e.g., without limitation, lung cancer). In such an example, the patient may have been exposed to a substance of unknown origin or unknown identity. The patient may inform clinicians that they have been exposed to substances that may be mixtures of particulate and gaseous substances that they suspect may potentially affect their health, particularly lung health. The clinician may select one or more assays to obtain a biological sample from the patient and generate measurable data of the patient. In particular embodiments, system 100 may assist in selecting an assay. For example, upon a clinician making a request for a test informing of the progress of lung cancer, the system 100 may display a list of one or more recommended tests to the clinician. Data from a patient obtained from one or more assays may be input into the system 100 for calculation. Based on this data, the system 100 can query a database to obtain entities with similar experimental results. For example, for a gene expression assay, the system 100 may query a database to identify entities whose genes or gene expression levels match those obtained from the patient's data. In particular embodiments, system 100 may filter one or more entities in the database based on other attributes that may not apply to the patient. The system 100 may determine one or more BIF values corresponding to the selected database entities and attribute the one or more BIF values to the patient. Alternatively, the system 100 may use the patient's data to calculate a BIF that is unique to the patient and that may be used to compare with a BIF value in a database that represents a particular biological outcome. For example, the database may include BIF values ranging from 0 to 100, each value representing a level of lung cancer progression. In such an example, a number closer to zero may represent an earlier stage of lung cancer, while a number closer to 100 may represent a later stage. The system 100 may determine that the patient's data generates BIF values in the range 10-20 and output the results for display. The clinician or system 100 may interpret the results and inform the patient that they are exposed to potentially harmful substances and may present a particular stage of lung cancer. System 100 may include suitable hardware and software components to receive data and generate and output BIF values.
FIG. 11 is a block diagram of a distributed computerized system 1100 for quantifying the effects of biological perturbations. The components of the system 1100 are the same as those in the system 100 of fig. 1, but the layout of the system 100 is such that: such that each component communicates through the network interface 1110. Such implementations may be suitable for distributed computing via a variety of communication systems including wireless communication systems that may share access to common network resources, e.g., a "cloud computing" paradigm.
Fig. 12 is a block diagram of a computing device, such as system 100 of fig. 1 or any of the components of system 1100 of fig. 11 for performing the processes described with reference to fig. 1-10. Each of the components of the system 100, including the SRP engine 110, the network modeling engine 112, the network scoring engine 114, the aggregation engine 116, and one or more databases (including the results database, the perturbations database, and the documents database) may be implemented on one or more computing devices 1200. In certain aspects, a plurality of the aforementioned components and databases may be included within one computing device 1200. In certain implementations, the components and databases may be implemented across several computing devices 1200.
Computing device 1200 includes at least one communication interface unit, an input/output controller 1210, a system memory, and one or more data storage devices. The system memory includes at least one random access memory (RAM 1202) and at least one read only memory (ROM 1204). All of these elements communicate with a central processing unit (CPU 1206) to facilitate the operation of computing device 1200. The computing device 1200 may be configured in many different ways. For example, computing device 1200 may be a conventional stand-alone computer, or alternatively, the functionality of computing device 1200 may be distributed across multiple computer systems and architectures. Computing device 1200 may be configured to perform some or all of the modeling, scoring, and aggregation operations. In fig. 10, computing device 1200 is linked to other servers or systems via a network or local network.
The computing device 1200 may be configured in a distributed architecture, where the database and processor are located in separate units or locations. Some such units perform primary processing functions and contain, at a minimum, a general-purpose controller or processor and system memory. In this regard, these units are each coupled via a communication interface unit 1208 to a communication hub or port (not shown) that serves as a primary communication link with other servers, client or user computers, and other related devices. The communication hub or port may itself haveMinimal processing power, primarily as a communications router. Various communication protocols may be part of the system, including, but not limited to: ethernet (Ethernet), SAP, SASTM、ATP、BLUETOOTHTMGSM and TCP/IP.
The CPU1206 includes a processor, e.g., one or more conventional microprocessors and one or more auxiliary coprocessors, e.g., math coprocessors for shifting the workload of the CPU 1206. The CPU1206 communicates with the communication interface unit 1208 and the input/output controller 1210, whereby the CPU1206 communicates with other devices such as other servers, user terminals, or devices. The communication interface unit 1208 and the input/output controller 1210 may include various communication channels for synchronous communication with, for example, other processors, servers, or client terminals. The devices communicating with each other need not constantly send signals to each other. Rather, such devices only need to send signals to each other when necessary, may actually avoid exchanging data for the majority of the time, and may need to perform several steps to establish a communication link between the devices.
The CPU1206 also communicates with data storage. The data storage may include a suitable combination of magnetic, optical, or semiconductor memory, and may include, for example, RAM1202, ROM1204, a flash drive, an optical disk (e.g., a compact disk), or a hard disk or hard drive. For example, the CPU1206 and the data storage device may each reside entirely within a single computer or other computing device; or connected to each other via a communication medium (e.g., a USB port, a serial port, a coaxial line, an ethernet-type network line, a telephone line, a radio frequency transceiver, or other similar wireless or wired medium, or a combination of the above). For example, the CPU1206 may be connected to a data storage device via the communication interface unit 1208. The CPU1206 may be configured to perform one or more particular processing functions.
The data storage device may store, for example, (i) an operating system 1212 for computing device 1200; (ii) one or more applications 1214 (e.g., computer program code or a computer program product) adapted for booting the CPU1206 in accordance with the systems and methods described herein, and in particular in accordance with a process described in detail with respect to the CPU 1206; or (iii) a database 1216 suitable for storing information that may be used to store information needed by the program. In certain aspects, the database comprises a database for storing experimental data and published literature models.
The operating system 1212 and applications 1214 may be stored in, for example, a compressed, uncompressed, and encrypted format, and may include computer program code. Instructions of the program may be read into the main memory of the processor from a computer-readable medium other than the data storage device, such as ROM1204 or RAM 1202. Although execution of the sequences of instructions in the programs causes the CPU1206 to perform the process steps described herein, hardwired circuitry may be used in place of or in combination with software instructions to implement the processes of the present invention. Thus, the systems and methods described are not limited to any specific combination of hardware and software.
Suitable computer program code may be provided for performing one or more functions related to modeling, scoring and aggregation as described herein. Programs may also include program elements such as an operating system 1212, a database management system, and "device drivers" that allow the processor to interface with computer peripheral devices (e.g., video display, keyboard, computer mouse, etc.) via the input/output controller 1210.
The term "computer-readable medium" as used herein refers to any non-transitory medium that provides or participates in providing instructions to the processor of computing device 1200 (or any other processor of the devices described herein) for execution. Such a medium may take many forms, including, but not limited to, non-volatile media and volatile media. Non-volatile media include, for example, optical, magnetic, or opto-magnetic disks, or integrated circuit memory such as flash memory. Volatile media includes Dynamic Random Access Memory (DRAM), which typically constitutes a main memory. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM or EEPROM (electrically erasable programmable read-only memory), a FLASH-EEPROM, any other memory chip or cartridge, or any other non-transitory medium from which a computer can read.
Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to CPU1206 (or any other processor of a device described herein) for execution. For example, the instructions may initially be carried on a magnetic disk of a remote computer (not shown). The remote computer can load the instructions into its dynamic memory and send the instructions over an ethernet connection, a wire or even a telephone line using a modem. Communication devices local to the computing device 1200 (e.g., servers) can receive data on respective communication lines and place the data on the processor's system bus. The system bus transports the data to main memory, from which the processor retrieves instructions for execution. The instructions received by main memory may optionally be stored in memory either before or after execution by processor. Further, the instructions may be received via the communication port as electrical, electromagnetic, or optical signals, which are exemplary forms of wireless communications or data streams for carrying various types of information.
As discussed above, the system 100 may be used to construct a network of biological mechanisms to further evaluate the biological impact of a perturbation at the system level. The following paragraphs describe several example networks, each of which may be used to compute BIF scores for different results related to the underlying mechanism.
As a first example, the system 100 may be used to construct a network of lungs of interest for cell proliferation. The lung-focused cell proliferation network was constructed using the Biological Expression Language (BEL), a computable framework for biological pathway representation by the Selventa (cambridge, massachusetts, usa) study, enabling it to be applied to the assessment of cell proliferation based on data obtained by high-throughput devices. The cell proliferation network comprised 854 nodes, 1598 edges (1017 causal edges and 581 non-causal edges) and was constructed using information from the 429 literature sources of PubMed digests. Several exemplary network node types include root protein nodes (e.g., CCNE 1), adjusted protein nodes (e.g., RB1 phosphorylated at specific serine residues), and active nodes (e.g., kinase activity of CDK2 (kaof (CDK 2)) and transcriptional activity of RBI (e.g., taof (RBI))), causal edges are causal influencing relationships between biological entities, e.g., increased kinase activity of CDK2 causal adds a RBI phosphorylated at serine 373 a non-causal edge connecting a different form of biological entity, such as mRNA or protein complex, to its base protein (e.g., STAT6 phosphorylated at tyrosine (Y) 641 has a non-causal relationship with its root protein node STAT 6), without implied causal relationships cell proliferation networks are constructed in a modular fashion, with core cell cycle models (core cell cycle models) connected to additional biological pathways that contribute to cell proliferation in the lung, five symbols are used, which include: the cell cycle (including the standard elements of the core mechanism of entry and exit of the mammalian cell cycle, including but not limited to cyclins, CDKs and members of group E2F); growth factors (including the common extracellular growth factors involved in regulating lung cell proliferation, i.e., EGF, TGF-beta, VEGF, and the members of the FGF group); intracellular and extracellular signaling (including the public intracellular and extracellular pathways involved in regulating lung cell proliferation, including the Hedgehog, Wnt and Notch signaling pathways, as well as calcium signaling, MAPK, Hox, JAK/STAT, mTOR, prostaglandin E2 (PGE 2), clock and nuclear receptor signaling involved in lung cell proliferation); cell interactions including the signal transduction pathway from the interaction of common cell adhesion molecules that causes cell proliferation (including ITGB1 complexed with ITGA1-3 chain) and extra-molecular matrix components (specifically collagen, fibronectin and laminin)); and experimental embryology (including the major known experimental embryology regulators of lung cell proliferation, including the histone acetyltransferase (HDAC) family and the DNA Methyltransferase (DMT) group member DNMT 1).
To examine the contents of the network, the system 100 is used to analyze the transcription data set using inverse causal reasoning (RCR) that identifies upstream controllers ("hypotheses") that can account for significant changes in the mRNA status in a given transcription data set. The transcription dataset is used to validate and extend the model using common data repositories such as GEO (Gene Expression Omnibus) and ArrAyexpress. The data sets used included the EIF4G1 data set (GSE11011), the RHOA data set (GSE5913), the CTNNBl data set (PMID15186480), and the NR3C1 data set (E-MEXP-861). System 100 was used to perform RCR analysis on each of these four cell proliferation transcription datasets and evaluate the generated hypotheses. Predictions for many nodes in the nuclear cell cycle block (including predictions of increased E2F1, 2, and 3 activity) are consistent with their published role in regulating cell proliferation in lung-associated cell types. Furthermore, the prediction of increased MYC in RhoA and CTNNBL datasets is consistent with the reported role MYC has in positively regulating cell proliferation in lung and lung-related cell types. In addition to predicting increased activity of positive cell proliferation regulators in data sets where cell proliferation was experimentally caused to increase, RCR also predicts decreased activity of proliferating negative regulators. For example, the decrease in transcriptional activity of RBi and E2F4 (both of which are known as negative regulators of cell cycle progression) was predicted in multiple datasets. Similarly, a decrease in the level of CDK1A or CDK2A (which are cell cycle checkpoint proteins with potent antiproliferative effects) was also predicted in all three datasets in which an increase in proliferation was observed. Many of these hypotheses are pleiotropic signaling molecules that involve other processes besides proliferation and can be caused by perturbation of non-proliferative areas of organisms in the data set under study. In addition to examining literature models of cell proliferation, RCR on four cell proliferation datasets was used to identify other mechanisms of cell proliferation affecting the lung. For example, transcriptional activity of Zbtbl7(MIZ-1) was predicted to increase in the CT B1 dataset, but Zbtbl7 still has no direct literature-described role in regulating general lung cell proliferation. Thus, in particular embodiments, the biological effect of an agent on a mammalian subject can be obtained by analyzing data in at least the network modules of lung cell proliferation. Suitably, the model of the lung cell proliferation network comprises at least one or a combination of two or more of the following sub-models: cell cycle, growth factors, intracellular and extracellular signaling, cell-cell interactions, and epigenetics.
As a second example, the system 100 is used to build a network model of the main pulmonary inflammatory process (network of inflammatory processes or IPN) by combining the investigation of relevant published literature with computational analysis of multiple sets of transcription data. To capture the contribution of multiple cell types to lung inflammation, the system 100 is configured to build an IPN model using a modular architecture, with a larger network model including constituent sub-models. At least 23 scored IPN submodels focused on major cell types known to be involved in cigarette smoke-induced lung inflammation; in particular, lung epithelial cells, macrophages, neutrophils, T-cell subsets (Th 1, Th2, Thl7, tregs and Tc), NK cells, dendritic cells, megakaryocytes and mast cells. Within each submodel, an input-output design is used; the submodel inputs signal the ligands/triggers that cause or inhibit signaling cascades within the cell, while the submodel outputs are cellular/physiological products of these signaling pathways (primarily secreted cytokines or biological processes). The system 100 is used to construct an IPN model analysis based on the processes described above, including investigating scientific literature, extracting causal relationships from the Selventa knowledge base, receiving manual aids statements from literature, and deriving nodes from anti-causal reasoning (RCR) analysis of transcriptome analysis experiments that evaluate specific inflammation-related processes. RCR augmentation (augmentation) is based on a data set obtained from Gene Expression Omnibus (GEO), representing mice whole lung exposure to LPS in vivo (GSE 18341), dendritic cell activation/monocyte-macrophase differentiation/NK cell activation in response to 1L15/Thl differentiation/Th 2 differentiation in vitro (GSE 22886) and exposure of lung neutrophils to LPS in vivo (GSE 2322). Thus, in certain embodiments, the biological effect of an agent on a mammalian subject (e.g., a human) can be assessed by analyzing data in at least one network of pulmonary inflammatory models. Suitably, the lung inflammatory network model comprises a combination of one or more of the following sub-models, each comprising one or more exemplary nodes (in parentheses): (1) mucus hypersecretion (high secretion and mucac expression in lung epithelial cells in response to cytokines such as IL13, CCL2, TNF and EGF); (2) epithelial barrier defense (changes in the barrier function and permeability of tight junctions of the epithelium in response to signals such as EGF, TNF, ADAM17 and ROS); (3) epithelial pro-inflammatory signaling (expression of inflammatory proteins in response to an up-signal during epithelial activation such as TNF, TLR4, ELA2, and IL-1 beta); (4) neutrophil response (in response to ascending signals such as TNF, CSF3, and FPR 1); (5) macrophage-mediated neutrophil secretion (the chemotaxis and recruitment of IL-8, serine 1, and leukotriene B4 resulting in neutrophils in response to an up-signaling such as TNF); (6) neutrophil chemotaxis (modulation of chemotaxis in response to up-signaling such as CSF3, F2, ILA CXCL12, S100a8 and S100a 9); (7) tissue damage (release of DAMP and PAMP as inflammatory triggers after tissue damage leading to TLR and NFkB signaling); (8) macrophage activation (NFkB-dependent production of pro-inflammatory molecules in response to an up-signal such as Toll-like receptor ligation); (9) differentiation of macrophages (differentiation in response to ascending signals such as IL-6, IGF-1, and interferon gama); (10) th1 differentiation (Th 1 differentiation and IFNG expression in response to uplinks such as CCL5 and DLL 1); (11) th1 responses (up-going signals such as IFNG, IL2, LTA and LTB); (12) th2 cell differentiation (in response to ascending signals, such as IL4, IL25, and VIP); (13) th17 differentiation (in response to up signals, such as TGFB1 and DLL 4); (14) thl7 response (in response to upstream signals, such as IL21, IL22, and IL 26); (15) treg response (and in response to up-signaling such as TGFB1 and IL7, regulatory T cell differentiation and IL10 expression); (16) tc response (induction of FASLG as a cytotoxic T cell response in response to ascending TCR ligation and IL 15); (17) NK cell activation (induction of cytolysis of target cells of NK cells in response to an up-signal such as IL-2, IL-4, IL-7, IL-12, IL-15, TGFbeta, IFNalphal and ITGB 2); (18) mast cell activation (in response to an uplink signal, such as IL4, KITLG, and FcIgE receptors); (19) activation of dendritic cells (production of cytokines and other inflammation-related proteins in response to upstream TLR ligands such as LPS and HMGB 1); (20) migration of dendritic cells to tissues (modulation of migration to the site of infection in response to up-signaling such as complement, CCL3, and CCL 5); (21) dendritic cell migration to lymph nodes (modulation of migration to lymph nodes in response to ascending signals such as CXCL9, CXCL10, CXCL11, CCL19 and CCL 21), (22) Th2 responses (immune responses in response to ascending signals such as IL-4 and IL-13); and (23) megakaryocyte differentiation (megakaryocyte differentiation in response to an up-signal such as IL11 and CXCL 12). Thus, the computerized method of the invention for determining biological effects may comprise using a network model of inflammatory processes of the lungs, comprising one or more of the 23 sub-models.
As a third example, system 100 is used to construct an integrated network model that captures physiological cell-based organisms responsive to endogenous and exogenous stress in lung and cardiovascular cells of a non-diseased mammal. The system 100 is used to build a model of the Cellular Stress Network (CSN) according to the process described above, including investigating scientific literature, extracting causal relationships from the Selventa knowledge base, and manually receiving ancillary statements from the literature. The model of CSN includes six submodels: (1) foreign body metabolic reactions (including AHR, Cytochrone P450 enzymes, and inducers of various environments of such reactions); (2) endoplasmic Reticulum (ER) stress (including unfolded protein response and pathway downstream of three pressure sensors including Perk (Eik 2ak 3), ATF6 and irelal (ernl), while excluding the pro-apoptotic arms of the response); (3) endothelial shear stress (including the effects of laminar (antiatherosclerotic) and turbulent (atherosclerotic) shear stress on monocyte adhesion, including NF- κ B and nitric oxide pathways); (4) hypoxia response (including Hlfla activation and target, transcriptional control, protein synthesis, and cross-talk to oxidative stress, endoplasmic reticulum stress, and osmotic stress response pathways); (5) osmotic stress (including Nfat5, aquaporins, and the Cftr downstream pathway of hypertonic response); and (6) oxidative stress (including intracellular free radical management pathways, endogenous/exogenous oxidants (including those induced by exposure to hyperoxic conditions), antioxidants, glutathione metabolism, P38, Erk, JNK, and NF- κ B pathways, and NRF2 and its upstream regulatory and downstream Antioxidant Response Elements (ARE) -mediated gene expression). Thus, in certain embodiments, the biological effect of an agent on a mammalian subject (e.g., a human) can be assessed by analyzing data in a model of stress in at least one network cell. Suitably, the cell stress network model comprises a combination of one or more of at least one or two of the following sub-models: foreign body metabolism response, Endoplasmic Reticulum (ER) stress, endothelial shear stress, hypoxic response, osmotic stress, and oxidative stress.
The system 100 was used to evaluate the CSN model against a data sequence representing the transcriptional response to Cigarette Smoke (CS) as a prototype induction of pleiotropic cellular stress in the mouse lung (GSE 18344). The data set included data from CS wild type and NRF2 knock-out animals exposed to ambient air (sham exposure), with CS treatment data from day 1 of the CSN model being tested selectively. Significant mRNA State Changes (SC) were determined for three comparisons: wild type 1 day CS v. sham exposure, NRF2 gene knockout 1 day CS v. sham exposure, and NRF2 gene knockout 1 day CS v. wild type 1 day CS exposure. The experimental results are consistent with the central role of NRF2 in the lung cell response to CS. In particular, 35% of the SC induced by 1 day CS exposure in wild type mice can be explained by activation of the NRF2 gene. When 1 day CS exposed NRF2 knock-out mice were compared to wild type mice, reduced NRF2 transcriptional activity was predicted, consistent with the absence of NRF2 in these mice. Thus, the computerized method of the invention for determining biological impact may comprise using a network model of cellular stress comprising one or more of the 5 sub-models.
As a further example, the system 100 is used to construct a network model of DNA damage response, apoptosis, programmed necrosis, autophagy, and senescence through investigation of relevant published literature in conjunction with computational analysis of various sets of transcription data. This network is known as the DACS network of DNA damage, phagocytosis, cell death and senescence. These DACS networks are built using a highly modular design, with larger networks divided into sub-models. Discrete mechanisms affecting cell fate (e.g., NFi B-mediated transcriptional upregulation of anti-apoptotic genes pro-survival) are described by 35 sub-models in 5 DACS network regions. In general, the DACS network contains 1052 unique nodes and 1538 unique edges (959 causal edges and 579 non-causal edges), which is supported by the literature citation referenced by 1231 PubMed. Nodes in the DACS network are biological entities such as protein abundance, mRNA expression, and protein activity. Furthermore, the nodes may also represent biological processes (e.g., apoptosis). Edges are the families between nodes and are classified as causal or non-causal. DACS networks are built and populated with content from two major sources: nodes and edges derived from prior knowledge described in the scientific literature, and nodes derived from computational analysis of transcriptome analysis data by inverse causal reasoning (RCR).
Suitably, the DACS network model comprises at least one or a combination of two or more of the following sub-models: for apoptosis- (1) the caspase cascade, (2) ER stress-induced apoptosis, (3) MAPK signaling, (4) NFkappaB signaling, (5) PKC signaling, (6) apoptotic mitochondrial signaling, (7) pro-survival mitochondrial signaling, (8) TNFR/FAS signaling, (9) TP53 transcriptional signature; ATG induction for autophagy — (10) autophagy, (11) induction of autophagy, (12) mTOR signaling, (13) synthesis of nutrient transporter, and (14) protein synthesis; for DNA damage — (15) an active component affecting TP53, (16) a component affecting TP63 activity, (17) a component affecting TP73 activity, (18) DNA damage to the Gl/S checkpoint, (19) DNA damage to the G2/M checkpoint, (20) double strand break reaction, (21) inhibition of DNA repair, (22) NER/XP pathway, (23) single strand break reaction, (24) TP53 gene transcription signature; for necroptosis-25 Fas activation, (26) gene markers, (27) inflammatory mediators, (28) RIPK/ROS-mediated execution, (29) TNFR1 activation; for senescence- (30) protooncogene-induced senescence, (31) replicative senescence, (32) stress-induced progeria, (33) regulation of pl6INK expression, (34) regulation of tumor suppressor gene, and (35) transcriptional regulation of SASP.
RCR-based enhancement of DACS networks using four transcription datasets (2 DNA lesions and 2 pairs of senescence), abbreviated as the "building" dataset. Ideally, a transcription dataset that handles all 5 DACS regions is used to maximize network coverage. However, since the three DACS network regions (apoptosis, autophagy and sexual necrosis) have not been classically described as being driven by transcriptome changes, the focus of the work was transcriptome data from experiments describing DNA damage responses and induction of senescence. Data sets for all four buildings were derived from in vitro experiments on human or mouse fibroblasts and represent responses to DNA damage induced by uv irradiation or chemical DNA cross-linking agents, replicative senescence induced by serial passages, and stress-induced premature senescence (SIPS) induced by bleomycin (GSE 13330). Thus, in certain embodiments, the biological effect of an agent on a mammalian subject (e.g., a human) can be assessed by analyzing the data in at least one DACS network model. Suitably, the DACS network model comprises at least one or a combination of one or more of the above-described sub-models.
A plurality of computational causal network models are provided at the processor, the models representing biological systems, each computational model including nodes representing a plurality of biological entities and edges representing relationships between entities of a larger plurality of biological entities. In one embodiment, the computational causal network model is selected from two or more of a network of cell proliferation, a network of inflammatory processes, cellular network stress and DNA damage selection, autophagy, cell death, and aging. Each network model may contain sub-model components.
In one embodiment, the cell proliferation network is a lung-focused cell proliferation network. Suitably, the sub-model is selected from the group consisting of: the cell cycle (including the standard elements of the core mechanism of entry and exit of the mammalian cell cycle, including but not limited to cyclins, CDKs and members of group E2F); growth factors (including the common extracellular growth factors involved in regulating lung cell proliferation, i.e., EGF, TGF-beta, VEGF, and the members of the FGF group); intracellular and extracellular signaling (including the public intracellular and extracellular pathways involved in regulating lung cell proliferation, including the Hedgehog, Wnt and Notch signaling pathways, as well as calcium signaling, MAPK, Hox, JAK/STAT, mTOR, prostaglandin E2 (PGE 2), clock and nuclear receptor signaling involved in lung cell proliferation); cell interactions including the signal transduction pathway from the interaction of common cell adhesion molecules that causes cell proliferation (including ITGB1 complexed with ITGA1-3 chain) and extra-molecular matrix components (specifically collagen, fibronectin and laminin)); and experimental embryology (including the major known experimental embryology regulators of lung cell proliferation, including histone acetyltransferase (HDAC) family and DNA Methyltransferase (DMT) group member DNMT 1), or a combination of two or more.
In one embodiment, the network of inflammatory processes is a network of pulmonary inflammatory processes. Suitably, the sub-models focus on the main cell types known to be involved in cigarette smoke-induced lung inflammation. In one embodiment, the sub-model is selected from the group consisting of lung epithelial cells, macrophages, neutrophils, T-cell subsets (Th 1, Th2, Thl7, tregs and Tc), NK cells, dendritic cells, megakaryocytes and mast cells, or a combination of two or more.
In one embodiment, the submodel of cellular network stress is selected from the group consisting of: foreign body metabolic reactions (including AHR, Cytochrone P450 enzymes, and inducers of various environments of such reactions); (2) endoplasmic Reticulum (ER) stress (including unfolded protein response and pathway downstream of three pressure sensors including Perk (Eik 2ak 3), ATF6 and irelal (ernl), while excluding the pro-apoptotic arms of the response); (3) endothelial shear stress (including the effects of laminar (antiatherosclerotic) and turbulent (atherosclerotic) shear stress on monocyte adhesion, including NF- κ B and nitric oxide pathways); (4) hypoxia response (including Hlfla activation and target, transcriptional control, protein synthesis, and cross-talk to oxidative stress, endoplasmic reticulum stress, and osmotic stress response pathways); (5) osmotic stress (including Nfat5, aquaporins, and the Cftr downstream pathway of hypertonic response); and (6) oxidative stress (including intracellular free radical management pathways, endogenous/exogenous oxidants (including those induced by exposure to hyperoxic conditions), antioxidants, glutathione metabolism, the P38, Erk, JNK and NF- κ B pathways, and NRF2 and its upstream regulatory and downstream Antioxidant Response Elements (ARE) -mediated gene expression), or a combination of two or more.
In an embodiment of the DACS network model, the sub-model is selected from the group consisting of: for apoptosis- (1) the caspase cascade, (2) ER stress-induced apoptosis, (3) MAPK signaling, (4) NFkappaB signaling, (5) PKC signaling, (6) apoptotic mitochondrial signaling, (7) pro-survival mitochondrial signaling, (8) TNFR/FAS signaling, (9) TP53 transcriptional signature; ATG induction for autophagy — (10) autophagy, (11) induction of autophagy, (12) mTOR signaling, (13) synthesis of nutrient transporter, and (14) protein synthesis; for DNA damage — (15) an active component affecting TP53, (16) a component affecting TP63 activity, (17) a component affecting TP73 activity, (18) DNA damage to the Gl/S checkpoint, (19) DNA damage to the G2/M checkpoint, (20) double strand break reaction, (21) inhibition of DNA repair, (22) NER/XP pathway, (23) single strand break reaction, (24) TP53 gene transcription signature; for necroptosis-25 Fas activation, (26) gene markers, (27) inflammatory mediators, (28) RIPK/ROS-mediated execution, (29) TNFR1 activation; for senescence- (30) protooncogene-induced senescence, (31) replicative senescence, (32) stress-induced progeria, (33) regulation of pl 6. sup. INK expression, (34) regulation of tumor suppressor gene and (35) transcriptional regulation of SASP, or a combination of two or more.
In accordance with the systems and methods described herein, a computational model may be used to represent any and all aspects of the operation and structure of a biological system and its components. In particular, the systems and methods described herein are configured to quantify the long-term effects of agents on any and all aspects of the operation and structure of biological systems and components thereof. Thus, while most of the present specification discusses biochemical data at a physiological level, computational models can be used to express interactions at the level of ions and atoms (e.g., calcium throughput, neurotransmission), nucleic acids, proteins and metabolites biochemistry, organelles, subcellular organelles, cells, tissue compartments, tissues, organs, organ systems, individuals, populations, diets, disease states, clinical trials, epidemiology, prey interactions, and parasite-host interactions.
Examples of biological systems in the human context include, but are not limited to: lung, body wall, bone, muscle, nerve, endocrine, cardiovascular, immune, circulatory, respiratory, digestive, urinary, and reproductive systems. In one particular example, a computational model may be used to represent the function and structure of skeletal muscle fibers in the muscular system. In another example, a computational model may be used to represent the functioning of the systolic neural control of muscle fibers in the skeletal system. In further embodiments, computational models may be used to represent the function and structure of pathways for visceral motor output in the nervous system or the function of synaptic communication in neural tissue. In other examples, computational models may be used to represent the function and structure of the control of cardiac cycles and cardiac rates in the cardiovascular system. In still other embodiments, computational models may be used to represent the function and structure of lymphocytes and immune responses in the lymphatic system. In other examples, computational models may be used to represent the appearance of symptoms or adverse health effects and the onset of disease. In certain embodiments, the computational models of the present invention represent diseases such as cardiovascular disease, cancer (particularly lung cancer), chronic obstructive pulmonary disease, asthma, and poor health conditions associated with the consumption of smoking cigarettes and other nicotine-containing compositions. Such computational models can be used to predict the biological effects of smoking and using nicotine-containing compositions in the methods of the invention.
Other examples of biological systems include, but are not limited to: epithelial cells, nerve cells, blood cells, connective tissue cells, smooth muscle cells, skeletal muscle cells, adipocytes, egg cells, sperm cells, stem cells, lung cells, brain cells, cardiac muscle cells, larynx cells, pharynx cells, esophageal cancer cells, stomach cells, kidney cells, liver cells, breast cells, prostate cells, pancreas cells, testis cells, bladder cells, uterus cells, colon cells, and rectum cells. Examples of cellular functions include, but are not limited to, cell division, cell regulation, control by nuclear cell activity, and cell-to-cell signaling, and computational models can be used to represent the function and structure of the cellular components. Examples of cellular components include, but are not limited to: cytoplasm, cytoskeleton, ribosomes, mitochondria, nucleus, Endoplasmic Reticulum (ER), golgi apparatus, or lysosomes.
In certain aspects, computer models can be used to represent the structure, function, and synthesis of proteins. In addition, computational models can be used to represent the components of proteins, including, but not limited to, amino acid sequence, secondary and tertiary structure, post-translational modifications (e.g., phosphorylation), conformational data. In addition, computational models can be used to represent proteins, including related molecules, but not limited to, enzymes.
In certain aspects, computational models are used to represent the structure, function, and synthesis of nucleic acids. The nucleic acid is not limited to any particular type of nucleic acid, including, but not limited to: total genomic DNA, cDNA RNA, mRNA, tRNA, and rRNA. In certain aspects, computational models derived from life science information are used to represent the structure and function of DNA replication, DNA repair, and DNA recombination. In another aspect of the systems and methods described herein, the computational model determines, for example, Single Nucleotide Polymorphisms (SNPs), splice variants, small RNA, double-stranded RNA (dsrna), small interfering RNA (also known as short interfering RNA or siRNA), RNA interference (RNAi), chromosomes, chromosomal modifications, or silenced genes.
In certain aspects, computational models are used to represent the function of cancer pathways, including, but not limited to, oncogenes and tumor suppressor genes. For example, one or more computational models can be used to express gene expression of the human p53 tumor suppressor gene. In another aspect, the computer model may be used to represent pathways of different types of cancer, including but not limited to: blood (e.g., leukemia), oral, lip, nasal and sinus cavities, larynx, pharynx, esophagus, stomach, lung, liver, pancreas, prostate, kidney, testis, bladder, uterus, cervix, colon, and rectum.
In certain aspects, computational models are used to represent pathways of various diseases, including, but not limited to, the operation of molecular mechanisms associated with the diseases. Examples of diseases include, but are not limited to: cardiovascular disease, coronary heart disease, lung, respiratory, blood, neurological, psychiatric, neurological, muscular, skeletal, ophthalmological, gastrointestinal, urogenital, endocrine, dermatological, inflammatory, metabolic, pathogenic and infectious diseases.
In certain aspects, the computational model identifies relationships of the artifacts. Examples of such relationships include, but are not limited to, the following: agent X inhibits a specific function of molecule Y; agent X acts as a drug; agent X is disclosed in the patent publication; agent X is used to treat disease Y; agent X inhibits the activity of entity Y; and agent X activates ABC activity of entity Y.
In certain aspects, the computer model may be used to represent the function and structure of the infectious agent. Examples of such infectious agents include, but are not limited to, viruses, bacteria, yeast fungi, or other microorganisms such as parasites. In another aspect, the computational model identifies pathogens such as viruses, bacteria, fungi, or prions and relational connectors that represent the effect on specific diseases and other characteristics. In other aspects of the disclosure, the computational model identifies that the particular measurable entity is a biomarker of a disease state, efficacy of a drug, or patient stratification, identifies relationships between models of model organisms, tissues, or other organisms of a disease, and related diseases or an epidemic and its characteristics.
The following examples are offered by way of illustration and not by way of limitation. The present invention employs, unless otherwise indicated, conventional techniques and procedures known in the art.
Examples of the invention
Described herein are novel computational methods that derive quantitative biological impact, defined as Biological Impact Factor (BIF), from data of the underlying system-wide using a substrate that defines causal biological (e.g., molecular) network models as a basis for data analysis. This approach allows the biological impact assessment of active substances to be a priori at the pharmacological level and the mechanism of action to be identified through the application of causal biological network models. The effect of a particular perturbation of the biological network due to one or a mixture of biologically active substances may be determined for each of the molecular entities in the network, thereby identifying the effect of causal mechanisms induced by each substance or mixture. Since our method is based on system-wide experimental data, this quantitative approach takes into account the entire biological system and thus many biological networks perturbed by the active species. This allows the quantitative and objective assessment of each molecular entity (or node) in the biological networks described to be used alone or as part of a signature as a molecular marker that closely expresses the overall state of perturbation (compared to controlled activation or inhibition) of each biological network in the system and its correlation to events such as disease onset or progression. Furthermore, our method enables quantitative comparison of biological impact of individuals and species across the mechanistic level, whereas genetic level comparisons are perturbed by genomic/genetic variation. This capability provides a means to switch between in vivo and in vitro model system biology and human biology.
This approach provides potential predictive power and all assumptions are explicitly listed by a deterministic scoring algorithm. This approach may enable applications of cyber pharmacology and system biology beyond toxicology assessments, and may be applied in areas such as drug development, consumer product testing, and environmental impact analysis. An embodiment of the present invention employing a five step method is depicted in fig. 2.
Example 1 design experiment for data production
In order to convert the study into a human system, the data collected from clinical studies is most useful. However, due to the challenges of obtaining large human datasets, it is useful to consider in vivo non-human models as well as models based on in vitro cells and organic canonical (3D) cultures that represent a major aspect of human disease. The data obtained from these systems allow at least some insight into the perturbation of the biological network by the substance to be obtained, to identify mechanism-specific biomarkers for human research and to link these mechanisms to the assessment of the onset impact of the disease.
Although experimental systems are known to have many deficiencies both in vitro and in vivo, systematic approaches to their use will minimize these problems (fig. 14). The method of such a system may include considering several limitations:
and (5) exposing. The substance or complex stimulus of the contact pathway reflects exposure to a range and situation set by daily living. A standard set of exposure protocols is defined to apply systematically to the same defined experimental system. In addition, each assay can be designed to collect time and dose dependent data to capture early and late events and ensure that a representative dose range is covered.
And (4) an experimental system. The experimental system may cover two complementary purposes, if possible: 1) this animal model which reproduces the defined characteristics and sufficient exposure of human diseases, 2) the selection of cells and organic typical systems to reflect the cell types and tissues involved in the etiology of the disease, and preferably primary cell or organ cultures which recapitulate as much as possible the human biology in vivo. It is also important that each individual be cultured in vitro to be the most equivalent of the culture obtained in an in vivo animal model. This allows the use of an in vitro matching system as a "hub" to establish a "translational continuum" from the animal model to the in vivo human biology.
And (6) measuring. Access high throughput system-wide measurements (e.g., phosphorylation and metabolite profiles) for gene expression, protein expression, post-translational modification are generated and correlated with functional outcomes of system exposures. Functional outcome measures are useful for strategies because they serve as anchors for evaluation and represent a clear step in the etiology of the disease. Although animal models and cellular systems do not always translate completely into human disease, certain steps can be reproduced and they represent an important material in understanding how biological network perturbations lead to disease.
Example 2 computer System response Curve
The quality control measurements generated in the first step constitute a system response curve (SRP) for each given exposure in a given experimental system. Thus, SRPs express the degree to which each molecular entity changes due to exposure of the system, and can be the result of rigorous quality control and statistical analysis. In this way, different measurements and data types can be combined and analyzed together to provide a more accurate quantitative representation of biology.
Next, measurable elements (such as mRNA expression) are causally incorporated into the biological network model using a priori knowledge. This, combined with the computational methods under development, enables the mechanistic assessment and understanding of the perturbation of the biological network by the active substance.
Example 3 construction of biological network model
Although the SRPs derived in the previous step represent experimental data from which biological effects are determined, the causal biological network model is the substrate for SRP analysis. The application of such strategies requires the development of detailed causal network models of the mechanistic biological processes relevant to risk assessment. Such a framework provides a layer of mechanistic understanding beyond the study of gene lists that have been used in more classical toxicology genomics. Using BEL (biological expression language, silvernta computational framework for biological network representation) a strategy for building such a model was developed such that it can be applied to the evaluation of biological processes of interest based on high throughput data.
The structure of such a network is an iterative process. The selection of the biological boundaries of the network is guided by literature investigation of signaling pathways related to the process of interest (e.g., cell proliferation in the lung). The causal relationships that describe these pathways were extracted from the knowledge base of silvernta, so that the network was centered on those relationships derived from the relevant cell types. Document-based networks can be validated using high-throughput data sets with available phenotype-providing endpoints.
One example is microarray analysis of human bronchial epithelial cells perturbed by inhibitors of the key cell cycle regulator CDK1 in combination with proliferation assays. These data sets are analyzed using inverse causal reasoning (RCR), which is a method for the prediction of the activation state of identified biological entities (nodes in the network) that are statistically significant and consistent with measurements made for a given high-throughput data set.
RCR predictions of literature network nodes, consistent with observations in cell proliferation for generating high-throughput data, validate the ability of the network to capture mechanisms that regulate the biological processes represented. Furthermore, the network-related nodes predicted by RCR, which are not represented in the literature by the network, are integrated. This approach generates an integrated biological network with nodes generated from nodes and edges (directed connections between nodes) at the site of the document and from related high-throughput data sets.
These networks contain functions that can score processes. The topology is maintained; a network of causal relationships (signaling pathways) can be traced from any point in the network to a measurable entity. Furthermore, the models are dynamic and the assumptions used to construct them can be modified or re-listed to accommodate different tissue contexts and species. This makes it possible to trial and error and improve new knowledge to become available.
Example 4-calculation of NPA score for biological networks by SRP
To enable quantitative comparisons of perturbations of biological networks, one computational approach is to develop a conversion of SRPs into network response scores. The network response score is applied to experimental data within the context of a causal model in the biological network. In particular, measurements that are causally mapped to individual elements in the model as downstream effects of perturbations are aggregated into a biological network-specific score via the techniques described herein. By providing a measure of biological network perturbation, the network response score allows correlation of molecular time with molecular events that characterize a phenotype at the cellular, tissue, or organ level.
Example 5 biological influencing factor of a biological computing System
A single numerical score can be calculated that represents the systematic range of a given substance of the mixture and the impact of pan-biological mechanisms. Another step in estimating the biological impact of a perturbing agent is to aggregate the network response scores, which represent the impact on each biological network, into an overall value representing the overall impact on the entire biological system. The network response scores for each contributing network are aggregated to generate an estimate of biological impact in a process that requires both normalizing the scores between networks and weighting the distribution of each network (fig. 15). Thus, the design of the aggregation algorithm may solve the problem of defining the relative contribution of each biological network to the overall state of the system. Finally, when BIF is used as a predictor of mid-and long-term disease outcome, it can be calibrated using a combination of experimental and (if available) epidemiological data.
Example 6 quantification of the Effect of inhaled chemicals on rat nasal epithelial tumorigenesis
As an example of an application of the graph theory BIF technique as disclosed herein, the system 100 is configured to generate Biological Impact Factors (BIF) using cell proliferation and inflammatory networks to quantify the impact of inhaled chemical products on nasal epithelial tumors in rats. Microarray analysis of Gene Expression from rat nasal tissues over time and data on formaldehyde inhalation at dose were obtained publicly under the accession number GSE23179 (Gene Expression Omnibus). To obtain this data set, eight week old male F344/CrlBR rats were exposed to formaldehyde by systemic inhalation. Systemic exposure doses were at 0, 0.7, 2, 6, 10 and 15ppm (6 hours per day, 5 days per week). The inhaled animals were sacrificed after 1, 4 and 13 weeks after the start of exposure. After sacrifice, tissues obtained from the secondary region of the nose were dissected and epithelial cells were removed with a mixture of digestive proteases. The epithelial cells obtained from this part of the nose are mainly composed of transitional epithelium and some of the airway epithelium. Microarray analysis of gene expression was performed in epithelial cells. The system response curve engine 110 receives transcriptome data from rats exposed to various doses of formaldehyde for 13 weeks and composes the data into a system response curve (SRP). The network modeling engine 112 identifies two networks associated with the tumor: diffusion networks and inflammatory networks. For each dose, the network scoring engine 114 evaluates the proliferative and inflammatory networks (and in particular, the transcriptional behavior predicted by these networks) against the SRP and calculates a network response score for each of the two networks. The aggregation engine 116 then generates a BIF for each dose by averaging the two net response scores (reflecting an assumption that the two net-based mechanisms contribute equally to the outcome of interest (i.e., tumorigenesis)). The prediction/validation engine 122 then compares the BIF value for each dose against the dose-specific tumor incidence intercepted from the biological literature. This comparison is shown in fig. 13. The results shown in fig. 13 indicate that the tumorigenesis predicted by BIF becomes first above the BIF threshold of 0.4. In some embodiments, the BIF is calibrated against a known or otherwise predicted biological outcome (as represented in fig. 13). In other embodiments, the BIFs are not calibrated, but the BIF values are compared to each other for ranking and comparing biological results. Initially, these scores are calculated using an intensity algorithm and then confirmed using a geometric perturbation index scoring technique.
While the disclosure has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention. In accordance with the present invention, the computational model can be used to represent any life science information. Other aspects of the disclosure are set forth in the following paragraphs:
1. a computerized method for determining a biological effect of an agent on a biological system, comprising: receiving, at a network modeling engine, data corresponding to a response of a biological system to an agent, wherein the biological system includes a plurality of biological entities, each biological entity interacting with at least one other biological entity in the biological system; receiving data corresponding to a biological system at a network modeling engine; generating, at a network modeling engine, a plurality of computational models of a portion of a biological system; wherein each computational model comprises nodes representing biological entities and edges representing relationships between the biological entities; generating, at a network scoring engine, at least one first score representing an effect of the agent on the plurality of computational models, and at least one second score representing a computational model of a biological system not exposed to the agent; and generating, at the aggregation engine, an aggregated score representing the biological system.
2. The computerized method of paragraph 1, wherein the data corresponding to the agent comprises a data representation that represents a degree to which one or more biological entities within the biological network have changed as a result of exposure of the biological system to the agent.
3. The computerized method of paragraph 1 or 2, wherein the network modeling engine identifies a biological entity within the biological system that exhibits statistically significant activity in response to the agent based at least in part on data corresponding to at least one of the first agent and the second agent.
4. The computerized method of paragraph 3, wherein the network modeling engine constructs one or more computational models having nodes corresponding to the identified biological entities and edges corresponding to causal connections between one or more of the identified biological entities.
5. The computerized method of any of paragraphs 1-4, wherein one or more of the plurality of computational models comprises one or more directly measurable nodes and the data corresponding to at least one of the first and second agents comprises measurements of one or more biological entities represented by the one or more directly measurable nodes.
6. The computerized method of paragraph 5, wherein the network scoring engine assigns a score to one or more computational models having one or more directly measurable nodes based on numerical values of measurements of the biological entities corresponding to the respective nodes.
7. The computerized method of any of paragraphs 1-6, wherein one or more of the plurality of computational models comprises one or more directly measurable nodes and the data corresponding to at least one of the first agent and the second agent comprises measurements of one or more biological entities causally linked to the one or more directly measurable nodes.
8. The computerized method of paragraph 7, wherein the network scoring engine assigns a score to one or more computational models having one or more directly measurable nodes based on numerical values of measurements of biological entities causally linked to the respective nodes.
9. The computerized method of paragraph 7, wherein the network scoring engine assigns a score to one or more computational models having one or more directly measurable nodes based on a combined value of measurements of biological entities causally linked to each node.
10. The computerized method of any of paragraphs 1-9, further comprising: assigning a weight to each of the plurality of computational models at the aggregation engine based on the effect of the agent on the response computational model; and generating, at the aggregation engine, a first aggregated score and a second aggregated score by combining the first set of scores and the second set of scores according to the assigned weights, respectively; wherein the relative aggregated score is a function of the second aggregated score.
11. The computerized method of paragraph 10, wherein generating the first aggregated score and the second aggregated score comprises applying a geometric computation technique.
12. The computerized method of any of paragraphs 10-11, wherein generating the first aggregated score and the second aggregated score comprises applying spectrogram computing techniques.
13. The computerized method of any of paragraphs 10-12, wherein generating the first aggregated score and the second aggregated score comprises applying a graphical optimization computing technique.
14. The computerized method of any of paragraphs 1-13, wherein the at least one first score and the at least one second score comprise vectors, and the step of aggregating further comprises filtering the at least one first score and the at least one second score at an aggregation engine to decompose each of the first score and the second score into a plurality of base vectors having respective scalar coefficients.
15. The computerized method of any of paragraphs 1-14, wherein filtering further comprises removing at least one of the plurality of basis vectors having a corresponding scalar coefficient.
16. The computerized method of any of paragraphs 1-15, further comprising filtering, at the aggregation engine, the at least one first score and the at least one second score to remove the statistical outliers.
17. The computerized method of any of paragraphs 1-16, further comprising normalizing the at least one first score and the at least one second score at the aggregation engine.
18. The computerized method of any of paragraphs 1-17, further comprising assigning a weight to each of the plurality of computational models at the aggregation engine based on maximizing a difference between the at least one first score and the at least one second score, and generating a relative aggregated score based on the assigned weights at the aggregation engine.
19. A computer system for determining a biological effect of an agent on a biological system, comprising: a network modeling engine to receive data corresponding to a response of the biological system to the agent and data corresponding to the biological system not being exposed to the agent; wherein the biological system comprises a plurality of biological entities, each biological entity interacting with at least one other biological entity; generating a plurality of computational models of a portion of a biological system perturbed by a first agent and a second agent; wherein each computational model comprises nodes representing one or more biological entities and edges representing relationships between the biological entities; a network scoring engine for generating at least one first score representing an effect of the agent on the plurality of computational models, and at least one second score representing a computational model of a biological system not exposed to the agent; and an aggregation engine that generates an aggregated score representing a biological effect of the agent on the biological system.
20. The computer system of paragraph 19, wherein the aggregation engine further comprises: a filtering module for filtering the at least one first score and the at least one second score to generate at least one first filtered score and at least one second filtered score; a network weighting module for assigning a weight to each of a plurality of computational models; and a relative scoring module for generating a relative aggregated score based on the at least one first filtered score and the at least one second filtered score.
21. A computerized system for determining a score representing an effect of an agent on a biological system, comprising: receiving, at a network modeling engine, data corresponding to a response of a biological system to a first agent, wherein the biological system comprises a plurality of biological entities, each biological entity interacting with at least one other biological entity; generating, at a network modeling engine, a plurality of computational models of a portion of a biological system; wherein each computational model comprises nodes representing biological entities and edges representing relationships between the biological entities; generating, at a network scoring engine, an expected response for each node of a plurality of computational models; wherein the expected response is based on exposure to the agent and at least one of nodes and edges of the computational model; receiving data at a network scoring engine; and combining the expected response and the data at the network scoring engine to generate a score representing performance of the computational model on the data.
22. A computerized method for determining a biological effect of a second agent relative to a biological effect of a first agent, comprising: receiving, at a network modeling engine, data representing a response of a biological system to a first agent, wherein the biological system comprises a plurality of biological entities, each biological entity interacting with at least one other biological entity; receiving, at a network modeling engine, data corresponding to responses of a plurality of computational models that generate a portion of a plurality of biological systems at the network modeling engine; wherein each computational model comprises nodes representing biological entities and edges representing relationships between the biological entities; generating, at a network scoring engine, at least one first score representing an effect of a first agent on a plurality of computational models, and at least one second score representing an effect of a second agent on the plurality of computational models; and generating, at the aggregation engine, a relative aggregated score representing the biological impact of the second agent relative to the biological impact of the first agent based on the at least one first score and the at least one second score.
23. A computer system for determining a biological effect of a second agent relative to a biological effect of a first agent, comprising: a network modeling engine for receiving data representing a response of the biological system to the first agent, and data representing a response of the biological system to the second agent; wherein the biological system comprises a plurality of biological entities, each biological entity interacting with at least one other biological entity; generating a plurality of computational models of a portion of a biological system perturbed by a first agent and a second agent; wherein each computational model comprises nodes representing biological entities and edges representing relationships between the biological entities; a network scoring engine for generating at least one first score representing an effect of a first agent on the plurality of computational models and at least one second score representing an effect of a second agent on the plurality of computational models; and an aggregation engine to generate a relative aggregated score representing the biological impact of the second agent relative to the biological impact of the first agent based on the at least one first score and the at least one second score.
A computerized method for determining the effect of a perturbation on a biological system, comprising:
receiving, at a processor, first data corresponding to responses of a set of biological entities to a first process, wherein the biological system includes a plurality of biological entities comprising a plurality of sets of biological entities, wherein each biological entity in the biological system interacts with at least one other biological entity in the biological system;
receiving, at the processor, second data corresponding to a response of the set of biological entities to a second process, the second process being different from the first process;
providing, at a processor, a plurality of computational causal network models representing a biological system, each computational model comprising nodes representing a plurality of biological entities and edges representing relationships between entities of the plurality of biological entities;
generating, at a processor, a first score representing a perturbation of the biological system based on the first data and the plurality of computational models, and a second score representing a perturbation of the biological system based on the second data and the plurality of computational models; and
generating, at the processor, a biological impact factor representing a biological impact of the perturbation on the biological system based on the first score and the second score.
The computerized method of paragraph 1a, wherein each of the first score and the second score comprises a score vector, and the step of generating the biological impact factor further comprises filtering, at the processor, the first score and the second score to decompose each of the first score and the second score into a plurality of projections on a set of basis vectors, suitably wherein filtering further comprises removing at least one of the plurality of projections from at least one of the decomposed first score and second score.
The computerized method of paragraph 2a, wherein the set of basis vectors includes eigenvectors of a matrix describing the at least one computational model.
The computerized method of any of paragraphs 1a to 3a, wherein generating the first score and the second score comprises: assigning, at the processor, a weight to each of the plurality of computational models based on the respective computational model and at least one of the first data and the second data; generating, at a processor, a plurality of first scores corresponding to a plurality of computational models and based on first data; and generating, at the processor, a plurality of second scores corresponding to the plurality of computational models and based on the second data; combining the plurality of first scores according to the assigned weights; combining the plurality of second scores according to the assigned weights; wherein the biological impact factor is a function of the plurality of first scores for binding and the plurality of second scores for binding.
The computerized method of paragraph 4a, wherein determining a weight for each of the plurality of computational models comprises selecting a weight for each of the plurality of computational models to maximize a difference between the plurality of first scores and the plurality of second scores.
The computerized method of any of paragraphs 1a to 5a, wherein generating the biological impact factor comprises determining an inner product between a first vector representing the first score and a second vector representing the second score, or wherein generating the biological impact factor comprises determining a distance between a first surface representing the first score and a second surface representing the second score.
The computerized method of any of paragraphs 1a to 6a, wherein the computational causal network model is two or more selected from a cell proliferation network, an inflammatory process network, a cellular stress network, and a DNA damage, autophagy, cell death, and senescence network.
A computer system for determining a biological impact factor, the computer system comprising a processor configured to: receiving first data corresponding to responses of a set of biological entities to a first process, wherein the biological system comprises a plurality of biological entities, the plurality of biological entities comprising a set of biological entities and wherein each biological entity in the biological system interacts with at least one other biological entity in the biological system; receiving second data corresponding to a response of the set of biological entities to a second process, the second process being different from the first process; providing a plurality of computational causal network models representing a biological system, each computational model comprising nodes representing a plurality of biological entities and edges representing relationships between nodes in the plurality of biological entities; generating a first score representing a perturbation of the biological system based on the first data and the plurality of computational models, and generating a second score representing a perturbation of the biological system based on the second data and the plurality of computational models; and generating a biological impact factor based on the first score and the second score.
The computer system of paragraph 8a, wherein each of the first score and the second score comprises a score vector, and wherein the processor is further configured to: filtering the first score and the second score to decompose each of the first score and the second score into a plurality of projections on a set of basis vectors; and removing at least one of the plurality of projections from at least one of the first score and the second score.
A computer system according to paragraph 8a or 9a, wherein the set of basis vectors comprises eigenvectors of a matrix describing the at least one computational model.
The computer system of any of paragraphs 8a to 10a, wherein the biological impact factor comprises determining an inner product between a first vector representing the first score and a second vector representing the second score.
12a. the computerized method of any of paragraphs 1a to 6a or the computer system of any of paragraphs 8a to 11a, wherein generating the biological impact factor comprises determining a distance between a first surface representing the first score and a second surface representing the second score.
A computerized method according to any of paragraphs 1a to 6a or 12a or a computer system according to any of paragraphs 8a to 12a, wherein the biological system comprises at least one of a cell proliferation mechanism, a cell stress mechanism, a cell inflammation mechanism and a DNA repair mechanism. In one embodiment, the first treatment comprises exposure to an aerosol generated by heating tobacco, exposure to an aerosol generated by burning tobacco, exposure to tobacco smoke, exposure to cigarette smoke, exposure to tramp material comprising molecules or entities not present or obtainable from biological systems, and exposure to at least one of toxins, therapeutic compounds, stimulants, relaxants, natural products, manufactured products, food substances.
14. A computer program product comprising program code adapted to perform the method of any of paragraphs 1a to 6a or 12a to 13a.
15. A computer or computer readable medium comprising the computer program product of paragraph 14 a.
While implementations of the invention have been particularly shown and described with reference to specific examples, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. All publications mentioned in the above specification are herein incorporated by reference in their entirety.

Claims (31)

1. A computerized method for determining the effect of a perturbation on a biological system, comprising:
receiving, at a processor, a first data set corresponding to a response of a biological system to a first process, wherein the biological system comprises a plurality of biological entities, wherein each biological entity in the biological system interacts with at least one other biological entity in the biological system;
receiving, at the processor, a second data set corresponding to a response of the biological system to a second process, the second process being different from the first process;
providing, at a processor, a plurality of computational network models representing biological systems, each model comprising nodes representing a plurality of biological entities and edges representing relationships between the nodes in the model;
generating, at a processor, a first set of scores representing a perturbation of the biological system based on the first data set and the plurality of models, and a second set of scores representing a perturbation of the biological system based on the second data set and the plurality of computational models; and
a numerical biological impact factor having a single scalar value is generated at the processor, the biological impact factor being based on each of the first set of scores and the second set of scores and representing an overall biological impact of the perturbation on the biological system.
2. The method of claim 1, wherein a number of more than two data sets are received and a corresponding number of score sets are generated.
3. The method of claim 1, wherein a biological impact factor is generated for each treatment.
4. The method of claim 1, wherein at least one data set comprises process data and corresponding control data.
5. The method of claim 1, wherein at least one of the plurality of networks is a causal network.
6. The method of claim 1, wherein the score within each set of scores is independently calculated by a geometric perturbation index scoring technique, a probabilistic perturbation index scoring technique, or an expected perturbation index scoring technique.
7. The method of claim 1, wherein each score in the first set of scores and the second set of scores comprises a score vector, and the step of generating the biological impact factor further comprises filtering, at the processor, the first score and the second score to decompose each of the first score and the second score into a plurality of projections on the set of basis vectors.
8. The method of claim 7, wherein filtering further comprises removing at least one of the plurality of projections from at least one of the decomposed first and second scores.
9. The method of claim 7, wherein the set of basis vectors includes eigenvectors of a matrix describing the at least one model.
10. The method of claim 1, wherein generating a first set of scores and a second set of scores comprises:
assigning, at the processor, a weight to each score in the first set of scores and the second set of scores based on the respective computational network model and at least one of the first set of data and the second set of data;
aggregating the weighted scores in the first set of scores;
aggregating the weighted scores in the second set of scores;
wherein the one or more biological impact factors are a function of the aggregated scores of the first set of scores and the second set of scores.
11. The method of claim 1, wherein the biological impact factor is a linear combination, linear transformation, or quadratic functional form of the aggregated scores of the first set of scores and the second set of scores.
12. The method of claim 10, wherein assigning weights to each of the first set of scores and the second set of scores comprises selecting weights for each of a plurality of computational models to maximize a difference between the scores within the first set of scores and the scores within the second set of scores.
13. The method of claim 1, wherein generating a biological impact factor comprises determining an inner product between a first vector representing the aggregated score for a first set of scores and a second vector representing the aggregated score for a second set of scores, wherein the biological impact factor is an angle associated with the inner product.
14. The method of claim 1, wherein generating a biological impact factor comprises determining a distance between a first surface defined by a first vector representing a first set of variables and a second surface defined by a second vector representing a second set of variables.
15. The method of any one of the preceding claims, wherein the computational network models are two or more computational network models selected from a cell proliferation network, an inflammatory process network, a cellular stress network, and a DNA damage, autophagy, cell death, and senescence network.
16. The method of claim 1, wherein the biological system comprises at least one of a cell proliferation mechanism, a cell stress mechanism, a cell inflammation mechanism, and a DNA repair mechanism.
17. The method of claim 1, wherein the first processing comprises at least one of: exposure to an aerosol generated by heating tobacco, exposure to an aerosol generated by burning tobacco, exposure to tobacco smoke, exposure to cigarette smoke, exposure to miscellaneous substances including molecules or entities not present or obtainable from biological systems, exposure to at least one of toxins, therapeutic compounds, stimulants, relaxants, natural products, manufactured products, food substances, and exposure to one or more of cadmium, mercury, chromium, nicotine, tobacco specific nitrosamines, and metabolites of tobacco specific nitrosamines.
18. A method according to claim 17, wherein the metabolites of tobacco specific nitrosamines include 4-methylnitrosamine-1- (3-pyridyl) -1-butanone 4(NNK), N' -nitrosonornicotine (NNN), N-Nitrosoanatabine (NAT), N-Nitrosoanabasine (NAB), and 4- (methylnitrosamino) -1- (3-pyridyl) -1-butanol (NNAL).
19. The method of claim 1, further comprising:
comparing the biological impact factor to one or more additional biological impact factors that have been obtained in the absence of a perturbation or in the presence of a different perturbation; and is
Wherein the comparison indicates a biological effect of the perturbation on the biological system.
20. The method of claim 1, wherein the biological impact factor represents or is used to estimate or determine the magnitude of an expected or adverse biological impact caused by a pathogen, a hazardous substance, a produced product for safety assessment or risk use comparison, a therapeutic compound, or a change in the environment or an environmentally active substance.
21. The method of claim 19, wherein more than two different perturbations are used to compare the effect of the different perturbations on the biological system.
22. The method of claim 19, wherein the one or more perturbations are indicative of at least two different processing conditions.
23. The method of claim 22, wherein at least one processing condition comprises at least one of: exposure to an aerosol generated by heating tobacco, exposure to an aerosol generated by burning tobacco, exposure to tobacco smoke, exposure to cigarette smoke, exposure to miscellaneous substances including molecules or entities not present or available from biological systems, and exposure to at least one of toxins, therapeutic compounds, stimulants, relaxants, natural products, manufactured products, food substances.
24. The method of claim 1, wherein the perturbation is caused by one or more agents.
25. The method of claim 24, wherein the one or more agents are selected from the group consisting of: an aerosol generated by heating tobacco, an aerosol generated by burning tobacco, tobacco smoke, cigarette smoke, and any gaseous or particulate component thereof, cadmium, mercury, chromium, nicotine, tobacco specific nitrosamines, metabolites of tobacco specific nitrosamines, or combinations of one or more of the foregoing.
26. A method according to claim 25, wherein the metabolites of tobacco specific nitrosamines include 4-methylnitrosamine-1- (3-pyridyl) -1-butanone 4(NNK), N' -nitrosonornicotine (NNN), N-Nitrosoanatabine (NAT), N-Nitrosoanabasine (NAB), and 4- (methylnitrosamino) -1- (3-pyridyl) -1-butanol (NNAL).
27. The method of claim 1, wherein the biological impact factor is compared to known biological outcomes to calibrate the value of the biological impact factor.
28. A computer system for determining a biological impact factor, the computer system comprising a processor configured to:
receiving first data corresponding to responses of a set of biological entities to a first process, wherein the biological system comprises a plurality of biological entities, the plurality of biological entities comprising a set of biological entities and wherein each biological entity in the biological system interacts with at least one other biological entity in the biological system;
receiving second data corresponding to a response of the set of biological entities to a second process, the second process being different from the first process;
providing a plurality of computational causal network models representing a biological system, each computational model comprising nodes representing a plurality of biological entities and edges representing relationships between nodes in the plurality of biological entities;
generating a first score representing a perturbation of the biological system based on the first data and the plurality of computational models, and generating a second score representing a perturbation of the biological system based on the second data and the plurality of computational models; and
a numerical biological impact factor having a single scalar value is generated, the biological impact factor being based on the first score and the second score and representing an overall biological impact of the perturbation on the biological system.
29. The computer system of claim 28, wherein each of the first score and the second score comprises a score vector, and wherein the processor is further configured to:
filtering the first score and the second score to decompose each of the first score and the second score into a plurality of projections on a set of basis vectors; and
at least one of the plurality of projections is removed from at least one of the first score and the second score.
30. The computer system of claim 28, wherein the set of basis vectors comprises eigenvectors of a matrix describing the at least one computational model, or wherein generating the biological impact factor comprises determining an inner product between a first vector representing the first score and a second vector representing the second score.
31. The computer system of claim 28, wherein generating a biological impact factor includes determining a distance between a first surface representing a first score and a second surface representing a second score.
HK14110893.1A 2011-06-10 2012-06-11 Systems and methods for quantifying the impact of biological perturbations HK1197483B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US61/495,824 2011-06-10
US61/525,700 2011-08-19
EP11195417.8 2011-12-22

Publications (2)

Publication Number Publication Date
HK1197483A HK1197483A (en) 2015-01-16
HK1197483B true HK1197483B (en) 2018-05-18

Family

ID=

Similar Documents

Publication Publication Date Title
US10916350B2 (en) Systems and methods for quantifying the impact of biological perturbations
CN103782301B (en) Systems and methods for web-based biological activity assessment
JP6320999B2 (en) Systems and methods related to network-based biomarker signatures
CN103843000B (en) System and method for characterizing topological network disturbance
HK1197483B (en) Systems and methods for quantifying the impact of biological perturbations
HK1197483A (en) Systems and methods for quantifying the impact of biological perturbations
HK1196688B (en) Systems and methods for network-based biological activity assessment
HK1196688A (en) Systems and methods for network-based biological activity assessment
HK1197698B (en) Systems and methods for network-based biological activity assessment
HK1198594B (en) Systems and methods for characterizing topological network perturbations
HK1197698A (en) Systems and methods for network-based biological activity assessment
HK1211360B (en) Systems and methods relating to network-based biomarker signatures