WO2012122516A1 - Système et procédé pour convertir des ensembles de données importants sur d'autres informations en des observations d'analyse pour révéler une relation complexe - Google Patents
Système et procédé pour convertir des ensembles de données importants sur d'autres informations en des observations d'analyse pour révéler une relation complexe Download PDFInfo
- Publication number
- WO2012122516A1 WO2012122516A1 PCT/US2012/028589 US2012028589W WO2012122516A1 WO 2012122516 A1 WO2012122516 A1 WO 2012122516A1 US 2012028589 W US2012028589 W US 2012028589W WO 2012122516 A1 WO2012122516 A1 WO 2012122516A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- observations
- analysis
- population
- present
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/06—Asset management; Financial planning or analysis
Definitions
- the present invention relates to analyzing data, and more specifically to a system and method for analyzing data sets to reveal complex relationship linkages within the data sets.
- the present invention is a system and method for analyzing data sets to reveal complex relationship linkages within the data sets.
- the present invention can be used for data analysis and mining in connection with pharmaceutical drug development analysis, diagnostic tools for healthcare professionals, patient information and decision support, related services, and/or consumer downloadable software associated with such services.
- the present invention is typically a stand-alone platform that can be used in conjunction with an associative memory engine such as SaffronSierra, or any other platform offering similar functionality such as a natural intelligence platform that includes an associated memory engine as well as other functionality.
- references to specific software tools and platforms herein such as FreeLing, Saffron, SaffronSierra, and any others are by way of example only and are not intended to limit operation of the present invention solely to such software tools and platforms; other software applications, platforms, and tools offering similar features and functionality can also be used in connection with the present invention.
- FIG. 1 is a block diagram providing a high level overview of the various components used in connection with the present invention.
- FIG 2 is a block diagram showing another perspective on the high level functionality of the present invention.
- FIG 3 is a block diagram showing another perspective of the present invention.
- FIG 4 is a graphic depiction of one example of the hierarchy used in hierarchical grammar text.
- FIG 5 shows a depiction of the initial properties known at the beginning of a clinical drug trial according to one aspect of the present invention.
- FIG 6 shows an example of an interface according to one aspect of the present invention.
- the present invention is a method of processing large data sets and other information - both numerical and textual.
- the data sets are typically processed into many observations, which are classified into initial properties, outcomes, and outside influencers.
- Numerical data is converted to observations, and then analyzed for affinity. Linkages are found between and amongst the initial properties and the outcomes that correlate to outside influencers.
- Textual data is analyzed and facts extracted to discover connections between numerical data observations and other textual data facts.
- the present invention can be used in connection with data gathered or generated in connection with clinical drug trials.
- the present invention is not limited to use in connection with clinical drug trials and can be used in connection with any other application in which large data sets have been collected or prepared for analysis.
- new facts are learned from analysis of the numerical data, which may be applied during further analysis. In so analyzing, complex linkages are revealed that may not have been uncovered using conventional data analysis techniques.
- the block diagram shown in Fig. 1 provides a high-level overview of the various components, or modules, used in connection with the present invention. As would be understood by one skilled in the art, the modules are described from a functional perspective and can be implemented across one or more servers as appropriate and can be implemented in a variety of different ways without departing from the scope of the present invention.
- the Data Preparation module 1100 (identified as the "AMEDataPrep") takes data in any of a number of file formats and converts it into standard file formats.
- the Data Acquisition module 1200 takes normalized database tables and converts them into observations suitable for input to an associative memory engine (“AME”) 1300 or any other platform offering similar functionality.
- AME associative memory engine
- the Data Space Manager module 1400 orchestrates the process of data acquisition, AME ingestion, data mining, and query/analysis.
- the Query/Analysis module 1500 uses information provided by the Space Manager 1400 (such as model orientation information) and interacts with the associative memory engine 1300 to search and analyze the clinical trial data (or other data as desired) therein.
- the block diagram shown in Fig. 2 offers another perspective on the high-level functionality of the present invention. From the left of the above diagram, in one example, External Public Data 2100 is available as drug study clinical trial data as well as documents (PDF, web pages, etc.) related to drug trials.
- Pharma Internal Data 2200 is client-specific drug study clinical trial data as well as documents that are kept separate from other client's drug study data.
- the Emphasize box 2300 is where all gathered data is converted into a standard format. The two target formats are CSV (text spreadsheet) and plain text.
- Numerical data is converted from SAS, Microsoft Excel, and other numerical formats into CSV format. Textual data is converted from PDF, HTML, and other text formats into plain text format.
- the Preprocessing stage 2400 begins, where data is categorized and elaborated into observations for ingestion to an associative memory engine. Numerical data is categorized from a raw form into category and value observations. For example, weight of 127.5 pounds may translate into "weight: 120-130”. Textual data is placed in context. Text such as "increased weight” or other English variations may translate to "sideeffec weight gain”.
- the Exploratory box 2500 represents research with the Query/Analysis tool in conjunction with the associative memory engine. Data affinity of outcomes to initial properties and outside influencers is determined. Amongst text, facts are connected and related observations discovered.
- the Refinement box 2600 is a de-correlation step that boils down all the affinity and discovery information where key data relationships will stand out. This step also ties the uncovered information back to the original data for integrity and validation purposes.
- the Results box 2700 represents the resulting reports for client viewing.
- the clockwise rotating arrows represent the iterative process: When results are found, it becomes clear that more detail is needed in certain places, so pre-processing is revisited to improve the quality of the observations fed to an associative memory engine.
- the diagram shown in Fig. 3 offers another perspective of the present invention and is conceptually the same as the previous diagram (the RedOak Ingestion engine 3100 is the Emphasize 2300 and Preprocessing 2400 boxes; the AME Engine 3200 is the Exploratory box 2500; the De-correlation Engine 3300 is the Refinement box 2600; the Navigation platform 3400 is the Exploratory 2500, Refinement 2600, and Results 2700 boxes) except that the Learning concept is now introduced.
- the present invention is able to characterize positive and adverse outcomes, based upon initial properties and outside influencers (drugs, biologies, and devices). It learns from this data in order to predict outcomes of clinical trial subjects that are not in the system yet.
- the Data Preparation 1 100 module takes data in any of a number of file formats and converts it into standard file formats. Data is typically numerical or textual. Once converted, it is normalized for comparison to other data.
- Numerical data can be provided to the present invention in many different formats, such as MS Excel, SAS7,CSV, SQL, tabular, DBF, 123, HTML, proprietary formats, and any other suitable numerical data format.
- Data inside these file formats are converted to a standard CSV (Comma Separated Values) file format. Once in CSV format, the data is organized into normalized database tables, which are created. The CSV data is readily uploaded into these SQL database tables.
- CSV Common Separated Values
- Textual data can be provided to the present invention in many forms as well, such as PDF, HTML, plain text, MSWord, proprietary formats, and any other suitable text data format. Data inside these file formats are converted to a standard plain text file format.
- HGT Hierarchical Grammar Text
- NLP Natural Language Processing
- an HGT format is used for data processing.
- the HGT utilities are preferably English text processing utilities that interpret English based upon NLP grammar mark-up and context. The utilities are so named because they read, process, and annotate HGT (Hierarchical Grammar Text) files, which are output by a NLP utility, specifically FreeLing or any other suitable NLT tool.
- An HGT file contains one or more English sentences labeled with grammar parts of speech and hierarchically structured to show which words relate or modify which other words.
- An HGT file further contains annotations per word. The HGT utilities according the present invention add those annotations.
- HGT annotations are added to designate data categories.
- the hgtElaborate utility When processing a medical file, the hgtElaborate utility will annotate the word “schizophrenia” with the “diagnosis:schizophrenia”. It will also find a phrase such as "weight gain” and annotate it with "sideeffect:weight gain”.
- the hgtNaming utility When processing an HGT file, the hgtNaming utility will identify pronouns and short name references, which are subsequently annotated with the individual's full name (as identified in the text).
- the HGT utilities work together to create a meaningful grouping of annotations, which then produce sentence-specific observations or facts. These observations or facts are input into an associative memory engine or Discovery Query/Analysis, which further enhances clinical trial research.
- the first line is the root of the hierarchy; the original text was "gained”; the root verb is "gain”; the part of speech, VBD, indicates the verb was used in past tense.
- the hierarchy can be viewed graphically shown in Fig. 4.
- the technology used for Data Preparation is typically individual tools such as special purpose command line utilities.
- various tools are used to convert numerical data into CSV format:
- csvdatefix converts the data under named columns into standard database format for dates, e.g. 2010 ⁇
- these utilities are used to declare/upload text documents to the Document Library, where the document is then converted to plain text, then HGT, then marked with attributes before facts are extracted and uploaded to an associative memory engine:
- the Data Acquisition module 1200 takes the normalized database tables generated in the Data Preparation 1 100 module and converts them into observations suitable for input to an associative memory engine 1300. This is conventionally known as ETL (Extract, Transform and Load). In this case, Data Acquisition involves elaboration, categorization, and normalization. This process varies per client engagement or per clinical trial.
- Data elaboration involves data processing to create new data. This may be as simple as taking a date and calculating the years from that date to today. Or this may be more involved, such as calculating the stock market activity between two particular dates.
- Data categorization is the process of converting data into an observation for comparison. Whereas a data set may contain many unique numbers - such as 132.54 - categorization places a number into a category, such as "130-135", which is readily compared to other similar numbers. Categorization may cross-reference other data; for example, a specific date can be categorized by the phase of the moon on that date, e.g. "first quarter". The primary purpose of data categorization is to place data into discrete, finite buckets (or categories) for reasonable comparison.
- data categorization is formulaic, such as placing the number 132.54 into the category "130-135" or converting a date to a moon phase category.
- some data categorization is heuristic. For example, time series characterization and date range categorization. Still other data characterization is textual (not numeric), such as context categorization.
- the present invention Given a time series of data (a set of (xi, ti) data points), the present invention will typically characterize the variability of xi into categories with values (observations).
- the dataprop (data properties) utility takes CSV data on input with one column designated for time (ti) and another one for values (3 ⁇ 4). It was written to perform this categorization. Five category observations are typically produced: trend, timing, trending, fluctuation, and delta.
- the trend category may have a value of increased, decreased,
- the timing category may have a value of early, late, even or burst, based upon when xi crosses the midpoint of change, (xn-x0)/2: if xi crosses the
- midpoint of change (and stays on that side) in the first half of the series (before (tn-t0)/2), then it is early. If xi crosses the midpoint of change (and stays on that side) in the second half of the series (after (tn-t0)/2), then it is late. If xi crosses the midpoint of change (and stays on that side) around the middle of the series (around (tn-t0)/2), then it is even. If xi crosses the midpoint of change, but bounces back; or if xi moves beyond xn (too high or too low), then it is burst.
- the trending category may have a value of inconsistent, consistent, or sudden.
- the sign of the difference of xi+l-xi remains constant (always upward (+) or always downward (-)), then it is consistent.
- the sign of that difference is sometimes positive and sometimes negative, then it is inconsistent.
- that difference is mostly zero (within tolerance) except for an occasional difference, then it is sudden.
- the delta value is expressed as a percentage like the fluctuation value, but the percentage is calculated simply using (xn-x0)/x0. Note that the delta value may be negative, e.g. -1%, -2%, -4%, etc.
- Date range categorization is the method of taking two related dates as a date range and extracting observations from a time series for that date range.
- the purpose of the date range and the nature of the observations do not have to be related.
- the date range may be determined by a clinical trial start date for a study subject whereas the time series may be stock market data, which is known inside the date range.
- the date range defines which stock market data to use as a time series.
- the Data Categorization algorithm explained above is then applied to the time series to extract observations such as "stock market increased during subject study date range".
- Context categorization determines a word's use or meaning based upon its context.
- a pile of marbles gaining weight because more marbles are being added to a bucket is not considered a side effect, however a study subject that gains weight while taking a study drug would consider weight gain a side effect.
- Context categorization is the two step process of (1) choosing vocabulary lists of words (or phrases) that relate to the context of text to analyze and (2) the analysis of that text and tagging it with context tags, such as "side effect", "finding", or "diagnosis”.
- Data normalization is the process of adjusting data so it can be compared to other possibly related data. This may be a simple mathematical conversion, such as inches to centimeters, or this may be more involved, such as rescaling and adjusting data from one testing score to another similar but standard score. Data normalization is critical for "apples to apples” comparison rather than “apples to oranges” comparison.
- Textual normalization is typically achieved by using hgtElaborate with configuration files that recognize various English text patterns to produce a common observation. For example:
- the left side identifies an HGT pattern and the right side is the attribute to assert when the pattern matches.
- Data acquisition according to another aspect of the present invention is specific to a set of data, such as a specific clinical drug trial.
- data acquisition will be highly similar as well.
- the global repository contains a voluminous variety of clinical trial data and textual documents that are publicly available from FDA and other sites.
- the clinical data is converted to CSV data as previously described.
- the textual data is converted to HGT and processed as previously described.
- the technology used for Data Acquisition is web-based, whether as PHP (hypertext processor) script or web services, that is invoked from the Data Space Manager.
- the Data Space Manager module 1400 typically orchestrates the process of data acquisition, associative memory engine ingestion, data mining, and query/analysis.
- the Data Space Manager module 1400 invokes Data Acquisition code modules to generate a set of observations that are properly categorized and normalized for ingestion into an associative memory engine. The generated observations are stored for later use.
- the same interface is used to query schema and model orientation information. Schema information identifies types of observations to leverage the associative memory engine technology.
- model orientation information is used to enhance and make meaningful the query/analysis.
- the model referenced here is a Solution Model where observation data is typically characterized as (1) an initial property, (2) an outcome (which is further identified as desirable or undesirable), and (3) an outside influencer (or agent of change).
- a Solution Model typically provides a context in which to evaluate a large data set of observations where change is present.
- each item of data (or observation) is typically placed into one of four groups: initial property, outcome, agent (or outside influence), and reference.
- initial property is known state from the initial state. This may also include information that is not known until later, but is fixed and therefore not subject to change.
- the outcome is an observation of change; it is not the initial property simply restated.
- the agent is an observation or presence of a fact that may contribute to change. An agent may or may not actually relate to change in outcome.
- the reference is not used for observation comparison, but is used to tie observations to individuals or components in the system.
- Fig. 5 shows Initial Properties 5100 known at the beginning of a clinical drug trial. This may be tested, measured or observed facts with more detailed preferred over less. During the course of a clinical drug trial, the subject may be further tested, measured or observed. Changes over time in any of those tests, measurements, or observations are identified as Outcomes 5200. Outside influences 5300 are anything present during the study period that was not present before the study that may - in any way - influence outcomes. The present invention seeks to show how given a set of initial properties and outside influences, which of those outside influences effect outcomes, perhaps characterized (tied to) certain initial properties.
- An outcome is not simply a category such as "weight change"; an outcome includes instances of that category, such as increased, decreased, or unchanged. Every outcome instance is judged and labeled as desired, undesired, or neutral. This labeling is important to improve reporting. By knowing which outcomes are desired and which are not desired, the Query/ Analysis software groups and judges outcomes.
- the Space Manager 1400 handles AME ingestion by uploading the observations using an API offering suitable functionality. For example, it uses a PHP API which in turn uses a web services REST API.
- the Space Manager can be used for data mining since it stores all the observational data. Data mining is useful for verifying data integrity.
- the Space Manager 1400 interfaces direction to the Query /Analysis module.
- the Query/ Analysis module 1500 uses information provided by the Space Manager 1400 (such as model orientation information) and interacts with the natural intelligence platform technology to search and analyze the clinical trial data therein.
- information provided by the Space Manager 1400 such as model orientation information
- One of two algorithms is used - affinity and discovery. Affinity analysis enables learning patterns so predictions can be made on new clinical trial data. Affinity/Query Analysis
- Affinity Query/ Analysis large amounts of potentially contradictory data are analyzed to reveal trends where certain initial data shows an affinity to certain outcomes when a particular outside influencer is present.
- a noise reduction algorithm is typically used in connection with this Affinity Query/Analysis. The following description provides an example of such an algorithm.
- Data from database tables of clinical trial patient information is uploaded to the natural intelligence platform, a network database with strong associative analysis capabilities.
- Data characteristics are divided into three groups: initial properties, outside influencers, and outcomes.
- the outcomes group is further divided into two sub-groups: desirable outcomes and undesirable outcomes. All data is tied to key indexes (or pivot points) such as patient id and site id. External data is gathered as much as it relates to patients or sites and uploaded as more characteristics tied to the key indexes.
- Population D The set of all patients that have key characteristics (Q) to examine.
- Population A The subset of Population D that additionally associates to an outside influence characteristic.
- Population B The subset of Population D that additionally associates to a control characteristic.
- each patient id is associated to all information that relates to that patient.
- Some information - such as initial weight - is common between multiple patients. This is expected and desirable to find common groupings of patients.
- the Noise Reduction Algorithm compares two populations with respect to some relevant data. For example: hair color. When comparing the two populations, is one hair color more prevalent than another? Or is it all evenly distributed?
- one or more relevant items of data is chosen to examine. This may be a single item or a list of items (such as a list of desirable outcomes). Given this relevant data, a population of patients can be found. This is Population D. This may be something such as "all red-heads" or "all red-heads scoring high”.
- Population D's data set is then reduced to a subset of patients associated to a specific outside influence - called Population A.
- An example outside influence is a 30mg dose of study drug.
- Population A is then reduced to a different subset of only patients in the control group - called Population B.
- Population A Although it may be interesting to compare Population A to Population B, they are not immediately compared. The data is pivoted first before comparing them. To pivot Population A is to find the strong characteristic affinities within the data set that is Population A. By definition, Population A will have strong affinity to characteristics Q from above. The process of pivoting reveals other characteristics with strong affinity in Population A. The resulting Population Ap includes information about how often each characteristic (aside from Q) occurs - NAj.
- Population B is pivoted to produce Population Bp, which includes information about how often each characteristic (aside from Q) occurs - NBj.
- a characteristic is known through two labels - a category and a value.
- a category may be something such as “hair color” and a value is a variation within the category such as “red head”, “blond” or “brunette”. We are interested in comparing characteristic values within the same category.
- Population Ap Population A, pivoted, now including affinity counts, NAj, per characteristic.
- Population B P Population B, pivoted, now including affinity counts, NBj, per characteristic.
- each characteristic in one population is compared to the other population for magnitude as a percentage within the category being compared within a population. For example, if the Ai counts within the category "hair color" are 50 for “red head”, 22 for “blond”, 23 for “black” and 5 for “brunette”, then the percentage values are straightforward to calculate, respectively 50%, 22%, 23%, and 5%. These percentages are PAi. Similarly, if the NBj counts are 333 for "red head”, 333 for "blond”, 333 for "black” and 1 for "brunette”, then percentage values, PBj, are similarly straightforward to calculate: 33.3%, 33.3%, 33.3%, .1%.
- PAj is comparable (roughly equal) to PBj.
- PAj occurs twice as often as PBj.
- PAj occurs (many) more times as often as PBj.
- PAj occurs half as often as PBi or: PBj occurs twice as often as PAj.
- PBj occurs twice as often as PAj.
- PAj occurs fractionally as often as PBj or: PBj occurs many more times as often as PAj.
- Magnitude Refinement Although the above magnitude calculation may return the correct numerical result, it may not be reasonable. For example, using the data from above, the Mj value for "brunette” would be 5% / .1% or 50. It does not seem reasonable to say that "brunette” appears 50 times more often in Population A than Population B, especially since Population A only has 5 brunettes ( NA; is 5) total. So, the magnitude Mj is adjusted using these refinements:
- Mj is greater than PAj and PAj is greater than PBj, then limit Mi to PAj. For example, as above, if there are only 5 brunettes, then reduce M; down from 50 to 5, meaning brunettes appear only 5 times as often in Population A as Population B. 2. If 1/Mj is smaller than 1/PBi and PBj is greater than PA, then limit Ms to 1/PBj. This is the complementary condition to (1) above where Population B has only 5 brunettes
- each characteristic is compared to the other characteristics for variance within the category. In other words, how does one characteristic compare with its peers? Is it much larger or much smaller?
- the medium and standard deviation among the characteristics PAj is calculated as C c and S c for the category.
- the variance is then calculated as the distance that PA; is from C c in standard deviations, or V; is (PAj - C c ) / S c , rounded to the nearest integer. Observations from the value of Vj can be made:
- 0 PAj is comparable to the mean of the values (within one standard deviation). 1 PA, occurs more often than the mean of the values (one standard deviation or above).
- the present invention When all the above calculations have been performed, the present invention generates a report of characteristics associated with Population A that differentiate it from Population B. Not all characteristics are displayed; only the characteristics that stand out. The stand outs are determined using the following noise reduction algorithm:
- Discovery Query/Analysis 1500 large amounts of non-contradictory facts or observations are analyzed to discover connections between those facts/observations. In this case, the existence of a connection is a desired search result. For example, a report for a drug that addresses schizophrenia may report a side effect of weight gain. These two observations, "diagnosis schizophrenia" and “sideeffect:weightgain” are used to search the RedOak global repository for published papers and other related material that also includes these observations. This provides a researcher with precision document search results that directly relates to the clinical trial.
- Data stored from learning from an Affinity Query/ Analysis can be applied to predict outcomes for new or theoretical clinical trial data. For example, a drug study may be successful for Caucasian subjects, but not for Asian subjects; the affinity query will learn this association of initial property (race) to outcome (success). When faced with a theoretical subject, it can then predict success or failure for the outside influencer if the race initial property is known. Further, race and diagnosis can become search terms for Discovery Query/ Analysis to determine if other research has found a similar result.
- the top pull-down selection control, Study Data 6200 allows the user to select the clinical drug trial data set.
- the Outcome Stage control 6100 selects which outcome (when multiple time periods are present, e.g. Stage 1, Stage 2, Stage 3, and Final).
- the Outside Influencer control 6300 allows the user select the drug (or any other change agent) to analyze (e.g. 5mg dose, lOmg dose, 30mg dose).
- the Control control 6400 allows the user to select a control population for comparison (e.g. placebo).
- Clicking the Overview button 6500 produces an Affinity Query/Analysis report that generally compares the outside influencer to the control, outputting the initial properties common the to Outside Influencer that are not common in the control group. These are the stand-outs resulting from the Noise Reduction discussed above.
- the Overview report also readily points out desirable outcomes and undesirable outcomes within the specific Outside Influencer population; in other words, what kinds of outcomes were observed as different from the control group.
- the General Result control offers two choices - desirable outcomes and undesirable outcomes - allowing the user to review initial properties that characterize the desired outcomes (e.g. Caucasian) or the undesired outcomes (e.g. Asian).
- the Result Category control allows the user to select just one outcome category for analysis without the distraction of other outcomes in the report.
- the Specific Result control allows one specific outcome category with one specific result value to be analyzed to characterize the initial property population for that result.
- the AME Interface is an API for interacting with the AME service or any other suitable associative memory engine.
- a PHP API was created to facilitate interchangeability amongst possibly different AME technology.
- This API provides a subroutine per feature function.
- the subroutine communicates with AME using the AME's specific interface (such as conversion of its parameters into a REST operation, which is a specially formulated URL page request.
- the URL response is then translated and provided as the return value of the respective subroutine.)
- the writing applications that use the AME or any other suitable associative memory engine is simplified, providing a layer of abstraction so a different tool could be substituted.
- ANALYSIS FOR LEARNING MODULE ARISTOTLE
- the learning module also called the Aristotle module, builds on the affinity and discovery analysis described above. Based upon complex relationships that the analysis produces, general rules are created. For example, these rules may be created:
- Rules may vary by inclusion of other initial properties or outcomes. Simplification rules may determine that certain initial properties may imply other initial properties. Similarly, rules may determine that certain outcomes imply other outcomes. Many rules are created, representing patterns that exist in the data. Unlike the original syllogisms, these rules are not absolute, but probabilistic: The terms on the left side have a sample size where this rule holds true. In the example above, a sample size of 50 out of 80 as well as 75 out of 80 was expressed. The terms on the right side similarly have a sample size (above is shown 50 out of 200 and 75 out of 75). These rules are cascaded together. For example, ⁇ may imply I 35 which implies I49, which can be combined with agent A 65 (outside influencer) to produce outcome X 32 , which also implies outcome X 67 . These rules might be:
- Clinical trial data is a series of numerical data sets, perhaps stored in SAS or MS Excel. Aside from numerical data, there may be a number of text documents.
- the CSV data is then loaded into an SQL database.
- a PHP application is created from a template to read the database tables and generate observations.
- the PHP application also identifies which observation categories are initial properties, outcomes, or outside influencers.
- the Space Manager runs the PHP application to generate the observations.
- the Space Manager uploads the observations to AME using the API.
- the Space Manager uploads the observations to AME using the API under a study label, e.g. study 9.
- the user accesses the Affinity Query/Analysis interface to research which outside influencers cause which outcomes, which may be triggered by specific initial properties.
- the user accesses the Discovery Query/Analysis interface to research connections of the initial properties or outcomes to textual documents, including not only the documents from step #1, but also the RedOak global repository of documents and learned (other) clinical trial results from the Aristotle Module.
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Operations Research (AREA)
- Game Theory and Decision Science (AREA)
- Human Resources & Organizations (AREA)
- Entrepreneurship & Innovation (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
L'invention concerne un système et un procédé pour analyser des ensembles de données en vue de révéler des liaisons de relations complexes à l'intérieur des ensembles de données. L'invention peut être utilisée pour l'analyse et l'exploration de données dans le cadre de l'analyse de développement de médicaments pharmaceutiques, d'outils de diagnostic pour les professionnels de la santé, d'informations de patient et d'aide à la décision, de services connexes et/ou de logiciels grand public téléchargeables associés à de tels services.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201161464873P | 2011-03-10 | 2011-03-10 | |
| US61/464,873 | 2011-03-10 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2012122516A1 true WO2012122516A1 (fr) | 2012-09-13 |
Family
ID=46798583
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2012/028589 Ceased WO2012122516A1 (fr) | 2011-03-10 | 2012-03-09 | Système et procédé pour convertir des ensembles de données importants sur d'autres informations en des observations d'analyse pour révéler une relation complexe |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2012122516A1 (fr) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016018258A1 (fr) * | 2014-07-29 | 2016-02-04 | Hewlett-Packard Development Company, L.P. | Similarité dans un jeu de données structuré |
| US11398298B2 (en) * | 2017-09-26 | 2022-07-26 | 4G Clinical, Llc | Systems and methods for demand and supply forecasting for clinical trials |
| CN118839185A (zh) * | 2024-06-03 | 2024-10-25 | 骆国明 | 搜索引擎增量大数据更新信息降噪分类方法 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070112714A1 (en) * | 2002-02-01 | 2007-05-17 | John Fairweather | System and method for managing knowledge |
| US20080097938A1 (en) * | 1998-05-01 | 2008-04-24 | Isabelle Guyon | Data mining platform for bioinformatics and other knowledge discovery |
| US20080208820A1 (en) * | 2007-02-28 | 2008-08-28 | Psydex Corporation | Systems and methods for performing semantic analysis of information over time and space |
| US7593952B2 (en) * | 1999-04-09 | 2009-09-22 | Soll Andrew H | Enhanced medical treatment system |
| US20100010968A1 (en) * | 2008-07-10 | 2010-01-14 | Redlich Ron M | System and method to identify, classify and monetize information as an intangible asset and a production model based thereon |
-
2012
- 2012-03-09 WO PCT/US2012/028589 patent/WO2012122516A1/fr not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080097938A1 (en) * | 1998-05-01 | 2008-04-24 | Isabelle Guyon | Data mining platform for bioinformatics and other knowledge discovery |
| US7593952B2 (en) * | 1999-04-09 | 2009-09-22 | Soll Andrew H | Enhanced medical treatment system |
| US20070112714A1 (en) * | 2002-02-01 | 2007-05-17 | John Fairweather | System and method for managing knowledge |
| US20080208820A1 (en) * | 2007-02-28 | 2008-08-28 | Psydex Corporation | Systems and methods for performing semantic analysis of information over time and space |
| US20100010968A1 (en) * | 2008-07-10 | 2010-01-14 | Redlich Ron M | System and method to identify, classify and monetize information as an intangible asset and a production model based thereon |
Non-Patent Citations (1)
| Title |
|---|
| GUHA ET AL.: "Advances in Cheminformatics Methodologies and Infrastructure to Support the Data Mining of Large,Heterogeneous Chemical Datasets", CURRENT COMPUTER - AIDED DRUG DESIGN, vol. 6, no. 1, March 2010 (2010-03-01), pages 50 - 67, Retrieved from the Internet <URL:http://grids.ucs.indiana.edu/ptliupages/publications/AdvancesinCheminformaticsMethodolo_3.pdf>entiredocument> * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016018258A1 (fr) * | 2014-07-29 | 2016-02-04 | Hewlett-Packard Development Company, L.P. | Similarité dans un jeu de données structuré |
| US11398298B2 (en) * | 2017-09-26 | 2022-07-26 | 4G Clinical, Llc | Systems and methods for demand and supply forecasting for clinical trials |
| CN118839185A (zh) * | 2024-06-03 | 2024-10-25 | 骆国明 | 搜索引擎增量大数据更新信息降噪分类方法 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Cheung et al. | A two-stage approach to synthesizing covariance matrices in meta-analytic structural equation modeling | |
| Carlin et al. | A new framework for managing and analyzing multiply imputed data in Stata | |
| Landhäußer et al. | From requirements to UML models and back: how automatic processing of text can support requirements engineering | |
| CN110472209B (zh) | 基于深度学习的表格生成方法、装置和计算机设备 | |
| Persson et al. | Python packages for exploratory factor analysis | |
| US12481837B2 (en) | Prompt configuration for LLM integrations in spreadsheet environments | |
| US20210027407A1 (en) | Method and apparatus for the unified evaluation, presentation and modification of healthcare regimens | |
| Ramakrishnaiah et al. | EHR-QC: A streamlined pipeline for automated electronic health records standardisation and preprocessing to predict clinical outcomes | |
| WO2007119567A1 (fr) | dispositif de traitement de document et procédé de traitement de document | |
| Trevisani et al. | A portrait of JASA: the History of Statistics through analysis of keyword counts in an early scientific journal | |
| CN113836895A (zh) | 一种基于大规模问题自学习的无监督机器阅读理解方法 | |
| US12499306B2 (en) | Prompt chaining for LLM integrations in spreadsheet environments | |
| US20240303424A1 (en) | Llm integrations with spreadsheet environments | |
| US20220253465A1 (en) | Interpreting Vague Intent Modifiers in Visual Analysis Using Word Co-occurrence and Sentiment Analysis | |
| WO2012122516A1 (fr) | Système et procédé pour convertir des ensembles de données importants sur d'autres informations en des observations d'analyse pour révéler une relation complexe | |
| Cox et al. | Croon’s bias-corrected estimation for multilevel structural equation models with latent interactions | |
| Revuelta et al. | Bayesian estimation and testing of a beta factor model for bounded continuous variables | |
| Breen | Mining twitter for airline consumer sentiment | |
| JP5534167B2 (ja) | グラフ作成装置、グラフ作成方法およびグラフ作成プログラム | |
| US11500885B2 (en) | Generation of insights based on automated document analysis | |
| Beniwal et al. | Data mining with linked data: Past, present, and future | |
| Malik et al. | E-assessment data compatibility resolution methodology with bidirectional data transformation | |
| Kilmen et al. | Shortening Psychological Scales: Semantic Similarity Matters | |
| JP2007066202A (ja) | データ分析プログラム | |
| Asif et al. | Using nanopublications to detect and explain contradictory research claims |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12755244 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 12755244 Country of ref document: EP Kind code of ref document: A1 |