TWI913719B - Feature creation method - Google Patents
Feature creation methodInfo
- Publication number
- TWI913719B TWI913719B TW113114686A TW113114686A TWI913719B TW I913719 B TWI913719 B TW I913719B TW 113114686 A TW113114686 A TW 113114686A TW 113114686 A TW113114686 A TW 113114686A TW I913719 B TWI913719 B TW I913719B
- Authority
- TW
- Taiwan
- Prior art keywords
- upstream
- feature
- downstream
- features
- test
- Prior art date
Links
Abstract
Description
本發明是有關於一種特徵建立方法,且特別是有關於一種用於疾病預測的特徵建立方法。This invention relates to a feature building method, and more particularly to a feature building method for disease prediction.
遺傳資訊傳遞依序為基因體、轉錄體、蛋白體及代謝體,通常稱為上下游關係。隨著分子生物學技術的進步,研究人員透過體學資料(基因體、轉錄體、蛋白體及代謝體等),從單一個體學亦或是多體學中尋找體學與疾病間的生物標記。過去利用合併多體學的方式預測疾病,是以平行的角度來擷取體學和疾病之間關連性,忽略了體學之間上下游的關係。然而,體學的上下游關係(例如:基因體會影響下游的轉錄體再進一步影響下游的蛋白體),不僅有機會能夠協助模型進行預測,也能夠使研究人員更全面性由下而上地解釋疾病機制。Genetic information is transmitted sequentially through the genome, transcriptome, proteome, and metabolite, often referred to as upstream and downstream relationships. With advancements in molecular biology techniques, researchers use somatic data (genomes, transcriptomes, proteomes, and metabolites, etc.) to search for biomarkers between somatic studies and diseases, whether in a single or multiple somatic studies. Past methods of using combined multiple somatic studies to predict diseases extracted the correlation between somatic studies and diseases from a parallel perspective, ignoring the upstream and downstream relationships between somatic studies. However, the upstream and downstream relationships of somatic studies (e.g., the genome affects the downstream transcriptome, which in turn affects the downstream proteome) not only have the potential to assist models in prediction but also allow researchers to more comprehensively explain disease mechanisms from the bottom up.
本發明提供一種特徵建立方法,透過體學的上下游關係建立串體學(trans-omics)特徵,豐富化特徵池,以期達到增強疾病預測的效果,並進一步了解體學上下游關係來解釋致病機轉。This invention provides a feature establishment method that establishes trans-omics features through the upstream and downstream relationships of omics, enriches the feature pool, and aims to enhance the effect of disease prediction and further understand the upstream and downstream relationships of omics to explain the pathogenesis mechanism.
本發明的特徵建立方法包括以下步驟。將上游體學及下游體學分別對疾病作檢定,以篩選出存在顯著差異的體學特徵。之後,以下游體學對上游體學作檢定,以選出有顯著的特徵。接下來,根據上游體學及下游體學的型別建立串體學特徵。最後,進行特徵挑選及疾病預測模型建立。The feature establishment method of this invention includes the following steps: First, upstream and downstream voxels are tested against the disease to screen for voxel features showing significant differences. Then, downstream voxels are tested against upstream voxels to select features with significant differences. Next, voxel features are established based on the types of upstream and downstream voxels. Finally, feature selection and disease prediction model establishment are performed.
在本發明的一實施例中,疾病包括年齡相關性黃斑部病變、糖尿病病患視網膜病變、失智症、肌少症或衰弱症。In one embodiment of the invention, the diseases include age-related macular degeneration, diabetic retinopathy, dementia, sarcopenia, or frailty.
在本發明的一實施例中,上游體學包括類別型變數及連續型變數。In one embodiment of the present invention, upstream dynamics includes categorical variables and continuous variables.
在本發明的一實施例中,類別型變數包括單核苷酸多態性(SNP),連續型變數包括核醣核酸(RNA)。In one embodiment of the present invention, categorical variables include single nucleotide polymorphisms (SNPs), and serial variables include ribonucleic acid (RNA).
在本發明的一實施例中,將上游體學對疾病作的檢定包括t檢定或卡方檢定,t檢定用於連續型變數,卡方檢定用於類別型變數。In one embodiment of the present invention, the upstream physical tests for the disease include t-tests or chi-square tests, with t-tests used for continuous variables and chi-square tests used for categorical variables.
在本發明的一實施例中,將下游體學對疾病作的檢定包括t檢定或卡方檢定,t檢定用於連續型變數,卡方檢定用於類別型變數。In one embodiment of the present invention, the downstream physical examination of the disease includes a t-test or a chi-square test, with the t-test used for continuous variables and the chi-square test used for categorical variables.
在本發明的一實施例中,以下游體學對上游體學作的檢定包括ANOVA檢定。In one embodiment of the present invention, the downstream dynamics test on the upstream dynamics test includes the ANOVA test.
在本發明的一實施例中,當上游體學為連續型變數時,將上游體學數值標準化作為下游體學權重調整的權重,以權重乘以下游體學數值以建立上游體學及下游體學的串體學特徵。In one embodiment of the present invention, when the upstream dynamics is a continuous variable, the upstream dynamics value is standardized as the weight for adjusting the downstream dynamics weight, and the weight is multiplied by the downstream dynamics value to establish the serial dynamics features of the upstream and downstream dynamics.
在本發明的一實施例中,特徵挑選包括將串體學特徵數值和疾病作檢定,以篩選出顯著的特徵。In one embodiment of the present invention, feature selection includes testing serial physical feature values and diseases to screen for significant features.
在本發明的一實施例中,特徵挑選包括將體學特徵及串體學特徵分別利用多機器學習方法進行挑選,再將特徵重要性進行排名,並利用AUC與ACC的陡坡圖與手肘法決定特徵數量。In one embodiment of the present invention, feature selection includes selecting solid features and serial solid features using multi-machine learning methods, ranking the importance of features, and determining the number of features using slope plots of AUC and ACC and the elbow method.
在本發明的一實施例中,下游體學包括蛋白質。In one embodiment of the present invention, the downstream organism includes proteins.
基於上述,本發明的特徵建立方法從體學上下游關係的角度切入,量化並建立遺傳資訊傳遞特徵,也就是串體學特徵,來建立疾病的預測模型。如此一來,能夠提升疾病預測模型的準確度,且對於疾病機制的解釋能夠提出更直接的解釋。Based on the above, the feature establishment method of this invention approaches the issue from the perspective of upstream and downstream relationships in genetics, quantifying and establishing genetic information transmission features, namely, genetic features, to build a disease prediction model. In this way, the accuracy of the disease prediction model can be improved, and a more direct explanation of the disease mechanism can be provided.
下文列舉實施例並配合附圖來進行詳細地說明,但所提供的實施例並非用以限制本揭露所涵蓋的範圍。此外,關於文中所使用「包含」、「包括」、「具有」等等用語,均為開放性的用語,也就是指「包含但不限於」。Examples are provided below with accompanying figures for detailed explanation, but the examples provided are not intended to limit the scope of this disclosure. In addition, the terms "comprising," "including," "having," etc., used in this document are open-ended terms, meaning "including but not limited to."
圖1是依照本發明一實施例的特徵建立方法的流程圖。Figure 1 is a flowchart of a method for establishing features according to an embodiment of the present invention.
請參照圖1,在步驟S10及S12中,將上游體學及下游體學分別對疾病作檢定,以篩選出存在顯著差異的體學特徵。在本實施例中,疾病可包括代謝症候群、眼睛相關疾病(例如年齡相關性黃斑部病變或糖尿病病患視網膜病變)或老化相關疾病(例如失智症、肌少症或衰弱症等),但本發明並不以此為限。上游體學可包括類別型變數及連續型變數,類別型變數例如是單核苷酸多態性(SNP),連續型變數例如是RNA(轉錄體),下游體學例如是蛋白質。將上游體學對疾病作的檢定包括t檢定或卡方檢定,t檢定用於連續型變數,卡方檢定用於類別型變數。將下游體學對疾病作的檢定包括t檢定或卡方檢定,t檢定用於所述連續型變數,卡方檢定用於所述類別型變數。篩選出存在顯著差異的體學特徵的流程例如是利用顯著水準設定為0.05,若小於則為顯著的體學特徵,可加入下一階段的流程。Referring to Figure 1, in steps S10 and S12, upstream and downstream immunological characteristics are used to test for diseases to screen for immunological features with significant differences. In this embodiment, diseases may include metabolic syndromes, eye-related diseases (such as age-related macular degeneration or diabetic retinopathy), or aging-related diseases (such as dementia, sarcopenia, or frailty), but this invention is not limited to these. Upstream immunological characteristics may include categorical and continuous variables. Categorical variables may be, for example, single nucleotide polymorphisms (SNPs), and continuous variables may be, for example, RNA (transcriptional organisms). Downstream immunological characteristics may be, for example, proteins. The tests performed on diseases using upstream immunological characteristics include t-tests or chi-square tests. T-tests are used for continuous variables, and chi-square tests are used for categorical variables. Downstream immunological tests for the disease include t-tests or chi-square tests. The t-test is used for continuous variables, and the chi-square test is used for categorical variables. A process for screening immunological features with significant differences is, for example, using a significance level of 0.05. Features with a significance level less than 0.05 are considered significant and can be added to the next stage of the process.
接下來,請參照圖1,在步驟S20中,以下游體學對上游體學作檢定,以選出有顯著的特徵。上游體學為類別型變數時,例如是單核苷酸多態性(SNP),基因型的表現型可包括AA、AB、BB以及00。上游體學為連續型變數時,例如是RNA(轉錄體),此時為連續型的變數,描述分子在體內的含量。以上游體學為類別型變數為例,根據生物學中體學的上下游關係(亦即,基因透過轉錄作用產生RNA,RNA再透過轉譯作用產生胜肽,並進一步組合成蛋白質)來建立串體學特徵。利用不同SNP表現型會導致不同蛋白質表現量的想法,分別利用SNP表現型建立蛋白質表現量的串體學特徵。在建立特徵之前,為了能建立存在明顯差異的上下游特徵,以下游體學對上游體學作檢定,以選出有顯著的特徵,例如是SNP與蛋白質進行ANOVA檢定,選出有顯著的特徵的流程例如是利用顯著水準設定為0.05,若小於則為顯著的體學特徵,可加入下一階段的流程。Next, referring to Figure 1, in step S20, downstream genotyping is used to test upstream genotyping to select features with significant significance. When upstream genotyping is a categorical variable, such as a single nucleotide polymorphism (SNP), the genotype phenotype can include AA, AB, BB, and O0. When upstream genotyping is a continuous variable, such as RNA (transcriptional molecules), it is a continuous variable describing the abundance of the molecule in the body. Taking upstream genotyping as a categorical variable, tandem genotyping features are established based on the upstream-downstream relationship of genotyping in biology (i.e., genes produce RNA through transcription, RNA then produces peptides through translation, and further combines to form proteins). Utilizing the idea that different SNP phenotypes lead to different protein expression levels, tandem genotyping features of protein expression levels are established using SNP phenotypes. Before establishing features, in order to establish upstream and downstream features with significant differences, downstream atomic features are tested against upstream atomic features to select significant features, such as SNPs and proteins, using ANOVA. The process for selecting significant features is, for example, by setting the significance level to 0.05. If it is less than 0.05, it is considered a significant atomic feature and can be added to the next stage of the process.
表1為病人數據範例。
表1
之後,請參照圖1,在步驟S30中,根據上游體學及下游體學的型別建立串體學特徵。Next, referring to Figure 1, in step S30, establish serial volumetric features based on the types of upstream and downstream volumetrics.
當上游體學特徵為類別型變數時,在表1的病人數據範例中,SNP有三種表現型(亦即,AA、AB以及00),會建立三個SNP-蛋白質串體學特徵(亦即,SNP_AA_ProteinX、SNP_AB_ProteinX以及SNP_00_ProteinX)來描述SNP表現型和下游蛋白質表現量的關係。舉例而言,SNP_AA_ProteinX特徵中,病人01的SNP表現型為AA,蛋白質表現量為791911,因此,病人01在SNP_AA_ProteinX的串體學特徵數值為791911。相似地,病人02的SNP表現型為AB,蛋白質表現量為2424990,但因其SNP表現型非AA,因此,病人02在SNP_AA_ProteinX的串體學特徵數值為0。依此規則,建立SNP表現型和蛋白質表現量對應關係的串體學特徵(如表2所示)。
表2
當上游體學特徵為連續型變數時,將上游體學數值標準化作為下游體學權重調整的權重(影響倍率的概念),以前述的權重乘以下游體學數值以建立上下游體學的串體學特徵(如表3所示)。
表3
最後,請參照圖1,在步驟S40及S50中,進行特徵挑選及疾病預測模型建立。特徵挑選的進行方式,例如是將串體學特徵數值和疾病進行檢定,以篩選出顯著的特徵,或例如是將體學及串體學特徵分別利用多機器學習方法進行特徵挑選,再將特徵重要性進行排名,並利用AUC與ACC的陡坡圖與手肘法決定特徵數量,以在各型態的特徵中找出對於疾病有影響力的特徵,最後建立疾病預測模型。Finally, referring to Figure 1, in steps S40 and S50, feature selection and disease prediction model building are performed. Feature selection can be carried out in various ways, such as testing the values of hematologic features against the disease to screen out significant features, or by using multi-machine learning methods to select features from hematologic and hematologic features separately, ranking the importance of features, and using the steep slope plot of AUC and ACC and the elbow method to determine the number of features, so as to find the features that have an impact on the disease among the features of each type, and finally building a disease prediction model.
疾病預測模型建立後,可應用於輔助診斷或是預警等情境。在獲取受測者檢體(例如: 抽血血液)後,可透過現行分子生物學等技術獲得該受測者的SNP、RNA等體學數據,並建立模型所需的串體學特徵。接著,應用已建立完成的預測模型於受測者數據可以分析此受測者的患病機率,而進一步採取相關臨床診斷或處理。舉例而言,以肌少症來說,除了目前依照步速慢、握力差、體重減輕、低能量或是低活動量等項目來診斷受測者外,本發明的串體學疾病預測模型,提供了體學的角度(也就是體內的生化數據)來協助診斷。亦即,本發明也提供一種健康疾病風險評估方法以及疾病預測方法,利用本發明的特徵建立方法,建立疾病預測模型後,應用已建立完成的預測模型於受測者數據,以分析此受測者的患病機率,可以此分析結果協助診斷,進而可採取相關臨床診斷或處理。Once a disease prediction model is established, it can be applied to assist in diagnosis or early warning. After obtaining a sample from the subject (e.g., blood sample), current molecular biology techniques can be used to obtain the subject's SNP, RNA, and other somatic data, and to establish the somatic features required for the model. Then, applying the established prediction model to the subject's data allows for analysis of the subject's probability of developing the disease, leading to further clinical diagnosis or treatment. For example, in the case of sarcopenia, in addition to current diagnostic methods based on slow walking speed, poor grip strength, weight loss, low energy, or low activity levels, this invention's somatic disease prediction model provides a somatic perspective (i.e., biochemical data within the body) to assist in diagnosis. In other words, this invention also provides a method for assessing health and disease risks and a method for predicting diseases. After establishing a disease prediction model using the feature-based method of this invention, the established prediction model is applied to the subject's data to analyze the subject's probability of developing the disease. The analysis results can be used to assist in diagnosis, and then relevant clinical diagnosis or treatment can be carried out.
綜上所述,本發明提供一種特徵建立方法,從體學上下游關係(例如:基因體會影響下游的轉錄體再進一步影響下游的蛋白體)的角度切入,量化並建立遺傳資訊傳遞特徵,也就是串體學特徵,豐富化特徵池,來建立疾病的預測模型,以期達到增強疾病預測的效果,並進一步了解體學上下游關係來解釋致病機轉。如此一來,能夠提升疾病預測模型的準確度,且對於疾病機制的解釋能夠提出更直接的解釋,也可使研究人員更全面性由下而上地解釋疾病機制。In summary, this invention provides a feature establishment method that approaches the problem from the perspective of upstream and downstream relationships in genetics (e.g., the genome influences downstream transcriptosomals, which in turn influence downstream proteomics). It quantifies and establishes genetic information transmission features, also known as string DNA features, enriching the feature pool to build disease prediction models. This aims to enhance disease prediction effectiveness and further elucidate upstream and downstream relationships in genetics to explain pathogenesis. In this way, the accuracy of disease prediction models can be improved, and more direct explanations of disease mechanisms can be provided. It also enables researchers to explain disease mechanisms more comprehensively from the bottom up.
S10、S12、S20、S30、S40、S50:步驟S10, S12, S20, S30, S40, S50: Steps
圖1是依照本發明一實施例的特徵建立方法的流程圖。Figure 1 is a flowchart of a method for establishing features according to an embodiment of the present invention.
S10、S12、S20、S30、S40、S50:步驟 S10, S12, S20, S30, S40, S50: Steps
Claims (11)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW113114686A TWI913719B (en) | 2024-04-19 | Feature creation method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW113114686A TWI913719B (en) | 2024-04-19 | Feature creation method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TW202542922A TW202542922A (en) | 2025-11-01 |
| TWI913719B true TWI913719B (en) | 2026-02-01 |
Family
ID=
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140149138A1 (en) | 2006-05-18 | 2014-05-29 | Caris Mpi, Inc. | System and method for determining individualized medical intervention for a disease state |
| CA2739675C (en) | 2008-10-14 | 2020-12-01 | Caris Mpi, Inc. | Gene and gene expressed protein targets depicting biomarker patterns and signature sets by tumor type |
| TWI730973B (en) | 2015-07-23 | 2021-06-21 | 香港中文大學 | Analysis of fragmentation patterns of cell-free dna |
| CN113192556A (en) | 2021-03-17 | 2021-07-30 | 西北工业大学 | Genotype and phenotype association analysis method in multigroup chemical data based on small sample |
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140149138A1 (en) | 2006-05-18 | 2014-05-29 | Caris Mpi, Inc. | System and method for determining individualized medical intervention for a disease state |
| CA2739675C (en) | 2008-10-14 | 2020-12-01 | Caris Mpi, Inc. | Gene and gene expressed protein targets depicting biomarker patterns and signature sets by tumor type |
| TWI730973B (en) | 2015-07-23 | 2021-06-21 | 香港中文大學 | Analysis of fragmentation patterns of cell-free dna |
| CN113192556A (en) | 2021-03-17 | 2021-07-30 | 西北工业大学 | Genotype and phenotype association analysis method in multigroup chemical data based on small sample |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Xiao et al. | Toward best practice in cancer mutation detection with whole-genome and whole-exome sequencing | |
| Hagenbeek et al. | Heritability estimates for 361 blood metabolites across 40 genome-wide association studies | |
| Reay et al. | Advancing the use of genome-wide association studies for drug repurposing | |
| Green et al. | Cellular communities reveal trajectories of brain ageing and Alzheimer’s disease | |
| Wong et al. | Decoding disease: from genomes to networks to phenotypes | |
| IGVF Consortium markerpaper@ igvf. org Writing group (ordered by contribution) http://orcid. org/0000-0002-5754-1719 Engreitz Jesse M. 1 2 3 4 Lawson Heather A. 5 http://orcid. org/0000-0002-6330-8526 Singh Harinder 6 Starita Lea M. 7 8 Hon Gary C. 9 10 11 Carter Hannah 12 Sahni Nidhi 13 14 Reddy Timothy E. 15 Lin Xihong 16 17 Li Yun 18 19 Munshi Nikhil V. 9 20 21 22 Chahrour Maria H. 22 23 24 25 26 Boyle Alan P. 27 28 Hitz Benjamin C. 1 Mortazavi Ali 29 30 Craven Mark 31 Mohlke Karen L. 18 Pinello Luca 4 32 Wang Ting 5 33 Steering Committee Co-Chairs (alphabetical by last name) Kundaje Anshul 1 34 Yue Feng 35 36 et al. | Deciphering the impact of genomic variation on function | |
| US20210280272A1 (en) | Methods and systems for quantifying sequence alignment | |
| Stranneheim et al. | Rapid pulsed whole genome sequencing for comprehensive acute diagnostics of inborn errors of metabolism | |
| Shigemizu et al. | Risk prediction models for dementia constructed by supervised principal component analysis using miRNA expression data | |
| JP2023526670A (en) | Predicting Disease Outcomes Using Machine Learning Models | |
| Sügis et al. | HENA, heterogeneous network-based data set for Alzheimer’s disease | |
| Burstein et al. | Genetic variant burden and adverse outcomes in pediatric cardiomyopathy | |
| Claussnitzer et al. | Gaining insight into metabolic diseases from human genetic discoveries | |
| Chang et al. | Development of novel noninvasive prenatal testing protocol for whole autosomal recessive disease using picodroplet digital PCR | |
| de Bruijn et al. | The impact of modern technologies on molecular diagnostic success rates, with a focus on inherited retinal dystrophy and hearing loss | |
| US20250232093A1 (en) | Multi-label neural architecture for modeling dna-encoded libraries data | |
| TWI913719B (en) | Feature creation method | |
| TW202542922A (en) | Feature creation method | |
| Paananen | Bioinformatics in the identification of novel targets and pathways in neurodegenerative diseases | |
| Chong et al. | SeqControl: process control for DNA sequencing | |
| Gitau et al. | Gene-Disease Relationships in Kidney Genetics | |
| Zhu et al. | Leveraging molecular quantitative trait loci to comprehend complex diseases/traits from the omics perspective | |
| Krishnamurthy et al. | Artificial intelligence-based drug screening and drug repositioning tools and their application in the present scenario | |
| Kioroglou et al. | Multi-omic integration sets the path for early prevention strategies on healthy individuals | |
| Liu et al. | Brain transcriptional regulatory architecture and schizophrenia etiology converge between East Asian and European ancestral populations |