TWI913719B

TWI913719B - Feature creation method

Info

Publication number: TWI913719B
Application number: TW113114686A
Authority: TW
Inventors: 涂芝瑋; 詹韻玄; 蔡宗憲
Original assignee: 宏碁股份有限公司
Filing date: 2024-04-19
Publication date: 2026-02-01

Abstract

The invention provides a feature creation method, which includes the following steps. Diseases are tested separately by upstream somatics and downstream somatics to screen out somatic characteristics that are significantly different. Afterwards, the downstream somatics is used to test the upstream somatics, so as to select significant features. Next, the trans-omics features are created based on the types of upstream somatics and downstream somatics. Finally, feature selection and disease prediction model creation are performed.

Description

Feature creation methods

本發明是有關於一種特徵建立方法，且特別是有關於一種用於疾病預測的特徵建立方法。This invention relates to a feature building method, and more particularly to a feature building method for disease prediction.

遺傳資訊傳遞依序為基因體、轉錄體、蛋白體及代謝體，通常稱為上下游關係。隨著分子生物學技術的進步，研究人員透過體學資料（基因體、轉錄體、蛋白體及代謝體等），從單一個體學亦或是多體學中尋找體學與疾病間的生物標記。過去利用合併多體學的方式預測疾病，是以平行的角度來擷取體學和疾病之間關連性，忽略了體學之間上下游的關係。然而，體學的上下游關係（例如：基因體會影響下游的轉錄體再進一步影響下游的蛋白體），不僅有機會能夠協助模型進行預測，也能夠使研究人員更全面性由下而上地解釋疾病機制。Genetic information is transmitted sequentially through the genome, transcriptome, proteome, and metabolite, often referred to as upstream and downstream relationships. With advancements in molecular biology techniques, researchers use somatic data (genomes, transcriptomes, proteomes, and metabolites, etc.) to search for biomarkers between somatic studies and diseases, whether in a single or multiple somatic studies. Past methods of using combined multiple somatic studies to predict diseases extracted the correlation between somatic studies and diseases from a parallel perspective, ignoring the upstream and downstream relationships between somatic studies. However, the upstream and downstream relationships of somatic studies (e.g., the genome affects the downstream transcriptome, which in turn affects the downstream proteome) not only have the potential to assist models in prediction but also allow researchers to more comprehensively explain disease mechanisms from the bottom up.

本發明提供一種特徵建立方法，透過體學的上下游關係建立串體學（trans-omics）特徵，豐富化特徵池，以期達到增強疾病預測的效果，並進一步了解體學上下游關係來解釋致病機轉。This invention provides a feature establishment method that establishes trans-omics features through the upstream and downstream relationships of omics, enriches the feature pool, and aims to enhance the effect of disease prediction and further understand the upstream and downstream relationships of omics to explain the pathogenesis mechanism.

本發明的特徵建立方法包括以下步驟。將上游體學及下游體學分別對疾病作檢定，以篩選出存在顯著差異的體學特徵。之後，以下游體學對上游體學作檢定，以選出有顯著的特徵。接下來，根據上游體學及下游體學的型別建立串體學特徵。最後，進行特徵挑選及疾病預測模型建立。The feature establishment method of this invention includes the following steps: First, upstream and downstream voxels are tested against the disease to screen for voxel features showing significant differences. Then, downstream voxels are tested against upstream voxels to select features with significant differences. Next, voxel features are established based on the types of upstream and downstream voxels. Finally, feature selection and disease prediction model establishment are performed.

在本發明的一實施例中，疾病包括年齡相關性黃斑部病變、糖尿病病患視網膜病變、失智症、肌少症或衰弱症。In one embodiment of the invention, the diseases include age-related macular degeneration, diabetic retinopathy, dementia, sarcopenia, or frailty.

在本發明的一實施例中，上游體學包括類別型變數及連續型變數。In one embodiment of the present invention, upstream dynamics includes categorical variables and continuous variables.

在本發明的一實施例中，類別型變數包括單核苷酸多態性（SNP），連續型變數包括核醣核酸(RNA)。In one embodiment of the present invention, categorical variables include single nucleotide polymorphisms (SNPs), and serial variables include ribonucleic acid (RNA).

在本發明的一實施例中，將上游體學對疾病作的檢定包括t檢定或卡方檢定，t檢定用於連續型變數，卡方檢定用於類別型變數。In one embodiment of the present invention, the upstream physical tests for the disease include t-tests or chi-square tests, with t-tests used for continuous variables and chi-square tests used for categorical variables.

在本發明的一實施例中，將下游體學對疾病作的檢定包括t檢定或卡方檢定，t檢定用於連續型變數，卡方檢定用於類別型變數。In one embodiment of the present invention, the downstream physical examination of the disease includes a t-test or a chi-square test, with the t-test used for continuous variables and the chi-square test used for categorical variables.

在本發明的一實施例中，以下游體學對上游體學作的檢定包括ANOVA檢定。In one embodiment of the present invention, the downstream dynamics test on the upstream dynamics test includes the ANOVA test.

在本發明的一實施例中，當上游體學為連續型變數時，將上游體學數值標準化作為下游體學權重調整的權重，以權重乘以下游體學數值以建立上游體學及下游體學的串體學特徵。In one embodiment of the present invention, when the upstream dynamics is a continuous variable, the upstream dynamics value is standardized as the weight for adjusting the downstream dynamics weight, and the weight is multiplied by the downstream dynamics value to establish the serial dynamics features of the upstream and downstream dynamics.

在本發明的一實施例中，特徵挑選包括將串體學特徵數值和疾病作檢定，以篩選出顯著的特徵。In one embodiment of the present invention, feature selection includes testing serial physical feature values and diseases to screen for significant features.

在本發明的一實施例中，特徵挑選包括將體學特徵及串體學特徵分別利用多機器學習方法進行挑選，再將特徵重要性進行排名，並利用AUC與ACC的陡坡圖與手肘法決定特徵數量。In one embodiment of the present invention, feature selection includes selecting solid features and serial solid features using multi-machine learning methods, ranking the importance of features, and determining the number of features using slope plots of AUC and ACC and the elbow method.

在本發明的一實施例中，下游體學包括蛋白質。In one embodiment of the present invention, the downstream organism includes proteins.

基於上述，本發明的特徵建立方法從體學上下游關係的角度切入，量化並建立遺傳資訊傳遞特徵，也就是串體學特徵，來建立疾病的預測模型。如此一來，能夠提升疾病預測模型的準確度，且對於疾病機制的解釋能夠提出更直接的解釋。Based on the above, the feature establishment method of this invention approaches the issue from the perspective of upstream and downstream relationships in genetics, quantifying and establishing genetic information transmission features, namely, genetic features, to build a disease prediction model. In this way, the accuracy of the disease prediction model can be improved, and a more direct explanation of the disease mechanism can be provided.

下文列舉實施例並配合附圖來進行詳細地說明，但所提供的實施例並非用以限制本揭露所涵蓋的範圍。此外，關於文中所使用「包含」、「包括」、「具有」等等用語，均為開放性的用語，也就是指「包含但不限於」。Examples are provided below with accompanying figures for detailed explanation, but the examples provided are not intended to limit the scope of this disclosure. In addition, the terms "comprising," "including," "having," etc., used in this document are open-ended terms, meaning "including but not limited to."

圖1是依照本發明一實施例的特徵建立方法的流程圖。Figure 1 is a flowchart of a method for establishing features according to an embodiment of the present invention.

請參照圖1，在步驟S10及S12中，將上游體學及下游體學分別對疾病作檢定，以篩選出存在顯著差異的體學特徵。在本實施例中，疾病可包括代謝症候群、眼睛相關疾病（例如年齡相關性黃斑部病變或糖尿病病患視網膜病變）或老化相關疾病（例如失智症、肌少症或衰弱症等），但本發明並不以此為限。上游體學可包括類別型變數及連續型變數，類別型變數例如是單核苷酸多態性（SNP），連續型變數例如是RNA（轉錄體），下游體學例如是蛋白質。將上游體學對疾病作的檢定包括t檢定或卡方檢定，t檢定用於連續型變數，卡方檢定用於類別型變數。將下游體學對疾病作的檢定包括t檢定或卡方檢定，t檢定用於所述連續型變數，卡方檢定用於所述類別型變數。篩選出存在顯著差異的體學特徵的流程例如是利用顯著水準設定為0.05，若小於則為顯著的體學特徵，可加入下一階段的流程。Referring to Figure 1, in steps S10 and S12, upstream and downstream immunological characteristics are used to test for diseases to screen for immunological features with significant differences. In this embodiment, diseases may include metabolic syndromes, eye-related diseases (such as age-related macular degeneration or diabetic retinopathy), or aging-related diseases (such as dementia, sarcopenia, or frailty), but this invention is not limited to these. Upstream immunological characteristics may include categorical and continuous variables. Categorical variables may be, for example, single nucleotide polymorphisms (SNPs), and continuous variables may be, for example, RNA (transcriptional organisms). Downstream immunological characteristics may be, for example, proteins. The tests performed on diseases using upstream immunological characteristics include t-tests or chi-square tests. T-tests are used for continuous variables, and chi-square tests are used for categorical variables. Downstream immunological tests for the disease include t-tests or chi-square tests. The t-test is used for continuous variables, and the chi-square test is used for categorical variables. A process for screening immunological features with significant differences is, for example, using a significance level of 0.05. Features with a significance level less than 0.05 are considered significant and can be added to the next stage of the process.

接下來，請參照圖1，在步驟S20中，以下游體學對上游體學作檢定，以選出有顯著的特徵。上游體學為類別型變數時，例如是單核苷酸多態性（SNP），基因型的表現型可包括AA、AB、BB以及00。上游體學為連續型變數時，例如是RNA（轉錄體），此時為連續型的變數，描述分子在體內的含量。以上游體學為類別型變數為例，根據生物學中體學的上下游關係（亦即，基因透過轉錄作用產生RNA，RNA再透過轉譯作用產生胜肽，並進一步組合成蛋白質）來建立串體學特徵。利用不同SNP表現型會導致不同蛋白質表現量的想法，分別利用SNP表現型建立蛋白質表現量的串體學特徵。在建立特徵之前，為了能建立存在明顯差異的上下游特徵，以下游體學對上游體學作檢定，以選出有顯著的特徵，例如是SNP與蛋白質進行ANOVA檢定，選出有顯著的特徵的流程例如是利用顯著水準設定為0.05，若小於則為顯著的體學特徵，可加入下一階段的流程。Next, referring to Figure 1, in step S20, downstream genotyping is used to test upstream genotyping to select features with significant significance. When upstream genotyping is a categorical variable, such as a single nucleotide polymorphism (SNP), the genotype phenotype can include AA, AB, BB, and O0. When upstream genotyping is a continuous variable, such as RNA (transcriptional molecules), it is a continuous variable describing the abundance of the molecule in the body. Taking upstream genotyping as a categorical variable, tandem genotyping features are established based on the upstream-downstream relationship of genotyping in biology (i.e., genes produce RNA through transcription, RNA then produces peptides through translation, and further combines to form proteins). Utilizing the idea that different SNP phenotypes lead to different protein expression levels, tandem genotyping features of protein expression levels are established using SNP phenotypes. Before establishing features, in order to establish upstream and downstream features with significant differences, downstream atomic features are tested against upstream atomic features to select significant features, such as SNPs and proteins, using ANOVA. The process for selecting significant features is, for example, by setting the significance level to 0.05. If it is less than 0.05, it is considered a significant atomic feature and can be added to the next stage of the process.

表1為病人數據範例。表1 上游體學 (類別型變數，SNP) 上游體學 (連續型變數，RNA) 下游體學 (蛋白質) 病人01 AA 76.82 791911 病人02 AB 86.74 2424990 病人03 AA 168.31 878780 病人04 00 58.88 336298 Table 1 shows an example of patient data. Upstream dynamics (Category Variable, SNP) Upstream dynamics (continuous variables, RNA) Downstream biology (proteins) Patient 01 AA 76.82 791911 Patient 02 AB 86.74 2424990 Patient 03 AA 168.31 878780 Patient 04 00 58.88 336298

之後，請參照圖1，在步驟S30中，根據上游體學及下游體學的型別建立串體學特徵。Next, referring to Figure 1, in step S30, establish serial volumetric features based on the types of upstream and downstream volumetrics.

當上游體學特徵為類別型變數時，在表1的病人數據範例中，SNP有三種表現型（亦即，AA、AB以及00），會建立三個SNP-蛋白質串體學特徵（亦即，SNP_AA_ProteinX、SNP_AB_ProteinX以及SNP_00_ProteinX）來描述SNP表現型和下游蛋白質表現量的關係。舉例而言，SNP_AA_ProteinX特徵中，病人01的SNP表現型為AA，蛋白質表現量為791911，因此，病人01在SNP_AA_ProteinX的串體學特徵數值為791911。相似地，病人02的SNP表現型為AB，蛋白質表現量為2424990，但因其SNP表現型非AA，因此，病人02在SNP_AA_ProteinX的串體學特徵數值為0。依此規則，建立SNP表現型和蛋白質表現量對應關係的串體學特徵（如表2所示）。表2 SNP_AA_ProteinX SNP_AB_ProteinX SNP_00_ProteinX 病人01 791911 0 0 病人02 0 2424990 0 病人03 878780 0 0 病人04 0 0 336298 When the upstream protein phenotype is a categorical variable, in the patient data example in Table 1, there are three SNP phenotypes (i.e., AA, AB, and 00). Three SNP-protein phenotypes (i.e., SNP_AA_ProteinX, SNP_AB_ProteinX, and SNP_00_ProteinX) are established to describe the relationship between the SNP phenotype and the downstream protein expression level. For example, in the SNP_AA_ProteinX feature, patient 01 has an SNP phenotype of AA and a protein expression level of 791911. Therefore, patient 01's phenotype value in SNP_AA_ProteinX is 791911. Similarly, Patient 02 has an SNP phenotype of AB and a protein expression level of 2,424,990. However, because its SNP phenotype is not AA, Patient 02 has a string eigenvalue of 0 in SNP_AA_ProteinX. Based on this rule, string eigenvalues corresponding to SNP phenotypes and protein expression levels are established (as shown in Table 2). Table 2 SNP_AA_ProteinX SNP_AB_ProteinX SNP_00_ProteinX Patient 01 791911 0 0 Patient 02 0 2424990 0 Patient 03 878780 0 0 Patient 04 0 0 336298

當上游體學特徵為連續型變數時，將上游體學數值標準化作為下游體學權重調整的權重（影響倍率的概念），以前述的權重乘以下游體學數值以建立上下游體學的串體學特徵（如表3所示）。表3 權重 RNA_ProteinX 病人01 W ₁ = 76.82 / (76.82+86.74+168.31+58.88 ) W ₁ *791911 病人02 W ₂ = 86.74 / (76.82+86.74+168.31+58.88) W ₂ *2424990 病人03 W ₁₃ = 168.31 / (76.82+86.74+168.31+58.88) W ₃ * 878780 病人04 W ₄ = 58.88 / (76.82+86.74+168.31+58.88) W ₄ *336298 When the upstream variance features are continuous variables, the upstream variance values are standardized and used as weights for adjusting the downstream variance weights (affecting the concept of multiplier). The aforementioned weights are then multiplied by the downstream variance values to establish the serial variance features of the upstream and downstream variances (as shown in Table 3). Table 3 Weight RNA_ProteinX Patient 01 W ₁ = 76.82 / ( 76.82+86.74+168.31+58.88 ) W ₁ *791911 Patient 02 W ₂ = 86.74 / (76.82+86.74+168.31+58.88) W ₂ *2424990 Patient 03 W ₁₃ = 168.31 / (76.82+86.74+168.31+58.88) W ₃ * 878780 Patient 04 W ₄ = 58.88 / (76.82+86.74+168.31+58.88) W ₄ *336298

最後，請參照圖1，在步驟S40及S50中，進行特徵挑選及疾病預測模型建立。特徵挑選的進行方式，例如是將串體學特徵數值和疾病進行檢定，以篩選出顯著的特徵，或例如是將體學及串體學特徵分別利用多機器學習方法進行特徵挑選，再將特徵重要性進行排名，並利用AUC與ACC的陡坡圖與手肘法決定特徵數量，以在各型態的特徵中找出對於疾病有影響力的特徵，最後建立疾病預測模型。Finally, referring to Figure 1, in steps S40 and S50, feature selection and disease prediction model building are performed. Feature selection can be carried out in various ways, such as testing the values of hematologic features against the disease to screen out significant features, or by using multi-machine learning methods to select features from hematologic and hematologic features separately, ranking the importance of features, and using the steep slope plot of AUC and ACC and the elbow method to determine the number of features, so as to find the features that have an impact on the disease among the features of each type, and finally building a disease prediction model.

疾病預測模型建立後，可應用於輔助診斷或是預警等情境。在獲取受測者檢體(例如: 抽血血液)後，可透過現行分子生物學等技術獲得該受測者的SNP、RNA等體學數據，並建立模型所需的串體學特徵。接著，應用已建立完成的預測模型於受測者數據可以分析此受測者的患病機率，而進一步採取相關臨床診斷或處理。舉例而言，以肌少症來說，除了目前依照步速慢、握力差、體重減輕、低能量或是低活動量等項目來診斷受測者外，本發明的串體學疾病預測模型，提供了體學的角度(也就是體內的生化數據)來協助診斷。亦即，本發明也提供一種健康疾病風險評估方法以及疾病預測方法，利用本發明的特徵建立方法，建立疾病預測模型後，應用已建立完成的預測模型於受測者數據，以分析此受測者的患病機率，可以此分析結果協助診斷，進而可採取相關臨床診斷或處理。Once a disease prediction model is established, it can be applied to assist in diagnosis or early warning. After obtaining a sample from the subject (e.g., blood sample), current molecular biology techniques can be used to obtain the subject's SNP, RNA, and other somatic data, and to establish the somatic features required for the model. Then, applying the established prediction model to the subject's data allows for analysis of the subject's probability of developing the disease, leading to further clinical diagnosis or treatment. For example, in the case of sarcopenia, in addition to current diagnostic methods based on slow walking speed, poor grip strength, weight loss, low energy, or low activity levels, this invention's somatic disease prediction model provides a somatic perspective (i.e., biochemical data within the body) to assist in diagnosis. In other words, this invention also provides a method for assessing health and disease risks and a method for predicting diseases. After establishing a disease prediction model using the feature-based method of this invention, the established prediction model is applied to the subject's data to analyze the subject's probability of developing the disease. The analysis results can be used to assist in diagnosis, and then relevant clinical diagnosis or treatment can be carried out.

綜上所述，本發明提供一種特徵建立方法，從體學上下游關係（例如：基因體會影響下游的轉錄體再進一步影響下游的蛋白體）的角度切入，量化並建立遺傳資訊傳遞特徵，也就是串體學特徵，豐富化特徵池，來建立疾病的預測模型，以期達到增強疾病預測的效果，並進一步了解體學上下游關係來解釋致病機轉。如此一來，能夠提升疾病預測模型的準確度，且對於疾病機制的解釋能夠提出更直接的解釋，也可使研究人員更全面性由下而上地解釋疾病機制。In summary, this invention provides a feature establishment method that approaches the problem from the perspective of upstream and downstream relationships in genetics (e.g., the genome influences downstream transcriptosomals, which in turn influence downstream proteomics). It quantifies and establishes genetic information transmission features, also known as string DNA features, enriching the feature pool to build disease prediction models. This aims to enhance disease prediction effectiveness and further elucidate upstream and downstream relationships in genetics to explain pathogenesis. In this way, the accuracy of disease prediction models can be improved, and more direct explanations of disease mechanisms can be provided. It also enables researchers to explain disease mechanisms more comprehensively from the bottom up.

S10、S12、S20、S30、S40、S50:步驟S10, S12, S20, S30, S40, S50: Steps

S10、S12、S20、S30、S40、S50:步驟 S10, S12, S20, S30, S40, S50: Steps

Claims

A feature establishment method includes: testing upstream and downstream voxels against a disease to screen voxel features with significant differences; testing the upstream voxels against the downstream voxels to select features with significant differences; establishing serial voxel features based on the types of the upstream and downstream voxels; and performing feature selection and disease prediction model establishment.

The feature creation method as described in claim 1, wherein the disease includes age-related macular degeneration, diabetic retinopathy, dementia, sarcopenia, or frailty.

The feature creation method as described in claim 1, wherein the upstream variance includes categorical variables and continuous variables.

The feature establishment method as described in claim 3, wherein the categorical variable includes single nucleotide polymorphisms (SNPs) and the serial variable includes ribonucleic acid (RNA).

The feature establishment method as described in claim 3, wherein the upstream immunology assay for the disease includes a t-test or a chi-square test, wherein the t-test is used for the continuous variable and the chi-square test is used for the categorical variable.

The feature establishment method as described in claim 3, wherein the downstream immunology assay for the disease includes a t-test or a chi-square test, wherein the t-test is used for the continuous variable and the chi-square test is used for the categorical variable.

The feature building method as described in claim 1, wherein the test of the upstream voxel with the downstream voxel includes an ANOVA test.

The feature establishment method as described in claim 3, wherein when the upstream variance is the continuous variable, the upstream variance value is standardized as the weight for adjusting the downstream variance weight, and the weight is multiplied by the downstream variance value to establish the serial variance features of the upstream variance and the downstream variance.

The feature establishment method as described in claim 1, wherein the feature selection includes testing serial feature values and the disease to screen for significant features.

The feature creation method as described in claim 1, wherein the feature selection includes selecting stereo features and serial stereo features using multi-machine learning methods, ranking the importance of features, and determining the number of features using the slope plot of AUC and ACC and the elbow method.

The feature creation method as described in claim 1, wherein the downstream genome includes proteins.