HK1240281B - Biomarkers for colorectal cancer related diseases - Google Patents
Biomarkers for colorectal cancer related diseasesInfo
- Publication number
- HK1240281B HK1240281B HK17113733.6A HK17113733A HK1240281B HK 1240281 B HK1240281 B HK 1240281B HK 17113733 A HK17113733 A HK 17113733A HK 1240281 B HK1240281 B HK 1240281B
- Authority
- HK
- Hong Kong
- Prior art keywords
- mlg
- seq
- nucleic acid
- acid shown
- biomarker
- Prior art date
Links
Description
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
无none
领域field
本发明涉及用于预测与微生物相关的疾病,特别是结直肠癌和结直肠中的进展性腺瘤的风险的生物标志物和方法。The present invention relates to biomarkers and methods for predicting the risk of microbial-associated diseases, particularly colorectal cancer and advanced adenomas in the colorectum.
背景background
结直肠癌(CRC)是全球前三位最常诊断的癌症之一,是癌症死亡的主要原因。其在较发达国家的发病率较高,但在诸如东亚、西班牙和东欧等历史低风险地区,由于所谓的西方生活方式,发病率正在迅速上升。在结直肠癌的发展中,遗传学改变累积了多年,通常涉及肿瘤抑制基因腺瘤性结肠息肉病基因(APC)的丧失,以及随后分别发生在KRAS、PIK3CA和TP53中的激活或失活突变(Brenner,H.,Kloor,M.&Pox,C.P.Colorectalcancer.Lancet383,1490-502(2014),通过引用并入本文)。虽然大多数CRC病例是散发性的,但在其出现之前通常发生异常腺瘤,该异常腺瘤可进展为恶性形式,这称为腺瘤-癌顺序。结直肠腺瘤和结直肠癌的早期诊断不仅有助于防止死亡,而且也有助于降低手术干预的费用。Colorectal cancer (CRC) is one of the top three most frequently diagnosed cancers in the world and is the main cause of cancer death. Its incidence rate is higher in more developed countries, but in historically low-risk areas such as East Asia, Spain and Eastern Europe, due to the so-called Western lifestyle, the incidence rate is rapidly rising. In the development of colorectal cancer, genetic changes have accumulated for many years, generally involving the loss of tumor suppressor gene adenomatous polyposis coli gene (APC), and subsequently occurring in KRAS, PIK3CA and TP53 activation or inactivation mutations (Brenner, H., Kloor, M. & Pox, C.P. Colorectal cancer. Lancet 383, 1490-502 (2014), incorporated herein by reference). Although most CRC cases are sporadic, abnormal adenomas usually occur before their occurrence, and the abnormal adenomas can progress to malignant forms, which is called adenoma-cancer sequence. Early diagnosis of colorectal adenoma and colorectal cancer not only helps to prevent death, but also helps to reduce the cost of surgical intervention.
CRC是研究最多的与肠道微生物群相关的疾病之一。然而,该疾病的因果关系通常通过施用抗生素混合剂疗法来研究,所述抗生素混合剂疗法清除肠道微生物群而无法获知起作用的确切微生物菌株和基因。相比于正常结肠组织,在结直肠癌中检测到了梭杆菌属(Fusobacterium),并且发现其富集在腺瘤中。具核梭杆菌(Fusobacterium nucleatum)(一种牙周病病原体),被发现能够促进ApcMin/+小鼠中肠道肿瘤的骨髓浸润,并与小鼠和人中的促炎基因诸如Ptgs2(COX-2)、Scyb1(IL8)、Il6、Tnf(TNFα)和Mmp3的表达上调相关(Kostic,A.D.等,Fusobacterium nucleatum potentiates intestinal tumorigenesis andmodulates the tumor-immune microenvironment.Cell Host Microbe14,207-215(2013),通过引用并入本文)。然而,目前尚不清楚,是否有更多的细菌或古细菌可作为结直肠癌病因的标志物或促成病因。CRC is one of the most studied diseases associated with the gut microbiota. However, the causal relationship of the disease is usually studied by administering a cocktail of antibiotics, which eliminates the gut microbiota without knowing the exact microbial strains and genes that are at work. Fusobacterium has been detected in colorectal cancer compared to normal colon tissue and has been found to be enriched in adenomas. Fusobacterium nucleatum, a periodontal disease pathogen, has been found to promote bone marrow infiltration of intestinal tumors in Apc Min/+ mice and is associated with upregulation of pro-inflammatory genes such as Ptgs2 (COX-2), Scyb1 (IL8), Il6, Tnf (TNFα), and Mmp3 in mice and humans (Kostic, AD et al., Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates the tumor-immune microenvironment. Cell Host Microbe 14, 207-215 (2013), incorporated herein by reference). However, it is not yet clear whether more bacteria or archaea can serve as markers or contributors to the etiology of colorectal cancer.
目前CRC的检测,诸如可屈性乙状结肠镜检查和结肠镜检查,都是侵入性的,并且患者可能会在该检测过程和肠道准备过程中感到不舒服或不愉快。肠道微生物群与免疫系统之间的相互作用在肠内和肠外的许多疾病中具有重要作用(Cho,I.&Blaser,M.J.Thehuman microbiome:at the interface of health and disease.Nature Rev.Genet.13,260-270(2012),通过引用并入本文)。粪便DNA的肠道微生物群分析有潜力被用作无创性检测,以发现可用作CRC患者早期诊断的筛选工具的特异性生物标志物,从而获得更长的生存时间和更好的生活质量。Current CRC detection, such as flexible sigmoidoscopy and colonoscopy, is invasive, and the patient may feel uncomfortable or unhappy during the detection process and intestinal preparation. The interaction between intestinal microbiome and immune system plays an important role in many diseases of the intestine and intestine (Cho, I. & Blaser, M.J. The human microbiome: at the interface of health and disease. Nature Rev. Genet. 13, 260-270 (2012), incorporated herein by reference). Intestinal microbiome analysis of fecal DNA has the potential to be used as non-invasive detection to find specific biomarkers that can be used as screening tools for early diagnosis of CRC patients, thereby obtaining longer survival time and better quality of life.
概要summary
本公开内容的实施方案旨在至少一定程度地解决现有技术中存在的至少一个问题。The embodiments of the present disclosure are intended to solve at least one problem existing in the prior art to at least some extent.
本发明基于发明人的以下发现:The present invention is based on the following findings of the inventors:
肠道微生物群的评估和表征已成为人类疾病(包括结直肠癌)的主要研究领域。本发明人首次针对来自健康对照、结直肠癌和腺瘤患者的粪便样品进行宏基因组全基因组鸟枪法测序。为了对结直肠癌和腺瘤患者中的肠道微生物含量进行分析,本发明人进行了宏基因组关联分析(Metagenome-Wide Association Study)(MGWAS)方案(Qin,J.等Ametagenome-wide association study of gut microbiota in type 2diabetes.Nature 490,55-60(2012),通过引用并入本文)。为了比较健康对照组、进展性腺瘤组和癌症患者组的粪便微生物群系,鉴定了相对丰度在任意两组之间展现出显著差异的基因(p<0.05,Kruskal-Wallis检验)。随后根据其在所有样品中的丰度变化,这些标记基因被聚类形成MLG(宏基因组连锁群)(Qin等,2012,同上),并且本发明人鉴定了这些肿瘤的MLG特征。然后本发明人鉴定并验证了15个用于结直肠癌的早期和无创性诊断的MLG,以及10个用于结直肠腺瘤的早期和无创性诊断的MLG。为了利用这些基于肠道微生物群的CRC分类的潜力,本发明人通过分别基于15个MLG和10个MLG的随机森林模型计算了疾病的概率。本发明人的数据为表征与CRC风险相关的肠道宏基因组提供了具有洞察力的见解,也为以后研究肠道宏基因组在其它相关病症的病理生理学中的作用提供了一个范例,同时还揭示了基于肠道-微生物群的方法用于评估处于此类病症风险中的个体的潜在用途。The assessment and characterization of intestinal microbiota has become the main research field of human diseases (including colorectal cancer). The present inventors carried out metagenomic whole genome shotgun sequencing for the first time for stool samples from healthy controls, colorectal cancer and adenoma patients. In order to analyze the intestinal microbial content in colorectal cancer and adenoma patients, the present inventors carried out metagenomic association analysis (Metagenome-Wide Association Study) (MGWAS) scheme (Qin, J. et al. Ametagenome-wide association study of gut microbiota in type 2diabetes.Nature 490,55-60 (2012), incorporated herein by reference). In order to compare the fecal microbial communities of healthy controls, progressive adenoma groups and cancer patient groups, the genes whose relative abundance showed significant differences between any two groups were identified (p < 0.05, Kruskal-Wallis test). These marker genes were then clustered into MLGs (metagenomic linkage groups) (Qin et al., 2012, supra) based on their abundance changes in all samples, and the inventors identified the MLG characteristics of these tumors. The inventors then identified and validated 15 MLGs for early and non-invasive diagnosis of colorectal cancer, and 10 MLGs for early and non-invasive diagnosis of colorectal adenomas. In order to exploit the potential of these gut microbiota-based CRC classifications, the inventors calculated the probability of disease using random forest models based on 15 MLGs and 10 MLGs, respectively. The inventors' data provide insightful insights into the characterization of gut metagenomes associated with CRC risk, and also provide a paradigm for future studies of the role of gut metagenomes in the pathophysiology of other related conditions, while also revealing the potential use of gut-microbiota-based methods for assessing individuals at risk of such conditions.
据信,上述15个MLG和10个MLG对于改善CRC的早期检测具有重要价值,原因如下。第一,与常规标志物相比,本发明的标志物更特异和灵敏。第二,粪便分析具备准确性、安全性、经济可承受性和患者依从性。粪便样品是可运输的。与需要肠道制备的结肠镜检查相比,本发明涉及舒适且无创的体外方法,因此人们更容易参与给定的筛查程序。第三,本发明的标志物也可用作CRC患者的治疗监测工具,以检测对治疗的反应。It is believed that the above-mentioned 15 MLG and 10 MLG are of great value for improving the early detection of CRC for the following reasons. First, the markers of the present invention are more specific and sensitive than conventional markers. Second, stool analysis has accuracy, safety, economic affordability and patient compliance. Stool samples are transportable. Compared with colonoscopy that requires intestinal preparation, the present invention relates to a comfortable and non-invasive in vitro method, so it is easier for people to participate in a given screening program. Third, the markers of the present invention can also be used as a treatment monitoring tool for CRC patients to detect the response to treatment.
因此,在第一方面,本发明提供了用于在受试者中预测或诊断与微生物群相关的疾病或确定受试者是否具有形成所述疾病的风险的生物标志物组。Thus, in a first aspect, the present invention provides a biomarker panel for predicting or diagnosing a disease associated with a microbiota in a subject or determining whether a subject is at risk of developing said disease.
在第二方面,本发明提供了用于在受试者中预测或诊断与微生物群相关的疾病,或确定受试者是否具有形成所述疾病的风险的试剂盒,其包含用于测定样品中的本发明的生物标志物组的每种生物标志物的水平或其量的试剂。In a second aspect, the present invention provides a kit for predicting or diagnosing a disease associated with a microbiota in a subject, or determining whether a subject has a risk of developing the disease, comprising reagents for determining the level or amount of each biomarker of the biomarker panel of the present invention in a sample.
在第三方面,本发明提供了用于测定本发明的生物标志物组的每种生物标志物的水平或其量的试剂在制备试剂盒中的用途,所述试剂盒用于在受试者中预测或诊断与微生物群相关的疾病或用于确定受试者是否具有形成所述疾病的风险。In a third aspect, the present invention provides the use of a reagent for determining the level or amount of each biomarker of the biomarker panel of the present invention in the preparation of a kit for predicting or diagnosing a disease associated with a microbiome in a subject or for determining whether a subject has a risk of developing the disease.
在第四方面,本发明提供了用于在受试者中预测或诊断与微生物群相关的疾病或确定受试者是否具有形成所述疾病的风险的方法。In a fourth aspect, the present invention provides a method for predicting or diagnosing a disease associated with a microbiota in a subject or determining whether a subject is at risk of developing the disease.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
根据以下描述,并结合附图,本公开内容的各个方面及优点将变得明显并且易于理解,其中:Various aspects and advantages of the present disclosure will become apparent and readily understood from the following description taken in conjunction with the accompanying drawings, in which:
图1示出了肠道MLG能够将结直肠癌样品与健康对照样品进行分类。(a)随着MLG数量的增加,癌的随机森林分类中5次10折交叉验证的误差的分布情况。使用MLG(>100个基因)在对照和癌症样品(n=55和41)中的相对丰度训练该模型。黑色曲线表示5次验证的平均值(灰线)。黑色直线标示最优集中的MLG数目(15个MLG)(表2-1,表2-2)。即使将年龄和BMI因素与MLG一起考虑,仍然筛选得到相同的MLG。(b)根据(a)中的模型的交叉验证训练集中的癌的概率的盒须图(box-and-whisker plot)。(c)训练集的接受者工作曲线(ROC)。在临界值(cut-off)为0.5时,AUC为98.34%,95%的置信区间(CI)为96.29-100%。(d)由8个对照样品(黑色方块)、47个进展性腺瘤样品(空心圆)和5个癌症样品(实心黑圈)组成的测试集的分类结果。(e)测试集的ROC。在临界值为0.5时,AUC为96%,95%的CI为87.88-100%。如果癌的概率≥0.5,则该受试者处于患结直肠癌的风险中。图1的结果表明,上述15个MLG可用作诊断结直肠癌和/或确定患结直肠癌的风险的生物标志物,且具备高灵敏度和高特异性。Figure 1 shows that intestinal MLGs are able to classify colorectal cancer samples from healthy control samples. (a) Distribution of errors in 5 10-fold cross validations of random forest classification of cancer as the number of MLGs increases. The model was trained using the relative abundance of MLGs (>100 genes) in control and cancer samples (n=55 and 41). The black curve represents the average of 5 validations (gray line). The black straight line indicates the number of MLGs in the optimal set (15 MLGs) (Table 2-1, Table 2-2). Even when age and BMI factors are taken into account together with MLGs, the same MLGs are still screened. (b) Box-and-whisker plot of the probability of cancer in the cross-validation training set according to the model in (a). (c) Receiver operating curve (ROC) of the training set. At a cut-off of 0.5, the AUC was 98.34% and the 95% confidence interval (CI) was 96.29-100%. (d) Classification results of the test set consisting of 8 control samples (black squares), 47 advanced adenoma samples (open circles), and 5 cancer samples (solid black circles). (e) ROC of the test set. At a critical value of 0.5, the AUC was 96%, and the 95% CI was 87.88-100%. If the probability of cancer is ≥ 0.5, the subject is at risk of colorectal cancer. The results in Figure 1 indicate that the above 15 MLGs can be used as biomarkers for diagnosing colorectal cancer and/or determining the risk of colorectal cancer with high sensitivity and specificity.
图2示出了肠道MLG能够将进展性腺瘤样品与健康对照样品进行分类。(a)随着MLG数量增加,进展性腺瘤的随机森林分类中5次10折交叉验证的误差的分布情况。使用MLG(>100个基因)在对照组和进展性腺瘤样品(n=55和42)中的相对丰度训练该模型。黑色曲线表示5次验证的平均值(灰线)。黑色直线标示最优集中的MLG数目(10个MLG)(表6-1,表6-2,表7)。即使将年龄和BMI因素与MLG一起考虑,仍然筛选得到相同的MLG。(b)根据(a)中的模型的交叉验证训练集中的进展性腺瘤的概率的盒须图(box-and-whisker plot)。(c)训练集的接受者工作曲线(ROC)。在临界值为0.5时,AUC为87.38%,95%的置信区间(CI)为80.21-94.55%。(d)由15个对照样品(空心圆)和15个进展性腺瘤样品(实心黑圈)组成的测试集的分类结果。(e)测试集的ROC。在最佳临界值为为0.4572时,AUC为90.67%,真阳性率(TPR)为1,假阳性率(FPR)为0.2667。如果结直肠腺瘤的概率≥0.4572(最佳临界值),则该受试者处于患结直肠腺瘤的风险中。图2的结果表明,上述10个MLG可用作诊断进展性腺瘤和/或确定患进展性腺瘤的风险的生物标志物,且具备高灵敏度和高特异性。Figure 2 shows that intestinal MLG can classify advanced adenoma samples from healthy control samples. (a) Distribution of errors in 5 10-fold cross validations of random forest classification of advanced adenoma as the number of MLG increases. The model was trained using the relative abundance of MLG (>100 genes) in the control group and advanced adenoma samples (n=55 and 42). The black curve represents the average of 5 validations (gray line). The black straight line indicates the number of MLGs in the optimal set (10 MLGs) (Table 6-1, Table 6-2, Table 7). Even when age and BMI factors are considered together with MLG, the same MLGs are still screened. (b) Box-and-whisker plot of the probability of advanced adenoma in the cross-validation training set according to the model in (a). (c) Receiver operating curve (ROC) of the training set. At a critical value of 0.5, the AUC was 87.38% and the 95% confidence interval (CI) was 80.21-94.55%. (d) Classification results of the test set consisting of 15 control samples (open circles) and 15 advanced adenoma samples (solid black circles). (e) ROC of the test set. When the optimal cutoff value was 0.4572, the AUC was 90.67%, the true positive rate (TPR) was 1, and the false positive rate (FPR) was 0.2667. If the probability of colorectal adenoma is ≥ 0.4572 (optimal cutoff value), the subject is at risk of colorectal adenoma. The results in Figure 2 show that the above 10 MLGs can be used as biomarkers for diagnosing advanced adenoma and/or determining the risk of advanced adenoma with high sensitivity and specificity.
详述Details
本文使用的术语具有本发明相关领域的普通技术人员通常理解的含义。然而,为了更好地理解本发明,相关术语的定义和解释如下。The terms used herein have the meanings commonly understood by those skilled in the art in the art to which the present invention relates. However, in order to better understand the present invention, the definitions and explanations of the relevant terms are as follows.
术语诸如“一个/一种(a)”、“一个/一种(an)”和“该(the)”并不旨在仅指单个实体,而且还包括可以使用具体示例来说明的一个种类。Terms such as "a," "an," and "the" are not intended to refer to only a singular entity but also include a species of which specific examples may be used.
根据本发明,术语“生物标志物”(也称为“生物学标志物”),是指受试者的生物学状态或状况的可测量指标。此类生物标志物可以是受试者中的任何物质,例如核酸标志物(例如DNA)、蛋白质标志物、细胞因子标志物、趋化因子标志物、糖类标志物、抗原标志物、抗体标志物、物种标记(种/属标志物)和功能标志物(KO/OG标志物)等,只要它们与受试者的特定生物学状态或状况(例如疾病)相关。通常通过测量和评估生物标志物以检测正常生物过程、病理过程或对治疗干预的药理学应答,并且生物标志物在许多科学领域中都是有用的。According to the present invention, the term "biomarker" (also referred to as "biological marker") refers to a measurable indicator of a biological state or condition of a subject. Such biomarkers can be any substance in a subject, such as nucleic acid markers (e.g., DNA), protein markers, cytokine markers, chemokine markers, carbohydrate markers, antigen markers, antibody markers, species markers (species/genus markers) and functional markers (KO/OG markers), etc., as long as they are related to a specific biological state or condition (e.g., disease) of the subject. Biomarkers are usually measured and evaluated to detect normal biological processes, pathological processes, or pharmacological responses to therapeutic interventions, and biomarkers are useful in many scientific fields.
根据本发明,术语“生物标志物组”是指一组生物标志物(即,两种或更多种生物标志物的组合)。According to the present invention, the term "biomarker panel" refers to a group of biomarkers (ie, a combination of two or more biomarkers).
根据本发明,术语“与微生物群相关的疾病”是指与肠道中的微生物群的失衡相关的疾病。例如,所述疾病可由肠道中的微生物群的失衡引起、诱发或加剧。这种疾病可以是结直肠癌的进展性腺瘤或结直肠恶性肿瘤/癌。According to the present invention, the term "microbiota-related disease" refers to a disease associated with an imbalance of the microbiota in the intestine. For example, the disease may be caused, induced, or exacerbated by an imbalance of the microbiota in the intestine. Such a disease may be an advanced adenoma of the colorectal gland or a colorectal malignancy/cancer.
根据本发明,术语“受试者”是指动物,特别是哺乳动物,诸如灵长类动物,优选人。According to the present invention, the term "subject" refers to an animal, particularly a mammal, such as a primate, preferably a human.
根据本发明,表述“结直肠癌(colorectal cancer)”具有与“结直肠癌(colorectal carcinoma)”相同的含义。According to the present invention, the expression "colorectal cancer" has the same meaning as "colorectal carcinoma".
根据本发明,表述“进展性腺瘤”和“结直肠进展性腺瘤”具有与“结直肠癌中的进展性腺瘤”相同的含义。According to the present invention, the expressions "advanced adenoma" and "advanced adenoma of the colorectum" have the same meaning as "advanced adenoma in colorectal cancer".
根据本发明,表述“临界值(cutoff)”和“临界值(cut-off)”具有相同的含义,是指预测的临界值。可以通过常规实验(例如通过平行检测来自已知生理状态的受试者的样品中的生物标志物的相对丰度)获得该预测的临界值。According to the present invention, the expressions "cutoff" and "cut-off" have the same meaning and refer to a predicted cutoff value. The predicted cutoff value can be obtained by routine experimentation (e.g., by parallel measurement of the relative abundance of biomarkers in samples from subjects with known physiological states).
根据本发明,术语“MLG”被定义为宏基因组中的一组遗传物质,其在物理上可能连接形成一个单元而不是独立分布(参见,Qin,J.等Ametagenome-wide association studyof gut microbiota in type 2diabetes.Nature 490,55-60(2012),其全部内容通过引用并入本文)。MLG使得不再需要完全确定存在于宏基因组中的特定微生物种类,这一点是非常重要的,因为目前还存在大量未知生物并且细菌之间存在频繁的侧向基因转移(LGT)。在本发明中,MLG是指具有一致丰度水平和分类学分配的一组基因。According to the present invention, the term "MLG" is defined as a group of genetic material in a metagenome that is physically likely to be connected to form a unit rather than independently distributed (see, Qin, J. et al. Ametagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55-60 (2012), the entire contents of which are incorporated herein by reference). MLGs eliminate the need to fully determine the specific microbial species present in a metagenome, which is very important because there are still a large number of unknown organisms and there is frequent lateral gene transfer (LGT) between bacteria. In the present invention, MLGs refer to a group of genes with consistent abundance levels and taxonomic assignments.
根据本发明,术语MLG的“特异性片段”是MLG的一个片段,其对于该MLG是独特的。可使用常规方法来确定片段对于其所源自的MLG是否是独特的。例如,可将该片段的序列输入公共数据库(诸如GenBank)并执行BLAST程序。如果该片段仅存在于数据库中的一个物种中(在这种情况下,这个MLG将代表或对应该物种),或者如果数据库中不存在与所述片段具有至少90%同一性(诸如95%同一性)的同源物(在这种情况下,该MLG将指未知物种),则该片段可被认为是独特的。如上所论述的,一个MLG通常是指一个特定的微生物物种(已知或未知的),因此MLG的“特异性片段”也可被认为是特定微生物物种的一个独特的基因组片段(即,该片段仅存在于特定微生物物种中)。According to the present invention, the term "specific fragment" of an MLG is a fragment of an MLG that is unique to that MLG. Conventional methods can be used to determine whether a fragment is unique to the MLG from which it is derived. For example, the sequence of the fragment can be entered into a public database (such as GenBank) and a BLAST program can be performed. If the fragment is only present in one species in the database (in which case, this MLG will represent or correspond to that species), or if there is no homolog with at least 90% identity (such as 95% identity) to the fragment in the database (in which case, the MLG will refer to an unknown species), then the fragment can be considered unique. As discussed above, an MLG generally refers to a specific microbial species (known or unknown), so a "specific fragment" of an MLG can also be considered to be a unique genomic fragment of a specific microbial species (i.e., the fragment is only present in a specific microbial species).
根据本发明,术语“同一性”是指两个多肽之间或两个核酸之间的匹配度。当用于比较的两个序列在某一位点具有相同的碱基或氨基酸单体亚单元时(例如,两个DNA分子的每一个中在某个位点都为腺嘌呤,或者两个多肽的每一个中的某一位点都为赖氨酸),所述两个分子在该位点是同一的。两个序列之间的百分比同一性是由两个序列共有的匹配位点的数目除以用于比较的位点总数×100的函数。例如,如果两个序列的10个位点中有6个匹配,则这两个序列具有60%的同一性。例如,DNA序列:CTGACT和CAGGTT共有50%的同一性(6个位点中有3个匹配)。通常,以产生最大同一性的方式进行两个序列的比较。这种比对可通过使用基于Needleman等人(J.Mol.Biol.48:443-453,1970)的方法的计算机程序(诸如Align程序(DNAstar,Inc.))来进行。According to the present invention, the term "identity" refers to the degree of match between two polypeptides or between two nucleic acids. When two sequences being compared have the same base or amino acid monomer subunit at a certain site (for example, adenine at a certain site in each of the two DNA molecules, or lysine at a certain site in each of the two polypeptides), the two molecules are identical at that site. The percent identity between two sequences is a function of the number of matching sites shared by the two sequences divided by the total number of sites compared × 100. For example, if 6 out of 10 sites in two sequences match, the two sequences have 60% identity. For example, the DNA sequences: CTGACT and CAGGTT have 50% identity (3 out of 6 sites match). Typically, the comparison of two sequences is performed in a manner that produces maximum identity. This alignment can be performed using a computer program based on the method of Needleman et al. (J. Mol. Biol. 48:443-453, 1970), such as the Align program (DNAstar, Inc.).
根据本发明,表述“用于测定生物标志物的水平或其量的试剂”是指可用于定量或测量样品中的生物标志物的水平或其量的试剂。基于本发明所提供的生物标志物的序列,这样的试剂可通过本领域公知的常规方法容易地设计或获得。例如,这样的试剂包括但不限于,可用于通过例如实时PCR来定量或测量生物标志物的水平或其量的PCR引物;可用于通过例如定量Southern印迹来定量或测量生物标志物的水平或其量的探针;可用于定量或测量生物标志物的水平或其量的微阵列(例如,基因芯片)等。另外,如本领域已知的,第二代测序方法或第三代测序方法也可用于定量或测量生物标志物的水平或其量。因此,这样的试剂也可以是可商购的用于进行第二代测序方法或第三代测序方法的试剂。According to the present invention, the expression "reagent for determining the level or amount of a biomarker" refers to a reagent that can be used to quantify or measure the level or amount of a biomarker in a sample. Based on the sequence of the biomarker provided by the present invention, such reagents can be easily designed or obtained by conventional methods well known in the art. For example, such reagents include, but are not limited to, PCR primers that can be used to quantify or measure the level or amount of a biomarker by, for example, real-time PCR; probes that can be used to quantify or measure the level or amount of a biomarker by, for example, quantitative Southern blotting; microarrays (e.g., gene chips) that can be used to quantify or measure the level or amount of a biomarker. In addition, as known in the art, second-generation sequencing methods or third-generation sequencing methods can also be used to quantify or measure the level or amount of a biomarker. Therefore, such reagents can also be commercially available reagents for performing second-generation sequencing methods or third-generation sequencing methods.
根据本发明,表述“能够特异性扩增”特定核酸或特定序列的引物是指当用于扩增(例如PCR扩增)时,所述引物与所述特定核酸或序列特异性退火,以及产生独特的扩增产物(即,不与其它核酸或序列退火或产生其他副产物)。According to the present invention, the expression "capable of specifically amplifying" a primer of a specific nucleic acid or a specific sequence means that when used for amplification (e.g., PCR amplification), the primer specifically anneals to the specific nucleic acid or sequence and produces a unique amplification product (i.e., does not anneal to other nucleic acids or sequences or produce other by-products).
根据本发明,表述“能够与特定核酸或特定序列特异性杂交的探针”是指当在严格条件下用于杂交或检测时,所述探针与所述特定核酸杂交酸或序列特异性退火并与其杂交,但不与其它核酸或序列退火或与其杂交。According to the present invention, the expression "a probe capable of specifically hybridizing to a specific nucleic acid or a specific sequence" means that when used for hybridization or detection under stringent conditions, the probe specifically anneals to and hybridizes with the specific nucleic acid or sequence, but does not anneal to or hybridize with other nucleic acids or sequences.
基于特定序列(诸如特定MLG或其特异性片段)设计所述引物或探针,是本领域技术人员的公知常识。例如,此类公知常识可见于各种教科书(参见例如,J.Sambrook等,Molecular Cloning:Laboratory Manual,第二版,Cold Spring Harbor LaboratoryPress,1989;F.M.Ausubel等,Short Protocols in Molecular Biology,第三版,JohnWiley&Sons,Inc.;以及许多论文,如Buck等(1999),Lowe等(1990),等等。It is common knowledge for those skilled in the art to design primers or probes based on specific sequences (such as specific MLG or specific fragments thereof). For example, such common knowledge can be found in various textbooks (see, for example, J. Sambrook et al., Molecular Cloning: Laboratory Manual, 2nd ed., Cold Spring Harbor Laboratory Press, 1989; F. M. Ausubel et al., Short Protocols in Molecular Biology, 3rd ed., John Wiley & Sons, Inc.); and in many papers, such as Buck et al. (1999), Lowe et al. (1990), etc.
根据本发明,术语“第二代测序方法”是指近些年开发的新一代DNA测序方法,包括例如Illumina GA,Roche 454,ABI Solid;并且与传统的测序方法(诸如,Sanger测序方法)不同。第二代测序方法与传统测序方法(诸如,Sanger测序方法)的区别在于第二代测序方法通过边合成边测序的方式来分析DNA序列。第二代测序方法具有以下有利方面:1)成本低,为传统测序方法成本的1%;2)高通量,能够同时对多个样品进行测序,并且一次Solexa测序即可产生约500亿(50G)碱基的数据;3)高精度(大于98.4%),有效解决了多重重复序列读出的问题。另一方面,当要测序的序列的数量已被预先确定时,高测序通量又提高了序列的测序深度(例如,每个序列可被测序多次),从而确保测序结果的可信性。According to the present invention, the term "second-generation sequencing method" refers to a new generation of DNA sequencing methods developed in recent years, including, for example, Illumina GA, Roche 454, and ABI Solid; and is different from traditional sequencing methods (such as Sanger sequencing). Second-generation sequencing methods differ from traditional sequencing methods (such as Sanger sequencing) in that second-generation sequencing methods analyze DNA sequences through sequencing-by-synthesis. Second-generation sequencing methods have the following advantages: 1) low cost, which is 1% of the cost of traditional sequencing methods; 2) high throughput, capable of sequencing multiple samples simultaneously, and a single Solexa sequencing run can generate approximately 50 billion (50G) bases of data; 3) high accuracy (greater than 98.4%), effectively solving the problem of multiple repeat sequence reads. On the other hand, when the number of sequences to be sequenced is predetermined, high sequencing throughput increases the sequencing depth of the sequences (for example, each sequence can be sequenced multiple times), thereby ensuring the credibility of the sequencing results.
根据本发明,术语“第三代测序方法”是指最近开发的新一代单分子测序技术。第三代测序技术提供优于当前测序技术的有利方面,包括(i)更高的通量;(ii)更短的周转时间(例如在数分钟内以高倍覆盖度测序后生动物基因组);(iii)更长的测序长度以增强从头组装(de novo assembly),并使得能够直接检测单体型(haplotypes)和甚至全染色体定相(whole chromosome phasing);(iv)更高的一致准确度,以使得能够进行稀有变异检测;(v)少量起始材料(理论上只需要单个分子即可进行测序);和(vi)低成本,其中以低于100美元的价格实现对人类基因组的高倍覆盖度测序已成为社会的合理目标。关于第三代测序方法的更多细节,参见例如,Eric E.Schadt等,A window into third-generationsequencing,Human Molecular Genetics,2010,第19卷,Review Issue 2,R227-R240,通过引用并入本文。According to the present invention, the term "third generation sequencing method" refers to a recently developed new generation of single molecule sequencing technology. Third generation sequencing technology offers advantages over current sequencing technologies, including (i) higher throughput; (ii) shorter turnaround time (e.g., sequencing a metazoan genome with high coverage in minutes); (iii) longer sequencing lengths to enhance de novo assembly and enable direct detection of haplotypes and even whole chromosome phasing; (iv) higher consistent accuracy to enable rare variant detection; (v) small amounts of starting material (theoretically, only a single molecule is required for sequencing); and (vi) low cost, with achieving high coverage sequencing of the human genome for less than $100 becoming a reasonable goal for society. For more details on third-generation sequencing methods, see, for example, Eric E. Schadt et al., A window into third-generation sequencing, Human Molecular Genetics, 2010, Vol. 19, Review Issue 2, R227-R240, incorporated herein by reference.
根据本发明,术语“相对丰度”具有本领域已知的常规含义,并且可通过本领域已知的方法计算。例如,可通过Qin,J.等A metagenome-wide association study of gutmicrobiota in type 2diabetes.Nature 490,55-60(2012)(通过引用并入本文)所公开的方法来测定或计算基因(即生物标志物)或MLG的相对丰度。According to the present invention, the term "relative abundance" has a conventional meaning known in the art and can be calculated by methods known in the art. For example, the relative abundance of a gene (i.e., a biomarker) or MLG can be measured or calculated by the method disclosed in Qin, J. et al., A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55-60 (2012) (incorporated herein by reference).
本领域技术人员将理解,提供上述术语定义以更好地理解本发明,但上述术语定义无意限定本发明,除了如权利要求中所概述的外。Those skilled in the art will understand that the above term definitions are provided for a better understanding of the present invention, but the above term definitions are not intended to limit the present invention, except as outlined in the claims.
在第一方面,本发明提供了用于在受试者中预测或诊断与微生物群相关的疾病或确定受试者是否具有形成所述疾病的风险的生物标志物组,其包含以下生物标志物(所有生物标志物列于表7中):In a first aspect, the present invention provides a biomarker panel for predicting or diagnosing a disease associated with a microbiota in a subject or determining whether a subject has a risk of developing the disease, comprising the following biomarkers (all biomarkers are listed in Table 7):
(1)MLG 317或其一个或多个特异性片段,所述MLG 317由SEQ ID NO:933-1052组成;(1) MLG 317 or one or more specific fragments thereof, wherein MLG 317 consists of SEQ ID NOs: 933-1052;
(2)MLG 3770或其一个或多个特异性片段,所述MLG 3770由SEQ ID NO:1053-1281组成;和(2) MLG 3770 or one or more specific fragments thereof, wherein MLG 3770 consists of SEQ ID NOs: 1053-1281; and
(3)MLG 3840或其一个或多个特异性片段,所述MLG 3840由SEQ ID NO:238-639组成;(3) MLG 3840 or one or more specific fragments thereof, wherein MLG 3840 consists of SEQ ID NOs: 238-639;
任选地,所述生物标志物组还包含以下生物标志物中的一种或多种:Optionally, the biomarker panel further comprises one or more of the following biomarkers:
(4)MLG 665或其一个或多个特异性片段,所述MLG 665由SEQ ID NO:640-932组成;(4) MLG 665 or one or more specific fragments thereof, wherein MLG 665 consists of SEQ ID NOs: 640-932;
(5)MLG 721或其一个或多个特异性片段,所述MLG 721由SEQ ID NO:120-237组成;(5) MLG 721 or one or more specific fragments thereof, wherein MLG 721 consists of SEQ ID NOs: 120-237;
(6)MLG 1738或其一个或多个特异性片段,所述MLG 1738由SEQ ID NO:1471-2436组成;(6) MLG 1738 or one or more specific fragments thereof, wherein MLG 1738 consists of SEQ ID NOs: 1471-2436;
(7)MLG 1340或其一个或多个特异性片段,所述MLG 1340由SEQ ID NO:2893-3067组成;(7) MLG 1340 or one or more specific fragments thereof, wherein MLG 1340 consists of SEQ ID NOs: 2893-3067;
(8)MLG 5954或其一个或多个特异性片段,所述MLG 5954由SEQ ID NO:1-119组成;(8) MLG 5954 or one or more specific fragments thereof, wherein MLG 5954 consists of SEQ ID NOs: 1-119;
(9)MLG 711或其一个或多个特异性片段,所述MLG 711由SEQ ID NO:1282-1470组成;和(9) MLG 711 or one or more specific fragments thereof, wherein MLG 711 consists of SEQ ID NOs: 1282-1470; and
(10)MLG 4668或其一个或多个特异性片段,所述MLG 4668由SEQ ID NO:2437-2892组成。(10) MLG 4668 or one or more specific fragments thereof, wherein MLG 4668 consists of SEQ ID NOs: 2437-2892.
在优选实施方案中,本发明的生物标志物组包含如(1)-(8)中定义的生物标志物。In a preferred embodiment, the biomarker panel of the present invention comprises the biomarkers as defined in (1)-(8).
如本领域技术人员已知的,特异性片段可具有任何长度,只要这样的片段对于其所源自的MLG或由该MLG表示的物种是独特的(即,片段不存在于其它MLG或其它物种中)。然而,为方便起见,所述特异性片段的长度可以是至少30bp,或至少40bp,或至少50bp,或至少60bp,或至少70bp,或至少80bp,或至少90bp,或至少100bp,或至少150bp,或至少200bp,或至少250bp,或至少300bp,或至少350bp,或至少400bp,或至少450bp,或至少500bp,或至少600bp,或至少700bp,或至少800bp,或至少900bp,或至少1000bp,或至少1500bp,或至少2000bp。As known to those skilled in the art, a specific fragment may be of any length, so long as such fragment is unique to the MLG from which it is derived or the species represented by the MLG (i.e., the fragment is not present in other MLGs or other species). However, for convenience, the length of the specific fragment may be at least 30 bp, or at least 40 bp, or at least 50 bp, or at least 60 bp, or at least 70 bp, or at least 80 bp, or at least 90 bp, or at least 100 bp, or at least 150 bp, or at least 200 bp, or at least 250 bp, or at least 300 bp, or at least 350 bp, or at least 400 bp, or at least 450 bp, or at least 500 bp, or at least 600 bp, or at least 700 bp, or at least 800 bp, or at least 900 bp, or at least 1000 bp, or at least 1500 bp, or at least 2000 bp.
例如,在优选实施方案中,本发明的生物标志物组还可由以下项中的任一项或多项表征:For example, in preferred embodiments, the biomarker panels of the present invention may also be characterized by any one or more of the following:
(1)所述MLG 5954的一个或多个特异性片段选自SEQ ID NO:1-119或其任意组合;(1) the one or more specific fragments of MLG 5954 are selected from SEQ ID NOs: 1-119 or any combination thereof;
(2)所述MLG 721的一个或多个特异性片段选自SEQ ID NO:120-237或其任意组合;(2) the one or more specific fragments of MLG 721 are selected from SEQ ID NOs: 120-237 or any combination thereof;
(3)所述MLG 3840的一个或多个特异性片段选自SEQ ID NO:238-639或其任意组合;(3) the one or more specific fragments of MLG 3840 are selected from SEQ ID NOs: 238-639 or any combination thereof;
(4)所述MLG 665的一个或多个特异性片段选自SEQ ID NO:640-932或其任意组合;(4) the one or more specific fragments of MLG 665 are selected from SEQ ID NOs: 640-932 or any combination thereof;
(5)所述MLG 317的一个或多个特异性片段选自SEQ ID NO:933-1052或其任意组合;(5) the one or more specific fragments of MLG 317 are selected from SEQ ID NOs: 933-1052 or any combination thereof;
(6)所述MLG 3770的一个或多个特异性片段选自SEQ ID NO:1053-1281或其任意组合;(6) the one or more specific fragments of MLG 3770 are selected from SEQ ID NOs: 1053-1281 or any combination thereof;
(7)所述MLG 711的一个或多个特异性片段选自SEQ ID NO:1282-1470或其任意组合;(7) the one or more specific fragments of MLG 711 are selected from SEQ ID NOs: 1282-1470 or any combination thereof;
(8)所述MLG1738的一个或多个特异性片段选自SEQ ID NO:1471-2436或其任意组合;(8) the one or more specific fragments of MLG1738 are selected from SEQ ID NOs: 1471-2436 or any combination thereof;
(9)所述MLG 4668的一个或多个特异性片段选自SEQ ID NO:2437-2892或其任意组合;和(9) the one or more specific fragments of MLG 4668 are selected from SEQ ID NOs: 2437-2892 or any combination thereof; and
(10)所述MLG1340的一个或多个特异性片段选自SEQ ID NO:2893-3067或其任意组合。(10) The one or more specific fragments of MLG1340 are selected from SEQ ID NOs: 2893-3067 or any combination thereof.
在优选实施方案中,所述疾病是结直肠中的进展性腺瘤。In a preferred embodiment, the disease is advanced adenoma in the colorectum.
在优选实施方案中,所述受试者是哺乳动物,诸如灵长类动物,优选为人。In preferred embodiments, the subject is a mammal, such as a primate, preferably a human.
在优选实施方案中,本发明的生物标志物组用于区分患有进展性腺瘤的患者与健康受试者。In a preferred embodiment, the biomarker panels of the present invention are used to distinguish patients with advanced adenomas from healthy subjects.
在第二方面,本发明提供了试剂盒,所述试剂盒用于在受试者中预测或诊断与微生物群相关的疾病,或确定受试者是否具有处于形成所述疾病的风险,其包含用于测定根据本发明的生物标志物组的每种生物标志物在样品中的水平或其量的试剂。In a second aspect, the present invention provides a kit for predicting or diagnosing a disease associated with a microbiota in a subject, or determining whether a subject is at risk of developing the disease, comprising reagents for determining the level or amount of each biomarker of a biomarker panel according to the present invention in a sample.
在优选实施方案中,用于测定所述生物标志物组的每种生物标志物的水平或其量的试剂选自:In a preferred embodiment, the reagents used to determine the level or amount of each biomarker of the biomarker panel are selected from the group consisting of:
(a)引物组,其包含:(a) A primer set comprising:
(a1)能够特异性扩增MLG 317或其一个或多个特异性片段的一种或多种引物,所述MLG 317由SEQ ID NO:933-1052组成;(a1) one or more primers capable of specifically amplifying MLG 317 or one or more specific fragments thereof, wherein MLG 317 consists of SEQ ID NOs: 933-1052;
(a2)能够特异性扩增MLG 3770或其一个或多个特异性片段的一种或多种引物,所述MLG 3770由SEQ ID NO:1053-1281组成;和(a2) one or more primers capable of specifically amplifying MLG 3770 or one or more specific fragments thereof, said MLG 3770 consisting of SEQ ID NOs: 1053-1281; and
(a3)能够特异性扩增MLG 3840或其一个或多个特异性片段的一种或多种引物,所述MLG 3840由SEQ ID NO:238-639组成;(a3) one or more primers capable of specifically amplifying MLG 3840 or one or more specific fragments thereof, wherein MLG 3840 consists of SEQ ID NOs: 238-639;
任选地,所述引物组还包含以下引物的一种或多种:Optionally, the primer set further comprises one or more of the following primers:
(a4)能够特异性扩增MLG 665或其一个或多个特异性片段的一种或多种引物,所述MLG 665由SEQ ID NO:640-932组成;(a4) one or more primers capable of specifically amplifying MLG 665 or one or more specific fragments thereof, wherein MLG 665 consists of SEQ ID NOs: 640-932;
(a5)能够特异性扩增MLG 721或其一个或多个特异性片段的一种或多种引物,所述MLG 721由SEQ ID NO:120-237组成;(a5) one or more primers capable of specifically amplifying MLG 721 or one or more specific fragments thereof, wherein MLG 721 consists of SEQ ID NOs: 120-237;
(a6)能够特异性扩增MLG 1738或其一个或多个特异性片段的一种或多种引物,所述MLG 1738由SEQ ID NO:1471-2436组成;(a6) one or more primers capable of specifically amplifying MLG 1738 or one or more specific fragments thereof, wherein MLG 1738 consists of SEQ ID NOs: 1471-2436;
(a7)能够特异性扩增MLG 1340或其一个或多个特异性片段的一种或多种引物,所述MLG 1340由SEQ ID NO:2893-3067组成;(a7) one or more primers capable of specifically amplifying MLG 1340 or one or more specific fragments thereof, wherein MLG 1340 consists of SEQ ID NOs: 2893-3067;
(a8)能够特异性扩增MLG 5954或其一个或多个特异性片段的一种或多种引物,所述MLG 5954由SEQ ID NO:1-119组成;(a8) one or more primers capable of specifically amplifying MLG 5954 or one or more specific fragments thereof, wherein MLG 5954 consists of SEQ ID NOs: 1-119;
(a9)能够特异性扩增MLG 711或其一个或多个特异性片段的一种或多种引物,所述MLG 711由SEQ ID NO:1282-1470组成;和(a9) one or more primers capable of specifically amplifying MLG 711 or one or more specific fragments thereof, wherein MLG 711 consists of SEQ ID NOs: 1282-1470; and
(a10)能够特异性扩增MLG 4668或其一个或多个特异性片段的一种或多种引物,所述MLG 4668由SEQ ID NO:2437-2892组成;(a10) one or more primers capable of specifically amplifying MLG 4668 or one or more specific fragments thereof, wherein MLG 4668 consists of SEQ ID NOs: 2437-2892;
(b)探针组,其包含:(b) a probe set comprising:
(b1)能够与MLG 317或其一个或多个特异性片段特异性杂交的一种或多种探针,所述MLG 317由SEQ ID NO:933-1052组成;(b1) one or more probes capable of specifically hybridizing to MLG 317 or one or more specific fragments thereof, wherein MLG 317 consists of SEQ ID NOs: 933-1052;
(b2)能够与MLG 3770或其一个或多个特异性片段特异性杂交的一种或多种探针,所述MLG 3770由SEQ ID NO:1053-1281组成;和(b2) one or more probes capable of specifically hybridizing to MLG 3770 or one or more specific fragments thereof, said MLG 3770 consisting of SEQ ID NOs: 1053-1281; and
(b3)能够与MLG 3840或其一个或多个特异性片段特异性杂交的一种或多种探针,所述MLG 3840由SEQ ID NO:238-639组成;(b3) one or more probes capable of specifically hybridizing to MLG 3840 or one or more specific fragments thereof, wherein MLG 3840 consists of SEQ ID NOs: 238-639;
任选地,所述探针组还包含以下探针的一种或多种:Optionally, the probe set further comprises one or more of the following probes:
(b4)能够与MLG 665或其一个或多个特异性片段特异性杂交的一种或多种探针,所述MLG 665由SEQ ID NO:640-932组成;(b4) one or more probes capable of specifically hybridizing to MLG 665 or one or more specific fragments thereof, wherein MLG 665 consists of SEQ ID NOs: 640-932;
(b5)能够与MLG 721或其一个或多个特异性片段特异性杂交的一种或多种探针,所述MLG 721由SEQ ID NO:120-237组成;(b5) one or more probes capable of specifically hybridizing to MLG 721 or one or more specific fragments thereof, wherein MLG 721 consists of SEQ ID NOs: 120-237;
(b6)能够与MLG 1738或其一个或多个特异性片段特异性杂交的一种或多种探针,所述MLG 1738由SEQ ID NO:1471-2436组成;(b6) one or more probes capable of specifically hybridizing to MLG 1738 or one or more specific fragments thereof, wherein MLG 1738 consists of SEQ ID NOs: 1471-2436;
(b7)能够与MLG 1340或其一个或多个特异性片段特异性杂交的一种或多种探针,所述MLG 1340由SEQ ID NO:2893-3067组成;(b7) one or more probes capable of specifically hybridizing to MLG 1340 or one or more specific fragments thereof, wherein MLG 1340 consists of SEQ ID NOs: 2893-3067;
(b8)能够与MLG 5954或其一个或多个特异性片段特异性杂交的一种或多种探针,所述MLG 5954由SEQ ID NO:1-119组成;(b8) one or more probes capable of specifically hybridizing to MLG 5954 or one or more specific fragments thereof, wherein MLG 5954 consists of SEQ ID NOs: 1-119;
(b9)能够与MLG 711或其一个或多个特异性片段特异性杂交的一种或多种探针,所述MLG 711由SEQ ID NO:1282-1470组成;和(b9) one or more probes capable of specifically hybridizing to MLG 711 or one or more specific fragments thereof, wherein MLG 711 consists of SEQ ID NOs: 1282-1470; and
(b10)能够与MLG 4668或其一个或多个特异性片段特异性杂交的一种或多种探针,所述MLG 4668由SEQ ID NO:2437-2892组成;(b10) one or more probes capable of specifically hybridizing to MLG 4668 or one or more specific fragments thereof, wherein MLG 4668 consists of SEQ ID NOs: 2437-2892;
(c)包含(a)的引物组和/或(b)的探针组的微阵列;(c) a microarray comprising the primer set of (a) and/or the probe set of (b);
(d)进行第二代测序方法或第三代测序方法的试剂;和(d) reagents for performing a second generation sequencing method or a third generation sequencing method; and
(e)(a)-(d)的任意组合。(e) Any combination of (a)-(d).
在优选实施方案中,所述引物组包含如(a1)-(a8)中定义的引物。In a preferred embodiment, the primer set comprises primers as defined in (a1) to (a8).
在优选实施方案中,所述探针组包含如(b1)-(b8)中定义的探针。In a preferred embodiment, the probe set comprises probes as defined in (b1) to (b8).
在优选实施方案中,该试剂盒通过包括以下步骤的方法,在受试者中预测或诊断与微生物群相关的疾病,或确定受试者是否具有形成所述疾病的风险:In a preferred embodiment, the kit predicts or diagnoses a disease associated with a microbiota in a subject, or determines whether the subject has a risk of developing the disease, by a method comprising the following steps:
(1)使用所述试剂盒来测定来自所述受试者的样品中的根据本发明的生物标志物组的每种生物标志物的水平或其量;(1) using the kit to determine the level or amount of each biomarker of the biomarker panel according to the present invention in a sample from the subject;
(2)通过使用多元统计模型(诸如随机森林模型)将所述样品中的每种生物标志物的水平或其量与训练数据集进行比较来计算所述疾病的概率;(2) calculating the probability of the disease by comparing the level or amount of each biomarker in the sample with a training data set using a multivariate statistical model (such as a random forest model);
其中当所述疾病的概率大于临界值时,表明所述受试者患有所述疾病或具有形成所述疾病的风险。When the probability of the disease is greater than a critical value, it indicates that the subject suffers from the disease or has a risk of developing the disease.
在优选实施方案中,所述训练数据集包含关于多个患有所述疾病的受试者和多个健康受试者的每种生物标志物的水平或其量的数据。In a preferred embodiment, the training dataset comprises data on the level of each biomarker or the amount thereof in a plurality of subjects suffering from the disease and a plurality of healthy subjects.
在优选实施方案中,所述训练数据集包含表8中的数据,并且当概率大于临界值0.4572时,表明所述受试者患有所述疾病或具有形成所述疾病的风险。In a preferred embodiment, the training data set comprises the data in Table 8, and when the probability is greater than a critical value of 0.4572, it indicates that the subject has the disease or has a risk of developing the disease.
在优选实施方案中,所述受试者是哺乳动物,例如灵长类动物,优选为人。In preferred embodiments, the subject is a mammal, such as a primate, preferably a human.
在优选实施方案中,所述样品是粪便样品。In a preferred embodiment, the sample is a stool sample.
在优选实施方案中,所述每种生物标志物的水平或其量是所述样品中每种生物标志物的相对丰度。In a preferred embodiment, the level of each biomarker or the amount thereof is the relative abundance of each biomarker in the sample.
在优选实施方案中,所述疾病是结直肠中的进展性腺瘤。In a preferred embodiment, the disease is advanced adenoma in the colorectum.
在优选实施方案中,所述试剂盒还包含另外的试剂,诸如用于处理所述样品的试剂(例如无菌水),用于进行PCR扩增的试剂(例如聚合酶、dNTP和扩增缓冲液),以及用于进行杂交的试剂(诸如标记缓冲液、杂交缓冲液和洗涤缓冲液)。In a preferred embodiment, the kit further comprises additional reagents, such as reagents for processing the sample (e.g., sterile water), reagents for performing PCR amplification (e.g., polymerase, dNTPs, and amplification buffer), and reagents for performing hybridization (such as labeling buffer, hybridization buffer, and wash buffer).
在第三方面,本发明提供了用于测定根据本发明的生物标志物组的每种生物标志物的水平或其量的试剂在制备试剂盒中的用途,所述试剂盒用于在受试者中预测或诊断与微生物群相关的疾病或确定受试者是否具有形成所述疾病的风险。In a third aspect, the present invention provides the use of a reagent for determining the level or amount of each biomarker of a biomarker panel according to the present invention in the preparation of a kit for predicting or diagnosing a disease associated with a microbiome in a subject or determining whether the subject has a risk of developing the disease.
在优选实施方案中,所述用于测定所述生物标志物组的每种生物标志物的水平或其量的试剂是如上所定义的。In a preferred embodiment, the reagents for determining the level or amount of each biomarker of the biomarker panel are as defined above.
在优选实施方案中,所述试剂盒通过包括以下步骤的方法,在受试者中预测或诊断与微生物群相关的疾病,或确定受试者是否具有形成所述疾病的风险:In a preferred embodiment, the kit predicts or diagnoses a disease associated with a microbiota in a subject, or determines whether the subject has a risk of developing the disease, by a method comprising the following steps:
(1)使用所述试剂盒来测定样品中根据本发明的生物标志物组的每种生物标志物的水平或其量;(1) using the kit to determine the level or amount of each biomarker of the biomarker panel according to the present invention in a sample;
(2)通过使用多元统计模型(诸如随机森林模型)将所述样品中的每种生物标志物的水平或其量与训练数据集进行比较来计算所述疾病的概率;(2) calculating the probability of the disease by comparing the level or amount of each biomarker in the sample with a training data set using a multivariate statistical model (such as a random forest model);
其中当所述疾病的概率大于临界值时,表明所述受试者患有所述疾病或具有形成所述疾病的风险。When the probability of the disease is greater than a critical value, it indicates that the subject suffers from the disease or has a risk of developing the disease.
在优选实施方案中,所述训练数据集包含关于多个患有所述疾病的受试者和多个健康受试者的每种生物标志物的水平或其量的数据。In a preferred embodiment, the training dataset comprises data on the level of each biomarker or the amount thereof in a plurality of subjects suffering from the disease and a plurality of healthy subjects.
在优选实施方案中,所述训练数据集包含表8中的数据,并且当所述疾病的概率大于临界值0.4572时,表明所述受试者患有所述疾病或具有形成所述疾病的风险。In a preferred embodiment, the training data set comprises the data in Table 8, and when the probability of the disease is greater than a critical value of 0.4572, it indicates that the subject has the disease or has a risk of developing the disease.
在优选实施方案中,所述受试者是哺乳动物,例如灵长类动物,优选为人。In preferred embodiments, the subject is a mammal, such as a primate, preferably a human.
在优选实施方案中,所述样品是粪便样品。In a preferred embodiment, the sample is a stool sample.
在优选实施方案中,所述每种生物标志物的水平或其量是所述样品中每种生物标志物的相对丰度。In a preferred embodiment, the level of each biomarker or the amount thereof is the relative abundance of each biomarker in the sample.
在优选实施方案中,所述疾病是结直肠中的进展性腺瘤。In a preferred embodiment, the disease is advanced adenoma in the colorectum.
在优选实施方案中,所述试剂盒还包含另外的试剂,诸如用于处理样品的试剂(例如无菌水),用于进行PCR扩增的试剂(例如聚合酶、dNTP和扩增缓冲液),以及用于进行杂交的试剂(诸如标记缓冲液、杂交缓冲液和洗涤缓冲液)。In a preferred embodiment, the kit further comprises additional reagents, such as reagents for handling the sample (e.g., sterile water), reagents for performing PCR amplification (e.g., polymerase, dNTPs, and amplification buffer), and reagents for performing hybridization (such as labeling buffer, hybridization buffer, and wash buffer).
在第四方面,本发明提供了用于在受试者中预测或诊断与微生物群相关的疾病或确定受试者是否具有形成所述疾病的风险的方法,其包括以下步骤:In a fourth aspect, the present invention provides a method for predicting or diagnosing a disease associated with a microbiota in a subject or determining whether the subject has a risk of developing the disease, comprising the steps of:
(1)测定来自所述受试者的样品中根据权利要求1至6中任一项所述的生物标志物组的每种生物标志物的水平或其量;(1) determining the level or amount of each biomarker of the biomarker panel according to any one of claims 1 to 6 in a sample from the subject;
(2)通过使用多元统计模型(如随机森林模型)将所述样品中的每个生物标志物的水平或其量与训练数据集进行比较来计算所述疾病的概率;(2) calculating the probability of the disease by comparing the level or amount of each biomarker in the sample with a training dataset using a multivariate statistical model (such as a random forest model);
其中当所述疾病的概率大于临界值时,表明所述受试者患有所述疾病或具有形成所述疾病的风险。When the probability of the disease is greater than a critical value, it indicates that the subject suffers from the disease or has a risk of developing the disease.
在优选实施方案中,所述训练数据集包含关于多个患有所述疾病的受试者以及多个健康受试者的每种生物标志物的水平或其量的数据。In a preferred embodiment, the training dataset comprises data on the level of each biomarker or the amount thereof in a plurality of subjects suffering from the disease and a plurality of healthy subjects.
在优选实施方案中,所述训练数据集包含表8中的数据,并且当所述疾病的概率大于临界值0.4572时,表明所述受试者患有所述疾病或具有形成所述疾病的风险。In a preferred embodiment, the training data set comprises the data in Table 8, and when the probability of the disease is greater than a critical value of 0.4572, it indicates that the subject has the disease or has a risk of developing the disease.
在优选实施方案中,在步骤(1)中使用如上定义的试剂盒或如上定义的试剂。In a preferred embodiment, a kit as defined above or a reagent as defined above is used in step (1).
在优选实施方案中,所述受试者是哺乳动物,例如灵长类动物,优选为人。In preferred embodiments, the subject is a mammal, such as a primate, preferably a human.
在优选实施方案中,所述样品是粪便样品。In a preferred embodiment, the sample is a stool sample.
在优选实施方案中,所述每种生物标志物的水平或其量是所述样品中每种生物标志物的相对丰度。In a preferred embodiment, the level of each biomarker or the amount thereof is the relative abundance of each biomarker in the sample.
在优选实施方案中,所述疾病是结直肠中的进展性腺瘤。In a preferred embodiment, the disease is advanced adenoma in the colorectum.
在优选实施方案中,在体外进行所述方法。In a preferred embodiment, the method is performed in vitro.
在以下非限制性实施例中进一步举例说明本发明。除非另有说明,否则部分和百分比以重量计,度为摄氏度。所用的试剂皆是商购可得的。对于本领域普通技术人员而言显而易见的是,这些实施例虽然表示本发明的优选实施方案,但仅以说明的方式给出。The present invention is further illustrated in the following non-limiting examples. Unless otherwise indicated, parts and percentages are by weight and degrees are in degrees Celsius. All reagents used are commercially available. It will be apparent to one of ordinary skill in the art that these examples, while representing preferred embodiments of the present invention, are provided by way of illustration only.
实施例Example
实施例1.鉴定和验证用于评估CRC相关疾病风险的生物标志物Example 1. Identification and validation of biomarkers for assessing CRC-related disease risk
1.样品收集和测序1. Sample Collection and Sequencing
1.1受试者和患者1.1 Subjects and Patients
在依照CRC国家筛选建议(Stadlmayr,A.et al.Nonalcoholic fatty liverdisease:an independent risk factor for colorectal neoplasia.J Intern Med270,41-49(2011),通过引用并入本文)进行的一个健康筛查程序中的那些参与者以及2010年至2012年期间在Oberndorf医院内科(奥地利萨尔斯堡Paracelsus医科大学的教学医院)进行过结肠镜检查(作为临床检查的部分)的那些疑似患有CRC的患者中进行研究。本研究获得当地伦理委员会(Ethikkommission des Landes Salzburg,批准号415-E/1262/2-2010)的批准,并获得所有参与者的知情同意书。The study was conducted in participants of a health screening program according to national screening recommendations for CRC (Stadlmayr, A. et al. Nonalcoholic fatty liver disease: an independent risk factor for colorectal neoplasia. J Intern Med 270, 41-49 (2011), incorporated herein by reference) and in patients suspected of having CRC who underwent a colonoscopy as part of a clinical workup at the Department of Internal Medicine of Oberndorf Hospital (teaching hospital of the Paracelsus Medical University of Salzburg, Austria) between 2010 and 2012. The study was approved by the local ethics committee (Ethikkommission des Landes Salzburg, approval number 415-E/1262/2-2010), and informed consent was obtained from all participants.
将泻药(含有聚乙二醇59.0g、硫酸钠5.68g、碳酸氢钠1.68g、NaCl1.46g和氯化钾0.74g;Norgine,Marburg,德国)用于肠道准备,然后进行结肠镜检查。基于肉眼检查和组织学检测结果的组合分析,结肠镜检查结果被分为管状腺瘤、进展性腺瘤(即绒毛状或管状绒毛状特征,大小≥1cm或高度发育异常)或癌(Bond,J.H.Polyp guideline:diagnosis,treatment,and surveillance for patients with colorectal polypsACGColorectal Polyp Guideline.Am.J.Gastroenterol.95,3053-3063(2000),Winawer SJ&AG.,Z.The advanced adenoma as the primary target of screening.GastrointestEndosc Clin N Am12,1-9(2002),通过引用并入本文)。根据位置(即右结肠(包括盲肠、升结肠和横结肠),左结肠(从脾曲到乙状结肠),以及单独的直肠)对病灶进行分类。A laxative (containing polyethylene glycol 59.0 g, sodium sulfate 5.68 g, sodium bicarbonate 1.68 g, NaCl 1.46 g, and potassium chloride 0.74 g; Norgine, Marburg, Germany) was used for bowel preparation before colonoscopy. Based on a combined analysis of macroscopic and histological findings, colonoscopy findings were classified as tubular adenoma, advanced adenoma (i.e., villous or tubulovillous features, size ≥1 cm or high-grade dysplasia), or carcinoma (Bond, J.H. Polyp guideline: diagnosis, treatment, and surveillance for patients with colorectal polyps ACG Colorectal Polyp Guideline. Am. J. Gastroenterol. 95, 3053-3063 (2000), Winawer SJ & AG., Z. The advanced adenoma as the primary target of screening. Gastrointest Endosc Clin N Am 12, 1-9 (2002), incorporated herein by reference). Lesions were classified according to location (ie, right colon (including cecum, ascending colon, and transverse colon), left colon (from splenic flexure to sigmoid colon), and rectum alone).
初始分析囊括了来自147名年龄在45至86岁之间的白种人的数据,其中包括57名健康对照(24名男性,33名女性),44例进展性腺瘤患者(女性22例,男性22例)和46例癌患者(18例男性,28例女性)(表1-1)。另外9个样品(6个健康对照,3个进展性腺瘤样品,表1-1)也被用于基于MLG的癌分类器的测试集(图1d)。到目前为止,还没有研究以可比较的方式探究过上述给定的主题;因此,无法进行用于样品量计算的正式效能分析(formal poweranalysis)。但是,根据以前的16S和宏基因组鸟枪法测序对病人的粪便微生物的研究来判断,这是合理的样品量。将受试者在性别、年龄和体重指数(BMI)方面进行分层,以使得三组(对照组、进展性腺瘤组、癌组)在这些变量上可比较。在进展性腺瘤组中,14例的病灶位于右结肠(包括盲肠、升结肠和横结肠),15例的病灶位于左结肠(从脾曲至乙状结肠),15例的病灶位于直肠。在癌组中,8例的病灶位于右结肠,11例的病灶位于左结肠,27例的病灶位于直肠。结直肠癌由美国癌症联合委员会(AJCC)TNM分期系统(Greene,F.L.Current TNMstaging of colorectal cancer.Lancet.Oncol.8,572-3(2007),通过引用并入本文)进行分类。The initial analysis included data from 147 Caucasian individuals aged 45 to 86 years, including 57 healthy controls (24 men, 33 women), 44 patients with advanced adenoma (22 women, 22 men), and 46 patients with cancer (18 men, 28 women) (Table 1-1). An additional 9 samples (6 healthy controls, 3 advanced adenoma samples, Table 1-1) were also used in the test set of the MLG-based cancer classifier (Figure 1d). To date, no studies have explored the given topic in a comparable manner; therefore, a formal power analysis for sample size calculation could not be performed. However, based on previous studies of patient fecal microbiota using 16S and metagenomic shotgun sequencing, this was a reasonable sample size. Subjects were stratified by sex, age, and body mass index (BMI) to make the three groups (control group, advanced adenoma group, and cancer group) comparable on these variables. In the advanced adenoma group, 14 cases had lesions located in the right colon (including the cecum, ascending colon, and transverse colon), 15 cases had lesions located in the left colon (from the splenic flexure to the sigmoid colon), and 15 cases had lesions located in the rectum. In the carcinoma group, 8 cases had lesions located in the right colon, 11 cases had lesions located in the left colon, and 27 cases had lesions located in the rectum. Colorectal cancer is classified by the American Joint Committee on Cancer (AJCC) TNM staging system (Greene, F.L. Current TNM staging of colorectal cancer. Lancet. Oncol. 8, 572-3 (2007), incorporated herein by reference).
表1-1:所有156个样品的临床资料Table 1-1: Clinical data of all 156 samples
1.2粪便样品1.2 Stool samples
从所有患者和受试者收集新鲜粪便样品。样品用无菌刮刀机械匀化,然后使用Sarstedt粪便取样系统(Sarstedt,Nümbrecht,德国)取4份等分试样。每个等分试样含有1g粪便并放置在无菌12ml冻存管中。然后将粪便等分试样储存在-20℃家用冰箱中,并在收集后的48小时内将其放置在冷藏包中运送至实验室,然后立即将其储存在-80℃。所有患者和受试者在过去3个月内没有接受过益生菌或抗生素。Fresh stool samples were collected from all patients and subjects. The samples were mechanically homogenized with a sterile spatula, and four aliquots were collected using a Sarstedt stool sampling system (Sarstedt, Nümbrecht, Germany). Each aliquot contained 1 g of stool and was placed in a sterile 12-ml cryovial. The stool aliquots were then stored in a −20°C home freezer and transported to the laboratory in cold packs within 48 hours of collection, where they were immediately stored at −80°C. None of the patients and subjects had received probiotics or antibiotics within the previous 3 months.
1.3 DNA的提取1.3 DNA extraction
将粪便样品在冰上解冻,并按照制造商的说明书使用Qiagen QIAamp DNA StoolMini试剂盒(Qiagen)进行DNA提取。提取物用不含DNA酶的RNA酶处理以消除RNA污染。使用NanoDrop分光光度计,Qubit荧光计(使用Quant-iTTMdsDNA BR Assay试剂盒)和凝胶电泳测定DNA量。Fecal samples were thawed on ice and DNA was extracted using the Qiagen QIAamp DNA Stool Mini Kit (Qiagen) according to the manufacturer's instructions. The extracts were treated with DNase-free RNase to eliminate RNA contamination. DNA quantity was determined using a NanoDrop spectrophotometer, a Qubit fluorometer (using the Quant-iT™ dsDNA BR Assay Kit), and gel electrophoresis.
1.4宏基因组测序和基因目录的构建1.4 Metagenomic sequencing and gene catalog construction
在Illumina平台(插入片段大小为350bp,读段(read)长度为100bp)上进行双末端宏基因组测序(paired-end metagenomic sequencing),并且如前所述(Qin等,2012,同上)使用SOAPdenovo v2.04(除对于-K 51-M3-F-u外,使用缺省参数)(Luo,R.等SOAPdenovo2:an empirically improved memory-efficient short-read de novoassembler.Gigascience1,18(2012),通过引用并入本文)对测序读段(read)进行质量控制并从头组装成重叠群(contig)。从头组装高质量的测序读段(平均每个样品含有5GB数据量),将鉴定的基因编入3.5M非冗余基因集,这使得平均每个样品中有76.3%的读段可以匹配上。Paired-end metagenomic sequencing was performed on the Illumina platform (insert size 350 bp, read length 100 bp), and the sequencing reads were quality controlled and assembled de novo into contigs as previously described (Qin et al., 2012, supra) using SOAPdenovo v2.04 (default parameters were used except for -K 51-M3-F-u) (Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012), incorporated herein by reference). High-quality sequencing reads were assembled de novo (average of 5 GB per sample) and the identified genes were compiled into a 3.5M non-redundant gene set, which resulted in an average of 76.3% of reads per sample being matched.
使用GeneMark v2.7d对组装的重叠群进行基因预测。使用BLAT(Kent,W.J.BLAT--the BLAST-like alignment tool.Genome Res.12,656-64(2002),通过引用并入本文),除去冗余基因,其中以90%重叠度和95%同一性(不允许有缺口)作为临界值。通过使用与Qin等2012(同上)中相同的程序,将高质量的测序读段与基因目录进行比对来测定基因的相对丰度。Gene prediction was performed on the assembled contigs using GeneMark v2.7d. Redundant genes were removed using BLAT (Kent, W.J.BLAT--the BLAST-like alignment tool. Genome Res. 12, 656-64 (2002), incorporated herein by reference), with 90% overlap and 95% identity (no gaps allowed) as critical values. The relative abundance of genes was determined by comparing high-quality sequencing reads with the gene catalog using the same program as Qin et al. 2012 (supra).
根据IMG数据库(v400),并使用先前详述(Qin等2012,同上)的内部流程,利用80%的重叠和65%的同一性,前10%的评分(BLASTN v2.2.24,-e 0.01-b 100-K 1-F T-m 8)对预测基因进行分类学分配。分配至门时,临界值为65%的同一性,分配至属时,临界值为85%同一性,分配至种时,临界值为95%的同一性;如果存在多次命中(multiple hits),则对于存在该疑问的分类群,其临界值为≥50%的一致性。The predicted genes were assigned taxonomically based on the IMG database (v400) using an in-house pipeline previously described (Qin et al. 2012, supra) using 80% overlap and 65% identity, with a top 10% score (BLASTN v2.2.24, -e 0.01-b 100-K 1-F T-m 8). The cutoffs for assignment to phylum were 65% identity, 85% identity for genus, and 95% identity for species. If multiple hits were present, a cutoff of ≥50% identity was used for the taxon in question.
2.宏基因组关联分析(MGWAS)2. Metagenomic Wage-Based Association Study (MGWAS)
为了比较健康对照、进展性腺瘤和癌症患者的粪便微生物群系,鉴定了相对丰度在上述任意两组之间展现出显著差异的基因(Benjamin-Hochberg q值<0.1,Kruskal-Wallis检验)。然后将这些标志物基因根据其在所有三组样品中的丰度变化聚类成MLG,这使得能够鉴定每组的微生物物种特征(Qin等,2012,同上)。147个样品中有9个含有超过20%的埃希氏杆菌属(Escherichia)(2个对照、2个腺瘤、5个癌症样品),随后该样品仅用在用于基于MLG的癌分类器的测试集中(图1d)。另外9个样品(6个健康对照、3个进展性腺瘤样品,表1-1)也用在用于上述分类器的测试集中。In order to compare the fecal microbiota of healthy controls, advanced adenomas and cancer patients, genes whose relative abundance showed significant differences between any two of the above groups were identified (Benjamin-Hochberg q value <0.1, Kruskal-Wallis test). These marker genes were then clustered into MLGs based on their abundance changes in all three groups of samples, which enabled the identification of the microbial species characteristics of each group (Qin et al., 2012, supra). Nine of the 147 samples contained more than 20% of the genus Escherichia (Escherichia) (2 controls, 2 adenomas, 5 cancer samples), and then the samples were used only in the test set for the MLG-based cancer classifier (Figure 1d). Another 9 samples (6 healthy controls, 3 advanced adenoma samples, Table 1-1) were also used in the test set for the above classifier.
如前所述,根据分类学及其组成基因的相对丰度进行MLG的分类学分配和丰度特征谱表征(Qin等2012,同上)。简而言之,将MLG分配至种需要MLG中超过90%的基因能够以超过95%的同一性,以及超过70%的查询重叠度比对到该种的基因组。将MLG分配至属,需要该MLG中超过80%的基因能够在DNA和蛋白质序列上均以至少85%同一性比对到该属的基因组上。As previously described, MLGs were assigned taxonomically and characterized based on the relative abundance of their constituent genes (Qin et al. 2012, supra). Briefly, assignment of an MLG to a species requires that >90% of the genes in the MLG be mapped to the genome of the species with >95% identity and >70% query overlap. Assignment of an MLG to a genus requires that >80% of the genes in the MLG be mapped to the genome of the genus with at least 85% identity in both DNA and protein sequences.
为了探索健康或肿瘤样品中的肠道微生物群系的特征,本发明人鉴定了在三组的任意两组中均显示出显著的丰度差异的130,715个基因(Kruskal-Wallis检验,Benjamin-Hochberg q值<0.1)。除了血清铁蛋白和对红肉的摄取状况外,除肿瘤状态以外没有一种表型在对照、腺瘤和癌症患者中均显示出显著的差异(p<0.05,Kruskal-Wallis检验,表1-2)。与健康和进展性腺瘤样品相比,58.9%的基因标志物在癌症样品中显著升高,表明它们对结直肠癌是特异的;另外24.3%的基因在癌症样品中的丰度显著高于对照样品,但在进展性腺瘤样品中具有中等水平。在具有下降趋势的基因中,与健康和进展性腺瘤样品相比,5388个基因(占总数的4.1%)在癌症样品中显著降低;2601个基因(占总数的2.0%)的丰度在癌症样品中显著低于对照样品,在进展性腺瘤样品中具有中等水平。这些在对照样品中富集的基因,而非那些在腺瘤或癌样品中富集的基因,被更多地匹配至京都基因与基因组百科全书(KEGG)通路。递增和递减基因数目的差异表明,在癌发展过程中病生菌(pathobionts)的增加比有益细菌的减少更为明显。根据各个基因在所有样品中丰度的共变化,将显著不同的基因聚类成126个MLG,这使得能够鉴定每组的微生物物种特征(Qin等,2012,同上)。To explore the characteristics of the gut microbiome in healthy or tumor samples, the present inventors identified 130,715 genes that showed significant differences in abundance between any two of the three groups (Kruskal-Wallis test, Benjamin-Hochberg q value <0.1). With the exception of serum ferritin and red meat intake, no phenotype other than tumor status showed significant differences between controls, adenomas, and cancer patients (p < 0.05, Kruskal-Wallis test, Tables 1-2). Compared with healthy and advanced adenoma samples, 58.9% of the gene markers were significantly elevated in cancer samples, indicating that they are specific for colorectal cancer; another 24.3% of the genes were significantly more abundant in cancer samples than in control samples, but had intermediate levels in advanced adenoma samples. Among the genes with a downward trend, 5388 genes (4.1% of the total) were significantly decreased in cancer samples compared with healthy and advanced adenoma samples; the abundance of 2601 genes (2.0% of the total) was significantly lower in cancer samples than in control samples, and had moderate levels in advanced adenoma samples. These genes enriched in control samples, rather than those enriched in adenoma or cancer samples, were more frequently matched to the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The difference in the number of increasing and decreasing genes suggests that the increase in pathogenic bacteria is more pronounced than the decrease in beneficial bacteria during cancer development. Based on the co-variation of the abundance of each gene in all samples, the significantly different genes were clustered into 126 MLGs, which enabled the identification of the microbial species characteristics of each group (Qin et al., 2012, ibid.).
3.结直肠癌或腺瘤的基于MLG的分类3. MLG-based classification of colorectal cancer or adenoma
为了评价结直肠癌的粪便微生物群系的诊断价值,本发明人构建了可检测癌症样品的随机森林分类器。使用训练队列(集)的MLG丰度特征谱来训练随机森林模型(R.2.14,randomForest4.6-7软件包)(Liaw,Andy&Wiener,Matthew.Classification andRegression by randomForest,RNews(2002),第2/3卷,第18页,通过引用并入本文),从而选择MLG标志物的最优集。在测试集上测试该模型,并测定了预测误差。关于该随机森林模型,通过使用R vision 2.14中的“randomForest4.6-7包”,输入训练数据集(即训练样品中所选MLG的相对丰度特征谱)、样品疾病状态(训练样品的样品疾病状态为矢量,1表示病例,0表示对照)和测试数据集(即在测试集中所选MLG的相对丰度特征谱)。然后本发明人使用R软件中的randomForest包的随机Forest函数来构建分类,并且使用预测函数来预测测试集。输出的是预测结果(患病的概率;临界值是指最佳临界值,如果疾病的概率≥最佳临界值,则受试者处于疾病的风险中)。In order to evaluate the diagnostic value of the fecal microbiome of colorectal cancer, the inventors constructed a random forest classifier that can detect cancer samples. The MLG abundance profile of the training cohort (set) was used to train a random forest model (R.2.14, randomForest4.6-7 software package) (Liaw, Andy & Wiener, Matthew. Classification and Regression by randomForest, R News (2002), Vol. 2/3, p. 18, incorporated herein by reference) to select the optimal set of MLG markers. The model was tested on the test set and the prediction error was determined. Regarding the random forest model, by using the "randomForest4.6-7 package" in R vision 2.14, the training data set (i.e., the relative abundance profile of the MLG selected in the training sample), the sample disease status (the sample disease status of the training sample is a vector, 1 represents a case, 0 represents a control) and the test data set (i.e., the relative abundance profile of the MLG selected in the test set) was input. The inventors then used the random Forest function of the randomForest package in R software to construct a classification, and used the prediction function to predict the test set. The output is the prediction result (probability of being sick; the critical value refers to the optimal critical value, if the probability of the disease ≥ the optimal critical value, the subject is at risk of the disease).
使用对照、进展性腺瘤或癌症样品的MLG丰度特征谱,对随机森林模型(R 3.0.2,randomForest4.6-7包)进行10折交叉验证。对获得自5次10折交叉验证的交叉验证误差曲线(每条曲线为10个测试集的平均值)进行平均,并将该平均曲线中的最小误差加上该点处的标准偏差得到的数值用作临界值。列出误差小于该临界值的MLG标志物的所有集合(≤50),并选择具有最少数目的MLG的集合作为最优集。使用该MLG集计算腺瘤或癌的概率,并绘制ROC(R 3.0.2,pROC3包)。在测试集上进一步测试该模型,并测定了预测误差。The random forest model (R 3.0.2, randomForest4.6-7 package) was subjected to 10-fold cross validation using the MLG abundance profiles of control, advanced adenoma or cancer samples. The cross validation error curves obtained from 5 10-fold cross validations (each curve is the average of 10 test sets) were averaged, and the minimum error in the average curve plus the standard deviation at that point was used as the critical value. All sets of MLG markers with an error less than the critical value (≤50) were listed, and the set with the least number of MLGs was selected as the optimal set. The probability of adenoma or cancer was calculated using the MLG set, and ROC (R 3.0.2, pROC3 package) was plotted. The model was further tested on the test set, and the prediction error was determined.
通过对由55个对照和41个癌症样品(表1-1)组成的训练集的5次重复的10折交叉验证(即50次测试),从而获得15个MLG标志物的最优选择(表2-1,表2-2)。简而言之,在由55个对照和41个癌样品组成的训练集中进行5次重复的10折交叉验证(即50次测试)。每次测试,随机森林测试均对每个MLG的重要性进行排序。本发明人挑选了前15个重要的MLG,并按照出现次数对mlg进行了排序。上述前15个MLG用于构建分类器。表5列出了MLG的重要性的排序。The optimal selection of 15 MLG markers (Table 2-1, Table 2-2) was obtained by performing 5 repetitions of 10-fold cross validation (i.e., 50 tests) on a training set consisting of 55 controls and 41 cancer samples (Table 1-1). In short, 5 repetitions of 10-fold cross validation (i.e., 50 tests) were performed on a training set consisting of 55 controls and 41 cancer samples. In each test, the random forest test ranked the importance of each MLG. The inventors selected the top 15 important MLGs and ranked the mlgs according to the number of occurrences. The above top 15 MLGs were used to construct a classifier. Table 5 lists the ranking of the importance of MLGs.
我们的研究结果表明,上述15个MLG在训练集上表现良好,接受者工作曲线的曲线下面积(AUC)为98.34%(临界值=0.5,图1a、1b、1c,表3-1,表3-2,表4和表5)。测试集(8个对照样品,47个进展性腺瘤样品和5个癌样品)的分类误差较低,接受者操作曲线的曲线下面积(AUC)为96%(进展性腺瘤被认为是非癌,临界值=0.5,图1d、1e,表4),这与他们(进展性腺瘤)大多为良性这一性质一致。上述MLG标志物当中包括可能为口腔厌氧菌的mlg-75和mlg-84,前者显示出对腺瘤的高优势比(odds ratio)(表2-2),表明其在发病机制中的早期作用。其它MLG标志物包括马赛拟杆菌(Bacteroides massiliensis),mlg-2985、mlg-121和10种另外的分类学未定义的MLG(表2-2)。因此,由癌分类器选择出的MLG显示在腺瘤和癌中导致疾病恶化的肠道微生物群系的重要特征,并对这些肿瘤的早期和无创性诊断具有很大的潜力。Our results showed that the 15 MLGs performed well in the training set, with an area under the receiver operating curve (AUC) of 98.34% (cutoff = 0.5, Figures 1a, 1b, 1c, Tables 3-1, 3-2, 4, and 5). The classification error in the test set (8 control samples, 47 advanced adenoma samples, and 5 cancer samples) was low, with an AUC of 96% (advanced adenomas were considered non-cancerous, cutoff = 0.5, Figures 1d, 1e, Table 4), consistent with their predominantly benign nature. Among the MLG markers were mlg-75 and mlg-84, which are likely oral anaerobes. The former showed a high odds ratio for adenoma (Table 2-2), suggesting an early role in pathogenesis. Other MLG markers included Bacteroides massiliensis, mlg-2985, mlg-121, and 10 additional taxonomically undefined MLGs (Table 2-2). Thus, the MLGs selected by the cancer classifier displayed important features of the gut microbiome that contribute to disease progression in adenomas and carcinomas and have great potential for early and noninvasive diagnosis of these tumors.
另外,表5中的结果显示,对于前2个重要的MLG(MLG 5045和MLG 121)的组合,AUC为0.91751663;对于前3个重要的MLG(MLG5045、MLG 121和MLG 75)的组合,AUC为0.970731707;对于前4个MLG的组合,AUC为0.959645233;对于前5个MLG的组合,AUC为0.975609756;对于前6个MLG的组合,AUC为0.978713969;对于前7个MLG的组合,AUC为0.980044346;对于前8个MLG的组合,AUC为0.985365854;对于前9个MLG,AUC为0.984035477;对于前10个MLG,AUC为0.981818182;对于前11个MLG,AUC为0.980931264;对于前12个MLG,AUC为0.979157428;对于前13个MLG,AUC为0.987583149;对于前14个MLG,AUC为0.986696231;以及对于前15个MLG,AUC为0.983370288。这些结果表明,MLG 5045和MLG121是15个MLG中最重要的生物标志物,MLG 5045和MLG121的组合足以诊断结直肠癌和/或确定患结直肠癌的风险,且具有高灵敏度和高特异性(AUC=0.91751663);该15个MLG中的其它MLG生物标志物的并入可以在一定程度上提高诊断或预测的灵敏度和特异度。特别地,这些结果还表明,前13个MLG的组合可以被认为是最优生物标志物组,其对结直肠癌的诊断或预测具有最优灵敏度和特异性(AUC=0.987583149)。In addition, the results in Table 5 show that for the combination of the top 2 important MLGs (MLG 5045 and MLG 121), the AUC is 0.91751663; for the combination of the top 3 important MLGs (MLG5045, MLG 121, and MLG 75), the AUC is 0.970731707; for the combination of the top 4 MLGs, the AUC is 0.959645233; for the combination of the top 5 MLGs, the AUC is 0.975609756; for the combination of the top 6 MLGs, the AUC is 0.978713969; for the combination of the top 7 MLGs, the AUC is 0.980044346; for the combination of the top 8 MLGs, the AUC is 0.985365854; for the combination of the top 9 MLGs, the AUC is 0.961000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 The UC was 0.984035477; for the first 10 MLGs, the AUC was 0.981818182; for the first 11 MLGs, the AUC was 0.980931264; for the first 12 MLGs, the AUC was 0.979157428; for the first 13 MLGs, the AUC was 0.987583149; for the first 14 MLGs, the AUC was 0.986696231; and for the first 15 MLGs, the AUC was 0.983370288. These results indicate that MLG 5045 and MLG121 are the most important biomarkers among the 15 MLGs, and that the combination of MLG 5045 and MLG121 is sufficient for diagnosing colorectal cancer and/or determining the risk of colorectal cancer with high sensitivity and specificity (AUC = 0.91751663); the incorporation of other MLG biomarkers from the 15 MLGs can improve the sensitivity and specificity of diagnosis or prediction to a certain extent. In particular, these results also indicate that the combination of the first 13 MLGs can be considered the optimal biomarker panel, which has the best sensitivity and specificity for the diagnosis or prediction of colorectal cancer (AUC = 0.987583149).
这些结果完全支持本文所鉴定的MLG(特别是MLG 5045和MLG121,任选地与这15个MLG中另外的MLG中的一个或多个组合)可用作诊断结直肠癌和/或确定患结直肠癌的风险的生物标志物,且具有高灵敏度和高特异性。These results fully support that the MLGs identified herein (particularly MLG 5045 and MLG121, optionally in combination with one or more additional MLGs among these 15 MLGs) can be used as biomarkers for diagnosing colorectal cancer and/or determining the risk of developing colorectal cancer with high sensitivity and specificity.
本发明人进一步直接研究了肠道MLG用于鉴定腺瘤的效用,其比结直肠癌更难筛选,但对于早期干预是重要的。The present inventors further directed investigation into the utility of intestinal MLG for identifying adenomas, which are more difficult to screen for than colorectal cancers but are important for early intervention.
类似地,在由55个对照和42个腺瘤样品组成的训练集中进行5次重复的10折交叉验证(即50次测试)。每次测试,随机森林测试均对每个MLG的重要性进行了排序。本发明人挑选了前10个重要的MLG,并按照出现次数对MLG进行了排序。这前10个MLG用于构建分类器。表10中列出了MLG的重要性的排序。Similarly, 5 repetitions of 10-fold cross validation (i.e., 50 tests) were performed on a training set consisting of 55 controls and 42 adenoma samples. In each test, the random forest test ranked the importance of each MLG. The inventors selected the top 10 important MLGs and ranked them according to the number of occurrences. These top 10 MLGs were used to construct the classifier. The ranking of the importance of the MLGs is listed in Table 10.
在5次重复的10折交叉验证之后,随机森林模型选择了允许对训练集进行最优分类(55个对照和42个进展性腺瘤,表8和表9,图2b)的10个MLG(表6-1,表6-2,图2a),AUC为0.8738(临界值=0.5,图2c)。在测试集(由15个对照和15个进展性腺瘤组成的、且未使用的新样品)中,所有进展性腺瘤样品均被正确分类(临界值=0.4572,图2d,2e,表9)。因此,粪便MLG为结直肠进展性腺瘤的无创性检测提供了新的机会。After five repetitions of 10-fold cross-validation, the random forest model selected 10 MLGs (Table 6-1, Table 6-2, Figure 2a) that allowed for optimal classification of the training set (55 controls and 42 advanced adenomas, Tables 8 and 9, Figure 2b), with an AUC of 0.8738 (critical value = 0.5, Figure 2c). In the test set (consisting of 15 controls and 15 advanced adenomas, new samples that were not used), all advanced adenoma samples were correctly classified (critical value = 0.4572, Figures 2d, 2e, Table 9). Therefore, fecal MLG provides a new opportunity for the noninvasive detection of colorectal advanced adenomas.
另外,表10中的结果显示,对于前2个重要MLG(MLG 317和MLG3770)的组合,AUC为0.782251082;对于前3个重要的MLG(MLG 317、MLG 3770和MLG 3840)的组合,AUC为0.805194805;对于前4个MLG的组合,AUC为0.773160173;对于前5个MLG的组合,AUC为0.795238095;对于前6个MLG的组合,AUC为0.780952381;对于前7个MLG的组合,AUC为0.895670996;对于前8个MLG的组合,AUC为0.896536797;对于前9个MLG的组合,AUC为0.884848485;对于前10个MLG的组合,AUC为0.873809524。这些结果表明,MLG 317、MLG3770和MLG 3840是这10个MLG中最重要的生物标志物,并且MLG 317、MLG 3770和MLG 3840的组合足以诊断结直肠进展性腺瘤和/或确定患结直肠进展性腺瘤的风险,且具有高灵敏度和高特异性(AUC=0.805194805);以及这10个MLG中的其它MLG生物标志物的并入可在一定程度上提高诊断或预测的敏感度和特异性。特别地,这些结果还表明,前8个MLG的组合可被认为是最优生物标志物组,其对结直肠进展性腺瘤的诊断或预测具有最优灵敏度和特异性(AUC=0.896536797)。In addition, the results in Table 10 show that for the combination of the top 2 important MLGs (MLG 317 and MLG3770), the AUC is 0.782251082; for the top 3 important MLGs (MLG 317, MLG 3770 and MLG 3840), the AUC was 0.805194805; for the combination of the first 4 MLGs, the AUC was 0.773160173; for the combination of the first 5 MLGs, the AUC was 0.795238095; for the combination of the first 6 MLGs, the AUC was 0.780952381; for the combination of the first 7 MLGs, the AUC was 0.895670996; for the combination of the first 8 MLGs, the AUC was 0.896536797; for the combination of the first 9 MLGs, the AUC was 0.884848485; for the combination of the first 10 MLGs, the AUC was 0.873809524. These results indicate that MLG 317, MLG3770, and MLG 3840 are the most important biomarkers among the 10 MLGs, and that the combination of MLG 317, MLG 3770, and MLG 3840 is sufficient for diagnosing colorectal advanced adenoma and/or determining the risk of developing colorectal advanced adenoma with high sensitivity and high specificity (AUC = 0.805194805); and that the incorporation of other MLG biomarkers among the 10 MLGs can improve the sensitivity and specificity of diagnosis or prediction to a certain extent. In particular, these results also indicate that the combination of the first 8 MLGs can be considered the optimal biomarker group, which has the best sensitivity and specificity for the diagnosis or prediction of colorectal advanced adenoma (AUC = 0.896536797).
这些结果完全支持本文鉴定的MLG(特别是MLG 317、MLG 3770和MLG 3840,任选地与上述10个MLG中另外的MLG中的一个或多个组合)可用作诊断结直肠进展性腺瘤和/或确定患结直肠进展性腺瘤的风险的生物标志物,且具有高灵敏度和高特异性。These results fully support that the MLGs identified herein (particularly MLG 317, MLG 3770 and MLG 3840, optionally in combination with one or more additional MLGs among the above 10 MLGs) can be used as biomarkers for diagnosing colorectal advanced adenoma and/or determining the risk of developing colorectal advanced adenoma with high sensitivity and specificity.
因此,本发明人通过基于相关基因标志物的随机森林模型,鉴定并验证了用于结直肠癌的早期和无创性诊断的15个MLG和用于结直肠腺瘤的早期和无创性诊断的10个MLG。并且本发明人构建了基于这些相关肠道微生物群的评估结直肠癌和腺瘤的风险的方法。Therefore, the present inventors identified and validated 15 MLGs for the early and non-invasive diagnosis of colorectal cancer and 10 MLGs for the early and non-invasive diagnosis of colorectal adenoma using a random forest model based on relevant gene markers. The present inventors also constructed a method for assessing the risk of colorectal cancer and adenoma based on these relevant intestinal microbiota.
Claims (16)
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1240281A1 HK1240281A1 (en) | 2018-05-18 |
| HK1240281B true HK1240281B (en) | 2022-05-13 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN107208149B (en) | Biomarkers for colorectal cancer-related diseases | |
| KR20170053617A (en) | Methods for Evaluating Lung Cancer Status | |
| WO2015018307A1 (en) | Biomarkers for colorectal cancer | |
| CN109477145A (en) | Biomarkers of Inflammatory Bowel Disease | |
| EP3245298B1 (en) | Biomarkers for colorectal cancer related diseases | |
| US10793911B2 (en) | Host DNA as a biomarker of Crohn's disease | |
| WO2016050110A1 (en) | Biomarkers for rheumatoid arthritis and usage thereof | |
| EP2909335B1 (en) | Prognostic of diet impact on obesity-related co-morbidities | |
| CN106399304A (en) | Breast cancer related SNP marker | |
| CN110747274B (en) | Gene methylation panel and kit for diagnosing and predicting colorectal cancer curative effect and prognosis | |
| CN108064273B (en) | Biomarkers of colorectal cancer-related diseases | |
| US20150284779A1 (en) | Determination of a tendency to gain weight | |
| CN116377070B (en) | Novel microbial markers for predicting colorectal cancer or colorectal adenoma risk | |
| CN107400708A (en) | Purposes of the XRCC1 gene pleiomorphisms in rheumatic arthritis diagnoses validity | |
| WO2024183478A1 (en) | Novel microbial biomarker for predicting risk of colorectal cancer or colorectal adenoma | |
| HK1240281B (en) | Biomarkers for colorectal cancer related diseases | |
| HK1249134B (en) | Biomarkers for colorectal cancer related diseases | |
| CN111108199B (en) | Biomarkers for atherosclerotic cardiovascular disease | |
| RU2771080C2 (en) | Method for determining response of a patient diagnosed with skin melanoma to anti-pd1-therapy | |
| Tagawa et al. | Clinical Utility of Circulating Cell‐Free DNA as a Liquid Biopsy in Cats With Various Tumours | |
| CN113151512A (en) | Using gut bacteria to detect early-stage lung cancer | |
| CN106520957B (en) | DHRS7 susceptibility SNP site detection reagent and kit for its preparation | |
| JP2025171299A (en) | Method, marker and test kit for predicting the prognosis of inflammatory bowel disease or for testing for inflammatory bowel disease, and method for screening for preventive/ameliorating agents for inflammatory bowel disease | |
| Nwosu | Identification of gut microbiota markers for Inflammatory Bowel Diseases in children: Early diagnostic potentials | |
| HK1240284A1 (en) | Biomarkers for colorectal cancer related diseases |