WO2016049927A1

WO2016049927A1 - Biomarkers for obesity related diseases

Info

Publication number: WO2016049927A1
Application number: PCT/CN2014/088056
Authority: WO
Inventors: Qiang FENG; Dongya ZHANG; Longqing TANG; Jun Wang
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2014-09-30
Filing date: 2014-09-30
Publication date: 2016-04-07
Anticipated expiration: 2017-03-30
Also published as: CN106795481B; CN106795481A

Abstract

Biomarkers and methods for predicting the risk of a disease related to microbes, in particular obesity or related diseases, are provided.

Description

BIOMARKERS FOR OBESITY RELATED DISEASES

CROSS-REFERENCE TO RELATED APPLICATION

None

FIELD

The present invention relates to biomarkers and methods for predicting the risk of a disease related to microbes, in particular obesity or related diseases.

BACKGROUND

Obesity, which is prevalent in developed countries, has increased considerably worldwide (de Carvalho Pereira et al. , 2013) . It is reported that the prevalence of overweight and obesity combined rose by 27.5％for adults and 47.1％for children between 1980 and 2013 in the world. The number of overweight individuals increased from 857 million in 1980, to 2.1 billion in 2013, and of these, 671 million are affected by obesity. More than 50％of which live in ten countries, and USA has the largest number of obese individuals, followed by China (Ng et al. , 2014) .

There is a growing body of evidence suggesting that patients who are diagnosed by their physician that they are overweight are more likely to lose weight relative to those who are not diagnosed. However, low rates of physician diagnosis and advice for behavioral health risk factors related to obesity is concerning (Bleich et al. , 2011) .

In children, the diagnosis of obesity is based on age-and gender-specific body mass index (BMI) cut-points. This is in contrast to adults, in which an obesity diagnosis is based on a BMI regardless of age or gender. Unlike adults, for whom obesity diagnostic criteria are simpler, fewer obese children being accurately diagnosed for the more complicated diagnostic criteria and change in terminology for pediatric obesity (Walsh et al. , 2013) . Moreover, limitations of BMI in terms of identification of the different populations should be considered (Nevill et al. , 2006) . Therefore, waist circumference (WC) can be considered a reliable and useful tool for epidemiological studies to assess abdominal adiposity, but this measurement seems to be harder to perform (Miguel‐Etayo et al. , 2014) . What’s more, regional studies of diagnosis of pediatric obesity using International Classification of Diseases, Ninth Revision (ICD-9) , National Ambulatory Medical Care Survey (NAMCS) , and National Hospital Medical Care Survey (NHAMCS) have shown relatively low sensitivity of a clinical diagnosis (Walsh et al. , 2013) .

Recent insight suggests that the human gut microbiota could play an important role in obesity. An early report, based on sequencing of amplified 16S rRNA genes, indicated a much higher ratio of Firmicutes to Bacteroidetes in faecal samples from 12 obese humans than in two lean controls (Ley et al. , 2006) . Recent observational studies using metagenomic sequencing in human obesity have demonstrated reduced bacterial diversity, a relative depletion of Bacteroidetes , and enrichment in genes involved in carbohydrate and lipid metabolism (Allin and Pedersen, 2014) . These correlative findings indicated the altered gut microbiota is a causal factor in the pathogenesis of obesity. This indicating that maybe we can use the characteristics of gut microbiota as criteria to diagnosis of obesity.

In summary, there are considerable missed opportunities and low sensitivity in the diagnosis of obesity. A more valid (less biased) assessment of overweight and/or obesity need to be developed.

SUMMARY

Embodiments of the present disclosure seek to solve at least one of the problems existing in the prior art to at least some extent.

The present invention is based on the following findings by the inventors:

Assessment and characterization of gut microbiota has become a major research area in human disease, including obesity. To carry out analysis on gut microbial content in obesity patients, the inventors carried out a protocol for a Metagenome-Wide Association Study (MGWAS) (Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012) , incorporated herein by reference) based on deep shotgun sequencing of the gut microbial DNA from 158 individuals. The inventors identified and validated 396, 100 obesity-associated gene markers. To exploit the potential ability of obesity classification by gut microbiota, the inventors developed a disease classifier system based on the 54 gene markers that are defined as an optimal gene set by a minimum redundancy -maximum relevance (mRMR) feature selection method. For intuitive evaluation of the risk of obesity disease based on these 54 gut microbial gene markers, the inventors calculated a healthy index. The inventors'data provide insight into the characteristics of the gut metagenome related to obesity risk, a paradigm for future studies of the pathophysiological role of the gut metagenome in other relevant disorders, and the potential usefulness for a gut-microbiota-based approach for assessment of individuals at risk of such disorders.

It is believed that gene markers of intestinal microbiota are valuable for increasing obesity detection at earlier stages due to the following. First, the markers of the present invention are more specific and sensitive as compared with conventional markers. Second, analysis of stool promises accuracy, safety, affordability, and patient compliance. And samples of stool are transportable. Thus, the present invention relates to an in vitro method, which is comfortable and noninvasive, so people will participate in a given screening program more easily. Third, the markers of the present invention may also serve as tools for therapy monitoring in cancer patients to detect the response to therapy.

In one aspect of present disclosure, there is provided with a biomarker set for predicting a disease related to microbiota in a subject consisting of:

at least a partial sequence of SEQ ID NO: 1 to 54.

According to embodiments of present disclosure, the disease is obesity or related disease.

Using these biomarkers, some disease related to the related to microbiota in a subject may be analyzed, for example obesity or related disease may be determined based on some sample from the subject , for example, some fecal sample may be used.

In another aspect of present disclosure, there is provided with a kit for determining the gene marker set described above, comprising primers used for PCR amplification and designed according to the DNA sequecne as set forth in at least a partial sequence of SEQ ID NO: 1 to 54.

In another aspect of present disclosure, there is provided with a kit for determining the gene marker set described above, comprising one or more probes designed according to the genes as set forth in SEQ ID NO: 1 to 54.

In another aspect of present disclosure, there is provided with use of the gene marker set described above for predicting the risk of obesity or related disorder in a subject. According to embodiments of present disclosure, the risk of obesity or related disorder in a subject may be predicted by the following step:

(1) collecting a sample j from a subject；

(2) determining the relative abundance information of each of SEQ ID NO: 1 to 54 in the DNA of the sample； and

(3) calculating a index of sample j denoted by I_j by a formula below:

A_ij is the relative abundance of marker i in sample j, wherein i refers to each of the gene markers in said gene marker set； .

N is a first subset of all patient-enriched markers in selected biomarkers related to the abnormal condition,

M is a second subset of all control-enriched markers in selected biomarkers related to the abnormal condition,

|N| and |M| are numbers of the biomarkers respectively in the first and second subsets,

wherein

an index greater than a cutoff indicates that the subject has or is at the risk of developing abnormal condition.

According to some embodiments of present disclosure, |N| is 24, and |M| is 30.

According to some embodiments of present disclosure, the cutoff is at least 0.5834.

In another aspect of present disclosure, there is provided with use of the gene marker set described above for preparation of a kit for predicting the risk of obesity or related disorder in a subject. According to embodiments of present disclosure, the risk of obesity or related disorder in a subject may be predicted by the following step:

(1) collecting a sample j from a subject；

(3) calculating a index of sample j denoted by I_j by a formula below:

wherein

According to some embodiments of present disclosure, |N| is 24, and |M| is 30.

In another aspect of present disclosure, there is provided with a method of diagnosing whether a subject has an abnormal condition related to microbiota or is at the risk of developing an abnormal condition related to microbiota, comprising:

determining the relative abundance of the biomarkers described above in a sample from the subject, and

determining whether a subject has an abnormal condition related to microbiota or is at the risk of developing an abnormal condition related to microbiota based on the relative abundance.

According to embodiments of present disclosure, According to embodiments of present disclosure, the risk of obesity or related disorder in a subject may be predicted by the following step:

(1) collecting a sample j from a subject；

(3) calculating a index of sample j denoted by I_j by a formula below:

wherein

According to some embodiments of present disclosure, |N| is 24, and |M| is 30.

According to embodiments of present disclosure, the abnormal condition related to microbiota is obesity or related disorder.

BRIEF DISCRIPTION OF DRAWINGS

These and other aspects and advantages of the present disclosure will become apparent and more readily appreciated from the following descriptions taken in conjunction with the drawings, in which:

Fig. 1 The association analysis of Obese p-value distribution identified a disproportionate over-representation of strongly associated markers at lower P-values.

Fig. 2 The ROC were drawned by the probability of the illness in training set, and AUC＝0.9651.

Fig. 3 The ROC were drawn by the obesity index in training set, and AUC＝0.9784.

Fig. 4 The ROC in test set (42 samples) were drawn by the obesity index in test set, and AUC＝0.8729.

Fig. 5 The ROC in test set (22 samples) were drawn by the obesity index in test set, and AUC＝0.9487.

Examples

Terms used herein have meanings as commonly understood by a person of ordinary skill in the areas relevant to the present invention. Terms such as “a” , “an” and “the” are not intended to refer to only a singular entity, but include the general class of which a specific example may be used for illustration. The terminology herein is used to describe specific embodiments of the invention, but their usage does not delimit the invention, except as outlined in the claims.

The present invention is further exemplified in the following non-limiting Examples. Unless otherwise stated, parts and percentages are by weight and degrees are Celsius. As apparent to one of ordinary skill in the art, these Examples, while indicating preferred embodiments of the invention, are given by way of illustration only, and the agents were all commercially available.

Example 1. Identifying biomarkers for evaluating obesity risk

1.1 Sample collection

Fecal samples from 158 Chinese subjects, including 78 obesity patients and 80 control subjects (training set) , were collected by Rui Jin Hospital Shanghai Jiao Tong Univeristy School of Medicine in 2012. Obesity patients were age from 18 to 30 with BMI over 25. Subjects were asked to collect fresh feces samples at hospital. Collected samples were put in sterile tubes and stored at -80℃immediately until further analysis.

The complete ethical approval has been obtained, and all the patients gave written informed consent. The study was approved by the Institutional Review Board of Rui Jin Hospital Shanghai Jiao Tong Univeristy School of Medicine.

1.2 DNA extraction

Fecal samples were thawed on ice and DNA extraction was performed using the Qiagen QIAamp DNA Stool Mini Kit (Qiagen) according to manufacturer`s instructions. Extracts were treated with DNase-free RNase to eliminate RNA contamination. DNA quantity was determined using NanoDrop spectrophotometer, Qubit Fluorometer (with the Quant-iTTMdsDNA BR Assay Kit) and gel electrophoresis.

1.3 DNA library construction and sequencing of fecal samples

DNA library construction was performed following the manufacturer`s instruction (Illumina, insert size 350bp, read length 100bp) . The inventors used the same workflow as described previously to perform cluster generation, template hybridization, isothermal amplification, linearization, blocking and denaturation, and hybridiza-tion of the sequencing primers. The inventors constructed one paired-end (PE) library with insert size of 350 bp for each sample, followed by a high-throughput sequencing to obtain around 30 million PE reads of length 2x100bp. High-quality reads were obtained by filtering low-quality reads with ambiguous `N'bases, adapter contamination and human DNA contamination from the Illumina raw reads, and by trimming low-quality terminal bases of reads simultaneously.

The inventors totally output about 5.9 Gb per sample of fecal micbiota sequencing data (high quality clean data) (Table 1) from 158 samples (78 cases and 80 controls) on Illumina HiSeq 2000 platform.

Table 1 Summary of metagenomic data. Fourth column reports results from Wilcoxon rank-sum tests.

1.4 Metagenomic data processing and analysis

1.4.1 Reads mapping

The inventors used the updated human gut gene catalogue established in Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. (2014) (incorporated herein by reference) and mapped the high quality reads to it with the alignment criteria identity >＝90. The average reads mapping rate was shown on Table 1. This mapping rate was close to the samples in Li, J. et al. 2014, supra, which indicated that this mapping rate was sufficient for the further study. After reads mapping, the inventors derived the gene profile (9.9Mb genes) from the mapping result using the same method as Li, J. et al. 2014, supra.

Taxonomic assignment of genes. Taxonomic assignment of the predicted genes was performed using an in-house pipeline which had described in the published paper (Li, J. et al. 2014, supra) .

1.4.2 Data profile construction

Gene profile. Based on the reads mapping results, the inventors use the same method described in the published T2D paper (Qin et al. 2012, supra) to compute the relative gene abundance.

1.4.3 Analysis of factors influencing gut microbiota gene profile. The inventors used the permutational multivariate analysis of variance (PERMANOVA) to assess the effect of 6 clinical parameters, including age, sex, height, weight, BMI and obese, based on gene profile . The inventors performed the analysis using the method implemented in package ″vegan″ in R, and the permuted p-value was obtained by 10,000 times permutations. The inventors also corrected for multiple testing using″ p. adjust″ in R with Benjamini-Hochberg method to get the q-value for each test. PERMANOA identified three significant factors associated with gut microbe (based on gene profiles) (q <0.05, Table 2) . The analysis indicated weight, BMI and obese status were strong associated markers, supporting the diseases (obese) status was the major determinant influencing the composition of gut microbiota.

Table 2 PERMANOVA based on euclidean distance analysis of gene profile. The analysis was conducted to test whether clinical parameters, and obese status have significant impact on the gut microbiota with q-value <0.05.

phenotype	Df	Sums Of Sqs	Mean Sqs	F. Model	R²	Pr (>F)
phenotype	Df	Sums Of Sqs	Mean Sqs	F. Model	R²	Pr (>F)	Age	1	0.317034738	0.317034738	1.004112579	0.006395454	0.4094
Sex	1	0.377329497	0.377329497	1.196542903	0.007611763	0.1727	Age	1	0.317034738	0.317034738	1.004112579	0.006395454	0.4094
Sex	1	0.377329497	0.377329497	1.196542903	0.007611763	0.1727	Height	1	0.331409667	0.331409667	1.049947284	0.006685435	0.3291
Weight	1	0.969536515	0.969536515	3.111941857	0.019558192	1.00E-04	Height	1	0.331409667	0.331409667	1.049947284	0.006685435	0.3291
Weight	1	0.969536515	0.969536515	3.111941857	0.019558192	1.00E-04	BMI
	1	0.954186893	0.954186893	3.0617069	0.019248548	1.00E-04	BMI
	1	0.954186893	0.954186893	3.0617069	0.019248548	1.00E-04	Obese
	1	0.972185352	0.972185352	3.120613959	0.019611626	2.00E-04	Obese

1.4.4 Identification of obesity associated markers

Identification of obesity associated genes. To identify the association between the metagenomic profile and obesity, a two-tailed Wilcoxon rank-sum test was used in 9, 879, 897 high occurrence gene (genes that were present in less than 10 samples across all 158 samples were removed) profiles. 396, 100 gene markers were obtained, which were enriched in either case or control with p-value < 0.01, FDR＝3.8％ (Fig. 1) .

Estimating the false discovery rate (FDR) . Instead of a sequential p-value rejection method, the inventors applied the “q-value” method proposed in a previous study to estimate the FDR (Storey, J. D. A direct approach to false discovery rates. Journal of the Royal Statistical Society 64, 479-498 (2002) , incorporated herein by reference) .

Receiver Operator Characteristic (ROC) analysis. The inventors applied the ROC analysis to assess the performance of the obesity classification based on metagenomic markers. The inventors then used the “pROC” package in R to draw the ROC curve.

1.4.5 Construction of MLG and identification of obesity associated MLG species markers

237 MLG species based on the 396, 100 obesity associated maker genes profile. The inventors used the 396, 100 gene markers to built the metagenomic linkage group (MLG) using the same method described in the published T2D paper (Qin et al. 2012, supra) . All the 396, 100 genes were annotated by aligning these genes to the 4, 653 reference genomes in IMG v400. An MLG was assigned to a genome if more than 50％constitutive genes were annotated to that genome, otherwise it was termed as unclassified. Total 237 MLG genomes with gene number > 100 were selected (P-value <0.01) . To estimate the relative abundance of an MLG species, the inventors estimated the average abundance of the genes of the MLG species, after removing the 5％lowest and 5％highest abundant genes (Qin et al. 2012, supra) .

1.5 MLG-based classifier

A random forest model (R. 2.14, randomForest4.6-7 package) (Liaw, Andy &Wiener, Matthew. Classification and Regression by randomForest, R News (2002) , Vol. 2/3 p. 18, incorporated herein by reference) was trained using the MLG abundance profile of the training cohort (158 samples) to select the optimal set of MLG markers. The model was tested on one or more testing sets and the prediction error was calculated.

About the randomForest model, using “randomForest4.6-7 package” package in R vision 2.14, input is a training dataset (namely relative abundance profiles of selected MLGs in training samples) , sample disease status (sample disease status of training samples is a vectot, 1 for obesity, 0 for control) , and a test set (just the relative abundance profiles of selected MLGs in test set) . Then the inventors used the randomForest function from randomForest package in R software to build the classification, and predict function was used to predict the test set. Output is the prediction results (probability of illness； cutoff is 0.5 and if the probability of illness ≥0.5, the subject is at risk of obesity)

54 MLG species marker identification. To identify 237 MLG species makers, the inventors used “randomForest4.6-7 package” package in R vision 2.14 based on the 237 obesity associated MLG species. Firstly, the inventors sorted all the 237 MLG species by the importance given by the “randomForest” method (Liaw, Andy &Wiener, Matthew. Classification and Regression by randomForest, R News (2002) , Vol. 2/3 p. 18, incorporated herein by reference) . MLG marker sets were constructed by creating incremental subsets of the top ranked MLG species, starting from 1 MLG species and ending at all 237 MLG species. For each MLG makers set, the inventors calculated the false predication ratio in the 158 samples. Finally, the 54 MLG species sets with lowest false prediction ratio were selected out as MLG species makers (Table 3-1) . Furthermore, the inventors drew the ROC curve using the OOB (out of bag) prediction probability of illness from randomForest model based on the selected MLG species markers (Table 3-2) and the area under the ROC curve (AUC) was 0.9651 in the 158 samples (Fig. 2) . At the best cutoff 0.5294, true positive rate (TPR) was 0.8625, and false positive rate (FPR) was 0.07692, indicating that the 54 MLG markers could be used to accurately classify obesity individuals.

Table 3-1 54 most discriminant MLGs (species markers) associated with obesity

Table 3-2 Prediction results of 54 MLGs in 158 samples

1.6 Methods for selecting 54 best markers from biomarkers (Maximum Relevance Minimum Redundancy (mRMR) feature selection framework)

To identify an optimal gene set, a minimum redundancy -maximum relevance (mRMR) (for detailed information, see Peng, H. , Long, F. &Ding, C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell27, 1226-1238, doi: 10.1109/TPAMI. 2005.159 (2005) , which is incorporated herein by reference) feature selection method was used to select from 54 MLG markers. Using the 54 MLG markers, the inventors select one gene to represent the MLG. The inventors selected the represent gene by mRMR method in each MLG. And the inventors choose the first gene to represent this MLG. So the inventors got 54 gene markers, which were shown on Table 4 and Table 5.The gene id is from the published reference gene catalogue as Li, J. et al. 2014, supra.

Table 4.54 optimal gene markers’enrichment information

Gene id	Enrichment (1＝obesity, 0＝control)
Gene id	Enrichment (1＝obesity, 0＝control)	4388	0
34851	0	4388	0
34851	0	58832	0
67971	1	58832	0
67971	1	85995	0
95220	0	85995	0
95220	0	156675	0
197527	0	156675	0
197527	0	201254	0
223292	0	201254	0
223292	0	232979	0

266046	1
266046	1	331594	0
447739	0	331594	0
447739	0	515920	1
563367	0	515920	1
563367	0	565364	0
622928	0	565364	0
622928	0	727087	0
859299	1	727087	0
859299	1	903584	1
963362	0	903584	1
963362	0	1015556	0
1571678	1	1015556	0
1571678	1	1583339	1
1801041	0	1583339	1
1801041	0	2150523	0
2273710	0	2150523	0
2273710	0	2285506	0
2291624	1	2285506	0
2291624	1	2390685	0
2397559	1	2390685	0
2397559	1	2414703	1
2687923	1	2414703	1
2687923	1	2940024	0
3111759	0	2940024	0
3111759	0	3179344	1
3239706	1	3179344	1
3239706	1	3381319	0
3449966	1	3381319	0
3449966	1	3984550	0
4202903	1	3984550	0
4202903	1	5243950	1
5459014	1	5243950	1
5459014	1	5486089	1
5659078	1	5486089	1
5659078	1	6616419	0
6692162	1	6616419	0
6692162	1	7136991	1
7209512	0	7136991	1
7209512	0	7775163	0
8202342	1	7775163	0
8202342	1	8846481	1
9454625	1	8846481	1

Table 5. SEQ ID of the 54 optimal gene markers

gene id	SEQ ID NO:
gene id	SEQ ID NO:	gene_id:85995	1
gene_id:5659078	2	gene_id:85995	1
gene_id:5659078	2	gene_id:8846481	3
gene_id:95220	4	gene_id:8846481	3
gene_id:95220	4	gene_id:3239706	5

gene_id:727087	6
gene_id:727087	6	gene_id:5243950	7
gene_id:1015556	8	gene_id:5243950	7
gene_id:1015556	8	gene_id:4388	9
gene_id:2397559	10	gene_id:4388	9
gene_id:2397559	10	gene_id:2414703	11
gene_id:3381319	12	gene_id:2414703	11
gene_id:3381319	12	gene_id:201254	13
gene_id:7209512	14	gene_id:201254	13
gene_id:7209512	14	gene_id:1801041	15
gene_id:7775163	16	gene_id:1801041	15
gene_id:7775163	16	gene_id:2273710	17
gene_id:447739	18	gene_id:2273710	17
gene_id:447739	18	gene_id:5459014	19
gene_id:6616419	20	gene_id:5459014	19
gene_id:6616419	20	gene_id:3111759	21
gene_id:4202903	22	gene_id:3111759	21
gene_id:4202903	22	gene_id:859299	23
gene_id:3449966	24	gene_id:859299	23
gene_id:3449966	24	gene_id:963362	25
gene_id:565364	26	gene_id:963362	25
gene_id:565364	26	gene_id:34851	27
gene_id:1583339	28	gene_id:34851	27
gene_id:1583339	28	gene_id:8202342	29
gene_id:622928	30	gene_id:8202342	29
gene_id:622928	30	gene_id:515920	31
gene_id:2390685	32	gene_id:515920	31
gene_id:2390685	32	gene_id:7136991	33
gene_id:2291624	34	gene_id:7136991	33
gene_id:2291624	34	gene_id:331594	35
gene_id:2687923	36	gene_id:331594	35
gene_id:2687923	36	gene_id:5486089	37
gene_id:156675	38	gene_id:5486089	37
gene_id:156675	38	gene_id:1571678	39
gene_id:3984550	40	gene_id:1571678	39
gene_id:3984550	40	gene_id:232979	41
gene_id:266046	42	gene_id:232979	41
gene_id:266046	42	gene_id:223292	43
gene_id:67971	44	gene_id:223292	43
gene_id:67971	44	gene_id:6692162	45
gene_id:9454625	46	gene_id:6692162	45
gene_id:9454625	46	gene_id:3179344	47
gene_id:2940024	48	gene_id:3179344	47
gene_id:2940024	48	gene_id:197527	49
gene_id:58832	50	gene_id:197527	49
gene_id:58832	50	gene_id:903584	51
gene_id:2285506	52	gene_id:903584	51
gene_id:2285506	52	gene_id:2150523	53
gene_id:563367	54	gene_id:2150523	53

1.7 Gut healthy index (obesity index)

To exploit the potential ability of disease classification by gut microbiota, the inventors developed a disease classifier system based on the 54 gene markers that the inventors defined. For intuitive evaluation of the risk of disease based on these gut microbial gene markers, the inventors calculated a gut healthy index (obesity index).

To evaluate the effect of the gut metagenome on obesity, the inventors defined and calculated the gut healthy index for each individual on the basis of the selected 54 gene markers as described above. For each individual sample, the gut healthy index of sample j that denoted by I_j was calculated by the formula below:

A_ij is the relative abundance of marker i in sample j.

N is a subset of all patient-enriched markers in selected biomarkers related to the abnormal condition (namely, a subset of all obesity-enriched markers in these 54 selected gene markers),

M is a subset of all control-enriched markers in selected biomarkers related to the abnormal condition (namely, a subset of all control-enriched markers in these 54 selected gene markers),

|N| and |M| are number (sizes) of the biomarker respectively in these two sets, wherein |N| is 24 and |M| is 30,

wherein an index greater than a cutoff indicates that the subject has or is at the risk of developing obesity.

1.8 Gut-microbiota-based obesity classification

The inventors computed a obesity index based on the relative abundance of these 54 gene markers, which clearly separated the obesity patient microbiomes from the control microbiomes (Table 6). Classification of the 78 obesity patient microbiomes against the 80 control microbiomes using the obesity index exhibited an area under the receiver operating characteristic (ROC) curve of 0.9784 (Fig. 3). At the best index cutoff 0.5834, true positive rate (TPR) was 0.9103, and false positive rate (FPR) was 0.075, and error rate was 8.86％ (14/158), indicating that the 54 gene markers could be used to accurately classify obesity individuals.

Table 6. 158 samples’ calculated gut healthy index (obesity patients and non-obesity controls)

Example 2. Validating the 54 gene biomarkers in 42 samples (test set)

The inventors validated the discriminatory power of the obesity classifier using another new independent study group, including 17 obesity patients and 25 non-obesity controls that were also collected in Rui Jin Hospital Shanghai Jiao Tong Univeristy School of Medicine .

For each sample, DNA was extracted and a DNA library was constructed followed by high throughput sequencing as described in Example 1. The inventors calculated the gene abundance profile for these samples using the same method as described in Qin et al. 2012, supra. Then the gene relative abundance of each of the markers as set forth in SEQ ID NOs: 1-54 was determined. Then the index of each sample was calculated by the formula below:

A_ij is the relative abundance of marker i in sample j.

N is a subset of all patient-enriched markers in selected biomarkers related to the abnormal condition (namely, a subset of all obesity-enriched markers in these 54 selected gene markers) ,

M is a subset of all control-enriched markers in selected biomarkers related to the abnormal condition (namely, a subset of all control-enriched markers in these 54 selected gene markers) ,

Table 7 shows the calculated index of each sample and Table 8 shows the relevant gene relative abundance of a representative sample DB68A. In this assessment analysis, at the cutoff 0.5834 (the best index cutoff in 158 samples above) , the error rate was 26.19％ (11/42) , validating that the 54 gene markers can classify obesity individuals. And most of obesity patients (13/17) were diagnosed as obesity correctly. Also, the ROC in test set were drawned by the obesity index in test set, and AUC＝0.8729 (Fig. 4) . At the best cutoff 0.7769, true positive rate (TPR) was 0.7647, and false positive rate (FPR) was 0.04.

Table 7.42 samples’ calculated gut healthy index

Table 8. Gene relative abundance of Sample DB68A

Example 3. Validating the 54 gene biomarkers in 22 samples (test set)

The inventors validated the discriminatory power of the obesity classifier using another 22 samples (Table 9) , including 9 case samples and 13 control samples (5 samples after operation 1 month and 8 samples after operation 3 month) that were also collected in Rui Jin Hospital Shanghai Jiao Tong Univeristy School of Medicine . Case means before operation samples, control means after operation 1 month and 3 month.

Table 9.22 samples information

*Before: before the operation； 1-M: operation after one month； 3-M: operation after three month.

A_ij is the relative abundance of marker i in sample j.

Table 10 shows the calculated index of each sample and Table 11 shows the relevant gene relative abundance of a representative sample DB62. In this assessment analysis, at the cutoff 0.5834 (the best index cutoff in 158 samples above) , the error rate was 18.18％ (4/22) , validating that the 54 gene markers can classify obesity individuals. And most of obesity patients (7/9) were diagnosed as obesity correctly. Also, the ROC in test set were drawned by the obesity index in test set, and AUC＝0.9487 (Fig. 5) . At the best cutoff 0.02538, true positive rate (TPR) was 1, and false positive rate (FPR) was 0.1538.

Table 10. 22 samples’calculated gut healthy index

Samples (DB:obesity)	obesity index
Samples (DB:obesity)	obesity index	DB62	1.191905591
DB67	0.025381992	DB62	1.191905591
DB67	0.025381992	DB68	1.757974404
DB78	1.344989391	DB68	1.757974404
DB78	1.344989391	DB85	1.796053682
DB124	0.072164965	DB85	1.796053682
DB124	0.072164965	DB125	1.162137206
DB126	0.979123077	DB125	1.162137206
DB126	0.979123077	DB01	0.686585017
DB.S1.62	0.879906331	DB01	0.686585017
DB.S1.62	0.879906331	DB.S1.68	-0.274438487
DB.S1.85	0.0154326	DB.S1.68	-0.274438487
DB.S1.85	0.0154326	DB.S1.124	-0.750440603
DB.S1.125	-0.893868407	DB.S1.124	-0.750440603
DB.S1.125	-0.893868407	DB.S3.62	0.711881869
DB.S3.67	-0.007230148	DB.S3.62	0.711881869
DB.S3.67	-0.007230148	DB.S3.68	-0.029903064
DB.S3.78	-0.761996663	DB.S3.68	-0.029903064
DB.S3.78	-0.761996663	DB.S3.124	-0.588485485
DB.S3.125	-0.575369569	DB.S3.124	-0.588485485
DB.S3.125	-0.575369569	DB.S3.126	-0.398672766
DB.S3.01	-0.420476048	DB.S3.126	-0.398672766

Table 11. Gene relative abundance of Sample DB62

Thus the inventors have identified and validated 54 markers set by a minimum redundancy -maximum relevance (mRMR) feature selection method based on 54 obesity-associated gut microbes. And the inventors have built a gut healthy index to evaluate the risk of obesity disease based on these 54 gut microbial gene markers.

Although explanatory embodiments have been shown and described, it would be appreciated by those skilled in the art that the above embodiments can not be construed to limit the present disclosure, and changes, alternatives, and modifications can be made in the embodiments without departing from spirit, principles and scope of the present disclosure.

Claims

A biomarker set for predicting a disease related to microbiota in a subject consisting of:

at least a partial sequence of SEQ ID NO: 1 to 54.
The biomarker set for predicting a disease related to microbiota in a subject of claim 1, wherein the disease is obesity or related disease.
A kit for determining the gene marker set of claim 1, comprising primers used for PCR amplification and designed according to the DNA sequecne as set forth in claim 1.
A kit for determining the gene marker claim 1, comprising one or more probes designed according to at least a partial sequence of SEQ ID NO: 1 to 54.
Use of the gene marker set of claim 1 for predicting the risk of obesity or related disorder in a subject, comprising:

(1) collecting a sample j from a subject；

(2) determining the relative abundance information of each of SEQ ID NO: 1 to 54 in the DNA of the sample； and

(3) calculating a index of sample j denoted by I_j by a formula below:

A_ij is the relative abundance of marker i in sample j, wherein i refers to each of the gene markers in said gene marker set；

N is a first subset of all patient-enriched markers in selected biomarkers related to the abnormal condition,

M is a second subset of all control-enriched markers in selected biomarkers related to the abnormal condition,

|N| and |M| are numbers of the biomarkers respectively in the first and second subsets,

wherein

an index greater than a cutoff indicates that the subject has or is at the risk of developing abnormal condition.
The use according to claim 5, wherein |N| is 24, and |M| is 30.
The use according to claim 5, wherein the cutoff is at least 0.5834.
Use of the gene marker set of claim 1 for preparation of a kit for predicting the risk of obesity or related disorder in a subject, comprising:

(1) collecting a sample j from a subject；

(2) determining the relative abundance information of each of SEQ ID NO: 1 to 54 in the DNA of the sample； and

(3) calculating a index of sample j denoted by I_j by a formula below:

A_ij is the relative abundance of marker i in sample j, wherein i refers to each of the gene markers in said gene marker set；

N is a first subset of all patient-enriched markers in selected biomarkers related to the abnormal condition,

M is a second subset of all control-enriched markers in selected biomarkers related to the abnormal condition,

|N| and |M| are numbers of the biomarkers respectively in the first and second subsets,

wherein

an index greater than a cutoff indicates that the subject has or is at the risk of developing abnormal condition.
The use according to claim 8, wherein |N| is 24, and |M| is 30.
The use according to claim 9, wherein the cutoff is at least 0.5834.
A method of diagnosing whether a subject has an abnormal condition related to microbiota or is at the risk of developing an abnormal condition related to microbiota, comprising:

determining the relative abundance of the biomarkers of claim 1 in a sample from the subject, and

determining whether a subject has an abnormal condition related to microbiota or is at the risk of developing an abnormal condition related to microbiota based on the relative abundance.
The method according to claim 11, comprising:

(1) collecting a sample j from a subject；

(2) determining the relative abundance information of each of SEQ ID NO: 1 to 54 in the DNA of the sample； and

(3) calculating a index of sample j denoted by I_j by a formula below:

A_ij is the relative abundance of marker i in sample j, wherein i refers to each of the gene markers in said gene marker set；

N is a first subset of all patient-enriched markers in selected biomarkers related to the abnormal condition,

M is a second subset of all control-enriched markers in selected biomarkers related to the abnormal condition,

|N| and |M| are numbers of the biomarkers respectively in the first and second subsets,

wherein

an index greater than a cutoff indicates that the subject has or is at the risk of developing abnormal condition.
The method according to claim 12, wherein |N| is 24, and |M| is 30.
The method according to claim 13, wherein the cutoff is at least 0.5834.
The method according to claim 11, wherein the abnormal condition related to microbiota is obesity or related disorder.