HK1190758B

HK1190758B - Noninvasive detection of fetal genetic abnormality

Info

Publication number: HK1190758B
Application number: HK14103891.8A
Authority: HK
Inventors: 蒋馥蔓; 陈会飞; 柴相花; 袁玉英; 张秀清; 陈芳
Original assignee: 深圳华大基因股份有限公司
Filing date: 2011-06-29
Publication date: 2015-01-30

Abstract

The current invention is directed to methods for noninvasive detection of fetal genetic abnormalities by large-scale sequencing of nucleotides from maternal biological sample. Further provided are methods to remove GC bias from the sequencing results according to the difference in GC content of a chromosome. The current invention not only makes the detection much more accurate but also represents a comprehensive method for fetal aneuploidy detection including sex chromosome disorders such as XO, ΧΧΧ, ΧΧY, and XYY, etc.

Description

Noninvasive detection of fetal genetic abnormalities

Technical Field

The present invention relates to a non-invasive method for detecting fetal genetic abnormalities by DNA sequencing of a sample from a pregnant woman. More particularly, the invention relates to data analysis to remove GC bias introduced by amplification and sequencing of DNA samples. The invention also relates to statistical analysis aimed at detecting genetic abnormalities of the fetus, such as chromosomal abnormalities including aneuploidy.

Background

Conventional prenatal diagnostic methods involving invasive procedures such as chorionic villus sampling and amniocentesis pose potential risks to both the fetus and mother. Noninvasive screening of fetal aneuploidy using maternal serum markers and ultrasound is feasible, but with limited sensitivity and specificity (Kagan, et al, Human Reproduction (2008) 23: 1968-.

Recent studies have demonstrated that noninvasive detection of fetal aneuploidy by massively parallel sequencing of DNA molecules in the plasma of pregnant women is feasible. Fetal DNA has been detected and quantified in maternal plasma and serum (Lo, et al, Lancet (1997) 350: 485487; Lo, et al, am.J. hum.Genet. (1998) 62: 768-. A variety of fetal cell types occur in the maternal circulation, including fetal granulocytes, lymphocytes, nucleated red blood cells, and trophoblasts (Pertl and Bianchi, Obstetrics and Gynecology (2001) 98: 483-490). Fetal DNA can be detected in serum at week 7 of pregnancy and increases with gestation. Fetal DNA is present in maternal serum and plasma at concentrations comparable to those obtained from fetal cell isolation procedures.

Circulating fetal DNA has been used to determine the gender of the fetus (Lo, et al., am.J.hum.Genet. (1998) 62: 768-775). Meanwhile, fetal rhesus D genotype has been detected by fetal DNA. However, diagnostic and clinical applications of circulating fetal DNA are limited to genes present in the fetus but not in the mother (Pertl andBonchi, Obstetrics and Gynecology (2001) 98: 483-490). Thus, there remains a need for a non-invasive method that can determine fetal DNA sequences and provide a definitive diagnosis of fetal chromosomal abnormalities.

The discovery of fetal cells and cell-free fetal nucleic acids in maternal blood and the application of high-throughput shotgun sequencing to maternal plasma cell-free DNA over the past decades has made it feasible: small changes in chromosome presentation caused by aneuploid fetuses in maternal plasma samples were detected. Noninvasive detection of trisomy 13, trisomy 18, and trisomy 21 pregnancies has been achieved.

However, as some studies have shown, GC bias introduced by amplification and sequencing creates operational limitations on the sensitivity of aneuploidy detection. GC bias can be introduced during sample preparation and sequencing under different conditions, such as reagent composition, cluster density, and temperature, which causes significant bias in the sequencing data for DNA molecules of different GC compositions and for chromosomes that are GC-rich or GC-poor.

To improve sensitivity, methods have been developed to remove the effects of GC bias. Fan and Quake developed a method to computationally remove GC bias by giving weight to each GC density based on local genomic GC content to computationally remove GC bias, thereby improving the number of reads (reads) mapped into each section (bin) by multiplying the respective weights (Fan and Quake PLoS ONE (2010) 5: e 10439). However, this method has difficulty in dealing with sex chromosome disorders, particularly Y chromosome-related disorders, because it may cause slight distortion of the data, which may interfere with the accuracy of the detection.

In this context, the inventors describe a method of computationally removing GC bias, thereby achieving higher sensitivity of fetal genetic abnormality detection while avoiding data distortion. The method defines parameters for statistical tests based on GC content. In addition, the inventors introduced the estimated fetal fraction into the diagnosis by a binary hypothesis showing higher sensitivity and specificity. The inventors' method also showed that it is possible to increase the sensitivity of noninvasive detection of fetal genetic abnormalities to a pre-determined precision by sequencing more polynucleotide fragments for maternal samples containing low fetal DNA fractions. Resampling maternal plasma in the later gestational week may also increase diagnostic sensitivity.

Disclosure of Invention

The present invention relates to a method for noninvasive detection of fetal genetic abnormalities by large-scale sequencing of nucleotides from maternal biological samples. Further provided are methods for removing GC bias from sequencing results due to differences in chromosomal GC content.

Accordingly, in one aspect, provided herein is a method for establishing a relationship between coverage depth and GC content of a chromosome, the method comprising: obtaining sequence information for a plurality of polynucleotide fragments encompassing the chromosome and another chromosome from more than one sample; assigning the fragments to chromosomes based on the sequence information; calculating a depth of coverage and GC content of the chromosome based on the sequence information for each sample; and determining a relationship between the depth of coverage and GC content of the chromosome.

In one embodiment, the polynucleotide fragments range in length from about 10 to about 1000 bp. In another embodiment, the polynucleotide fragments range in length from about 15 to about 500 bp. In yet another embodiment, the polynucleotide fragments range in length from about 20 to about 200 bp. In yet another embodiment, the polynucleotide fragments range in length from about 25 to about 100 bp. In another embodiment, the polynucleotide fragment is about 35bp in length.

In one embodiment, the sequence information is obtained by parallel genome sequencing. In another embodiment, assigning the fragments to chromosomes is performed by comparing the sequences of the fragments to a human genome reference sequence. The human genomic reference sequence may be any suitable and/or published version of the human genome (build), such as hg18 or hg 19. Fragments that are assigned to more than one chromosome or not assigned to either chromosome may be ignored.

In one embodiment, the depth of coverage of a chromosome is the ratio between the number of fragments assigned to the chromosome and the number of reference unique reads of the chromosome. In another embodiment, the depth of coverage is standardized. In yet another embodiment, normalization is calculated with respect to the coverage of all other autosomes. In yet another embodiment, normalization is calculated with respect to the coverage of all other chromosomes.

In one embodiment, the relationship is the following equation:

cr_i，j＝f(GC_i，j)+ε_i，j，j＝1，2，…，22，X，Y，

wherein f (GC)_i，j) Represents the relationship between the normalized coverage depth of sample i, chromosome j and the corresponding GC content, ∈_i，jRepresents the residual error of sample i and chromosome i. In some embodiments, the relationship between depth of coverage and GC content is calculated by local polynomial regression. In some embodiments, the relationship may be a non-strong linear relationship. In some embodiments, the relationship is determined by a loess algorithm.

In some embodiments, the method further comprises calculating the fitted depth of coverage according to the following formula:

in some embodiments, the method further comprises calculating the standard deviation according to the following formula:

where ns represents the number of reference samples.

In some embodiments, the method further comprises calculating the student t-statistic according to the formula:

in one embodiment, the GC content of a chromosome is the average GC content of all fragments that are assigned to the chromosome. The GC content of a fragment can be calculated by dividing the number of G/C nucleotides of the fragment by the total number of nucleotides of the fragment. In another embodiment, the GC content of a chromosome is the aggregate GC content of the reference unique reads of the chromosome.

In some embodiments, at least 2, 5, 10, 20, 50, 100, 200, 500, or 1000 samples are used. In some implementations, the chromosome is chromosome 1,2, a.

In one embodiment, the sample is from a pregnant female subject. In another embodiment, the sample is from a male subject. In yet another embodiment, the sample is from both a pregnant female subject and a male subject.

In some embodiments, the sample is a biological sample. In some embodiments, the sample is a peripheral blood sample.

Also provided herein is a method of detecting a fetal genetic abnormality, the method comprising: a) obtaining sequence information for a plurality of polynucleotide fragments from a sample; b) assigning the fragments to chromosomes based on the sequence information; c) calculating a coverage depth and a GC content of the chromosome based on the sequence information; d) calculating a fitted coverage depth for the chromosome using the GC content of the chromosome and the established relationship between coverage depth and GC content of the chromosome; and e) comparing the fitted coverage depth to the coverage depth of the chromosome, wherein a difference between them is indicative of a fetal genetic abnormality.

In some embodiments, the method further comprises the step of f) determining the gender of the fetus. The fetal gender may be determined according to the following formula:

wherein cr.a_i，xAnd cr.a_i,yNormalized relative coverage of the X and Y chromosomes, respectively.

In some embodiments, the method further comprises the step of g) estimating the fetal fraction. The fetal fraction may be calculated according to the following formula:

whereinIs overlaid from the Y chromosome of a sample from a pregnant woman carrying a female fetusThe fitted depth of coverage calculated from the relationship of cap depth and corresponding GC content,refers to the fitted coverage depth calculated from the relationship of the Y chromosome coverage depth and the corresponding GC content of a male subject. Further, the fetal fraction may be calculated according to the following formula:

whereinIs the fitted coverage depth calculated from the relationship of the X-chromosome coverage depth and the corresponding GC content of a sample from a pregnant woman carrying a female fetus,refers to the fitted depth of coverage calculated from the relationship of the X chromosome depth of coverage and the corresponding GC content of a male subject. Further, the fetal fraction may be calculated according to the following formula:

whereinIs the fitted coverage depth calculated from the relationship of the X-chromosome coverage depth and the corresponding GC content of a sample from a pregnant woman carrying a female fetus,means the fitted coverage depth calculated from the relationship of the Y chromosome coverage depth and the corresponding GC content of a sample from a pregnant woman carrying a female fetus,refers to the fitted depth of coverage calculated from the relationship of the X chromosome depth of coverage and the corresponding GC content of a male subject,refers to the fitted coverage depth calculated from the relationship of the Y chromosome coverage depth and the corresponding GC content of a male subject.

In one embodiment, the genetic abnormality is a chromosomal abnormality. In another embodiment, the genetic abnormality is aneuploidy. In yet another embodiment, the fetal aneuploidy is an autosomal disorder selected from the group consisting of trisomy 13, trisomy 18, and trisomy 21. In yet another embodiment, the fetal aneuploidy is a sex chromosome disorder selected from the group consisting of XO, XXX, XXY, and XYY.

In some embodiments, comparing the fitted coverage depth to the coverage depth of the chromosome is performed by statistical hypothesis testing, wherein one hypothesis is that the fetus is euploid (H0) and another hypothesis is that the fetus is aneuploid (H1). Statistics may be calculated for both hypotheses. In some embodiments, student t-statistics are calculated for H0 and H1, respectively, according to the following formula:andwhere fxy is fetal fraction. In some embodiments, the log-likelihood ratio of t1 and t2 is calculated according to the following formula: l is_i，j＝log(p(t1_i，j，degree|D))/log(p(t2_i，jDegree | T)), where degree refers to T degree of distribution, D refers to diploidy, T refers to trisomy, p (T1)_i，jDegree |) and D, T representing the conditional probability density given the degree of T distribution.

In one embodiment, the fetal gender is female and the student t-statistic is calculated according to the following formula:whereinIs the fitted coverage depth calculated from the relationship of the X-chromosome coverage depth and the corresponding GC content of a sample from a pregnant woman carrying a female fetus. In some embodiments, | t1| >3.13 indicates that the fetus is XXX or XO. In some embodiments, | t1| >5 indicates that the fetus is XXX or XO.

In another embodiment, the fetal gender is male and the student t-statistic is calculated according to the following formula:whereinIs the fitted coverage depth calculated from the relationship of the X-chromosome coverage depth and the corresponding GC content of a sample from a pregnant woman carrying a female fetus. In some embodiments, | t2| >3.13 indicates that the fetus may be XXY or XYY. In some embodiments, | t2| >5 indicates that the fetus is XXY or XYY.

Further provided herein is a method of determining a fetal genetic abnormality, the method comprising: a) obtaining sequence information for a plurality of polynucleotide fragments covering a chromosome and another chromosome from more than one normal sample; b) assigning the fragments to chromosomes based on the sequence information; c) calculating the coverage depth and GC content of the chromosome based on the sequence information of the normal sample; d) determining the relationship between the depth of coverage and GC content of the chromosome; e) obtaining sequence information for a plurality of polynucleotide fragments from a biological sample; f) assigning the fragments to chromosomes based on the sequence information from the biological sample; g) calculating a depth of coverage and GC content of the chromosome based on the sequence information of the biological sample; h) calculating a fitted coverage depth for the chromosome using the GC content of the chromosome and the relationship between coverage depth and GC content of the chromosome; and i) comparing the fitted coverage depth to the coverage depth of the chromosome, wherein a difference between them is indicative of a fetal genetic abnormality.

In another aspect, provided herein is a computer readable medium containing a plurality of instructions for performing prenatal diagnosis of a fetal genetic abnormality, comprising the steps of: a) receiving sequence information for a plurality of polynucleotide fragments from a sample; b) assigning the polynucleotide fragments to chromosomes based on the sequence information; c) calculating a coverage depth and a GC content of the chromosome based on the sequence information; d) calculating a fitted coverage depth for the chromosome using the GC content of the chromosome and the established relationship between coverage depth and GC content of the chromosome; and e) comparing the fitted coverage depth to the coverage depth of the chromosome, wherein a difference between them is indicative of a fetal genetic abnormality.

In yet another aspect, provided herein is a system for determining a fetal genetic abnormality, comprising: a) means for obtaining sequence information for a plurality of polynucleotide fragments from a sample; and b) a computer readable medium comprising a plurality of instructions for performing prenatal diagnosis of a fetal genetic abnormality. In some embodiments, the system further comprises a biological sample obtained from a pregnant female subject, wherein the biological sample comprises a plurality of polynucleotide fragments.

Drawings

FIG. 1 shows a schematic diagram for calculating the depth of coverage and GC content by using sequence information of polynucleotide fragments.

Figure 2 shows the normalized coverage depth-GC content correlation established by using data from 300 reference cases. Normalized coverage depth for each case was plotted against the GC content of the sequence. Crosses indicate cases with euploid female fetuses and squares indicate cases with euploid male fetuses. The solid line is the fit line covering depth and GC content.

FIG. 3 shows the trend between normalized coverage depth and corresponding GC contents by arranging chromosomes with increasing GC contents within. Here the intrinsic rising GC content of each chromosome refers to the average GC content of the sequence tags from the 300 reference case chromosomes.

Figure 4 shows the different composition of CG classes for each chromosome. GC content of each 35bp read of the reference unique read was calculated for each chromosome, the GC content was ranked into 36 levels, and the percentage of each level was calculated as the GC composition of each chromosome. The chromosomes are then mapped through a heat map and clustered hierarchically.

FIG. 5 shows the correlation introduced by process sequencing bias of the artificial simulation sequencer preference shown in FIG. 2.

Figure 6 plots the standard deviation relative to the total number of polynucleotide fragments sequenced. The corrected standard deviation for each chromosome showed a linear relationship with the inverse of the square root of the number of unique reads in 150 samples.

Fig. 7 shows a plot of each chromosome residual calculated by equation 3, showing a linear relationship with a normal distribution.

Fig. 8 shows a histogram of the Y chromosome coverage depth. There are two peaks indicating that case gender can be distinguished by depth of coverage of the Y chromosome. The curve is the distribution of relative coverage depths of the Y chromosome obtained by kernel density estimation with Gaussian kernel function.

Fig. 9 shows a schematic of a method for diagnosing fetal chromosomal abnormalities for 903 test samples.

Figure 10 shows the results of aneuploidy: trisomy 13, trisomy 18 and trisomy 21 as well as XO, XXY, XYY cases and normal cases. Fig. 10A shows a plot of GC content versus normalized coverage depth for chromosome 13, 18, and 21. FIG. 10B shows a plot of the X and Y chromosomes. Circles represent the relative coverage depth and GC content of a normal female fetus, and dots represent normal male fetuses. The solid line is the fitted line of relative coverage and GC content, the dotted line is 1 for absolute value of t, 2 for absolute value of t, the dotted line: the absolute value of t is 3.

Fig. 11 compares the confidence values of different diagnostic methods.

FIG. 12 shows the relationship between fetal DNA fraction and gestational age. The fraction of fetal DNA in maternal plasma correlates with gestational age. Fetal DNA fraction is estimated by X and Y together. There was a statistically significant correlation between mean fetal DNA fraction and gestational age (P < 0.001). Note that the R2 value indicates that the square of the correlation coefficient is small. The minimum score was 3.49%.

Fig. 13 shows the relationship between the standard deviation and the number of cases required for detection. The standard deviation for each chromosome was calculated by equation 5 as a function of the number of different samples. The standard deviation becomes stable when the number of samples exceeds 100.

Figure 14 shows an estimate of the number of unique reads used for fetal aneuploidy detection in cell-free plasma as a function of fetal DNA fraction. The assessment for aneuploidy of chromosomes 13, 18, 21 and X and even Y (from the relationship between X and Y), each having different lengths, is based on a t-value confidence level of not less than 3. As the fraction of fetal DNA decreases, the total number of shotgun sequences required increases. Using a sequencing throughput of 4 million sequence reads per channel on a flow cell (flowcell), trisomy 21 could be detected if 3.5% of the cell-free DNA was fetal. When the fraction and number of unique reads is small, e.g., 4% and 5 million reads, aneuploidy of the X chromosome is not readily detected. Different chromosomes require different levels of fetal DNA fraction and unique read numbers, which may be due to the GC structure of the chromosome.

Fig. 15 shows a constant plot of sensitivity for each gestational week and each data volume point plotted from data volumes and gestational age (weeks) used to detect female fetal chromosome 13 trisomy.

Fig. 16 shows a constant plot of sensitivity for each gestational week and each data volume point plotted from data volumes and gestational age (weeks) used to detect trisomy 18 in a female fetus.

Fig. 17 shows a constant plot of sensitivity for each gestational week and each data volume point plotted from data volumes and gestational age (weeks) used to detect female fetal chromosome 21 trisomy.

Fig. 18 shows a constant plot of sensitivity for each gestational week and each data volume point plotted from data volumes and gestational age (weeks) used to detect female fetal X-chromosome trisomy.

FIG. 19 shows a plot of data volume and gestational age (weeks) plotted against susceptibility for detecting trisomy 13 in men. For each gestational week and each data volume point, the inventors calculated an empirical distribution of fetal DNA scores and standard deviations for a given data volume and compared the scores estimated by XY or Y, and then the inventors calculated the sensitivity for each type of aneuploidy.

FIG. 20 shows a plot of data volume and gestational age (weeks) plotted for susceptibility for detecting trisomy 18 in men.

FIG. 21 shows a plot of data volume and gestational age (weeks) plotted against susceptibility for detecting trisomy 21 in men.

Detailed Description

The invention relates to methods for noninvasive detection of fetal genetic abnormalities by large-scale sequencing of polynucleotide fragments from maternal biological samples. Further provided are methods for removing GC bias in sequencing results due to differences in GC content of chromosomes based on the relationship between the depth of coverage of chromosomes and the corresponding GC content. Thus, provided herein is a method to correct GC content for student-t statistics as a reference parameter by calculation by fitting the depth of coverage of each sample chromosome to the GC content of the polynucleotide fragment by locally weighted polynomial regression.

Also provided herein is a method of determining a fetal genetic abnormality by statistical analysis using statistical hypothesis testing. Additionally, methods are provided to calculate Data Quality Control (DQC) criteria that can be used to determine the amount of clinical sample required for a particular level of statistical significance.

I. Definition of

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. All patents, patent applications, published patent applications, and other publications referred to herein are incorporated by reference in their entirety. If a definition set forth in this section is contrary to or otherwise inconsistent with a definition set forth in the patents, patent applications, published patent applications, and other publications that are herein incorporated by reference, the definition set forth in this section prevails over the definition that is incorporated herein by reference.

As used herein, the singular forms "a", "an" and "the" include plural referents unless the content clearly dictates otherwise. For example, a "dimer" includes one or more dimers.

The term "chromosomal abnormality" refers to a deviation between the structure of a subject's chromosome and a normal homologous chromosome. The term "normal" refers to the predominant karyotype or banding pattern present in a healthy individual of a particular species. Chromosomal abnormalities can be numerical or structural and include, but are not limited to, aneuploidy, polyploidy, inversion, trisomy, monosomy, duplication, deletion, partial chromosomal deletion, addition, partial chromosomal addition, insertion, chromosomal fragment, chromosomal region, chromosomal rearrangement, and translocation. Chromosomal abnormalities may be associated with the presence of a pathological condition or a predisposition to a pathological condition. A single nucleotide polymorphism ("SNP"), as defined herein, is not a chromosomal abnormality.

Monosomy X (XO, deletion of the entire X chromosome) is the most common type of Turner syndrome, with 1 occurrence in every 2500 to 3000 newborn girls (Sybert and McCauley N Engl J Med (2004) 351: 1227-. Syndrome XXY is a condition in which human Males have an extra X chromosome, occurring in approximately 1 out of every 1000 Males (Bock, unrestanding Klinefelter Syndrome: A Guide for XXY metals and theri families, NIH pub. No.93-3202 (1993)). The XYY syndrome is a sex chromosome aneuploidy with an extra Y chromosome in human males, with 47 chromosomes instead of the normal 46, which is affected in 1 out of 1000 born males and may lead to male sterility (Aksglaede, et al., J Clin Endocrinol Metab (2008) 93: 169-.

Turner syndrome includes several states, with monosomic X (XO, lacking whole sex chromosomes, babysome) being the most common. Women usually have two X chromosomes, but in Turner syndrome, one of these sex chromosomes is deleted. In 1 case of women with a phenotype of 2000 to 5000, the syndrome manifests itself in various ways. Klineflelter syndrome is a state in which human males have an extra X chromosome. Klinefelter syndrome is the most common sex chromosome disorder in humans, and is the second most common state caused by the presence of extra chromosomes. This condition occurs in approximately 1 out of every 1000 men. The XYY syndrome is a sex chromosome aneuploidy with one extra Y chromosome in human males, sharing 47 chromosomes instead of the normal 46. This produced a 47, XYY karyotype. This state is usually asymptomatic and 1 out of 1000 born males is affected, which may lead to male infertility.

Trisomy 13 (Patau syndrome), trisomy 18 (Edward syndrome) and trisomy 21 (Down syndrome) are the most clinically important autosomal trisomies, and how to detect them has been a hot oneAnd (4) point. Detection of the above fetal chromosomal aberrations is of great importance in prenatal diagnosis (Ostler,Diseases of the eye and skin：a color atlas。LippincottWilliams & Wilkins.pp.72.ISBN9780781749992(2004)；Driscoll and Gross N Engl J Med(2009)360：2556-2562；Kagan，et al.，Human Reproduction(2008)23：1968-1975)。

the term "reference unique reads" refers to chromosome fragments having unique sequences. Thus, such fragments can be unambiguously assigned to a single chromosomal location. Reference unique reads of chromosomes can be constructed based on published genomic reference sequences such as hg18 or hg 19.

The terms "polynucleotide", "oligonucleotide", "nucleic acid", and "nucleic acid molecule" are used interchangeably herein to refer to a polymeric form of nucleotides of any length, and may include ribonucleotides, deoxyribonucleotides, analogs thereof, or mixtures thereof. The term refers only to the primary structure of the molecule. Thus, the term includes triple-, double-and single-stranded deoxyribonucleic acid ("DNA") as well as triple-, double-and single-stranded ribonucleic acid ("RNA"). It also includes modified forms (e.g., by alkylation and/or by capping) and unmodified forms of the polynucleotide. More specifically, the terms "polynucleotide", "oligonucleotide", "nucleic acid" and "nucleic acid molecule" include polydeoxyribonucleotides (containing 2-deoxy-D-ribose), polyribonucleotides (containing D-ribose), any other type of polynucleotide including tRNA, rRNA, hRNA and spliced or unspliced mRNA, N-or C-glycosides of purine or pyrimidine bases, and other multimers containing a non-nucleotide (nonucleotic) backbone, such as polyamides (e.g.nucleic acid peptides ("PNA")) and polymorpholino (commercially available from Anti-Virals, Inc., Corvallis, OR., e.g.) Multimers and other synthetic sequence-specific nucleic acid multimers, provided that the multimer comprises nucleobases in a configuration that allows base pairing and base stacking, e.g., as found inIn DNA and RNA. Thus, these terms include, for example, 3 '-deoxy-2', 5 '-DNA, oligodeoxyribonucleotides N3' -P5 'phosphoramidate, 2' -O-alkyl-substituted RNA, hybrids between DNA and RNA or PNAs and DNA or RNA, and also known types of modifications, such as labels, alkyls, "caps", substitution of one or more nucleotides with an analog, internucleotide modifications (e.g., those with uncharged linkers (e.g., methyl phosphate, phosphotriester, phosphoramidate, carbamate, etc.), those with negatively charged linkers (e.g., phosphorothioate, phosphorodithioate, etc.), and those with positively charged linkers (e.g., aminoalkyl phosphoramidate, aminoalkyl phosphotriester), including pendant moieties such as proteins (including enzymes (e.g., nucleases); PNAs), and the like, Toxins, antibodies, signal peptides, poly-l-lysine, etc.), those with intercalators (e.g., acridine, psoralen, etc.), those comprising chelates (e.g., metals, radioactive metals, boron, metal oxides, etc.), those comprising alkylates, those with modified linkers (e.g., alpha anomeric nucleic acids); and unmodified forms of the polynucleotides or oligonucleotides.

"massively parallel sequencing" refers to a technique for sequencing millions of nucleic acid fragments, for example, by attaching randomly fragmented genomic DNA onto a transparent plane and performing solid phase amplification to form a high-density sequencing flow cell with millions of clusters, each cluster containing about 1000 copies of template per square centimeter. These templates were sequenced using 4-color DNA sequencing-by-synthesis techniques. See the products provided by Illumina, inc., San Diego, Calif. The sequencing used in the present invention is preferably performed without a pre-amplification or cloning step, but may be combined with an amplification-based method having reaction chambers on a microfluidic chip that can be used for PCR and sequencing based on microscopic templates. Only about 30bp of random sequence information is required to identify a chromosomal sequence belonging to a particular human. Longer sequences can uniquely identify more specific targets. In this example, a large number of 35bp reads were obtained. Further description of massively parallel sequencing methods is given in Rogers and Ventner, Nature (2005) 437: 326-327.

As used herein, "biological sample" refers to any sample obtained from macromolecules and biomolecules of living or viral or other origin, and includes any cell type or tissue of a subject from which nucleic acids, proteins or other macromolecules are obtained. The biological sample may be a sample obtained directly from a biological source, or a processed sample. For example, the amplified isolated nucleic acid constitutes a biological sample. Biological samples include, but are not limited to, bodily fluids such as blood, plasma, serum, cerebrospinal fluid, synovial fluid, urine, and sweat; tissue and organ samples from animals and plants and processed samples thereof.

It is to be understood that the aspects and embodiments of the invention described herein include aspects and embodiments that "consist of and/or" consist essentially of.

Other objects, advantages and features of the present invention will become apparent from the following detailed description taken in conjunction with the accompanying drawings.

Establishing a relationship between depth of coverage and GC content

Provided herein is a method for establishing a relationship between coverage depth and GC content of a chromosome, the method comprising: obtaining sequence information for a plurality of polynucleotide fragments encompassing the chromosome and another chromosome from more than one sample; assigning the fragments to chromosomes based on the sequence information; calculating a depth of coverage and GC content of the chromosome based on the sequence information for each sample; and determining a relationship between the depth of coverage and GC content of the chromosome. The steps of operations may be performed in an unspecified order. In some embodiments, the method may be performed in the following order: a) obtaining sequence information for a plurality of polynucleotide fragments encompassing the chromosome and another chromosome from more than one sample; b) assigning the fragments to chromosomes based on the sequence information; c) calculating a depth of coverage and GC content of the chromosome based on the sequence information for each sample; and d) determining a relationship between the depth of coverage and GC content of the chromosome.

To calculate the depth of coverage and GC content of the chromosomal mapping, sequence information of the polynucleotide fragments was obtained by sequencing template DNA obtained from the sample. In one embodiment, the template DNA comprises both maternal and fetal DNA. In another embodiment, the template DNA is obtained from the blood of a pregnant woman. Blood may be collected using any conventional technique for taking blood, including but not limited to venipuncture. For example, blood may be taken from the medial elbow or dorsal hand vein. Blood samples may be taken from pregnant women at any time during the pregnancy of the fetus. For example, a blood sample may be taken from a human female at weeks 1-4, 4-8, 8-12, 12-16, 16-20, 20-24, 24-28, 28-32, 32-36, 36-40, or 40-44 of fetal pregnancy, preferably at weeks 8-28 of fetal pregnancy.

The polynucleotide fragments are assigned to chromosomal locations based on the sequence information. The genomic reference sequence is used to obtain the reference unique read. The term "reference unique read" as used herein refers to all unique polynucleotide fragments that have been assigned to a specific genomic location based on a genomic reference sequence. In some embodiments, the reference unique reads have the same length, e.g., about 10, 12, 15, 20, 25, 30, 35, 40, 50, 100, 200, 300, 500, or 1000 bp. In other embodiments, the human genome version hg18 or hg19 may be used as the genomic reference sequence. Chromosomal location may be a continuous window on a chromosome having a length of about 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 300, 400, 5000, 6000, 7000, 8000, 9000, 10,000 or more kb. The chromosomal location may also be a single chromosome.

The term "depth of coverage" as used herein refers to the ratio between the number of fragments assigned to a chromosomal location and the number of reference unique reads for that chromosomal location, using the following formula:

C_i，j=n_i,j／N_j，j=1，2，…，22，X，Y (1)，

wherein n is_i,jIs the number of unique sequence reads in sample i that map to chromosome j; c_i,jIs the depth of coverage of chromosome i in sample i; n is a radical of_jIs the number of reference unique reads in chromosome j.

In some embodiments, polynucleotide fragments that do not assign to a single chromosomal location or to multiple chromosomal locations are ignored. In some embodiments, the coverage depth is normalized based on the coverage depth of another chromosome location, the coverage depth of another chromosome, the average coverage depth of all other autosomes, the average coverage depth of all other chromosomes, or the average coverage depth of all chromosomes. In some embodiments, the average depth of coverage of 22 autosomes is used as a normalization constant to calculate the difference in the total number of sequence reads obtained for different samples:

wherein cr_i,jRepresenting the relative depth of coverage of chromosome j in sample i. From this perspective, it is proposed that the "relative depth of coverage" of each chromosome is a normalized value that is used to compare different samples and for subsequent analysis.

GC content of a chromosomal location can be calculated by the average GC percentage of a chromosomal location, wherein the chromosomal location is based on the unique reference reads of the chromosomal location, or on sequenced polynucleotide fragments assigned to a chromosomal location. The GC content of the chromosome can be calculated using the following formula:

GC_i，j=NGC_i，j／BASE_i，j (3)，

wherein i represents sample i, j represents chromosome j, NGC_i,jRepresenting the number of DNA BASEs G and C on chromosome j in sample i, BASE_i,jRepresents the number of DNA bases on chromosome j in sample i.

Depth of coverage and GC content can be based on the sequence information of polynucleotide fragments obtained from a single sample or from multiple samples. To establish a relationship between the coverage depth and GC content of a chromosomal location, the calculation can be based on the sequence information of polynucleotide fragments obtained from at least 1,2, 5, 10, 20, 50, 100, 200, 500, or 1000 samples.

In some embodiments, the relationship between depth of coverage and GC content is a non-strong linear relationship.

The Loess algorithm or local weighted polynomial regression can be used to evaluate the non-linear relationship (correlation) between the value pairs, e.g., between the depth of coverage and the GC content.

Determining fetal genetic abnormalities

Also provided herein is a method of determining a fetal genetic abnormality, the method comprising: a) obtaining sequence information for a plurality of polynucleotide fragments from a sample; b) assigning the fragments to chromosomes based on the sequence information; c) calculating a coverage depth and a GC content of the chromosome based on the sequence information; d) calculating a fitted coverage depth for the chromosome using the GC content of the chromosome and establishing a relationship between coverage depth and GC content of the chromosome; and e) comparing the fitted depth of coverage with the depth of coverage of the chromosome, wherein a difference between them is indicative of a fetal genetic abnormality.

The methods can be used to detect fetal chromosomal abnormalities, and are particularly useful for detecting aneuploidy, polyploidy, monosomy, trisomy 21, trisomy 13, trisomy 14, trisomy 15, trisomy 16, trisomy 18, trisomy 22, triploidy, tetraploidy, and sex chromosomal abnormalities including XO, XXY, XYY, and XXX. Specific regions in the human genome may also be of interest according to the method to identify partial monosomy and partial trisomy. For example, the method may involve analysis of sequence data in a sliding "window" of defined chromosomes, such as contiguous, non-overlapping 50kb regions distributed throughout the chromosome. Partial trisomies 13q, 8p (8p23.1), 7q, distal 6p, 5p, 3q (3q25.1), 2q, 1q (1q42.1 and 1q21-qter), partial Xpand monosomy 4q35.1 have been reported, among others. For example, in the case of 18q21.1-qter repeats, partial repeats of the long arm of chromosome 18 can lead to Edwardz syndrome (Mewar, et al., Am J HumGenet. (1993) 53: 1269-78).

In some embodiments, the fetal fraction is estimated based on the sequence information obtained from the polynucleotide fragments of a sample. The depth of coverage and GC content of the X and Y chromosomes can be used to estimate the fetal fraction. In some embodiments, the fetal gender is determined based on the sequence information obtained from the polynucleotide fragments of the sample. The depth of coverage and GC content of the X and Y chromosomes can be used to determine the fetal gender.

In some embodiments, the comparison of the fitted coverage depth to the coverage depth of the chromosome is by statistical hypothesis testing, wherein one hypothesis is that the fetus is euploid (HO) and another hypothesis is that the fetus is aneuploid (H1). In some embodiments, the student t-statistic is computed separately for both hypotheses as t1 and t 2. In some embodiments, log-likelihood ratios for t1 and t2 are calculated. In some embodiments, a log likelihood ratio >1 indicates that the fetal trisomy is indicated.

Computer readable media and systems for diagnosing fetal genetic abnormalities

In another aspect, provided herein is a computer readable medium comprising a plurality of instructions for performing prenatal diagnosis of a fetal genetic abnormality, comprising the steps of: a) receiving the sequence information; b) assigning the polynucleotide fragments to chromosomes based on the sequence information; c) calculating a coverage depth and a GC content of the chromosome based on the sequence information; d) calculating a fitted coverage depth for the chromosome using the GC content of the chromosome and the established relationship between coverage depth and GC content of the chromosome; and e) comparing the fitted coverage depth to the coverage depth of the chromosome, wherein a difference between them is indicative of a genetic abnormality.

In yet another aspect, provided herein is a system for determining fetal aneuploidy, comprising: a) means for obtaining sequence information for the plurality of polynucleotide fragments; and b) a computer readable medium comprising a plurality of instructions for performing prenatal diagnosis of a fetal genetic abnormality. In some embodiments, the system further comprises a biological sample obtained from a pregnant female subject, wherein the biological sample comprises a plurality of polynucleotide fragments.

It will be apparent to those skilled in the art that several different assays may be usedSequence methods and variations. In one embodiment, the sequencing is performed using massively parallel sequencing. Massively parallel sequencing can be carried out, for example, on the 454 platform (Roche) (Margulies, et al., Nature (2005) 437: 376-380), Illumina Genome Analyzer (or Solexa)^TMPlatform) or SOLiD systems (Applied Biosystems) or on a computer System using the help True Single molecular dna sequencing technology (Harris, et al, Science (2008) 320: 106-^TM) Techniques and nanopore sequencing techniques (Soni and Meller, Clin Chem (2007) 53: 1996-2001), allowing sequencing of many nucleic acid molecules isolated from a sample in parallel with high-order multiplexing (Dear, Brief Funct Genomic (2003) 1: 397-416). Each of these platforms can sequence clonally amplified or even unamplified single molecules of nucleic acid fragments. Commercially available sequencing equipment can be used to obtain the sequence information of the polynucleotide fragments.

V. examples

The following examples are provided to illustrate the invention, but not to limit it.

Example 1 analysis of factors affecting the sensitivity of detection: GC bias and gender

A schematic step framework for calculating the depth of coverage and GC content is shown in fig. 1. The inventors generated reference unique reads using software by cleaving hg18 reference sequence into 1-mers (where the 1-mers are artificially decomposed reads from human sequence references at the same length "1" as the sample sequencing reads), and collected these "unique" 1-mers as the inventors' reference unique reads. Second, the inventors mapped their sequencing sample reads to reference unique reads for each chromosome. Third, the inventors removed outliers by applying a fifth outlier cutoff to get a clean data set. Finally, the inventors calculated the depth of coverage of each chromosome for each sample and calculated the GC content of the sequencing unique reads mapped to each chromosome for each sample.

To investigate how GC content affects the inventors' data, the inventors selected 300 euploid cases of karyotype results and disseminated their depth of coverage of sequencing reads and associated GC content into a graph showing a strong correlation between them, a phenomenon not previously reported (fig. 2). In fig. 2, the depth of coverage is strongly correlated with GC-content, showing a clear downward trend in some chromosomes e.g. 4, 13, etc., and an upward trend in other chromosomes e.g. 19, 22, etc. All chromosomes are arranged in ascending order of their intrinsic GC-content as shown in fig. 3, with a downward trend being present in lower GC-content set chromosomes and an upward trend being present in higher GC-content set chromosomes. This can be explained by the fact that if the polynucleotide fragment sequenced for one sample has a higher GC-content than the other samples, the sample exhibits a depth of coverage that will decrease in chromosomes in the lower GC-content set and increase in chromosomes in the higher GC-content set compared to the depth of coverage of the other samples.

A possible explanation for this different trend in the different GC-content chromosomes is that the differences in GC-content composition in the different chromosomes shown in FIG. 4 combine with the GC bias introduced during sequencing. The GC content of each 35-mer reference unique read for each chromosome was used to rank the GC content into 36 levels. The percentage of each level as GC composition of each chromosome was calculated and then used to draw a Heatmap using Heatmap2 software. Taking chromosome 13 as an example, it is composed mostly of lower GC-content sequence segments, but it is composed in small part of higher GC-content sequence segments. If conditions during sequencing or PCR favor sequencing those segments with higher GC-content, a significant portion of chromosomes 13 with low GC-content will be difficult to sequence, resulting in a lower depth of coverage of chromosomes 13 in the sample. In contrast, in the higher GC-content group, such as chromosome 19, the depth of coverage of chromosome 19 in this sample becomes higher because most chromosomes 19 have higher sequencer-preferred GC-content. Regardless of the chromosome in which the GC-poor and GC-rich segments are difficult to sequence, the effects of GC bias are different for different chromosomes with different GC-content compositions. Each reference chromosome was divided into 1kb segments and the GC content of each unique reference read in the segment was calculated. The GC content of each segment, present in the form of appropriate intervals 0.3, 0.6, was divided by a step size of 0.001 and then the relative coverage of each interval was calculated. FIG. 5 shows a plot of relative coverage and GC content for each chromosome.

The effect of fetal gender on the data was analyzed using a t-test of two independent samples. Essentially no significant difference was found between autosomes other than sex chromosomes for the same GC content, but there was a significant difference in UR% between females and males (Chiu eta1., (2008) Proc Natl Acad Sci USA 105: 20458-.

Example 2 statistical model

Using this phenomenon discussed above, the inventors tried to fit the relationship between depth of coverage and corresponding GC content using a local polynomial. The depth of coverage consists of the following GC function and the normally distributed residuals:

cr_i，j=f(GC_i，j)+ε_i,j，j=1，2，…，22，X，Y (4)，

wherein f (GC)_i,j) A function representing the relationship between sample i, depth of coverage of chromosome j and corresponding GC content, ∈_i,jRepresents the residual error of sample i and chromosome i.

There is a non-strong linear relationship between the depth of coverage and the corresponding GC content, so the inventors fit the depth of coverage to the corresponding GC content using the loess algorithm, from which the inventors calculated values important for the inventors' model, namely the fitted depth of coverage:

with the fitted coverage depth, the standard deviation and student t are calculated according to the following equations 6 and 7:

example 3 fetal fraction estimation

Since the fetal fraction is important for the inventors' testing, the inventors estimated the fetal fraction prior to the testing step. As noted previously, the inventors sequenced 19 adult males and when comparing their depth of coverage with the case carrying a female fetus, the inventors found that the male X chromosome depth of coverage was approximately 1/2 for female X chromosomes and approximately 0.5 times greater for male Y chromosome depth of coverage than female Y chromosome depth of coverage. Therefore, the inventors can estimate that the fetal fraction depends on the depth of coverage of the X and Y chromosomes as in equations 8, 9 and 10, and consider the GC correlation:

whereinIs a fitted coverage depth obtained by regression correlation of the coverage depth of the X chromosome carrying the female fetal case and the corresponding GC content,refers to the fitted coverage depth obtained by regression correlation of the coverage depth of the Y chromosome carrying the female fetal case and the corresponding GC content,refers to the fitted coverage depth obtained by regression correlation of X chromosome coverage depth and corresponding GC content for male adults,refers to the fitted coverage depth obtained by regression correlation of Y chromosome coverage depth and corresponding GC content for male adults. To simplify the calculation, settings are madeAndthe phase of the two phases is equal to each other,andare equal.

Example 4 calculation of the residual error for each chromosome

FIG. 6 shows that the standard deviation of each chromosome (see equation 3) at the total number of unique reads determined is affected by the number of reference cases involved. With a total number of unique reads of 170 tens of thousands sequenced for each case, the standard deviation hardly increased when the number of selected cases exceeded 150. However, the standard deviation is different for different chromosomes. After considering the GC bias, the inventors' method had a modest standard deviation for the following chromosomes: chromosome 13(0.0063), chromosome 18(0.0066) and chromosome 21 (0.0072). The standard deviation of the X chromosome is higher than the chromosomes mentioned above, which requires more strategies for accurate abnormality detection.

FIG. 7 shows a Q-Q plot where the residuals are compiled into a normal distribution, which indicates that the student-t calculation is reasonable.

Example 5 fetal gender differentiation

In order to discover sex chromosome disorders, it is desirable to differentiate between fetal sexes. When the inventors studied the frequency distribution of the coverage depth of the Y chromosome in 300 cases, there were two distinct peaks, which suggests that the sex could be distinguished by the coverage depth of the Y chromosome. Cases with a depth of coverage less than 0.04 can be considered as carrying a female fetus, while cases greater than 0.051 are considered as carrying a male fetus, between 0.04 and 0.051 are considered gender uncertain, as in fig. 8. For these cases of gender uncertainty and aneuploidy, logistic regression was used to predict their gender as shown in equation 11(Fan, et al, Proc Natl Acad Sci USA (2008) 42: 16266-:

wherein cr.a_i，xAnd cr.a_i，yNormalized relative coverage for X and Y, respectively.

Compared to karyotyping results, the inventors' method of differentiating fetal gender performed very well in their 300 reference cases with 100% accuracy, while only one case was mistaken when performed in their 901 case group, and the depth of coverage of the Y chromosome for this wrong case was between 0.04 and 0.051.

Example 6 diagnostic of GC-correlation t-test methodCan be used for

Sample recruitment

The 903 participants are prospectively recruited from Shenzhen people Hospital and Shenzhen mother-infant healthcare centers, with their karyotype results. Permission was obtained from the public reviewing department of each recruitment unit and all participants signed informed consent. Mother age and week of pregnancy were recorded at the time of blood draw. The 903 cases included 2 trisomy 13 cases, 15 trisomy 18 cases, 16 trisomy 21 cases, 3 XO cases, 2 XXY cases, and 1 XYY case. The karyotype results distribution is shown in FIG. 9.

Maternal plasma DNA sequencing

Peripheral venous blood (5 ml) was collected from each pregnant woman enrolled into EDTA tubes and centrifuged at 1,600g for 10 minutes over 4 hours. The plasma was transferred to a microcentrifuge tube and centrifuged at 16,000g for 10 minutes to remove residual cells. Cell-free plasma was stored at 80 ℃ until DNA extraction. Each plasma sample was only frozen and thawed once.

For massively parallel genome sequencing, DNA library construction was performed according to a modified protocol from Illumina using DNA extracted from 600 microliters of maternal plasma. Briefly, T4DNA polymerase, Klenow, was used^TMPolymerase and T4 polynucleotide kinase end-pair maternal plasma DNA fragments. After addition of the terminal a residues, a commercially available linker (Illumina) was ligated to the DNA fragment. The adaptor-ligated DNA was then additionally amplified with conventional multiplex primers using 17 cycles of PCR. Using an Agencourt AMPure^TMThe PCR product was purified with a 60 ml kit (Beckman). At 2100Bioanalyzer^TMThe size distribution of the sequencing library was analyzed on a DNA1000 kit (Agilent) and quantified by real-time PCR. Then, sequencing libraries with different tags (indexes) were pooled into one in equal amounts and then introduced into Illumina GAII^TMCluster station (single-ended sequencing) was performed.

19 male euploid samples were sequenced for subsequent analysis for fetal DNA score estimation. The inventors developed a new GC correlation t test method for diagnosing trisomy 13, trisomy 18, trisomy 21 and sex chromosome abnormalities, and compared the new method with the other three methods mentioned below in terms of diagnostic performance.

Example 7 detection of fetal aneuploidies such as trisomies 13, 18 and 21

To determine whether the chromosome copy number in a patient case deviates from normal, the depth of coverage of the chromosome is compared to the depth of coverage of all other reference cases. All previous studies have only one null hypothesis. The inventors introduced the binary hypothesis for the first time by using two null hypotheses. A null hypothesis (H0: the fetus is euploid) is to assume that the mean depth of coverage of the patient case distribution is equal to the mean depth of coverage of all normal reference distributions, which means that the patient case is euploid if the null hypothesis is accepted. Using the student t test, t1 can be calculated as in equation 12:

another null hypothesis (H1: the fetus is aneuploidy) is that the mean depth of coverage of the distribution of patient cases with poor fetal fraction is equal to the mean depth of coverage of the distribution of aneuploidy cases with the same fetal fraction, which means that the patient case is aneuploidy if the null hypothesis is accepted. student t-statistic, t2 was calculated as in equation 13:

i t1 i >3 and i t2 i <3 will indicate aneuploidy cases in most cases, especially when the distribution between euploid and aneuploidy cases is fully differentiated, whereas i t1 i may be less than 3 under other conditions such as insufficient precision or insufficient fetal fraction, but the fetus is abnormal. the combination of t1 and t2 may help the inventors make a more correct decision, and then the inventors apply the log likelihood ratios of t1 and t2 as in equation 14:

L_i，j＝log(p(t1_i，j，degree|D))／log(p(t2_i，j，degree|T)) (14)，

wherein L is_i，jIs a log likelihood ratio. If the ratio is greater than 1, the inventors will conclude that the fetus is likely to be a trisomy.

However, in the case of a female fetus, it is difficult for the inventor to estimate the fetal fraction, and therefore, it is impossible to perform calculation. However, from the empirical distribution of fetal fractions, the inventors could derive a score Reference (RV) of 7%.

903 cases were studied, of which 866 carried euploid fetuses, and 300 were randomly selected to develop the GC-related student-t method. In addition, 2 trisomy 13, trisomy 12 18, trisomy 16 21, 4 XOs (consisting of 3 XO cases and 1 chimera 45, XO/46, xx (27: 23) case), 2 XXY and 1 XYY cases participated in the inventors' study. After alignment, the inventors obtained an average of 170 ten thousand data (SD =306185) uniquely aligned reads (no mismatches) per case. All T13 cases (2 out of 2) were successfully identified, while 901 out of 901 13 non-trisomy cases were correctly classified by using the CG correlation student T-test newly developed by the inventors (fig. 10A). The sensitivity and specificity of the method were 100% and 100% (table 1).

For trisomy 18, 12 of 12 trisomy 18 cases and 888 of 891 non-trisomy 18 cases could be correctly identified (figure 10A). The sensitivity and specificity of the method were 100% and 99.66%, respectively. For trisomy 21, 16 of the 16 trisomy 21 cases and 16 of the 16 non-trisomy 21 cases were also correctly detected (fig. 10A). The sensitivity and specificity of the method were 100% and 100%, respectively.

Example 8 detection of XO, XXX, XXY, XYY

While the inventors contemplate the detection of autosomal trisomies hereinabove, sex chromosome disorders such as XO, XXX, XXY and XYY may also be detected by the methods of the inventors.

First, gender was confirmed by gender differentiation. If the test case is confirmed to be carrying a female fetus, the student-t value t1 needs to be calculatedFor XXX or XO detection, whereinAnd std_XfThe same as formula 10; if t1 is greater than 3.13 or less than-3.13, this case may be XXX or XO. However, considering that the accuracy for the X chromosome is limited by large deviations in the depth of coverage, the inventors sampled the plasma again and repeated the experiment to Y cells at | t1<5 even | t1>3.13 to make a more trusted decision. In this case, | t1>5 was confirmed to be aneuploidy. All inventors' test methods are based on the premise that the data meets standard quality control.

If the test sample is confirmed to be carrying a male fetus, the fetal DNA fraction is first estimated by Y and X. Meanwhile, the inventors can extrapolate the fitting coverage depth of the X chromosome with the fetal DNA fraction estimated only by the Y chromosome coverage depth, and can calculate t 2.If t2 is too large (greater than 5) or too small (less than-5), the fetus may be XXY or XYY. In addition, the difference between fetal fractions estimated by X and Y independently will provide information for detecting disorders related to sex chromosomes.

In the XO assay, 3 out of 4 XO cases were assayed, and the case that could not be identified was the chimera case (fig. 10B). The sensitivity and specificity of this method were 75% (100% if the inventors neglected the chimera case) and 99.55%, respectively. For XXY cases, all 2 cases were successfully identified, while 901 of the 901 non-XXY cases were correctly classified (fig. 10B), 100% sensitive and 100% specific. For the XYY case, which the inventors correctly identified (fig. 10B), sensitivity and specificity were 100% and 100%, respectively.

To evaluate whether the new method of the present invention has any advantages when compared to the other two reported methods (z-value and GC corrected z-value), the inventors used all these 3 methods to analyze the inventors' 900 cases, and the same 300 cases as reference group for all these methods. The accuracy of the measurement is always embodied in a Confidence Value (CV). In the inventors' studies, the CV of the standard z-value method was greater in the 18 and 21 chromosomes of clinical interest than the other methods (fig. 11), resulting in lower sensitivity to 18 and 21 trisomies (table 1).

TABLE 1 comparison of sensitivity and specificity of different methods

For the GC corrected z-value method, the CV values for chromosome 13 are 0.0066, 100% sensitivity and 100% specificity. For the new GC correlation student t method discussed herein, the CV value for chromosome 13 is 0.0063, 100% sensitivity rate and 100% specificity rate. In chromosome 18, the CVs for these two methods are 0.0062 and 0.0066, respectively, both are 100% sensitive and the specificity rates for them are 99.89% and 99.96%, respectively. For chromosome 21, the performance when comparing CVs of the two methods was similar: 0.0088 and 0.0072 respectively. Both resulted in 100% of the same sensitivity rate and achieved the same 100% specificity rate in the inventors' small case group study. Moreover, both methods outperform the standard z-value method. Not only does the newly developed GC correlation method of the inventors have good performance compared to GC correction methods, it also has another advantage in detecting sex chromosome abnormalities such as XO, XXY and XYY. The inventors' data show that it is difficult to distinguish fetal gender by the data bias exhibited by sex chromosomes introduced in the number of repair sequence tags multiplied by a weighting factor when performing the GC correction method, and thus detection of sex chromosome disorders seems difficult.

Example 9 theoretical Performance of the GC correlation t-test method considering data size, peripregnancy and fetal DNA fraction

Since the high background of maternal DNA (Fan, et al, Proc Natl Acad Sci USA (2008) 42: 16266-. However, there was no major breakthrough in clinically determining the minimum fetal DNA fraction prior to the MPGS assay, especially for female fetuses, and the only clinical clue associated with involvement of fetal DNA fraction was gestational weeks. Statistically significant correlations between fetal DNA fraction and gestational age have been previously reported (Lo, et al., am. J. human Genet. (1998) 62: 768-775). In the inventors' study, in order to study the relationship between the estimated fetal DNA fraction and gestational age, the inventors plotted the fetal DNA fractions of all participating cases (427 cases in total) carrying male fetuses estimated by referring to equation 10 in fig. 12. The fraction of fetal DNA estimated for each sample correlated with gestation weeks (P less than 0.0001). It was also shown that even at 20 weeks gestational age, there were still 4 cases with a fetal DNA fraction of less than 5%, which would adversely affect the detection accuracy. To evaluate the fetal fraction estimation method, the inventors selected some cases of hierarchical distribution in the estimated fetal fraction and then used Q-PCR to help calculate another relevant fetal fraction. Then, the inventors obtained a correlation standard curve showing a strong correlation between them, which proves that the estimation of the fetal fraction by the inventors' method is reliable.

At the same time, the sequencing depth (number of total unique reads) is another important factor that affects the accuracy of aneuploidy detection in terms of standard deviation. When the number of reference cases reached 150, the standard deviation of each chromosome employed in the GC correlation method of the inventors could be fixed at a certain sequencing depth level (fig. 13). To investigate how sequencing depth affects the standard deviation of each chromosome, the inventors sequenced 150 cases not only at the level of 170 ten thousand of the present invention, but also at another sequencing depth level with a total number of unique reads up to 5 million (SD 170 ten thousand). Depending on these two sets, the inventors found that the standard deviation is linearly related to the inverse of the square of the total number of unique reads, as shown in fig. 6.

For a given fraction of fetal DNA, the inventors can estimate the total number of unique reads required for the method of the invention to detect deviations from normal chromosomal copy number at t1 equal to 3 (fig. 14). It has been shown that the lower the fraction of fetal DNA, the greater the depth of sequencing required. In the 170 ten thousand unique read sets of the present invention, the present method is capable of detecting aneuploidy fetuses for chromosome 13 and X with a fetal DNA score of more than 4.5%, and aneuploidy fetuses for chromosome 21 and 18 with a score of more than 4%; whereas in the 5 million reference set of the present invention, the method of the present invention is capable of detecting trisomy 18 and trisomy 21 even with a fetal DNA fraction of about 3%. If the inventors wanted to identify fetuses with an X chromosome abnormality such as XXX or XO with a fetal fraction of about 4%, the total number of unique cases needed in these cases and the corresponding reference cases should be up to 5 million. If the fetal DNA is less than 3.5%, the sequencing depth requirement will exceed 20M. Also, if the fetal DNA fraction is lower, the test will become unreliable and difficult to perform, so the inventors propose other strategies, i.e. re-sampling maternal plasma at a growing gestational age, re-performing the present experiment and re-analyzing the data, as there is a greater likelihood that the fetal DNA fraction will increase with increasing gestational age as gestational age grows. Also, the strategy can be applied to samples suspected of having a low fetal DNA fraction.

Even though the method of the present invention works well, it is not convincing if there is not a large set of abnormal cases. To estimate the sensitivity of this GC-related student t method applied in the present invention, the inventors disclose theoretical sensitivity considering different gestational ages and different sequencing depths.

The inventors calculated the theoretical sensitivity of aneuploidy in the following procedure. First, the inventors applied regression analysis to fit fetal DNA fractions with gestational ageWhereinIs the ith gestational age gsa_iMean values of fitted fetal DNA fractions and approximate fetal DNA fractions, mainly estimated fetal DNA fractions distributed in the gestational weeks of 19 and 20, were estimated by applying Gaussian kernel function density estimation (Birke, (2008) Journal of Statistical Planning and reference 139: 2851-2862) and then based on the relation between fetal DNA fraction and gestational ageExtrapolating the fetal DNA fraction distribution in other weeks, whereinIs the fitted probability density of fetal DNA fraction in the ith gestational age, where X is data for 19 and 20 gestational weeks (fig. 12). Second, the inventors estimated the standard deviation from its previously mentioned total number of unique readsWhere tuqn is the total number of unique reads. Finally, to calculate the sensitivity of each gestational age at a certain sequencing depth level from the distribution fetal DNA fraction and standard deviation estimated in each sequencing depth, the inventors calculated the probability density of false negatives for each fetal DNA fraction (herein, the inventors assume that fetal DNA fraction fluctuates in a normal distribution), and then integrated them to obtain the False Negative Rate (FNR) of gestational age consisting of all fetal DNA fraction levelsWhere j is chromosome j. Easily, the theoretical sensitivity of a certain sequencing depth for this gestational age was calculated as 1-FNR.Figures 15-21 show graphs calculated by the inventors. Student-t was set to be greater than 3 to identify female fetal aneuploidy, while for male fetuses, when calculating the probability density of false negatives per fraction, a log likelihood greater than 1 was used as the cutoff value mentioned by the inventors in the binary hypothesis, a value that helps achieve higher sensitivity than women.

However, the inventors' reasoning is relatively conservative, as it is difficult to get a distribution that approaches the true distribution of fetal DNA fraction with gestational age, especially small gestational age in small scale sampling, indefinitely.

Reference to the literature

1.Virginia P.Sybert，Elizabeth McCauley(2004).Turner′s Syndrome.N Engl J Med2004；351：1227-1238.

2.Robert Bock(1993).Understanding Klinefelter Syndrome：A Guide for XXY Males andTheir Families.NIH Pub.No.93-3202August1993.

3.Aksglaede，Lise；Skakkebaek，Niels E.；Juul，Anders(January2008).″Abnormal sexchromosome constitution and longitudinal growth：serum levels of insulin-like growth factor(IGF)-I，IGF binding protein-3，luteinizing hormone，and testosterone in109males with47，XXY，47，XYY，or sex-determining region of the Y chromosome(SRY)-positive46，XX karyotypes″.JClin Endocrinol Metab93(1)：169-176.doi：10.1210/jc.2007-1426.PMID17940117.

4.H.Bruce Ostler(2004).Diseases of the eye and skin：a color atlas.Lippincott Williams&Wilkins.pp.72.ISBN9780781749992.

5.Driscoll DA，Gross S(2009)Clinical practice.Prenatal screening for aneuploidy.N EnglJ Med360：2556-2562.

6.Karl O.Kagan，Dave Wright，Catalina Valencia etc(2008).Screening for trisomies21，18and13by maternal age，fetal nuchal translucency，fetal heart rate，free b-hCG andpregnancy-associated plasma protein-A.Human Reproduction Vol.23，No.9pp.1968-1975，2008doi：10.1093/humrep/den224.

7.Malone FD，et al.(2005)First-trimester or second-trimester screening，or both，forDown’s svndrome.N Engl J Med353：2001-2011.

8.Fan HC，Quake SR(2010)Sensitivity of Noninvasive Prenatal Detection of FetalAneuploidyfrom Maternal Plasma Using Shotgun Sequencing Is Limited Only by CountingStatistics.PLoS ONE5(5)：e10439.doi：10.1371/journal.pone.0010439.

9.Chiu RW，Chan KC，Gao Y，Lau VY，Zheng W，et al.(2008)Noninvasive prenataldiagnosis offetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA inmaternal plasma.Proc Natl Acad Sci USA 105：20458-20463.

10.McCullagh，P.and Nelder，J.～A.(1989)，Generalized Linear Models，London，UK：Chapman&Hall/CRC.

11.Fan HC，Blumenfeld YJ，et al.(2008)Noninvasive diagnosis of fetal aneuploidy byshotgun sequencing DNA from maternal blood.Proe Natl Acad Sci USA42：16266-16271.

12.Melanie Birke.(2008)Shape constrained kernel density estimation.Journal of StatisticalPlanning and Inference Volume139，Issue8，1August2009，Pages2851-2862.

13.Lo et al.，Lancet350：485487(1997).

14.Lo et al.，Am. J.hum.Genet.62：768-775(1998).

15.Pertl and Bianchi，Obstetrics and Gynecology98：483-490(2001).

16.Rogers and Ventner，″Genomics：Massively parallel sequencing，″Nature，437，326-327(15Sep.2005).

17.Mewar et al.，″Clinical and molecular evaluation of four patients with partialduplications of the long arm of chromosome18，″Am J Hum Genet.1993December；53(6)：1269-78.

18.Margulies et al.，(2005)Nature437：376-380.

19.Harris et al.，(2008)Science，320：106-109.

20.Soni and Meller，(2007)Clin Chem53：1996-2001.

21.Dear，(2003)Brief Funct Genomic Proteomic1：397-416.

Claims

1. A system for determining a fetal genetic abnormality, comprising:

a) means for obtaining sequence information for a plurality of polynucleotide fragments from a sample; and

b) a computer readable medium having a plurality of instructions for performing the following steps, wherein

b1) Receiving sequence information for a plurality of polynucleotide fragments from a sample;

b2) assigning the polynucleotide fragments to chromosomes based on the sequence information;

b3) calculating a coverage depth and a GC content of the chromosome based on the sequence information;

b4) calculating a fitted depth of coverage for the chromosome using the GC content of the chromosome and the established relationship between depth of coverage and GC content for the chromosome; and

b5) comparing the fitted coverage depth to the coverage depth of the chromosome, wherein a difference therebetween indicates a genetic abnormality;

wherein, in step b4),

the relationship is the following formula: cr_i,j＝f(GC_i,j)+ε_i,jJ is 1,2, …,22, X, Y, where f (GC)_i,j) A function representing the relationship between sample i, depth of coverage of chromosome j and corresponding GC content, ∈_i,jRepresents the residual error of sample i and chromosome j;

the fitted depth of coverage is calculated according to the following formula:further comprising:

c) a biological sample obtained from a pregnant female subject, wherein the biological sample includes a plurality of polynucleotide fragments.

2. The system of claim 1, wherein the means for obtaining sequence information is parallel genome sequencing.

3. The system according to claim 1, wherein in step b1), wherein the polynucleotide fragments are in the interval of 10-1000bp in length.

4. The system of claim 3, wherein the polynucleotide fragments are between 15-500bp in length.

5. The system of claim 4, wherein the polynucleotide fragments are in the interval of 20-200bp in length.

6. The system of claim 5, wherein the polynucleotide fragments are between 25-100bp in length.

7. The system of claim 6, wherein the polynucleotide fragment is 35bp in length.

8. The system of claim 1, wherein in step b2), the assigning is performed by comparing the sequence of the fragments to a human genome reference sequence.

9. The system of claim 8, wherein the human genomic reference sequence is hg18 or hg 19.

10. The system of claim 8, wherein the segments assigned to more than one chromosome are ignored.

11. The system of claim 8, wherein the segments not assigned to any chromosome are ignored.

12. The system according to claim 1, wherein in step b3), the depth of coverage of the chromosome is the ratio between the number of fragments allocated to the chromosome and the number of reference unique reads of the chromosome.

13. The system of claim 12, the depth of coverage being standardized.

14. The system of claim 13, wherein the normalization is calculated with respect to the coverage of another chromosome.

15. The system of claim 13, wherein the normalization is calculated with respect to the coverage of all other autosomes.

16. The system of claim 13, wherein the normalization is calculated with respect to the coverage of all other chromosomes.

17. The system of claim 1, the GC content of the chromosome is an average GC content of all fragments distributed to the chromosome.

18. The system of claim 1, wherein at least 2, 5, 10, 20, 50, 100, 200, 500, or 1000 samples are used.

19. The system of claim 1, wherein the chromosome is chromosome 1,2, … …,22, X, or Y.

20. The system of claim 1, wherein the relationship between depth of coverage and GC content is calculated by local polynomial regression.

21. The system of claim 20, wherein the relationship is a non-strong linear relationship.

22. The system of claim 21, wherein the relationship is determined by a loess algorithm.

23. The system according to claim 1, wherein in step b5), the comparison is performed by statistical hypothesis testing.

24. The system of claim 23, wherein one hypothesis is that the fetus is normal (H0), and another hypothesis is that the fetus is abnormal (H1).

25. The system of claim 24, wherein the student t-statistic is computed for both hypotheses.

26. The system of claim 25, further comprising calculating a standard deviation according to the following equation:

where ns represents the number of reference samples.

27. The system of claim 26, further comprising computing a student t-statistic according to the formula:andwhere fxy is fetal fraction.

28. The system of claim 27, wherein the log-likelihood ratio of t1 and t2 is calculated according to the formula: l is_i,j＝log(p(t1_i,j,degree|D))/log(p(t2_i,jDegree | T)), where L_i,jIs log-likelihood ratio, where depth refers to T degree of distribution, D to diploidy, T to trisomy, p (T1)_i,jDegree |) and D, T representing the conditional probability density given the degree of T distribution.

29. The system of claim 1, wherein the fetal genetic abnormality is a chromosomal abnormality.

30. The system of claim 29, wherein the fetal genetic abnormality is aneuploidy.

31. The system of claim 30, wherein the aneuploidy is an autosomal disorder selected from the group consisting of trisomy 13, trisomy 18, and trisomy 21.

32. The system of claim 31, wherein the aneuploidy is a sex chromosome disorder selected from the group consisting of XO, XXX, XXY, and XYY.

33. The system of claim 32, wherein the fetal sex is female, the student t-statistic is calculated according to the formula:

whereinIs the fitted coverage depth calculated from the relationship of the X-chromosome coverage depth and the corresponding GC content of a sample from a pregnant woman carrying a female fetus.

34. The system of claim 33, wherein | t1| >3.13 indicates that the fetus is XXX or XO.

35. The system of claim 33, wherein | t1| >5 indicates that the fetus is XXX or XO.

36. The system of claim 32, wherein the fetal sex is male, the student t-statistic is calculated according to the formula:whereinIs the fitted coverage depth calculated from the relationship of the X-chromosome coverage depth and the corresponding GC content of a sample from a pregnant woman carrying a female fetus.

37. The system of claim 36, wherein | t2| >3.13 indicates that the fetus is likely XXY or XYY.

38. The system of claim 36, wherein | t2| >5 indicates that the fetus is XXY or XYY.

39. The system of claim 1, wherein the sample is a peripheral blood sample.