HK1170531B - Methods and compositions for long fragment read sequencing - Google Patents
Methods and compositions for long fragment read sequencing Download PDFInfo
- Publication number
- HK1170531B HK1170531B HK12111249.2A HK12111249A HK1170531B HK 1170531 B HK1170531 B HK 1170531B HK 12111249 A HK12111249 A HK 12111249A HK 1170531 B HK1170531 B HK 1170531B
- Authority
- HK
- Hong Kong
- Prior art keywords
- adaptor
- nucleic acid
- fragments
- dna
- sequence
- Prior art date
Links
Description
Cross reference to related applications
This application claims priority to U.S. patent application No61/187,162, filed on 6/15/2009, which is incorporated herein by reference in its entirety.
Background
Large-scale genomic sequence analysis is a key step that helps understand a variety of biological phenomena. The need for low cost, high throughput sequencing and resequencing has led to the development of new sequencing methods that employ parallel analysis of multiple nucleic acid targets simultaneously.
Conventional sequencing methods are generally limited to the fact that tens of nucleotides can be determined before the signal is significantly degraded, and thus the overall sequencing efficiency is greatly limited. Conventional sequencing methods are also often limited by signal-to-noise ratios, making such methods unsuitable for single molecule sequencing.
The entire field would benefit much if methods and compositions could be devised that would improve the efficiency of sequencing reactions and the efficiency of assembly into complete sequences from shorter read lengths.
Summary of The Invention
Thus, the present invention provides sequencing reaction methods and compositions.
In an exemplary embodiment, the present invention provides a method of fragmenting a double stranded target nucleic acid. The method comprises (a) providing genomic DNA; (b) dividing the DNA into a number of separate aliquots (aliquot); (c) amplifying said DNA in said divided aliquot in the presence of a dNTP population comprising dNTP analogues such that a number of nucleotides in the DNA are replaced with dNTP analogues; (d) removing the dNTP analogs to form a nicked DNA, (e) treating the nicked DNA to translate the nicks until the nicks on opposite strands converge, thereby creating a blunt-ended DNA fragment. In yet another embodiment, substantially each segment in the separate mixture does not overlap with an intervening (every other) segment of the same aliquot.
In yet another embodiment and in accordance with any of the above, the present invention provides a method for fragmenting a nucleic acid, comprising the steps of: (a) providing at least two DNA genome equivalents for at least one genome; (b) separating the DNA into a first layer of separated mixture; (c) amplifying the DNA in the separate mixtures, wherein the amplification is performed with a population of dntps comprising a predetermined dUTP to dTTP ratio (such that a plurality of thymines in the DNA are replaced by uracils) and a predetermined ratio of 5-methyl dCTP to dCTP such that a plurality of cytosines are replaced by 5-methyl cytosines; (d) removing uracil and 5-methylcytosine to form gapped DNA; (e) the gapped DNA is treated to translate the gaps until the gaps on opposite strands converge, thereby creating blunt-ended DNA fragments, wherein the blunt-ended fragments have a smaller GC bias (bias) and a smaller overlay bias than the fragments generated in the absence of 5-methylcytosine.
In yet another embodiment, the present invention provides a method of fragmenting a double stranded target nucleic acid comprising the steps of: (a) providing genomic DNA; (b) dividing the DNA into separate aliquots; (c) amplifying the DNA in the divided aliquot to form a plurality of amplicons, wherein the amplifying is performed with a population of dntps comprising dNTP analogs such that a number of nucleotides in the amplicons are replaced with dNTP analogs, and wherein the amplifying is performed in the presence of an additive selected from the group consisting of: glycogen, DMSO, ET SSB, betaine, and any combination thereof, (c) removing dNTP analogs from the amplicon to form nicked DNA; (d) the gapped DNA is treated to translate the gaps until the gaps on opposite strands converge, thereby creating blunt-ended DNA fragments, wherein the blunt-ended fragments have a smaller GC bias than the fragments generated in the absence of the additive.
In yet another embodiment, the present invention provides a method of obtaining sequence information from a genome, comprising the steps of: (a) providing a first population of fragments of said genome; (b) preparing emulsion droplets of the first segment such that each emulsion droplet comprises a subset of the first segment population; (c) obtaining a second population of fragments within each emulsion droplet such that the second fragments are shorter than the first fragments from which they were derived; (d) combining the emulsion droplets of the second segment with the emulsion droplets of the adaptor tag; (e) ligating the second fragments with an adaptor tag to form tagged fragments; (f) combining the tagged fragments into a single composition; (g) obtaining sequence reads from the tagged fragments, wherein the sequence reads comprise sequence information from the adaptor tag and the fragments to identify fragments from the same emulsion droplet, thereby providing sequence information about the genome.
Brief Description of Drawings
FIG. l is a schematic diagram of an embodiment of a method for fragmenting nucleic acids.
Fig. 2 is a schematic diagram of an embodiment of a method for fragmenting nucleic acids.
FIG. 3 is a graph of the effect of primer concentration on GC bias in MDA reactions.
Figure 4 shows the effect of DMSO and primer concentrations on variability (figure 4A) and GC bias (figure 4B) in MDA reactions.
FIG. 5 shows the effect of SSB (FIG. 5A) and betaine (FIG. 5B) on GC bias in MDA reactions.
FIG. 6 is a schematic of an embodiment of the present invention for generating a circular nucleic acid template comprising a plurality of adaptors.
FIG. 7 is a schematic of an embodiment of the invention for controlling the orientation of an adaptor inserted into a target nucleic acid.
FIG. 8 is a schematic of an exemplary embodiment in different orientations in which an adaptor and a target nucleic acid molecule can be ligated to each other.
FIG. 9 is a schematic of one aspect of a method for assembling a nucleic acid template of the invention.
FIG. 10 is a schematic of an adaptor member that can be used to control the manner in which such adaptors are inserted into a target nucleic acid.
FIG. 11 is a schematic diagram of an embodiment of an arm-to-arm ligation process for inserting adaptors into a target nucleic acid. Fig. 11A shows an exemplary embodiment of an arm-to-arm connection process, and fig. 11B shows exemplary components of an adapter arm used in the process.
FIG. 12 is a schematic view of a possible orientation of adaptor insertion.
FIG. 13 is a schematic view of one embodiment of a nick translation attachment method.
FIG. 14 is a schematic diagram of one embodiment of a method for inserting a plurality of adaptors.
FIG. 15 is a schematic view of one embodiment of a nick translation attachment method.
FIG. 16 is a schematic view of one embodiment of a nick translation attachment method.
FIG. 17 is a schematic diagram of one embodiment of a nick translation ligation process using nick translation loop inversion (FIG. 17A) and nick translation loop inversion coupled with uracil degradation (FIG. 17B).
FIG. 18 is a schematic view of one embodiment of a nick translation attachment method.
FIG. 19 is a schematic diagram of one embodiment of a method for inserting a plurality of adaptors.
FIG. 20 is a schematic diagram of one embodiment of a method for inserting a plurality of adaptors.
FIG. 21 is a schematic diagram of one embodiment of a method for inserting a plurality of adaptors.
FIG. 22 is a schematic diagram of one embodiment of a method for inserting a plurality of adaptors.
FIG. 23 is a schematic representation of one embodiment of a conformational probe-anchor ligation method.
FIG. 24 is a schematic representation of one embodiment of a conformational probe-anchor ligation method.
FIG. 25 is a schematic representation of one embodiment of a conformational probe-anchor ligation method.
FIG. 26 is a schematic representation of one embodiment of a conformational probe-anchor ligation method.
FIG. 27 is a schematic diagram of one embodiment of a method for tagging nucleic acid fragments.
Fig. 28(a) - (F) are schematic summaries of the steps of one embodiment of the long fragment reading of the present invention.
FIG. 29 is a schematic overview of using one embodiment of the long fragment reading technique of the present invention to define haplotypes.
Fig. 30A is a schematic overview of one embodiment of the long fragment reading technique of the present invention. Fig. 30B is a schematic overview of an exemplary method of preparing fragments for long fragment reading techniques.
Detailed Description
The present invention may be practiced using conventional techniques and descriptions in the fields of organic chemistry, polymer technology, molecular biology (including recombinant techniques), cell biology, biochemistry, and immunology, unless otherwise specified. These conventional techniques include polymer array synthesis, hybridization, ligation, and detection of hybridization using a label. A detailed description of suitable techniques may be obtained by reference to the examples below. Of course, other equivalent conventional procedures may be used. Such conventional techniques and descriptions can be found in standard Laboratory manuals such as Genome Analysis: A Laboratory Manual Series (Vols. I-IV), UingAntibodies: A Laboratory Manual, Cells: A Laboratory Manual, PCR Primer: A Laboratory Manual and Molecular Cloning: A Laboratory Manual (both published by Cold spring harbor Laboratory Press), layer, L. (1995) Biochemistry (4th Ed.) Freeman, New York, Ga' Oligonucleic Synthesis: A Practical Approach "1984, IRL Press, London, Nelson and Cox (2000), Leninger, Principles of Biochemistry3rdEd, W.H.Freeman pub, New York, N.Y., and Berg et al (2002) Biochemistry,5thEd, w.h.freemanpub, New York, n.y., all incorporated herein by reference.
It is noted that, herein and in the appended claims, the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to "a polymerase" refers to one reagent or a mixture of such reagents, reference to "a method" includes equivalent steps and methods known to those skilled in the art, and so forth.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. All publications mentioned herein are incorporated herein by reference to describe and disclose the devices, compositions, formulations and methodologies which are described in the publications and which may be used in the invention described herein.
Where a range of values is provided, it is understood that each intervening value, to the tenth unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit in that range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges, and are also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where a stated range includes one or both of the limits, ranges excluding either or both of those limits are also included in the invention.
In the following description, numerous details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art, that the present invention may be practiced without one or more of these specific details. In other instances, features and procedures well known to those skilled in the art have not been described in order to avoid obscuring the present invention.
Although the present invention has been described primarily with reference to specific embodiments, it is envisioned that other embodiments will become apparent to those skilled in the art upon reading the present disclosure, and it is intended that such embodiments be included in the methods of the present invention.
I. Overview
The present invention relates to compositions and methods for nucleic acid identification and detection, which find application in a wide variety of applications as described herein. Such applications include sequencing of whole genomes, sequencing of multiple whole genomes, and detection of specific target sequences, including Single Nucleotide Polymorphisms (SNPs) and gene targets of interest.
The present invention provides compositions and methods for isolating and fragmenting nucleic acids from a sample. For some applications, fragments are generated using a controlled random enzyme (CoRE) method. Generally, the CoRE fragmentation method involves replacing many nucleotides in a target nucleic acid with modified nucleotides or nucleotide analogs. The modified/analogous nucleotides are then removed by enzymatic means to produce the gapped nucleic acid. Further enzymatic treatment translates those nicks along the nucleic acid until the nicks on opposite strands converge, producing blunt-ended nucleic acid fragments. The segments generated according to the present invention can be reproducibly controlled for length, deviation and convergence.
One method of replacing nucleotides in a target nucleic acid according to the CoRE fragmentation method is via amplification of an initial population of target nucleic acids. This amplification is typically performed in the presence of a population of dntps, wherein the population comprises a predetermined ratio of dNTP analogs to naturally occurring nucleotides. For example, in the CoRE method in which thymine is replaced with deoxyuracil, a target nucleic acid is amplified using a population of dNTPs containing a predetermined ratio of dUTP to dTTP. The number of thymines replaced (and thus the length of the resulting fragment) can be controlled by manipulating the ratio of dUTP to dTTP. Similarly, a CoRE approach that replaces cytosine with 5-methylcytosine or adenine with inosine would utilize a population of dNTPs that incorporate a predetermined proportion of 5-methylcytosine or inosine. As should be appreciated, the CoRE method may also utilize any combination of deoxyuracil, 5-methylcytosine, and inosine to replace multiple nucleotides within a nucleic acid.
Amplification methods for the CoRE or amplifying any of the nucleic acid constructs described herein can include a number of amplification methods known in the art. In some applications, Multiple Displacement Amplification (MDA) is used to amplify nucleic acids used in sequencing and other applications described in further detail herein. The present invention provides MDA compositions and methods that reduce GC bias inherent to many amplification methods, particularly whole genome amplification methods. In some applications, the methods of the invention include MDA methods that utilize additives such as betaine, glycerol, and single-chain binding proteins to prevent or improve GC bias.
Nucleic acids can be used in a number of sequencing applications, including nucleic acid fragments generated according to the present invention. In certain applications, Long Fragment Read (LFR) sequencing is used to obtain sequence information from nucleic acid fragments. Such methods involve physically separating long genomic DNA fragments between many different aliquots, making it rare for a given region of the genome in the maternal and paternal components to occur simultaneously in the same aliquot. Long DNA fragments can be assembled into a diploid genome by placing a unique identifier in each aliquot and analyzing many aliquots of the aggregate, e.g., the sequence of each parent chromosome can be obtained. In certain LFR applications, emulsion droplets are used, where each droplet contains a small number of fragments, and all emulsion droplets collectively contain fragments representing one or more copies of the entire genome or equivalent. Combining emulsion droplets containing nucleic acid fragments with emulsion droplets containing adaptors. The combined droplets provide a sealed space connecting the adaptors to the fragments such that different combined droplets contain fragments tagged with different adaptors. In some applications, two or more adaptor tag components (components) are combined in an adaptor droplet such that unique conformational tags are ligated to the fragments after the droplet containing the nucleic acid fragments are combined. In applications utilizing droplets, reagents such as ligase and buffer can be included in the emulsion droplets containing the nucleic acid fragments, i.e., the droplets containing the adaptors, or in different droplets that are then combined with the fragments and adaptor droplets. The advantage of using emulsion droplets is that the reaction volume is reduced to picoliter levels, which provides a reduction in the cost and time associated with generating LFR libraries. Aliquots of nucleic acids may also be dispensed between different containers or vessels, such as different wells in a multi-well microtiter plate, for LFR sequencing.
Regardless of the method of generation and tagging of the different LFR aliquot libraries, the resulting nucleic acids can then be sequenced using methods known in the art and described in further detail herein. Sequence reads from individual fragments can be assembled using sequence information from their associated tag adaptors to identify fragments from the same aliquot.
Preparation of nucleic acids
The present invention includes methods and compositions for isolating nucleic acids from a sample. By "nucleic acid" or "oligonucleotide" or "polynucleotide" or grammatical equivalents is meant herein at least two nucleotides covalently linked together. The nucleic acid can be DNA (both genomic and cDNA), RNA, or a hybrid, wherein the nucleic acid contains any combination of deoxyribose and ribose nucleotides, as well as any combination of bases (including uracil, adenine, thymine, cytosine, guanine, inosine, xanthine, hypoxanthine, isocytosine, isoguanine, etc.). As used herein, the term "nucleotide" encompasses both nucleotides and nucleosides and nucleotide analogs, as well as modified nucleotides such as amino-modified nucleotides. In addition, nucleotides include non-naturally occurring analog structures. Thus, for example, individual units of a peptide nucleic acid (each of which contains a base) may be referred to herein as nucleotides.
In the present invention, as discussed further herein, nucleotide analogs are used in many embodiments. Nucleotide analogs include any nucleotide that can be incorporated into genomic DNA that allows for subsequent cleavage (enzymatically or chemically). Thus, dUTP is considered a nucleotide analog because uracil is not generally in a diverse state. Inosine and 5-methylcytosine are also considered modified nucleotides or nucleotide analogs. In addition, as described further below, several RNA bases may be incorporated into the genomic DNA to allow for subsequent cleavage by rnase H, and as such in these embodiments those RNA bases will be considered analogs for purposes of the present invention. Nucleotide analogs may also include non-basic residues such as 2 '-deoxyribosylformamide, 2' -deoxyribose, 1 '2' -dideoxyribofuranose, or propylene glycol.
Nucleic acids of the invention typically contain phosphodiester bonds, although in some cases, nucleic acid analogs that may contain alternative backbones, e.g., inclusion of a linker, are included, as set forth below (e.g., in primer and probe constructs such as labeled probes), for examplePhosphoramidates (Beaucage et al, Tetrahedron49(10):1925(1993) and references therein; Letsinger, J.Org.Chem.35:3800(1970); Sprinzl et al, Eur.J.biochem.81:579(1977); Letsinger et al, Nucl.acids Res.14:3487(1986); Sawai et al, chem.Le805 (1984), Letsinger et al, J.am.Chem.Soc.110:4470(1988); and Pauwels et al, mica Scripta26:14191986)), phosphorothioate (Mag et al, closed Acids Res.19:1437(1991); and U.S. Pat. No. 5,644,048), dithiohol (J.Am.Chem.111: 2321, USA) and Polynucleotide bond (PNA: 1984, 1984), incorporated herein by molecular dynamics et al, see also incorporated by Nature et al, Inc. 15: Methocar et al, (1984) and U.103; see also, U.103, et al, U.S. Pat. Ser. 11: Methocar et al, (1989, et al, see, U.103; incorporated by Natadelphosphamide et al, Inc. (SEQ ID No. 2: 1985: Methylon. 1989, et al, USA et al, see, U.S. 2: Methylon. 1989, et al, USA et al, see, et al, U.103, et al, see. Other nucleic acid analogs include those having a bicyclic structure, including locked nucleic acids (also referred to herein as "LNA"), Koshkin et al, J.Am.chem.Soc.120:132523(1998), positively charged backbones (Denpcy et al, Proc.Natl.Acad.Sci.USA92:6097(1995); non-ionic backbones (U.S. Pat. Nos. 5,386,023, 5,637,684, 5,602,240, 5,216,141 and 4,469,863; Kiedrowshi et al, Angel.chem.Ed.English30: 423(1991); Letsinger et al, J.Am.chem.Soc.110:4470(1988); Letsinger et al, Nucleoside&Nucleotide13:1597(1994), Chapters2and3, ASCSymposium Series580, "Carbohydrate Modifications in Antisense Research", Ed.Y.S.Sanghui and P.Dan Cook, Mesmaker et al, Bioorganic&Medicinal chem.Lett.4:395(1994); Jeffs et al, J.Biomolecular NMR34:17(1994); Tetrahedron Lett.37:743(1996)) and non-ribose backbones, including those described in U.S. Pat. Nos. 5,235,033 and 5,034,506 and ASC symposium series580, Chapter 6 and 7 of "Carbohydrate Modifications in Antisense Research" (Ed.Y.S.Sanghuian P.Dan Cook). Nucleic acids containing one or more carbocyclic sugars are also included within the definition of nucleic acids (see Jenkins et al, chem. Soc. Rev. (1995), pp.169-176). Rawls, C&E News Jun.2,1997, page 35, describes several nucleic acid analogs. "locked nucleic acids" (LNA)TM) Are also included within the definition of nucleic acid analogs. LNAs are nucleic acid analogs in which the ribose ring is linkedThe methylene bridge of the 2 '-O atom and the 4' -C atom is "locked". These references are expressly incorporated herein by reference for all purposes, especially for all teachings relating to nucleic acids. These modifications of the ribose-phosphate backbone can be made in order to increase the stability and half-life of such molecules in physiological environments. For example, PNA-DNA and LNA-DNA hybrids may exhibit greater stability and thus may be used in certain embodiments.
The target nucleic acid can be obtained from the sample using methods known in the art. The term target nucleic acid refers to a nucleic acid of interest and, unless otherwise specified, is used interchangeably with the terms nucleic acid and polynucleotide. It is understood that the sample may comprise any number of substances, including, but not limited to, bodily fluids (including, but not limited to, blood, urine, serum, lymph, saliva, anal and vaginal secretions, sweat, and semen of virtually any organism, preferably mammalian samples, and particularly preferably human samples); environmental samples (including, but not limited to, air, agricultural, water, and soil samples); a biological warfare agent sample; studying a sample (i.e., for nucleic acids, the sample may be the product of an amplification reaction, including target and signal amplification as generally described in PCT/US99/01705, such as the product of a PCR amplification reaction); purified samples such as purified genomic DNA, RNA, proteins, etc.; crude samples (bacteria, viruses, genomic DNA, etc.), as will be appreciated by those skilled in the art, can be subjected to virtually any experimental procedure. In one aspect, the nucleic acid construct of the invention is formed from genomic DNA. In certain embodiments, the genomic DNA is obtained from whole blood or a cell preparation from whole blood or a cell culture.
In one aspect of the invention, the target nucleic acid is a genomic nucleic acid, although other target nucleic acids including mRNA (and corresponding cDNAs, etc.) can be used. Target nucleic acids include natural or genetically altered or synthetically prepared nucleic acids (e.g., genomes from mammalian disease models). The target nucleic acid can be obtained from almost any source, and can also be prepared using methods known in the art. For example, a target nucleic acid can be isolated directly without amplification, isolated by amplification using methods known in the art, including, but not limited to, Polymerase Chain Reaction (PCR), Multiple Displacement Amplification (MDA) (which encompasses and is used interchangeably with the term strand displacement method (SDA)), Rolling Circle Amplification (RCA) (which encompasses and is used interchangeably with the term Rolling Circle Replication (RCR)), and other amplification methods. Target nucleic acids can also be obtained by cloning, including but not limited to cloning into vectors such as plasmids, yeast, and bacterial artificial chromosomes.
In certain aspects, the target nucleic acids comprise mRNAs or cDNAs. In particular embodiments, the target DNA is produced using transcripts isolated from a biological sample. The isolated mRNA can be reverse transcribed into cDNAs using conventional techniques, as also described in Genome Analysis: A Laboratory Manual Series (Vols.I-IV) or Molecular Cloning: A Laboratory Manual.
The target nucleic acid may be specifically designated as single-stranded or double-stranded, or contain both double-stranded and single-stranded sequence portions. Depending on the particular application, the nucleic acid may be DNA (including genomic and cDNA), RNA (including mRNA and rRNA), or a mixture thereof containing any combination of deoxyribose-and ribonucleotides, as well as any combination of bases, including uracil, adenine, thymine, cytosine, guanine, inosine, xanthine, hypoxanthine, isocytosine, isoguanine, and the like.
In some embodiments, the target nucleic acid is genomic DNA, in some embodiments, mammalian genomic DNA, and particularly human genomic DNA. In some cases, genomic DNA may be obtained from normal somatic tissue, germ cell tissue, or in some cases, diseased tissue, such as tumor tissue. In many embodiments, as outlined herein, the use of many genomic equivalents, typically 1 to 30, 5 to 20, is useful in many embodiments. Many embodiments utilize 10 genomic equivalents. The genomic equivalent may comprise the entire genome from one or more cells or may comprise an amount of DNA that encompasses the genome of one or more cells (i.e., a single diploid cell has 2 DNA genomic equivalents). In some embodiments, at least two genome equivalents are used in the methods of the invention to completely cover a diploid genome.
In an exemplary embodiment, the genomic DNA is isolated from a target organism. By "target organism" is meant a target organism, as will be understood, the term includes any organism from which a nucleic acid can be obtained, particularly mammals, including humans, although in certain embodiments the target organism is a pathogen (e.g., when a bacterial or viral infection is to be detected). Methods for obtaining nucleic acids from target organisms are well known in the art. Samples comprising human genomic DNA are useful in many aspects and embodiments of the invention. In certain aspects, such as whole genome sequencing, it is preferred to obtain DNA equivalent to about 1 to about 100 or more genomes to ensure that the population of target DNA fragments is sufficient to encompass the entire genome. The number of genomic equivalents obtained may depend in part on the method used in the further preparation of the genomic DNA fragments of the invention. For example, in the long fragment reading method described further below, equivalents of about 1 to about 50 genomes are typically used. In still other embodiments, about 2-40, 3-30, 4-20, and 5-10 genomic equivalents are used in the methods of the invention. In still other embodiments, about 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19, or 20 genomic equivalents are used. For some methods, typically about 1000 to about 100,000 genomic equivalents are utilized. For some methods that do not amplify prior to fragmentation, equivalents of about 100,000 to about 1,000,000 genomes are used.
Libraries containing nucleic acid constructs or fragments generated from populations containing one or more genomic equivalents will contain target nucleic acids whose sequences, once identified and assembled, will provide most or all of the sequence of the entire genome.
The target nucleic acid is isolated using conventional techniques, e.g., as disclosed in Sambrook and Russell, Molecular Cloning: organic Manual (cited above).
In some embodiments, the target nucleic acids are treated to protect them during subsequent chemical or mechanical manipulation. For example, in certain embodiments, target nucleic acids are isolated (or combined after isolation) in the presence of spermidine or polyvinylpyrrolidone 40(PVP40) to protect them from shearing during mechanical manipulations such as pipetting. Such protection is particularly useful for applications that utilize long nucleic acid fragments, such as LDR methods described in further detail below. In some cases, it is beneficial to provide carrier DNA (e.g., unrelated circular synthetic double stranded DNA) to be mixed and used with sample DNA when there is only a small amount of sample DNA and it is also possible to lose DNA due to non-specific binding to, for example, the container wall or the like.
Fragmentation of target nucleic acids
In some aspects of the invention, the target nucleic acid is fragmented. The fragment size of the target nucleic acid may vary depending on the source target nucleic acid and the library construction method used. For some applications, longer fragments are used in the present invention. Such longer fragments may range in size from about 100,000 to about 1,000,000 nucleotides in length. In yet another embodiment, the longer fragment is about 50,000, 100,000, 150,000, 200,000, 250,000, 300,000, 350,000, 400,000, 450,000, 500,000, 700,000, 900,000, 1,000,000, 1,500,000 nucleotides in length. In still other embodiments, the longer fragment lengths range from about 150,000-950,000, 200,000-900,000, 250,000-850,000, 300,000-800,000, 350,000-750,000, 400,000-700,000, 450,000-650,000, and 500,000-600,000 nucleotides. For certain applications, fragments ranging from about 50 to about 600 nucleotides in length may be used in the methods of the invention. In yet other embodiments, the fragments are about 100,200,300,400,500,600,700,800,900,1000,1200,1400,1600,1800 and 2000 nucleotides in length. In still other embodiments, the length of the fragment is 10-100,50-100,50-300,100-200,200-300,50-400,100-400,200-400,300-400, 400-600, 500-500, 50-1000,100-1000,200-1000,300-1000,400-1000,500-1000,600-1000, 700-900,700-800,800-1000,900-1000,1500-2000,1750-2000 and 50-2000 nucleotides.
Many mechanical and enzymatic fragmentation methods are well known in the art. In many embodiments, the shear forces generated during the lysis and extraction process will mechanically generate fragments in the desired range. Other mechanical fragmentation methods include sonication and neutralization. The mechanical fragmentation method has the advantage of generating fragments of a particular size range in a predictable manner. However, mechanical fragmentation methods typically require large (>2 μ g) or a certain volume (>200 μ L) of input nucleic acid. As such, the mechanical fragmentation method is used only in a single sample processing.
Enzymatic fragmentation methods can also be used to generate nucleic acid fragments, particularly shorter fragments of 1-5kb in size. Methods of enzyme fragmentation include the use of endonucleases. The enzymatic method can be used with moderate nucleic acid mass and volume and is more suitable for multiple sample processing than the mechanical fragmentation method. However, enzymatic fragmentation methods are inherently prone to variability in the degree of fragmentation, as the enzyme activity, substrate amount and concentration, and digestion time need to be extremely carefully controlled in order to achieve a consistent fragment size distribution in such methods.
In some embodiments, fragments of a particular size or into which a particular size is divided are isolated. Such methods are well known in the art. For example, gel fractionation can be used to generate a population of fragments of a particular size within a range of base pairs, e.g., 500 base pairs ± 50 base pairs.
In some cases, particularly where it is desired to isolate long fragments (such as fragments of about 150 to about 750 kilobases in length), the present invention provides methods in which cells are lysed and the intact nucleic acids are pipetted in the case of a gentle centrifugation step. Nucleic acids, typically genomic DNA, are released over several hours via enzymatic digestion using, for example, proteinase K and rnase digestion. The resulting material is then dialyzed overnight or directly diluted to reduce the concentration of remaining cellular waste. Because such methods of isolating nucleic acids do not involve many disruption processes (such as ethanol precipitation, centrifugation, and vortexing), genomic nucleic acid remains largely intact, producing most fragments in excess of 100 kilobases.
Ii.a.1. coce fragmentation
As discussed above, the fragmentation methods used in the present invention include both mechanical and enzymatic fragmentation methods, and a combination of enzymatic and fragmentation methods. In one aspect, the invention provides a fragmentation process, referred to herein as controlled random enzyme (CoRE) fragmentation. The coce fragmentation methods described herein can be used alone or in combination with other mechanical and enzymatic fragmentation methods known in the art.
Generally, the CoRE fragmentation method involves replacing a number of nucleotides in a target nucleic acid with nucleotide analogs. The nucleic acid containing the nucleotide analog is then enzymatically or chemically treated to produce a gapped nucleic acid. In certain embodiments, the enzymatic/chemical treatment cleaves a nucleotide analog from a nucleic acid to form a gapped nucleic acid. In certain embodiments, the enzymatic/chemical treatment creates a nick just 3 'or 5' of the nucleotide analog to form the nicked nucleic acid. A nucleic acid that is nicked is typically a double-stranded nucleic acid that contains a single nucleotide or multiple nucleotide nicks or nicks in at least one strand.
Further enzymatic treatment of the nucleic acid with nicks translates those nicks along the nucleic acid until the nicks on the equivalent strands converge, resulting in blunt-ended nucleic acid fragments. The length, deviation and coverage can be reproducibly controlled for the fragments generated according to the invention. CoRE fragmentation has the advantages of enzymatic fragmentation (such as the ability to use low amounts and/or volumes of DNA) without many of its disadvantages (including sensitivity to substrate or enzyme concentration variations and sensitivity to digestion time).
In still other embodiments, the nucleotide analog is introduced into the nucleic acid by amplifying the nucleic acid in the presence of dntps comprising a predetermined ratio of nucleotide analog to naturally occurring nucleotide. Amplification with this mixed population of nucleotides and nucleotide analogs generates many amplicons in which naturally occurring nucleotides are replaced with nucleotide analogs. The number of nucleotides replaced by the analogue is controlled by controlling the predetermined ratio of analogue to naturally occurring nucleotide in the dntps used in the amplification method. This predetermined ratio is the ratio of analog to natural nucleotide required to generate a fragment of the desired length. For example, if the starting nucleic acid is about 100,000 bases in length, a predetermined analog-to-nucleotide ratio can be adjusted to replace the desired number of nucleotides, ultimately producing (in a non-limiting example) a fragment that is 10,000 bases in length (after processing to produce a nicked nucleic acid, then further processing to produce double-stranded fragments).
The number of nucleotides replaced by nucleotide analogues in the amplicon is controlled by manipulating the ratio of nucleotide analogues to naturally occurring nucleotides in the dNTP population used in the amplification method. In some embodiments, the population of dntps used to generate amplicons in which nucleotides are replaced with nucleotide analogs during amplification comprises from about 0.05% to about 30% nucleotide analogs. In still other embodiments, the dNTP population comprises about 0.1% -0.5%,0.5% -0.7%, 1% -25%,5% -20%,10% -15% nucleotide analogs. In still other embodiments, the dNTP population comprises at least about 0.5%,0.75%,1%,2%,3%,4%,5%,6%,7%,8%,9%,10%,11%,12%,13%,14%,15% nucleotide analogs.
In some embodiments, about 0.01-5% of one or more nucleotide species (a, C, G, and/or T) are replaced with nucleotide analogs according to the methods described herein. In still other embodiments, about 0.05% -4%, 0.1% -3%, 0.2% -2%, 0.3% -1%, 0.4% -0.9%, 0.5% -0.8%, and 0.6% -0.7% of one or more nucleotide species are replaced with a nucleotide analog according to the methods described above. In still other embodiments, at least about 0.1%,0.2%,0.25%,0.3%,0.4%,0.5%,0.6%,0.7%,0.75%,0.8%,0.9%,1%,2%,3%,4%, and 5% of one or more nucleotide species are replaced with a nucleotide analog according to the methods described above.
After amplification of a nucleic acid in the presence of dntps containing a predetermined ratio of nucleotide analogs, the resulting amplicon has some of the naturally occurring nucleotides replaced with nucleotide analogs. The amplicon is then treated chemically or with one or more enzymes to remove the nucleotide analog or to generate a nick in the amplicon 5 'or 3' of the nucleotide analog to generate a nicked nucleic acid. The nicked nucleic acid is then treated with an enzyme, typically a polymerase, to translate the nicks along the length of the nucleic acid until the nicks on opposite strands converge. This results in a population of blunt-ended double-stranded fragments.
In some embodiments, the invention provides a CoRE method in which thymine is replaced with uracil or deoxyuracil and a target nucleic acid is amplified using a population of dNTPs containing a predetermined ratio of dUTP to dTTP. As discussed above, the number of thymines (and thus the length of the resulting fragment) that are replaced can be controlled by manipulating the ratio of dUTP to dTTP, e.g., a higher ratio of dUTP to dTTP results in a greater number of thymines in the target nucleic acid that are replaced with uracil. Subsequent processing to remove dUTP (or nick 3 'or 5' of dUTP) will then generate shorter fragments, as substitution will occur with greater frequency along the nucleic acid. Similarly, a CoRE approach that replaces cytosine with 5-methylcytosine or adenine with inosine would utilize a population of dNTPs that incorporate a predetermined proportion of 5-methylcytosine or inosine. As should be appreciated, the CoRE method according to the present invention can utilize any combination of deoxyuracil, 5-methylcytosine, and inosine to replace multiple nucleotide species along a nucleic acid with analogs.
In still other embodiments, the nucleic acid is amplified using a population of dntps comprising 4% dUTP relative to dTTP to generate amplicons, wherein a proportion of thymine is replaced with deoxyuracil. Such concentrations of dUTP will generally result in the incorporation of about 0.05% to 0.1% thymine in the resulting amplicon replaced with deoxyuracil. As discussed above, the amount of deoxyuracil incorporated into the amplicon can be fine-tuned by the ratio of dUTP to dTTP contained in the dntps used to amplify the nucleic acid. In certain embodiments, the dUTP population relative to dTTP comprises about 0.1% -0.5%,0.5% -0.8%,1% -25%,5% -20%,10% -15% dUTP. In still other embodiments, the dNTP population comprises at least about 0.5%,0.75%,1%,2%,3%,4%,5%,6%,7%,8%,9%,10%,11%,12%,13%,14%,15% dUTP.
In some embodiments, a combination of nucleotide analogs is used in the amplification step of the CoRE method, such that two different nucleotide species are replaced with a nucleotide analog in the resulting amplicon. For example, in some embodiments, both thymine and cytosine are replaced with nucleotide analogs. In yet other embodiments, thymine is replaced with deoxyuracil and cytosine is replaced with 5-methylcytosine. As discussed above, the range of ratios of analogs to naturally occurring nucleotides can be used to control the size of fragments produced when processing amplicons to form gapped nucleic acids, and then processing the gapped nucleic acids to form double-stranded fragments. In certain embodiments, the same ratio of dUTP and 5-methylcytosine is used relative to naturally occurring nucleotides. In other words, amplicons were created using a population of dntps comprising about 0.05% -25% dUTP relative to dTTP and 0.05% -25% 5-methylcytosine relative to cytosine, wherein the ratio of thymine and cytosine was replaced with the corresponding analog. In still other embodiments, the dNTP population comprises about 4-5% 5-methylcytosine and 0.75-1% dUTP. In yet another embodiment, the population of dUTP versus dTTP and the population of 5-methylcytosines versus cytosine comprise about 0.1% -0.5%,0.5% -0.8%,1% -25%,5% -20%,10% -15% dUTP. In still other embodiments, the population of dUTP relative to dTTP and the population of 5-methylcytosine relative to cytosine comprise at least about 0.5%,0.75%1%,2%,3%,4%,5%,6%,7%,8%,9%,10%,11%,12%,13%,14%,15% dUTP. As will be appreciated, the same or different ratio of dUTP to dTTP as compared to the ratio of 5-methylcytosine to cytosine may be used in this embodiment of the invention. If different ratios are used when using different nucleotide analogs, any combination of the ratios listed above can be used to generate amplicons in which at least a portion of the naturally occurring nucleotides are replaced with nucleotide analogs.
An exemplary method of coce fragmentation is shown in fig. 1. First, nucleic acid 101 is subjected to enzyme-catalyzed Multiple Displacement Amplification (MDA) in the presence of dNTPs to which dUTP or UTP is added in a proportion to dTTP. This results in the T on both strands of the amplification product being replaced by deoxyuracil ("dU") or uracil ("U") in a certain and controlled ratio (103). The U is then partially cleaved (104), typically by use of one or more enzymes, including but not limited to UDG, EndoIV, EndoVIII, and T4PNK, to create a single base notch (also referred to herein as a nick) with a functional 5 'phosphate and 3' hydroxyl terminus (105). The average interval at which single base gaps occur is determined by the frequency of U occurrence of dU in the MDA product. Treatment of the nicked nucleic acid with a polymerase having exonuclease activity (105) causes translation or displacement of the nicks or nicks along the length of the nucleic acid until the nicks on opposite strands converge, thereby forming double-stranded breaks, resulting in a relative population of double-stranded fragments that are relatively uniform in size (107). The exonuclease activity of a polymerase, such as Taq polymerase, will cleave a short DNA strand adjacent to the nick, and the polymerase activity will fill up the nick and the subsequent nucleotides in the strand (essentially, Taq moves along the strand, excising a base and adding the same base using exonuclease activity, with the result that the nick or nick is displaced along the strand until the enzyme reaches the end of the strand). The size distribution of the double stranded fragments (107) is determined by the ratio of dTTP to dUTP or UTP used in the MDA reaction, not the length or extent of the enzymatic treatment. That is, the higher the amount of dUTP, the shorter the resulting fragment. As such, the coce fragmentation method produces a high degree of fragmentation reproducibility compared to other enzymatic or mechanical fragmentation methods.
As will be appreciated, in the above exemplary embodiments and in any embodiment of the CoRE method, a number of amplification methods may be used in this step to replace nucleotides with modified nucleotides or nucleotide analogs. Such amplification methods are described in more detail below, and may include, but are not limited to, Polymerase Chain Reaction (PCR), Multiple Displacement Amplification (MDA), Rolling Circle Amplification (RCA) (for circularized fragments), and any other applicable amplification method known in the art. As will also be discussed in more detail below, in certain embodiments, the methods and compositions of the amplification reaction used in this step of the CoRE method can also reduce bias and improve coverage of the resulting fragments.
Another exemplary embodiment of the coe fragmentation method is shown in fig. 2. In this exemplary embodiment, two different nucleotides are replaced with nucleotide analogs: thymine is replaced by uracil and cytosine is replaced by 5-methylcytosine. As shown in FIG. 2, nucleic acid 201 is subjected to enzyme-catalyzed Multiple Displacement Amplification (MDA) in the presence of dNTPs to which dUTP or UTP is added in proportion to dTTP. dNTPs are also incorporated into 5-methyl-dCTP at a defined ratio of dCTP. This results in the replacement of the T and C positions on both strands of the DNA product by dU and 5-methyl dC in a defined (and controllable) ratio (103). Next, cleavage of the U and regions near the 5-methyl C moiety, in one non-limiting example, cleavage (204) is achieved by a combination of McrBC, UDG, and EndoIV or EndoVIII and T4PNK to create a 5' PO with functionality4And a single base gap at the 3' OH terminus (or a double-stranded nick in the case of McrBC), the mean interval of which is defined by the frequency of uracil and 5-methylcytosine in the MDA product (203). Single base gaps will be created at average intervals defined by the frequency of U's of dU in the MDA product. Treatment of the gapped nucleic acid (205) with a polymerase such as Taq polymerase or E.coli DNA pol I (206) causes the nicks to translate until the nicks on opposite strands converge, thereby creating a double-strand break (207). Treatment with E.coli DNA pol I also fills in or removes any overhangs created by McrBC self-duplex cleavage. As in the method shown in fig. 1, this exemplary embodiment of a CoRE results in double-stranded fragments whose length can be reproducibly controlled by varying the proportion of nucleotide analogs contained in the dNTP population during amplification. The introduction of the additional nucleotide analog (5-methylcytosine) in this embodiment of the CoRE improves fragmentation in the GC-rich region of the genome compared to methods that introduce only a single nucleotide analog species into the target nucleic acid. For example, the embodiment of the CoRE shown in fig. 1 can be shown in a CoRE-biased genomic embodiment (in which more than one nucleotide analog is introduced), higher fragmentation reduction in AT-rich regions such as the embodiment shown in fig. 2 can be observed in embodiments using only a single nucleotide analog species or in other enzymatic and/or mechanical fragmentation methodsThe resulting overlay deviation.
As will be appreciated, nucleic acid fragments can be generated according to the coce method described above using any nucleotide analogs and modified nucleotides known in the art. In addition to the uracil and 5-methylcytosine nucleotide analogs discussed above, other exemplary modified nucleotides and nucleotide analogs that can be used in the CoRE method of the invention include, but are not limited to, peptide nucleotides, modified phosphate-sugar backbone nucleotides, N-7-methylguanine, deoxyuracil, and deoxy-3' -methyladenine.
Further enzymatic and chemical treatment of fragments
In some embodiments, after fragmentation, the target nucleic acids are further modified to prepare them for later use, such as in preparing nucleic acid constructs, as discussed in more detail below. Such modifications are desirable because the fragmentation process may render the ends of the resulting target nucleic acid incapable of performing certain reactions, particularly using enzymes such as ligases and polymerases. This further modified step is optional for all steps outlined herein, and may be combined with any other steps in any order.
In an exemplary embodiment, after fragmentation, the target nucleic acid typically has a combination of blunt-ended and overhanging ends and a combination of terminal phosphate and hydroxyl chemistries. Such fragments can be treated with several enzymes to create blunt ends with specific chemistry. In one embodiment, any 5' single strands of the overhang are filled with polymerase and dntps to create blunt ends. A polymerase having 3' exonuclease activity (typically but not always the same enzyme as the 5' active polymerase, such as T4 polymerase) is used to remove the 3' overhang. Suitable polymerases include, but are not limited to: t4 polymerase, Taq polymerase, E.coli DNA polymerase 1, Klenow fragment, reverse transcriptase, Φ 29-related polymerase (including wild-type Φ 29 polymerase and derivatives of the polymerase), T7DNA polymerase, T5DNA polymerase, RNA polymerase. These techniques can be used to create blunt ends for a variety of uses.
In further optional embodiments, the chemistry of the ends is altered to avoid target nucleic acid interconnection. For example, in addition to polymerase, protein kinase may be used in the process of generating blunt ends, using its 3 'phosphatase activity to convert the 3' phosphate group into a hydroxyl group. Such kinases include, but are not limited to, commercial kinases such as T4 kinase, and kinases that are not yet commercially available but have the desired activity.
Similarly, the phosphate group at the end can be converted to a hydroxyl group by phosphatase. Suitable phosphatases include, but are not limited to, alkaline Phosphatase (including calf intestinal alkaline Phosphatase (CIP)), anti phosphatases, Apyrase, pyrophosphatase, inorganic (yeast) thermostable inorganic pyrophosphatase, and the like, which are known in the art and commercially available from, for example, New England Biolabs.
One skilled in the art will appreciate that for all steps outlined herein, any combination of these steps and enzymes may be used. For example, certain enzymatic fragmentation techniques, such as the use of restriction enzymes, may render one or more of these enzymatic "end-repair" steps superfluous.
The modifications described above can prevent the formation of nucleic acid templates containing different fragments linked in an unknown conformation, thus reducing and/or eliminating errors in sequence identification and assembly caused by such undesirable templates.
In still other embodiments, the DNA fragments are denatured after fragmentation to generate single-stranded fragments.
II.C. amplification
In one embodiment, after fragmentation (and indeed before or after any of the steps outlined herein), the fragmented nucleic acid population may be subjected to an amplification step to ensure that a sufficiently large concentration of all fragments is available for subsequent use. Such amplification methods are well known in the art and include, but are not limited to, Polymerase Chain Reaction (PCR), ligase chain reaction (sometimes referred to as oligonucleotide ligase amplification OLA), Circular Probe Technique (CPT), Multiple Displacement Amplification (MDA), Transcription Mediated Amplification (TMA), Nucleic Acid Sequence Based Amplification (NASBA), Rolling Circle Amplification (RCA) (fragments for circularization), and are used interchangeably with the term Strand Displacement Amplification (SDA).
II.C.1. Multiple Displacement Amplification (MDA)
In one aspect of the invention, the MDA is used to amplify fragments or nucleic acid constructs generated according to the methods described herein. MDA generally involves contacting at least one primer, a DNA polymerase, and a target sample, and incubating the target sample under conditions that promote amplification of the target sequence. If one primer (e.g., a Watson primer, complementary to a Crick target) is used, multiple copies of one strand of the double stranded target (e.g., Crick) are generated, and if a second primer (e.g., Crick) is used, which is complementary to a second strand of the target (e.g., Watson), amplification of both strands occurs. Replication of the target sequence generates a replicated strand such that, during replication, the replicated strand is displaced from the target sequence by strand displacement replication of another replicated strand. In some embodiments of MDA, a random primer set is used to randomly prime a sample of genomic nucleic acid (or another sample of highly complex nucleic acid). By selecting a sufficiently large set of primers with random or partially random sequences, the primers in the set will be collectively and randomly complementary to nucleic acid sequences distributed among the nucleic acids in the sample. Amplification proceeds by replication with a highly persistent polymerase, initiated at each primer, and continues until spontaneous termination. A key feature of this method is the displacement of the intermediate primer by the polymerase during replication. Thus, multiple duplicate copies of the whole genome can be synthesized in a shorter time. General methods for MDA are known in the art and are disclosed, for example, in U.S. patent No. 7,074,600, which is hereby incorporated by reference in its entirety for all purposes and for all teachings relating specifically to MDA.
One weakness of conventional MDA methods, particularly when used for whole genome amplification, is that deviations are often introduced into the amplification product. In many cases, this deviation is a GC deviation, where a larger number of copies are made to GC-rich regions of the genomic sequence. In some cases, AT bias is seen, where AT-rich regions of the genome are amplified in greater amounts than other sequences. The present invention provides compositions and methods that ameliorate or prevent bias that can lead to amplification reactions, particularly MDA reactions.
In some embodiments, a random 8-mer primer is used to reduce amplification bias in a population of fragments, as opposed to a random hexamer conventionally used in MDA reactions. In addition, primers used in MDA reactions can be designed to have a lower GC content, which also has the effect of reducing GC bias. For example, FIG. 3 shows the effect of primer concentration on GC bias. In fig. 3, points above the x-axis represent sequences biased toward AT-rich, while points below the x-axis show sequences biased toward GC-rich. The low GC content of the 6-mer (squares in FIG. 3) shows a relatively low deviation between a broad concentration range in the MDA reaction at 30 ℃ for 90 minutes.
In still other embodiments, certain enzymes may be added to the MDA reaction to reduce the bias of amplification. For example, low concentrations of non-persistent 5' exonucleases can reduce GC bias.
In still other embodiments, additives are included in the MDA reaction to prevent or improve GC bias. Such additives include, but are not limited to, single-chain binding proteins, betaine, DMSO, trehalose, glycerol.
FIG. 4 shows that DMSO reduces GC bias in MDA reactions caused by higher primer concentrations (see FIG. 4B). As will be appreciated, a wide range of DMSO concentrations may be used in accordance with the present invention. In an exemplary, non-limiting embodiment, about 0.5% to about 10% DMSO is used as an additive in the MDA reaction of the invention. In still other embodiments, about 1%,2%,3%,4%,5%,6%,7%8%,9%,10% DMSO is used in the methods of the invention. In yet other embodiments, about 1% -2%,2% -4%,5% -8%, and 3% -6% DMSO is used.
Figure 5 shows that both SSB (figure 5A) and betaine (figure 5B) can reduce GC bias over a wide range of concentrations. The experiments of FIGS. 4 and 5 were carried out at 30 ℃ for 90 minutes. As should be appreciated, a wide range of SSB and betaine concentrations may be used in accordance with the present invention. In some embodiments, about 1 to about 5000ng SSB is used in accordance with the present invention. In yet other embodiments, about 1-10,20-4000,30-3000,40-2000,50-1000,60-500,70-400,80-300,90-200,10-100,15-90,20-80,30-70,40-60ng SSB is used. In some embodiments, from about 0.1 to about 5 μ M betaine is used in accordance with the present invention. In still other embodiments, about 0.2-4, 0.5-3, and 1-2 μ M betaine is used. In still other embodiments, about 0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4 and 1.5 μ M betaine is used.
In certain embodiments, the nucleic acid fragments are combined with spermidine prior to amplification with MDA to provide protection from shearing during pipetting or other physical manipulations. However, high concentrations of spermidine can interfere with MDA. In certain embodiments, the nucleic acid fragments are denatured in the presence of high concentrations (about 100mM) of spermidine prior to MDA. The mixture is then diluted to yield a final concentration of 1mM spermidine, which is then amplified using MDA or other amplification methods known in the art.
As will be appreciated, methods to prevent or ameliorate bias in MDA reactions can be used with any method for fragmenting nucleic acids or generating nucleic acids for the generation of DNA nanospheres, where those methods include one or more amplification steps.
Preparation of circular constructs
In one aspect, the circular nucleic acid template construct can be generated using nucleic acid fragments generated as described above. These circular constructs can serve as templates for the generation of DNA nanospheres (which are described in more detail below). The invention provides circular nucleic acid template constructs comprising a target nucleic acid and a plurality of discrete adaptors. Nucleic acid template constructs are assembled by inserting adaptor molecules at multiple sites throughout each target nucleic acid fragment. The dispersed adaptors allow sequence information to be obtained from multiple sites in the target nucleic acid, either sequentially or simultaneously.
Although the embodiments of the invention described herein are generally described in terms of circular nucleic acid template constructs, it will be appreciated that the nucleic acid template constructs may also be linear. Furthermore, the nucleic acid template constructs of the invention may be single-stranded or double-stranded, the latter being preferred in some embodiments. As used herein, unless otherwise noted, the terms "target nucleic acid" and "target nucleic acid fragment" and all grammatical equivalents may be used interchangeably.
The nucleic acid templates of the invention (also referred to herein as "nucleic acid constructs" and "library constructs") comprise a target nucleic acid and an adaptor. The term "adaptor" as used herein refers to an oligonucleotide of known sequence. The adaptor used in the present invention may comprise a variety of elements. The type and number of elements (also referred to herein as "features") included in the adaptor depends on the intended use of the adaptor. Adaptors useful in the present invention generally include, but are not limited to, recognition and/or cleavage sites for restriction endonucleases (particularly type II recognition sites, as described below, which allow endonucleases to bind to recognition sites located inside the adaptor but cleave outside the adaptor), primer binding sites (for amplifying nucleic acid constructs) or anchor primers (sometimes also referred to herein as "anchor probes") binding (for sequencing target nucleic acids in nucleic acid constructs), nickase sites, and the like. In some embodiments, the adaptor comprises a single recognition site for a restriction endonuclease, while in other embodiments, the adaptor comprises two or more recognition sites for one or more restriction endonucleases. As outlined herein, a recognition site is often (but not necessarily) present at the end of an adaptor, so that cleavage of the double stranded construct is performed at the most likely remote location from the end of the adaptor.
In some embodiments, the adaptor does not include a recognition site for any restriction endonuclease.
In some embodiments, adaptors of the invention are about 10 to about 250 nucleotides in length, depending on the number and size of features included in the adaptor. In certain embodiments, the adaptors of the present invention are about 50 nucleotides in length. In other embodiments, adaptors used in the present invention are from about 20 to about 225, from about 30 to about 200, from about 40 to about 175, from about 50 to about 150, from about 60 to about 125, from about 70 to about 100, and from about 80 to about 90 nucleotides in length.
In other embodiments, the adaptor optionally comprises elements that can be attached as two "arms" to the target nucleic acid. One or both of these arms may comprise the complete recognition site for a restriction endonuclease, or both arms may comprise a partial recognition site for a restriction endonuclease. In the latter case, the construct containing the target nucleic acid will have adapter arms attached to each end, the circularization of which will constitute the complete recognition site.
In still other embodiments, adaptors useful in the present invention comprise different anchor molecule binding sites at their 5 'and 3' ends. As further described herein, such anchor molecule binding sites may be used for sequencing applications, including sequencing methods for complex probe-anchor molecule ligation (cPAL) described herein and in U.S. patent applications 60/992,485, 61/026,337, 61/035,914, 61/061,134, 61/116,193, 61/102,586, 12/265,593 and 12/266,385, 11/938,106, 11/938,096, 11/982,467, 11/981,804, 11/981,797, 11/981,793,11/981,767, 11/981,761,11/981,730, 11/981,685,11/981,661, 11/981,607,11/981,605, 11/927,388, 11/927,356, 11/679,124, 11/541,225, 10/547,214 and 11/451,691, the above documents are all incorporated herein by reference in their entirety, in particular for the disclosure relating to sequencing by ligation.
In one aspect, the adaptors of the present invention are dispersed adaptors. "discrete adaptors" as used herein means oligonucleotides inserted at spaced apart positions within a target nucleic acid. In one aspect, "internal" with respect to a target nucleic acid means a site within the target nucleic acid prior to processing such as circularization and cleavage, which may introduce sequence inversions or similar transformations, thus disordering the order of nucleotides in the target nucleic acid.
Overview of template construction procedure ii.d.1
The nucleic acid template constructs of the invention contain a plurality of discrete adaptors inserted into a target nucleic acid and in a particular orientation. As discussed further herein, a target nucleic acid is generated from nucleic acid isolated from one or more cells, including from 1 to millions of cells. These nucleic acids are then fragmented using mechanical or enzymatic methods. In particular embodiments, nucleic acid fragments generated using the CoRE method described herein are used to generate the nucleic acid template constructs of the invention.
The target nucleic acid that becomes part of the nucleic acid template construct of the present invention may have discrete adaptors inserted at intervals within a continuous region of the target nucleic acid at a predetermined position. The intervals may be the same or different. In some aspects, the spacing between discrete adaptors may only be known exactly to one to a few nucleotides. In other aspects, the spacing of the adaptors is known, and the orientation of each adaptor relative to the other adaptors in the library construct is known. That is, in many embodiments, the adaptors are inserted at known distances so that the target sequence at one end is contiguous with the target sequence at the other end, the native genomic sequence. For example, for a type II restriction endonuclease that cleaves 16 bases from the recognition site, 3 bases are located within the adaptor and the endonuclease cleaves 13 bases from the end of the adaptor. When a second adaptor is inserted, the target sequence "upstream" of the adaptor and the target sequence "downstream" of the adaptor are actually contiguous sequences in the original target sequence.
The invention provides a nucleic acid template comprising a target nucleic acid comprising a plurality of discrete adaptors. In yet another embodiment, nucleic acid templates formed from a plurality of genomic fragments can be used to generate a library of nucleic acid templates. Such libraries of nucleic acid templates in certain embodiments encompass target nucleic acids that together can cover all or part of the entire genome. That is, by using a sufficient number of starting genomes (e.g., cells), in combination with random fragmentation, the resulting target nucleic acid of a particular size used to generate the circular templates of the invention can effectively cover the genome, although it will be appreciated that in rare cases, biases may be introduced that prevent the entire genome from being represented.
The nucleic acid template constructs of the invention comprise a plurality of discrete adaptors, which in certain aspects comprise one or more recognition sites for a restriction endonuclease. In another aspect, the adaptor comprises recognition sites for nicking endonucleases, type 1 endonucleases, type II endonucleases and/or type III endonucleases such as EcoP1 and EcoP 15. In another aspect, the adaptor comprises a recognition site for a type II endonuclease. Type II and type III endonucleases are generally commercially available and are well known in the art. Such endonucleases recognize specific nucleotide base pair sequences in a double-stranded polynucleotide sequence. When the sequence is recognized, the type II endonuclease will cleave the polynucleotide sequence, typically leaving an overhang, or "sticky end," of one strand of the sequence. Type II and III endonucleases typically cleave outside their recognition site, which may be between about 2and 30 nucleotides from the recognition site, depending on the particular endonuclease. Certain type II endonucleases are "precision cutters" that cut from a known number of bases away from the recognition site. In certain embodiments, the type II endonuclease used is not a "precision cutter," but cuts within a specific range (e.g., 6 to 8 nucleotides). Typically, the cleavage site of a type II restriction endonuclease used in the present invention is separated from its recognition site by at least 6 nucleotides (i.e., the number of nucleotides between the end of the recognition site and the nearest cleavage point). Exemplary type II restriction endonucleases include, but are not limited to, Eco57M I, Mme I, Acu I, Bpm I, BceA I, Bbv I, BciV I, BpuE I, BseM II, BseR I, Bsg I, BsmF I, BtgZ I, Eci I, EcoP15I, Eco57M I, Fok I, Hga I, Hph I, Mbo II, Mnl I, Sfan I, TspDT I, TspDwI, Taq II, and the like. In certain exemplary embodiments, the type II restriction endonuclease used in the present invention is AcuI, which cleaves approximately 16 bases in length, generating a2 base 3' overhang; and the type III endonuclease EcoP15, which cleaves about 25 bases in length, generating a2 base 5' overhang. As will be discussed further below, the inclusion of the above type II and type III sites in the adaptors in the nucleic acid template constructs of the invention provides a means for facilitating the insertion of multiple adaptors at defined locations in a target nucleic acid.
It will be appreciated that the adaptor may also comprise other elements, including recognition sites for other (non-type II) restriction endonucleases, primer binding sites for amplification, and binding sites for probes used in sequencing reactions ("anchor probes"), as further described herein. In addition, adaptors used in the present invention may contain palindromic sequences, which may be used to facilitate intramolecular binding if a nucleic acid template containing such adaptors is used to generate concatemers, as discussed in more detail below.
The ability to control the spacing and direction of insertion of each subsequent adaptor has a number of advantages over randomly inserting discrete adaptors. In particular, the methods described herein improve the efficiency of the adaptor insertion process, thus reducing the need to introduce an amplification step in the insertion of each subsequent adaptor. In addition, controlling the spacing and orientation of each added adaptor ensures that the restriction endonuclease recognition site normally contained in each adaptor is oriented such that subsequent cleavage and ligation steps occur at the appropriate site in the nucleic acid construct, thereby further increasing the efficiency of the process by reducing or eliminating the formation of nucleic acid templates containing adaptors in an inappropriate position or orientation. In addition, controlling the position and orientation of each subsequently added adaptor is beneficial for certain uses of the resulting nucleic acid construct, as the adaptors serve multiple functions in sequencing applications, including serving as reference points for sequence knowledge, thereby facilitating confirmation of the relative spatial position of bases identified at a particular site in a target nucleic acid. Such uses of adaptors in sequencing applications are further described herein.
The 5 'and 3' ends of the double-stranded fragment are optionally adjusted as described above. For example, many techniques for fractionating nucleic acids produce fragment ends that vary in length and chemistry. For example, the ends may contain overlap for many purposes, with blunt-ended double-stranded fragments being preferred. This can be done using known techniques such as polymerases and dNTPs. Similarly, fractionation techniques are also possible to obtain various termini, such as 3 'and 5' hydroxyl groups and/or 3 'and 5' phosphate groups. In certain embodiments, it may be desirable to enzymatically alter these termini, as described below. For example, to prevent ligation of multiple fragments that do not contain adaptors, it may be desirable to alter the chemistry of the termini so that the correct phosphate and hydroxyl group orientation is not present, thereby preventing "polymerization" of the target sequence. The chemistry of the tip can be controlled using methods known in the art. For example, in some cases, all phosphate groups are removed using a phosphatase so that all the terminals contain hydroxyl groups. Each end can then be selectively altered to allow the desired components to be linked together.
In addition, amplification is optionally performed as needed using a number of known techniques to increase the number of genomic fragments for subsequent manipulation, although in many embodiments, no amplification step is required at this stage.
In some embodiments, if amplification is used to increase the number of fragments before or after any step of constructing a nucleic acid template, the amplification is an MDA reaction that uses one or more additives described above to reduce bias that may otherwise result from amplification.
After fractionation and optional end-modulation, a set of adaptor "arms" are added to the ends of the genomic fragments. Two adapter arms, when connected together, form a first adapter. For example, as depicted in fig. 6, circularization of a linear construct with one adaptor arm at each end (605) joins the two arms together to form a complete adaptor (606) and circular construct (607). Thus, one end of the genomic fragment is added to a first adaptor arm (603) of the first adaptor and the other end of the genomic fragment is added to a second adaptor arm (604) of the first adaptor. Generally, one or both of the adapter arms will contain a recognition site for a type II endonuclease, depending on the system desired, as described more fully below. Alternatively, the adaptor arms may each contain a partial recognition site, which when ligated, reconstitutes the complete recognition site.
In order to ligate subsequent adaptors in a desired position and orientation for sequencing, the present invention provides methods in which a type II restriction endonuclease binds to a recognition site within a first adaptor of a circular nucleic acid construct and then cleaves at a point in the genomic fragment (also referred to herein as a "target nucleic acid") that is outside of the first adaptor. A second adaptor is then attached at this point where the cleavage occurs (again typically by adding two adaptor arms to the second adaptor). In order to cleave the target nucleic acid at a known site, it may be desirable to block any other recognition site of the same enzyme that may be randomly contained in the target nucleic acid, so that the only site to which the restriction endonuclease can bind is within the first adaptor, thus avoiding unwanted cleavage of the construct. In general, the recognition site in the first adaptor is first protected from inactivation, and then any other unprotected recognition site in the construct is inactivated, typically by methylation. That is, the methylated recognition site will not bind to the enzyme and therefore no cleavage will occur. Only the unmethylated recognition sites in the adaptors are able to bind to the enzyme and subsequent cleavage occurs.
One way to protect the recognition site in the first adaptor from inactivation is to make the site single-stranded, since the methylase does not bind to the single-strand. Thus, one method of protecting the recognition site in the first adaptor is to amplify a linear genomic fragment ligated to both first adaptor arms by using uracil modified primers. The primers are complementary to the adaptor arms and are uracil modified, so that during amplification (typically by PCR) the resulting linear construct contains uracil embedded in a recognition site of the first adaptor arm. Cleavage of uracil using known techniques renders the first adaptor arm (or any uracil-containing fragment) single-stranded. The linear construct is then given a sequence-specific methylase that methylates all double-stranded recognition sites of the same endonuclease contained in the first adaptor. This sequence-specific methylase is unable to methylate the single-stranded recognition site in the first adaptor arm, and therefore the recognition site in the first adaptor arm is protected from methylation inactivation. As described below, if a restriction site is methylated, it will not be cleaved by a restriction endonuclease.
As will be more fully described below, in some cases a single adaptor may contain two identical recognition sites to enable cleavage "upstream" and "downstream" from the same adaptor. In this embodiment, as illustrated in FIG. 7, the primer and uracil positions are suitably selected so that the "upstream" or "downstream" recognition site can be selectively protected from inactivation or inactivated. For example, in FIG. 7, two different adapter arms (shown as rectangles) each contain a recognition site for a restriction endonuclease (shown as circles in one adapter arm and triangles in the other). If it is desired to protect the adaptor arm with the recognition site represented by a circle using the uracil degradation method described above, uracil modified amplification primers are designed to introduce uracil into the recognition site. Upon uracil degradation, the adaptor arm then becomes single stranded (shown as a half rectangle), thereby protecting the recognition site from inactivation.
After protecting the recognition site in the first adaptor arm from methylation, the linear construct is circularized by using, for example, a bridge oligonucleotide and T4 ligase. Circularization allows the restriction endonuclease in the first adaptor arm to reform double strands. In certain embodiments, the bridge oligonucleotide has a blocked end, which allows circularization by the bridge oligonucleotide, ligation of the unblocked end, and leaving a nick near the recognition site. This cut may be further utilized as discussed below. The restriction endonuclease is used to generate a second linear construct comprising a first adaptor located inside the target nucleic acid and an end comprising (depending on the enzyme) a two-base overhang.
The second set of adaptor arms of the second adaptor are ligated to the second linear construct. In some cases, when nicking is used, to ensure that the adaptors are ligated in the proper orientation, the nicks in the first adaptor are "translated" (or "displaced") by using a polymerase having exonuclease activity. The exonuclease activity of a polymerase (such as Taq polymerase) will cleave the short DNA strand adjacent to the nick, while the polymerase activity will "fill in" the nick and subsequent nucleotides on that strand (essentially, Taq moves along the strand, excising a base and adding the same base using exonuclease activity, with the result that the nick is displaced along the strand until the enzyme reaches the end of the strand).
In addition, one end of the construct is modified with one base in order to create asymmetry in the template. For example, some polymerases, such as Taq, will perform a nucleotide addition without a template, thus causing the addition of one nucleotide at the 3 'end of the flat DNA duplex, resulting in a 3' overhang. One skilled in the art understands that any base can be added depending on the concentration of dntps in solution. In a particular embodiment, the polymerase used is capable of adding only a single nucleotide. For example, Taq polymerase can add a single G or A. Other polymerases may also be used to add other nucleotides to create overhangs. In one embodiment, an excess of dGTP is used, resulting in the addition of guanine without a template at the 3' end of one strand. This "G tail" at the 3 'end of the second linear construct creates an end asymmetry and therefore enables ligation with a second adaptor arm with a C-tail, such that the second adaptor arm is renatured with the 3' end of the second linear construct. The adaptor intended to be ligated to the 5 'end carries a C-tail positioned such that it can be ligated to the 5' G-tail. Following ligation of the second adaptor arms, the construct is circularized to produce a second circular construct comprising two adaptors. The second adaptor typically contains a recognition site for a type II endonuclease, which may be the same or different from the recognition site contained in the first adaptor, which is a number of applications.
A third adaptor may be inserted on the other side of the first adaptor by cleavage with a restriction endonuclease that binds to a recognition site in the second arm of the first adaptor (i.e. a recognition site that is initially inactivated by methylation). To make this recognition site available, a third linear construct is generated by amplifying the circular construct with a uracil modifying primer complementary to the recognition site in the first adaptor, which contains uracil embedded in the second restriction recognition site. Degradation of uracil renders the first adaptor single-stranded, thus protecting the recognition site in the adaptor from methylation. All unprotected recognition sites are then inactivated with a sequence-specific methylase. Once circularized, the recognition site in the first adaptor is reformed and administration of the restriction endonuclease will cleave the circle, creating a site in the third linear construct into which the third adaptor can be inserted. Ligation of the third adaptor arm to the third linear construct follows the same basic procedure as described above-the third linear construct will bear an a-or G-tail and the third adaptor arm a T-or C-tail, enabling the adaptor arm to be ligated with the third linear construct for renaturation. The linear construct comprising the third adaptor arm is then circularized to form a third circular construct. As with the second adaptor, the third adaptor typically comprises a different restriction endonuclease recognition site than the recognition site contained in the first adaptor.
The fourth adaptor may be added by using a type II restriction endonuclease with a recognition site in the second and third adaptors. Cleavage with these restriction endonucleases will produce a fourth linear construct which is then ligated to a fourth adaptor arm. Circularization of the fourth linear construct ligated to the fourth adaptor arm will result in the nucleic acid template construct of the invention. Other adaptors may also be added, as will be appreciated by those skilled in the art. Thus, the methods described herein allow two or more adaptors to be added in an orientation, sometimes distance dependent manner.
The present invention also provides a method to control the direction of insertion of each post-added adaptor. Such "nick translation" methods provide a means of controlling the manner in which a target nucleic acid is ligated to an adaptor. These methods also prevent the formation of pseudonucleic acid constructs by preventing ligation of adaptors to other adaptors and target nucleic acid molecules to other target nucleic acid molecules (i.e., by avoiding "polymerization" of adaptors and target nucleic acid molecules, respectively). FIG. 8 illustrates examples of different orientations that an adaptor and a target nucleic acid molecule can take to ligate. The target nucleic acids 801 and 802 are preferably ligated to adaptors 803 and 804 in a desired orientation (as shown in the figure, the desired orientation is that in which the ends of the same shape-a circle or square-are ligated to each other). Modifying the ends of the molecule avoids undesirable conformations 807, 808, 809, and 810 in which the target nucleic acids are linked to each other and adaptors are linked to each other. In addition, as will be discussed in more detail below, the direction of each adapter-target nucleic acid ligation can be controlled by controlling the chemistry of the adapter and the end of the target nucleic acid. The chemistry of the termini can be controlled using methods known in the art. For example, in some cases, a phosphatase is used to remove all phosphate groups so that all termini contain hydroxyl groups. Each end can then be restrictively modified to allow for attachment between the desired components. These and other methods of end modification and controlling adaptor insertion in the nick translation method of the present invention are described in more detail below.
These nucleic acid template constructs (comprising "monomers" interspersed with target sequences of these adaptors) can then be used to generate concatemers, which in turn can form nucleic acid nanoballs for downstream applications such as sequencing and detection of specific target sequences.
The invention provides methods of forming a nucleic acid template construct, wherein the template construct comprises a plurality of interspersed adaptors for insertion into a target nucleic acid. As discussed further herein, the methods of the invention allow for the insertion of each subsequent adaptor by utilizing the recognition site of a type II restriction endonuclease contained in the adaptor. In order to insert multiple adaptors in a desired order and/or orientation, it may be necessary to block the restriction endonuclease recognition sites contained in the target nucleic acid so that only the recognition sites in the adaptors are available for enzyme binding and subsequent cleavage. One advantage of such methods is that the same restriction endonuclease site can be used in each adaptor, which simplifies the production of the circular template that is ultimately used to make the concatemer, and adaptor insertion can use the previously inserted adaptor as a "stepping stone" for the next adaptor, with each new adaptor being added by "walking" along the fragment. Controlling the recognition sites available for restriction enzymes also avoids the need to excise certain sequences and therefore only limited sequence representation is available (as may be the case if sites within the target nucleic acid are accessible).
II.D.2. addition of first adapter
As a first step in the production of a nucleic acid template of the invention, a first adaptor is ligated to a target nucleic acid. The entire first adaptor may be added to one end, or two portions of the first adaptor, referred to herein as "adaptor arms," may be ligated to the two ends of the target nucleic acid, respectively. The first adapter arm is designed to be connected to reconstitute a complete first adapter. As described in detail above, the first adaptor typically comprises one or more type II restriction endonuclease recognition sites. In certain embodiments, the type II restriction endonuclease recognition site is split between the two adaptor arms, such that the site is available for restriction endonuclease binding only after ligation of the two adaptor arms has occurred.
FIG. 6 is a schematic of one aspect of a method of assembling adaptor/target nucleic acid templates (also referred to herein as "target library constructs", "library constructs", and all grammatical equivalents). DNA, such as genomic DNA601, is isolated and fragmented into target nucleic acids 602 using standard techniques described above. The fragmented target nucleic acid 602 is then repaired such that the 5 'and 3' ends of each strand are blunt or blunt ends. After this reaction, a single A is added to the 3' end of each strand of fragmented target nucleic acid using a polymerase without a proofreading function, bringing each fragment "A-tail". The addition of an a-tail is typically accomplished by using a polymerase (such as Taq polymerase) and providing only adenine nucleotides, such that the polymerase is forced to add one or more a's to the end of the target nucleic acid in a template sequence independent manner.
In the exemplary method shown in FIG. 6, a first arm (603) and a second arm (604) of a first adaptor are ligated to each target nucleic acid, resulting in a target nucleic acid with adaptor arms ligated to each end. In one embodiment, the adapter arm is "T-tailed," and thus complementary to the A-tail of the target nucleic acid, such that ligation of the adapter arm to the target nucleic acid is facilitated by providing a means for the adapter arm to first renature the target nucleic acid and then attach the adapter arm to the target nucleic acid using a ligase.
In other embodiments, the invention provides for the ligation of adaptors to each fragment in a manner that minimizes the generation of intra-or intermolecular ligation artifacts. This is beneficial because the false ligation of random fragments of a target nucleic acid with each other can complicate the sequence alignment process by creating spurious genomic proximity relationships between target nucleic acid fragments. The use of the addition of A and T tails to attach adaptors to DNA fragments prevents random intramolecular or intermolecular association of adaptors and fragments, which reduces artifacts that would result from ligation (adaptor-adaptor or fragment-fragment ligation).
As an alternative to A/T tailing (or G/C tailing), various other methods can be taken to prevent ligation artifacts from the target nucleic acid and adapter, as well as the orientation of the adapter arms relative to the target nucleic acid, including blunt end ligation using complementary NN overhangs in the target nucleic acid and adapter arms, or using an appropriate target nucleic acid to adapter ratio to optimize the single piece nucleic acid/adapter arm ligation ratio.
After generating a linear construct comprising the target nucleic acid and adapter arms at each end, the linear target nucleic acid is circularized (605) (this process is discussed in more detail herein) to generate a circular construct 607 comprising the target nucleic acid and the adapters. Note that the circularization process results in the first and second arms of the first adaptor being brought together to form a continuous first adaptor in the circular construct (606). In certain embodiments, circular construct 607 is amplified, such as by loop-dependent amplification, using, for example, random hexamers and Φ 29 or helicases. Alternatively, the target nucleic acid/adaptor structure can remain linear, amplification being performed by PCR directed from sites within the adaptor arms. Amplification is preferably a regulated amplification process using a high fidelity, proofreading polymerase to produce a sequence accurate amplified target nucleic acid/adaptor construct library in which the genome or one or more portions of the genome being interrogated is sufficiently representative.
II.D.3. addition of multiple adaptors
As discussed above, fig. 6 is a schematic of one aspect of a method of assembling an adaptor/target nucleic acid template (also referred to herein as a "target library construct", "library construct", and all grammatical equivalents). DNA, such as genomic DNA601, is isolated and fragmented into target nucleic acids 602 using standard techniques. The fragmented target nucleic acid 602 is then repaired in certain embodiments (as described herein) such that the 5 'and 3' ends of each strand are flush or blunt ends.
In the exemplary method shown in FIG. 6, a first arm (603) and a second arm (604) of a first adaptor are ligated to each target nucleic acid, resulting in a target nucleic acid with adaptor arms ligated to each end.
After generating a linear construct comprising the target nucleic acid and adapter arms at each end, the linear target nucleic acid is circularized (605) (this process is discussed in more detail herein) to generate a circular construct 607 comprising the target nucleic acid and the adapters. Note that the circularization process results in the first and second arms of the first adaptor being brought together to form a continuous first adaptor in the circular construct (606). In certain embodiments, circular construct 607 is amplified, such as by loop-dependent amplification, using, for example, random hexamers and Φ 29 or helicases. Alternatively, the target nucleic acid/adaptor structure can remain linear, amplification being performed by PCR directed from sites within the adaptor arms. Amplification is preferably a regulated amplification process using a high fidelity, proofreading polymerase to produce a sequence accurate amplified target nucleic acid/adaptor construct library in which the genome or one or more portions of the genome being interrogated is sufficiently representative.
Similar to the process of adding the first adaptor, a second set of adaptor arms (610) and (611) can be added to each end of the linear molecule (609), and then ligated (612) to form the complete adaptor (614) and circular molecule (613). Similarly, a third adaptor can be added to the other side of the adaptor (609) by using a type II endonuclease that cleaves the other side of the adaptor (609), and then a third set of adaptor arms (617) and (618) can be ligated to each end of the linearized molecule. Finally, a fourth adaptor is added by cutting the loop construct again and adding a fourth set of adaptor arms to the linearized molecules. The embodiment depicted in FIG. 6 is a method for cleaving a circular construct using type II endonucleases with their recognition sites in adaptors (620) and (614). The recognition sites in the adaptors (620) and (614) may be the same or different. Similarly, the recognition sites in all of the adaptors shown in FIG. 6 can be said to be the same or different.
As shown generally in FIG. 9, the circular construct comprising the first adaptor may contain two type II restriction endonuclease recognition sites in the adaptor that are positioned such that the target nucleic acid outside the recognition sequence (outside the adaptor) is cleaved (910). Arrows around structure 510 indicate recognition sites and restriction sites. In process 911, a type II restriction endonuclease EcoP15 was used to cleave the circular construct. Note that in the aspect shown in FIG. 9, the portion of each library construct that is mapped to a portion of the target nucleic acid will be cleaved from the construct (the portion of the target nucleic acid between the arrows in structure 910). Restriction cleavage of the library construct with EcoP15 in process 911 generates a library of linear constructs containing a first adaptor located within the end of linear construct 912. The size of the resulting linear library construct is determined by the distance between the endonuclease recognition site and the endonuclease restriction site plus the size of the adaptor. In process 913, linear construct 912, like fragmented target nucleic acid 904, is treated to blunt or flush ends by conventional methods, an a-tail comprising a single a is added to the 3' end of the linear library construct using a polymerase with no proofreading activity, and the first and second arms of the second adaptor are ligated to the ends of the linearized library construct 913 by a-T renaturation and ligation. The resulting library construct comprises a structure as seen at 914, where the first adaptor is located within the end of the linear construct, the target nucleic acid is flanked on one end by the first adaptor, and on the other end by the first or second arm of the second adaptor.
In process 915, the double-stranded linear library constructs are processed into single strands 916, and the single-stranded library constructs 916 are ligated (917) to form a single-stranded loop of the target nucleic acid 918 interspersed with two adaptors. 917 the ligation/cyclization process is performed under conditions that optimize intramolecular ligation. At particular concentrations and reaction conditions, local intramolecular ligation at the ends of each nucleic acid construct is favored over ligation between molecules.
II.D.4. control of the ligation orientation between target nucleic acid and adaptor
One aspect of the invention provides methods in which ligation of an adaptor to a target nucleic acid is performed in a desired orientation, as described above. This directional control is beneficial because the false ligation artifacts of random fragments of the target nucleic acid create spurious genomic proximity relationships between target nucleic acid fragments, complicating the sequence alignment process.
There are several methods that can be used to control the direction of insertion of the adaptor. As described above, the chemistry of the target nucleic acid and the end of the adaptor can be altered so that ligation occurs only in the correct orientation. Alternatively, a "nick translation method" may be performed, which again relies on the chemistry of the tip, as outlined below. Finally, methods involving amplification with specifically selected primers can be employed as described below.
FIG. 12 illustrates different orientations in which a second adaptor may be added to a nucleic acid construct. Likewise, process 1200 begins with a circular library construct 1202 containing an inserted first adaptor 1210. The first adapter 1210 has a particular orientation, with triangles illustrating the outer strand of the first adapter ("outer strand") and diamonds illustrating the inner strand of the first adapter ("inner strand") (Ad1 orientation 1210). The tail of the arrow 1201 indicates the type II restriction endonuclease site in the first adaptor 1210 and the head of the arrow indicates the cleavage site. Process 1203 involves cleaving with a type II restriction endonuclease, ligating the first and second arms of the second adaptor, and recircularizing. As can be seen from the resulting library constructs 1204 and 1206, the second adaptor can be inserted in two different ways with respect to the first adaptor. In the desired orientation 1204, the oval inserts the outer strand with the triangle and the bow inserts the inner strand with the diamond shape (Ad2 orientation 1220). In the undesired orientation, the oval inserts the inner strand with diamonds and the bow-tie inserts the outer strand with triangles (Ad2 orientation 1230).
Although for clarity the following discussion and reference is made primarily to the insertion of a second adaptor relative to a first adaptor, it will be understood that the process discussed below applies to adaptors added after the second adaptor, which will result in library constructs with three, four, five, six, seven, eight, nine, ten or more inserted adaptors.
In one embodiment, A-tailed and T-tailed adapters are used to attach nucleic acid fragments. For example, the ends of fragments are repaired according to the modifications described above, and each fragment is "A-tailed" by adding a single A to the 3' end of each strand of the fragmented target nucleic acid using a polymerase that does not have proofreading activity. Adding an a tail typically utilizes a polymerase (such as Taq polymerase) and provides only adenine nucleotides (or excess adenine nucleotides) so that the polymerase is forced to add one or more a's to the end of the target nucleic acid in a template-sequence-independent manner. In embodiments employing "add A-tail", ligation to the adaptor (or adaptor arm) is by adding a "T-tail" to the adaptor/adaptor arm, thereby complementing the A-tail of the target nucleic acid, thus facilitating ligation of the adaptor arm to the target nucleic acid by providing a means for the adaptor arm to first renature with the target nucleic acid, and then attaching the adaptor arm to the target nucleic acid using a ligase.
Because aspects of the invention are optimized when the nucleic acid template is of a desired size and comprises a target nucleic acid derived from a single fragment, it is beneficial to ensure that the entire process of circularization reactions that produce the nucleic acid template is performed intramolecularly. That is, it is beneficial to ensure that the target nucleic acid does not self-associate during ligation with the first, second, third, etc. adaptors. Figure 10 illustrates one embodiment of controlling the cyclization process. As shown in FIG. 10, blocking oligonucleotides 1017 and 1027 are used to block binding regions 1012 and 1022, respectively. Blocking oligonucleotide 1017 is complementary to binding sequence 1016 and blocking oligonucleotide 1027 is complementary to binding sequence 1026. In the schematic representation of the 5 'adaptor arm and the 3' adaptor arm, the underlined bases are dideoxycytosine (ddC) and the bold bases are phosphorylated. Blocking oligonucleotides 1017 and 1027 are not covalently bound to the adapter arm and can be "thawed out" after ligation of the adapter arm to the library construct and prior to circularization; also, dideoxynucleotides (here ddC, or alternatively another non-ligatable nucleotide) prevent ligation of the blocking molecule to the adaptor. Additionally or alternatively, in certain aspects, the blocking oligonucleotide-adaptor arm hybrid contains one or more base gaps between the adaptor arm and the blocking molecule to reduce the likelihood of ligation of the blocking molecule to the adaptor. In certain aspects, T of a blocking molecule/binding region hybridms is approximately 37 ℃ so that the blocking sequence is easily thawed before adaptor arm ligation (circularization).
Ii.d.5. control of connection direction: arm to arm connection
In one aspect, directional insertion of adaptors can be controlled using an "arm-to-arm" ligation method without modifying the end of the target nucleic acid. In general, this is a two-step ligation process in which adapter arms are added to the target nucleic acid, primer extension with strand displacement produces two double-stranded molecules, one at each end, and then a second adapter arm can be added to the end without the adapter arms. This process can prevent the production of nucleic acid molecules with identical adaptor arms at both ends-for example, as shown in FIG. 11A, the arm-to-arm ligation process can prevent the formation of nucleic acid molecules with both ends occupied by either adaptor A or adaptor B. In many embodiments, it is preferred that each end of the target nucleic acid is ligated to a different adapter arm, such that when the two arms are ligated together they form a complete adapter. This is particularly useful in reducing the number of amplification steps required after addition of each adaptor arm, since arm-to-arm ligation reduces the number of unwanted molecules per ligation reaction.
FIG. 11 shows one embodiment of an arm-to-arm attachment method. In this embodiment, both strands of the dephosphorylated target nucleic acid are added to one strand of the first adaptor arm A. One end of the adapter arm (shown as a closed loop) is typically blocked with alkaline phosphatase. Primer replacement can be used to replace the strand with a closed end. Primer extension accompanying strand displacement (which may be achieved in an exemplary embodiment by using phi29 or Pfu polymerase) extends from both ends across the entire insert, resulting in two double-stranded nucleic acid molecules, each with an adaptor arm A at one end and a blunt end at one end. In an alternative embodiment, adaptor arm A may first hybridize to the primer upstream of the blocked strand to initiate primer extension without the need for a primer displacement reaction. After the strand displacement polymerase reaction, a second adapter arm is attached to the blunt end of the target nucleic acid, usually, instead of the end already carrying the adapter arm. This arm-to-arm ligation process can prevent the formation of target nucleic acids that contain identical adapter arms at both ends.
Ii.d.6. control of connection direction: incision translation method
In one embodiment, the invention provides a "nick translation method" for constructing a nucleic acid molecule. In one embodiment, the nick translation method is used to ligate nucleic acid molecules in a desired orientation. In another embodiment, a nick translation method is used to insert the adaptor in a desired orientation. These methods typically involve modification of one or both ends of one or both of the nucleic acid molecules to be ligated. For example, when an adaptor is ligated to a target nucleic acid, one or both ends of one or both of the target nucleic acid and the adaptor to be ligated are modified. Following such modifications, "displacement" or "translation" of the nick inserted into one strand of the construct provides the ability to control the final orientation of the ligated adaptor-target nucleic acid construct. As described in more detail below, the "nick translation methods" described herein may also include primer extension or gap filling methods. Although the following discussion is in terms of controlling ligation of an adaptor to a target nucleic acid, it will be appreciated that the methods are not limited to ligation of an adaptor to a target nucleic acid, and that the methods can also be used to control ligation of any two nucleic acid molecules. For example, nick translation methods and any other methods of controlling ligation described herein can be part of genetic and/or DNA engineering methods, such as the construction of new plasmids or other DNA vectors, genetic or genomic synthesis or modification, and components for the construction of nanotechnology constructs.
Fig. 13 illustrates this "notch translation" type of process. Construct 1306 in fig. 13 is formed using the methods discussed herein, and contains interspersed adaptor 1304, restriction endonuclease recognition sites (tail of arrow in fig. 13), and cleavage sites. In FIG. 14, the library constructs are not circularized, but rather are branched concatemers of alternating target nucleic acid fragments 1406 (containing restriction endonuclease recognition sites 1404) and adaptors 1412; however, the nick translation type of process shown in fig. 13 can also be performed on library construct configurations. The term "library construct" as used herein refers to a nucleic acid construct comprising one or more adaptors, interchangeable with the term "nucleic acid template".
The library construct with the inserted first adaptor is digested with a restriction endonuclease (process 1301), in some aspects a type II restriction endonuclease that cleaves the target nucleic acid to generate a 3' nucleotide overhang 1308. In FIG. 11, 1308 of two nucleotides (NN-3') is shown, although in different aspects the number of overhanging terminal nucleotides will vary depending at least in part on the restriction endonuclease used. Library construct 1310 is linearized with the adaptor inserted first, shown as 1304. The first inserted adaptor 1304 is engineered to contain a cut 1312 at the boundary of the adaptor segment; or contain a recognition site for a nicking endonuclease, a nick 1314 can be introduced inside the adaptor. In both cases, the library construct is treated (1303) with a polymerase 1316, which is capable of extending the top strand of library construct 1310 from nick 1312 or 1314 to the bottom strand end to form a strand with a 3' overhang at one end and a blunt end at the other end. In process 1305 this library construct 1310 is ligated with a second adaptor 1318, which second adaptor 1318 has degenerate nucleotide overhangs at one end and a single 3' nucleotide (e.g., dT) overhang at the other end to form library construct 1320. Library construct 1320 is then processed in process 1307 to add 3' dA at its blunt end. The library construct 1322 may then be amplified by PCR using, for example, primers containing uracil. Alternatively, the library construct 1322 can be circularized in process 1309, in which case CDA can be performed (such as in step 1421 in fig. 14). The process discussed herein in conjunction with the nick translation process shown in FIG. 13 allows for selection of the relative position and relative orientation of the subsequently added adaptors to any previously inserted adaptors of the library construct.
In order to utilize nick translation type procedures, it may be beneficial to modify one or both ends of the target nucleic acid and/or the adaptor as discussed above. In an exemplary embodiment, the first arm of the adaptor that is intended to be ligated to the 3 'end of the target nucleic acid may be designed such that its 3' end is blocked, so that only the 5 'end of the adaptor arm is available for ligation to the 3' end of the target nucleic acid. Similarly, the second arm intended to be linked to the 5 'end of the target nucleic acid may be designed such that its 5' end is blocked, and thus only the 3 'end of the second arm may be linked to the 5' end of the target nucleic acid. Methods of blocking one end of an adapter arm and/or a target nucleic acid are known in the art. For example, a target nucleic acid is treated with an enzyme capable of producing a specific functional end and removing phosphate from the 3 'and 5' ends as discussed above (also referred to herein as "nucleic acid insertion" or "DNA insertion" or "insertion"). All phosphate groups are removed so that the target nucleic acid molecules cannot be linked to each other. The adaptor in this embodiment is also designed such that one strand can be ligated (e.g., by creating or retaining a 5 'phosphate group) and the 3' end of the complementary strand is protected from ligation. Generally, protection of the 3 'end is achieved by inactivating the 3' end with dideoxynucleotides. Thus, when the modified target nucleic acid has no phosphate groups at both ends, the modified adaptor includes a phosphate group at one 5' end, and the complementary strand is 3' blocked (e.g., dideoxy), the only ligation products that may be formed are the target nucleic acid ligated to the 5' end of the adaptor that carries the phosphate group. Following this ligation step, the protected 3 'end of the adaptor may be replaced with a strand containing a functional 3' end. This substitution is usually achieved by taking advantage of the fact that the 3' protected strand is generally short and easily denatured. The displaced strand with a functional 3' end is longer and therefore is able to bind the complementary strand more efficiently-in other embodiments, the strand with a functional end is added at a higher concentration at the same time, thereby further influencing the reaction towards displacement of the protected strand by the strand with a functional end. The strand with the functional 3' end is then primed by adding a DNA polymerase with nick translation activity, which removes bases exonucleotically from the 5' end of the target nucleic acid, thereby exposing the functional 5' phosphate. This newly generated 5' phosphate can be ligated to the extension product via a ligase. (if there is no ligase present during the extension reaction, two polymerase molecules will translate from each end nick of the target nucleic acid until they meet, resulting in a fragmented molecule). For example, as shown in FIG. 2, the target nucleic acid (insert) is first end-repaired to form a specifically functional end, preferably blunt-ended. Then, to avoid the formation of concatemers from the insert, the 5' terminal phosphate was removed. The insert is then mixed with a DNA ligase and a DNA adaptor. The DNA adaptor contains two oligonucleotides, which when hybridized simultaneously have a blunt end and a sticky end. The blunt-ended side contains a "top-strand" with a protected/inactivated 3 'end and a "bottom-strand" with a functional 5' end phosphate, and therefore cannot be ligated to itself. The only possible combination of ligation is thus one insert with one "downstream strand" per blunt end. The "top strand" with the 3 'end protection is then replaced with an oligonucleotide containing a functional 3' end, which can serve as a primer in a polymerase extension reaction. Upon addition of polymerase and ligase, a second oligonucleotide can be inserted by nick translation and ligation. When the polymerase extends into the insert, it introduces a nick with a functional 5' phosphate, which can be recognized and sealed by DNA ligase. The resulting inserts with adaptors or adaptor arms at each end of each strand can then be subjected to PCR using adaptor-specific primers.
Typically in nick translation reactions such as those described above, active ligase is present or is added to the mixture prior to or simultaneously with the addition of polymerase. In certain embodiments, it may be beneficial to use low activity polymerase (low nick translation) conditions. The addition of ligase prior to or simultaneously with the polymerase and the low activity conditions help to ensure that the translated nicks are sealed before reaching the opposite end of the DNA fragment. In certain embodiments, this may be achieved by incubating Taq polymerase and T4 ligase at 37 ℃ (which temperature typically results in oligomeric synthase activity and high ligase activity). The reaction may then be continued at a higher temperature (e.g., 50-60 ℃) to ensure that most/all of the constructs in the reaction complete nick translation ligation.
In other embodiments, the invention provides methods of forming a nucleic acid template construct comprising a plurality of interspersed adaptors. The method of the invention includes a method of inserting a plurality of adaptors such that each subsequent adaptor is inserted at a particular position relative to the previously added adaptor. Certain methods of inserting a plurality of interspersed adaptors are known in the art, for example as discussed in U.S. patent applications 60/992,485, 61/026,337, 61/035,914, 61/061,134, 61/116,193, 61/102,586, 12/265,593, 12/266,385, 11/679,124,11/981,761,11/981,661,11/981,605,11/981,793 and 11/981,804, which are incorporated herein by reference in their entirety for all purposes, particularly in relation to methods and compositions for generating nucleic acid templates comprising a plurality of interspersed adaptors, and for all teachings of methods of use of such nucleic acid templates. Inserting known adaptor sequences into the target sequence such that a contiguous target sequence is interrupted by a plurality of interspersed adaptors provides the ability to sequence each adaptor "upstream" and "downstream", thus increasing the amount of sequence information that can be generated from each nucleic acid template. The present invention provides other methods of inserting each subsequent adaptor relative to one or more previously added adaptor specific locations.
Nick-translational ligation is typically performed by adding at least a polymerase to the reaction after the first strand is ligated. In certain embodiments, the nick translation reaction may be performed in a one-step reaction by adding all of the ingredients at once, while in other embodiments, the reaction steps are performed sequentially. There are many possible embodiments of the "one-step" method of nick translation reaction. For example, a single mixture containing primers can be used, wherein Taq is added at the beginning of the reaction. The use of thermostable ligase provides the ability to perform primer exchange and nick translation ligation (and PCR, if desired) by simply raising the temperature. In another exemplary embodiment, the reaction mixture contains a minimal concentration of a non-processive nicking translating polymerase, and a weak 3 'exonuclease capable of activating the 3' blocked strand.
In other embodiments, the nick translation process is prepared using T4 polynucleotide kinase (PNK) or alkaline phosphatase to alter the 3' end of the adaptor and/or target nucleic acid. For example, an adaptor may be inserted as part of the cyclization reaction. The end-repaired and alkaline phosphatase treated target nucleic acid is ligated to an adaptor, designed in this exemplary embodiment to form a self-complementary hairpin unit (FIG. 16). The hairpins are designed to contain modifications at a given position that can be recognized and cleaved by enzymes or chemicals. For example, if the hairpin contains deoxyuridine, it can be recognized and cleaved by UDG/Endo VIII. After cleavage, the two hairpins become single-stranded with a phosphate at their respective 3' ends. These 3' phosphates can then be removed by T4 polynucleotide kinase (PNK) or alkaline phosphatase (SAP) for nick translation as further described herein. In an exemplary embodiment, such as the embodiment illustrated in fig. 4A, the two hairpins are designed to be partially complementary to each other, and thus can form a circular molecule by intramolecular hybridization. Finally, the circularised molecule enters a nick translation process in which the polymerase extends into the insert, introducing a nick with a functional 5' terminal phosphate which can be recognised and blocked by a DNA ligase.
In addition to utilizing hairpin structures as described above, circularization can also be performed using a pair of double stranded adaptors that are partially complementary to each other. One-to-one chain contains deoxyuridine, which can be recognized and cleaved by UDG/Endo VIII. Other methods of nicking a strand may also be used, including, but not limited to, nicking enzymes, introducing inosine modified DNA that is recognized by an enzyme that is an endonuclease, and introducing RNA modifications to DNA that are recognized by an RNA-endonuclease. The target nucleic acid and the adaptor can be prepared for controlled ligation as described above, for example by treating the target nucleic acid with alkaline phosphatase to produce blunt ends that cannot be ligated to other target nucleic acids. The circularization is activated by denaturing the short 3' protected strand in the adaptor from the strand to which the target nucleic acid is ligated, leaving two partially complementary single stranded ends at each end of the target nucleic acid insertion. These ends are then joined together by intramolecular hybridization, nick translation and ligation to form a covalently closed loop. These loops were then treated with UDG/Endo VIII to generate loops for the next adapter orientation insertion.
FIG. 15 shows yet other embodiments in which a linear target nucleic acid is treated with Shrimp Alkaline Phosphatase (SAP) to remove the 5' phosphate. The target nucleic acid is then ligated to one arm of an adaptor (arm A), which comprises one strand with a 5 'phosphate, and a shorter complementary strand with a protected 3' end. The ligation product was then subjected to nick translation. The nick generated in the circularization reaction is located on the upstream strand of the first adaptor and can serve as a primer for the polymerase in the nick translation reaction. The polymerase extends the top strand to the nick where the adaptor-insert meets, releasing one of the adaptor A arms, creating a blunt end or A or G overhang. The resulting polymerase-produced insert ends are then ligated to the second adaptor arm (arm B). By designing the first adaptor to make a nick in the circularization reaction, subsequent adaptors can be added in a predetermined orientation. This strategy can be applied to all type II restriction enzymes or other enzymatic or non-enzymatic fragmentation methods, whether they produce digests with blunt ends, 3 'overhangs or 5' overhangs. Subsequent primer replacement, extension, ligation and PCR are similar to those described in FIG. 2. Non-amplification methods can also be used to close the loop, including melting the blocked oligonucleotide and then circularizing the DNA via nick translation ligation.
Polymerases with proofreading activity (having 3 '-5' exonuclease activity, such as Pfu polymerase) and polymerases without proofreading activity (lacking 3 '-5' exonuclease activity, such as Taq) can be used for nick translation and strand synthesis including strand displacement processes described herein. Polymerases with proofreading activity can effectively generate blunt ends during nick translation, but have the disadvantage of degrading unprotected 3' overhangs as well. The resulting nick translation product will have two blunt ends and therefore will not be able to ligate to a subsequent adaptor in a particular orientation. One solution is to use, for example, dideoxyribonucleoside triphosphates (ddNTPs) at the 3 'end to protect the 3' end of the ligated adaptor (e.g., arm A in FIG. 15) from degradation. However, ddNTP protection also protects the 3' end from post-extension, thus limiting forward advancement of the adaptor during direct circularization. Another potential solution is to protect the 3' end from degradation by polymerase using a modification on the 3' end (e.g., 3' phosphate), which can be removed prior to nick translational circularization (e.g., using alkaline phosphatase). Another approach is to use hairpin-shaped adaptors to bind polymerases with proofreading activity in nick translation reactions. These adaptors can be protected from degradation, but have the disadvantage of requiring an additional UDG/EndoVIII step. Furthermore, the inventors found that Pfu polymerase, a polymerase having a proofreading activity, can efficiently produce blunt ends without degrading unprotected 3' overhangs, indicating that it has a lower 3' -5 ' exonuclease activity.
Polymerases without proofreading activity, such as Taq polymerase, can generate either blunt ends or single base overhangs during nick translation (Taq can generate template-independent A-and G-tails in addition to blunt ends). An advantage of using a polymerase that does not have 3' -5 ' exonuclease activity during nick translation is that the unprotected 3' overhang can remain intact. This makes it unnecessary to protect the 3' protrusion from degradation, i.e., to attach subsequent adaptors in a particular orientation. A potential drawback of many polymerases with proofreading activity is that they function to add a single nucleotide to the 3' end in a template-independent process. This process is very difficult to control and often results in a mixed population of 3' ends, resulting in low adaptor-to-insert ligation yields. Generally, methods using blunt-end ligation are more efficient than single-base overhang ligation.
In one embodiment, rather than forming a loop and then cleaving with a type II endonuclease having a recognition site in the first adaptor (which is a step in certain embodiments of the invention for generating a nucleic acid template, such as the embodiments illustrated in FIGS. 6 and 9), a variation of the nick translation method is used to add the second adaptor after ligation of the first adaptor. An illustrative embodiment of this variation is illustrated in fig. 17. Generally, as described in detail above and shown in FIGS. 6 and 9, these embodiments begin by adding a first adaptor to a target nucleic acid, followed by circularization. In the embodiment shown in FIG. 17A, nick translation is performed using a polymerase having 5 '-3' exonuclease activity (such as Taq polymerase), resulting in an inverted circle with the first adaptor inside the target nucleic acid. This product can then be end-repaired and ligated to the adaptor 2 (using the methods described in detail above). One disadvantage of this embodiment is that the target nucleic acid may be longer than required for sequencing, and in any concatemer of nucleic acids generated from the template (concatemers generated from the nucleic acid templates of the invention are discussed in more detail below), such long templates may readily form secondary structures. One way to overcome this drawback is by shortening the target nucleic acid-an exemplary embodiment of this method is depicted in fig. 17B. In this embodiment, the first adaptor is modified with uracil using the methods described herein. Nick translation-following inversion of the loop containing the first adaptor, adaptor C-arms are added to both ends of the end-repaired molecule. Uracil-modified adaptor 1 was treated to remove uracil, creating a gap, and treated to create an activated 3' end. Typically, uracil is removed by using a UDG/EndoVIII enzyme mixture, and the 3 'phosphate is removed with PNK and/or alkaline phosphatase to produce an activated 3' terminus. The activated 3 'end of adapter 1 and the 3' end of adapter arm C are recognized by nick-translating polymerase (i.e., a polymerase having 5 '-3' exonuclease activity) to produce a product in which adapter 1 is surrounded by target nucleic acid that has been trimmed to about half its original length. If the adaptor 1 is modified by other nicking modifications (including but not limited to the introduction of inosine, RNA modifications, etc.), this polymerase cleavage procedure can be repeated to further reduce the size of the target nucleic acid.
In other embodiments, as shown in FIG. 17C, the nick translation method shown in FIGS. 17A and B may be extended to insert multiple adaptors. By modifying the adaptors, nicks and functional 3' ends can be formed that direct nick translation reactions simultaneously by multiple adaptors. As shown in fig. 17C, a nucleic acid construct comprising a target nucleic acid and two adaptors, each having a uracil modification on one strand, is circularized. The ring is then treated with an enzyme mixture such as UDG/EndoVIII to remove uracil and introduce a notch. These notches can be simultaneously nick translated to reverse the circle so that the construct can be ligated to additional adaptors. By adding multiple modifications to the same adaptor, subsequent nicks/gaps and nick translation reversals can be made to introduce multiple adaptors. In certain embodiments, uracil can be added back to the same location in the adaptor, making the adaptor suitable for further nick translation reactions. Uracil can be added back by, for example, incubating the nick translation reaction with uracil alone to reconstitute the adaptor, and then adding higher concentrations of unmodified nucleotides to fill in the rest of the construct.
In yet other embodiments, shown in FIG. 17D, target nucleic acids can be shortened by controlling the speed of the nick-translating enzyme. For example, nick translation enzymes can be slowed by changing the temperature or limiting the reagents, possibly resulting in two nicks being introduced into the circularized insert, using a nick translation process to move from the original site in the adaptor. Similarly, the use of a strand displacing polymerase (such as phi29) will cause nicks to be moved, creating branch points as a result of a segment of nucleic acid being displaced. These nicks or branch points may be recognized by a variety of enzymes (including, but not limited to, S1 endonuclease, Bal31, T7 endonuclease, mungbean endonuclease, and combinations of enzymes, such as 5 '-) 3' exonucleases, such as T7 exonuclease and S1 or mungbean endonuclease) that cleave opposite strands of the nick, producing a linear product. The product can then be end-repaired (if necessary) and ligated to the next adaptor. The size of the remaining target nucleic acid will be controlled by the nick translation reaction rate, again by, for example, reducing the concentration of reagents (such as dNTPs), or by performing the reaction at a less than optimal temperature. The size of the target nucleic acid can also be controlled by the incubation time of the nick translation reaction.
In other embodiments, the nucleic acid template may be formed using a nick translation method without any transformation by a cyclization step. An exemplary embodiment of such a method is shown in FIG. 18, which shows ligation of a hairpin-shaped first adaptor 1801 to a target nucleic acid 1802, using the ligation method described above, such as by treating the target nucleic acid with shrimp alkaline phosphatase to remove the phosphate group and thereby control the end of the target nucleic acid available for ligation with the first adaptor. Following ligation of the first adaptor, a controlled double-strand specific 5' -3' exonuclease reaction is performed to generate a single-stranded 3' end. In certain embodiments, the exonuclease reaction is carried out using a T7 exonuclease, although it will be appreciated that other double-strand specific exonucleases may be used in these embodiments of the invention. In other embodiments, the exonuclease reaction produces a single-stranded 3' end of about 100 to about 3000 bases in length. In still other embodiments, the exonuclease reaction produces single-stranded 3' ends of about 150 to about 2500, about 200 to about 2000, about 250 to about 1500, about 300 to about 1000, about 350 to about 900, about 400 to about 800, about 450 to about 700, and about 500 to about 600 bases in length.
It is understood that the nick translation process described herein can be used in conjunction with any of the other methods of adding adaptors described herein. For example, the arm-to-arm ligation process described above and illustrated in FIG. 11A can be used in conjunction with the nick translation process to prepare constructs for PCR amplification.
In other embodiments, the adaptor arm A used in the arm-to-arm ligation reaction can be designed to circularize directly without PCR, and then seal the circle by nick translation ligation. In an exemplary embodiment, for direct circularization, adaptor arm a can be designed as depicted in fig. 11B. The section 1101 is designed to be complementary to the adapter arm B. The construct in FIG. 11B can be primer extended directly by a strand displacing polymerase (such as phi29) without the need for a primer exchange reaction to remove the blocked ends (the polymerase does not extend across the 3' phosphate on segment 1102). This construct also provides a 3' overhang for circularization. Segment 1102 prevents adapter arm a from hybridizing to adapter arm B prior to circularization. In certain embodiments, the segment 1102 may not be needed to prevent hybridization to arm B (such as when adapter arm B is at a very high concentration) or the segment 1102 may be part of the design of adapter arm B rather than adapter arm A.
After the single-stranded 3 'end is generated, the second adaptor 1803 is hybridized to the single-stranded 3' end of the target nucleic acid and ligated to the first adaptor by nick translation ligation (in one embodiment, the nick translation ligation is a "primer extension" or "gap filling" reaction). The second adaptor was provided with a 5 'phosphate and a 3' seal (identified as vertical line 1804). In certain embodiments, the 3 'block may be a removable block, such as a 3' phosphate, which in certain exemplary embodiments may be removed using polynucleotide kinase (PNK) and/or shrimp alkaline phosphatase. The second adaptor in certain embodiments carries degenerate bases at the 3 'and/or 5' end. In certain exemplary embodiments, the second adaptor has about 2-6 degenerate bases at the 5 'end and 4-9 degenerate bases at the 3' end, but it is understood that the invention encompasses second adaptors with any number of combinations of degenerate bases at one or both ends. In the illustrated embodiment of FIG. 18, the second adaptor comprises 3 degenerate bases at the 5 'end ("N3") and 7 degenerate bases at the 3' end ("N7"). In certain embodiments, ligation of a first adaptor to a second adaptor can be achieved under reaction conditions that favor hybridization of the adaptors to the target nucleic acid. In certain exemplary embodiments, such reaction conditions may include a temperature of from about 20 to about 40 ℃. Polymerases that can be used under such reaction conditions include, but are not limited to, phi29, Klenow, T4 polymerase, and PolI.
The ligation product 1805 is then denatured and/or further treated with 5 '-3' exonuclease, followed by a re-annealing step to form two single-stranded nucleic acid molecules (indicated as "x 2" in fig. 18). During re-annealing, the N7 portion of the second adaptor can hybridize to a segment at a random distance from the first hybridizing sequence motif, forming a single-stranded loop 1806. In certain embodiments, the N7 end of the second adaptor may not hybridize until denatured to produce a long single-stranded nucleic acid region 1807. The average distance between two captured genome segments (which are typically about 20 to about 200 bases in length) is in many embodiments between about 0.5 to about 20 kb. This average distance depends in part on the number of degenerate bases ("Ns") in the adaptors and the stringency of the hybridization conditions. The re-annealing step may then be followed by another round of adaptor hybridization and nick translation ligation. The final adaptor (in FIG. 18, this final adaptor is shown as a third adaptor 1808, but it will be appreciated that the final adaptor may be a fourth, fifth, sixth, seventh or more adaptors inserted according to any of the methods described herein) is similar to the second adaptor, but in many embodiments lacks degenerate bases at the 3' end. In other embodiments, the final adaptor may comprise a binding site for an amplification reaction primer, such as a PCR primer.
In other embodiments, an amplification reaction, such as a PCR reaction (see 1809 in fig. 18), may be performed by utilizing primer binding sites contained in the first and last adaptors. In still other embodiments, the first and last adaptors may be two arms of the same adaptor, and more than one adaptor may be inserted before the last adaptor is added. In still other embodiments, the amplification product may be used to form a circular double-stranded nucleic acid molecule for further insertion of adaptors using any of the procedures described herein or known in the art.
Ii.d.7. controlled insertion of subsequent adaptors: protection of restriction endonuclease recognition sites
In addition to controlling the direction of an adaptor inserted into a target nucleic acid as described above, a plurality of adaptors may be inserted into a target nucleic acid at specific positions relative to previously inserted adaptors. Such methods include embodiments in which certain restriction endonuclease recognition sites, particularly recognition sites contained in previously inserted adaptors, are protected from inactivation. In order to ligate subsequent adaptors in a desired position and orientation, the invention provides a method in which a type II restriction endonuclease binds to a recognition site within a first adaptor in a circular nucleic acid construct and then cleaves at a point within a genomic fragment (also referred to herein as a "target nucleic acid") outside the first adaptor. A second adaptor can then be ligated (again typically by adding the two adaptor arms of the second adaptor) at the point where the cleavage occurs. In order to cleave the target nucleic acid at a known point, it is necessary to block any other recognition sites of the same enzyme that may be randomly contained in the target nucleic acid, so that the only site to which the restriction endonuclease can bind is within the first adaptor, thereby avoiding unwanted cleavage of the construct. Typically, the recognition site in the first adaptor is first protected from inactivation, and then any other unprotected recognition site in the construct is inactivated, typically by methylation. "inactivation" of a restriction endonuclease recognition site in this context means that the recognition site is rendered incapable of being bound by the restriction endonuclease in a manner that prevents downstream cleavage steps of the enzyme. For example, methylated recognition sites cannot bind to restriction endonucleases and therefore no cleavage occurs. Once all the unprotected recognition sites in the nucleic acid construct are methylated, only unmethylated recognition sites within the adaptor allow for enzyme binding and subsequent cleavage. Other methods of inactivating the recognition site include, but are not limited to, the use of methylase blockers for the recognition site, blocking the recognition site with blocking oligonucleotides, blocking the recognition site with other blocking molecules such as zinc finger proteins, and nicking the recognition site to prevent methylation. Such methods of protecting the desired recognition site are described in U.S. patent application 12/265,593 filed on 5.11.2008 and 12/266,385 filed on 6.11.2008, both of which are incorporated herein by reference in their entirety for all purposes, particularly with respect to the insertion of multiple loosely distributed adaptors into a target nucleic acid.
It will be appreciated that the above-described method for controlling the direction in which an adaptor is attached to a target nucleic acid may also be used in combination with the below-described method for controlling the spacing of each subsequently added adaptor.
One aspect of the invention provides a method of protecting a recognition site in a first adaptor from inactivation by rendering the recognition site in the first adaptor single-stranded, such that a methylase that is only capable of methylating the double-stranded molecule is not capable of methylating the protected recognition site. One method of rendering the recognition site in the first adaptor single stranded is to amplify a linear genomic fragment ligated to both first adaptor arms using uracil modified primers. The primers are complementary to the adaptor arms and are modified with uracil so that upon amplification (typically by PCR), the resulting linear construct contains uracil embedded in a recognition site of one of the adaptor arms. The primers produce a PCR product in which uracil is adjacent to a type II restriction endonuclease recognition site in the first and/or second arm of the first adaptor. Digestion for uracil allows the region of the adapter arm that includes the type II recognition site to be protected and single-stranded. The linear construct is then given a sequence-specific methylase that methylates all double-stranded recognition sites of the same endonuclease contained in the first adaptor. This sequence-specific methylase is unable to methylate the single-stranded recognition site in the first adaptor arm, and therefore the recognition site within the first adaptor arm is protected from inactivation by methylation.
In some cases, as described more fully below, a single adaptor may have two identical recognition sites, which may allow for "upstream" and "downstream" cleavage from the same adaptor. In this embodiment, as illustrated in fig. 7, the primer and uracil positions are suitably selected such that the "upstream" or "downstream" recognition site is selectively protected from inactivation or is inactivated.
A third adaptor may be inserted on the other side of the first adaptor by cleavage with a restriction endonuclease that binds to a recognition site in the second arm of the first adaptor (i.e. a recognition site that begins to be inactivated by methylation). To make this recognition site available, the circular construct is amplified using a uracil modified primer (which is complementary to the second recognition site in the first adaptor) to produce a third linear construct, wherein the first adaptor comprises uracil mosaiced in the second restriction recognition site. Degrading the uracil makes the first adaptor single-stranded, thereby protecting the recognition site in the adaptor from methylation. All unprotected recognition sites are then inactivated using a sequence-specific methylase. When circularized, the recognition site in the first adaptor is reconfigured and the circle is cut using a restriction endonuclease to create a location where a third adaptor can be inserted into a third linear construct. Ligation of the third adaptor arm to the third linear construct follows the same general procedure described above-the third linear construct will be A or G tailed, the third adaptor arm will be T or C tailed, so that the adaptor arm anneals to the third linear construct, and is ligated. The linear construct comprising the third adaptor arm is then circularized to form a third circular construct. Like the second adaptor, the third adaptor typically comprises a restriction endonuclease with a recognition site that is different from the recognition site contained in the first adaptor.
A fourth adaptor may be added using a type II restriction endonuclease in which the second and third adaptors contain their recognition sites. Cleavage with these restriction endonucleases generates a fourth linear construct, which is then ligated to a fourth adaptor arm. Circularization of the fourth linear construct ligated to the fourth adaptor arm will result in the nucleic acid template construct of the invention.
In general, the methods of the invention provide a means of specifically protecting the type II endonuclease recognition site from inactivation such that once all other unprotected recognition sites in the construct are inactivated, the addition of a type II endonuclease results in binding to the protected site, thereby allowing control over where in the construct subsequent cleavage occurs. The above described method provides one embodiment of how to protect the desired recognition site from inactivation. It is understood that the above-described methods may be modified using techniques known in the art, and that such modified methods are also encompassed by the present invention.
In an exemplary embodiment, the recognition site is protected from inactivation by methods combined with the insertion method of each subsequently inserted adaptor. FIG. 19 illustrates an embodiment in which a second adaptor is inserted in a desired position relative to a first adaptor using a process that combines uracil degradation with nickase to methylate and protect against methylation. FIG. 19 shows that the genomic DNA of interest 1902 carries a type II restriction endonuclease recognition site located at 1904. The genomic DNA is fractionated or fragmented in process 1905 to produce fragments 1906 with type II restriction endonuclease recognition sites 1904. In process 1907 adaptor arms 1908 and 1910 are ligated to fragments 1906. In process 1911, fragment 1906 is PCR amplified with first and second adaptor arms 1908 and 1910 (library constructs) using uracil-modified primers 1912 that are complementary to adaptor arms 1908 and 1910. The primers produce a PCR product with uracil adjacent to the recognition site for type II restriction endonucleases. In process 1913, uracil is specifically degraded using, for example, uracil-DNA glycosylase (Krokan, et al, (1997) biochem. J.325:1-16), leaving the PCR product single-stranded in the region of the type II restriction endonuclease recognition site. As has been shown, introduction and degradation of uracil can be used to make type II restriction endonuclease recognition sites single-stranded; however, as further described herein, other methods may be employed, including limited digestion with 3 'or 5' exonucleases to render these regions single-stranded.
In process 1915, each double-stranded type II restriction endonuclease recognition site is nicked using a sequence-specific nicking enzyme to protect the sites from recognition by the type II restriction endonuclease. However, the single-stranded type II restriction endonuclease recognition site portions of the first and second adaptor arms 1908 and 1910 are not cleaved, and once circularized and ligated (1917), the type II restriction endonuclease recognition sites in the first and second adaptor arms are reformed, which can be restriction digested. When a nicking enzyme and a type II restriction endonuclease are selected for this process, it is preferred that the two enzymes recognize the same sequence or that one enzyme recognizes a subsequence (a sequence within a sequence) of the other enzyme. Alternatively, the nickase may recognize a different sequence, but the sequence is located within the adaptor, so that the nickase cuts within the type II restriction endonuclease recognition site. The use of uracil or 3 'or 5' degradation allows the use of a nicking enzyme for the entire procedure. Alternatively, more than one sequence-specific nicking enzyme may be employed. The circularized construct is then cleaved with a type II restriction endonuclease at process 1919, where the type II restriction endonuclease recognition site is indicated at 1922, the construct is cleaved at 1920, the nicks are indicated at 1918, and the resulting linear construct can be used for the second set of adapter arms to ligate to add to the construct at process 1921.
Ligation process 1921 the first (1924) and second (1926) adaptor arms of the second adaptor are added to the linearized construct and a second amplification is performed by PCR in process 1923, again using uracil modified primers 1928 complementary to adaptor arms 1924 and 1926. As above, the primers produce PCR products with uracil adjacent to the recognition site for type II restriction endonucleases. In process 1925, uracil is specifically degraded, leaving the PCR product single-stranded at the type II restriction endonuclease recognition sites in the first and second adaptor arms 1924 and 1926 of the second adaptor. The ligation process 1921 can also repair the nick 1918 in the type II restriction site 1904 in the target nucleic acid fragment 1906. In process 1927, the target nucleic acid fragments (where cleavage 1914 of type II restriction endonuclease recognition sites 1904 occurs) and the bases of the double-stranded type II restriction endonuclease recognition sites in first adaptor 1930 are cleaved, again with a sequence-specific nickase, to protect these sites from recognition by the type II restriction endonuclease.
The nicked construct is then circularized and ligated in process 1929, where the type II restriction endonuclease recognition sites in first and second arms 1924 and 1926 of the second adaptor are reformed (1932), and the process is repeated, and the circularized construct is cleaved again with type II restriction endonuclease in process 1931 to generate another linearized construct to which the first and second adaptors have been added for ligation of a third pair of adaptor arms 1936 and 1938 into the construct. The type II restriction endonuclease recognition site is shown as 1922, the restriction site is shown as 1920, the type II restriction endonuclease recognition site that is cleaved in the target nucleic acid fragment is shown as 1918, and the nick in the first adaptor is shown as 1934. This process can be repeated to add the desired number of adaptors. As shown herein, the first adaptor added contains a type II restriction endonuclease recognition site; however, in other aspects, the first adapter added may contain two type II restriction endonuclease recognition sites to precisely select the desired target nucleic acid size for the construct.
In one aspect, the adaptor may be designed to contain a sequence-specific nickase site that surrounds or partially overlaps with a type II restriction endonuclease recognition site. By using a nickase, the type II restriction endonuclease recognition site in each adaptor can be selectively protected from methylation. In other embodiments, the nicking enzyme may recognize another sequence or site, but nicks at the type II restriction endonuclease recognition site. Nickases are endonucleases that recognize specific recognition sequences in double-stranded DNA and are capable of cleaving one strand at a specific position relative to the recognition sequence, thereby causing a single-strand break in the duplex DNA, and include, but are not limited to, nb. By using a sequence specific nicking enzyme in combination with a type II restriction endonuclease, all type II restriction endonuclease recognition sites in the target nucleic acid as well as those in any previously inserted adaptors can be protected from digestion (provided of course that the type II restriction endonuclease is nick sensitive, i.e., does not bind to recognition sites that have been nicked).
FIG. 20 illustrates an embodiment of the method of the invention in which the desired relative positions of the second adaptor to the first adaptor are selected using methylation and sequence specific nicking enzymes. FIG. 20 shows genomic DNA of interest 2002 with a type II restriction endonuclease recognition site located at 2004. The genomic DNA is fractionated or fragmented in process 2005 to produce fragments 2006 with type II restriction endonuclease recognition sites 2004. Adapter arms 2008 and 2010 are ligated to fragments 2006 in process 2007. The fragments 2006 (library constructs) with adaptor arms 2008 and 2010 are circularized in process 2009 and amplified by loop-dependent amplification in process 2011, resulting in a highly branched concatemer of target nucleic acid fragments 2006 (where the type II restriction endonuclease recognition site is located at 2004) alternating with first adaptors 2012.
In process 2013, a sequence-specific nicking enzyme 2030 is used to nick nucleic acids in or near specific type II restriction endonuclease recognition sites in adaptors in library constructs, thereby preventing methylation of these sites. Here, the type II restriction endonuclease recognition sites in the adaptor arms 2012 and 2014 are cleaved by the sequence-specific nicking enzyme 2030. In process 2015, the type II restriction endonuclease recognition sites in the construct that are not cut are methylated (here, methylation 2016 of type II restriction endonuclease recognition site 2004) to protect these sites from recognition by the type II restriction endonuclease. However, the type II restriction endonuclease recognition sites in adaptors 2012 and 2014 are not methylated due to the presence of the nicks.
In process 2017, the nicks in the library constructs are repaired, and the resulting type II restriction endonuclease recognition sites in the adaptors 2012 in the library constructs can be used to recognize and restrict digestion 2018, whereas the type II restriction endonuclease recognition sites in the genomic fragments 2004 cannot. The methylated construct is then ligated to a second pair of adaptor arms, circularized, and amplified by loop-dependent amplification in process 2021 to give a concatemer of target nucleic acid fragments 2006 (type II restriction endonuclease recognition sites in 2004), first adaptors 2012 and second adaptors 2020, alternating. Then, in process 2023, sequence specific nicking is performed again, this time with a sequence specific nicking enzyme that recognizes a site in the second adaptor 2020, thereby preventing methylation of the type II restriction endonuclease recognition site in the second adaptor 2020, but not contributing to other type II restriction endonuclease recognition sites in the construct (i.e., the type II restriction endonuclease recognition site 2004 in the fragment and the type II restriction endonuclease recognition site in the first adaptor 2012). The process continues with methylation 2015, and if necessary, further adaptor arms can be added. Different sequence-specific nickase sites were used in each different adaptor to allow sequence-specific cleavage throughout the process.
FIG. 21 is a schematic representation of a process in which methylation and sequence specific methylase blockers are used to select the desired relative position of the second adaptor to the first adaptor. FIG. 21 shows genomic DNA of interest (target nucleic acid) 2212 with a type II restriction endonuclease recognition site at 2214. The genomic DNA is fractionated or fragmented in process 2105 to produce fragments 2106 with type II restriction endonuclease recognition sites 2104. The adapter arms 2108 and 2110 are attached to the segments 2106 in process 2107. Fragments 2106 (library constructs) with adaptor arms 2108 and 2110 are circularized in process 2109 and amplified by loop-dependent amplification in process 2111 to give a highly branched concatemer of target nucleic acid fragments 2106 (in which the type II restriction endonuclease recognition site is at 2104) alternating with first adaptors 2112.
In process 2113, sequence-specific methylase blocker 2130 (e.g., zinc finger) is used to block methylation of specific type II restriction endonuclease recognition sites in the library constructs. Here, the type II restriction endonuclease recognition sites in adaptor arms 2112 and 2114 are blocked by methylase blocker 2130. When choosing methylase blockers and type II restriction endonucleases for this process, it is not necessary that the two entities recognize the same site sequence or that one entity recognizes a subsequence of the other entity. The blocker sequence may be upstream or downstream of the type II restriction endonuclease recognition site, but in a configuration in which the methylase blocker blocks the site (such as a zinc finger or other nucleic acid binding protein or other entity). In process 2115, the unprotected type II restriction endonuclease recognition sites in the construct are methylated-here, the methylation 2116 of the type II restriction endonuclease recognition site 2104) -protecting these sites from recognition by the type II restriction endonuclease. However, the type II restriction endonuclease recognition sites in adaptors 2112 and 2114 are not methylated due to the presence of the methylase blocker.
In process 2117, the methylase blocker is released from the library construct, and the resulting library construct has recognized and restriction digested 2118 the type II restriction endonuclease recognition site in adaptor 2112, but not the type II restriction endonuclease recognition site in genomic fragment 2104. The methylated construct is then ligated to the second pair of adaptor arms, circularized, and amplified by loop-dependent amplification in process 2121 to yield a concatemer of target nucleic acid fragments 2106 (with type II restriction endonuclease recognition sites at 2104), first adaptors 2112, and second adaptors 2120. Then, in process 2123, methylase blocking is performed again, this time with a methylase blocker that recognizes a site in the second adaptor 2120 to block methylation of the type II restriction endonuclease recognition site in the second adaptor 2120, but not to contribute to the other type II restriction endonuclease recognition sites in the construct (i.e., the type II restriction endonuclease recognition site 2104 in the fragment and the type II restriction endonuclease recognition site in the first adaptor 2112). The process continues with methylation 2115, and if desired, an adapter arm may be added further. Different methylase blocker sites are used in each different adaptor so that sequence specific methylase blocking can be performed throughout the process. Although fig. 9 and 21 show insertion of a second adaptor relative to a first adaptor, it will be appreciated that this process may be applied to adaptors added after the second adaptor, resulting in library constructs with up to four, six, eight, ten or more inserted adaptors.
FIG. 22 illustrates a process in which methylation and uracil degradation are used to select a desired relative position of a second adaptor to a first adaptor. FIG. 22 shows genomic DNA of interest 2202 with a type II restriction endonuclease recognition site located at 2204. The genomic DNA is fractionated or fragmented in process 2205 to produce fragments 2206 with type II restriction endonuclease recognition sites 2204. Adapter arms 2208 and 2210 are attached to segment 2206 in process 2207. Fragment 2206 (library construct) with first and second adaptor arms 2208 and 2210 is amplified by PCR in process 2211 using uracil modified primer 2212 complementary to adaptor arms 2208 and 2210. The primers produce PCR products with uracils at or near the recognition site for type II restriction endonucleases. In process 2213, uracil is specifically degraded using, for example, uracil-DNA glycosylase (Krokan, et al, (1997) biochem. J.325:1-16), leaving the PCR product single-stranded in the region of the type II restriction endonuclease recognition site. As has been shown, the introduction and degradation of uracil can be used to make type II restriction endonuclease recognition sites single-stranded; however, as further described herein, other methods may be employed including limited digestion with 3 'or 5' exonucleases to render these regions single-stranded.
In process 2215, the bases in each double-stranded type II restriction endonuclease recognition site (here methylation 2214 of type II restriction endonuclease recognition site 2204) are methylated using a sequence-specific methylase to protect these sites from recognition by the type II restriction endonuclease. However, the single-stranded type II restriction endonuclease recognition sites in the first and second adaptor arms 2208 and 2210 are not methylated, and once circularized and ligated 2217, the type II restriction endonuclease recognition sites reform 2216, and thus the type II restriction endonuclease recognition sites can be restriction digested. However, when choosing methylases and type II restriction endonucleases for this process, both enzymes need to recognize the same sequence or one enzyme recognizes a subsequence (a sequence within a sequence) of the other enzyme. The circularized construct is then cleaved by type II restriction endonuclease cleavage at process 2219, where the type II restriction endonuclease recognition site is shown at 2218 and the construct is cleaved at 2220, resulting in a linearized construct that can be added to the construct in process 2221 for a second set of adapter arm ligations.
Ligation process 2221 first (2222) and second (2224) adaptor arms of a second adaptor are added to the linearized construct and second amplification by PCR is performed in process 2223 again using uracil modified primers 2226 complementary to the adaptor arms 2222 and 2224. As above, the primers produce PCR products with uracil adjacent to the recognition site for type II restriction endonucleases. In process 2225, uracil is specifically degraded, leaving the PCR product single-stranded in the type II restriction endonuclease recognition site regions in the first and second adaptor arms 2222 and 2224 of the second adaptor. In process 2227, the bases of the double-stranded type II restriction endonuclease recognition sites in the target nucleic acid fragment (again, this is methylation 2214 of type II restriction endonuclease recognition site 2204) and the bases of the type II restriction endonuclease recognition sites in the first adaptor 2228 are methylated, again using a sequence-specific methylase, to protect these sites from recognition by the type II restriction endonuclease. The methylated construct is then circularized in process 2229, where the type II restriction endonuclease recognition sites in the first and second arms 2222 and 2224 of the second adaptor are reformed 2230, and this process is repeated, again cleaving the circularized construct with the type II restriction endonuclease in process 2219 to generate another linear construct (which has been added to the first and second adaptors) for ligation of a third pair of adaptor arms to the construct. This process can be repeated to add the desired number of adaptors. As shown herein, the first adaptor added contains a type II restriction endonuclease recognition site; however, in other aspects, the first adapter added may contain two type II restriction endonuclease recognition sites to precisely select the desired target nucleic acid size for the construct.
In addition to the above-described method of controlling the insertion of a plurality of discretely distributed adaptors, constructs comprising adaptors in a particular orientation may be further selected by enriching the population of constructs carrying adaptors in the desired orientation. Such enrichment methods are described in U.S. patent application 60/864,992 (filed 11/09/06), 11/943,703 (filed 11/02/07), 11/943,697 (filed 11/02/07), 11/943,695 (filed 11/02/07), and PCT/US07/835540 (filed 11/02/07), all of which are incorporated herein by reference for all purposes, particularly for all teachings relating to methods and compositions for selecting adaptors in a particular orientation.
Preparation of DNB
In one aspect, the nucleic acid templates of the present invention are used to make nucleic acid nanospheres, which are also referred to herein as "DNA nanospheres", "DNBs", and "amplicons". Although the nucleic acid nanospheres of the present invention can be made from any nucleic acid molecule using the methods described herein, these nucleic acid nanospheres are typically concatemers comprising multiple copies of the nucleic acid template of the invention.
In one aspect, Rolling Circle Replication (RCR) is utilized to generate concatemers of the invention. The RCR procedure has been used to prepare successive copies of the M13 genome (Blanco, et al, (1989) J Biol Chem 264: 8935-8940). In this method, nucleic acids are replicated in a linear concatameric manner. Guidance for selecting conditions and reagents for RCR reactions can be found by those skilled in the art in a number of references, including U.S. patent nos. 5,426,180, 5,854,033, 6,143,495 and 5,871,921, which are incorporated herein by reference in their entirety for all purposes, particularly for all teachings relating to the preparation of concatemers using RCR or other methods.
Typically, the PCR reaction components include a single-stranded DNA loop, one or more primers capable of annealing to the DNA loop, a DNA polymerase having strand displacement activity, the 3' end of the primer capable of extending to anneal to the DNA loop, nucleoside triphosphates, and conventional polymerase reaction buffers. These components are combined under conditions that allow the primer to anneal to the DNA loop. Extension of these primers by a DNA polymerase forms concatemers of complementary strands of the DNA loop. In certain embodiments, the nucleic acid templates of the invention are double-stranded loops that are denatured to form single-stranded loops that can be used in an RCR reaction.
In certain embodiments, amplification of circular nucleic acids can be achieved by attaching short oligonucleotides (e.g., 6-mers) in series from a mixture containing all possible sequences, or, if the loops are synthetic, by a limited mixture of these short oligonucleotides containing selected sequences for loop replication, a process known as "loop-dependent amplification" (CDA). "Loop-dependent amplification" or "CDA" refers to multiple displacement amplification of a double-stranded circular template with primers that anneal to both strands of the circular template, resulting in a series of multiple-hybridization, primer extension, and strand displacement events that are capable of representing both strands of the template. This results in an exponential increase in the number of primer binding sites, and as a result, the amount of product produced also increases exponentially over time. The primers used may be random sequences (e.g., random hexamers) or have specific sequences to select for amplification of the desired product. CDA results in the formation of a set of concatemeric double-stranded fragments.
In the presence of a bridging template DNA complementary to both the beginning and end of the target molecule, concatemers can also be generated by ligating the target DNA. A population of different target DNAs can be converted (converted) within the concatemer by means of a mixture of corresponding bridging templates.
In certain embodiments, a subset of the population of nucleic acid templates can be isolated according to a particular characteristic, such as a desired number or type of adaptors. This population may be separated or otherwise processed (e.g., sized) using conventional techniques (e.g., conventional spin columns, etc.) to form a population from which concatemer populations may be generated using techniques such as RCR.
Methods of forming the DNBs of the present invention are described in published patent applications WO2007120208, WO2006073504, WO2007133831 and US2007099208, as well as US patent applications 60/992,485, 61/026,337, 61/035,914, 61/061,134, 61/116,193, 61/102,586, 12/265,593, 12/266,385, 11/938,096, 11/981,804, 11/981,797, 11/981,793,11/981,767, 11/981,761,11/981,730 (filed 10/31/2007), 11/981,685,11/981,661, 11/981,607,11/981,605, 11/927,388, 11/927,356, 11/679,124, 11/541,225, 10/547,214, 11/451,692 and 11/451,691, which are all incorporated herein by reference in their entirety for all purposes, particularly for all teachings relating to the formation of DNBs.
Method for obtaining sequence information
Nucleic acids, nucleic acid fragments, and template nucleic acid constructs isolated and generated according to any of the methods described herein can be used in applications where sequence information is obtained. Such methods include sequencing and detecting a particular sequence in a target nucleic acid (e.g., detecting a particular target sequence (e.g., a particular gene) and/or identifying and/or detecting a SNP). Nucleic acid rearrangements and copy number changes can also be detected using the methods described herein. Nucleic acid quantification can also be achieved using the methods described herein, such as digital gene expression (i.e., analysis of the complete transcriptome: all mRNA present in a sample) and detection of the number of specific sequences or groups of sequences in a sample.
In one aspect, fragments and nucleic acid constructs generated according to the invention provide the advantage of combining shorter sequence reads to provide sequence information about longer contiguous regions of a target nucleic acid (contiguous nucleic acid segments comprising two or more nucleotides in a row are also referred to herein as contigs). As used herein, "sequence read" refers to identifying or determining the identity of one or more nucleotides in a target nucleic acid region. Generally, sequence reads provide sequence information about the sequence of a nucleic acid segment comprising two or more contiguous nucleotides. In certain aspects, the use of relieved (uncomplicated) base reads to generate sequence information, as described in Drmanac et al, (2010), Science,327:78-81 and supplementary online materials, is incorporated herein in its entirety and specifically for all teachings relating to methods and compositions for sequencing nucleic acids.
III.A.LFR
In one aspect, the Long Fragment Reading (LFR) sequencing method is used with any of the fragments or nucleic acid template constructs or DNA nanospheres described herein. Although described below primarily with respect to genomic nucleic acid fragments, it will be appreciated that any nucleic acid molecule will be suitable for use in the methods described below. General LFR methods are described in U.S. patent application No11/451,692 (now U.S. patent No7,709,197) filed on 13.6.2006 and U.S. patent application No12/329,365 filed on 5.12.2008, each incorporated herein in its entirety and specifically for all teachings relating to LFR and sequencing using the LFR method.
Generally, the LFR method involves physically separating long genomic DNA fragments between many different aliquots, so that the probability that a given region of the genome in the maternal and paternal components will occur in the same aliquot at the same time is very low. Analysis of many aliquots of the aggregate by placing a unique identifier in each aliquot, together with the fact that a diploid genome can be assembled from long pieces of DNA, for example, the sequence of each parent chromosome can be obtained.
Aliquots of LFR fragments are also referred to herein as LFR libraries and LFR aliquot libraries. These LFR libraries may include tagged and untagged fragments.
LFR provides a new and inexpensive way of DNA preparation and tagging, and associated algorithms and software, to achieve precise assembly (i.e., complete haplotyping) of different sequences of parent chromosomes in a diploid genome (such as in human embryonic or adult somatic cells) at significantly reduced experimental and computational costs (less than $ 1000). This approach, which is universally applicable to any existing genomic or metagenomic (metagenomic) sequencing technology, including future longer read (about 1kb) approaches, is in many ways equivalent to sequencing a single DNA molecule greater than 100kb in length, a technically challenging proposition. The proposed Long Fragment Reading (LFR) method does not require expensive, less accurate and lower yield single molecule detection. The LFR method is based on the random physical division of a long fragment (100-1000kb) genome into many aliquots in such a way that each aliquot contains 10% or less of the haploid genome.
The LFR method as described herein is particularly useful when the initial amount of DNA to be analyzed is low. In some embodiments, the LFR methods of the invention are used to analyze the genome of individual cells. In still other embodiments, the LFR method of the invention is used to analyze genomes from 1-100 cells. In still other embodiments, the LFR methods of the invention are used to analyze genomes from 1-5,5-10,2-90,3-80,4-70,5-60,6-50,7-40,8-30,9-20, and 10-15 cells. The method of isolating DNA when using small numbers of cells is similar to the method described above, but occurs in a smaller volume. As will be appreciated, the LFR method of the invention may also be used when the initial amount of DNA is high (i.e., greater than the equivalent from 50-100 cells).
In some embodiments, after isolating the DNA and before dividing it into different aliquots (such as into wells of a multiwell plate or into different emulsion droplets, as described in more detail below), the genomic DNA must be carefully fragmented to avoid loss of material, particularly loss of the terminal sequence of each fragment, as loss of such material can result in gaps in final genomic assembly. In some cases, loss of sequence is avoided by using rare nicking enzymes that generate polymerase (such as phi29 polymerase) start sites that are approximately 100kb apart. As the polymerase produces new DNA strands, the old strands are displaced, with the net result that overlapping sequences are present near the polymerase start site, leaving few sequence deletions.
In particular embodiments, fragments generated according to one or more of the coce embodiments as described above are used in the LFR methods described herein. Generally, the method of isolating DNA from a sample will produce a 100kb fragment. These fragments can then be further fragmented or used to generate shorter fragments either before or after separation into different aliquots or both before and after separation into different aliquots using the methods described herein (including the CoRE).
In some embodiments, DNA is isolated from a sample and then aliquoted into a number of different separate mixtures (such separate mixtures are interchangeably referred to herein as aliquots). After aliquoting, DNA in the separate mixtures is then fragmented using any of the methods described herein (including any of the embodiments of coce fragmentation discussed above). Shorter fragments can also be generated using DNA in separate mixtures as templates by using controlled DNA synthesis or amplification. Such synthesis and amplification methods are known in the art, and typically use a plurality of spaced apart primers corresponding to different regions of DNA in a separate mixture to replicate and/or amplify the DNA. In such embodiments, a second population of DNA fragments is formed that are of shorter length than the longer fragments from which they were derived. In yet other embodiments, the DNA in the separate mixture is fragmented (or used as a template to generate shorter fragments) multiple times. In yet other embodiments, after one or more rounds of fragmentation, the DNA in each aliquot is tagged with an adaptor tag according to the methods described herein.
In one embodiment, the genomic fragments (either before or after fragmentation) are divided into aliquots and the nucleic acids are diluted to a concentration that contains approximately 10% haploid genome per aliquot. At this dilution level, approximately 95% of the base pairs in a particular aliquot are non-overlapping. This method of aliquoting, also referred to herein as the Long Fragment Reading (LFR) fragmentation method, may be used in certain embodiments for large molecular weight fragments isolated according to the methods described above and further herein. LFRs are typically initiated by brief treatment of genomic nucleic acid, typically genomic DNA, with a 5 'exonuclease to produce a 3' single-stranded overhang. This single-stranded overhang serves as the initiation site for Multiple Displacement Amplification (MDA). The 5' exonuclease treated DNA is then diluted to subgenomic concentrations and dispersed among many aliquots. In some embodiments, the aliquots are dispersed among a plurality of wells in a multiwell plate. In other embodiments, the aliquots are contained in different emulsion droplets, as described in more detail below. Fragments in each aliquot are typically amplified using an MDA method that includes one or more additives described above for reducing or preventing deviation.
As discussed above, to properly separate the fragments, the DNA is typically divided into aliquots/diluted to a concentration of about 1-15% haploid genome per aliquot. In yet another embodiment, the DNA is split into aliquots to a concentration of about 10% haploid genome per aliquot. At such concentrations, 95% of the base pairs in the aliquots do not overlap. Dilution into subgenomic aliquots resulted in statistical separation such that maternal and paternal fragments generally fell in different aliquots. It will be appreciated that the dilution factor may depend on the original size of the fragment. Techniques that are capable of producing larger fragments require fewer aliquots, while techniques that produce shorter fragments may require a greater number of aliquots.
In still other embodiments, the DNA is diluted (i.e., divided into aliquots) to a concentration of about 1,2,3,4,5,6,7,8,9,10,11,12,13,14, and 15% haploid genome per aliquot. In yet other embodiments, the DNA is diluted to a concentration of less than 1% haploid genome per aliquot. In yet other embodiments, the DNA is diluted to about 0.1-1%,0.2-0.9%,0.3-0.8%,0.4-0.7%, and 0.5-0.6% of the haploid genome per aliquot.
In some embodiments, the fragments are amplified before, after, or both before and after separation into aliquots. In still other embodiments, the fragments in each aliquot are then further fragmented and then tagged with an adaptor tag such that fragments from the same aliquot will all contain the same tagged adaptor, see e.g., US2007/0072208, which is hereby incorporated by reference in its entirety, and in particular with regard to the discussion of additional aliquots and overlays. In certain embodiments, fragments are not amplified after separation into aliquots, but are further fragmented using any of the methods discussed herein and known in the art. In certain embodiments, the DNA is not amplified prior to aliquoting, but rather fragmented and amplified after aliquoting the DNA in separate aliquots, and in still other embodiments fragmented and amplified multiple times.
In still other embodiments, multiple rows of graduated samples are used in the LFR methods of the invention. The aliquot pattern in one or more rows can be labeled such that the aliquot pattern in each subsequent row can be identified by its originating aliquot in the previous row. The fragments in each round of the aliquot may or may not be amplified and/or further fragmented prior to the next round of the aliquot.
In yet other embodiments, the sequence information obtained from LFR aliquots is assembled using bioinformatics techniques that comprehensively utilize information from a large number of about 10Mb aliquots, which reduces the computational investment (i.e., the capital cost of a computer) by about 100-fold. The added cost of reading 10 base tags (10% on sequencing reagents and instrument time for 2x50 base-pairing reads) offsets this computational savings and increased sequence accuracy by several fold.
In yet another embodiment, the methods of the invention are combined with high-throughput, low-cost short-read DNA sequencing techniques, such as those described in published patent application nos. WO2007120208, WO2006073504, WO2007133831, and US2007099208, and U.S. patent application nos. 11/679,124,11/981,761,11/981,661,11/981,605,11/981,793,11/981,804,11/451,691,11/981,607,11/981,767,11/982,467,11/451,692,11/541,225,11/927,356,11/927,388,11/938,096,11/938,106,10/547,214,11/981,730,11/981,685,11/981,797,11/934,695,11/934,697,11/934,703,12/265,593,11/938,213,11/938,221,12/325,922,12/252,280,12/266,385,12/329,365,12/335,168,12/335,188, and 12/361,507 (for all purposes and particularly for all teachings relating to DNA sequencing by reference to mention And all documents are incorporated herein in their entirety).
III.A.1. tagging
Fragments in different aliquots can be tagged with one or more adaptor tags to identify fragments contained in the same aliquot. In some embodiments, fragments in different aliquots may be tagged with one or more adaptor tags (sometimes referred to as tagging sequences, tags, or barcodes (note that these are also referred to as adaptors in U.S. provisional application No61/187,162 filed 6/15/2009).
As outlined above, some embodiments of LFR do not require an adaptor tag, in these embodiments LFR aliquots are placed into different vessels, such as microtiter plate embodiments discussed herein. In these embodiments, the LFR fragments may be additionally fragmented again without adding adaptor tags, as long as the source of each aliquot is tracked.
Alternatively, as described in detail below, aliquots are tagged with adaptor tags to identify fragments contained in the same aliquot. The adaptor tag may be added in a variety of ways, as outlined below. In some cases, the adaptor tag may be added (in relation to other adaptor additions described herein) in such a way that aggregation of the adaptor tag is prevented.
In embodiments utilizing tagging, the fragments in each aliquot are tagged with one or more adaptor tags. In some embodiments, the adaptor tag is designed in two segments: one segment is common to all wells, while the blunt ends directly link the fragments using methods described further herein. The second segment is unique to each well and may also contain a barcode sequence such that when the contents of each well are combined, a fragment of each well can be identified. FIG. 27 shows that some exemplary barcode adaptor tags may be added to fragments relating to this aspect of the invention.
In many aspects of the invention, it is useful to have fragments repaired to have blunt ends, and in some cases it may be desirable to alter the terminal chemistry so that the correct orientation of phosphate and hydroxyl groups is not present, thus preventing polymerization of the target sequence. Control of the end chemistry can be provided using methods known in the art and described in more detail above with respect to further processing of the fragments and with respect to ligating adaptors to target nucleic acids. Such methods are also applicable to controlling the directionality of ligation of adaptor tags to fragments in the methods described herein. An additional method for controlling the orientation of the adaptor tag is shown in FIG. 7, where the primer and uracil positions are selected such that either the upstream or downstream recognition site can be selectively protected from inactivation. For example, in 7, two different adaptor tag arms (represented by squares) each contain a recognition site for a restriction endonuclease (represented by circles in one adaptor arm and triangles in the other). If the adaptor tag arm with the recognition site represented by a circle needs to be protected using the uracil degradation method described above, the uracil modified amplification primer is designed to incorporate uracil into the recognition site. The adaptor tag arms are then rendered single stranded (represented by a half-square) after uracil degradation, thus protecting the recognition site from inactivation.
In some cases, all phosphate groups are removed using a phosphatase such that all termini contain hydroxyl groups. Each end can then be selectively altered to facilitate connection between the desired components. One end of the fragment may then be "activated", in some embodiments, by treatment with alkaline phosphatase.
Fig. 27 provides a schematic diagram of some embodiments of an adaptor tag design for use as a tag according to the LFR method described herein. Typically, the adaptor tag is designed as two segments, one common to all the wells, that are ligated directly blunt-ended to the fragment using the methods described further herein. A common adaptor tag may be used as a control for any potential concentration differences between aliquots. In the embodiment shown in fig. 27, the added "common" adaptor tag has two adaptor tag arms: one arm is blunt-ended to the 5 'end of the fragment and the other arm is blunt-ended to the 3' end of the fragment. The second segment of the adapter tag is a "bar code" segment that is unique to each well. The barcode is typically a unique nucleotide sequence, and each fragment in a particular well is given the same barcode. Then, when tagged fragments from all aliquots are recombined together for sequencing applications, fragments from the same aliquot can be identified by identifying the barcode adaptor tags. In the embodiment illustrated in FIG. 27, the barcode is attached to the 5' end of the common adaptor tag arm. The common adaptor and barcode adaptor tag may be ligated to the fragments sequentially or simultaneously. As will be described in further detail herein, the ends of the common adaptor tag and the barcode adaptor tag may be modified so that each adaptor segment may be ligated in the correct orientation and with the appropriate molecule. Such modifications prevent "aggregation" of the adaptor tag segments by ensuring that the fragments are not ligated to each other and that the adaptor tag segments can only be ligated in the desired orientation. Such modifications are also discussed in detail in the above section with respect to the manipulation of ligation of adaptors to target nucleic acids to produce the nucleic acid template constructs of the invention.
In other embodiments, a three-segment design may be employed to tag the fragments in each well with an adaptor tag. This embodiment is similar to the barcode adapter label design described above, except that the barcode adapter label segment is divided into two segments (see fig. 27). This design allows a wider range of possible barcodes by connecting different barcode sections together to form a complete barcode section thus creating a combined barcode adaptor label section. This combination design provides a larger repertoire of possible barcode adapter tags, while reducing the number of complete barcode adapter tags that need to be generated.
In one embodiment, construction of LFR libraries of multiple aliquots of tagged fragments involves the use of different adaptor-tag sets. The a and B adaptor tags are easily modified to each contain different half-barcode sequences to produce thousands of combinations. In certain embodiments, the half-barcode sequences are incorporated into the same adaptor tag. This can be achieved by splitting the B adaptor tag into two parts, each with half barcode sequences separated by a common overlapping sequence for ligation (fig. 28E). The two tag components each have 4-6 bases. The 8-base (2x4 bases) tag set was able to uniquely tag 65,000 aliquots. One additional base (2x5 bases) would allow for false detection, and a 12 base tag (2x6 bases, 1 thousand 2 million unique barcode sequence) can be designed using Reed-Solomon design to allow for substantial false detection and correction in 10,000 or more aliquots. Methods for designing adaptor tags are further disclosed in U.S. patent application No12/697,995, filed 2/1/2010, which is hereby incorporated by reference in its entirety for all purposes and in particular for teaching relating to the Reed-Solomon algorithm and its use in designing adaptor tags, which are also referred to as adaptors in that application.
In still other embodiments, ligation of the adaptor tags controls orientation, that is, the invention provides for directed ligation of the adaptor tags. Such directed ligation can utilize any of the methods described herein for ligating adaptors to target nucleic acids. In an exemplary embodiment, half-adaptor tags (also referred to herein as tag components and adaptor tag segments) are ligated to each side of a DNA fragment in two separate steps. The first half-adaptor tag is blocked at its 3 'end by the incorporation of dideoxynucleotides on one strand, thus allowing ligation of only the 3' end of the DNA fragment. Thus, double-stranded fragments have a half-adaptor tag attached to the 3 'end of each strand of the fragment (i.e., there is a half-adaptor tag attached to the 3' end of the Watson strand and to the Crick strand). These half-tagged fragments are then denatured and combined with a primer complementary to the ligated adaptor tag and a polymerase to generate double-stranded DNA from each strand of DNA fragments ligated to the first half-adaptor tag. In certain embodiments, the first half-adaptor label comprises a barcode or a half-barcode, as discussed in more detail herein. A second half-adaptor tag (which in some embodiments does not contain a barcode) can then be ligated to the newly created 3' end of the replicated fragment containing the first half-adaptor tag. The advantage of this sequential method of adding each half-adaptor tag to the fragments is that only those fragments ligated to the first half-adaptor tag will then undergo ligation to the second half-adaptor tag. As should be appreciated, multiple half-adaptor tags may be added during each cycle: in other words, 1 or more tag components can be directionally ligated to the selected end of each fragment, and then after denaturation and replication, 1 or more additional tag components can be added to the newly created 3' end. As such, different tag component sets can be used in various combinations to generate a combined tag that tags fragments.
In still other embodiments, the first half-adaptor tag is blocked at the 5 'end, allowing ligation of only the 5' end of the DNA fragment, and the second half-adaptor tag is blocked at the 3 'end, allowing ligation of only the 3' end of the DNA fragment. Thus, in this embodiment both halves of the adaptor tag may be ligated to the fragment simultaneously.
In still other embodiments, adaptor tags or other tags are added in accordance with the disclosure of adding adaptors in WO2007120208, WO2006073504, WO2007133831, and US2007099208, and U.S. patent application nos. 11/679,124,11/981,761,11/981,661,11/981,605,11/981,793,11/981,804,11/451,691,11/981,607,11/981,767,11/982,467,11/451,692,11/541,225,11/927,356,11/927,388,11/938,096,11/938,106,10/547,214,11/981,730,11/981,685,11/981,797,11/934,695,11/934,697,11/934,703,12/265,593,11/938,213,11/938,221,12/325,922,12/252,280,12/266,385,12/329,365,12/335,168,12/335,188, and 12/361,507 (each is included by reference for all purposes and specifically for all teachings relating to adaptors) Methods of adding to fragments.
After tagging the fragments in each well, all aliquots can be combined to form a single population in some embodiments. Sequence information obtained from these tagged fragments can be identified as belonging to a particular aliquot based on the barcode tag adaptor tag attached to each fragment.
III.A.2. Multiwell Format LFR
In many embodiments, each aliquot is contained in a different well of a multiwell plate (e.g., a 384 or 1536 well microtiter plate). It should be appreciated that while the following LFR discussion is provided in terms of multi-well plates, many different types of containers and systems may be used to contain the different aliquots generated in this method. Such containers and systems are well known in the art and it will be apparent to those skilled in the art what type of container and system would be suitable for use in accordance with this aspect of the invention.
In some embodiments, 10% genome equivalents are aliquoted into each well of a multi-well plate. If a 384 well plate is used, 10% of the genome equivalents divided equally into each well produce each plate containing a total of 38 genomes. In still other embodiments, 5-50% of the genomic equivalents are aliquoted into each well. As noted above, the number of aliquots and genomic equivalents used in the LFR method of the invention may depend on the initial fragment size.
After separation into a plurality of wells, the fragments in each well can be amplified, typically using the MDA method. In certain embodiments, the MDA reaction is a modified Phi29 polymerase-based amplification reaction. Although the discussion herein is primarily in terms of MDA reactions, it will be understood by those skilled in the art that many different types of amplification reactions can be used in the present invention, which are well known in the art, and are summarized in Maniatis et al, Molecular Cloning: A Laboratory Manual, 2 nd edition, 1989, and Short Protocols in Molecular Biology, Ausubel et al, which are incorporated herein by reference. In certain embodiments, the MDA method used before or after each aliquoting step may include an additive to reduce amplification bias, as discussed in more detail above.
After amplification of the fragments in each well, the amplification products may be subjected to another round of fragmentation. In some embodiments, the fragment in each well is further fragmented after amplification using the coe method described above. As discussed above, to use the CoRE method, the MDA reaction used to amplify the fragments in each well is designed to introduce uracil or other nucleotide analogs to the MDA product.
III.A.3. emulsion droplets
In certain LFR applications, emulsion droplets are used in the aliquoting and labeling process. Methods for generating emulsion droplets containing nucleic acids and/or reagents for enzymatic reactions are well known in the art, see, e.g., Weizmann et al, (2006), Nature Methods, volume 3, stage 7, pages 545-550, which are hereby incorporated in their entirety for all purposes and in particular for all teachings relating to the formation of emulsions and the enzymatic reactions carried out within emulsion droplets.
In some embodiments, the emulsion droplets comprise nucleic acids or nucleic acid fragments isolated from the sample, including fragments generated using the CoRE method described herein. In such embodiments, each droplet typically contains a small number of fragments. In the LFR method for whole genome sequencing, a population of emulsion droplets will collectively contain fragments representing one or more genome equivalents. In still other embodiments, the population of emulsion droplets will collectively contain fragments representing 5-15 genomic equivalents. In still other embodiments, the population of emulsion droplets collectively will contain fragments representing 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19, or 20 genomic equivalents.
In still other embodiments, the emulsion droplets further comprise two or more adaptor tag components. For purposes of clarity, emulsion droplets containing target nucleic acid fragments are referred to as target nucleic acid droplets, and emulsion droplets containing adaptor tags are referred to as adaptor tag droplets.
In certain embodiments, enzymes such as ligases and other reagents such as buffers and cofactors are also contained within the target nucleic acid droplet and/or in the adaptor tag droplet. The fragments or adaptor tags may be prevented from chaining within the same droplet by altering the ends as described in more detail above such that ligation occurs between the fragments and the adaptors in only a preferred orientation. Ligases and other reagents may also be included in different groups of emulsion droplets.
In still other embodiments, individual target nucleic acid droplets are combined with individual adaptor tag droplets such that the droplets merge. In embodiments where the target nucleic acid droplet or the adaptor tag droplet contains a ligase and/or other reagents for the ligation reaction, the nucleic acid fragment will be ligated to one or more adaptor tags after the adaptor tag and the nucleic acid droplet have been combined. In embodiments where the ligase and other reagents are contained in different sets of emulsion droplets, ligation will occur after the individual target nucleic acid droplets, individual adaptor tag droplets and ligase/reagent droplets are merged.
In embodiments where the adaptor tag droplets contain two or more "half-adaptors" (also referred to herein as "tag components"), pooling of the droplets results in ligation of the target nucleic acid fragments in each droplet to a unique combined adaptor tag (fig. 28A-B). 100 half-barcodes of both groups were sufficient to uniquely identify 10,000 aliquots (FIG. 2E). However, increasing the number of half barcode adaptors to over 300 may allow random addition of barcode droplets to be combined with sample DNA with a low probability that any two aliquots contain the same barcode combination. This has the advantage that tens of thousands of unique combinatorial barcode adapter label droplets can be generated in large quantities and stored in a single tube for use as reagents for thousands of different LFR libraries.
In some embodiments, 10,000 to 100,000 or more aliquot libraries (i.e., emulsion droplets) are used in the methods of the invention. In still other embodiments, the emulsion LFR method is scaled up by increasing the number of initial half-barcode adaptor tags. These combinatorial adaptor tag droplets were then fused one-to-one with droplets containing ligated DNA representing less than 1% of the haploid genome (fig. 28D). Using a conservative estimate of 1nl and 10,000 droplets per droplet, representing a total volume of 10 μ Ι for the entire LFR library, a volume reduction of about 400-fold and thus a cost reduction may be possible. In such embodiments, emulsion droplets provide the ability to miniaturize LFR aliquots from microliters to nanoliters, and increase the number of aliquots typically used in such methods from hundreds to thousands (reducing DNA per aliquot from 10% to less than 1%). Such systems with 10,000 or more emulsion droplets open the possibility to perform complete genome sequencing starting with only one cell.
In still other embodiments, 1,000 to 500,000 fragments and adaptor tag droplets are used in the methods of the invention. In still other embodiments, 10,000-400,000,20,000-300,000,30,000-200,000,40,000-150,000,50,000-100,000,60,000-75,000 fragments and adaptor tag droplets are used in the methods of the invention. In still other embodiments, at least 1,000, at least 10,000, at least 30,000, and at least 100,000 fragments and adaptor tag droplets are used in the methods of the invention.
In still other embodiments where the adaptor tag droplets contain at least 2,3,4,5,6,7,8,9,10 different adaptor tag groups or building blocks, combining these adaptor tag droplets with droplets of nucleic acid fragments produces at least a portion of the resulting combined droplets having fragments tagged with different tag component combinations. In still other embodiments, at least 1,000, at least 10,000, at least 30,000, and at least 100,000 different droplets contain fragments that are labeled with different combinations of label components. In still other embodiments, 1,000 to 500,000 droplets contain fragments that are tagged with different tag component combinations. In still other embodiments, 10,000-400,000,20,000-300,000,30,000-200,000,40,000-150,000,50,000-100,000,60,000-75,000 droplets contain fragments that are tagged with a combination of different tag components.
In some embodiments, nucleic acids or nucleic acid fragments from a sample generated using any of the methods described herein are contained within an emulsion droplet, as discussed above. The nucleic acids or fragments within each nucleic acid droplet are fragmented using any of the methods described herein prior to combining with and tagging the adaptor tag droplets. Such fragmentation and then subsequent tagging allows identification of fragments contained in the same droplet and that can also be consecutive segments of the same region of the genome. In this manner, the identification results of the attached tags can be used to assemble and collate the sequence information of the tagged target nucleic acid fragments. In certain embodiments, sequencing of fragments comprises obtaining information about the adaptor tags to which they are attached.
In certain embodiments, the size of the emulsion droplets is controlled using methods known in the art to prevent shearing and thus further fragmentation of target nucleic acid fragments when they are contained within the droplets. In some embodiments, 1nL droplets (i.e., 100 μm) are used3A drop of a volume). 50kb lambda dsDNA has been shown to form 1 μm3A ball, anAnd thus would be expected to form about 2 μm for 200kb human genomic dsDNA3Cubic spheres, which would be easily contained in 1nl droplets with minimal shear due to the incorporation (emulsification) process. Single stranded DNA, which is the initial step of MDA and in embodiments where DNA is amplified before or after an aliquot is the material typically used to form the droplets of the invention, is even more compact or flexible because it has a persistence length of about one tenth that of dsDNA. In addition, and as discussed in more detail above, adding an element such as spermidine to the DNA during the pipetting process also helps protect the DNA from shearing, which is likely (without being limited to theory) due to the ability of substances such as spermidine to compress the DNA.
There are currently several types of microfluidic (e.g., Advanced Liquid Logic) or pico/nanoliter droplet (e.g., RainDance Technologies) devices that can be modified to accept LFR reagents and methods. These instruments have the currently fully operational dermal/nanoliter droplet making, fusion (3000/sec) and collection functions. Such small volumes may also help prevent bias introduced by the amplification method, but may also reduce background amplification.
The advantage of using emulsion droplets is that reducing the reaction volume to microliter, nanoliter and picoliter levels provides a reduction in the cost and time associated with generating LFR libraries.
Advantages and illustrative applications of III.A.4.LFR
In one aspect, the DNBs are generated using fragments from an LFR aliquot library according to the methods described above. These DNBs can then be used in sequencing methods known in the art and described in more detail herein.
In yet another aspect, the initial long DNA fragments are aliquoted, then fragmented, and tagged in each aliquot. These tagged fragments are then pooled together, and at least a portion of the fragments are subsequently sequenced without amplification. In certain embodiments, about 30% -80% of the fragments are sequenced. In still other embodiments, about 35% -70%,40% -65%,45% -60%, and 50% -55% of the fragments are sequenced. In yet another embodiment, at least 30%,40%,50%,60%,70%,80%,90%,95% of the aliquot and tagged fragments are sequenced without amplification.
In other embodiments, the fragments are amplified, and then about 35% -70%,40% -65%,45% -60%, and 50% -55% of the amplified fragments are sequenced. In yet another embodiment, at least 30%,40%50%,60%,70%,80%,90%,95% of the aliquoted and tagged fragments are sequenced after amplification.
In one aspect, sequence reads from LFR fragments are assembled to provide sequence information about a contiguous region of the initial target nucleic acid that is longer than the individual sequence reads. Sequence reads may be in the order of 20-200 bases or, in some methods, 200-2,000 bases or longer. As discussed in more detail herein, the aliquot is typically about 20-200kb or even longer than 1 Mb. In yet another aspect, such assembly relies on the identity of each fragment tag identifying fragments contained in the same aliquot. In still other embodiments, the tags are oligonucleotide adaptor tags and individual tags are identified by determining at least a portion of the sequence of the tags. The identity of the tag is used to identify the source aliquot of the attached fragment, and can also be used to sort sequence reads from individual fragments and to distinguish haplotypes. For example, as discussed above, the process of equally dividing fragments in LFR generally results in dividing the corresponding parental DNA fragments into different aliquots such that as the number of aliquots increases, the number of aliquots with both maternal and paternal haplotypes becomes negligibly small. In this manner, sequence reads from fragments in the same aliquot can be assembled and collated. The longer fragments used in this method also help bridge segments lacking heterozygous loci or resolve long segment replications.
A further advantage of LFR is that sequence information obtained from longer fragments can be used to assemble sequences containing genomic regions whose length is greater than the repeat sequence read from the individual sequences regardless of the sequencing method used. Such advantages and applications of LFR are also discussed in U.S. patent application No11/451,692 (now U.S. patent No7,709,197), filed on 13.2006, and U.S. patent application No12/329,365, filed on 5.12.2008, each incorporated by reference herein in its entirety and specifically for all teachings relating to LFR and sequencing using the LFR method.
It should be recognized that advances in bioscience (including with respect to agriculture and biofuel production) and medicine are critically dependent on accurate, low-cost and high-throughput genome and transcriptome sequencing. To achieve these benefits, the cost of accurate sequencing of an individual's genome should be very low, such as less than $ 1000. This cost should include all components of the method such as DNA preparation reagents, depreciation of sequencing instruments, and calculations.
The current LFR invention can also be used in the absence of a reference sequence (e.g., metagenomics) for rapid complete de novo assembly. First, a partial assembly can be achieved within each aliquot. Then, limited alignment of the assembled contigs is used to find aliquots with overlapping fragments to complete the complete assembly of the shared DNA segment. The assembly of segments is then propagated in two directions. The large number of LFR aliquots with less than 0.1% of the genome ensures uniqueness of the short overlap of short reads in de novo assembly (i.e., 12 bases is sufficient for a unique read overlap in 17 bases for 0.1% of the genome required for the complete genome), resulting in a longer sequence contig of lower read coverage. Read coverage generally refers to fractional or fold coverage of the genome.
In one aspect, the present invention encompasses software and algorithms that perform the schemes according to the exemplary methods described above with high efficiency.
In yet another aspect, genomic methylation analysis is performed using the methods and compositions of the invention. There are several methods currently available for global genomic methylation analysis. The most economically available method involves bisulfate treatment of genomic DNA and sequencing of repetitive elements or parts of the genome obtained by methylation specific restriction enzyme fragmentation. This technique yields information about global methylation, but does not provide locus specific data. The next high level of resolution utilizes DNA arrays and is limited by the number of features on the chip. Finally, the most analytical and expensive approach requires bisulfate treatment followed by sequencing of the entire genome. Using the LFR technique of the present invention, it is possible to sequence all the bases of a genome and assemble a complete diploid genome with numerical information about the methylation level of each cytosine position in the human genome (i.e., 5-base sequencing). In addition, LFR allows blocking methylated sequences of 100kb or greater that are to be linked to a sequence haplotype, providing methylation haplotype typing, i.e., information that is not possible with any currently available method.
In one non-limiting exemplary embodiment, the methylation state is obtained in a process in which genomic DNA is first aliquoted and denatured for MDA. Next, the DNA is treated with bisulfite (i.e., a step in which the DNA needs to be denatured). The remaining preparations followed those methods described in, for example, U.S. application serial nos. 11/451,692 and 12/335,168 filed by 6/13/2006, filed by filing 11/451,692 and 12/15/2008, each of which is incorporated by reference herein for all purposes and in particular for all teachings relating to nucleic acid analysis of fragment mixtures in accordance with long fragment reading techniques.
In one aspect, MDA will independently amplify each strand of a particular fragment, resulting in 50% read for any given cytosine position, as unaffected by bisulfite (i.e., the base opposite cytosine, guanine unaffected by bisulfite), and 50% providing the methylation state. The reduced DNA complexity per aliquot aids in precise localization and assembly of reads that provide less information, primarily 3-bases (a, T, G).
It has been historically found that bisulfite treatment fragmented DNA. However, denaturation and careful titration with bisulfite buffer can avoid excessive fragmentation of genomic DNA. LFRs can tolerate 50% conversion of cytosine to uracil, allowing for exposure of DNA to a reduction in bisulfite to minimize fragmentation. In some embodiments, some degree of fragmentation after aliquoting is acceptable because it does not affect haplotyping.
In one aspect, the methods of the invention generate mass genome data from a single cell. The ability to sequence single cells opens many new pathways in genomic research and diagnostics. Assuming no DNA loss, starting with a small number of cells (10 or less), there is a benefit instead of using an equal amount of DNA from a large preparation. Starting with less than 10 cells and aliquoting all DNA exactly ensures consistent coverage of long fragments of any given region of the genome. Starting with 5 or fewer cells allows for four-fold or greater coverage per 100kb DNA fragment in each aliquot without increasing the total number of reads above 120Gb (20-fold coverage of a 6Gb diploid genome). However, when testing samples obtained from a small number of cells, a large number of aliquots (10,000 or more) and longer DNA fragments (greater than 200kb) can be useful because for any given sequence there are only as many overlapping fragments as there are starting cell numbers, and the occurrence of overlapping fragments of two parent chromosomes in an aliquot can be a destructive loss of information.
The LFR technique of the present invention is adapted to the problem of small input DNA amounts, as it is effective in cases where only about 10 cells of value of the starting input genomic DNA are valuable. In still other embodiments, LFR is performed on nucleic acids obtained from about 1-20,2-18,3-16,4-14,5-12,6-10, and 7-8 cells. In still other embodiments, LFRs may also be used with nucleic acids obtained from single cells, as the first step in LFRs is generally low-bias whole genome amplification, which may be particularly useful in single cell genome analysis. Even single molecule sequencing methods may require some level of DNA amplification from single cells due to DNA strand breaks and DNA loss in the process. The difficulty in sequencing single cells comes from trying to faithfully amplify the entire genome. Studies performed on bacteria using MDA have suffered from a loss of about half of the genome in the final assembled sequence and a fairly high variation in coverage between those sequenced regions. This can be explained in part by having initial genomic DNA that cannot be replicated at the end and thus is lost during the MDA process as nicks and strand breaks. In certain aspects, LFR provides a solution to this problem because it includes steps to generate long overlapping fragments of the genome prior to whole genome amplification methods such as MDA. As discussed in more detail above, these long fragments are generated in some embodiments using a mild method of isolating genomic DNA from the cells used. The predominantly intact genomic DNA was then gently treated with common nicking enzymes to generate a semi-randomly nicked genome. Then, using the strand displacement capability of Φ 29 from nick polymerization, very long (greater than 200kb) overlapping fragments were created. These fragments are then used as starting templates for the LFR process. In other embodiments, the long fragment is generated prior to MDA using a CoRE fragmentation technique as discussed above. As should be appreciated, a combination of CoRE and other methods known in the art for generating fragments may also be utilized to provide materials for the steps of the LFR methods discussed herein.
There are two basic approaches to advanced genome sequencing using amplified DNA or relying on single molecule detection. Generally, the first set is expected to have lower detection costs (higher throughput), while the second set is expected to have lower costs in DNA preparation and reagents. To achieve accurate measurements, single molecule sequencing may require 100 more measurements than using amplified DNA because of asynchronous base reads and/or longer detection times. Alternatively, amplified DNA arrays have demonstrated reduced reagent cost via miniaturization while still maintaining high quality low cost detection, and further reagent reduction via microfluidic devices is fully accessible. Therefore, advanced miniaturization methods using amplified DNA are likely to be the first system to provide low cost medical genome sequencing.
For diagnostic medical applications, the accuracy and integrity of the sequence cannot be included at low cost. In addition to high per base accuracy, an important building block for the accuracy and integrity of human genome sequencing is the assembly of independent and precise sequences (including methylated haplotype states) from the two parental chromosomes of a diploid cell. This can be important for accurate prediction of the primary structure of a synthetic protein or RNA allele, and their corresponding levels of expression consensus sequence information cannot do these predictions, because the enhancers and other sequences responsible for the allelic expression level may be more than 100kb upstream of the gene of interest, or because two adjacent SNPs affecting the protein amino acid sequence may reside on different alleles of the gene of interest.
To achieve haplotyping at the chromosome level, simulation experiments have shown that allelic linkage information in the range of at least 70-100kb is required. This cannot be achieved with techniques using amplified DNA. These techniques are most likely limited to reads of less than 1000 bases due to difficulties in consistent amplification of long DNA molecules and loss of linkage information in sequencing. Pairing techniques can provide equivalents of extended read length, but are limited to less than 10kb due to the inefficiency of generating such DNA libraries (i.e., circularization of DNA longer than a few kb is very difficult). This approach also requires extreme read coverage to join all heterozygotes. If it is feasible to process such long molecules, and if the accuracy of single molecule sequencing is high and the detection/instrumentation costs are low, then an ideal technique for this would be single molecule sequencing of DNA fragments greater than 100 kb. This is very difficult to achieve for shorter molecules with high yields, let alone for 100kb fragments.
LFR provides a universal solution equivalent to inexpensive long single DNA molecule sequencing, which would make current shorter read amplification DNA technology and potentially future longer read but molecular technology cheaper to obtain and accurately assemble genomic sequence data. At the same time, this approach would provide complete haplotype resolution in complex diploid genomes, and allow assembly of metagenomic mixtures.
In one aspect, the invention is based on an actual read length of about 100-1000 kb. In addition, LFR can also significantly reduce the computational requirements and associated costs of any short reading techniques. Importantly, LFR eliminates the need to extend the sequencing read length if it reduces overall inclusion. With low cost short reading technologies such as cPAL (combinatorial probe anchor ligation) chemistry based on DNA nanoarrays (described in, for example, published patent application nos. WO2007120208, WO2006073504, WO2007133831, and US2007099208, and US patent application nos. 11/679,124,11/981,761,11/981,661,11/981,605,11/981,793,11/981,804,11/451,691,11/981,607,11/981,767,11/982,467,11/451,692,11/541,225,11/927,356,11/927,388,11/938,096,11/938,106,10/547,214,11/981,730,11/981,685,11/981,797,11/934,695,11/934,697,11/934,703,12/265,593,11/938,213,11/938,221,12/325,922,12/252,280,12/266,385,12/329,365,12/335,168,12/335,188, and 12/361,507, all patents are incorporated herein by reference in their entirety for all purposes and in particular for all teachings relating to sequencing technology), LFR provides a complete solution to sequencing of the human genome for medical and research applications at affordable costs.
LFR provides the ability to obtain the actual sequence of individual chromosomes, in contrast to the consensus sequence of the parent or related chromosomes only, despite their high degree of similarity and the presence of long repeats and segment duplications. To generate such data, sequence continuity is generally established over a long DNA range, such as 100kb to 1 Mb. Traditionally, such information has been obtained by BAC cloning, an expensive and unreliable method (e.g., unclonable sequences). Most sequencing techniques produce relatively short reads of DNA (100 to several kilobases). Furthermore, it is very difficult to maintain long segments in multiple processing steps. Thus, one advantage of LFR is that it provides a versatile in vitro method to obtain such information at a lower cost.
LFRs with 10,000 or more aliquots incur substantially reduced computational costs and complexity of genome assembly via short read length sequencing techniques. This can be particularly important to reduce the overall cost of human genome sequencing below $ 1000.
LFR provides a relatively high rate of reduction in erroneous or suspicious base calls (calls), typically one of 100kb or 30,000 false positive calls and a similar number of undetectable variants per human genome, which complements (sequence) the present genomic sequencing technology. Such error rates may be 10-1000 fold using the methods of the invention in order to minimize the follow-up confirmation of detected variants and to allow for diagnostic applications using human genome sequencing.
LFR using emulsion droplets is particularly useful in reducing cost and improving efficiency. By reducing the total reaction volume of the LFR process by more than 1000-fold, increasing the number of aliquots to about 10,000, and improving the quality of the data, the total cost of the complete genome processed via methods such as those described and published patent application nos. WO2007120208, WO2006073504, WO2007133831, and US2007099208, and US patent application nos. 11/679,124,11/981,761,11/981,661,11/981,605,11/981,793,11/981,804,11/451,691,11/981,607,11/981,767,11/982,467,11/451,692,11/541,225,11/927,356,11/927,388,11/938,096,11/938,106,10/547,214,11/981,730,11/981,685,11/981,797,11/934,695,11/934,697,11/934,703,12/265,593,11/938,213,11/938,221,12/325,922,12/252,280,12/266,385,12/329,365,12/335,168,12/335,188, and 12/361,507 (all patents are incorporated herein by reference for all purposes and specifically for all teachings relating to sequencing and nucleic acid preparation).
In addition to being common to all sequencing platforms, LFR-based sequencing can be applied to all major applications of low-cost-high-throughput sequencing outside of standard individual genome analysis (e.g., structural rearrangements in cancer genomes, fully methylated set (methyl) analysis, haplotypes including methylation sites, and even complex polyploid genomes, such as the application of de novo assembly of metagenomic or new genomic sequencing of genomes present in plants).
Due to the general nature and cost effectiveness in providing linkage information for sequences that are separated by 100-. One of the important goals in various genomic applications is to generate sufficient genomic sequence data with a high degree of accuracy and integrity to be able to develop knowledge about the various genomic codons that drive complex genetic regulatory networks. The present invention encompasses LFR kits, tools and software for all genomics and sequencing platform applications.
LFR provides the ability to understand the genetic basis of thousands of diseases, particularly a large number of sporadic genetic diseases (with new or combined genetic defects) for which only a small number of patients are available for study. In these cases, the integrity of the genomic sequence (including all sequence variants and complete haplotyping of methylation status) allows the discovery of the actual genetic defect that leads to such rare diseases.
In some embodiments, the invention is useful in the genetic medical diagnostics of cancer genomes and in individual genome sequencing. In addition to helping to better understand tumor formation, complete sequencing of the cancer genome can be crucial to selecting the best personalized cancer therapy. Low cost, accurate and complete sequence data from a small number of cells can be useful in this important health application. Second, individual genomic sequencing for personalized disease diagnosis, prevention and treatment must be complete (including complete chromosomal haplotypes), accurate and affordable, and must be effective. The present invention significantly improves all three success metrics. Such low cost universal genetic tests can be performed as part of an in vitro fertilization procedure (in which only one or two cells are available), as part of prenatal diagnosis or neonatal screening, and as part of routine adult health care. Once performed at the realized scope of influence (more than 1 million genomes sequenced per year), this genetic test can significantly reduce healthcare costs via preventive measures and appropriate drug use.
The present invention can generate haplotype reads exceeding 100 kb. In some aspects, a cost reduction of about 10 fold can be achieved by reducing the volume to the sub-microliter level. This is achieved due to the methods, compositions and reaction conditions of the present invention that allow all six enzymatic steps to be performed in the same well without DNA purification. In some embodiments, the invention includes the use of a commercial automated pipetting method in a 1536 well format. Rapid and low cost pipetting can be performed using nanoliter (nl) dispensing tools (e.g., Hamilton Robotics nanoliter pipette heads, TTP LabTech Mosquito, etc.) that provide 50-100nl of non-contact pipetting to generate tens of genomic libraries in parallel. 4-fold addition of aliquots resulted in a large reduction in the complexity of the genome within each well, reducing the overall cost of the calculation by more than 10-fold and improving data quality. In addition, automation of this process increases throughput and reduces operations in terms of costs for generating libraries.
In still other embodiments, and as discussed in greater detail above, unique identification results for each aliquot are achieved with barcode adaptor tags. In embodiments utilizing multi-well plates, the same number of adaptor tags (384 and 1536 in two non-limiting examples) are also used. In still other embodiments, the costs associated with generating adaptor labels are reduced via a new combinatorial labeling approach based on two sets of 40 half-barcode adaptor labels.
Reducing the volume to picoliter levels in 10,000 aliquots can achieve even greater cost reductions, perhaps as much as 30-400 times the reagent cost and an additional 10 times the computational cost (over 100 times total). In some embodiments, this level of cost reduction and broad aliquoting is achieved by combining the LFR method with the combined tagging of emulsion or microfluidic type devices. Furthermore, one development in the present invention of the conditions to perform all six enzymatic steps in the same reaction without DNA purification provides the ability to be miniaturized and automated and adaptable to its wide variety of platforms and sample preparation methods.
Another advantage of LFR is that whole genome amplification can be much more efficient and show significantly smaller deviations due to the small volumes and long fragments used in LFR. Many studies have examined the extent of unwanted amplification bias, background product formation, and chimeric artifacts introduced via Φ 29-based MDA, but many of these drawbacks have occurred under the extreme conditions of amplification (greater than 1 million fold). LFR requires only one percent of the amplification level. In addition, LFR starts with a long DNA fragment (about 100kb) that is critical for efficient MDA.
In one aspect, the invention provides diploid genome sequencing techniques that allow for calling parental haplotypes. LFR solves the problem of determining parent haplotypes by dividing corresponding parent DNA fragments greater than 100kb in length into physically separate subgenomic aliquots. As the number of aliquots increased, e.g., to 1536, and the percentage of the genome decreased to about 1% of the haploid genome, the statistical support for haplotypes increased significantly because there was a reduction in sporadic presence of both maternal and paternal haplotypes in the same well. Thus, the neglect frequency of a large number of small aliquots and mixed haplotypes for each aliquot allows for the use of fewer cells. Similarly, longer fragments (e.g., 300kb or longer) help bridge segments lacking heterozygous loci.
An efficient algorithm for haplotyping can be performed by calculating the Percentage of Shared Aliquots (PSA) for a pair of adjacent alleles (fig. 29). This method resolves instances of the non-called allele in aliquots with mixed haplotypes or in some aliquots. For a 100kb fragment of 20 cells in an aliquot of 1536 well plates, the average PSA for pairs representing true haplotypes decreased from nearly 100% to 21%, when the distance between adjacent heterozygous sites increased from 0 to 80 kb. PSA for a pair of pseudohaplotypes in rare cases (less than 1%) can represent 5-10% (1-2 out of 20 aliquots, PSA that is close to the 80kb split allele in the true haplotype) because the random probability of both haplotypes is present in the same aliquot. Thus, even fragments longer than 100kb are required for haplotyping adjacent heterozygous loci that are more than 80kb apart.
In one aspect, the methods and compositions of the invention provide a complete diploid genome sequencing technique that allows calling polymorphic loci homozygous. Due to random sampling, there is a significant possibility that only one of the parent chromosomes has been sequenced at any given region of the genome. An expensive solutionAnd the method commonly employed in conventional sequencing techniques is to provide a high average read coverage across the entire genome. The present invention significantly reduces this problem because it requires much less sequence coverage than is required in conventional techniques. As a non-limiting example, consider a homozygous location in the human genome detected with 5 overlapping reads (reference in 99.9% of cases). If such positions are indicated as being homozygous, the LFR method provides a probability of 0.5 for each reading and 0.5 for error in all 5 cases5Or 1/32) cases (about 3%) would be incorrect, that is to say in case 1/32 all 5 reads are from the same chromosome, and none from one to the other. Because of this, it is generally preferable to indicate that all of these positions are no calls or half calls. That results in millions of half-call positions per genome. If the method of the invention is used (1536 or more aliquots), 32/33 cases can be considered as actual homozygous positions (some of the 5 reads from each parent aliquot), while only the remaining 3% will be indicated as half calls (all reads from one parent aliquot). To achieve this improvement, homozygous reference or SNP positions are called after haplotype staging.
Similar advantages can be achieved to reduce the rate of false positive calls (call rate). Most of the false calls have lower, but still sufficient coverage from the actual second allele. Using LFR data, false positive cases can be identified by determining that the better supported allele is present in aliquots from both parents. For example, a common situation encountered in sequencing is a region covered by 7 reads, of which 5 correspond to a at a particular locus and 2 correspond to G. If two reads of G are false (e.g., mutated during DNA processing), they will most likely be from the same aliquot, and the 5 reads of a will be from multiple aliquots belonging to both parents. This would indicate homozygous a at the locus in question.
Locating short reads to the reference genome without computational complications of de novo sequencing requires substantial computation, particularly in the presence of divergent or new sequences created by multiple mutations, insertions, and/or deletions. Such genomic segments require local or global de novo assembly of short sequence reads. This is coupled with the reagent and imaging cost reduction for a new generation of DNA arrays with 30-60 billion spots per microscope slide (1-4 genomes per slide), and the computational effort of sequence assembly quickly becomes the major cost of genome sequencing. One way to reduce the costs associated with whole genome sequencing is to reduce these computational requirements.
The present invention provides LFR methods (greater than 1500 aliquots) that provide solutions to the short read sequencing computational problem at multiple levels: (a) fast read localization of the reference sequence, (b) minimization of the number of loci that require extensive local assembly, and (c) an order of magnitude of faster local and global de novo assembly. This is achieved in part because less than 1% of the genome is assembled per partial assembly. In essence, human genome assembly was reduced to the equivalent of 1000 bacterial genome assemblies. In one aspect, the following sequence assembly process is used:
1. mapping less than 1% of reads to whole genome reference
2. Defining 3-10Mb (for 10,000 aliquots) reference sequences for each aliquot
3. All reads from each aliquot were mapped to short aliquot references
4. Recall about 80% of apparent heterozygous positions
5. Establishing parental chromosomal haplotypes by staging heterozygous loci
6. Calling all homozygous references (no variation) or SNPs and short indels and low coverage heterozygous positions
7. Defining the sequence of the remaining about 40K region (1 out of about 1 million bases) requiring extensive (including de novo) assembly
As an example of reducing the cost of localization (a), consider the sequencing and localization of DNA from 5 cells that has been divided into 10,000 aliquots consisting of 0.1% of a haploid human genome (3Mb or 30 100kb fragments) per aliquot. If each aliquot is sequenced to 4-fold coverage with 120 base pair reads, there will be about 100,000 reads per aliquot (3Mb X4/120). Each 100kb fragment in an aliquot will be covered by 3,300 reads. By locating 500 (or 0.5%) of all reads in an aliquot against the entire human reference (step 1), totaling about 15 reads per fragment, a reference segment corresponding to the fragment in each aliquot is defined (step 2). The remaining reads would then be located at 0.1-0.2% of the composite reference (3-6Mb) uniquely defined for each aliquot (step 3). This method uses only 1% of the total positioning effort required in the absence of LFR or a 100-fold reduction in the computational cost of positioning. In one embodiment, the invention includes software for rapid collection and indexing of aliquot reference sequences.
The present invention improves the efficiency of diploid genome sequencing by first defining haplotypes (steps 4 and 5), and then using aliquot-haplotype pairing to achieve accurate and computationally efficient base (variant) calling for most of the remaining bases (step 6). For example, for over almost 30 hundred million base positions in the human genome of an individual, there is a reference/reference homozygous state, without the LFR haplotype, information about over 1 hundred million positions cannot be called on both chromosomes without extensive evaluation of new sequences. With advanced LFR, most of these positions can be accurately determined as reference/reference without any de novo type sequence assembly. This resulted in an approximately 1000-fold calculated reduction for this genomic assembly step. In addition, 99.9% of all variants in the genome (e.g., SNPs and 1-2 base indels) would be precisely called at this step, and the remaining 0.1% (4 million of the 4 million variants found per individual human genome) representing more complex changes would be resolved at step 7.
Assuming a standard 40-fold coverage of haploid genomes (10 billion 120-base reads), about 100,000 reads (in about 10 of 10,000 aliquots) can be used to achieve re-assembly of sequences containing unresolved positions in the parent chromosomes (step 7). This is much more efficient than over 1 hundred million (greater than 10%) of the anticipated unused reads in a standard assembly without LFR. In addition, false assembly is minimized even in the case of short overlap between successive reads. In this manner, a cost reduction of over 100-fold per de novo assembly site can be achieved.
The ability of the LFR technique of the present invention to sequence and assemble very long (greater than 100kb) fragments of the genome makes it well suited for sequencing the entire cancer genome. It has been suggested that more than 90% of cancers somehow contain significant loss or gain in human genomic regions, called aneuploidy, where some individual cancers have been observed to contain more than 4 copies of some chromosomes. This increases the complexity of the chromosome copy number, and regions within the chromosome may make sequencing using methods other than LFR untenable.
In still other embodiments, the invention utilizes automation to further reduce costs associated with whole genome sequencing. The methods and compositions of the present invention also include miniaturization, which can be achieved by a number of techniques, including the use of nanoliter droplets. In yet other embodiments, about 10-20 nanoliter droplets are deposited in a plate or on a glass slide using modified nanoliter-scale pipetting or acoustic droplet ejection techniques (e.g., LabCyteInc) or using a microfluidic device capable of processing up to 9216 individual reaction wells in 3072-.
In one aspect, the invention encompasses software having the capability to process data from more than 10,000 aliquots. Because the aliquot mapping is performed on a reference of only a few megabases, the Smith-Waterman algorithm can be used instead of fast indexing of the reading with Indel that is not mapped. This allows for accurate alignment even for reads of reference sequences with multiple changes or indels in a cost-effective manner.
III.B. other sequencing methods
As will be appreciated, the nucleic acids of the invention (including fragments and DNBs in a library of LFR aliquots) may be used in any sequencing method known in the art, including, but not limited to, sequencing by ligation, sequencing by hybridization, sequencing by synthesis (including sequencing by primer extension), chain sequencing by ligation of cleavable probes, and the like.
Methods similar to those described herein for sequencing can also be used to detect specific sequences in target nucleic acids, including detection of Single Nucleotide Polymorphisms (SNPs). In such methods, sequencing probes that will hybridize to a particular sequence, such as a sequence containing a SNP, may be used. Such sequencing probes can be differentially labeled to identify which SNP is present in a target nucleic acid. Anchor probes may also be used in combination with such sequencing probes to provide further stability and specificity.
In one aspect, the methods and compositions of the invention are used in combination with techniques such as those described in WO2007120208, WO2006073504, WO2007133831, and US2007099208 and US patent application nos 60/992,485,61/026,337,61/035,914,61/061,134,61/116,193,61/102,586,12/265,593,12/266,385,11/938,096,11/981,804, No11/981,797,11/981,793,11/981,767,11/981,761,11/981,730,11/981,685,11/981,661,11/981,607,11/981,605,11/927,388,11/927,356,11/679,124,11/541,225,10/547,214,11/451,692 and 11/451,691, all patents being incorporated herein by reference in their entirety for all purposes, and in particular for all teachings relating to sequencing, in particular nucleic acid sequencing.
In yet another aspect, the sequence of the nucleic acid is identified using sequencing methods known in the art, including but not limited to hybridization-based methods such as Drmanac, U.S. patents 6,864,052, 6,309,824, and 6,401,267; and Drmanac et al, U.S. patent publication 2005/0191656; and synthetic sequencing methods, such as Nyren et al, U.S. Pat. No. 6,210,891, Ronaghi, U.S. Pat. No. 6,828,100, Ronaghi et al (1998), Science,281: 363-; and ligation-based methods, such as Shendare et al (2005), Science,309: 1728-.
III.B.1.cPAL
Although described below with respect to sequencing DNA, any of the sequencing methods described herein can also be applied to target nucleic acid fragments, such as those generated by the LFR sequencing methods described above. As should be further appreciated, the present invention also encompasses a combination of sequencing methods.
In one aspect, the sequence of the DNB is identified using a method referred to herein as combinatorial probe-anchored ligation (cPAL) and variations thereof, as described below. Briefly, cPAL involves identifying a nucleotide at a specific detection position in a target nucleic acid by detecting a probe ligation product formed by ligation of at least one anchor probe that is fully or partially hybridized to an adapter and a sequencing probe that contains the specific nucleotide at an "interrogation site" corresponding to (e.g., would hybridize to) the detection position. If the nucleotide at the interrogation site is complementary to the nucleotide at the detection position, ligation can occur and the ligation product formed contains the unique tag, i.e., can be detected. A description of various exemplary embodiments of the cPAL method is provided below. It is to be understood that the following description is not intended to be limiting, and that variations of the embodiments described below are contemplated in the present invention.
"complementary" or "substantially complementary" refers to hybridization or base pairing or duplex formation between nucleotides or nucleic acids, such as, for example, between two strands of a double-stranded DNA molecule or between an oligonucleotide primer and a primer binding site on a single-stranded nucleic acid. Complementary nucleotides are typically A and T (or A and U) or C and G. Two single-stranded RNA or DNA molecules are said to be substantially complementary when the nucleotides of one strand, optimally aligned and compared and with appropriate nucleotide insertions or deletions, pair with at least about 80%, usually at least about 90% to about 95%, and even about 98% to 100% of the other strand.
"hybridization" as used herein refers to the process by which two single-stranded polynucleotides are non-covalently bound to form a stable double-stranded polynucleotide. The resulting (usually) double-stranded polynucleotide is a "hybrid" or "duplex". "hybridization conditions" generally include salt concentrations of less than about 1M, more usually less than about 500mM, and possibly less than about 200 mM. "hybridization buffer" is a buffered salt solution, such as 5% SSPE or other such buffers known in the art. Hybridization temperatures can be as low as 5 ℃ but are generally above 22 ℃, more typically above about 30 ℃ and generally above 37 ℃. Hybridization is generally performed under stringent conditions, i.e., conditions under which a probe will hybridize to its target subsequence, but not to other non-complementary sequences. Stringent conditions are sequence dependent and will be different in different situations. For example, longer fragments may require higher hybridization temperatures than shorter fragments for specific hybridization. The combination of parameters is more important than the absolute measure of any one parameter alone, although other factors, including base composition and length of the complementary strand, presence of organic solvents, and degree of base mismatching, may affect the stringency of hybridization. Typically, stringent conditions are those in which a temperature about 5 ℃ below the Tm of the specific sequence is selected under defined ionic strength and pH. Exemplary stringent conditions include a salt concentration of at least 0.01M to no more than 1M sodium ion concentration (or other salt), a pH of about 7.0 to about 8.3, and a temperature of at least 25 ℃. For example, conditions of 5 XSSPE (750mM NaCl, 50mM sodium phosphate, 5mM EDTA, pH7.4) and 30 ℃ are suitable for allele-specific probe hybridization. Other examples of stringent conditions are known in the art, see, e.g., Sambrook J et al (2001), molecular cloning, A Laboratory Manual, (3rd Ed., Cold Spring Harbor Laboratory Press.
Noun "Tm"As used herein generally refers to the dissociation of half of a double-stranded nucleic acid molecule into single strandsAnd (3) temperature. Formulas for calculating the Tm of nucleic acids are well known in the art. As indicated in the standard reference, when the nucleic acid is in an aqueous solution having a cation concentration of 0.5M or less, the (G + C) content is between 30% and 70%, which can be determined by the formula Tm=81.5+16.6(log10[Na+])0.41(%[G+C]) Simple estimate of T at-675/n-1.0 mmValues, n is the number of bases and m is a mismatched base pair (see, e.g., Sambrook J et al (2001), Molecular Cloning, A Laboratory Manual, (3rd Ed., Cold Spring harbor Laboratory Press.) other references contain more complex computational methods that take into account structural and sequence properties when calculating Tm (see also, Anderson and Young (1985), Quantitative Filter Hybridization, Nucleic Acid Hybridization, and Allawi and Santa Lucia (1997), Biochemistry36: 10581-94).
In one example of the cPAL approach, designated herein as "Single cPAL" as shown in FIG. 23, an anchor probe 2302 hybridizes to a complementary region in an adaptor 2308 of DNB 2301. The anchor probe 2302 hybridizes to the adapter region directly next to the target nucleic acid 2309, but in some cases, the anchor probe can be designed to "stretch" into the target nucleic acid by introducing a desired number of degenerate bases at the end of the anchor probe, as illustrated in FIG. 24 and described further below. Differentially labeled sequencing probe set 2305 hybridizes to complementary regions of the target nucleic acid, and sequencing probes hybridized adjacent to the anchor probes are ligated, typically by using a ligase, to form probe ligation products. Sequencing probes are typically a set or collection of oligonucleotides comprising two parts, i.e., different oligonucleotides at the interrogation site, and all possible bases (or universal bases) at other positions; each probe therefore represents each base type at a particular position. The sequencing probes are labeled with a detectable label, distinguishing each sequencing probe from sequencing probes containing other nucleotides at that position. Thus, in the example shown in FIG. 23, a sequencing probe 2310 hybridized adjacent to and ligated to the anchor probe 2302 will identify the base at the 5 base position from the adaptor in the target nucleic acid as a "G". FIG. 23 depicts a situation in which the interrogating base is 5 bases away from the ligation site, but as described more fully below, the interrogating base may be "closer" to the ligation site, in some cases at the point of ligation. Once ligated, the anchor and sequencing probes that are not ligated are washed away and the ligation products present on the array are detected using the label. Multiple hybridization and ligation cycles of the anchor probe and the sequencing probe can be used to identify the desired number of bases of the target nucleic acid on each side of each adapter in the DNB. Hybridization of the anchor probe and the sequencing probe can occur sequentially or simultaneously. The fidelity of the base determination (base call) depends in part on the fidelity of the ligase, and ligation is usually not performed if there is a mismatch near the ligation site.
The invention also provides methods of using two or more anchor probes in each hybridization-ligation cycle. FIG. 25 shows another example of a "double-cpAL with overhang" method in which first anchor probe 2502 and second anchor probe 2505 each hybridize to a complementary region of an adaptor. In the example shown in FIG. 25, first anchor probe 2502 is fully hybridized to a first region of adaptor 2511, and second anchor probe 2505 is complementary to a region of the second adaptor adjacent to the site of hybridization of the first anchor probe. The end of the second anchor probe that is not adjacent to the first anchor probe likewise comprises degenerate bases. In this way, the second anchor probe is capable of hybridizing to a region ("ledge") of the target nucleic acid 2512 proximal to the adaptor 2511. The second anchor probe is generally too short to remain in duplex hybridization alone, but forms a longer anchor probe upon ligation with the first anchor probe, enabling stable hybridization in subsequent methods. As discussed above for the "single cPAL" method, a collection of sequencing probes 2508, representing each base type at a target nucleic acid detection location and labeled with a detectable label capable of distinguishing each sequencing probe from sequencing probes containing other nucleotides at that location, hybridize to the adaptor-anchor probe duplexes and are ligated to the terminal 5 'or 3' base of the ligated-anchor probes. In the example shown in fig. 25, the sequencing probe is designed to interrogate the bases 5 nucleotides from the 5' end of the junction between the sequencing probe 2514 and the ligated anchor probe 2513. Because the second adapter probe 2505 has 5 degenerate bases at its 5' end, it reaches 5 bases within the target nucleic acid 2512, allowing the sequencing probe to interrogate the entire 10 bases from the boundary between the target nucleic acid 2512 and the adapter 2511.
In certain variations of the example of the dual cPAL approach described above, if the first anchor probe ends closer to the end of the adaptor, the second adaptor probe will be proportionally more degenerate, and thus there is a greater likelihood of ligation not only to the end of the first adaptor probe, but also to other second adaptor probes at multiple sites on the DNB. To prevent such ligation artifacts, the second anchor probe may be selectively activated so that it participates in ligation with the first anchor probe or the sequencing probe. Such activation methods are described in more detail below, and include, for example, selectively modifying the ends of the anchor probes so that they can only be ligated to a particular anchor probe or sequencing probe in a particular orientation relative to the adapter.
Similar to the dual cPAL approach described above, it is understood that the use of three or more anchor probes is also encompassed by the present invention.
In addition, the sequencing reaction may be performed at one or both ends of each adaptor, e.g., the sequencing reaction may be "one-way", with detection occurring 3' or 5 of the adaptor or at the other end; alternatively, the reaction may be "bi-directional" in which bases are detected at the 3 'and 5' detection positions of the adaptor. The bidirectional sequencing reaction can be carried out simultaneously, namely, the bases on both sides of the adaptor are detected simultaneously; or sequentially in any order.
Multiple cycles of cPAL (whether single, double, triple, etc.) will identify multiple bases in the target nucleic acid region adjacent to the adaptor. Briefly, the cPAL method is repeated to interrogate multiple adjacent bases in a target nucleic acid by cycling through the anchor probe hybridization and enzymatic ligation reactions and moving the set of sequencing probes designed to detect nucleotides at different positions away from the junction of the adaptor and the target nucleic acid. The sequencing probe used is designed such that the identity of one or more bases at one or more positions corresponds to the identity of the label attached to the sequencing probe at any given cycle. Once the ligated sequencing probe (and the base at the interrogation site) is detected, the ligation complex is stripped of the DNB and a new round of hybridization and ligation of the adaptor and sequencing probe is performed.
It will be appreciated that, in addition to the cPAL method described above, the DNBs of the present invention may be used in other sequencing methods, including other methods of ligation sequencing as well as other sequencing methods, including but not limited to sequencing by hybridization, sequencing by synthesis (including primer extension), cleavable probe ligation chain sequencing (cleavable by ligation of clean probes), and the like.
Sequencing methods similar to those described above can also be used to detect specific sequences in target nucleic acids, including the detection of Single Nucleotide Polymorphisms (SNPs). In such methods, sequencing probes that are capable of hybridizing to a particular sequence (e.g., a sequence containing a SNP) will be used. The sequencing probes can discriminate between the labels to identify which SNP is present in the target nucleic acid. Anchor probes can also be used in combination with such sequencing probes to provide greater stability and specificity.
The target nucleic acid used in the sequencing method of the present invention comprises a target sequence having a plurality of detection positions. The term "detection position" refers to a position in a target nucleic acid for which sequence information is desired. As will be appreciated by those skilled in the art, typically a target sequence contains a plurality of detection positions for which sequence information is desired, for example, sequencing of the entire genome as described herein. In some cases, for example in SNP analysis, it may be desirable to read only a single SNP in a particular region.
As discussed above, the present invention provides sequencing methods that use an anchor probe and a sequencing probe in combination. As used herein, a "sequencing probe" refers to an oligonucleotide designed to provide the identity of a nucleotide at a specific detection position in a target nucleic acid. The sequencing probe hybridizes to a domain within the target sequence, e.g., a first sequencing probe may hybridize to a first target domain and a second sequencing probe hybridizes to a second target domain. The terms "first target domain" and "second target domain" or grammatical equivalents herein mean two portions of a target sequence within a nucleic acid under test. The first targeting domain may be adjacent to the second targeting domain, or the first and second targeting domains may be separated by intervening sequences (e.g., adaptors). The terms "first" and "second" are not intended to convey the orientation of the sequences with respect to the 5'-3' orientation of the target sequence. For example, assuming that the complementary target sequences are in the 5'-3' orientation, the first target domain may be located in the 5 'orientation of the second domain, or in the 3' orientation of the second domain. The sequencing probes may overlap, e.g., a first sequencing probe may hybridize to the first 6 bases adjacent to one end of the adapter, and a second sequencing probe may hybridize to the 3rd to 9 th bases from the end of the adapter (e.g., when the anchor probe has three degenerate bases). Alternatively, the first sequencing probe may hybridise to 6 bases adjacent the "upstream" end of the adaptor and the second sequencing probe may hybridise to 6 bases adjacent the "downstream" end of the adaptor.
Sequencing probes will typically comprise a number of degenerate bases and specific nucleotides located at specific positions within the probe to allow interrogation of the detection site (also referred to herein as an "interrogation site").
In general, when degenerate bases are used, a collection of sequencing probes is used. That is, a probe having the sequence "NNNANN" is actually a set of probes that contain all possible combinations of 4 nucleotide bases at 5 positions, and adenine at the 6 th position (i.e., 1024 sequences). (As indicated herein, this technique can also be used for adaptor probes: e.g., when an adaptor probe contains "three degenerate bases", it is actually a set of adaptor probes that contains the sequence corresponding to the anchor site and all possible combinations at 3 positions, so a collection of 64 probes).
In some embodiments, for each interrogation site, four different sets of labels may be combined into a single set for the sequencing step. Thus, in any particular sequencing step, 4 pools are used, each carrying a different specific base at the interrogation site and having a different label corresponding to the base of the interrogation site. That is, the sequencing probes are also labeled, wherein a particular nucleotide at a particular interrogation site is associated with a label that is different from the label of a sequencing probe with a different nucleotide at the same interrogation site. For example, four sets of NNNANN-dye 1, NNNTNN-dye 2, NNNCNN-dye 3, and NNNNNN-dye 4 can be used in one step, as long as the dyes are optically resolvable. In certain embodiments, for example for SNP detection, it may only be necessary to include two sets, since the SNPs can only be C or A, etc. Similarly, some SNPs contain three possibilities. Alternatively, in certain embodiments, if the reactions are performed sequentially rather than simultaneously, the same dye may be used, except that in different steps: for example, the NNNANN-dye 1 probe alone can be used in a reaction, with or without signal detection, and the probe washed away; any incorporation of the second set of NNNTNN-dye 1.
In any of the sequencing methods described herein, the sequencing probes can be of varying lengths, including from about 3 to about 25 bases. In other embodiments, the extent of sequencing probes may be in the range of about 5 to about 20, about 6 to about 18, about 7 to about 16, about 8 to about 14, about 9 to about 12, and about 10 to about 11 bases.
The sequencing probes of the invention are designed to be complementary to sequences in the target sequence, and are typically fully complementary, such that hybridization of a portion of the target sequence to the probes of the invention can occur. In particular, it is important that the interrogation site base and the detection site base are fully complementary unless they do fully complement each other without producing a signal according to the method of the invention.
In many embodiments, the sequencing probes and the target sequences to which they hybridize are fully complementary; that is, the assay is performed under conditions that favor the formation of complete base pairing as is known in the art. It will be appreciated by those skilled in the art that a sequencing probe that is fully complementary to a first domain of a target sequence can only be substantially complementary to a second domain of the same target sequence; that is, the present invention relies in many cases on the use of a set of probes, e.g., a set of hexamers that are perfectly complementary to some target sequences, but not to others.
In some embodiments, depending on the particular application, the complementarity between the sequence probe and the target sequence need not be perfect; there may be any number of base pair mismatches that may interfere with hybridization between the target sequence and the single-stranded nucleic acid of the invention. However, if the number of mismatches is too high, hybridization will not occur even under the least stringent hybridization conditions and the sequence will not be complementary to the target sequence. Thus, "substantially complementary" as used herein means that the sequencing probe is complementary to the target sequence to an extent sufficient to hybridize under normal reaction conditions. However, for most applications, conditions are set to favor probe hybridization only if complete complementarity exists. Alternatively, there should be sufficient complementarity for the ligase reaction to occur, i.e., some parts of the sequence may have mismatches, but the bases of the interrogation site should be perfectly complementary only at that site to allow ligation to occur.
In some cases, probes of the invention may use universal bases that hybridize to more than one base, in addition to or in place of degenerate bases. For example, inosine may be used. Any combination of these systems and probe compositions may be employed.
Sequencing probes used in the methods of the invention are typically detectably labeled. As used herein, "labeled" or "labeled" means that the compound has at least one element, isotope, or chemical attached to it to enable detection of the compound. Generally, labels useful in the present invention include, but are not limited to, isotopic labels, which can be radioactive or heavy metal isotopes, magnetic labels, electronic labels, heat sensitive labels, chromogenic and luminescent dyes, enzymes, magnetic spheres, and the like. The dyes used in the invention may be chromophores, phosphors or fluorescent dyes, because the signals they produce are strong and can provide good signal to noise ratios for decoding. Sequencing probes may also be used with quantum dots, fluorescent nanobeads, or other structures containing more than one identical fluorophore molecule. Labels comprising multiple molecules of the same fluorophore generally provide a stronger signal and are less sensitive to quenching than labels comprising a single fluorophore molecule. Any discussion herein of labels comprising fluorophores should be understood as applicable to labels comprising single or multiple fluorophore molecules.
Many embodiments of the present invention involve the use of fluorescent labels. Dyes suitable for use in the present invention include, but are not limited to, fluorescent rare earth (including europium and terbium) complexes, fluorescein, rhodamine, tetramethylrhodamine, eosin, erythrosine, coumarin, methylcoumarin, pyrene, malachite green (Malcite green), stilbenes (stilbenes), Lucifer Yellow (Lucifer Yellow), Cascade BlueTMOther dyes described in Texas Red and 6 th edition Molecular Probes Handbook by Richard p. Commercial fluorescent dyes for introducing nucleic acids with any nucleotide include, but are not limited to, Cy3&Cy5(Amersham Biosciences, Piscataway, New Jersey, USA), fluorescein, tetramethylrhodamine, Texas Red、Cascade Blue、BODIPYFL-14、BODIPYR、BODIPYTR-14、Rhodamine GreenTM、Oregon Green488、BODIPY630/650、BODIPY650/665-、AlexaFluor488、Alexa Fluor532、Alexa Fluor568、Alexa Fluor594、Alexa Fluor546(Molecular Probes, Inc. Eugene, OR, USA), Quasar570, Quasar670, Cal Red610(Biosearch Technologies, Novato, Ca). Other fluorophores that can be attached post-synthetically include Alexa Fluor350、Alexa Fluor532、Alexa Fluor546、Alexa Fluor568、Alexa Fluor594、Alexa Fluor647、BODIPY493/503、BODIPY FL、BODIPY R6G、BODIPY530/550、BODIPY TMR、BODIPY558/568、BODIPY558/568、BODIPY564/570、BODIPY576/589、BODIPY581/591、BODIPY630/650、BODIPY650/665、Cascade Blue、Cascade Yellow, Dansyl, Lissamine rhodamine (lissamine rhodamine) B, Marina Blue, Oregon Green488, Oregon Green514, Pacificblue, rhodamine 6G, rhodamine Green, rhodamine Red, tetramethylrhodamine, Texas Red (available from Molecular Probes, Inc., Eugene, OR, USA), and Cy2, Cy3.5, Cy5.5, and Cy7(Amersham Biosciences, Piscataway, NJ USA), among others. In certain embodiments, labels including fluorescein, Cy3, Texas Red, Cy5, Quasar570, Quasar670, and Cal Red610 are used in the methods of the invention.
Labels can be attached to nucleic acids to form the labeled sequencing probes of the invention, as well as to various locations on nucleosides, using methods known in the art. For example, attachment may be at one or both ends of the nucleic acid, or at an internal location, or both. For example, in one embodiment, the label may be attached to the ribose-phosphate backbone at the 2 'or 3' position of the ribose via an amide or amine bond (the latter case being used for end labeling). Attachment can also be via a phosphate in the ribose-phosphate backbone, or to the base of a nucleotide. Labels may be attached to one or both ends of the probe, or along any one of the nucleotides on the probe.
The structure of the sequencing probe varies depending on the desired interrogation site. For example, for sequencing probes that are labeled with fluorophores, one site in each sequencing probe will correspond to the identity of the fluorophore used for the labeled probe. In general, the fluorophore molecule will be attached to the sequencing probe opposite the end to which the anchor probe will be attached.
"Anchor probe" as used herein means an oligonucleotide designed to be complementary to at least a portion of an adaptor (referred to herein as an "anchor site"). As described herein, an adaptor may contain multiple anchor sites for hybridization to multiple anchor probes. As discussed further herein, an anchor probe for use in the present invention may be designed to hybridize to an adaptor such that at least one end of the anchor probe is flush with one end of the adaptor (either "upstream" or "downstream" or both). In other embodiments, the anchor probe can be designed to hybridize to at least a portion of the adaptor (the first adaptor site) and at least one nucleotide ("overhang") in the target nucleic acid adjacent to the adaptor. As shown in fig. 24, anchor probe 2402 comprises a sequence complementary to a portion of an adaptor. Anchor probe 2402 also contains 4 degenerate bases at one end. This degeneracy allows a portion of the population of anchor probes to completely or partially match the target nucleic acid sequence adjacent to the adapter and allows the anchor probes to hybridize to and extend into the target nucleic acid adjacent to the adapter regardless of the nucleotide identity of the target nucleic acid adjacent to the adapter. The terminal base of the anchor probe is moved into the target nucleic acid so that the base site to be determined is closer to the ligation site, thereby maintaining the fidelity of the ligase. In general, a ligase is able to ligate a probe more efficiently if the probe is perfectly complementary to the region of the target nucleic acid to which it hybridizes, but the fidelity of the ligase decreases as the distance from the ligation site increases. Thus, in order to reduce and/or prevent errors caused by incorrect pairing between the sequencing probe and the target nucleic acid, it may be useful to maintain the distance between the nucleotide to be detected and the ligation site of the sequencing and anchor probes. By designing the anchor probe to extend into the target nucleic acid, ligase fidelity can be maintained, but still a greater number of nucleotides can be identified that are ligated to each adaptor. Although FIG. 24 shows an example in which a sequencing probe hybridizes to a region of a target nucleic acid on one side of an adapter, it is understood that embodiments in which a sequencing probe hybridizes to the other side of an adapter are also encompassed by the invention. In FIG. 24, "N" represents a degenerate base, and "B" represents a nucleotide of an undetermined sequence. As can be appreciated, in certain embodiments universal bases can be used rather than degenerate bases.
The anchor probe of the invention may comprise any sequence that enables the anchor probe to hybridize to a DNB, typically an adaptor on the DNB. Such anchor probes may comprise sequences such that when the anchor probe is hybridized to the adaptor, the full length of the anchor probe is contained within the adaptor. In certain embodiments, the anchor probe can comprise a sequence complementary to at least a portion of the adaptor, and further comprise degenerate bases capable of hybridizing to a target nucleic acid adjacent to the adaptor. In certain exemplary embodiments, the anchor probe is a hexamer comprising 3 bases complementary to the adaptor and 3 degenerate bases. In certain exemplary embodiments, the anchor probe is an 8-mer comprising 3 bases complementary to the adaptor and 5 degenerate bases. In other embodiments, particularly where multiple anchor probes are used, the first anchor probe comprises a plurality of bases complementary to the adaptor at one end and degenerate bases at the other end, and the second anchor probe comprises all degenerate bases designed to be ligated to the end of the first anchor probe comprising degenerate bases. It will be appreciated that these are exemplary embodiments and that various combinations of known and degenerate bases may be used to generate an anchor probe suitable for use in the present invention.
In certain aspects, the ligation sequencing methods of the invention comprise providing different combinations of an anchor probe and a sequencing probe that, when hybridized to adjacent regions on a DNB, can be ligated to form a probe ligation product. The probe ligation product is then detected, which can provide the identity of one or more nucleotides in the target nucleic acid. As used herein, "ligation" refers to any method of joining two or more nucleotides to each other. Ligation may include chemical and enzymatic ligation. Generally, the ligation sequencing methods discussed herein utilize a ligase to perform enzymatic ligation. Such ligases for use in the invention may be the same as or different from the ligases discussed above for use in forming the nucleic acid template. Such ligases include, but are not limited to, DNA ligase I, DNA ligase II, DNA ligase III, DNA ligase IV, E.coli DNA ligase, T4RNA ligase 1, T4RNA ligase 2, T7 ligase, T3DNA ligase, and thermostable ligase (including, but not limited to Taq ligase), among others. As discussed above, sequencing by ligation often relies on the fidelity of the ligases to ligate only probes that are perfectly complementary to the nucleic acids to which they hybridize. This fidelity decreases as the distance between the base at a particular site in the probe and the point of attachment between the two probes increases. Therefore, conventional sequencing by ligation methods can only identify a limited number of bases. As further described herein, the present invention employs multiple probe sets to increase the number of bases that can be identified.
A variety of hybridization conditions can be used for the ligation sequencing method and other sequencing methods discussed herein. These conditions include high, medium and low stringency conditions, see, e.g., Maniatis et al, molecular cloning: A Laboratory Manual,2d Edition,1989, and Short Protocols in molecular biology, ed. Stringent conditions are sequence dependent and will be different in different situations. Longer sequences hybridize specifically at higher temperatures. A comprehensive guideline for Nucleic Acid Hybridization can be found in Tijssen, Techniques in biochemistry and Molecular Biology- -Hybridization with Nucleic Acid Probes, "Overview of principles of Hybridization and the protocol of Nucleic Acid assays," (1993). Typically, stringent conditions are selected to be about 5-10 ℃ lower than the melting point (Tm) of the specific sequence at a defined ionic strength and pH. The Tm is the temperature (under defined ionic strength, pH and nucleic acid concentration) at which 50% of the probes complementary to the target hybridize to the target sequence at equilibrium (because there is excess target sequence, at Tm, 50% of the probes are occupied at equilibrium). Stringent conditions may be those in which the salt concentration is less than about 1.0M sodium ion, typically about 0.01 to 1.0M sodium ion concentration (or other salt), pH7.0 to 8.3, and the temperature is at least about 30 ℃ for short probes (e.g., 10 to 50 nucleotides) and at least about 60 ℃ for long probes (e.g., more than 50 nucleotides). Stringent conditions may also be achieved by the addition of a helical destabilizer such as formamide. As is known in the art, hybridization conditions may also vary when a non-ionic backbone, i.e., PNA, is used. In addition, a cross-linking agent may be added to cross-link, i.e., covalently attach, the two strands of the hybridization complex after target binding.
For any sequencing method known in the art and described herein that utilizes the nucleic acids of the invention (including LFR aliquot fragments and DNBs), the invention provides a method for determining at least about 10 to about 200 bases in a target nucleic acid. In other embodiments, the invention provides methods for determining at least about 20 to about 180, about 30 to about 160, about 40 to about 140, about 50 to about 120, about 60 to about 100, and about 70 to about 80 bases in a target nucleic acid. In still other embodiments, sequencing methods are used to identify at least 5,10, 15, 20, 25, 30 or more bases adjacent to one or both ends of each adaptor in a nucleic acid template of the invention.
Any sequencing method described herein and known in the art can be applied to nucleic acids in solution or on a surface and/or array.
B1(a) Single cPAL
One aspect of the present invention provides methods for identifying the sequences of DNBs by utilizing a combination of sequencing probes and anchor probes that hybridize to adjacent regions of the DNB and are typically ligated together using a ligase enzyme. This method is generally referred to herein as the cPAL (composite Probe-Anchor molecule ligation) method. In one aspect, the cPAL method of the invention produces a probe ligation product comprising a single anchor probe and a single sequencing probe. The cPAL approach using only a single anchor probe is referred to herein as "single cPAL".
FIG. 23 shows one embodiment of a single cPAL. Monomeric unit 2301 of DNB comprises target nucleic acid 2309 and adaptor 2308. The anchoring probes 2302 hybridize to complementary regions on the adaptors 2308. In the example shown in FIG. 23, the anchor probe 2302 hybridizes to the region of the adaptor directly adjacent to the target nucleic acid 2309, although the anchor probe can also be designed to extend into the target nucleic acid adjacent to the adaptor by introducing a desired number of degenerate bases at the end of the anchor probe, as discussed further herein. Differentially labeled sequencing probe set 2306 hybridizes to a complementary region in a target nucleic acid. Adjacent to the anchor probe 2302, a sequencing probe 2310 that hybridizes to a region of the target nucleic acid 2309 is ligated to the anchor probe to form a probe ligation product. The efficiency of hybridization and ligation is increased when the base of the interrogation site in the probe is complementary to an unknown base in the detection site of the target nucleic acid. This increased efficiency facilitates the ligation of a fully complementary (rather than containing mismatches) sequencing probe and anchor probe. As discussed above, ligation is typically accomplished enzymatically using a ligase enzyme, although other ligation methods suitable for the present invention may also be used. In FIG. 23, "N" represents a degenerate base, and "B" represents a nucleotide of an undefined sequence of the protein. It will be appreciated that in certain embodiments universal bases may be used in place of degenerate bases.
As also discussed above, the sequencing probes may be oligonucleotides representing various base types at a particular site and labeled to detect the label, wherein the label distinguishes each sequencing probe from sequencing probes carrying other nucleotides at that position. Thus, in the example shown in FIG. 23, sequencing probe 2310 hybridized adjacent to and ligated to anchor probe 2302 identifies the base at the 5 base from the adaptor in the target nucleic acid as a "G". Multiple cycles of anchor probe and sequencing probe hybridization and ligation can be used to identify the desired number of bases in the target nucleic acid on each side of each adapter in the DNB.
It will be appreciated that the hybridization of the anchor probe and the sequencing probe in any of the cPAL methods described herein can be sequential or simultaneous.
In the embodiment shown in FIG. 23, sequencing probe 2310 hybridizes to the "upstream" region of the adaptor, but it will be appreciated that the sequencing probe may also hybridize "downstream" to the adaptor. The terms "upstream" and "downstream" refer to regions in the 5 'and 3' directions of the adaptors, depending on the orientation of the system. In general, "upstream" and "downstream" are relative terms, not limiting; they are for ease of understanding only. As shown in FIG. 6, the sequencing probe 607 can hybridize downstream of the adaptor 604, thereby identifying nucleotides 4 bases away from the junction of the adaptor and the target nucleic acid 603. In other embodiments, sequencing probes can hybridize upstream and downstream of an adaptor to identify nucleotides at nucleic acid sites flanking the adaptor. These embodiments allow for multiple spots of information to be generated from each adaptor per hybridisation-ligation-detection cycle in a single cPAL approach.
In certain embodiments, a probe for a single cPAL method may contain from about 3 to about 20 bases, and from about 1 to about 20 degenerate bases of the corresponding adaptor (i.e., in the anchor probe set). These anchor probes may also comprise universal bases, as well as combinations of degenerate and universal bases.
In certain embodiments, an anchor probe containing degenerate bases may have approximately 1-5 mismatches with an adaptor sequence in order to increase the stability of perfect match hybridization of the degenerate bases. Such design provides an alternative way to control the stability of ligated anchoring and sequencing probes in order to favor those probes that are perfectly matched to the target (unknown) sequence. In other embodiments, the multiple bases of the degenerate base portion of the anchor probe may be replaced with abasic sites (i.e., sites with no bases on the sugar) or other nucleotide analogs to affect the stability of the hybridization probe, thereby facilitating the formation of a perfectly matched hybrid at the distal end of the degenerate portion of the anchor probe (which will participate in the ligation reaction with the sequencing probe as described herein). Such modifications can be introduced at internal bases, particularly at internal bases of an anchor probe comprising a large number (i.e., more than 5) of degenerate bases. In addition, as described further below, certain degenerate or universal bases distal to the anchor probe may be designed to cleave (e.g., by introducing uracil) upon hybridization to create a ligation site for the sequencing probe or second anchor probe.
In other embodiments, hybridization of the anchoring probe may be controlled by manipulating reaction conditions, such as stringency of hybridization. In exemplary embodiments, the anchoring hybridization process can be from high stringency (higher temperature, lower salt concentration, higher pH, higher formamide concentration, etc.) conditions, which can be gradually or gradually relaxed. This may require successive hybridization cycles in which the different sets of anchor probes are removed and then added to subsequent cycles. Such methods provide for a higher percentage of the target nucleic acid to be occupied by fully complementary anchor probes, particularly at the distal site to which the sequencing probe will be ligated. Hybridization times under each stringency condition can also be controlled to obtain a greater number of perfect match hybrids.
B1(b) Dual (and above) cPAL
In still other embodiments, the invention provides methods of cPAL using two ligated-anchor probes per hybridization-ligation cycle. See, e.g., U.S. patent applications 60/992,485,61/026,337,61/035,914, and 61/061,134, which are incorporated by reference herein in their entirety, particularly the examples and claims. FIG. 25 shows an example of a "dual cPAL" method in which first anchor probe 2502 and second anchor probe 2505 hybridize to complementary regions of an adaptor; i.e., the first anchor probe hybridizes to the first anchor site and the second anchor probe hybridizes to the second anchor site. In the example shown in FIG. 25, first anchor probe 2502 is fully complementary to a region of adaptor 2511 (the first anchor site) and second anchor probe 2505 is complementary to a region of the adaptor adjacent to the hybridization site of the first anchor probe (the second anchor site). Generally, the first and second anchor locations are adjacent.
The second anchor probe optionally also comprises degenerate bases at the end that is not adjacent to the first anchor probe, so it will hybridize to a region of the target nucleic acid 2512 that is adjacent to the adaptor 2511. This enables sequence information to be obtained for target nucleic acid bases further away from the adaptor/target interface. Also, as summarized herein, the term "degenerate base" when used in reference to a probe means that the probe actually comprises a set of probes, which is a combination of all possible sequences at degenerate positions. For example, if the anchor probe is 9 bases in length, 6 known bases and 3 degenerate bases, the anchor probe is actually a collection of 64 probes.
The second anchor probe is generally too short to maintain duplex hybridization alone, but upon ligation with the first anchor probe, forms a longer anchor probe that is stable in subsequent methods. In certain embodiments, the second anchor probe comprises about 1 to about 5 bases and about 5 to about 10 bases of degenerate sequence complementary to the adaptor. As discussed above in the "Single cPAL" method, a set of sequencing probes 2508 representing each base type of the target nucleic acid detection site and labeled with a detectable label (capable of distinguishing each sequencing probe from sequencing probes with other nucleotides at that site) is hybridized 2509 to the adaptor-anchor probe duplex and ligated to the terminal 5 'or 3' base of the ligated anchor probe. In the example shown in fig. 25, the sequencing probe is designed to interrogate the bases 5' to the junction between the sequencing probe 2514 and the attached anchor probe 2513. Because the second anchor probe 2505 has 5 degenerate bases at its 5' end, it extends 5 bases into the target nucleic acid 2512, allowing the sequencing probe to interrogate the entire 10 bases away from the boundary between the target nucleic acid 2512 and the adaptor 2511. In FIG. 25, "N" represents a degenerate base, and "B" represents a nucleotide of an undefined sequence. It will be appreciated that in certain embodiments universal bases may be used in place of degenerate bases.
In certain embodiments, the second anchor probe may contain about 5-10 bases corresponding to the adaptor and about 5-15 bases corresponding to the target nucleic acid, which is typically a degenerate base. The second anchor probe may first hybridize under optimal conditions, thereby favoring a high percentage of target sequences that are perfectly matched to hybridize at a few bases around the junction of the two anchor probes. The first adaptor probe and/or sequencing probe may be hybridized and ligated to the second anchor probe in a single step or sequentially. In certain embodiments, the first and second anchor probes may have about 5 to about 50 complementary bases at their point of attachment that are not complementary to the adaptor, thus forming a "branched" hybrid. This design allows for adaptor-specific stabilization of the hybridized second anchor probe. In certain embodiments, the second anchor probe is ligated to the sequencing probe prior to hybridization with the first anchor probe; in certain embodiments, the second anchor probe is ligated to the first anchor probe prior to hybridization with the sequencing probe; in certain embodiments, the first and second anchor probes and the sequencing probe hybridize simultaneously, ligation occurs between the first and second anchor probes and between the second anchor probe and the sequencing probe simultaneously or substantially simultaneously, while in other embodiments, ligation occurs sequentially in any order between the first and second anchor probes and between the second anchor probe and the sequencing probe. Stringent washing conditions can be used to remove unligated probes (e.g., temperature, pH, salts, buffers containing an optimal concentration of formamide can be used, wherein the optimal conditions and/or concentrations are determined using methods known in the art). This method is particularly useful in methods that use a second anchor probe with a large number of degenerate bases that hybridize outside of the corresponding junction between the anchor probe and the target nucleic acid.
In certain embodiments, the dual cPAL approach utilizes ligation of two anchor probes, one of which is fully complementary to the adaptor and the second of which is all degenerate in base (and, again, effectively a probe set). FIG. 26 shows an example of such a dual cPAL approach, in which a first anchor probe 2602 hybridizes to an adaptor 2611 of DNB 2601. The second anchor probes 2605 are all degenerate bases and are therefore capable of hybridizing to unknown nucleotides in the region of the target nucleic acid adjacent to the adaptor 2611. The second anchor probe is designed to be too short to maintain duplex hybridization alone, but forms a longer, ligated anchor probe construct upon ligation with the first anchor probe, providing the stability required for subsequent steps in the cPAL process. A fully degenerate second anchor probe may be about 5 to about 20 bases in some embodiments. For longer lengths (i.e., greater than 10 bases), hybridization and ligation conditions can be modified to reduce the effective Tm of the degenerate anchor probe. A shorter second anchor probe will typically bind non-specifically to the target nucleic acid and to the adapter, but its shorter length affects hybridization kinetics, so that generally only those second anchor probes that are fully complementary to the region adjacent to the adapter and the first anchor probe are capable of allowing the ligase to ligate the first and second anchor probes together to produce a longer ligated anchor probe construct. The second anchor probe that non-specifically hybridizes does not have the ability to hybridize to the DNB long enough for subsequent ligation to any adjacently hybridized sequencing probes. In certain embodiments, after ligation of the second and first anchor probes, any unligated anchor probes are typically removed by a washing step. In FIG. 26, "N" represents a degenerate base, and "B" represents a nucleotide of an undefined sequence. It will be appreciated that in certain embodiments, universal bases may be used in place of degenerate bases.
In other exemplary embodiments, the first anchor probe is a hexamer comprising 3 bases complementary to the adaptor and 3 degenerate bases, while the second anchor probe comprises only degenerate bases, and the first and second anchor probes are designed such that only the end of the first anchor probe bearing degenerate bases is capable of ligation to the second anchor probe. In other exemplary embodiments, the first anchor probe is an 8-mer comprising 3 bases complementary to the adaptor and 5 degenerate bases, and likewise the first and second anchor probes are designed such that only the ends of the first anchor probe with the degenerate bases are capable of ligation to the second anchor probe. It will be appreciated that these are exemplary embodiments and that a wide variety of combinations of known and degenerate bases may be used in the design of the first and second (and in some embodiments, third and/or fourth) anchor probes.
In the modified version of the dual cPAL approach described above, if the first anchor probe terminates closer to the end of the adaptor, the second anchor probe will proportionally contain more degenerate bases and therefore will be more likely to ligate not only to the end of the first anchor probe, but also to additional second anchor probes at multiple sites on the DNB. To prevent such ligation artifacts, the second anchor probe can be selectively activated to limit its ligation to the first anchor probe or the sequencing probe. Such activation involves selectively modifying the ends of the anchor probes so that they can only be ligated to specific anchor probes or sequencing probes in a specific orientation relative to the adapter. For example, 5 'and 3' phosphate groups can be introduced into the second anchor probe such that a modified second anchor probe can be ligated to the 3 'end of the first anchor probe hybridized to the adaptor, but two second anchor probes cannot be ligated to each other (since the 3' end is phosphorylated, which prevents enzymatic ligation). Once the first and second anchor probes are ligated together, the 3 'end of the second anchor probe may be activated by removing the 3' phosphate group (e.g., with T4 polynucleotide kinase or phosphatases such as shrimp alkaline phosphatase and calf intestinal phosphatase).
If it is desired that ligation occur between the 3' end of the second anchor probe and the 5' end of the first anchor probe, the first anchor probe can be designed and/or modified such that its 5' end is phosphorylated and the second anchor probe can be designed and/or modified such that it does not carry a 5' or 3' phosphate. Likewise, a second anchor probe will be capable of ligation to a first anchor probe, but not to other second anchor probes. Following ligation of the first and second anchor probes, a 5' phosphate group may be introduced at the free end of the second anchor probe (e.g., by using T4 polynucleotide kinase) to make it available for ligation to a sequencing probe in a subsequent step of the cPAL process.
In certain embodiments, two anchor probes are added to the DNBs simultaneously. In certain embodiments, the two anchor probes are added sequentially to the DNBs, allowing one anchor probe to hybridize to the DNBs before the other. In certain embodiments, the two anchor probes are ligated to each other prior to ligation of the second adaptor to the sequencing probe. In certain embodiments, the anchor probe and the sequencing probe are ligated in one step. In embodiments where two anchor probes and sequencing probes are ligated in one step, the second adaptor may be designed with sufficient stability to maintain its position until the three probes (the two anchor probes and the sequencing probe) are in place for ligation. For example, a second anchor probe comprising 5 bases complementary to an adaptor and 5 degenerate bases for hybridization to a region of the target nucleic acid adjacent to the adaptor can be used. Such a second anchor probe may have sufficient stability to be maintained at low stringency washes, and therefore no ligation step is required between the second anchor probe hybridization and sequencing probe hybridization steps. In a subsequent ligation step of the sequencing probe to the second anchor probe, the second anchor probe will also ligate to the first anchor probe, resulting in a duplex with greater stability than either of the anchor probe or the sequencing probe alone.
Similar to the dual cPAL method described above, it is understood that cPAL having three or more anchor probes are also encompassed by the present invention. These anchor probes can be designed according to the methods described herein and known in the art such that upon hybridization to an adaptor region, one end of one of the anchor probes can be ligated to a sequencing probe that hybridizes to an adjacent end anchor probe. In an exemplary embodiment, three anchor probes are provided-two complementary to different sequences within the adaptor and a third comprising degenerate bases to hybridize to sequences within the target nucleic acid. In other embodiments, one of the two anchor probes complementary to sequences within the adaptor may further comprise one or more degenerate bases at its terminus such that the anchor probe extends into the target nucleic acid and is ligated to a third anchor probe. In other embodiments, one of the anchor probes may be fully or partially complementary to the adaptor, and the second and third anchor probes are fully degenerate bases for hybridization to the target nucleic acid. In other embodiments, four or more fully degenerate anchor probes may be sequentially ligated to three ligated anchor probes, thereby extending the assay further into the target nucleic acid sequence. In an exemplary embodiment, a first anchor probe comprising 12 bases complementary to an adaptor can be ligated to a second hexamer anchor probe, all 6 bases of which are degenerate bases. A third anchor molecule, which is also a fully degenerate hexamer, can also be ligated to the second anchor probe and extend further into the unknown sequence of the target nucleic acid. A fourth, fifth, sixth, etc. anchor probe may also be added to further extend into the unknown sequence. In still other embodiments, the one or more anchor probes may comprise one or more labels for "tagging" the anchor probes and/or for identifying the particular anchor probe hybridized on the adapter of the DNB according to any of the cPAL methods described herein.
B1(c) detection of fluorescently labeled sequencing probes
As discussed above, sequencing probes useful in the present invention can be detectably labeled with a variety of labels. Although the following description is primarily directed to embodiments in which the sequencing probe is labeled with a fluorophore, it will be appreciated that similar embodiments using sequencing probes comprising other types of labels are also encompassed by the present invention.
Multiple cycles of cPAL (whether single, double, triple, etc.) will identify multiple bases within the region of the target nucleic acid adjacent to the adaptor. Briefly, the cPAL method is repeated by cycling through the anchor probe hybridization and enzymatic ligation reactions and removing the sequencing probe pool (designed to detect nucleotides at different sites) from the junction of the adaptor and target nucleic acid to interrogate multiple bases within the target nucleic acid. The sequencing probe used in any given cycle is designed such that the identity of one or more bases at one or more sites corresponds to the identity of the label attached to the sequencing probe. Once the ligated sequencing probes (and hence the bases of the interrogation sites) have been detected, the ligation complexes are stripped of the DNB and a new round of adaptor and sequencing probe hybridization and ligation is performed.
Generally, four fluorophores are typically used to identify the base at the interrogation site within a sequencing probe, one base for each hybridization-ligation-detection cycle. However, it is understood that embodiments using 8, 16, 20, and 24 or more fluorophores are also encompassed by the present invention. Increasing the number of fluorophores will increase the number of bases that can be identified in any one cycle.
In an exemplary embodiment, a set of 7-mer sequencing probes having the following structure is employed:
3’-F1-NNNNNNAp
3’-F2-NNNNNNGp
3’-F3-NNNNNNCp
3’-F4-NNNNNNTp
wherein "p" represents a phosphate available for ligation and "N" represents a degenerate base. F1-F4 represent four different fluorophores-so each fluorophore is associated with a specific base. This exemplary set of probes is capable of detecting the bases immediately adjacent to the adaptor upon ligation of the sequencing probe to the anchor probe hybridized to the adaptor. To the extent that the ligase used to ligate the sequencing probe and the anchor probe is capable of discriminating complementarity between the base of the probe interrogation site and the base of the target nucleic acid detection site, the fluorescent signal that will be detected upon hybridization and ligation of the sequencing probe provides the identity of the base of the target nucleic acid detection site.
In certain embodiments, a set of sequencing probes will comprise three differentially labeled sequencing probes, leaving the fourth optional sequencing probe unlabeled.
After the hybridization-ligation-detection cycle is performed, the anchor probe-sequencing probe ligation product is stripped off and a new cycle is started. In certain embodiments, accurate sequence information can be obtained 6 or more bases from the point of attachment between the anchor probe and the sequencing probe, and 12 or more bases from the boundary between the target nucleic acid and the adaptor. The number of bases that can be identified can be increased using the methods described herein, including the use of anchor probes with degenerate ends that can extend further into the target nucleic acid.
Image acquisition may be performed using methods known in the art, including using a commercial imaging software package such as Metamorph (Molecular Devices, Sunnyvale, Calif.). Data extraction may be performed by a series of binary files written in, for example, C/C + +, and base determination and determination-mapping may be performed by a series of Matlab and Perl script.
In an exemplary embodiment, the DNBs arrayed on the surface are subjected to a round of cPAL as described herein, wherein the sequencing probes used are labeled with four different fluorophores (each corresponding to a particular base on the interrogation site within the probe). To determine the identity of the bases of each DNB arrayed on the surface, each field of view ("frame") was imaged with four wavelengths corresponding to four fluorescently labeled sequencing probes. All images obtained per cycle are stored in a cyclic catalog, where the number of images is four times the number of frames (when four fluorophores are used). Any of the loop image data can be stored in a directory structure organized for downstream data processing.
In certain embodiments, data extraction relies on two types of image data, brightfield images to distinguish the location of all DNBs on the surface, and sets of fluorescence images acquired in each sequencing cycle. All objects can be identified in the brightfield image using data extraction software, and for each such object the average fluorescence value for each sequencing cycle is calculated using software. For any given cycle, there are four data points that correspond to the four images taken at different wavelengths to query whether the base is A, G, C or T. These raw data points (also referred to herein as "base measurements") are collated to produce a discontinuous sequencing result for each DNB.
The identified groups of bases can then be assembled to provide sequence information for the target nucleic acid and/or to identify the presence or absence of a particular sequence in the target nucleic acid. In certain embodiments, the identified bases are assembled into a complete sequence by alignment of overlapping sequences obtained from multiple sequencing cycles performed on multiple DNBs. The term "complete sequence" as used herein refers to sequences of part or the entire genome as well as sequences of part or the entire target nucleic acid. In other embodiments, the assembly method utilizes that overlapping sequences can be "spliced" to provide the complete sequence. In still other embodiments, a reference table is utilized to assist in the assembly of the identified sequences into complete sequences. The reference table can be compiled using existing sequencing data for the selected organism. For example, human genomic data can be obtained from the National Center for Biotechnology Information (ftp.ncbi.nih.gov/refseq/ release) Or J.Craig vendor Institute (http://www.jcvi.org/researchhuref/) And (6) obtaining. The entire human genomic information, or a subset thereof, can be used to make a reference table for a particular sequencing query. In addition, a specific reference table may be constructed from empirical data derived from a particular population, including genetic sequences from a particular ethnic, geographic inheritance, religion, or culturally defined human population, as differences within the human genome may skew such data depending on the source of the information contained in the reference data.
In any of the embodiments of the invention discussed herein, the nucleic acid template and/or the population of DNBs may comprise a plurality of target nucleic acids so as to cover substantially the entire genome or the entire target polynucleotide. "substantially covers" as used herein means that the number of nucleotides (i.e., target sequences) analyzed is at least equal to two copies of the target polynucleotide; or in another aspect, at least ten copies; or in another aspect, at least twenty copies; or in another aspect, at least 100 copies. The target polynucleotide may include DNA fragments (including genomic DNA fragments and cDNA fragments) and RNA fragments. Guidance as to the steps for reconstructing a target polynucleotide sequence can be found in the literature incorporated herein by reference, Lander et al, Genomics,2:231-239(1988); Vintron et al, J.mol.biol.,235:1-12(1994), and similar references.
Group of probes B1(d)
As can be appreciated, different combinations of sequencing and anchor probes can be used according to the various cPAL methods described above. The following description of the probe sets (also referred to herein as "probe sets") used in the present invention is an exemplary embodiment, and it is to be understood that the present invention is not limited to these combinations.
In one aspect, a probe set is designed to identify nucleotides at a site that is a specific distance from an adaptor. For example, a set of probes can be used to identify bases at most 3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, and more positions from an adaptor. As discussed above, an anchor probe with a degenerate base at one end can be designed to extend into the target nucleic acid adjacent to an adaptor, allowing the sequencing probe to be ligated at a position further away from the adaptor, thereby providing the identity of the base further away from the adaptor.
In an exemplary embodiment, a set of probes comprises at least two anchor probes designed to hybridize to a region adjacent to an adaptor. In one embodiment, the first anchor probe is fully complementary to the adaptor region and the second anchor probe is complementary to the adaptor-adjacent region. In certain embodiments, the second anchor probe comprises one or more degenerate nucleotides that extend and hybridize to nucleotides of the target nucleic acid adjacent to the adaptor. In exemplary embodiments, the second anchor probe comprises at least 1-10 degenerate bases. In other exemplary embodiments, the second anchor probe comprises 2-9, 3-8, 4-7, and 5-6 degenerate bases. In still other exemplary embodiments, one or both ends of the second anchor probe and/or regions within its sequence comprise one or more degenerate bases.
In other embodiments, a set of probes can further comprise one or more populations of sequencing probes for determining the base at one or more detection sites within a target nucleic acid. In one embodiment. The set of probes comprises a sufficient population of different sequencing probes to be able to identify about 1 to about 20 sites within a target nucleic acid. In other exemplary embodiments, the set of probes comprises a sufficient population of sequencing probes to be able to identify about 2 to about 18, about 3 to about 16, about 4 to about 14, about 5 to about 12, about 6 to about 10, and about 7 to about 8 sites within a target nucleic acid.
In other exemplary embodiments, a collection of 10 labeled or tagged probes is used in accordance with the invention. In still other embodiments, a probe set comprises two or more anchor probes that differ in sequence. In still other embodiments, a probe set comprises 3,4,5,6,7,8,9,10,11,12,13,14,15, or more anchor probes that differ in sequence.
In other exemplary embodiments, a set of probes is provided that includes one or more sequencing probe populations and three anchor probes. The first anchor probe is complementary to the first adaptor region, the second anchor probe is complementary to the second adaptor region, and the first and second adaptor regions are contiguous. The third anchor probe comprises three or more degenerate nucleotides capable of hybridizing to nucleotides within the target nucleic acid adjacent to the adaptor. The third anchor probe may also be complementary to a third adaptor region in certain embodiments, which may be adjacent to the second region such that the second anchor probe flanks the first and third anchor probes.
In certain embodiments, the set of anchoring and/or sequencing probes comprises a different concentration of each probe, and the concentration depends in part on the degenerate bases that may be contained in the anchoring probe. For example, probes with lower hybridization stability, such as probes with more A and/or T, may be present in higher relative concentrations to compensate for their lower stability. In other embodiments, the difference in relative concentrations is achieved by separately preparing small probe sets and then mixing these separately prepared probe sets in appropriate amounts.
III.B.1(e). two-stage sequencing
In one aspect, the invention provides a "two-stage" sequencing method, also referred to herein as "shotgun sequencing". Such a method is described in U.S. patent application 12/325,922 filed on 12.1.2008, which is incorporated herein by reference in its entirety for all purposes, particularly for all teachings relating to two-stage or shotgun sequencing.
Generally, the two-stage sequencing method used in the present invention comprises the following steps: (a) determining a target nucleic acid sequence, generating a primary target nucleic acid sequence comprising one or more target sequences; (b) synthesizing a plurality of target-specific oligonucleotides, wherein each of the plurality of target-specific oligonucleotides corresponds to at least one target sequence; (c) providing a pool of target nucleic acid fragments (or constructs comprising such fragments and further comprising adaptors and other sequences such as described herein), hybridizing to a plurality of target-specific oligonucleotides; and (d) determining the sequence of the library of fragments (or a construct comprising such fragments) to produce a secondary target nucleic acid sequence. The number of target-specific oligonucleotides synthesized for use in these methods can range from about 1 million to about 1 million in order to fill gaps in missing sequences or to account for low confidence base determinations in the primary sequence of genomic DNA (such as human genomic DNA), and the present invention therefore contemplates the use of at least about 10,000 target-specific oligonucleotides, or about 25,000, or about 50,000, or about 100,000, or about 20,000, or about 50,000, or about 100,000, or about 200,000 or more target-specific oligonucleotides.
By "corresponding" to at least one target sequence, it is meant that the target-specific oligonucleotide is designed to hybridize to a target nucleic acid that is in close proximity to, including but not limited to, a contiguous target sequence, such that there is a high probability that the target nucleic acid fragment to which the oligonucleotide hybridizes will contain the target sequence. The target-specific oligonucleotides can thus be used in hybrid capture methods to generate libraries of fragments enriched in target sequence, as sequencing primers for sequencing target sequences, amplification primers for amplifying target sequences, or for other purposes.
After assembly determination, one skilled in the art will readily appreciate that gaps may exist in the assembled sequence, or that one or more bases or a string of bases at a particular position in the sequence may be less reliable, according to shotgun sequencing and other sequencing methods of the present invention. By comparing the primary target nucleic acid sequence to the reference sequence, target sequences may also be identified that may contain such gaps, sequences of low confidence, or simply sequences that differ at a particular position (i.e., a change in one or more nucleotides of the target sequence).
According to one embodiment of these methods, determining the target nucleic acid sequence to produce a primary target nucleic acid sequence comprises computer-processed sequence determination inputs and computer-processed sequence determination assemblies to produce the primary target nucleic acid sequence. In addition, the design of target-specific oligonucleotides can also be computerized, and the synthesis of such computerized target-specific oligonucleotides can be combined with the input computer assembly of computer processing and sequencing and the design of target-specific oligonucleotides. This is particularly useful since the number of target-specific oligonucleotides to be synthesized may be in the tens or hundreds of thousands for higher organisms, such as the human genome. The invention thus allows for automated integration of the process of generating a collection of oligonucleotides from the determined sequences and identified regions for further processing. In certain embodiments, the computer driver utilizes the identified regions and the determined sequence adjacent to or contiguous with the identified regions to design oligonucleotides for use in isolating and/or generating new fragments that cover these regions. The fragments can then be isolated from the first sequencing pool, from precursors of the first sequencing pool, from different sequencing pools generated from the same target nucleic acid, directly from the target nucleic acid, etc., with oligonucleotides as described herein. In other embodiments, automated integration to identify regions for further analysis and/or to isolate/generate a second library defines oligonucleotide sequences within a collection of oligonucleotides and directs synthesis of these oligonucleotides.
In certain embodiments of the two-stage sequencing method of the invention, the hybrid capture process is followed by a release step, and in other aspects of the technique, the second sequencing process is preceded by an amplification step.
In still other embodiments, some or all of the regions are identified in the identifying step by comparing the determined sequence to a reference sequence. In certain aspects, the second shotgun sequencing library is isolated from the reference sequence using a collection of oligonucleotides comprising oligonucleotides. Likewise, in certain aspects, the collection of oligonucleotides comprises at least 1000 oligonucleotides that differ in sequence, and in other aspects, the collection of oligonucleotides comprises at least 10,000, 25,000, 50,000, 75,000, or 100,000 or more oligonucleotides that differ in sequence.
In certain aspects of the invention, one or more of the sequencing processes employed in the two-stage sequencing method are performed by ligation sequencing; in other aspects, one or more sequencing processes are performed by sequencing by hybridization or sequencing by synthesis.
In certain aspects of the invention, about 1 to about 30% of the complex target nucleic acids are identified as requiring re-sequencing in stage II of the method; in other aspects, about 1 to about 10% of the complex target nucleic acids are identified as requiring re-sequencing in stage II of the method. In certain aspects, the percent coverage for identification of a complex target nucleic acid is about 25x to 100 x.
In other aspects, from 1 to about 10 target-specific selection oligonucleotides are identified and synthesized for each region of target nucleic acid resequenced in phase II of the method; in other aspects, about 3 to about 6 target-specific selection oligonucleotides are determined for each region of the target nucleic acid resequenced in phase II of the method.
In still further aspects of the technology, target-specific selection oligonucleotides are synthesized by automated procedures in which the process of identifying missing or less reliable regions of a complex nucleic acid sequence and the process of determining the sequence of a target-specific selection oligonucleotide are communicated to oligonucleotide synthesis software and hardware. In other aspects of the technology, the target-specific selection oligonucleotide is about 20 to about 30 bases in length, and in some aspects is unmodified.
Not all regions of the complex target nucleic acid identified for further analysis are actually present. One reason for the expected lack of coverage of a region may be that the region predicted to be present in a complex target nucleic acid is not actually present (e.g., the region may be deleted or rearranged in the target nucleic acid) and thus not all of the pool-generated oligonucleotides may be isolated into fragments for inclusion in a second shotgun sequencing library. In certain embodiments, at least one oligonucleotide is designed and prepared for each region identified for further analysis. In other embodiments, three or more oligonucleotides are provided on average for the region identified for further analysis. It is a feature of the present invention that the collection of oligonucleotides can be used directly to generate a second shotgun sequencing library by extending the oligonucleotides with a polymerase using a template derived from the target nucleic acid. Another feature of the invention is that the collection of oligonucleotides can be used directly to generate replicons via loop-dependent replication using the collection of oligonucleotides. It is a further feature of the invention that the method can provide sequence information to identify missing target regions, such as predicted regions that are identified for further analysis but do not actually exist due to, for example, deletion or rearrangement.
The embodiments of the two-stage sequencing methods described above can be used in combination with any of the nucleic acid constructs and sequencing methods described herein and known in the art.
SNP detection in B.1(f)
The methods and compositions discussed above may be used in other embodiments to detect specific sequences in nucleic acid constructs such as DNBs. In particular, the cPAL method using sequencing and anchor probes can be used to detect polymorphisms or sequences associated with gene mutations, including Single Nucleotide Polymorphisms (SNPs). For example, to detect the presence of a SNP, two sets of differentially labeled sequencing probes can be used, such that detection of one, but not the other, indicates the presence or absence of a polymorphism in the sample. Such sequencing probes can be used in combination with anchoring probes similar to those in the cPAL approach described above, further improving the specificity and efficiency of SNP detection.
Determination of
In one aspect, nucleic acids (including LFR aliquot fragments and DNBs) are arrayed on a surface to form a random array of individual molecules. Nucleic acids can be immobilized on a surface by a variety of techniques, including covalent attachment and non-covalent attachment. Non-covalent attachment includes hydrogen bonding, van der waals forces, electrostatic attraction, and the like.
Methods of forming arrays of the present invention are described in published patent applications WO2007120208, WO2006073504, WO2007133831 and US2007099208, as well as US patent applications 60/992,485,61/026,337,61/035,914,61/061,134,61/116,193,61/102,586,12/265,593,12/266,385,11/938,096,11/981,804, 11/981,797,11/981,793,11/981,767,11/981,761,11/981,730,11/981,685,11/981,661,11/981,607,11/981,605,11/927,388,11/927,356,11/679,124,11/541,225,10/547,214,11/451,692 and 11/451,691, which are all incorporated herein by reference for all purposes, particularly for all teachings relating to the formation of DNBs arrays.
In some embodiments, the patterned substrate is formed by forming a silicon dioxide layer on the surface of a standard silicon wafer. A layer of metal, such as titanium, is deposited over the silicon dioxide and the titanium layer is patterned by framing using conventional photolithography and dry etching techniques. A layer of Hexamethyldisilazane (HMDS) (Gelest lnc, Mornsville, PA) may then be added to the substrate surface by vapor deposition and a deep-UV, positive tone photoresist material applied to the surface by centrifugal force. The photoresist material may then be exposed in an array pattern with a 248nm lithography tool and a resist formed to produce an array of discrete regions with exposed HMDS. The HMDS layer in the cavity can be removed, in some embodiments performed with a plasma-etch process, and the functional module can be vapor deposited in the cavity to provide an attachment site for the nucleic acid. In certain embodiments, these functional moieties are aminosilane moieties that provide a positive charge that can be used to non-covalently immobilize nucleic acids via electrostatic attraction. In some embodiments, the surface may be further coated with a photoresist layer after deposition of the aminosilane module and cut into substrates of a predetermined size. For example, in some embodiments, a 75mmx25mm area substrate is useful in aspects of the invention. In still other embodiments, the photoresist material may be stripped from the individual substrates using methods known in the art, including sonication. In still other embodiments, the regions between discrete aminosilane features are inert to prevent binding of nucleic acids to the spaces between the discrete regions. For example, the aminosilane features patterned onto the substrate according to embodiments described herein serve as nucleic acid binding sites, while nucleic acid binding between the remaining HMDS inhibitor features. In yet other embodiments, a mixture of polystyrene beads and polyurethane glue is applied to each diced substrate in a series of parallel lines and a coverslip is pressed into the tubing line to form 6 gravity/capillary driven flow slides. In certain embodiments, the polystyrene beads are 50 μm beads. Nucleic acids can be loaded into a flow slide by pipetting the nucleic acids onto the slide. In certain embodiments, a greater number of nucleic acids than the number of binding sites present on the slide are applied to the slide. In yet another exemplary embodiment, 2-20 times more nucleic acid single molecules than binding sites are applied to the slide. In yet another embodiment, 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 and 20 times more nucleic acid single molecules are applied to the slide than binding sites.
As will be appreciated, a wide range of densities of the nucleic acids of the invention can be placed on a surface comprising discrete regions to form an array. Generally, the nucleic acids are immobilized to the discrete regions by a variety of methods known in the art and described in more detail below. In particular embodiments, the nucleic acids are immobilized via non-covalent electrostatic interactions to discrete regions on the array.
In a preferred embodiment, at least a majority of the discrete regions comprise single molecules attached thereto, and the discrete regions and/or the single molecules are distributed such that at least a majority of the single molecules immobilized to the discrete regions are optically resolvable. In still other embodiments, at least 50% -100% of the discrete regions have a single molecule attached thereto. In still other embodiments, at least 55% -95%,60% -90%,65% -85%, and 70% -80% of the discrete regions on the array have a single molecule attached thereto. In yet other embodiments, at least 60%,65%,70%,75%,80%,85%,90%,95%,96%,97%,98%, and 99% of the discrete regions on the array have a single molecule attached thereto.
In still other embodiments, at least 50% to 100% of the single molecules on the random array of the invention are optically resolvable. In still other embodiments, at least 55% to 95%,60% to 90%,65% to 85%, and 70% to 80% of the single molecules on the random array of the invention are optically resolvable. In still other embodiments, at least 60%,65%,70%,75%,80%,85%,90%,95%,96%,97%,98%, and 99% of the single molecules on the random array of the invention are optically resolvable.
In some embodiments, the discrete regions have an area of less than 1 μm2And in some embodiments the discrete region area ranges from 0.04 μm2To 1 μm2And in some embodiments the discrete region area ranges from 0.2 μm2To 1 μm2. In still other embodiments, the discrete regions have an area of about 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,1.5,2,2.5 μm2. In embodiments where the discrete regions are approximately circular or square in shape, such that their size can be indicated by a single linear dimension, the size of such regions may range from 125nm to 250nm, or rangesThe circumference is 200nm to 500 nm. In some embodiments, the nearest neighbor center-to-center distance of the discrete regions ranges from 0.25 μm to 20 μm, and in some embodiments, such distance ranges from 1 μm to 10 μm, or ranges from 50 to 1000 nm. In still other embodiments, the nearest neighbor center-to-center distance of the discrete regions ranges from about 100-. In still other embodiments, the nearest neighbor center-to-center distance of the discrete region ranges from about 650-. In certain embodiments, the nearest neighbor center-to-center distance of the discrete region is 707 nm. Generally, the discrete regions are designed such that most of the discrete regions on the surface are optically resolvable. In some embodiments, the regions can be arranged on the surface in substantially any pattern, wherein each region has a defined position. As described in more detail above, in certain embodiments, a single nucleic acid is attached to each of at least a majority of the discrete regions on the surface.
In some embodiments, the arrays of the invention comprise 1,2,3,4,5,6,7,8,9, or 10 single molecules per cubic micron.
In some embodiments, the nucleic acid array is provided at a density of at least 50,100,200,300,400,500,600,700,800,900 or 1000 ten thousand molecules per cubic millimeter.
In some embodiments, the nucleic acids are randomly arranged on a substrate as described herein and known in the art at a density such that each discrete region comprises a single nucleic acid molecule to which it is immobilized. In yet other embodiments, the nucleic acids are disposed on the substrate at a density of 100,200,500,750,1000,2000,3000,4000,5000,10,000,50,000,100,000,250,000,500,000,750,000,1,000,000 molecules per cubic micron.
In some embodiments, the surface may have reactive functionality that reacts with complementary functionality on the polynucleotide molecules to form covalent linkages, for example via the same techniques used to attach cDNA to microarrays, such as Smirnov et al (2004), Genes, Chromosomes & Cancer,40:72-77, Beaucage (2001), Current medicinal chemistry,8: 1213-. Nucleic acids can also be efficiently attached to hydrophobic surfaces, such as clean glass surfaces, with low concentrations of various reactive functionalities, such as-OH groups. Attachment via covalent bonds formed between the polynucleotide molecules and reactive functionalities on the surface is also referred to herein as chemical attachment.
In one aspect, the nucleic acids on the surface are limited to the area of the discrete region. The discrete regions may be incorporated into the surface using methods known in the art and described further below. As will be appreciated, the nucleic acids of the invention can be immobilized to the discrete regions via non-specific interactions, or via non-covalent interactions, such as hydrogen bonding, van der waals forces, electrostatic attraction, and the like. The nucleic acids can also be attached to the discrete regions via the use of capture probes or via covalent interactions with reactive functionalities, as is known in the art and described in more detail herein. As will be appreciated, attachment may also include a washing step that changes stringency to remove incompletely attached single molecules or other reagents present from earlier preparation steps (the presence of which is undesirable or to which non-specifically binds the surface).
The discrete regions may have defined locations in a regular array (which may correspond to a rectilinear pattern, a hexagonal pattern, etc.). A regular array of such regions is advantageous for detection and data analysis of signals collected from the array during analysis. In addition, first and/or second stage amplicons limited to the limited area of the discrete region provide a more concentrated or intense signal, particularly when using fluorescent probes in analytical procedures, thereby providing higher signal to noise ratio values. In some embodiments, the nucleic acids are randomly distributed over discrete regions, such that a given region is equally likely to accept any different single molecule. In other words, the resulting array is not spatially addressable immediately after fabrication, but can be made so by performing identification, sequencing, and/or decoding operations. Thus, the identity of the polynucleotide molecules of the invention distributed on the surface is discernible, but not initially known after their placement on the surface. In some embodiments, the discrete areas and attachment chemistries, macromolecular structures employed, etc., are selected to correspond to the size of the single molecules of the present invention, such that when a single molecule is applied to a surface, each region is actually occupied by only one single molecule. In some embodiments, the nucleic acids are arranged in a patterned manner on a surface comprising discrete regions such that a particular nucleic acid (identified, in one exemplary embodiment, by a tag adaptor or other marker) is arranged on a particular discrete region or set of discrete regions.
In other embodiments, the molecules are directed to discrete regions on the surface because the areas between the discrete regions (referred to herein as "inter-region areas") are inert because the concatemers or other macromolecular structures do not bind to them. In certain embodiments, such inter-domain areas may be treated with blocking agents, e.g., DNA unrelated to concatemer DNA, other macromolecules, etc.
There are a variety of supports that may be utilized to form random arrays with the compositions and methods of the present invention. In one aspect, the support is a rigid solid having a surface, preferably a substantially planar region, such that the single molecules to be interrogated lie in the same plane. The latter feature allows for efficient signal collection by, for example, detection of light. In another aspect, the support comprises a bead, in which case the bead surface contains reactive functional groups or capture probes that can be used to immobilize polynucleotide molecules.
In a further aspect, the solid support of the invention is non-porous, particularly when the random array of single molecules is to be analysed by hybridization reactions, requiring a small volume. Suitable solid support materials include materials such as glass, polyacrylamide coated glass, ceramics, silica, silicon, quartz, various plastics, and the like. In one aspect, the area of the planar region may be from 0.5 to 4cm2Within the range of (1). In one aspect, the solid support is glass or quartz, such as a microscope slide with a uniformly silanized surface. This can be achieved using conventional test protocols, e.g. acid treatment followed by immersion at 80 ℃3-glycidoxypropyltrimethoxysilane, N-diisopropylethylamine and dry xylene (8:1:24v/v) to form an epoxysilanized surface (e.g., Beattie et a (1995), Molecular Biotechnology,4: 213). Such surfaces are readily treated for attachment to the ends of the capture oligonucleotides, for example by providing the capture oligonucleotides with 3 'or 5' triethylene glycol phosphoryl spacer (see Beattie et al cited above) prior to application to the surface. Other embodiments of surface functionalization and further preparation for use in the present invention are described in, for example, U.S. patent applications 60/992,485,61/026,337,61/035,914,61/061,134,61/116,193,61/102,586,12/265,593,12/266,385,11/938,096,11/981,804, 11/981,797,11/981,793,11/981,767,11/981,761,11/981,730,11/981,685,11/981,661,11/981,607,11/981,605,11/927,388,11/927,356,11/679,124,11/541,225,10/547,214,11/451,692 and 11/451,691 which, for all purposes, in particular, all teachings relating to the preparation of array-forming surfaces, and all teachings relating to the formation of arrays, particularly nucleic acid arrays, are incorporated herein by reference in their entirety.
In embodiments of the present invention requiring discrete regions in a particular pattern, such patterns can be produced on a variety of surfaces using photolithography, electro-lithography, nanoimprint lithography, and nanoimprint lithography, such as Pirrung et al, U.S. Pat. No. 5,143,854, Fodor et al, U.S. Pat. No. 5,774,305, Guo, (2004) Journal of Physics D: applied Physics,37: R123-141, which are incorporated herein by reference.
In one aspect, the surface comprising a plurality of discrete regions is fabricated by photolithography. The quartz substrate of the optical plane of the commodity was spin-coated with a photoresist layer of 100 and 500nm thickness. The photoresist layer is then fired onto a quartz substrate. Using a stepper, a reticle image with a pattern of regions to be activated is projected onto the surface of the photoresist layer. After exposure, the photoresist layer is developed to remove the areas of the projection pattern exposed to the UV source. This is achieved by plasma etching, a dry development technique that can produce very fine details. The substrate is then baked to strengthen the remaining photoresist layer. After baking, the quartz wafer can be functionalized. The wafer was then subjected to 3-aminopropyldimethylethoxysilane vapor deposition. By varying the concentration of the monomer and the exposure time of the substrate, the density of the amino-functional monomer can be tightly controlled. Only the quartz region that is subjected to the plasma etching process can react with and capture the monomer. The substrate is then baked again to bake the monolayer of amino functionalized monomer onto the exposed quartz. After baking, the remaining photoresist was removed with acetone. Because of the difference in the adhesion chemistry of photoresist and silane, the area of aminosilane functionalization on the substrate remains intact during the acetone rinse. These areas can be further functionalized by reaction with p-phenylene diisothiocyanate dissolved in a solution of pyridine and N-N-dimethylformamide. The substrate can then be reacted with the amine-modified oligonucleotide. Alternatively, the oligonucleotide may be primed with a 5' -carboxy-modifier-c 10 linker molecule (Glen Research). This technique allows the oligonucleotides to be attached directly to the amine modified support, thereby avoiding additional functionalization steps.
In another aspect, the surface comprising a plurality of discrete regions is fabricated by nanoimprint lithography (NIL). To prepare DNA arrays, a quartz substrate is spin coated with a layer of photoresist, commonly referred to as a transfer layer. A second type of photoresist, commonly referred to as an imprint layer, is then applied over the transfer layer. The master embossing tool then leaves an impression on the embossed layer. The total thickness of the imprint layer is then reduced by plasma etching until the lower regions of the imprint layer encounter the transfer layer. Because the transfer layer is more difficult to remove than the imprinting layer, it is substantially unaffected. The imprinting layer and the transfer layer are then hardened by heating. The substrate is then placed in a plasma etcher until the lower region of the imprinting layer encounters quartz. The substrate is then derivatized by vapor deposition as described above.
In another aspect, the surface comprising a plurality of discrete regions is fabricated by nanoimprinting. This process uses light, imprint or e-beam lithography to create a master mold, which is a negative image of the desired feature pattern on the printhead. The print head is typically made of a soft, flexible polymer, such as Polydimethylsiloxane (PDMS). Such materials or material layers of different properties are spin coated onto the quartz substrate. The mold is then used to emboss the feature pattern into the surface layer of the photoresist material under controlled temperature and pressure conditions. The print head is then subjected to a plasma-based etching process to increase the aspect ratio of the print head and eliminate deformation of the print head due to relaxation of the embossed material over time. Random array substrates are fabricated using nanoprinting by leaving a pattern of amine-modified oligonucleotides on a homogeneously derivatized surface. These oligonucleotides will act as capture probes for nucleic acids. One possible advantage of nanoprinting is the ability to print an interlaced pattern of different capture probes onto a random array support. This can be achieved by successive printing with a plurality of print heads, each carrying a different pattern, all of which cooperate together to form the final structured support pattern. Such methods allow for some positional encoding of DNA elements in random arrays. For example, control concatemers containing specific sequences can be bound at regular intervals on a random array.
In yet another aspect, a high density array of submicron-sized capture oligonucleotide spots is prepared using a print head or imprint master (imprints-master) prepared from one or more bundles of about 10,000 to 1 million optical fibers comprising a core and a coating material. Unique materials are produced by drawing and fusion splicing of optical fibers, containing about 50-1000nm axial cores separated by cladding material of similar size or 2-5 times smaller or larger. A nano-printhead containing a very large number of small rods (posts) of nanometer scale is obtained by differential etching (dissolution) of the coating material. Such a print head can be used to place oligonucleotides or other biological (proteins, oligopeptides, DNA, aptamers) or chemical compounds such as silanes with various reactive groups. In one embodiment, a glass fiber tool is used as a patterned support for the storage of oligonucleotides or other biological or chemical compounds. In this case, only the small rods produced by etching can come into contact with the material to be stored. The fused fiber bundle may be used to direct light through the mandrel by flat cutting, allowing only light induced chemicals to occur on the mandrel head surface, thus eliminating the need for etching. In both cases, the same support may then serve as a light guide/collection device for imaging the fluorescent label used to label the oligonucleotide or other reactant. The device provides a large field of view with a large numerical aperture (possibly > 1). From 2 to 100 different oligonucleotides may be printed in an interlaced pattern using a stamp or printing tool that performs the storage of the active material or oligonucleotide. This process requires precise positioning of the print head at about 50-500 nm. This type of oligonucleotide array can be used to attach 2 to 100 different populations of DNA, such as different source DNAs. They can also be used for parallel reading of sub-optical resolution spots by using DNA-specific anchor molecules or tags. Information can be obtained by DNA-specific tags (e.g., 16 specific anchor molecules for 16 DNAs), and 2 bases can be read by a combination of 5-6 colors, using 16 ligation cycles or one ligation cycle and 16 decoding cycles. This way of preparing the array is efficient if each segment only requires limited information (e.g. a small number of cycles), so more information can be provided per cycle or more cycles can be done per surface.
In one aspect, multiple arrays of the present invention can be placed on a single surface. For example, patterned array substrates can be produced to match standard 96 or 384 well plate formats. The production format may be a 6mm x6mm array, an 8x12 pattern spaced 9mm apart, or a 3.33mm x3.33mm array, a 16x24 pattern spaced 4.5mm apart on a single piece of glass or plastic and other optically compatible materials. In one example, each 6mmx6mm array consists of 3 thousand 6 million 250-500nm square regions spaced 1 micron apart. Hydrophobic or other surface or physical barriers may be used to prevent mixing of different reactions between arrays of cells.
Other methods of forming molecular arrays are known in the art and can be used to form nucleic acid arrays.
Exemplary embodiments
The following provides certain illustrative embodiments of the invention. It will be appreciated that these embodiments may be altered or augmented using methods known to those skilled in the art. Since many aspects can be made without departing from the spirit and scope of the presently described technology, the appropriate scope is to be determined by the appended claims. Accordingly, other aspects are contemplated. Moreover, it should be understood that any operations may be performed in any order, unless explicitly claimed otherwise or a specific order is inherently necessitated by the claim language.
In an exemplary embodiment, the present invention provides a method of fragmenting a double stranded target nucleic acid. The method comprises (a) providing genomic DNA; (b) dividing the DNA into a plurality of separate aliquots; (c) amplifying the DNA in the divided aliquot in the presence of a population of dntps comprising dNTP analogs such that a number of nucleotides in the DNA are replaced with dNTP analogs, (d) removing the dNTP analogs to form nicked DNA; (e) the gapped DNA is treated to translate the nicks until the nicks on opposite strands converge, thereby creating blunt-ended DNA fragments. In yet another embodiment, substantially each fragment in the separate mixture is non-overlapping with every other fragment of the same aliquot.
In yet another embodiment and in accordance with the above, the dNTP analogues are selected from the group comprising inosine, uracil and 5-methylcytosine.
In yet another embodiment and in accordance with any of the above, the dNTP analogs include both deoxy-uracil and 5-methylcytosine.
In yet another embodiment and in accordance with any of the above, the method of the present invention comprises the further step of obtaining a plurality of sequence reads from fragments of each of the separate mixtures.
In yet another embodiment and in accordance with any of the above, the fragments are used to generate DNA nanospheres prior to obtaining a sequence read.
In yet another embodiment and in accordance with any of the above, the separate mixtures comprise an average of less than about 0.1%,0.3%,1%, or 3% of the genome.
In yet another embodiment and in accordance with any of the above, the present invention provides a method for fragmenting a nucleic acid, comprising the steps of: (a) providing at least two DNA genome equivalents for at least one genome; (b) separating the DNA into a first layer of separated mixture; (c) amplifying the DNA in the separated mixture, wherein the amplification is performed with a population of dntps comprising a predetermined dUTP to dTTP ratio (such that a plurality of thymines in the DNA are replaced by uracils) and a predetermined ratio of 5-methyl dCTP to dCTP such that a plurality of cytosines are replaced by 5-methylcytosines, (d) removing uracils and 5-methylcytosines to form gapped DNA; (e) the gapped DNA is treated to translate the gaps until the gaps on opposite strands converge, thereby creating blunt-ended DNA fragments, wherein the blunt-ended fragments have smaller GC bias and smaller coverage bias than fragments generated in the absence of 5-methylcytosine.
In yet another embodiment and in accordance with any of the above, obtaining a sequence reading of fragments from each separate mixture of the first layer.
In yet another embodiment and in accordance with any of the above, the separated mixture of fragments is further separated into a second layer of separated mixture. In yet another embodiment, sequence reads are obtained from fragments of each separate mixture in the second layer.
In yet another embodiment and in accordance with any of the above, the separated mixture in the first, second or larger layer that is aliquoted and/or fragmented has a volume of less than 1 μ Ι,100nl,10nl,1nl or 100 pl.
In yet another embodiment and in accordance with any of the above, the amplifying is performed in the presence of a member selected from the group consisting of: glycogen, DMSO, ET SSB, betaine, and any combination thereof.
In yet another embodiment and in accordance with any of the above, after one or more rounds of fragmentation, the fragments have a length of about 100kb to about 1 mb.
In yet another embodiment and in accordance with any of the above, the present invention provides a method of fragmenting a double stranded target nucleic acid comprising the steps of: (a) providing genomic DNA; (b) dividing the DNA into separate aliquots; (c) amplifying the DNA in the divided aliquot to form a plurality of amplicons, wherein the amplifying is performed with a population of dntps comprising dNTP analogs such that a number of nucleotides in the amplicons are replaced with dNTP analogs, and wherein the amplifying is performed in the presence of an additive selected from the group consisting of: glycogen, DMSO, ET SSB, betaine, and any combination thereof; (d) removing dNTP analogs from the amplicon to form a nicked DNA; (e) the gapped DNA is treated to translate the gaps until the gaps on opposite strands converge, thereby creating blunt-ended DNA fragments that have less GC bias than the fragments generated in the absence of the addition.
In yet another embodiment and in accordance with any of the above, a plurality of sequence reads are obtained from fragments of each separate mixture.
In yet another embodiment and in accordance with any of the above, the step of obtaining a sequence read is preceded or followed by a second amplification of fragments of each separate mixture.
In yet another embodiment and in accordance with any of the above, the dNTP analogs are selected from the group consisting of inosine, uracil, and 5-methylcytosine.
In yet another embodiment and in accordance with any of the above, the dNTP analogs include both deoxy-uracil and 5-methylcytosine.
In yet another embodiment and in accordance with any of the above, the fragment has a length of about 10,000 to about 200,000 bp.
In yet another embodiment and in accordance with any of the above, the fragment has a length of about 100,000 bp.
In yet another embodiment and in accordance with any of the above, the present invention provides a method of obtaining sequence information from a genome, comprising the steps of: (a) providing a first population of fragments of said genome; (b) preparing emulsion droplets of the first segment such that each emulsion droplet comprises a subset of the first segment population; (c) obtaining a second population of fragments within each emulsion droplet such that the second fragments are shorter than the first fragments from which they were derived; (d) combining the emulsion droplets of the second segment with the emulsion droplets of the adaptor tag; (e) ligating the second fragments with an adaptor tag to form tagged fragments; (f) combining the tagged fragments into a single composition, (g) obtaining sequence reads from the tagged fragments, wherein the sequence reads comprise sequence information from the adaptor tags and the fragments to identify the fragments from the same emulsion droplet, thereby providing sequence information about the genome.
In yet another embodiment and in accordance with any of the above, the emulsion droplets of the adaptor comprise at least two sets of different tag components, such that the fragments in at least some of the emulsion droplets are tagged with different combinations of tag components in the ligating step (f).
In yet another embodiment and in accordance with any of the above, at least 1000 different emulsion droplets comprise fragments tagged with different combinations of tag components.
In yet another embodiment and in accordance with any of the above, at least 10,000,30,000, or 100,000 different emulsion droplets comprise fragments tagged with different combinations of tag components.
In yet another embodiment and in accordance with any of the above, the tag component is from a set of more than 1000 unique barcodes prepared as a population of droplets in oil.
In yet another embodiment and in accordance with any of the above, the emulsion droplets of the first fragments comprise only 1-5 first fragments in each droplet.
In yet another embodiment and in accordance with any of the above, the emulsion droplets of the fragments or the emulsion droplets of the adaptors further comprise a ligase and/or other reagents required for the ligation reaction.
Examples
Example 1 overview of LFR technology
As shown in FIG. 30(A), genomic DNA was released from 1-100 cells and maintained as long fragments of 100kb to 1mb in size. If several cells are used, the DNA is replicated. Blue represents the maternal fragment and red the paternal fragment of the selected locus. In fig. 30(B), long genomic DNA is divided into 1000 to 100,000 aliquots (e.g., 1536 or 6144 well plates or greater than 10,000 nanoliter droplets, such as in the RainDance or Advanced Liquid Logic system) that contain 1% or as low as 0.01% haploid genome (1-1,000 fragments per aliquot). In FIG. 30(C), DNA was amplified by phi29 polymerase (not necessary for some platforms) (the resulting DNA may be shorter than the starting material), fragmented with the enzyme to 100-10,000bp (standard 500bp), and uniquely barcoded in each aliquot via a combinatorial DNA adaptor ligation with a unique 6-to 12-mer sequence. In fig. 30(D), aliquots are combined into a single reaction. In FIG. 30(E), barcoded DNA was incorporated into standard library preparations and the DNA and barcodes sequenced. Minimal localization of tagged reads to the entire reference determines which regions of the genome are used as short composite references for rapid read assembly in individual aliquots. Thus, the computational cost of reading positioning is reduced by a factor of 100. In fig. 30(F), tagged reads were used to assemble maternal and paternal 100+ kb fragments of the genome independently. Overlapping 100+ kb fragments (e.g., from aliquots 3 and 77) were identified by shared SNP alleles and used to assemble the sequences of the maternal and paternal chromosomes independently. 10 cells provided fragments that overlapped on average more than 90+ kb with approximately 60 heterozygote variants ensuring correct parental localization.
Example 2 miniaturization of LFR
As shown in fig. 28(a), 96-384 uniquely barcoded half adaptors from group a and group B were combined in a twofold fashion into about 10K-150K unique individual combined adaptor oil-water droplets. In FIG. 28(B), up to 100 hundred million drops of combined adaptor in 10ml were formed (over a few days) and stored. This amount is sufficient to process over 1000 human samples. In fig. 28(C), the combined adaptor droplets from (B) were injected into a microfluidic device and merged one-to-one with droplets of amplified fragmented DNA generated from a sub-genomic aliquot of fragments greater than 100 kb. Fig. 28(D), fragmented DNA in 10,000 or more emulsion droplets was ligated with unique combinatorial adaptors. In fig. 28(E), an enlarged view of the combined adaptor is shown. Yellow represents the 4-6bp building block of the barcode sequence, and blue and red represent the adaptor sequences common to groups A and B, respectively. The group A and B adaptors have 2-4bp complementary sequences for improved directed ligation. B is a block ("-") linkage to genomic DNA (black). In fig. 28(F), after adaptor ligation, individual emulsion droplets are disrupted and DNA fragments are pooled for standard library preparation.
Example 3: defining haplotypes using LFR data
An example of a consensus chromosomal sequence with 4 heterozygote sites of variable distance 3 to 35kb is depicted in FIG. 29. Starting from the left, Percent Shared Aliquots (PSA) were calculated for each pair of adjacent alleles. The numbering of the 4 possible pairs is written in the following order: the top-top, top-bottom, bottom-top, and bottom-bottom (e.g., numbers 7, 87, 83, and 0) of the 7kb segment correspond to A-C, A-T, G-C, and G-T pairs, respectively. If 20 cells are used, alleles can be found in 20 or fewer aliquots. For the A-C and A-T pairs, only aliquots of A lacking G were used. For the G-C and G-T pairs, only the G aliquot lacking A was used. For the a-T pair, PSA was 13/15=87% if a without G was present in 15 aliquots, T was present in 17 aliquots, and a and T were present together in 13 aliquots.
Example 4: Φ 29-mediated overlapping genomic fragments
Long fragments of genomic DNA can be treated with low concentrations of rare nicking enzymes. The Φ 29 polymerase molecule extends DNA from the nick simultaneously displacing the advancing DNA strand. Complete extension results in long overlapping fragments without loss of DNA at the ends of the fragments.
Example 5: sequencing cancer samples
4 cancer samples and matched normal cells were sequenced using the LFR technique discussed herein. The emulsion technique or 3072-6144 aliquots of the library were used. Complete methylation group data was also generated. Depending on the cost reduction achieved, data greater than 120Gb can be obtained per genome. Results from the experiments indicate the integrity and quality of the sequences, and the nature of the genetic and epigenetic changes in the cancer tissues analyzed.
Example 6: MDA reaction for insertion of uracil for CoRE
Aliquots of DNA were diluted to 1 ng/. mu.L. Excess pipetting is avoided to help preserve long fragment lengths. The mixture was not vortexed at any point in the preparation for the reaction.
1/5 diluted denaturing buffer was generated from the concentrated frozen stock solution. The denaturation buffer contained:
5ng (5. mu.L) of 1 ng/. mu.L DNA was diluted in 45. mu.L of 1x glycogen water.
DNA was denatured by the addition of 50. mu.L of 1/5 diluted denaturation buffer (the total volume is now 100. mu.L). The final concentration of this mixture would be 50 pg/. mu.L.
The mixture was incubated for 5 minutes.
The DNA required for the number of wells/aliquot was removed to create a concentration of 0.025 genome equivalents per μ L (i.e., 0.0825pg/μ L) and placed in a tube, well, or other method of aliquot storage. In embodiments using wells, the amount is determined using the following calculation: DNA (μ L) - [0.0825 pg/. mu.L) x (2 μ L) x (# aliquots/well) ]/50 pg/. mu.L.
An appropriate amount of 1mM9 mer primer (0.03 μ L per well) was added to the denatured DNA from the above step and incubated for 1 minute. The appropriate amount is calculated from the number of aliquots that will be used. For example, for 405 wells, this would be equal to 0.03 μ L x (# aliquots) =12.2 μ L
The reaction was neutralized with the appropriate amount of 1/45 diluted neutralization buffer (using 1/2 volume of denatured DNA from the removal step described above). The neutralization buffer contained the following:
the reaction was then diluted to 0.025 genomic equivalents in distilled water with 1x glycogen. For the embodiment using the multi-well format, the calculation was [ (number of wells x2 μ L) - (μ L denatured DNA + μ L buffer N + μ L9 mer ]: for a 405-well plate this would be (405x2) - (1.33+0.67+12.2) =796 μ L.
A4.0% dUTP-MDA mixture was created according to the protocol outlined below (examples for 405 wells are shown).
The 3x main mixture contains the following:
0.0375uL Φ 29 was added to the 1-well 3X master mix prior to MDA (i.e., for 384-well plates, 14.4uL Φ 29 was added to the master mix). 0.03uL1mM random 9-mers per well were added directly to the DNA during the denaturation step.
Add 1uL MDA mixture to each well and spin down briefly. Aliquots were incubated at 26 ℃ for about 120 minutes to achieve about 10-30K amplification to 3-10 ng/well.
Φ 29 was inactivated by incubation at 45-65 ℃ for 5 minutes.
Example 7: complete diploid genomic sequence from a human Yoluban (Yoluban) female using LFR
The LFR method eliminates some of the problems associated with short read sequencing because it equates to single molecule sequencing of fragments greater than 10kb (up to 1Mb is possible). This is achieved by randomly dividing the corresponding parental DNA fragments into physically distinct sets. As the genome score in each set decreases to less than a haploid genome, the statistical likelihood of having segments from two parent chromosomes in the same set decreases significantly (i.e., at 0.1 genome equivalents per well, there is a 10% chance that two segments will overlap and a 50% chance that those segments will originate from separate parent chromosomes, resulting in a 5% overall chance that a particular well will not provide information for a given segment). Likewise, the more individual sets of interrogations, the greater the number of times fragments from maternal and paternal complements will be analyzed (i.e., a 384 well plate with 0.1 genomic equivalents in each well results in theoretical 19X coverage of both maternal and paternal alleles for each fragment). Finally, all chromosomes from one parent are expected to separate from the corresponding chromosome of the other parent in most aliquots of the sequencing.
Several preparation steps were used to generate these physically separated fragments for analysis by any short read sequencing platform. First, highly uniform amplification using modified Φ 29-based Multiple Displacement Amplification (MDA) was performed to increase the number of fragments per well to greater than 1000 copies. This step can be omitted for single molecule sequencing methods. Next, the DNA was fragmented and ligated to barcode adaptors via 5 enzymatic steps per well without any intervening purification steps. Briefly, a long DNA molecule is fragmented into a blunt-ended 300-bp 1,300 segment by a novel method of controlled random enzymatic fragmentation (CoRE). The CoRE fragments the DNA via removal of uridine bases, which are incorporated at a predetermined frequency during MDA by uracil DNA glycosylase and endonuclease IV. The resolved fragment was translated with nicks of E.coli polymerase 1 and blunt-ended. Then, a high yield, low chimera formation protocol is used to ligate unique barcode adaptors designed to reduce any deviation caused by differences in the sequence and concentration of each barcode to the fragmented DNA in each well. In this regard, all 384 wells are combined and if necessary to generate enough templates for a short read sequencing platform, an unsaturated polymerase chain reaction using primers common to the ligated adaptors is employed.
To demonstrate the ability of LFR to determine diploid genomic sequences, libraries were generated starting from high molecular weight genomic DNA of an immortalized B cell line from the hipmap sample NA19240 of a human female yoluba. NA19240 was extensively interrogated as part of HapMap and triplet of 1,000 genomic items (NA19240 is the progeny of samples NA19238 and NA 19239). Thus, highly accurate haplotype information was generated based on the sequence data of the parental samples NA19238 and NA 19239. A total of about 130 picograms of DNA (equal to about 20 cells) was aliquoted into 384-well plates. The DNA in each well was tagged with a unique 6 base sequence and sequenced using the DNA nanoarray sequencing platform of Complete Genomics. A custom alignment algorithm was used to map 35 base-paired reads to a reference genome, yielding 236Gb mapped data and 86-fold average genome coverage.
The positional reads from each well were then grouped based on unique 6-base barcode identifiers and assembled into male and female parent chromosome fragments. These fragment sizes had a median of about 90kb and a maximum of greater than 180 kb. A large contig with an upper bound of 373Kb of N50 and 2.63Mb was assembled using a two-step custom haplotyping algorithm using overlapping heterozygous SNPs between fragments from the same parent chromosome located in different wells. A total of almost 270 million heterozygous SNPs were staged (phase) and approximately 86% of the NA19240 genome was covered by LFR haplotypes.
To confirm the accuracy of LFR haplotype calling, low coverage BAC libraries were generated and 10 clones with an average 83kb overlap with the LFR contig were selected for further validation. Sequencing was performed on approximately 10 different heterozygous SNPs spread between each BAC. 128 of the 130 informative SNPs were completely consistent with LFR calls, yielding a difference rate of only 1.5%. To further confirm the LFR results, SNP staging data were compared to those generated from parental sequencing. Generally, the two sets of data are highly correlated.
To generate complete haplotypes for all NA19240 chromosomes (each parent chromosome containing almost all heterozygous SNPs is a single contig), we combined the LFR data with haplotypes derived from sequences of the female and male parents. To achieve this, informative variants from one or both parents and NA19240 were used to create a genome-wide sparse haplotype. This allows staging of about 180 ten thousand SNPs. The haplotype contigs generated by LFR were then staged using a chromosome scaffold, resulting in a high density of whole chromosome haplotypes encompassing 260 ten thousand SNPs. It is estimated that about 5% heterozygous SNPs are detected, but remain unfractionated, and about 5% are undetected.
Example 8: ligation of combinatorial adaptors to DNA fragments
In a first step, adaptor A was ligated to both sides of the genomic DNA fragment in a reaction using T4 ligase. The ligation was carried out at 14 ℃ for 2 hours. The DNA adaptor ratio was about 30: 1. The following concentrations of reactants were used for this first step of the process.
The partially tagged DNA fragments are denatured and then annealed with primers complementary to adaptor a. The polymerase extends from the primer to generate double-stranded fragments, each tagged at one end with an adaptor. The following concentrations of reactants were used for this step of the process.
The protocol used with the above reactions was 3 minutes at 95 ℃,1 minute at 55 ℃ and 10 minutes at 72 ℃ followed by a drop to 4 ℃.
The next step of the method is to ligate adaptor B with blunt ends generated during primer extension. In addition, the mixture at 14 degrees C temperature 2 h incubation. The DNA adaptor B ratio was about 15: 1. The following concentrations of reactants were used for this step of the process.
The present specification provides a complete description of the methodology, systems and/or structures of the technology described herein and their use in the examples. Although various aspects of the technology have been described above with a certain degree of particularity, or with respect to one or more individual aspects, those skilled in the art could make numerous alterations to the disclosed aspects without departing from the spirit or scope of this technology. Since many modifications may be made without departing from the techniques described herein, the proper scope of the invention resides in the claims hereinafter appended. Other aspects are therefore also contemplated. Further, it should be understood that any operations may be performed in any order, unless explicitly stated otherwise or a specific order is required by the language of the claims. It is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative only of particular aspects and not limiting to the above-described embodiments. Unless otherwise clear from the context or explicitly stated, any concentration values given herein are generally in terms of mixed liquor values or percentages, and do not take into account any transformations upon or after addition of the particular components of the mixture. All published references and patent documents referred to in this disclosure are incorporated by reference herein in their entirety for all purposes to the extent not already expressly incorporated herein. Changes in detail or structure may be made without departing from the basic elements of the inventive technique as defined in the following claims.
Claims (20)
1. A method of obtaining sequence information from a DNA fragment, the method comprising:
(a) physically separating the DNA fragments into separate aliquots;
(b) treating the aliquot from step (a) by mechanical shearing, enzymatic treatment, chemical treatment or a CoRE fragmentation method to produce shorter fragments of the DNA;
(c) ligating each of the shorter fragments in the aliquot with one or more adaptor tags, thereby generating tagged fragments, wherein the adaptor tags have a sequence or combination of sequences that is unique for each aliquot;
(d) combining labeled fragments from different aliquots into a mixture;
(e) obtaining sequence reads for each tagged fragment in the mixture, wherein each of the sequence reads comprises sequence information for both an adaptor tag and the DNA;
(f) combining sequence reads from tagged fragments having the same adaptor tag into a sequence of a longer contiguous region; and
(g) collecting the assembled sequence reads as sequence information for the DNA, wherein the sequence information comprises the sequence of the longer contiguous region.
2. A method of obtaining sequence information from a DNA fragment, the method comprising:
(a) physically separating the DNA fragments into separate aliquots such that the probability of a given region of the genome in the maternal and paternal components being in the same aliquot is less than 1%;
(b) treating the aliquot from step (a) by mechanical shearing, enzymatic treatment, chemical treatment or a CoRE fragmentation method to produce shorter fragments of the DNA;
(c) ligating the shorter fragments with one or more adaptor tags, thereby generating tagged fragments, wherein the adaptor tags have a sequence or combination of sequences that is unique for each aliquot;
(d) combining labeled fragments from different aliquots into a mixture;
(e) obtaining a sequence read for each tagged fragment in the mixture, wherein the sequence read comprises sequence information for both the adaptor tag and the DNA;
(f) collecting assembly sequence reads to generate collected assembled sequence information for DNA, wherein the collected assembled sequence information comprises a heterozygote locus; and
(g) the heterozygote loci are staged using sequence information from the adaptor tags to identify aliquots from which sequence reads initiate.
3. The method of claim 1 or 2, wherein the aliquot is an emulsion droplet.
4. The method of claim 3, further comprising combining the tagged fragments into a single mixture.
5. The method of claim 4, wherein the droplets containing the shorter fragments or the droplets containing the adaptor tags further comprise a ligase.
6. The method of claim 1 or 2, wherein the fragments are amplified before, after, or both before and after dividing into aliquots.
7. The method of claim 1 or 2, wherein step (b) comprises digesting the fragments with an endonuclease thereby generating shorter fragments.
8. The method of claim 1 or 2, wherein each of the shorter fragments has an adaptor tag with a sequence that is unique for each aliquot.
9. The method of claim 1 or 2, wherein each adaptor tag is designed as two segments, wherein one segment is common to all aliquots and the other segment is unique to each aliquot.
10. The method of claim 1 or 2, wherein the unique sequence or combination of sequences is capable of uniquely labeling 65,000 aliquots.
11. The method of claim 1 or 2, which is carried out in a microfluidic device.
12. The method of claim 1 or 2, wherein the sequence read in step (e) is obtained by combinatorial probe-anchored ligation in which nucleotides at specific detection positions in the DNA are detected by means of a probe ligation product formed by ligation of at least one anchor probe to a sequencing probe, wherein the anchor probe is fully or partially hybridized to an adaptor and the sequencing probe is hybridized to the tag fragment at a position adjacent to the adaptor.
13. A method according to claim 1 or 2, wherein a plurality of rows of aliquots are used, wherein each row of aliquots is labelled such that an aliquot in each subsequent row can be identified by its originating aliquot in the previous row.
14. The method of claim 1 or 2, wherein 95% of the base pairs in an aliquot do not overlap.
15. The method of claim 1 or 2, wherein the aliquots in step (a) each comprise less than 1% haploid genome.
16. The method of claim 1 or 2, wherein at least 10,000 different aliquots contain fragments that are tagged with different tag component combinations.
17. The method of claim 1 or 2, which produces haplotype reads that exceed 100kb in length.
18. The method of claim 1 or 2, wherein the long pieces of DNA assemble into a diploid genome.
19. The method of claim 1 or 2, which is a method of haplotyping a diploid chromosome.
20. Use of sequence information from sequence reads of a marker fragment for staging a target DNA heterozygote locus, wherein the sequence information is obtained by:
(a) physically separating the DNA fragments into separate aliquots such that the probability of a given region of the genome in the maternal and paternal components being in the same aliquot is less than 1%;
(b) treating the aliquot from step (a) by mechanical shearing, enzymatic treatment, chemical treatment or a CoRE fragmentation method to produce shorter fragments of the DNA;
(c) ligating the shorter fragments with one or more adaptor tags, thereby generating tagged fragments, wherein the adaptor tags have a sequence or combination of sequences that is unique for each aliquot;
(d) combining labeled fragments from different aliquots into a mixture;
(e) obtaining a sequence read for each tagged fragment in the mixture, wherein the sequence read includes sequence information for both the adaptor tag and the DNA.
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US61/187,162 | 2009-06-15 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1170531A HK1170531A (en) | 2013-03-01 |
| HK1170531B true HK1170531B (en) | 2018-03-29 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12529163B2 (en) | Library of DNA fragments tagged with combinatorial oligonucleotide bar codes for use in genome sequencing | |
| CN102459592B (en) | Methods and compositions for long read sequencing | |
| CN101932729B (en) | Efficient determination of bases in sequencing reactions | |
| US9023769B2 (en) | cDNA library for nucleic acid sequencing | |
| CN104039438B (en) | For the processing method of stabilization of nucleic acids array | |
| US8415099B2 (en) | Efficient base determination in sequencing reactions | |
| DK2565279T3 (en) | Efficient base determination in sequencing reactions | |
| HK1170531B (en) | Methods and compositions for long fragment read sequencing | |
| HK1170531A (en) | Methods and compositions for long fragment read sequencing | |
| HK1214303B (en) | Method for long fragment read sequencing | |
| HK1169679B (en) | Methods and compositions for long fragment read sequencing | |
| HK1187078B (en) | Efficient base determination in sequencing reactions | |
| AU2013202989A1 (en) | Efficient base determination in sequencing reactions |