NUCLEOTIDE SEQUENCE ANALYSIS BY QUANTIFICATION
OF MUTAGENESIS
FIELD OF THE INVENTION
[0001] THIS INVENTION relates generally to sequence analysis. More particularly, the present invention relates to a method for wholly or partially determining the sequence of a polymer of subunits. Even more particularly, the present invention relates to the construction of a plurality of secondary polymers, each varying from a primary polymer by the substitution of at least one subunit with a subunit of a different species, and to their use for inferring information about the primary polymer.
BACKGROUND OF THE INVENTION
[0002] DNA sequencing using DNA polymerase based extension products is frequently impaired by sequence motifs present within the template DNA that form secondary structures or other structural forms that impede the processive extension of DNA polymerase catalysed products through the region of the motif (Tabor and Richardson, 1987 Proc. Natl. Acad. Sets USA 84:4767- 4771; Donlin and Johnson, 1994 Biochemistry 33:14908-14917; Weinshenker et al, 1998 Biotechniques 25:68-72). These motifs are typically sequence motifs such as CG-rich motifs which present high thermal and structural stability (Mizusawa et ah, 1986 Nucleic Acids Research 14:1319-1324; McConlogue et al., 1988 Nucleic Acids Research 16:9869; Liu and Sommer, 1998 Biotechniques 25:1022-1028; Haqqi et ah, 1988 Nucleic Acids Research 16:11844; Fernandez- Rachubinski et ah, 1990 DNA Sequence 1:137-140; Perng et al., 1994 Journal of Virological Methods 46:111-116; Fernandez-Rachubinski et ah, 1990DNA Sequence 1:137-140; Motz et ah, 2000 Biotechniques 29:268-270; Liu and Sommer, 1998 Biotechniques 25:1022-1028). Similarly, AT-rich sequence motifs (Quail, 2001 DNA Sequence 12:355-359; Glockner et ah, 2002 Nature 418:79-85; Gardner et ah, 1998 Science 282:1126-1132) and other repetitive sequence motifs (Baran, Lapidot & Manor, 1991 Proc. Natl. Acad. Sci. USA 88:507-511; Bieth et ah, 1997 Gene 194:97-105; Razin et al, 2001 Journal of Molecular Biology 307:481-486; Voet et ah, 1997 Yeast 13:177-182; Thoraval et al, 1996 Proc. Natl. Acad. Sci.s USA 93:4442-4447; Kaiser et ah, 2002 Clinical Biochemistry 35:49-56) that potentially allow extension from either aligned triplex strands or from misalignment partially replicated duplex primed ends, or from other forms of simple repeat, homopolymer, and other potentially fold-back and stem-loop forming or kinked-DNA forming regions conform to these patterns (Razin et ah, 2001 Journal of Molecular Biology 307:481-486; Kurahashi et ah, 2000 Human Molecular Genetics 9:1665-1670; Kang and Cox, 1996 Genomics 35:189-195; Moreau et al, 1998 International Microbiology 1:35-43) are well known from the literature to limit or to prevent entirely the processive synthesis of DNA by DNA polymerases. Indeed, Schlδtterer and Tautz (1992 Nucleic Acids Research 20:211-215) reported
that it is possible to synthesise all variant types of repetitious di- and tri-nucleotide simple sequence DNA motifs starting from short primers, a simple sequence template and a DNA polymerase in vitro.
[0003] Incorporation of non-mutagenic nucleotide analogues has been employed widely to replace particular specific cognate native deoxyribonucleotides in order to effect a reduction in the thermal stability of the duplex DNA within a refractory sequence motif. Such incorporation has been found to improve the ability to sequence through particular types of nucleotide structure or sequence motif that is recalcitrant to sequencing. The nucleotide analogues used for these purposes have particular properties, which typically are - (i) the ability to reduce the stability of nucleotide- to-nucleotide base pairing across the DNA duplex; (ii) the ability to be efficiently incorporated into DNA by preferred DNA polymerases; (iii) the ability to be incorporated adjacent to particular cognate bases such that sequence of the DNA duplex is not altered, except for the presence of the nucleotide analogue in place of a particular cognate base; and (iv) the ability of the nucleotide analogue to be "read" as the same nucleotide base as the base it replaces. The replacement of certain nucleotides with nucleotide analogues achieves a more uniform distribution of local thermal stability of DNA across the entire DNA region under investigation, particularly when local regions of CG-rich DNA are present.
[0004] The nucleotide analogue 2'-deoxyinosine triphosphate (dlTP) which is incorporated efficiently into DNA by DNA polymerases is used widely to alter the stability of DNA pairing (Kawase et ah, 1986 Nucleic Acids Research 14:7727-7736; Bergstrom et ah, 1997 Nucleic Acids Research 25:1935-1942; Strobel and Shetty, 1997 Proc. Natl. Acad. Sci. USA 94:2903-2908; Wong and McClelland 1991 Nucleic Acids Research 19:1081-1085; Ikeda et ah, 1992 Journal of Biological Chemistry 267:6291-6296). Although it is known to use error-prone PCR for randomly mutating genes by altering the concentrations of respective dNTPs in the presence of dlTP (Leung, S. et al, (1989) Nucleic Acid Res. 17:1177-1195); Caldwell and Joyce (1992) In PCR Methods Application 2:28-33; Kuipers, 1996 Methods in Molecular Biology 57:351-356; Spee et ah, 1993 Nucleic Acids Research 21:777-778), dlTP induces only low-level (at a frequency of - 4xl0"3) mutation of DNA after elevated cycles of PCR amplification (Spee et ah, 1993 Nucleic Acids Research 21:777-778; Shibata, 1994 Nippon Rinsho 52:1665-1673; Kuipers, 1996 Methods in Molecular Biology 57:351-356) and effects mainly A -^ G and T -^ C transitions as well as infrequent C -> G transversions.
[0005] In addition, although widely used to modify the thermal stability of DNA, the analogue 7-deaza-dGTP is not reported to be mutagenic to DNA (McClary et al, 1991 DNA Sequence 1:173-180; Baran, Lapidot & Manor, 1991 Proc. Natl. Acad. Sci. USA 88:507-511; Haqqi et ah, 1988 Nucleic Acids Research 16:11844).
[0006] Although nucleotide analogues have heretofore been used in methods to reduce the thermal stability of difficult-to-sequence regions of DNA in order to improve their ability to be sequenced, these methods do not introduce mutations at an intensity sufficient to substantially alter the structural characteristics of these regions and are, therefore, often inefficient. [0007] In work leading up to the present invention, a novel strategy for sequencing a target region of a polynucleotide was developed to deal with problematic structural features as, for example, described above. The strategy involves producing a plurality of variant polynucleotides that vary from the original polynucleotide in the target region by the substitution of at least one nucleotide with a nucleotide of a different species and quantifying each species of nucleotide at individual positions in the target region of all the variant polynucleotides to thereby determine the nucleotide species located at the same positions in the original polynucleotide. This novel strategy has been reduced to practice in methods for analysing multi-subunit sequences, including the analysis of molecules whose local sequence characteristics render them refractory to sequence analysis, as described hereinafter.
SUMMARY OF THE INVENTION
[0008] The present invention provides methods that generally take advantage of altering a sequence of subunits selected from a finite set of possible subunit species to produce a multiplicity of secondary subunit sequences, from which the unaltered sequence can be determined. In particular, in one aspect, the present invention provides methods for wholly or partially determining the sequence of a target region of a primary polymer of subunits from a multiplicity of secondary polymers that vary from the primary polymer in the target region by the substitution of at least one subunit, including a first subunit at a first position, with a subunit of a different species, wherein each species of subunit at the first position correlates with a distinct detectable signal. These methods generally comprise analysing the detectable signals that correlate with the first position of all the secondary polymers, collectively, to determine the species of subunit, which is in higher abundance than other species of subunit at the first position, and which corresponds to the species of subunit at the first position in the target region of the primary polymer. Typically, individual secondary polymers vary from other secondary polymers at the position(s) of variation with the primary polymer. The polymers are suitably selected from nucleic acid polymers and amino acid polymers. In specific embodiments exemplified herein, the polymers are nucleic acid polymers. Advantageously, the secondary polymers are generated by mutagenesis, which is typically random. In specific embodiments, an individual secondary polymer is formed using the primary polymer as a template for polymerisation or using another secondary polymer that has been directly or indirectly formed using the primary polymer as a template for polymerisation.
[0009] In certain embodiments, the subunit of a different species is a naturally-occurring subunit species. Typically, the naturally-occurring subunit species is incorporated into an individual secondary polymer by polymerising that polymer in the presence of another secondary polymer having a mutagenic subunit species at the first position, wherein the mutagenic subunit species serves as a template for incorporating at least two naturally-occurring subunit species at the first position of the individual secondary polymer. Desirably, the mutagenic subunit species induces mutation at a frequency generally greater than about lx 10"2.
[0010] In some embodiments, the secondary polymers vary from the primary polymer in the target region by the substitution of a second subunit at a second position with a subunit of a different species, wherein each species of subunit at the second position correlates with a distinct detectable signal. In these embodiments, the methods further comprise analysing the detectable signals that correlate with the second position of all the secondary polymers, collectively, to determine the species of subunit, which is in higher abundance than other species of subunit at the second position, and which corresponds to the species of subunit at the second position in the target region of the primary polymer.
[0011] In some embodiments, the secondary polymers vary from the primary polymer in the target region by the substitution of subunits at a multiplicity of positions with subunits of different species, wherein each species of subunit at an individual position correlates with a distinct detectable signal. In these embodiments, the methods further comprise analysing the detectable signals that correlate with an individual position of all the secondary polymers, collectively, to determine the species of subunit, which is in higher abundance than other species of subunit at the individual position, and which corresponds to the species of subunit at the individual position in the target region of the primary polymer.
[0012] In some embodiments, the detectable signals are analysed by: measuring for each species of subunit at an individual position at least one parameter that correlates with that subunit; and processing the measured parameter(s) to determine the abundance of each subunit species relative to other subunit species at the individual position.
[0013] i some embodiments, the measured parameters are further processed by comparing them to determine the species of subunit that is in higher abundance than the other species of subunit at the individual position.
[0014] In certain embodiments, the parameter is a label-associated parameter, which includes, but is not restricted to, parameters relating to fluorescence emission, luminescence, phosphorescence, infrared radiation, electromagnetic scattering including light and x-ray scattering, light transmittance, light absorbance, electrical impedance and molecular mass. [0015] In some embodiments, the variant sequence of a secondary polymer is produced by mutagenesis of the target sequence of the primary polymer. In other embodiments, the variant sequence of a secondary polymer is produced by mutagenesis of the variant sequence of another secondary polymer. In certain embodiments, a parent target sequence is mutagenised to produce at least one variant sequence in which at least 2, 5, 10, 15, 20, 25, 30 or 35% of subunits are different than the parent target sequence. Suitably, the mutagenesis of the parent target sequence is random.
[0016] In another aspect, the invention encompasses the whole or partial sequence of a target region of a primary polymer, as determined by the methods broadly described above.
[0017] Advantageously, the target region of the primary polymer is refractory to sequence analysis or repeat-length analysis and the variation in the corresponding regions of the secondary polymers is associated with the abrogation, inhibition or amelioration of the refractory behaviour. For example, in the case of nucleic acid polymers, local sequence characteristics, including inverted repeats or palindromes, which may be present in the target region of the primary nucleic acid polymer, may be modified in the secondary or variant nucleic acid polymers to change the structure of the target region in whole or in part such that formation of stem-and-loop structures, for example, is prevented, reduced or otherwise weakened. Sequencing of several sequence
variants simultaneously as disclosed herein can permit the deduction of the whole or partial sequence of the target region. Similarly, fragment repeat-length analysis of several sequence variants simultaneously as disclosed herein can permit the deduction of the whole or partial repeat- length of the target region. [0018] Thus, certain embodiments of the present invention relate to the mutagenesis of a target sequence of a parent polynucleotide. In illustrative examples of this type, the target sequence is mutagenised using a repair deficient host, which is desirably a bacterium. In other illustrative examples, the target sequence is mutagenised using a low fidelity nucleic acid amplification reaction and an error prone DNA polymerase, which is suitably thermostable. In specific embodiments, the target sequence is mutagenised using a nucleic acid amplification reaction and a DNA polymerase, which is suitably thermostable. In still other illustrative examples, the target sequence is mutagenised using an isothermal nucleic acid amplification reaction and a processive "rolling circle amplification" DNA polymerase. In still other illustrative examples, the target sequence is mutagenised using an isothermal nucleic acid amplification reaction and an error prone DNA polymerase, e.g., using a "sloppier-copier polymerase" or other "Y-family polymerase" in concert with a processive "rolling circle amplification" DNA polymerase. In still other illustrative examples, the target sequence is mutagenised using a nucleic acid amplification reaction and a RNA polymerase, wherein the template used for amplification is RNA. In other illustrative examples, the target sequence is mutagenised using a nucleic acid amplification reaction and an error prone DNA polymerase, which is suitably a "Reverse Transcriptase" DNA polymerase.
[0019] In specific embodiments, the target sequence is mutagenised by incorporation of mutagenic nucleotide analogues. In illustrative examples of this type, the mutagenesis facilitates random replacement of nucleotides in the target sequence with at least one nucleotide analogue, which, through its adoption of different tautomeric forms that base pair with alternative nucleotides, results in transition and/or transversion mutagenesis of the target sequence to produce a mixture of randomly mutated polynucleotides (secondary polynucleotides) that vary from the parent polynucleotide in the target sequence by the substitution of at least one naturally-occurring nucleotide with a different naturally-occurring nucleotide. The mutagenesis suitably produces polynucleotides selected from: [0020] (i) a mixture of polynucleotides, the sequence of individual polynucleotides being mutated randomly with a single mutagenic nucleotide analogue;
[0021] (ii) a mixture of polynucleotides, the sequence of individual polynucleotides being mutated randomly with a plurality of mutagenic nucleotide analogues;
[0022] (iii) a mixture of polynucleotides including a plurality of polynucleotide subsets, individual polynucleotides of each subset being mutated randomly and independently with a distinct mutagenic nucleotide analogue;
[0023] (iv) a mixture of greater than 5 polynucleotides, preferably a mixture of greater than 7 polynucleotides, more preferably a mixture of greater than 10 polynucleotides, more preferably a mixture of greater than 20 polynucleotides, even more preferably a mixture of greater than 50 polynucleotides and even more preferably a mixture of greater than 100 polynucleotides, the sequence of individual polynucleotides being mutated randomly with a single mutagenic nucleotide analogue; [0024] (v) a mixture of greater than 5 polynucleotides, preferably a mixture of greater than 7 polynucleotides, more preferably a mixture of greater than 10 polynucleotides, more preferably a mixture of greater than 20 polynucleotides, even more preferably a mixture of greater than 50 polynucleotides and even more preferably a mixture of greater than 100 polynucleotides, the sequence of individual polynucleotides being mutated randomly with a plurality of mutagenic nucleotide analogues;
[0025] (vi) a mixture of greater than 5 polynucleotides, preferably a mixture of greater than 7 polynucleotides, more preferably a mixture of greater than 10 polynucleotides, more preferably a mixture of greater than 20 polynucleotides, even more preferably a mixture of greater than 50 polynucleotides and even more preferably a mixture of greater than 100 polynucleotides including a plurality of polynucleotide subsets, individual polynucleotides of each subset being mutated randomly and independently with a distinct mutagenic nucleotide analogue;
[0026] (vii) a mixture of polynucleotides, the sequence of individual polynucleotides being mutated randomly with a single mutagenic nucleotide analogue and further altered at one or more positions by the substitution of a single non-mutagenic nucleotide analogue for a cognate typically naturally-occurring nucleotide, examples of non-mutagenic nucleotide analogues including, but not restricted to, 7-deaza-dGTP, 7-deaza-dGTP, 7-deaza-dATP, 8- aza-7-deaza-dATP, 5-methyl-dCTP, N4-methyl-dCTP dlTP, 7-deaza-7-nitro-dATP, 7- deaza-7-nitro-dGTP, 5-hydroxy-dCTP, and 5-hydroxy-dUTP and borano-deoxynucleotide analogues;
[0027] (viii) a mixture of polynucleotides, the sequence of individual polynucleotides being mutated randomly with a plurality of mutagenic nucleotide analogues and further altered at one or more positions by the substitution of a single non-mutagenic nucleotide for a cognate nucleotide;
[0028] (ix) a mixture of polynucleotides including a plurality of polynucleotide subsets, individual polynucleotides of each subset being mutated randomly and independently by a distinct mutagenic nucleotide analogue and further altered at one or more positions by the substitution of a single non-mutagenic nucleotide for a cognate nucleotide; [0029] (x) a mixture of greater than 5 polynucleotides, preferably a mixture of greater than 7 polynucleotides, more preferably a mixture of greater than 10 polynucleotides, more preferably a mixture of greater than 20 polynucleotides, even more preferably a mixture of greater than 50 polynucleotides and even more preferably a mixture of greater than 100 polynucleotides, the sequence of individual polynucleotides being mutated randomly with a single mutagenic nucleotide analogue and further altered at one or more positions by the substitution of a single non-mutagenic nucleotide for a cognate nucleotide;
[0030] (xi) a mixture of greater than 5 polynucleotides, preferably a mixture of greater than 7 polynucleotides, more preferably a mixture of greater than 10 polynucleotides, more preferably a mixture of greater than 20 polynucleotides, even more preferably a mixture of greater than 50 polynucleotides and even more preferably a mixture of greater than 100 polynucleotides, the sequence of individual polynucleotides being mutated randomly with a plurality of mutagenic nucleotide analogues and further altered at one or more positions by the substitution of a single non-mutagenic nucleotide for a cognate nucleotide;
[0031] (xii) a mixture of greater than 5 polynucleotides, preferably a mixture of greater than 7 polynucleotides, more preferably a mixture of greater than 10 polynucleotides, more preferably a mixture of greater than 20 polynucleotides, even more preferably a mixture of greater than 50 polynucleotides and even more preferably a mixture of greater than 100 polynucleotides including a plurality of polynucleotide subsets, individual polynucleotides of each subset being mutated randomly and independently with a distinct mutagenic nucleotide analogue and further altered at one or more positions by the substitution of a single non-mutagenic nucleotide for a cognate nucleotide.
[0032] (xiii) a mixture of polynucleotides, the sequence of individual polynucleotides being mutated randomly with a single mutagenic nucleotide analogue and further altered at one or more positions by the introduction of modified nucleotides which have increased chemical reactivity, examples of which include, but are not restricted to, 7-deaza-7-nitro- dATP, 7-deaza-7-nitro-dGTP, 5-methyl, 5-ethyl, 5-bromo or 5-iodo substitution for the 5- hydrogen of cytosine forming 2'-deoxycytidine 5'-(alpha-P-borano) triphosphates, 5- hydroxy-dCTP, 5-hydroxy-dUTP and dlTP;
[0033] (xiv) a mixture of polynucleotides, the sequence of individual polynucleotides being mutated randomly with a plurality of mutagenic nucleotide analogues and further altered at
one or more positions by the introduction of modified nucleotides which have increased chemical reactivity;
[0034] (xv) a mixture of polynucleotides including a plurality of polynucleotide subsets, individual polynucleotides of each subset being mutated randomly and independently by a distinct mutagenic nucleotide analogue and further altered at one or more positions by the introduction of modified nucleotides which have increased chemical reactivity,
[0035] (xvi) a mixture of greater than 5 polynucleotides, preferably a mixture of greater than
7 polynucleotides, more preferably a mixture of greater than 10 polynucleotides, more preferably a mixture of greater than 20 polynucleotides, even more preferably a mixture of greater than 50 polynucleotides and even more preferably a mixture of greater than 100 polynucleotides, the sequence of individual polynucleotides being mutated randomly with a single mutagenic nucleotide analogue and further altered at one or more positions by the introduction of modified nucleotides which have increased chemical reactivity;
[0036] (xvii) a mixture of greater than 5 polynucleotides, preferably a mixture of greater than 7 polynucleotides, more preferably a mixture of greater than 10 polynucleotides, more preferably a mixture of greater than 20 polynucleotides, even more preferably a mixture of greater than 50 polynucleotides and even more preferably a mixture of greater than 100 polynucleotides, the sequence of individual polynucleotides being mutated randomly with a plurality of mutagenic nucleotide analogues and further altered at one or more positions by the introduction of modified nucleotides which have increased chemical reactivity; and
[0037] (xviii) a mixture of greater than 5 polynucleotides, preferably a mixture of greater than
7 polynucleotides, more preferably a mixture of greater than 10 polynucleotides, more preferably a mixture of greater than 20 polynucleotides, even more preferably a mixture of greater than 50 polynucleotides and even more preferably a mixture of greater than 100 polynucleotides including a plurality of polynucleotide subsets, individual polynucleotides of each subset being mutated randomly and independently by a distinct mutagenic nucleotide analogue and further altered at one or more positions by the introduction of modified nucleotides which have increased chemical reactivity.
[0038] When the above mixtures of (secondary) polynucleotides are used as templates in further rounds of polymerisation to produce progeny polynucleotides, or are processed to form sequencing polynucleotides fragments, using, for example, a polymerase and conventional nucleotide triphosphates (dATP, dCTP, dGTP, dTTP and dUTP), the progeny polynucleotides or sequencing polynucleotides fragments thus produced, which are examples of secondary polynucleotides as defined herein, will have substituted for the nucleotide analogue at a given position a naturally-occurring nucleotide that base pairs or complements with that nucleotide
analogue. This means that for an individual position so substituted, the position will vary between the progeny polynucleotides or between the sequencing polynucleotide fragments; most of which will contain at the individual position the correct nucleotide which is present in the unmutagenised (or parent) polynucleotide, whereas some will contain an incorrect or mutant nucleotide at the same position.
[0039] Identification of the correct nucleotide at a specified position is predicated in part on the random incorporation of incorrect nucleotides in the secondary polynucleotides (e.g., progeny polynucleotides or sequencing polynucleotide fragments) at a frequency of no more than 70%, 65%, 60%, 55%, 50%, 45%, 40%, 35%, 30%, 25%, 20%, 15%, 10%, 5% or 1%. Desirably, the frequency of incorporation of incorrect nucleotides is chosen so that, across the collection of secondary polynucleotides, there are more correct nucleotides at a specified position than incorrect nucleotides. Thus, quantification of each species of nucleotide at a particular position within a target sequence will reveal the nucleotide species which is in higher abundance than other species of nucleotides at that position and which corresponds to the correct nucleotide species in the target sequence of the parent polynucleotide.
[0040] Accordingly, in another aspect, the invention provides methods for wholly or partially determining the sequence of a target region of a primary polynucleotide from a multiplicity of secondary polynucleotides that vary from the primary polynucleotide in the target region by the substitution of at least one nucleotide, including a first nucleotide at a first position, with a nucleotide of a different species, wherein each species of nucleotide at the first position correlates with a distinct detectable signal. These methods generally comprise analyzing the detectable signals that correlate with the first position of all the secondary polynucleotides, collectively, to determine the species of nucleotide, which is in higher abundance than other species of nucleotide at the first position, and which corresponds to the species of nucleotide at the first position in the target region of the primary polynucleotide. In some embodiments, the different species of nucleotide is a naturally-occurring nucleotide species that is incorporated into an individual secondary polynucleotide using another secondary polynucleotide as a template for polymerisation, wherein the secondary polynucleotide comprises a mutagenic nucleotide analogue at the first position that complements the naturally-occurring nucleotide and at least one other naturally- occurring nucleotide.
[0041] In yet another aspect, the invention contemplates methods for wholly or partially determining the sequence of a target region of a primary polynucleotide from a multiplicity of secondary polynucleotides that vary from the primary polynucleotide in the target region by the substitution of at least one nucleotide, including a first nucleotide at a first position, with a nucleotide of a different species. These methods generally comprise: (a) separating sequencing polynucleotide fragments formed from the secondary polynucleotides and having lengths indicative
of the positions of the nucleotides within the target region, including fragments having lengths indicative of the first position, as a function of fragment length, wherein each species of nucleotide, whose position is indicated by a respective fragment, correlates with a distinct detectable signal; (b) detecting the detectable signals during, or at the completion of, the separation; (c) processing the detectable signals to produce a data set containing a plurality of peaks reflecting the positions and species of the nucleotides in the secondary polynucleotides, the plurality of peaks including a first group of peaks representing at least two species of nucleotide at the first position; and (iii) processing the peaks of the first group, collectively, to determine the species of nucleotide, which is in higher abundance than the other species of nucleotide, at the first position, and which corresponds to the species of nucleotide at the first position in the target region of the primary polynucleotide.
[0042] In even yet another aspect, the invention contemplates methods for wholly or partially determining the sequence of a target region of a primary polynucleotide from a multiplicity of secondary polynucleotides that vary from the primary polynucleotide in the target region by the substitution of at least one nucleotide, including a first nucleotide at a first position, with a nucleotide of a different species. These methods generally comprise (a) separating sequencing polynucleotide fragments formed from the secondary polynucleotides and having lengths indicative of the positions of the nucleotides within the target region, as a function of molecular mass, wherein each species of nucleotide, whose position is indicated by a respective fragment, correlates with a distinct detectable signal, wherein the fragments are typically separated by a mass spectroscopic technique and especially by differential time of flight; (b) detecting the detectable signals during, or at the completion of, the separation; (c) processing the detectable signals to produce a data set containing a plurality of peaks reflecting the positions and species of the nucleotides in the secondary polynucleotides, the plurality of peaks including a first group of peaks representing at least two species of nucleotide at the first position; and (d) processing the peaks of the first group, collectively, to determine the species of nucleotide, which is in higher abundance than the other species of nucleotide, at the first position, and which corresponds to the species of nucleotide at the first position in the target region of the primary polynucleotide.
[0043] In a further aspect, the invention contemplates methods for determining the repeat- length of a target region of a primary polynucleotide from a multiplicity of secondary polynucleotides that are formed directly or indirectly using the primary polynucleotide as a template for polymerisation and that vary from the primary polynucleotide in the target region by the substitution of at least one nucleotide with a nucleotide of a different species. These methods generally comprise: (a) fractionating the secondary polynucleotides according to their length, size or mass wherein each secondary polynucleotide correlates with a detectable signal; (b) detecting the detectable signals during, or at the completion of, the fractionation; and (c) processing the
detectable signals to determine the repeat-length of the polynucleotide. Desirably, the secondary polynucleotides are generated in a nucleic acid amplification reaction. In some embodiments, the secondary polynucleotides are fractionated using gel electrophoresis or mass spectrometry.
[0044] In still another aspect, the invention contemplates the use of a mutagenic nucleotide analogue in the manufacture of a kit for wholly or partially determining the sequence of a target region of a primary polynucleotide from a multiplicity of secondary polynucleotides that vary from the primary polynucleotide in the target region by the substitution of at least one nucleotide, as broadly described above.
[0045] In another aspect, the invention contemplates the use of a mutagenic nucleotide analogue in the manufacture of a kit for wholly or partially determining the repeat-length of a target region of a primary polynucleotide from a multiplicity of secondary polynucleotides that are formed directly or indirectly using the primary polynucleotide as a template for polymerisation and that vary from the primary polynucleotide in the target region by the substitution of at least one nucleotide, as broadly described above. [0046] In still another aspect, the invention provides computer program products for wholly or partially deducing the sequence of a target region of a primary polymer of subunits from a multiplicity of secondary polymers that vary from the primary polymer in the target region by the substitution of at least one subunit, including a first subunit at a first position, with a subunit of a different species, wherein each species of subunit at the first position correlates with a distinct detectable signal. These computer program products generally include computer executable code which when implemented on a suitable processing system causes the processing system to process the detectable signals that correlate with the first position of all the secondary polymers, collectively, to determine the species of subunit, which is in higher abundance than other species of subunit at the first position, and which corresponds to the species of subunit at the first position in the target region of the primary polymer.
[0047] In a further aspect, the invention provides processing systems for wholly or partially deducing the sequence of a target region of a primary polymer of subunits from a multiplicity of secondary polymers that vary from the primary polymer in the target region by the substitution of at least one subunit, including a first subunit at a first position, with a subunit of a different species, wherein each species of subunit at the first position correlates with a distinct detectable signal. These processing systems are generally adapted to process the detectable signals that correlate with the first position of all the secondary polymers, collectively, to determine the species of subunit, which is in higher abundance than other species of subunit at the first position, and which corresponds to the species of subunit at the first position in the target region of the primary polymer.
[0048] The detectable signals are desirably processed to produce a data set containing a plurality of peaks reflecting the positions and species of the subunit in the secondary polymers, the plurality of peaks including a first group of peaks representing at least two species of subunit at the first position. In some embodiments of this type, the processing systems are further adapted to process the peaks of the first group, collectively, to determine the species of subunit, which is in higher abundance than the other species of subunit, at the first position, and which corresponds to the species of subunit at the first position in the target region of the primary polymer. In some embodiments, the processing systems further comprise a store for storing the data.
[0049] Suitably, the processing systems are further adapted to generate an indication of the sequence of the target region of the primary polymer. In illustrative examples of this type, the processing systems comprise a display, which displays the indication.
[0050] In yet a further aspect, the invention provides computer program products for wholly or partially deducing the sequence of a target region of a primary polynucleotide from a multiplicity of secondary polynucleotides that vary from the primary polynucleotide in the target region by the substitution of at least one nucleotide, including a first nucleotide at a first position, with a nucleotide of a different species, wherein each species of nucleotide at the first position correlates with a distinct detectable signal. These computer program products generally include computer executable code which when implemented on a suitable processing system causes the processing system to (a) process the detectable signals to produce a data set containing a plurality of peaks reflecting the positions and species of the nucleotides in the secondary polynucleotides, the plurality of peaks including a first group of peaks representing at least two species of nucleotide at the first position; and (b) process the peaks of the first group, collectively, to determine the species of nucleotide, which is in higher abundance than the other species of nucleotide, at the first position, and which corresponds to the species of nucleotide at the first position in the target region of the primary polynucleotide.
[0051] In still a further aspect, the invention provides processing systems for wholly or partially deducing the sequence of a target region of a primary polynucleotide from a multiplicity of secondary polynucleotides that vary from the primary polynucleotide in the target region by the substitution of at least one nucleotide, including a first nucleotide at a first position, with a nucleotide of a different species, wherein each species of nucleotide at the first position correlates with a distinct detectable signal. These processing systems are generally adapted to (i) process the detectable signals to produce a data set containing a plurality of peaks reflecting the positions and species of the nucleotides in the secondary polynucleotides, the plurality of peaks including a first group of peaks representing at least two species of nucleotide at the first position; and (ii) process the peaks of the first group, collectively, to determine the species of nucleotide, which is in higher
abundance than the other species of nucleotide, at the first position, and which corresponds to the species of nucleotide at the first position in the target region of the primary polynucleotide.
BRHCF DESCRIPTION OF THE DRAWINGS
[0052] Figure 1 is a diagrammatic representation illustrating mutant configurations: (A) Star;
(B) Path; (C) Octopus; and (D) Binary Tree.
[0053] Figure 2 is a graphical representation of an indicative calculated relative probability of miscalling of a nucleotide at any individual position varying in relationship with the number of different mutant polynucleotides in the range of 1 to 15 polynucleotides, wherein the polynucleotides are each mutated randomly at frequencies of 2%, 10%, 20%, 30% and 40%, and the relative probability is calculated on the assumption that 70% of the nucleotides at any position need be identical for detection. [0054] Figure 3 illustrates a pair of electropherograms representative of direct cycle sequencing reactions of specific "difficult to sequence" polynucleotide fragment W51. Figure 3 A represents an electropherogram of sequencing fragments derived from W51 using standard Applied Biosystems Incorporated BigDye™ dideoxy-terminator sequencing chemistry, version 3.0. In contrast, Figure 3B represents an electropherogram of sequencing fragments derived from an individual copy of W51 mutated using PCR with dPTP and sequenced using the standard ABI BigDye™ dideoxy-terminator sequencing chemistry, version 3.0.
[0055] Figure 4 illustrates a pair of electropherograms representative of direct cycle sequencing reactions of specific "difficult to sequence" polynucleotide fragment D36. Figure 4A represents an electropherogram of sequencing fragments derived from D36 using standard Applied Biosystems Incorporated BigDye™ dideoxy-terminator sequencing chemistry, version 3.0. In contrast, Figure 4B represents an electropherogram of sequencing fragments derived from an individual copy of D36 mutated using PCR with dPTP and sequenced using the standard ABI BigDye™ dideoxy-terminator sequencing chemistry, version 3.0.
[0056] Figure 5 illustrates a pair of electropherograms representative of cycle sequencing reaction of a "difficult-to-sequence" polynucleotide region from a fragment of the human RP11- 167L9 BAC. The bar indicates the polyA motif. Figure 5A represents an electropherogram of sequencing fragments derived from the wild-type region using standard Applied Biosystems Incorporated BigDye™ dideoxy-terminator sequencing chemistry, version 2.0. In contrast, Figure 5B represents the sequence of the region mutated using TempliPhi™ and the analogue 8-oxo-dGTP and then sequenced using standard Applied Biosystems Incorporated BigDye™ dideoxy-terminator sequencing chemistry, version 2.0.
[0057] Figure 6 illustrates a quartet of electropherograms representative of direct cycle sequencing reactions of specific "difficult-to-sequence" polynucleotide fragment from human BAC RP11-167L9. The bar indicates the polyA motif. Figure 6A represents sequence derived from the
wild type BAC RP11-167L9 fragment using standard Applied Biosystems Incorporated BigDye™ dideoxy-terminator sequencing chemistry, version 3.0. Figure 6B represents the mean sequence read derived from a mixture of nine different 8-oxo-dGTP mutated copies of the BAC RPI 1-167L9 fragment, sequenced using the standard ABI BigDye™ dideoxy-terminator sequencing chemistry, version 3.0. Further, Figures 6C and 6D represent the mean or 'composite' sequence read derived from a mixture of eight different dPTP mutated copies of the BAC RP11-167L9 fragments, sequenced using the standard ABI BigDye™ dideoxy-terminator sequencing chemistries version 3.0 and version 3.1, respectively.
[0058] Figure 7 is a schematic representation of a computer system useful in the practice of the present invention.
[0059] Figure 8 illustrates an electropherogram representative of cycle sequencing reaction of an individual copy of the specific polynucleotide fragment pTEST™ mutated using TempliPhi™ and the analogue 5-Br-dUTP and then sequenced using standard Applied Biosystems Incorporated BigDye™ dideoxy-terminator sequencing chemistry, version 3.0. The dots represent individual base substitutions in the sequence.
[0060] Figure 9 illustrates a pair of electropherograms representative of direct cycle sequencing reactions of specific polynucleotide fragment from pTEST™. Figure 9A represents sequence derived from the wild type pTEST™ using standard Applied Biosystems Incorporated BigDye™ dideoxy-terminator sequencing chemistry, version 3.0. Figure 9B represents the mean sequence read derived from a mixture of fifteen different dPTP mutated copies of pTEST™, sequenced using the standard ABI BigDye™ dideoxy-terminator sequencing chemistry, version 3.0.
[0061] Figure 10 illustrates a quartet of electropherograms representative of direct cycle sequencing reactions of specific "difficult-to-sequence" polynucleotide fragment from D12. Figure 10A represents sequence derived from the wild type D12 using standard Applied Biosystems Incorporated BigDye™ dideoxy-terminator sequencing chemistry, version 3.1. Figure 10B represents the mean sequence derived from a mixture of different dPTP mutated copies of D12 sequenced using the non-standard ABI BigDye™ dideoxy-terminator chemistry with the addition of 38 μM dPTP. Figure 10C and 10D represent the mean or 'composite' sequence of a mixture of different dPTP mutated copies of D12 sequenced using the non-standard ABI BigDye™ dideoxy- terminator chemistry with the addition of 76 μM and 150 μM dPTP, respectively.
[0062] Figure 11 illustrates a pair of electropherograms representative of direct cycle sequencing reactions of specific "difficult-to-sequence" polynucleotide fragment from D4. Figure
11A represents sequence derived from the wild type D4 using standard Applied Biosystems Incorporated BigDye™ dideoxy-terminator sequencing chemistry, version 3.1. Figure 11B
represents the mean sequence derived from a mixture of different dPTP mutated copies of D4 sequenced using the non-standard ABI BigDye™ dideoxy-terminator chemistry with the addition of 38 μM dPTP.
[0063] Figure 12 illustrates a pair of electropherograms representative of direct cycle sequencing reactions of specific "difficult-to-sequence" polynucleotide fragment from human BAC clone RP11 167L9. The bar indicates the polyA motif. Figure 12A represents sequence derived from the wild type fragment using standard Applied Biosystems Incorporated BigDye™ dideoxy- terminator sequencing chemistry, version 3.0, whilst Figure 12B represents the mean sequence derived from a mixture of different dPTP mutated copies of RPI 1 167L9 sequenced using the non- standard ABI BigDye™ dideoxy-terminator chemistry, v 3.1 with the addition of 38 μM dPTP.
[0064] Figure 13 represents the mean fragment repeat lengths of nine different Simple
Tandem Repeat (STR) PCR products derived from a mixture of different 8-oxo-dGTP mutated copies of the Amelogenin (non STR marker, XY marker), DYS14, D21S11, D13S317, D13S258, D13S631, D18S51, D18S851 and D18S391 loci simultaneously amplified using non-standard PCR chemistry, with the addition of 76 μM 8-oxo-dGTP. All alleles were amplified, however here seven genotype loci are visible (D21S11, D13S317, D13S258, D13S631, D18S51, D18S851 and D18S391) as two fall below the cut-off intensity.
DETAILED DESCRIPTION OF THE INVENTION
1. Definitions
[0065] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by those of ordinary skill in the art to which the invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, preferred methods and materials are described. For the purposes of the present invention, the following terms are defined below.
[0066] The articles "α" and "an" are used herein to refer to one or to more than one (i.e. to at least one) of the grammatical object of the article. By way of example, "an element" means one element or more than one element.
[0067] The term "about" is used herein to refer to parameters (e.g., amounts, concentrations, time etc) that vary by as much as 30%, 20%, 15%, 10% , 5% or even 4%, 3%, 2%, 1% to a specified parameter.
[0068] The term "complementary" refers to the topological capability or matching together of interacting surfaces of an oligonucleotide probe and its target oligonucleotide, which may be part of a larger polynucleotide. Thus, the target and its probe can be described as complementary, and furthermore, the contact surface characteristics are complementary to each other. Complementary includes base complementarity such as A is complementary to T or U, and C is complementary to G in the genetic code. However, this invention also encompasses situations in which there is non- traditional base-pairing such as Hoogstein base pairing which has been identified in certain transfer RNA molecules and postulated to exist in a triple helix. In the context of the definition of the term "complementary", the terms "match" and "mismatch" as used herein refer to the hybridisation potential of paired nucleotides in complementary nucleic acid strands. Matched nucleotides hybridise efficiently, such as the classical A-T and G-C base pair mentioned above. Mismatches are other combinations of nucleotides that hybridise less efficiently.
[0069] Throughout this specification, unless the context requires otherwise, the words
"comprise", "comprises" and "comprising" will be understood to imply the inclusion of a stated step or element or group of steps or elements but not the exclusion of any other step or element or group of steps or elements. [0070] By "gene" is meant a genomic nucleic acid sequence at a particular genetic locus.
[0071] By "mutagenic nucleotide analogue," or "mutagenic analogue" is meant a nucleotide analogue that is incorporated by a polymerase into a nucleic acid polymer, wherein the analogue replaces a first naturally-occurring nucleotide with which it complements and wherein the analogue complements a second nucleotide, which is other than the first nucleotide, and which is suitably a
naturally-occurring nucleotide (i.e., which occurs naturally in naturally-occurring nucleic acid polymers).
[0072] The terms "modified nucleotide" and "derivatised nucleotide" mean synthetic bases, i.e., non-naturally-occurring nucleotides and nucleosides, particularly modified or derivatised adenine, guanine, cytosine, thymidine, uracil and minor bases. Although there is substantial overlap between the terms "modified" and "derivatised," modification tends to relate broadly to any difference or alteration compared to a corresponding natural base, whereas derivatisation refers more specifically to the addition or presence of different chemical groups, i.e., modification by the addition of chemical groups, functional groups and/or molecules. [0073] "Nucleotide analogue" means a molecule that can be used in place of a naturally- occurring base in nucleic acid synthesis and processing, typically enzymatic as well as chemical synthesis and processing, particularly modified nucleotides capable of base pairing, including synthetic bases that do not comprise adenine, guanine, cytosine, thymidine, uracil or minor bases. Although there is also overlap between the terms "modified nucleotide" and "nucleotide analogue" as used herein, "modified nucleotide" typically refers to congeners of adenine, guanine, cytosine, thymidine, uracil and minor bases, whereas "nucleotide analogue" further refers to synthetic bases that may not comprise adenine, guanine, cytosine, thymidine, uracil or minor bases, i.e., novel bases. Illustrative nucleotide analogues include nucleotides in which the pentose sugar and/or one or more of the phosphate esters is replaced with its respective analogue and includes modified and derivatised nucleotides. Exemplary pentose sugar analogues are those in which one or more of the carbon atoms are each independently substituted with one or more of the same or different -R, - OR, -NRR or halogen groups, where each R is independently hydrogen, (Cι-C6) alkyl or (C5-C14) aryl. The pentose sugar may be saturated or unsaturated. illustrative pentose sugars include, but are not limited to, ribose, 2'-deoxyribose, 2'-(Cι-C6)alkoxyribose, 2'-(C5-Cι4)aryloxyribose, 2',3'- dideoxyribose, 25,3'-didehydroribose, 2'-deoxy-3'-haloribose, 2'-deoxy-3'-fluororibose, 2'-deoxy- 3'-chlororibose, 2'-deoxy-3'-aminoribose, 2'-deoxy-3'-(Ci-C6)alkylribose, 2'-deoxy-3'-(Cι- C6)alkoxyribose, 2'-deoxy-3'-(C5-Cι4)aryloxyribose, 2',3'-dideoxy-3'-haloribose and 2',3'- dideoxy-3'-fluororibose. Exemplary phosphate ester analogues include, but are not limited to, alkylphosphonates, methylphosphonates, phosphoramidates, phosphotriesters, phosphorothioates, phosphorodithioates, phosphoroselenoates, phosphorodiselenoates, phosphoroanilothioates, phosphoroanilidates, phosphoroamidates, boronophosphates, etc., including any associated counterions, if present.
[0074] The "naturally occurring nucleotides" in RNA contain the nucleobases adenine (A), guanine (G), cytosine (C) or uracil (U) and in DNA contain the nucleobases adenine (A), guanine (G), cytosine (C) or thymine (T). Nucleotides which are complementary to one another are those that tend to form complementary hydrogen bonds between them and, specifically, the natural
complement to A is U or T, the natural complement to T or U is A, the natural complement to C is G and the natural complement to G is C.
[0075] The term "nucleobase" refers to a substituted or unsubstituted nitrogen-containing parent heteroaromatic ring of a type that is commonly found in polynucleotides. Typically, but not necessarily, the nucleobase is capable of forming Watson-Crick and/or Hoogsteen hydrogen bonds with an appropriately complementary nucleobase. Exemplary nucleobases include, but are not limited to, purines such as 2-aminopurine, 2,6-diaminopurine, adenine (A), ethenoadenine, N6-δ2- isopentenyladenine (6iA), N6-δ2-isopentenyl-2-methylthioadenine (2ms6iA), N6-methyladenine, guanine (G), isoguanine, N2-dimethylguanine (dmG), 7-methylguanine (7mG), 2-thiopyrimidine, 6-thioguanine (6sG) hypoxanthine and 06-methylguanine; 7-deaza-purines such as 7-deazaadenine (7-deaza-A) and 7-deazaguanine (7-deaza-G); pyrimidines such as cytosine (C), 5- propynylcytosine, isocytosine, thymine (T), 4-thiothymine (4sT), 5,6-dihydrothymine, O4- methylthymine, uracil (U), 4-thiouracil (4sU) and 5,6-dihydrouracil (dihydrouracil; D); indoles such as nitroindole and 4-methylindole; pyrroles such as nitropyrrole; nebularine; base (Y); etc. Additional exemplary nucleobases can be found in Fasman, 1989 Practical Handbook of Biochemistry and Molecular Biology, pp. 385-394, CRC Press, Boca Raton, Fla., and the references cited therein. Exemplary nucleobases are purines, 7-deazapurines and pyrimidines. Particularly desirable nucleobases are the normal nucleobases, defined infra, and their common analogues, e.g., 2ms6iA, 6iA, 7-deaza-A,D, 2dmG, 7-deaza-G, 7mG, hypoxanthine, 4sT, 4sU and Y.
[0076] By "obtained from" is meant that a sample such as, for example, a polynucleotide extract is isolated from, or derived from, a particular source of the host. For example, the extract can be obtained from a tissue or a biological fluid isolated directly from the host.
[0077] The term "oligonucleotide" as used herein refers to a polymer composed of a multiplicity of nucleotide residues (deoxyribonucleotides or ribonucleotides, or related structural variants or synthetic analogues thereof) linked via phosphodiester bonds (or related structural variants or synthetic analogues thereof). Thus, while the term "oligonucleotide" typically refers to a nucleotide polymer in which the nucleotide residues and linkages between them are naturally occurring, it will be understood that the term also includes within its scope various analogues including, but not restricted to, peptide nucleic acids (PNAs), phosphoramidates, phosphorothioates, methyl phosphonates, 2-O-methyl ribonucleic acids, and the like. The exact size of the molecule can vary depending on the particular application. An oligonucleotide is typically rather short in length, generally from about 8 to 30 nucleotides, more preferably from about 10 to
20 nucleotides and still more preferably from about 11 to 17 nucleotides, but the term can refer to molecules of any length, although the term "polynucleotide" or "nucleic acid" is typically used for
large oligonucleotides. Oligonucleotides may be prepared using any suitable method, such as, for example, the phosphotriester method as described in an article by Narang et ah, (1979 Methods Enzymol. 68:90) and U.S. Patent No. 4,356,270. Alternatively, the phosphodiester method as described in Brown et ah, (1979 Methods Enzymol. 68: 109) may be used for such preparation. [0078] The term "polymerase", "DNA polymerase", "RNA polymerase" "reverse transcriptase" and the like refer to an enzyme of interest (e.g., a single enzyme or group of enzymatic subunits) or a group of enzymes (e.g., a family of polymerases) that can catalyse the assembly of a polynucleotide from its appropriate nucleotide subunits. The target polynucleotide can designate mRNA, RNA, cRNA, cDNA single strand DNA or double strand DNA. [0079] The term "polymerase chain reaction" or "PCR" as used herein designates DNA or
RNA which is amplified by a method in which an oligonucleotide primer is hybridised to the 5' end of each complementary strand of the double-stranded target polynucleotide or nucleic acid as described in US Patent Nos. 4,683,195 and 4,683,202. The primers are extended from the 5' end forward in the 3 ' direction by a DNA polymerase which incorporates free nucleotides into a nucleic acid sequence complementary to each strand of the target nucleic acid. After dissociation of the extension products from the target nucleic acids strands, the extension products become target sequences for the next cycle of primer hybridisation and subsequent extension. Repeated cycles must be carried out in order to obtain sufficient amounts of the amplified DNA, between which cycles the complementary DNA strands must be denatured under elevated temperature. [0080] The term "polynucleotide" or "nucleic acid" as used herein designates mRNA, RNA, cRNA, cDNA or DNA. The term typically refers to oligonucleotides greater than 30 nucleotides in length. Polynucleotides or nucleic acids are understood to encompass complementary strands as well as alternative backbones described herein.
[0081] The terms "polynucleotide variant" and "variant" refer to polynucleotides that are distinguished from a reference polynucleotide by the substitution, addition or deletion of at least one nucleotide.
[0082] "Polypeptide", "peptide" and "protein" are used interchangeably herein to refer to a polymer of amino acid residues and to variants and synthetic analogues of the same. Thus, these terms apply to amino acid polymers in which one or more amino acid residues is a synthetic non- naturally occurring amino acid, such as a chemical analogue of a corresponding naturally occurring amino acid, as well as to naturally-occurring amino acid polymers.
[0083] The term "polypeptide variant" refers to polypeptides that vary from a reference polypeptide by the substitution, addition or deletion of at least one amino acid residue.
[0084] Reference herein to a "subsequence" refers to a contiguous sequence of a particular
unit, value, variable or entity, that exists in part or in whole within a larger contiguous sequence of that particular unit, value, variable or entity, hi this context a subsequence can refer to a contiguous sequence of nucleotides or amino acid residues within, or that is part of, a larger contiguous sequence of nucleotides or amino acid residues, respectively. [0085] The term "target region" refers to at least a subsequence of a polymer of interest, which generally contains a structural element that interferes with sequence analysis of the subsequence.
2. A new paradigm for sequence analysis
[0086] The present invention provides a new paradigm, designated SAQOM (Sequence Analysis via Quantification Of Mutation), for determining a sequence of subunits in a target region of a polymer of interest, which is suitably, but not exclusively, difficult or impossible to sequence by conventional means. SAQOM is predicated in part on the provision or generation of a multiplicity of variant (or mutant) polymers that vary from the polymer of interest (or parent polymer) in the target region by the substitution of at least one subunit with a subunit of a different species. Typically, these variant polymers are generated using the polymer of interest as template for their polymerisation. There are two reasons for providing or generating the variant polymers. The first reason is that the variants may contain fewer problem regions than the polymer of interest and should, therefore, be easier to sequence. For example, genomic DNA is highly repetitive, so random mutation is more likely to destroy repeats than to create them. The second reason is that each variant polymer may contain a different pattern of problem regions. Thus, the substitution of subunits in the target region with subunits of different species is used to effect a change in structure of the target region, which abrogates, inhibits or otherwise ameliorates the refractory behaviour of the target region, in whole or in part, to sequence analysis.
[0087] In the case of nucleic acid polymers, which represents a specific embodiment of the present invention, local sequence characteristics, including inverted repeats or palindromes, which may be present in the target region of a primary nucleic acid polymer, may be modified in the variant nucleic acid polymers to change the structure of the target region such that formation of stem-and-loop structures, for example, is prevented, reduced or otherwise weakened.
[0088] In accordance with the present invention, individual variant polymers will differ from other variant polymers in the target region at the position(s) of subunit substitution. However, the frequency of substitution is typically chosen so that the subunit species, which is located at a specified position within the target sequence of the primary polymer (i.e., the correct subunit species), is represented at that position by more variant polymers than polymers containing other species of subunit at the same position. In other words, the correct species of subunit must be more abundant than other subunit species at the specified position across all the variant polymers.
r. _
Typically, the substituent (or mutant) frequency of individual variant polymers is in the range of between about 1% and about 70%. Usually, the substituent frequency of individual variant polymers is no more than 70%, more suitably no more than 65%, preferably no more than 60%, more preferably no more than 55%, even more preferably no more than 50%, even more preferably no more than 45%, even more preferably no more than 40%, even more preferably no more than 35%, even more preferably no more than 30%, even more preferably no more than 25%, even more preferably no more than 20%, even more preferably no more than 15%, and still even more preferably no more than 10%. Depending upon the number of variant polymers analysed, as described in more detail below, the substituent frequency of individual variant polymers may be as low as 9%, 8%, 7%, 6%, 5%, 4%, 3% 2% or even 1%. In certain embodiments of this type, the substitution is random.
[0089] Generally, each species of subunit in the variant polymers is detectably distinct from other species of subunit to permit interrogation of the identity of each species of subunit at individual interrogation positions in the target region. The variant polymers are analysed, collectively, at each interrogation position to determine the identity and quantity of each subunit species at that position and to thereby deduce the species of subunit which is in higher abundance relative to other species of subunit at the interrogated position. The highest abundant species so identified represents the species of subunit located at the same position in the target region of the parent polymer. Similar analyses of other interrogation positions will reveal a ''composite' sequence of highest abundant subunits, which corresponds to the sequence of at least a portion of the target region of the primary polymer. The sequence information so obtained may also provide the means to identify local sequence characteristics (e.g., repeat sequences and palindromic sequences) underlying any refractory behaviour of the target region in the primary polymer to sequence analysis. [0090] The variant polymers may already exist and could, therefore, constitute naturally occurring variants (e.g., different alleles of a gene, different polymorphic forms of a polymorphic site, homologous or orthologous genes in different organisms). Alternatively, the variant polymers may be produced by mutagenesis techniques as for example described infra.
[0091] The subunits of the polymers are selected from a finite series of possible subunit species. Suitably, the subunits are selected from nucleic acid subunits, amino acid subunits or carbohydrate subunits, or combinations thereof. In a preferred embodiment, the subunits are selected from nucleic acid or amino acid subunits. In specific embodiments, the subunits are selected from nucleic acid subunits.
3. Variant or mutant configurations
[0092] The variants or mutants (e.g., Mx, M2, ..., Mn) may be related in various ways to the original or parent sequence G. Several possible configurations are illustrated in Figure 1. The first is the star, in which each mutant is generated directly from the original sequence. The second is the path, in which each mutant is generated from the previous mutant. The octopus and the binary tree are two generalisations, combining features of both the star and the path. If the mutants are naturally occurring (in different lineages) then they may be derived from a common ancestor of unknown sequence.
[0093] The advantage of the path configuration is that the final mutant M„ has a high rate of mutation relative to the original sequence G, and it is therefore highly unlikely that any of problematic regions in G will appear in M„. A disadvantage of the path is that it takes longer to generate mutants in this configuration because they must be generated sequentially. Generalisations such as the octopus and binary tree should combine advantages of both the star and the path.
4. Factors influencing the number of mutants required [0094] Factors that influence the number of mutants required to determine the sequence of a target region of a polymer of interest, include, but are not restricted to: the intensity of mutation (proportion of positions or 'sites' affected); the base-specificity of mutation (e.g., some mutagens target a single type of subunit, others target all subunit types, but have varying preferences); the site-specificity of mutation (some mutagens target specific sites preferentially); the configuration of mutants (star, path, etc.); and the need for obtaining a composite sequence.
[0095] In general, there are two main issues to consider when estimating the number of mutants required for SAQOM. The first is that there must be sufficient mutations to ensure that a substantial proportion of the problem regions in a target region will be rendered amenable to sequencing in at least one mutant. The second issue is relevant only for regions where a composite sequence is required: i.e., there must be a sufficient number of mutants to obtain a composite sequence that reflects the sequence of the target region in the polymer of interest.
[0096] With regard to the first issue, if a higher intensity of mutation is used, then fewer mutants will be required because more problem regions will be modified per mutant. However, with regard to the second issue, if a higher intensity of mutation is used, then more mutants will be required to achieve an accurate composite sequence. Both issues must, therefore, be taken into account.
[0097] To accurately calculate the number of mutants needed to achieve successful results in a particular application of SAQOM, detailed information regarding the base and site specificities of the mutagenesis technique would be required. The following two examples illustrate the type of
calculations which might be appropriate. These examples consider the number of mutants needed to achieve an accurate composite sequence.
[0098] First, suppose that the mutagenesis techniques used are neither base-specific, nor site- specific. Further suppose that the only mutations are substitutions, with all substitutions being equally likely. Then if the mutation intensity is p, and the number of mutant is n, the probability that a majority of mutants have been modified at a given site may be calculated using the binomial formula. The probability that exactly k mutants are different from the original target sequence at a given site is:
[0099] xk = (n choose k)pΛk(l-p)Λ(n-k) [0100] and the probability that the majority of mutants have been modified at that site is the sum of xk over all k greater than or equal to n/2. As an example, if 20 mutants are generated, each with a mutation intensity of 0.1, then the probability that a majority of mutants are altered at any given site is less than 1 in 100,000. This means that an incorrect or uncertain composite sequence will be obtained at fewer than one site in 100,000 on average under these assumptions. [0101] Note that it is often possible to correctly determine the base appearing at a given site even when that site has been modified in a majority of mutants. The base that appears at that site in the greatest number of mutants will be typically the correct one, even if it appears in fewer than half of the mutants. Thus the above method overestimates the number of mutants required to achieve an accurate composite sequence. Also note that it is easier to calculate the degree of accuracy given the number of mutants, than vice versa. To select the number of mutants, it might be desirable to create and use a table of composite sequence accuracies for various numbers of mutants and mutation intensities.
[0102] These issues are illustrated in Figure 2 and are particularly relevant when a low number of mutants and high levels of mutation are considered. In the particular example depicted in Figure 2, as the number of polynucleotide mutants in the mixture is increased the probability of the coincident mutation at any same nucleotide position in more than one mutant polynucleotide increases. Simultaneously, the probability that the majority of nucleotides at any nucleotide position reflect the original (unmutated) nucleotide also increases. Figure 2 graphically displays a probability factor of miscalling (reading error) any nucleotide within the composite polynucleotide signal here for illustration assuming that a shared identity of 70% of the nucleotides at each position is necessary for correct detection. With low numbers of polynucleotides in a mixture (here illustrated at between 1 and 15 polynucleotides) the probability of miscalling for each unit mixture of polynucleotides teaches that maxima and minima for miscalling occur irrespective of the frequency of mutation in each polynucleotide in the mixture. Figure 2 also teaches that there is a trend for the decrease in probability of miscalling with increasing numbers of polynucleotides in
the mixture, irrespective of the frequency of mutation in each polynucleotide in the mixture. Further, Figure 2 teaches that there is a higher probability of miscalling for each number of polynucleotides in a mixture as the frequency of mutation increases, indicating that at higher levels of mutation, greater numbers of polynucleotides must be present in a mixture to avoid miscalling of a nucleotide at any position.
[0103] Second, suppose that the only substitutions that may occur are the four transitions A to
G, G to A, C to T and T to C. Further suppose that the probability that a given A will be mutated to a G is 0.23, the probability that a given G will be mutated to an A is 0.08, the probability that a given C will be mutated to a T is 0.05, and the probability that a given T will be mutated to a C is 0.23. (These probabilities are approximately those observed in the dPTP mutants presented in Example 1). Those skilled in the art can calculate the Bayesian probabilities that the original base was an A, a C, a G or a T, given the bases quantified for a relevant site in the mutants. The most probable original base can thus be determined. For example, if a G is quantified at a particular site as having an arbitrary abundance value of 12, and an A is quantified at that site as having an arbitrary abundance value of 8, then the Bayesian probability that the original base was an A is approximately 0.81, and the Bayesian probability that it was a G is approximately 0.19. Those skilled in the art can also determine the probability that the most probable original base according to such calculations is not the correct base. For example, if 20 mutant sequences are sequenced simultaneously, then the probability that an A will be misidentified as a G using such Bayesian calculations is approximately 0.00007. The number of mutants can thus be selected to achieve any desired accuracy.
5. Mutagenesis of polynucleotide sequences
[0104] Several mutagenesis techniques may be employed advantageously to produce mutant polymers. For example, when the polymers are nucleic acid polymers (or polynucleotides), the mutagenesis facilitates the random replacement of nucleotides in a target polynucleotide with nucleotide analogues. Through their adoption of different tautomeric forms, the nucleotide analogues will base pair with at least two species of conventional or naturally-occurring nucleotides, resulting in the random transition and/or transversion mutagenesis of the target polynucleotide. A mixture of randomly mutated polynucleotides molecules is thereby produced in which the sequence of each of the randomly mutated secondary polynucleotides is materially dissimilar to the sequence of the target polynucleotide.
[0105] The mutagenic nucleotide analogues may be introduced into a target polynucleotide using any suitable method known to persons of skill in the art. For example:
[0106] (i) the analogue(s) may be introduced by a nucleic acid amplification process of the target polynucleotide examples of which include, but are not restricted to, PCR-directed
amplification or rolling circle-amplification (e.g., by using a 29 DNA polymerase) of the target polynucleotide or by any other similar DNA polymerase directed DNA replication process. The molar ratio of nucleotide analogue to the corresponding conventional nucleotide in the amplification reaction is generally about 1:10, more usually about 1:5, more usually from about 1:3 to about 1 :2 and preferably from about 1 : 1.5 to about 1:1.
[0107] (ii) the analogue(s) may be introduced during a PCR-directed cycle sequencing amplification of the target polynucleotide or by any other DNA polymerase directed DNA sequencing process. The molar ratio of nucleotide analogue to the corresponding conventional nucleotide in the PCR reaction is generally about 1:10, more usually about 1:5, more usually from about 1 :3 to about 1 :2 and preferably from about 1 : 1.5 to about 1:1.
[0108] (iii) the analogue(s) may be introduced by co-transformation of the target polynucleotide simultaneously with the nucleotide analogue(s) into host cells such as E. coli; and [0109] (iv) the analogues may be introduced by growth of host cells such as E. coli transformed with the target polynucleotide in the presence of nucleotide analogue(s).
5.1 Nucleotide analogues that induce simple transition mutations
[0110] Nucleotide analogues are compounds that may mimic natural nucleotides in structural associations with natural nucleotides in DNA or RNA. These analogues may possess the ability to (i) replace particular natural nucleotides within the DNA duplex and (ii) be introduced into the DNA molecule during DNA synthesis by DNA polymerases; (iii) replace particular natural nucleotides within the RNA polynucleotide and/or (iv) be introduced into the RNA molecule during RNA synthesis by RNA polymerases. Mutagenic nucleotide analogues have the additional property of replacing several natural nucleotides, rather than a single cognate nucleotide base. This has the effect of inducing transition and transversion mutations in subsequent rounds of DNA replication when the novel cognate base, initially introduced opposite the mutagenic nucleotide analogue, itself forms base pairs or complements with its natural cognate nucleotide.
[0111] The mutagenic properties of mutagenic nucleotide analogues are generally complex and dependent upon several factors. The mutagenic nucleotide analogue must be able to replace several natural nucleotides, rather than a single nucleotide. Typically an individual nucleotide analogue mimics a naturally-occurring nucleotide to a major extent, firstly in terms of its selection by polymerases to be introduced as a cognate base opposite a particular natural nucleotide, and secondly in the selection and introduction of a particular cognate nucleotide by the polymerase opposite the analogue when the analogue occurs in the template strand of a replicating nucleic acid molecule. The property of the mutagenic nucleotide analogue to functionally mimic several natural nucleotides is believed to be the result of the physical structural state that the analogue may
assume, either as a free nucleotide precursor in solution, or while constrained within a nucleic acid molecule.
[0112] The effect of the analogues on sequence analysis of DNA is illustrated in Figures 3, 4 and 5. Figure 3 compares the sequence analysis of a "difficult-to-sequence" polynucleotide W51 with the sequence analysis of a mutated copy of that polynucleotide. Figure 4 compares the sequence analysis of a "difficult to sequence" polynucleotide D36 with the sequence analysis of a mutated copy of that polynucleotide. Figure 5 compares the sequence analysis of a "difficult to sequence" polynucleotide region from human BAC RP11-167L9 with the sequence analysis of a mutated copy of that polynucleotide. Specifically, in Figures 3 A, 4A and 5 A, a mixture containing multiple identical copies of the respective difficult-to-sequence polynucleotides was sequenced using a standard Applied Biosystems International (ABI) BigDye™ version 3.0 and version 2.0 cycle sequencing chemistry and an ABI 3700 capillary sequencer, resulting in a failed sequence determination. By contrast, in Figures 3B and 4B and 5B, a mixture containing multiple identically mutated copies of the respective difficult-to-sequence polynucleotides were sequenced using the same conditions but, in this instance, resulting in a high quality "sequence read" as the sequence of these copies of W51 and D36 and RPI 1-167L9 had each been altered.
5.2 Mutagenic analogues:
[0113] The ability to introduce transition mutations into DNA or RNA is dependent upon tautomerisation states that the mutagenic nucleotide analogue may assume. Some examples of nucleotide-like analogues capable of introducing mutations into nucleic acid polymers are:
[0114] 1. dPTP (or dQ6TP) [6-(2-deoxy-B-D-ribofuranosyl)-3,4-dihydro-8H-pyrimido-[4,5-
C]oxazin-7-one triphosphate] behaves as thymine in the majority, and as cytosine in a minority, of DNA copying events, inducing A→ G and T→ C transitions (Zaccolo et al, 1996 Journal of Molecular Biology 255:589-603; Hill et ah, 1998b Proc. Natl. Acad. Sci. USA 95:4258-4263). The tautomerisation constant of P is assumed to be about 0.03 (Brown et ah, 1968 Journal of the Chemical Society C:1925-1929; Kierdaszuk et ah, 1983 FEBS Letters 158:128-130; Moore et ah,
1995 Journal of Molecular Biology 251:665-673; Williams et ah, US Patent 6,153,745).
[0115] 2. 8-oxo-dGTP preferentially causes A→ C and T→ G transversions (Zaccolo et al,
1996 Journal of Molecular Biology 255:589-603; Nampalli and Kumar, 2000 Bioorganic and Medicinal Chemistry Letters 10:1677-1679; Bebenek et ah, 1999 Mutation Research 429:149-158) although the tautomerisation constant and attendant frequency of mutation is low.
[0116] 3. N6-methoxy-2,6-diaminopurine (dK) behaves as adenine in the majority, and as guanine in a minority, of DNA copying events, preferentially causing A→ G and T→ C transitions
(Hill et al, 1998a Nucleic Acids Research 26:1144-1149; Hill et ah, 1998b Proc. Natl. Acad. Sci. USA 95:4258-4263).
[0117] 4. N6-methoxyadenine (dZ) behaves as adenine in the majority, and as guanine in a minority, of DNA copying events, preferentially causing A→ G and T→ C transitions (Hill et ah, 1998a Nucleic Acids Research 26:1144-1149: Hill et ah, 1998b Proc. Natl. Acad. Sci. USA 95:4258-4263).
[0118] 5. 5-bromo-2'-deoxyuridine (BrdU), 5-fluoro-2'-deoxyuridine (FldU) and 5-formyl-
2'-deoxyuridine (fdU) each behave as thymine in the majority DNA copying events and as cytosine in a minority, principally inducing T→ C and A→ G transitions. These mutations are able to be influenced by the ionization state of the analogue and are enhanced at elevated pH values. The ratio of transitions to transversions is altered if DNA amplification is performed at elevated pH. (Driggers and Beattie, 1988 Biochemistry 27:1729-1735; Sowers et ah, 1986 Proc. Natl. Acad. Sci. USA 83:5434-5438; Yu et ah, 1993 Journal of Biological Chemistry 268:15935-15943; Yoshida et ah, 1991 Nucleic Acids Research 25:1570-1577). [0119] 6. GC→ AT transitions are induced when 8-oxo-dATP is introduced by co- transfection into E. coli (I oue et ah, 1998 Journal of Biological Chemistry 273: 11069-11074). The ratio of transitions to transversions is altered if DNA amplification is performed at elevated pH. (Driggers and Beattie, 1988 Biochemistry 27:1729-1735; Sowers et ah, 1986 Proc. Natl. Acad. Sci. USA 83:5434-5438; Yu et ah, 1993 Journal of Biological Chemistiy 268:15935-15943; Yoshida et ah, 1997 Nucleic Acids Research 25:1570-1577) as a result of the formation of particular tautomeric forms at elevated pH.
[0120] 7. N4-methoxycytosine induces purine transition mutagenesis. When methoxylated dCTP is incorporated into DNA it behaves as thymine in the majority DNA copying events and as cytosine in a minority. However, once incorporated in the template strand methoxycytosine directs the incorporation of adenine above guanosine in the majority of cases (Brown et ah, 1968 Journal of the Chemical Society Section C:1925-1929; Reeves and Beattie, 1985 Biochemistry 24:2262- 2268; Hossain et ah, 2001a Nucleic Acids Research 29:3949-3954; Hossain et ah, 2001b Journal of Biochemistry 130:9-12).
[0121] 8. 5-formylcytosine (5-fC), an oxidation product of 5-methylcytosine (5-mC) are mutagenic, with mutation frequencies in double-stranded DNA of 0.03-0.28%. The mutation spectrum of 5-fC was broad, and included targeted (5-fC~>G, 5-fC~>A, and 5-fC~>T) and untargeted mutations. These results suggest that the oxidation of 5-mC results in mutations at and around the modified sites (Kamiya et al., 2002 Journal of Biochemistry (Tokyo) 132:551-555).
[0122] 9. The triphosphate derivative (dYTP) of the analogue dY (l-(2-deoxy-β-D- ribofuranosyl)-irmdazole-4-carboxamide) is preferentially incorporated as dATP only with elevated
dYTP and reduced dATP. dYTP can also be incorporated as a dGTP with elevated dYTP and reduced dGTP (Sala, et ah, 1996 Nucleic Acids Research 24:3302-3306; Strobel, et al, 2002 Nucleic Acids Research 30:1869-1878).
[0123] 10. The mutagenic specificity of 5-methyl-N4-hydroxydeoxycytidine appears to be opposite to that of 2-aminopurine. This suggests that the former can induce transition of CG→ TA, while the latter TA→ CG (Janion 1978 Mutation Research 56:225-234; Chu et ah, 1974 Mutation Research 23:267-273; Popowska and Janion 1974 Biochemical and Biophysical Research Communications 56:459-466).
[0124] 11. The analogues 5-hydroxy-deoxycytosine and 4-methyl, 5-hydroxy-deoxycytosine. 50H-dCTP and 4Me 50H-dCTP can replace dCTP, and to a lesser extent dTTP. Once incorporated, the analogues can template for particular nucleotides. dG is predominantly incorporated opposite 5-OH-dC, with low dA incorporation also seen. 5-OH-dCTP has the principal mutagenic potential for G→ A transitions (Purmal et al, 1994 Nucleic Acids Research 22:72-78; Purmal et al, 1994 Nucleic Acids Research 22:3930-3935; Loeb et ah, 1999 Proc. Natl. Acad. Sci. USA 96:1492-1497).
[0125] 12. Although its level of mutation is low, the nucleotide analogue N2,3-ethenoguanine causes virtually only G→ A transition mutations (Cheng et al, 1991 Proc. Natl. Acad. Sci. USA 88:9974-9978).
[0126] Suitably, the mutagenic nucleotide analogues of the present invention exclude dlTP. [0127] A person of skill in the art will recognise that various factors can influence the rate of substitution and degree of randomness of mutagenesis. For example, although the introduction of efficient nucleotide analogues such as dPTP into nascent DNA has been reported to be nearly 'random' (Zaccolo et al, 1996 Journal of Molecular Biology 255:589-603), several studies have indicated that an effect of nearest neighbour base pair stability can influence the efficiency of substitution of a nucleotide analogue for a native nucleotide at a particular locus (Sinha and Haimes, 1981 Journal of Biological Chemistry 256:10671-10683; Patten, So and Downey. 1984 Biochemistry 23:1613-1618; Pless et ah, 1981 Biochemistry 20:6235-6244). Patten et al, (1984 Biochemistry 23:1613-1618) showed that nucleotide analogue are incorporated (misincorporation) "with increased frequency when the nearest-neighbour nucleotides form more stable base pairs with the corresponding nucleotides in the template and is decreased when they form less stable base pairs." Further, they revealed that the stability of the base pair formed by a nucleotide either preceding (5' to) or following (3' to) a misincorporated nucleotide also influences the misincorporation frequency. The stability of base pairs formed by preceding nucleotides affects the rate of insertion of mismatched nucleotide but does not protect the mismatched nucleotide from removal by the 3' to 5' exonuclease activity. By contrast, the stability of a base pair formed by a
following nucleotide determines whether a misincorporated nucleotide is extended or excised by affecting the ability of the enzyme to edit errors of incorporation. Pleiss et ah, 1981 similarly showed that 2-amino-purine incorporation is disfavoured and that a greater bias exists with those polymerases containing an active 3 '-exonuclease. This bias against 2-amino-purine incorporation is alleviated after strong base pairs, and particularly following guanine, possibly due to stabilising vertical stacking interactions. Under conditions of nucleotide limitation during DNA synthesis the frequency of G-T and A-C mispairs reveals that most AT→ GC transition mutations occur through G-T mispairs. Measurement of the frequency of the mispairs required to induce transversion mutations reveals that these occur primarily through purine-purine mispairs. Accordingly, to address possible nearest neighbour effects, which may cause biases in the incorporation of nucleotide analogues, when preparing mixtures of mutant polynucleotides for sequence analysis, it is desirable to use at least one mixture of polynucleotides mutated using at least one mutagenic nucleotide analogue, more preferably at least two mixtures of polynucleotides, wherein each mixture is mutated using different types of mutagenic nucleotide analogue for the mutagenesis, and even more preferably more than two mixtures of polynucleotides, wherein each mixture is mutated using different types of mutagenic nucleotide analogue for the mutagenesis. The choice of analogue types and the numbers of differently mutated polynucleotides in the mutant mixture may depend upon the nature of the target sequence and the frequency of mutation of the polynucleotide mutants within the mixture. The choice of mutagenic analogue types and conditions for optimisation of mutagenesis is well within the realm of the practitioner in the art.
[0128] The present inventors have also found unexpectedly that mutagenic analogues such as dPTP and others can be used in an analogous fashion to non-mutagenic analogues such as dlTP, to improve the ability to sequence "difficult to sequence polynucleotides," using for example chain- terminating sequence analysis (e.g., PCR cycle sequencing). The present invention, therefore, also extends to the use of mutagenic analogues in combination with chain-extending nucleotides, chain- terminating nucleotides and a polymerase for sequencing a polynucleotide of interest. The molar ratio of mutagenic nucleotide analogue to the corresponding conventional chain-extending nucleotide in a sequencing reaction can be about 1:10, more usually about 1:5, more usually from about 1 :3 to about 1 :2 and preferably from about 1 : 1.5 to about 1:1. [0129] Desirably, the mutagenic analogues induce mutation at a frequency generally greater than lxlO"2, more usually at a frequency greater than 2xl0"2, 3xl0"2, 4xl0"2, 5xl0"2, 6xl0"2, 7xl0"2, 8xl0"2, 9xl0"2, lxlO"1, l.lxlO-1, 1.2xl0-1, 1.3xl0_1, 1.4xl0_1, 1.5xl0_1, l.βxlO"1, UxlO"1, 1.8xl0_1, 1.9X10'1, 2.X10"1, 2.1xl0-\ 2.2xl0"\ 2.3X10"1, 2.4X10"1, 2.5X10"1, 2.6xl0"\ 2.7X10"1, 2.8X10"1, 2.9x1c1, or even greater than 3x10"', as determined in Example 16.
6. Methods of producing mutant or variant polymers
[0130] Any suitable mutagenesis technique for mutagenising polymers is contemplated for use in SAQOM. Currently, two general approaches are commonly used to mutate nucleic acids: low fidelity PCR amplification of a DNA element using conditions to promote mis-incorporation of nucleotides, and the chemically-induced mutagenesis of DNA followed by repair and recovery of mutants either by PCR or other polymerase catalysed polynucleotide synthesis, or by biological systems (reviewed Ling & Robertson, 1997 Analytical Biochemistry 254:157-178; Leppard, 1999 Mutagenesis of DNA virus genomes, in DNA Viruses: A Practical Approach Series 214 (ed., Alan J. Cann), IRL Press, Oxford; Kaminya, 2003 Nucleic Acids Research 31:517-531). [0131] A number of different mutagenesis schemes could potentially be used to produce suitable variant sequences for use in SAQOM. For example, an original or parent polynucleotide can be mutated using random mutagenesis (e.g., PCR mediated mutagenesis) or oligonucleotide- mediated (or site-directed) mutagenesis.
[0132] Oligonucleotide-mediated mutagenesis can be used for preparing suitable nucleotide substitution variants of a primary polynucleotide. This technique is well known in the art as, for example, described by Adelman et al. (1983 DNA 2:183-193). Briefly, a polynucleotide is altered by hybridising an oligonucleotide encoding the desired mutation to a template DNA, wherein the template is the single-stranded form of a plasmid or bacteriophage containing the unaltered or parent DNA sequence. After hybridisation, a DNA polymerase is used to synthesise an entire second complementary strand of the template that will thus incoφorate the oligonucleotide primer, and will code for the selected alteration in the parent DNA sequence.
[0133] Generally, oligonucleotides of at least 25 nucleotides in length are used. An optimal oligonucleotide will have 12 to 15 nucleotides that are completely complementary to the template on either side of the nucleotide(s) coding for the mutation. This ensures that the oligonucleotide will hybridise properly to the single-stranded DNA template molecule.
[0134] The DNA template can be generated by those vectors that are either derived from bacteriophage Ml 3 vectors, or those vectors that contain a single-stranded phage origin of replication as described by Vieira et al. (1987 Methods Enzymol. 153:3-11). Thus, the DNA that is to be mutated may be inserted into one of the vectors to generate single-stranded template. Production of single-stranded template is described, for example, in Sections 4.21-4.41 of Sambrook et al. MOLECULAR CLONING. A LABORATORY MANUAL (Cold Spring Harbor Press, 1989).
[0135] Alternatively, the single-stranded template may be generated by denaturing double- stranded plasmid (or other DNA) using standard techniques.
[0136] For alteration of the native DNA sequence, the oligonucleotide is hybridised to the single-stranded template under suitable hybridisation conditions. A DNA polymerising enzyme, usually the Klenow fragment of DNA polymerase I, is then added to synthesise the complementary strand of the template using the oligonucleotide as a primer for synthesis. A heteroduplex molecule is thus formed such that one strand of DNA encodes the mutated form of the polypeptide or fragment under test, and the other strand (the original template) encodes the native unaltered sequence of the polypeptide or fragment under test. This heteroduplex molecule is then transformed into a suitable host cell, usually a prokaryote such as E. coli. After the cells are grown, they are plated onto agarose plates and screened using the oligonucleotide primer having a detectable label to identify the bacterial colonies having the mutated DNA. The resultant mutated DNA fragments are then cloned into suitable expression hosts such as E. coli using conventional technology and clones that retain the desired antigenic activity are detected. Where the clones have been derived using random mutagenesis techniques, positive clones would have to be sequenced in order to detect the mutation. [0137] Alternatively, linker-scanning mutagenesis of DNA may be used to introduce clusters of point mutations throughout a sequence of interest that has been cloned into a plasmid vector. For example, reference may be made to Ausubel et ah, CURRENT PROTOCOLS IN MOLECULAR BIOLOGY (John Wiley & Sons, Inc. 1994-1998), in particular, Chapter 8.4, which describes a first protocol that uses complementary oligonucleotides and requires a unique restriction site adjacent to the region that is to be mutagenised. A nested series of deletion mutations is first generated in the region. A pair of complementary oligonucleotides is synthesised to fill in the gap in the sequence of interest between the linker at the deletion endpoint and the nearby restriction site. The linker sequence actually provides the desired clusters of point mutations as it is moved or "scanned" across the region by its position at the varied endpoints of the deletion mutation series. An alternate protocol is also described by Ausubel et ah, supra, which makes use of site directed mutagenesis procedures to introduce small clusters of point mutations throughout the target region. Briefly, mutations are introduced into a sequence by annealing a synthetic oligonucleotide containing one or more mismatches to the sequence of interest cloned into a single-stranded Ml 3 vector. This template is grown in an E. coli duf ung strain, which allows the incoφoration of uracil into the template strand. The oligonucleotide is annealed to the template and extended with T4 DNA polymerase to create a double-stranded heteroduplex. Finally, the heteroduplex is introduced into a wild-type E. coli strain, which will prevent replication of the template strand due to the presence of apurinic sites (generated where uracil is incoφorated), thereby resulting in plaques containing only mutated DNA. [0138] Methods for generating abundant mutations are advantageous. Examples of such methods are based on exposing an original or target polynucleotide to mutagenising chemicals
. , 1 -
(Leppard, 1999 supra; Warnecke et ah, 1998 Genomics 51:182-190). The chemicals preferentially modify specific base residues or damage the base structurally. The modified DNA may then be PCR amplified, with novel nucleotide bases pairing opposite the modified bases (Fromenty et ah, 2000), or recovered using in vivo repair mechanisms at a lower frequency of mutation. [0139] Of several different chemical mutagens, bisulphite is preferred because it modifies
DNA without significant levels of strand cleavage (Warnecke et al. 1998 supra). Bisulphite converts the cytosines of single strand DNA into thymines, with the complementary base guanine also changing to adenine in the complementary strand (Olek et ah, 1996 Nucleic Acids Res. 24:5064-5066). [0140] The chemical treatment of DNA cloned in suitable viral, BAC or YAC vectors and the subsequent in vivo recovery of mutants is also useful for the mutagenesis of larger DNA fragments (Cocchia et ah, 2000 Nucleic Acids Research 28:e81). An alternative procedure involves the bisulphite-treatment of long DNA clones that may then be used to template the PCR amplification of smaller internal DNA fragments that are then cloned and sequenced. [0141] SAQOM can potentially be used to sequence fragments ranging in length from a few hundred bases up to an entire genome. Some of the above-mentioned mutagenesis techniques rely on PCR amplification, which is currently limited to DNA fragments of about 40 kb or shorter (Cheng et ah, 1995; Fromenty et ah, 2000). This is long enough to enable some exciting applications of SAQOM, but techniques suitable for longer fragments would greatly empower the technique, as for example described infra. The main advantage of mutating an entire genome or a large segment of a genome is that this would need to be done only once at the beginning of a sequencing project. Consequently, mutagenesis would not be a major limiting factor in the time and resource requirements of such a project.
[0142] Exemplary methods of mutagenesis for use in SAQOM include one or more of the following: (1) DNA replication with mutagenic nucleotide analogues and damaged nucleotides; (2) nucleic acid shuffling protocols based on in vitro or in vivo homologous recombination of pools of nucleic acid fragments or polynucleotides; (3) in. vitro DNA replication with low fidelity polymerases and high processivity polymerases; (4) propagation of damaged DNA in repair- deficient E. coli hosts; (5) chemical mutagens and (6) Degenerate Oligonucleotide Primed PCR. For example, these methods can be applied to two groups of DNA targets - small (1-10 kb) and large (>50 kb) DNA elements. The methods differ somewhat for the two targets, and are described infra:
[0143] Small DNA elements can be mutated by the misincoφoration of bases during a nucleic acid amplification reaction, which are well known to the skilled artisan, and include polymerase chain reaction (PCR) as for example described in Ausubel et al. (supra); isothermal
strand displacement amplification (SDA) as for example described in U.S. Patent No 5,422,252; rolling circle replication (RCR) as for example described in Liu et ah, (1996 and International application WO 92/01813), Laskins et ah, (U.S. Patent No. 6,323,009), Auerbach et ah, (U.S. Patent No. 6,448,017) and Lizardi et ah, (International Application WO 97/19193 and U.S Patent No. 6,344,329); "error-prone" translesion synthesis as for example described by Goodman (2002 Annual Review of Biochemistry 71:17-50) and by Ohmori et ah, (2001 Molecular Cell 8:7-8), nucleic acid sequence-based amplification (NASBA) as for example described by Sooknanan et al, (1994); and Q-/3 replicase amplification as for example described by Tyagi et al, (1996). The polymerases used for these processes are suitably selected from Taq DNA polymerase, Pfo DNA polymerase, Pwo DNA polymerase, Tfl DNA polymerase, Tth DNA polymerase, Pfu DNA polymerase or Exo-Pfu DNA polymerase, Hot Tub DNA polymerase, Vent DNA polymerase or Deep Vent DNA polymerase, E. coli DNA polymerase, the Klenow fragment of E. coli DNA polymerase, T4 or T7 DNA polymerase, AmpliTaq™ DNA polymerase Stoffel fragment, AmpliTaq™ Gold DNA polymerase, Q/3 DNA polymerase, Φ29 DNA polymerase, E. coli DNA polymerase V, Y-family DNA polymerases, RNA polymerase and reverse transcriptase.
[0144] In certain embodiments, the mutagenesis is carried out using PCR-directed mutagenesis. In this method, small DNA elements can be mutated efficiently (1-20%.) by using non-standard base analogues (Karniya et ah, 1994 Nucleosides and Nucleotides 13:1483-1492; Zaccolo et ah, 1996 Journal of Molecular Biology 255:589-603), or less efficiently by limiting the provision of some bases (Cline et ah, 1996 Nucleic Acids Research 24:3546-3551), or by chemically reducing polymerase fidelity (Rice et ah, 1992 Proc. Natl. Acad. Sci. USA 89:5467- 5471; Nartanain et ah, 1996 Nucleic Acids Research 24:2627-2631; Shafikhani et ah, 1997 Biotechniques 23:304-310). As will be appreciated by those of skill in the art, particular nucleotide analogues are incoφorated by Taq DΝA polymerase and cause a known range of mutations. For example, dPTP [6-(2-deoxy-B-D-ribofuranosyl)-3,4-dihydro-8H-pyrimido-[4,5-C]oxazin-7-one triphosphate] induces A→ G and T→ C transitions, while 8-oxo-dGTP preferentially causes A→ C and T→ G transversions (Zaccolo et ah, 1996 Journal of Molecular Biology 255:589-603). Other nucleoside analogues, such as Ν6-methoxy-2,6-diaminopurine (dK) and N6- methoxyoxyaminopurine (dZ) (Hill et ah, 1998a, Nucleic Acids Research 26:1144-1149: Hill et ah, 1998b, Proc. Natl. Acad. Sci. USA 95:4258-4263) also induce particular mutations. The inventors have found that small DNAs modified by analogues in vitro may be recovered with controlled frequency from 1-30% mutation. As high levels of modification can introduce mismatches between the specific PCR primers and the modified template, short discriminatory primers can be used to recover specific products (Mitchelson et ah, 1999 Nucleic Acids Research 27:e28). Polymerase co- elements that aid processivity during PCR can also be used to increase the attainable size of amplified products (Motz et ah, 2002 Journal of Biological Chemistry 277:16179-16188).
[0145] Small DNA elements can also be mutated using recombination techniques, as for example disclosed by Stemmer in U.S. Pat. No. 6,344,356, 6,323,030, 6,297,053, 6291,242, 6,297,861, 6,277,638, 6,180,406, 6,165,793 and 6,117,679, which employ repeated cycles of mutagenesis, shuffling and selection, and which allow for the directed molecular evolution in vitro or in vivo of nucleic acid sequences. In this method, mixtures of related nucleic acid sequences or polynucleotides are randomly fragmented, and reassembled to yield a library or mixed population of recombinant nucleic acid molecules or polynucleotides.
[0146] A mutant DNA polymerase with lowered fidelity for incoφoration of correct complementary nucleotides during DNA synthesis, and which is desirably thermostable, is suitably employed in such nucleic acid amplification-directed mutagenesis protocols. For example, a mutant Taq polymerase has been found to produce significant levels of random mutation during PCR amplification (U.S. Patent No 6,329,178; Suzuki et al, 1997 Journal of Biological Chemistry 272:11228-11235). This mutant polymerase can also incoφorate nucleotide analogues as efficiently as, or more efficiently than, native Taq polymerase. In specific embodiments, rounds of mutagenesis with the low-fidelity polymerase (e.g., mutant Taq or Pfo) and mutagenic nucleotide analogues are used to effect modification in genomic sub-fragments and other small DNAs. Additionally, superior performance by a family B-type DNA polymerase, VentR exo~, which is able to fully synthesise a 300-bp DNA product when all natural dNTPs are completely replaced by their biotin-labelled dNTP analogues are known (Tasara et ah, 2003 Nucleic Acids Research 31:2636-2646). The length of DNA that can be mutated exhaustively is only limited by the PCR procedure, which can routinely amplify 10-20 kb fragments, aided by E. coli exonuclease Ul (Fromenty et ah, 2000 Nucleic Acids Research 28:e50) and other protein factors (Motz et al, 2002 supra).
[0147] (Low fidelity) PCR amplification can be carried out by methods which include, but are not restricted to, degenerate oligonucleotide primer PCR and shotgun mutagenesis. [0148] Bacterial strains, which are deficient in enzymes of excision repair pathways that catalyse different steps in DNA sanitation, are suitably employed and these are well known to practitioners versed in the art. Examples include E. coli strains that fail to remove oxidative damaged and deaminated nucleotides efficiently, post-replication (Miller 1992 in A Short Course in Bacterial Genetics, CSH Press; Yonezawa et ah, 2001 Mutation Research 490:21-26; Kamiya and Kasai, 2000 Nucleic Acids Research 28:1640-1646). In certain embodiments, DNA and nucleotide analogues, as for example described above, are co-transfected into repair-deficient bacteria, which results in increased levels of mutation, as mispaired bases are not thoroughly removed (lhoue et ah, 1998 Journal of Biological Chemistry 273:11069-11074; Fujikawa et ah, 1998 Nucleic Acids Research 26:4582-4587). The co-transfection of nucleotide analogues and DNAs into repair- deficient host strains can also be used to mutate random shotgun libraries at low mutation frequencies.
[0149] Larger DNA elements can be mutated efficiently using mutagenic nucleotide analogues and repair-deficient bacteria. For example, mutagenic nucleotide analogues and larger DNA elements such a BACs can be co-transfected into repair-deficient host strains to generate mutant BACs. The in vivo functionality of the modified BACs may be recovered efficiently in E. coli by homologous recombination (Nefedov et ah, 2000 Nucleic Acids Research [Methods on Line] 28:e79).
[0150] Alternatively, rolling circle amplification (RCA) can be employed for mutating larger
DNA elements. RCA polymerase including Φ29 DNA polymerase and other polymerases permit the synthesis of large circular double strand DNA molecules such a large plasmids and BACs (Dean et ah, 2001 Genome Research 11:1095-1099; Amersham Biosciences, 2002 TempliPhi, Amersham Technical note; Zhang et ah, 2001 Gene 274:209-216). The ability to replicate large DNAs in vitro permits mutation to higher levels, without the functional limits imposed by replication in bacterial hosts. In accordance with the present invention, this technique is used in concert with mutagenic nucleotide analogues as herein described and other deoxynucleotide triphosphate analogues to incoφorate the mutagenic analogues directly into DNA templates. Clones harbouring mutant DNAs can then be recovered in a suitable host (e.g., E. coli) by homologous recombination, or by other in vitro recombinant techniques.
[0151] An error-prone repair DNA polymerase with lowered fidelity for incoφoration of correct complementary nucleotides during DNA synthesis, is suitably employed in such nucleic acid amplification-directed mutagenesis protocols. For example, E. coli DNA polymerase V has been found to produce significant levels of random mutation during DNA repair (Goodman, 2002 Annual Review of Biochemistry 71:17-50; Silvian et ah, 2001 Nature Structural Biology 8:984-989). This variant polymerase can also incoφorate nucleotide analogues as efficiently as replicative DNA polymerase. In specific embodiments, rounds of mutagenesis with the low-fidelity polymerase (e.g., DNA polymerase V), processive polymerase such as Φ29 DNA polymerase and mutagenic nucleotide analogues are used to effect modification in genomic sub-fragments and other small DNAs. The length of DNA that can be mutated exhaustively is only limited by the rolling circle procedure, which can routinely amplify 80-120 kb fragments (Liu et ah, 1996; Dean et ah, 2001 supra). [0152] Larger DNA elements can also be mutated advantageously using RNA polymerase amplifications. In this context, the high processivity of RNA polymerases and RNA reverse transcriptases can be used to amplify DNA fragments (Iwata et ah, 2000 Bioorganic and Medical Chemistry 8:2185-2194; Bebenek et ah, 1999 Mutation Research 429:149-158) with incoφoration of ribo-nucleotide or deoxynucleotide analogues. Ribo-nucleotide analogues are mutagenic and some are incoφorated into both RNA (U.S. Patent No 6,132,776; U.S. Patent No 5,512,431; Moriyama et ah, 1998 Nucleic Acids Research 26:2105-2111; Moriyama et ah, 2001 Nucleic Acids
Research Supplement :255-256) and DNA (Mϋller et ah, 1978 Journal of Molecular Biology 124:343-358), and deoxynucleotide analogues are incoφorated by reverse transcriptase into DNA (Lutz et ah, 1998 Bioorganic and Medical Chemical Letters 8:499-504; Bebenek et ah, 1999 Mutation Research 429:149-158). RNA products incoφorating ribonucleotide analogues can be copied from cloned DNAs residing in suitable plasmid vectors possessing RNA polymerase promoters. The RNA products can be used to create mutated cDNA, based on G→ A hypermutation resulting from retroviral reverse transcription in the presence of highly biased dNTP concentrations (Martinez et ah, 1995 Nucleic Acids Research 23:2573-2578), which are then subsequently subcloned and sequenced individually. [0153] Chemical mutagens can also be used advantageously to mutate larger DNA elements.
For example, chemicals such as nitrous acid, glyoxal, bisulphite, peroxide, hydrazine and other mutagenic agents modify several different base residues, or damage the base structurally (Burney et ah, 1999 Mutation Research 424:37-49; Wagner et ah, 1992 Proc. Natl. Acad. Sci. USA 89:3380- 3384; Rodriguez et ah, 1999 Biochemistry 38:16578-16588; Murata-Kamiya et al, 1997 Mutation Research 377:3-16). The damaged DNA can be recovered by in vivo recovery of the target in plasmid or BAC vectors (Ling and Robinson, 1997 Analytical Biochemistry 254:157-178; Leppard, 1999 in DNA Viruses: A Practical Approach, vol 214, IRL Press) with low-level chemical modification.
[0154] The present invention also contemplates whole genome mutagenesis. Several routes to random mutation of whole genomes are known, and these generally fall into two major categories: (i) induced-mutagenesis in biological systems or whole cell lines and (ii) (low fidelity) PCR amplification and replication of DNA elements using conditions to promote mis-incoφoration of nucleotides, or analogues of nucleotides.
[0155] Induced mutagenesis can be carried out by methods which include, but are not limited to, whole cell mutation, large cloned element mutagenesis, degenerate oligonucleotide primer PCR and shotgun mutagenesis.
[0156] Whole cell mutation involves the induction of mutation in stable cell lines from an organism, or in hybrid cells lines that carry an individual chromosome from the organism under study within the cell of another organism. The advantage of this approach is the potential to isolate individual mutant cell lines that may be used as a recurrent source of a particular mutated DNA sequence, while retaining the larger chromosomal context of that sequence. Efficient in vivo incoφoration of nucleotide analogues has been described for mammalian cell lines exposed to 5- bromo-2'-deoxyuridine (Bick and Davidson, 1974 Proc. Natl. Acad. Sci. USA 71:2082-2086) and other brominated base analogues such as 8-bromo-2'-deoxypurines and 5-bromo-2'- deoxypyrimidines (Stewart et ah, 1968 Experimental Cell Research 49:293-299). Similarly, a few
examples of non-bromine base analogues that have been incoφorated in this manner include 2- aminopurine (Glickman, 1985 Basic Life Sciences 31:353-79), 5-propynyloxy-2'-deoxyuridine, and 5-ethynyl-2'-deoxyuridine (Balzarini et ah, 1984 Biochemistry Journal 217:245-252).
[0157] Mutation of large cloned elements that have been used in the sequencing of the human genome (international Human Genome Sequencing Consortium 2001 Nature 409:860-921), and the genomes of many other organisms is also contemplated. In this procedure, the genome is subcloned into a set of overlapping fragments of 100-200 kb supported in BAC and other large element vectors. Mutagenic nucleotide analogues might conveniently be introduced conveniently into large BAC or cosmid clones using nick translation and which uses Escherichia coli DNA polymerase I for the sequential addition of nucleotide residues to the 3 '-hydroxyl terminus of a nick [created by pancreatic Deoxyribonuclease (DNAse) I], simultaneous with the elimination of nucleotides from the adjacent 5'-phosphoryl terminus of the nicked polynucleotide strand (Langer et ah, 1981 Proc. Natl. Acad. Sci. USA 78-6633-6637; Holtke et ah, 1990 Mol. Gen. Hoppe-Seyler 371:929-938). DNA polymerase I efficiently fills in the strand breaks as rapidly as they are formed by DNAse I nuclease nicking, incoφorating the desired nucleotides into the original strands and modifying both of the parent strands (Meinkoth and Wahl, 1987 Methods in Enzymology 152:91-94; Rigby et ah, 1977 Journal of Molecular Biology 113:237-251). Many different species of modified bases have been incoφorated into duplex polynucleotides as large as 7 kb by nick translation (Gillam and Tener, 1986 Analytical Biochemistry 157:199-207; Gebeyehu et ah, 1987 Nucleic Acids Research 15:4513-4534; Meffert and Dose, 1988 FEBS Letters 239:190-194; Meffert et a , 2001 Methods in Molecular Biology 148:323-335; Yu et ah, 1994 Nucleic Acids Research 22:3226-3232). A minimum overlapping set of these large elements that represent the genome is then further subcloned into plasmids (1-3 kb inserts), forming and stabilising the mutations. The plasmid-contained mutated genomic DNA elements are then sequenced. Mapping and fingeφrinting techniques, such as BAC insert end-sequencing, restriction fingeφrinting, STR fingeφrinting, hybridisations with cDNA and other cloned and sequenced DNA elements, as well as cross-hybridisation between BAC elements is used to identify the genomic elements and to create contigs of the overlapping large cloned elements. Mutation of the genome can be performed segmentally on the large element clones preferably using methods for mutagenesis of large DNA elements, as for example described supra. [0158] Random amplification by degenerate oligonucleotide primer (DOP) PCR can be used to recover essentially random DNA fragments of 0.5 kb to 2 kb from limiting amounts of genomic DNA and from individual cytometric flow-sorted chromosomes (Zhou et ah, 2000 Biotechniques 2:766-767; Hirose et ah, 2001 Journal of Molecular Diagnostics 3:62-67). Nucleotide analogues are incoφorated efficiently by these fragment sizes by PCR to as high as ~20% mutation. DOP-PCR can used to amplify from whole genomes, individual chromosomes, or to amplify from large DNA
fragments such as cloned BACs to limit the sequence complexity. Such random amplified mutant fragments can then be sub-cloned to form a representative mutant library.
[0159] Shotgun cloning of entire genomes has also been used in the sequencing the human genome (Venter et ah, 2001 Science 291:1304-1351). The method involves the subcloning and DNA sequencing of a selection of randomly broken, short, overlapping DNA elements that collectively represent the original genome and the reconstruction of the original sequence by the computer-aided alignment of the resulting multiple overlapping sequence reads. This principle could be applied to the cloning of chemically-modified DNA, in which nucleotide-damage internal to the random fragments will result in recovery of a mutated shotgun clone library. [0160] Chemical modification of DNA can be achieved by several different methods. For example, random chemical modification of nucleotide bases (with attendant double strand and single strand breakage of the genome into smaller fragments) can be used for shotgun-mutagenesis. Desirably, such chemical modification is combined with processes for efficient fragment end-repair and sub-cloning of the damaged DNAs. End repair enzymes such as E. coli endonuclease IV (Levin et al 1988 Journal of Biological Chemistry 263:8066-8071; Demple and Harrison 1994 Annual Review Biochemistry 63:915-948) and endonuclease lTJ (Masson and Ramotar, 1997 Molecular Microbiology 24:711-721) are used to remove 3'-phosphoglycolates and different 3 '-phosphates that may arise at the termini of chemically broken DNA fragments, and additionally conventional DNA polymerases (such as Klenow enzyme or T4 DNA polymerase) and polynucleotide kinases (e.g., T4 polynucleotide kinase) are used to 'fill-out' single strand fragment termini and to phosphorylate 5 '-termini, respectively. Unless repaired, together the ragged and blocked DNA fragment termini would prevent ligation of the fragment into the plasmid vector. This method of introducing mutagenesis into genomic DNA prior to subcloning fragments has the implicit advantage that DNA fragments, which may possess sequence motifs that prevent convenient cloning of the element, may be altered or disrupted by the mutation and hence may be cloned and represented within the library. Lower levels of chemical modification of genomic DNA may also be sought prior to the shotgun cloning to provide DNA fragments which are more intact and which need to be sheared mechanically to produce 2-4 kb fragments suitable for inclusion in the random library. End repair of such sheared fragments are readily achieved and subsequent cloning of these fragments is efficient.
[0161] Chemical modification of DNA can also be achieved by conventional shotgun cloning followed by subsequent mutation of the random genomic sub-fragments. The subsequent mutation may be carried out by library mutagenesis, or individual sub-clone mutagenesis, which has the advantage that subclones of genomic DNA that are created may be first created and cloned efficiently without chemical damage to the termini requiring particular repair steps. Library- mutagenesis is suitably achieved either by the above methods for small element mutagenesis in
which the entire random representative library is subjected to a mutagenic procedure and subsequently, random mutant clones are chosen from the resultant library, or collections of clones from the random library, e.g., 96 clones, are collectively mutated by nucleotide analogue PCR and the resultant amplicons are re-cloned to make a sub-set mutant library that can be conveniently related back to the original individual unmodified 96 clones.
[0162] If biological systems are used for the mutation and the recovery stages as described above as applied to small element mutagenesis, lower levels of mutation might be necessary to preserve biological functions of the vector and host cells. If a general PCR-based mutagenic procedure is used on the whole shotgun library, techniques to achieve high levels of mutation of the amplified mutant library elements such as the incoφoration of mutagenic nucleotide analogues could be employed, and the mutant elements again sub-cloned into plasmids to produce a random, highly-mutant fragment library. The same mutagenic methods may be employed for individual shotgun clones or with small collections of clones so as to minimise the sequence complexity of the resultant mutant library or libraries. [0163] In an preferred method, which may avoid the loss of DNA elements that are otherwise difficult to clone in conventional plasmid vectors and host E. coli (including low copy plasmids, read-through truncated plasmids, and E. coli hosts cells lacking restriction and/ or recombination functions), short, sheared random genomic DNA elements are ligated into new plasmid vectors, which are directly used as a random template for a PCR-based nucleotide analogue mutation protocol to generate amplicons from the total library, before the loss of potentially unclonable elements through cloning in E. coli. The oligonucleotide primers for PCR are preferably complementary to vector sequences flanking the ligated elements and possess (rare) restriction sites that are either all C:G or A:T. Nucleotide analogues that preferentially target either A:T or C:G base pairs for sequence mutation are then employed, which leave one of the two types of restriction sites essentially unaltered and thus available for convenient regeneration of restriction termini for cloning of the amplicons into the new plasmid vectors. In this manner, a fully representative genome library could be efficiently mutated before passage through E. coli cells.
[0164] Desirable hosts and/or vectors for cloning parent or mutagenised sequences are those which have been engineered to ameliorate difficulties in cloning otherwise difficult-to-clone nucleic acid molecules. For example, bacterial strains, particularly strains of E. coli, and engineered plasmid vectors are known to practitioners in the art, which have been selected or engineered to overcome such difficulties. Illustrative strains for this puφose include, but are not restricted to: E. coli strains engineered to limit recombination of DNA, such as JMl lO cells that accept repetitive DNA, as for example disclosed by Troester et al. (2000 Gene 258:95-108), E. coli strains engineered to be methylation-tolerant (mcrA" mcrBT) that limit the restriction of unmethylated or 'incorrectly' methylated DNA and thus accept mammalian DNA-containing
clones, as for example disclosed by Doherty et al. (1991 Gene 98:77-82) and Williamson et al. (1993 Gene 124:37-44; Stratagene Coφ., SURE cells); and E. coli strains engineered to be deficient in DNA sanitation enzyme(s) that promote the mutation of introduced DNA as for example described by Deng and Nickoloff (1992 Analytical Biochemistry 200:81-88), Greener and Callahan (1994 Strategies 7:32-34), Cox and Horner (1986 Journal of Molecular Biology 190:113- 117) and Miller (1992 in A Short Course in Bacterial Genetics, CSH Press). Suitable plasmids include, but are not limited to: plasmid vectors that have been engineered to prevent read-through transcription (e.g., the CloneSmart™ vector system from Lucigen Coφoration, Middleton WI 53562, USA, which is a gap-free cloning system available for sequencing recalcitrant or unclonable DNA), low copy plasmids that replicate in E. coli hosts to 1-10 copies per cell in which repeat DNA elements may be maintained (e.g., pBRm and its derivatives as for example described by Mitchelson and Moss, 1987 Nucleic Acids Research 15:9577-9596; and pEV-vrf3 as for example described by Perng et ah, 1994 Journal of Virological Methods 46:111-116).
[0165] It will be understood that SAQOM can be applied to the sequencing of any 'problematic' polymer including, for example, polypeptides and carbohydrates. Variant or mutant polypeptides can be produced using any suitable technique. For example, mutant polypeptides may be produced from mutant polynucleotides prepared by rational or random mutagenesis methods as, for example, described supra.
[0166] Sequencing of a polypeptide may be performed by site-directed or random cleavage of the polypeptide using, for example endopeptidases or CNBr, to produce a set of polypeptide fragments and subsequent sequencing of the polypeptide fragments by, for example, Edman sequencing or mass spectrometry, as is known in the art. Alternatively, the polypeptide probes or polypeptide fragments could be sequenced by use of antibody probes as for example described by Fodor et al in U.S. Patent Serial No. 5,871,928. Briefly, such antibody probes specifically recognise particular subsequences (e.g., at least three contiguous amino acids) found on a polypeptide. Optimally, these antibodies would not recognise any sequences other than the specific desired subsequence and the binding affinity should be insensitive to flanking or remote sequences found on a target molecule.
7. Methods for interrogating individual positions within a target sequence [0167] The sequence of the mutant polymers can be determined by any suitable technique that is capable of interrogating corresponding positions of a multiplicity of mutant polymers, collectively, for the identity and quantity of different species of subunit. For example, mutant polynucleotides may be sequenced using the chain termination method, which involves combining the target polynucleotide with a sequencing primer that hybridises with the target polynucleotide, and extending the sequencing primer in the presence of normal nucleotide precursors (dATP,
dCTP, dGTP, and dTTP). A chain-terminating nucleotide, such as a dideoxynucleotide triphosphate, of one particular base type (A, C, G, T) is added to the reaction, to effect a termination of DNA chains at random positions along the sequence. The nested series of DNA fragments produced in this reaction is then separated according to size (i.e., fragment length) typically by electrophoresis in a separation medium, to produce a series of bands in the profile of that medium. A set of four reactions (with chain termination occurring via ddA, ddC, ddG, ddT incoφoration) is required for explicit determination of the positions of all four bases in the sequence. This process results in fragments of DNA of varying sizes that end with a different base (A, T, C, or G). The determination of DNA sequence in these methods depends on separating the DNA fragments produced by order of size and either by what base they contain (when each lane has only one reaction product) or by what tag is detected (e.g., a fluorescent or chromophoric tag) if all four reaction products are in one lane as in commercially available automated DNA sequencers. If the shortest fragment ends in A, then the first base in the sequence is A. If the next longest fragment ends in T, then the next base in the DNA sequence is T and so on. This is the basic algorithm for "base calling", i.e., determining the sequence of purine and pyrimidine bases in a strand of DNA.
[0168] In some embodiments, the sequencing procedure employs four different fluorescent tags, one for each sequencing reaction, as for example described in U.S. Pat. No. 5,171,534. In this procedure, four fluorescent tags are used to visualise DNA fragments. Examples of fluorescent tags include, but are not restricted to, fluorescein-5-isothiocynate (FITC), which has an emission or fluorescence peak at about 525 nm, Texas Red, which has a fluorescence peak at about 620 nm, Tetramethyl rhodamine isothiocynate (TRITC), which has a fluorescence peak at about 580 nm, and 4-fluoro-7nitro-benzofurazan (NBD-fluoride), which has a fluorescence peak at about 540 nm. Commercially, four universal primers, respectively labelled with dyes called FAN (C), TAMRA (G), JOE (A), and ROX (T), are available from Applied Biosystems, Inc. (ABI) of Foster City, Calif. Alternatively, different chromophoric tags can be used to substitute for the fluorescent tags, wherein the chromophores have well resolved absoφtion maxima. The tags bind on the residual fragments in accordance with the exposed end base, if using dye terminator chemistry, or are attached to primers that are used to initiate the sequencing reaction, if using dye primer chemistry. When using fluorescent tags, the sequence is read by causing the fluorescent markers to fluoresce. The four fluorescent tags generally are selected to have a strong fluorescence peak that is separated from the strong fluorescence peak of the remaining tags. An optical instrument detects the emitted fluorescence signals. The fragments developed in the A, G, C and T sequencing reactions are then recombined and introduced together onto a separation matrix. A system of optical filters is used to individually detect the fluorescent tags as they pass the detector.
[0169] Alternatively, the sequencing procedure disclosed in International Patent Publication
No. WO97/40184 may be employed. In this procedure, each sample is first divided into four aliquots which are combined with four sequencing reaction mixtures. Each sequencing reaction mixture contains a polymerase enzyme, a primer for hybridising with the target nucleic acid, nucleotide precursors and a different dideoxynucleotide. This results in the formation of an A- mixture, a G-mixture, a T-mixture and a C-mixture for each sample containing product oligonucleotide fragments of varying lengths. The product oligonucleotide fragments are labelled with fluorescent tags, and these tags will generally be the same for all four sequencing reactions for a sample. However, the fluorescent tags used for each sample are distinguishable one from the other on the basis of their excitation or emission spectra. Next, the A-mixtures for each sample are combined to form a combined A mixture, the G-mixtures are combined to form a combined G- mixture and so on for all four mixtures. The combined mixtures are loaded onto a separation matrix at separate loading sites and an electric field is applied to cause the product oligonucleotide fragments to migrate within the separation matrix. The separated product oligonucleotide fragments having the different fluorescent tags are detected as they migrate within the separation matrix.
[0170] Wolfe et al. (2002 Proc. Natl. Acad. Sci. USA 99:11073-11078) have developed a method for sequencing a polynucleotide, which entails incoφoration of a chemically labile nucleotide by PCR followed by specific chemical cleavage of the resulting amplicon at the modified bases. The identity of the cleaved fragments determines the genotype or sequence of the DNA. This method utilises modified nucleotides 7-deaza-7-nitro-dATP, 7-deaza-7-nitro-dGTP, 5- hydroxy-dCTP, and 5-hydroxy-dUTP. Thus, one analogue is substituted for the corresponding nucleotide during PCR, generating amplicons that contain nucleotide analogues at each occurrence of the selected base throughout the target DNA except for the primer sequences. Subsequent chemical cleavage at each site of modification produces fragments of different lengths and/or molecular weights that may be analysed by mass spectrometry, which employs, for example, MALDI-TOF techniques and secondary post-source decay, to determine the mass-identity of nucleotides within the fragments of the polynucleotide (Abdi et ah, 2002 Genome Research 12: 1135-1141). These data are then analysed to reconstruct the target polynucleotide sequence. [0171] Koster, in U.S. 6,238,871, describes a method that assembles sequence information by analysing nested fragments obtained by base-specific chain termination according to their different molecular masses using mass spectrometry, as for example, MALDI or electrospray (ES) mass spectrometry. Through the separate determination of the molecular weights of the four base- specifically terminated fragment families obtained using the chain termination method, the DNA sequence can be assigned via supeφosition (e.g., inteφolation) of the molecular weight peaks of the four individual experiments. Alternatively, the molecular weights of the four specifically
terminated fragment families can be determined simultaneously by MS, either by mixing the products of all four reactions run in at least two separate reaction vessels (i.e., all run separately, or two together, or three together) or by running one reaction having all four chain-terminating nucleotides (e.g., a reaction mixture comprising dT7W, ddTTP, dATP, ddATP, dCTP, ddCTP, dGTP, ddGTP) in one reaction vessel. By simultaneously analysing all four base-specifically terminated reaction products, the molecular weight values can, in effect, be inteφolated. Comparison of the mass difference measured between fragments with the known masses of each chain-terminating nucleotide allows the assignment of sequence to be carried out.
[0172] Ulmer, in U.S. 6,296,810, describes a single molecule sequencing method in which a single terminal nucleotide on a single DNA strand is cleaved by a processive exonuclease. The cleaved nucleotide is transported away from the DNA strand and is incoφorated in a fluorescence- enhancing matrix. The single nucleotide is irradiated, which causes it to fluoresce and the fluorescence is detected to identify the single nucleotide by its fluorescence. These steps are repeated indefinitely until the DNA strand is fully cleaved or until a desired length of the DNA is sequenced. This method can be adapted to a multiplex format so that a plurality of secondary polynucleotides as herein described can be analysed collectively to obtain a composite sequence, as herein described.
[0173] Alternatively, He et al. (1999 Nucleic Acids Research 27:1788-1794) describe a method for direct PCR sequencing with boronated nucleotides in which the positions of boranophosphate-modified deoxynucleotides incoφorated randomly into DNA during PCR can be revealed directly by exonuclease digestion to give sequencing ladders. When these analogues are introduced into polynucleotides, they allow preferential fragmentation or uniform reactivity at individual nucleotides resulting in the cleavage of the phosphoribose backbone of the polynucleotide at the positions at which the analogues are located. They further allow fractionation of the resulting polynucleotide fragment mixture with respect to fragment size or fragment mass and thus permit the determination of the sequence of the target polynucleotide from the sum of its fragments. Such analogue incoφoration and cleavage chemistries are widely applicable to different nucleic acid analysis platforms, including gel electrophoresis, mass spectrometry and Sequencing By Hybridisation (SBH). [0174] Alternatively, single DNA molecule sensitivity and single-molecule imaging, together can provide simultaneous analysis of up to 100,000 distinct molecules every second. Miniaturised chip CE systems with nano-channels <1 μm allows analysis to be undertaken on limiting numbers of molecules held pico- and nano-molar concentrations, with amplification and detection of signals from single template molecules (Krishnan et ah, 2001 Curr-ent Opinion in Biotechnology 12:92-98; Koutny et ah, 2000 Analytical Chemistry 72:3388-3391; Paegel et ah, 2003 Current Opinion in Biotechnology 14:42-50; Chen et ah, 2002 Analytical Chemistry 74:1772-1778). "Single molecule
sequencing with exonuclease" comprises the serial digestion of a single DNA strand, which is attached to a solid surface or microchannel surface (Marziali and Akeson, 2001 Annual Review of Biomedical Engineering 3:195-223; Jett et ah, 1989 Journal of Biological Structure & Dynamics 7:301-309). The fluorescent-tagged nucleotide subunits are sequentially released, then collected and detected. The methods demand highly efficient enzymatic incoφoration of labelled analogues at each subunit nucleotide position such as with use of nick translation (Jett et ah, 1995 US Patent 5,405,747; Gillam and Tener 1986 Analytical Biochemistry 157:199-207; Gebeyehu et ah, 1987 Nucleic Acids Research 15:4513-34), or highly efficient in vivo incoφoration as has been described for cell lines exposed to 5-bromo-2'-deoxypyrimidines (Bick and Davidson, 1974 Proc. Natl. Acad. Sci. USA 71:2082-2086). Error synthesis due to polymerase stutter, incoφoration of an incorrect nucleotide, or a physical barrier to efficient synthesis could limit the technique, as could the efficiency of single nucleotide release reactions.
[0175] Alternatively, methods of sequencing employing physical techniques and nano-scale fluidics combined with ultra-sensitive optical systems that also permit the analysis of one molecule at a time (Chen, 2002 US Patent 6,355,420; Chen, 2002 US Patent 6,403,311). The nano-devices sort DNA molecules by use of openings or pores or forests of pillars that are less than the "radius of gyration" of the DNA fragments, just large enough for DNA molecules to run through in single file. When a DNA molecule is placed outside a "forest" of tiny pillars arranged in a square grid and the molecules forced to move into the grid by an electric field, the DNA must uncoil to pass through. By varying the electric field pulse, the length of the DNA strands that are collected in the grid can be controlled and thus separated by size (length). When a mix of DNA fragments is driven electrically along a channel with several such barriers, fragments of different lengths will arrive at the far end in a series of bands not unlike those seen in conventional gel electrophoresis separation. Single-molecule sequencing (SMS) promises to radically improve DNA sequencing as it is potentially 10,000 times faster than current production systems that rely on single lanes. It can potentially start with genomic DNA reducing the need for sample preparation (Sauer et ah, 2001 Journal of Biotechnology 86:181-201). SMS read lengths are also potentially significantly longer than those obtained from gel electrophoresis systems. Longer read-lengths will simplify sequence reconstruction and reduce the total number of runs required to get full coverage of the genome. Uniquely, SMS can directly detect haplotypes over several polymoφhisms.
[0176] In another example, "Sequencing by Synthesis" is a method common to primer extension methods such as "single nucleotide primer extension" (SnuPE) and pyrosequencing (Ronaghi et ah, 1999 Analytical Biochemistry 267:65-71; Nordstrom et ah, 2000 Analytical Biochemistry 282:186-193). Ideally the strand extensions are continued for 30 nucleotides or more, and of solid phase parallel micro-array analysis is employed in which 108 features (molecules) are sequenced simultaneously, coupled with a unitary base addition chemistry that allows single
nucleotide additions to growing chains to be monitored on each feature. Oligonucleotides are anchored to glass slides at densities of up to 108 molecules per cm2 and used to capture complementary genomic DNA fragments. Primed molecules are attached to the array such that they can be efficiently extended by polymerases with the addition of 4 differentially labelled terminating nucleotides (Braslavsky et ah, 2003 Proc. Natl. Acad. Sci. USA 100:3960-3964). The extended fragments are then all simultaneously detected following the addition of one nucleotide, the terminating moiety and fluorescent-tag are then removed chemically from each attached nucleotide analogue, ready for the addition of the next nucleotide to each chain (Li et ah, 2003 Proc. Natl. Acad. Sci. USA 100:414-419; Mitra et ah, 2003 Analytical Biochemistry 320:55-65). Two innovations set this method apart: the first requiring a zero-mode wave guide, a device confining optical excitation and detection to the few zeptolitres of fluid surrounding the polymerase, and the second the identification of nucleotides by fluorescent labels attached to the γ- phosphate leaving group of the dNTP. These systems measure an individual nucleotide after it is incoφorated into the growing strand, and sequence is read out from the pattern of fluorescence signals from each molecule. These analyses are envisioned to develop sequencing rates of 10s base reads per day. The net effect is the equivalent of a billion-lane sequencer that reads the sequence of each molecule at the speed of the addition reaction. Although currently the efficiency and uniformity of extension is poor, it is expected that if each molecule could be extended by an average 50 nucleotides, it will allow parallel discovery and detection of genetic variation on 108 molecules which can be aligned to known reference sequence (such as the human genome). Software and alignment programs that permit de novo sequencing from multiple short sequence reads could also allow direct sequencing of entire genomes of low repetition, such as phage, virus and bacterial genomes using 'sequencing by synthesis'.
[0177] Recently, Church and colleagues (Mitra and Church, 1999 Nucleic Acids Research 27:e34; Mitra et ah, 2003 Analytical Biochemistry 320:55-65) have integrated solid-state DNA isolation, amplification, and sequencing by use of "polony technology", involving polymerase colonies (polonies) and cycles of fluorescent dNTP incoφoration with high signal sensitivity. As the PCR proceeds, the PCR products diffuse radially within the gel from its immobilized template
(e.g. genomic DNA), giving rise to a circular PCR product, or "polymerase colony", which can be scanned with a microarray scanner. "Polony technology" involves a polymerase trapping technique which enables efficient nucleotide extension by DNA polymerase in a polyacrylamide matrix, and eliminates loss of enzyme during sequencing cycles. Novel types of reversibly dye-labelled nucleotide analogues are used for each extension cycle that can be efficiently incoφorated by DNA polymerase, and which dyes can be removed by thiol reduction or light exposure following nucleotide addition, and permitting sequencing of multiple 'polonies' in parallel. A high density of polonies can be achieved with minimal overlap between adjacent polonies by limiting the
. _ ^
concentration of free primer in the 'polony' amplification reactions. Large-scale arrays of discrete 'polonies' can be sequentially and cyclically nucleotide-extended. Software has also been developed for automated image alignment and sequence calling from randomly-sited 'polonies'.
[0178] Accordingly, numerous methods are known for interrogating a polynucleotide sequence for the identity of nucleotide species at individual positions within the polynucleotide. However, in accordance with the present invention, it is necessary to interrogate corresponding positions of a multiplicity of mutant polynucleotides, collectively, to determine the identity and quantity of each species of nucleotide located at individual interrogation positions. Suitably, the nucleotide species at an individual interrogation position are each detectably labelled. In illustrative embodiments of this type, each species of nucleotide is associated with the same label and is resolved in space or in time from other species of nucleotides as used, for example, in conventional chain terminating sequencing, in which DNA sequencing fragments are separated according to size and what base they contain (i.e., typically separating fragments in four lanes, each lane resolving fragments terminating in a single species of nucleotide). In other illustrative embodiments, individual species of nucleotide are each distinctly labelled and are resolved according to their labels (i.e., typically separating four fragment subsets in a single lane, each fragment subset terminating in a distinct species of nucleotide).
[0179] From the foregoing, several methods are known for interrogating a polynucleotide sequence for the identity of nucleotide species at individual positions within the polynucleotide. However, in accordance with the present invention, it is necessary to interrogate corresponding positions of a multiplicity of mutant polynucleotides, collectively, to determine the identity and quantity of each species of nucleotide located at individual interrogation positions. Suitably, the nucleotide species at an individual interrogation position are each detectably labelled. In illustrative embodiments of this type, each species of nucleotide is associated with the same label and is resolved in space or in time from other species of nucleotides as used, for example, in conventional chain terminating sequencing, in which DNA sequencing fragments are separated according to size and what base they contain (i.e., typically separating fragments in four lanes, each lane resolving fragments terminating in a single species of nucleotide). In other illustrative embodiments, individual species of nucleotide are each distinctly labelled and are resolved according to their labels (i.e., typically separating four fragment subsets in a single lane, each fragment subset terminating in a distinct species of nucleotide).
[0180] Individual nucleotide positions are then interrogated to measure for each species of nucleotide at least one parameter that correlates with that nucleotide species. The parameters are suitably label-associated parameters, which include, but are not restricted to, parameters relating to fluorescence emission, luminescence, phosphorescence, infrared radiation, electromagnetic scattering including light and x-ray scattering, light transmittance, light absorbance, electrical
impedance and molecular mass. In a preferred embodiment, the parameter is signal intensity (e.g., fluorescence intensity, light intensity, radiation intensity, etc).
[0181] The measured parameters are typically compared to each other to determine the species of nucleotide that is in higher abundance than the other species of nucleotide at individual interrogation positions.
[0182] In certain embodiments, the detectable signals associated with different species of nucleotide at individual interrogation positions are processed to produce a data set containing a plurality of peaks reflecting the positions and species of the nucleotides in the mutant polynucleotides which are the subject of the analysis. Examples of automatic sequencing apparatus and methods that can be used for this puφose include, but are not restricted to, U.S. Pat. No. 4,811,218 to Hunkapiller et al. and U.S. Pat. No. 5,556,790 to Pettit. These methods and such commercially available instruments as the ABI 3730® and 3730x1® as discussed above, the MegaBACE 4000® from Amersham Biosciences, Inc, the Pharmacia A.L.F.® from Pharmacia, Inc., and the Licor® Sequencer from Licor, all produce a chromatogram data file from an analog signal.
[0183] In accordance with the present invention, a group of peaks will result for each interrogation position at which a plurality of different species of nucleotide reside. The peaks of each group are then processed collectively to determine the species of nucleotide, which is in higher abundance relative to the other species of nucleotide, at a respective interrogation position. This processing may suitably involve extracting a vector from one or more peak features for each peak, i.e., a vector may quantify such peak characteristics as peak height and area under the peak, and comparing the vector(s) derived for each peak with the vector(s) of other peaks to deduce the species of nucleotide that is in higher abundance at an interrogation position than other nucleotide species. Advantageously, prior to feature extraction, the detectable signals may be corrected for certain distortions such as peak clipping and contextual influences in order to more accurately identify and "base call" a distinct unique detectable signal from a composite signal. This operation is equivalent to noise removal as performed in speech recognition software programs. Peaks may then be detected on the corrected, if opted for, detectable signals.
[0184] Figure 6 compares the sequence analysis of a "difficult-to-sequence polynucleotide motif from human BAC RP11-167L9 with the sequence analysis of a mixture containing a plurality of mutants of that polynucleotide. Specifically, in Figure 6A, a mixture containing multiple identical copies of the difficult to sequence motif was sequenced using standard ABI BigDye™ version 3.0 cycle sequencing chemistry and an ABI 3730x1 capillary sequencer, resulting in a failed sequence determination of an internal polyA tract which weakly reads 21 deoxyadenosine subunits (A) and 3 indeterminate deoxynucleosides (N) within the tract. By
contrast, in Figures 6C and 6D, a mixture containing a plurality of different mutants of that motif was sequenced under the same conditions, which resulted in a high quality "sequence read" identical to the sequence of the original unmutated sequence, except for the polyA tract that was now correctly determined to have a total of 20 deoxyadenosine subunits. Further, in Figure 6B, a mixture containing a plurality of different mutants of that motif was sequenced under the same conditions, which also resulted in a high quality "sequence read" identical to the sequence of the original unmutated sequence, except for the identification of one deoxycytidine subunit (C) within the tract replacing one deoxyadenosine subunit, and the polyA tract that was now correctly determined to have a total of 20 deoxyadenosine subunits.
8. Computer-related embodiments
[0185] The present invention discloses methods for sequence analysis, which may be conveniently implemented by a processing system such as a computer system. These methods are predicated in part on the provision or detection of signals representing distinct species of subunit resolved as a function of subunit position in a plurality of variant sequences that vary from a target sequence by the substitution of at least one subunit with a subunit of a different species. In one embodiment, the signals are generated by resolving sequencing products or fragments of varying length according to their size and tags which indicate the positions of the selected species of subunit within a common region of interest in the variant sequences. A suitable detection means is suitably provided to detect the tags. For example, if fluorescent tags are used, an electromagnetic wave source may be used to induce the emission of electromagnetic energy (fluorescence by the tags), and the emitted energy is detected by a detector to produce an analog signal. Advantageously, the analog signal is sampled and the sampled values transmitted to a data file, which typically represents a chromatogram that includes a plurality of peaks reflecting the positions and species of the subunits in the variant sequences that are the subject of the analysis. The ready use of these data preferably, but not essentially, requires that they be stored in a format that is usable by a processing system which is adapted to generate or deduce, on the basis of those data, all or part of the target sequence. Thus, in accordance with specific embodiments of the present invention, data representing a chromatogram, as described above, may be stored in a data store, which preferably includes a database, for use by a processing system in operable communication with the data store. The data store may have stored therein the above described plurality of peaks including, for individual positions of the common region of interest, a group of peaks representing a plurality of different species of subunit.
[0186] Suitably, the processing system is adapted to process the data in the data store to generate a comparison of peak features. This processing typically involves extracting a vector from one or more peak features for each peak, wherein the vector may quantify such peak characteristics
as peak height and area under the peak, and comparing the vector(s) derived for each peak with the vector(s) of other peaks to deduce the species of subunit that is in higher abundance at a respective position than other species of subunit. Optionally, the processing may involve, prior to feature extraction, correction of the signals for certain distortions such as peak clipping and contextual influences.
[0187] Any general or special puφose processing system is contemplated by the present invention and includes, but is not limited to, a processor in operable (e.g, electrical) communication with both a memory and at least one input/output device, such as a keyboard and a display. Such a system may include, but is not limited to, personal computers, workstations or mainframes. The processor may be a general puφose processor or microprocessor or a specialised processor executing programs located in RAM memory. The programs may be placed in memory (e.g., RAM) from a storage device, such as a disk or pre-programmed ROM memory. The RAM memory in one embodiment is used both for data storage arid program execution. The processing system also embraces systems where the processor and memory reside in different physical entities but which are in operable communication by means of a network. For example, a processing system having the overall characteristics set forth in Figure 7 may be useful in the practice of the instant invention. More specifically, Figure 7 is a schematic representation of a processing system (100) having in operable communication (101) with one another via, for example, an internal bus or external network, a processor (102), a memory (103), an input/output device (104) such as a keyboard and display and a data store (105), which typically includes a database (106). For example, the data store may be in the form of an external storage device such as but not limited to a diskette, CD ROM, or magnetic tape. It will, therefore, be appreciated that the processing system 100 may be formed from any suitable processing system, which is capable of operating applications software to enable the processing of the data, such as a suitably programmed personal computer.
[0188] When in electrical communication with an external network, the processing system
(100) will preferably be formed from a server, such as a network server, web-server, or the like allowing the analysis to performed from remote locations. In this case, the processing system includes an interface (107), such as a network interface card, allowing the processing system to be connected to remote processing systems, such as via the Internet as will be described in more detail below.
[0189] In the practice of some embodiments of the present invention, the processing system executes a sequence analysis program that includes computer executable code which when implemented on the processing system causes the system to receive data representing a chromatogram, as described above, which includes a plurality of peaks reflecting the positions and species of the subunits in the variant sequences that are the subject of the analysis. The
chromatogram data may be obtained from a number of sources, such as manual input via the I/O device 104 or received from an external processing system via the interface 107; or by accessing subunit sequences stored in the database 106. The system is also caused to process the chromatogram data to generate a comparison of peak features to deduce the species of subunit that is in higher abundance at a respective position than other species of subunit. In specific embodiments, the system is caused (1) to extract a vector from one or more peak features for each peak, wherein the vector suitably quantifies peak features such as peak height and area under the peak, and (2) to compare the vector(s) derived for each peak with the vector(s) of other peaks to deduce the species of subunit that is in higher abundance at an individual position of the common region than other species of subunit. The system is further caused to effect this process for all positions in the common region to thereby deduce the species of subunit which is in highest abundance for each position of the corresponding region of the target sequence. Optionally, the processing means analyses base statistics at each site to determine the quality of the sequence so deduced, and to identify possible sites at which there are sequencing errors or polymoφhisms. [0190] In certain embodiments, the processing system is further adapted to generate an indication of the target sequence, which is suitably displayed by a display means that is part of the processing system.
9. Kits
[0191] All the essential materials and reagents required for SAQOM including, but not limited to, mutagens, polymerases, chain-elongating nucleotides, chain-terminating nucleotides) may be assembled together in a kit. The kits may also optionally include appropriate reagents for detection of labels, positive and negative controls, dilution buffers and the like. For example, a nucleic acid- based SAQOM lάt may include at least one, and preferably at least two, of the following: (i) a set of mutagenic nucleotide analogues such as dPTP, (ii) chain-elongating nucleotides, (iii) chain- terminating nucleotides, (iv) a polymerase such a Taq polymerase; (v) a polymerase such a Rolling Circle φ29 DNA polymerase or an error-prone DNA polymerase, (vi) buffer, (vii) adaptor polynucleotide primers, (viii) sequencing polynucleotide primers, (ix) adaptor restriction endonucleases and (x) a parent polynucleotide and a mixture of variant polynucleotides according to the invention, which may be used as a positive control. Such kits also generally will comprise, in suitable means, distinct containers for each individual reagent and instructions for use of the kit.
[0192] In order that the invention may be readily understood and put into practical effect, particular preferred embodiments will now be described by way of the following non-limiting example.
EXAMPLES
EXAMPLE 1
Nucleotide Analogue PCR Amplification
[0193] The target sequence was an undefined DNA fragment pTEST of 1.5 kb in length cloned into pUC19. Amplification conditions were essentially as described by Zaccolo et al (1996 supra). Using a concentration of 400 μM of each dATP, dCTP, dGTP, dTTP, and a mutagenic nucleotide analogue such as dPTP in a non-standard PCR reaction, mutations were incoφorated up to a frequency of 1 in 5. Universal M13 primers (FSP-21, FSP-40, RSP-26, RSP-48) were used for PCR amplification and sequencing. All amplification reaction described herein were performed with an Applied Biosystems GeneAmp™ PCR System 9700 or 9600 thermal cycler.
Reaction conditions:
[0194] 2 ng DNA template, lx AmpliTaq™ Gold buffer, 400 μM dNTPs, 2 mM magnesium chloride, 0.4 μM each primer, 1 unit of AmpliTaq™ Gold, in a total of 25 μL. Reactions were performed as follows: 1 cycle of 94° C for 15 min., 30 cycles of 94° C 1 min, 50° C, 0.5 min, 72° C for 5 min., 1 cycle 72° C for 10 min. This regimen yields PCR products incoφorating analogue bases.
EXAMPLE 2
Standard PCR Amplification (Recovery)
[0195] The target sequence was an undefined DNA fragment pTEST™ of 1.5 kb in length cloned into pUC19. Amplification conditions were essentially as described by Zaccolo et al (1996 supra). Using a concentration of 200 μM of each dATP, dCTP, dGTP, dTTP and universal M13 primers (FSP-21, FSP-40, RSP-26, RSP-48) the PCR amplified products could be used directly for sequencing, or could be cloned and individual clones could be used for sequencing.
Reaction conditions: [0196] 2 ng DNA template, lx AmpliTaq™ Gold buffer, 200 μM dNTPs, 2 mM magnesium chloride, 0.4 μM each primer, 1 unit of AmpliTaq™ Gold, in a total of 25 μL. Reactions were performed as follows: 1 cycle of 94° C for 15 min., 30 cycles of 94° C 1 min, 50° C, 0.5 min, 72° C for 5 min, 1 cycle 72° C for 10 min. This regimen yields PCR products that are not further mutated but comprise the same mutations incoφorated in the amplification of Example 1.
EXAMPLE 3 Standard DNA Sequencing by Cycle PCR Amplification (Big Dye version 3.0)
[0197] The target sequence was typically an undefined DNA fragment cloned into pDPJVE™ or pUC19. Cycle sequencing amplification conditions were essentially as described by Applied Biosystems Incoφorated (2002) for BigDye™® Terminator v3.0 and v3.1 Cycle Sequencing Kits. Using a concentration of 200 μM of each dATP, dCTP, dGTP, dTTP, ABI proprietary concentrations of the dye terminators ddATP, ddCTP, ddGTP and ddTTP, and universal M13 primers (FSP-21, FSP-40, RSP-26, RSP-48) the amplified products were used directly for sequencing.
Reaction conditions:
[0198] 10 ng DNA template, lx ABI PRISM BigDye™™ Terminator version 3.0 sequencing buffer, 200 μM dNTPs, approximately 200 μM ddNTPs, 2 mM magnesium chloride, 0.4 μM each primer, 1 unit of AmpliTaq® DNA Polymerase, FS in a total of 20 μL. Reactions were performed as follows: 1 cycle of 95° C for 5 min., 50 cycles of a rapid thermal ramping (1° C/sec) to 95° C 0.5 min, 50° C, 10 sec, 60° C for 4 min, then to 4° C and hold until ready to purify. This regimen yields standard BigDye™ Terminator version 3.0 sequencing products.
EXAMPLE 4
Purification of PCR products
Oiasen Mini-Elute™ PCR purification kit protocol [0199] 1. Add 5 volumes of Buffer PB to 1 volume of the PCR reaction and mix.
[0200] 2. Place a Mini-Elute™ column in a provided 2ml collection tube in a suitable rack.
[0201] 3. To bind DNA, apply the sample to the Mini-Elute™ column and centrifuge for 1 minute at 13,000 g.
[0202] 4. Discard the flow through and place the Mini-Elute column back in the same tube and centrifuge the column for an additional 1 minute at 13,000 g.
[0203] 5. Place the Mini-Elute column in a clean 1.5 mL microfuge tube.
[0204] 6. To Elute DNA, add 10 μl of Buffer EB or H20 to the centre of the membrane, let the column stand for I minute and centrifuge for 1 minute at 13,000 g.
Clone into pDRIVE™ Plasmid
[0205] Vector DNA (pGEM T-Easy) 1 μL
[0206] Insert DNA (50-100ng) 3 μL
[0207] 2 x Buffer 5 μL [0208] T4 DNA Ligase μL
[0209] TOTAL 10 uL
[0210] Put samples on PCR Block at 16° C overnight or for 12 hr on PCR block at 16° C then incubate 1 hour on bench at room temperature to complete the ligation reaction.
EXAMPLE 5 Chemical Transformation Procedure
[0211] • Chill sterile 15 ml Falcon tubes on ice.
[0212] • Thaw frozen competent E. coli cells (JM109) on ice.
[0213] • Gently mix the thawed competent cells. Transfer 50 μL to each of the chilled falcon tubes. [0214] • Add 3 μL of DNA per 50 μL of competent cells. Quickly flick the tube several times. Note: do not mix by pipetting up and down.
[0215] • Immediately return the tubes to ice for 10 minutes.
[0216] • Heat shock the cells for 45-50 seconds in a water bath at exactly 42° C.
[0217] • Add 900 μL of cold (4°C) SOC Medium to all tubes. [0218] • Incubate the suspension for 60 minutes at 225 rpm and 37° C.
[0219] • Plate the 500 μL of cells on LB Agar + Amp (Xgal, IPTG) plates, transformation mix can be stored at 4° C and plated out the next day.
[0220] • Incubate plates in 37° C oven overnight, allow 17 hours growth for blue white selection.
EXAMPLE 6
Growing Bacterial Cultures
[0221] • Fill each well of a square well block with 1.2 mL of growth medium + antibiotic
(TB Broth + Ampicillin).
[0222] • Inoculate each well with a single bacterial colony white colonies and incubate the
cultures for 16 hours at 37° C at 220 rpm.
EXAMPLE 7 Qiagen - R.E.A.L. Prep 96™ Plasmid Protocol
[0223] • Harvest bacterial cells by centrifugation for 5 minutes at 2,500 g in a centrifuge. Remove medium by inverting the block.
[0224] • Resuspend each bacterial pellet in 0.3 mL of RI Buffer. Tape the box and mix by vortexing. (Ensure that RNase A has been added to buffer R before use).
[0225] • Add 0.3 mL of buffer R2, seal the block with new tape and invert block 10 times to mix. (Do not allow the lysis solution to proceed for more than 5 minutes).
[0226] • Add 0.3 mL of buffer R3, seal the block with new tape and invert block 10 times to mix
[0227] • Place a QIAfilter™ 96 plate (yellow) in the top plate of the QIAvac™ 96 manifold. Place a new square well block into the base and reassemble the manifold.
[0228] • Apply the lysates from the step above to the well of the QIAfilter™ 96 plate
[0229] • Apply vacuum (-200 to -300 mbar) until the lysates are completely transferred to the square-well block in the QIAvac™ base.
[0230] • Take the square well block containing the cleared lysates from the vacuum manifold. Add 0.63 mL of room temperature isopropanol to each well, tape the box and mix by immediately inverting 3 times. [0231] • Centrifuge at 2,500 g for 15 minutes at room temperature to pellet the DNA.
Remove the supernatant by inverting the block over a waste container, then tapping firmly upside down on a paper towel.
[0232] • Wash each DNA pellet by adding 0.5 mL of 70% cold ethanol and centrifuging at
2,500 g for 2 minutes. Remove the supernatant by inverting the block over a waste container, then tapping firmly upside down on a paper towel.
[0233] • Place block in 37°C oven for 1/2 an hour to dry, and remove any residual ethanol.
[0234] • Redissolve the DNA pellets in 35-40 μL of MilliQ™ water. Store at -20° C
EXAMPLE 8
Recovery of individual mutated DNA by PCR reamplification and cloning:
PROTOCOL 1
[0235] Viable mutated DNAs were recovered by re-amplification of 1 μL of a 1 in 1000-fold dilution of the nucleotide analogue modified PCR reaction products with nested primers and the four conventional dNTPs and standard PCR conditions.
Clonins of PCR products and sequencing:
[0236] The mutant PCR products were gel purified and then cloned into the pGEM T-
EASY™ vector (Promega) or pDRJNE™ (Promega) and transformed into E. coli. Plasmid DΝA of individual clones was sequenced using standard sequencing conditions. The effect of the analogues on the sequence of DΝA is illustrated in Figures 3B, 4B and 5B for individual mutant clones.
Standard TempliPhi'™ Amplification for Purified Plasmid or Ml 3 Samples
[0237] The target sequence was typically 1 pg-10 ng of an undefined DΝA fragment cloned into pDRJNE™ or pUC19. Isothermal amplification conditions were essentially as described by Amersham BioSciences PLC (2003) for TempliPhi™ DΝA Amplification Kits and by Dean et al (2001 Genome Research 11, 1095-1099). Using an Amersham proprietary concentration of approximately 200 μM of each dATP, dCTP, dGTP, dTTP, and Amersham proprietary random hexamer primers, the amplified DΝA products were used directly for sequencing.
STANDARD TEMPLIPHI™ REACTION CONDITIONS: [0238] 1 pg to 10 ng of DNA template (increasing amounts are used as template length increases), lx denaturing buffer with 0.4 μM random hexamer primers, lx TempliPhi™ DNA Amplification buffer, 200 μM dNTPs, 2 mM Magnesium chloride, 1 unit of φ29 DNA polymerase in a total of 20 μL. Reactions were performed as follows: Mix template DNA and denaturing buffer with random hexamer primers in a volume of 10 μL and denature with 1 cycle of 95° C for 3 min, cool to room temperature or 4° C. Add 10 μL of TempliPhi™ premix (with dNTPs, TempliPhi™ incubation buffer and 29 DNA polymerase) to the cooled sample and mix briefly, then incubate at 30° C for 8-12 hr (recommended range 4-18 h). Heat-inactivate the enzyme at 65° C for 10 min, then cool to 4 °C. Dilute the amplified DNA approximately 1000-fold and use 10 ng DNA (typically 2-3 μL) to template for a Standard DNA Sequencing reaction.
Mutagenic Analogue TempliPhi™ Amplification of Purified Plasmid Samples
[0239] The target sequence was typically 1 pg-10 ng of an undefined DNA fragment cloned into pDRINE™ or pUC19. Isothermal amplification conditions were essentially as described by Amersham BioSciences PLC (2003) for TempliPhi™ DΝA Amplification Kits and GenomiPhi™
DNA Amplification Kits. Using an Amersham proprietary concentration of approximately 200 μM of each dATP, dCTP, dGTP, dTTP, mutagenic nucleotide analogues and Amersham proprietary random hexamer primers, the amplified DNA products were used directly for sequencing.
MUTAGENIC TEMPLIPHI™ REACTION CONDITIONS: [0240] 1 pg to 10 ng of DNA template (increasing amounts are used as template length increases), lx denaturing buffer with 0.4 μM random hexamer primers, lx TempliPhi™ DNA
Amplification buffer, 200 μM dNTPs, 50 to 100 μM analogue dNTP, 2 mM Magnesium chloride,
1 unit of 029 DNA polymerase in a total of 20 μL. Reactions were performed as follows: Mix template DNA and denaturing buffer with random hexamer primers in a volume of 10 μL and denature with 1 cycle of 95° C for 3 min, cool to room temperature or 4° C. Add 1-2 μL IM analogue dNTP, 10 μL of TempliPhi™ premix (with dNTPs, TempliPhi™ incubation buffer and φ29 DNA polymerase) to the cooled sample and mix briefly, then incubate at 30° C for 8-12 hr
(recommended range 4-18 h). Heat-inactivate the enzyme at 65° C for 10 min, then cool to 4° C.
Dilute the amplified DNA approximately 5-fold and use 10 ng DNA (typically 2-3 μL) to template for a Standard DNA Sequencing reaction.
MUTAGENIC ALKALINE TEMPLIPHI™ REACTION CONDITIONS:
[0241] Using a concentration of 200 μM of each dATP, dCTP, dGTP, dTTP, 50 to 100 μM analogue dNTP such as 5'-Bromo-dUTP in a mutagenic TempliPhi™ reaction modified by addition of Glycine KOH, pH 9.5 to a final pH of 8.5 to 9.0, 2 mM Magnesium chloride, 1 unit of 29 DNA polymerase in a total of 20 μL. Reactions were performed as follows: Mix template DNA and denaturing buffer with random hexamer primers in a volume of 10 μL and denature with 1 cycle of 95° C for 3 min, cool to room temperature or 4° C. Add 1-2 μL IM analogue dNTP, 10 μL of TempliPhi™ premix (with dNTPs, TempliPhi™ incubation buffer and φ29 DNA polymerase) to the cooled sample and mix briefly, then incubate at 30° C for 8-12 hr (recommended range 4-18 h). Heat-inactivate the enzyme at 65° C for 10 min, then cool to 4° C. Dilute the amplified DNA approximately 5 -fold and use 10 ng DNA (typically 2-3 μL) to template for a Standard DNA Sequencing reaction.
EXAMPLE 9
Recovery of individual Rolling Circle mutated DNA by PCR reamplification and cloning: PROTOCOL 2
[0242] Plasmid pTEST™ was mutated using the analogue 5-Br-dUTP with the modified alkaline TempliPhi™ reaction. 1 μL of a 1 in 1000-fold dilution of the reaction products recovered by PCR re-amplification with nested primers and the four natural or conventional dNTPs
and standard PCR conditions. PCR amplified fragments were cloned and individual clones were sequenced and the effect of the analogues on the sequence is illustrated in Figure 8 for an individual mutant pTEST™ clone, as described above. Here mutations were incoφorated up to a frequency of 1 in 25. A further example is illustrated in Figure 5B for an individual mutant fragment of the previously unsequencable Human genomic BAC clone RP11-167L9 as described above.
EXAMPLE 10
Mutation using Taq DNA polymerase for primary mutation
[0243] Amplification conditions were essentially as described by Zaccolo et al. (1996 supra). Using a concentration of 400 μM of each dATP, dCTP, dGTP, dTTP, and 400 μM of a mutagenic nucleotide analogue such as dPTP using a Nucleotide Analogue PCR Amplification, mutations were incoφorated up to a frequency of 1 in 5. Universal M13 primers (FSP-21, FSP-40, RSP-26, RSP-48) were used for PCR amplification.
Cloning of PCR products and SAQOM sequencing: [0244] The mutant PCR products were gel purified and then cloned into the pGEM T-
EASY™ vector (Promega) and transformed into E. coli. Plasmids from individual clones were restricted with selected restriction endonucleases to determine the orientation of the cloned mutated DNA inserted in the plasmid. Plasmids DNA from selected individual clones with common insert orientation were combined in an approximate equimolar mixture and sequenced using Standard sequencing conditions.
EXAMPLE 11
Recovery of Original sequence from a mixture of SAQOM mutated DNA:
PROTOCOL 3
[0245] Mutated DNAs were recovered by PCR re-amplification of 1 μL of a 1 in 1000 -fold dilution of the nucleotide analogue modified PCR reaction products with nested primers and the four natural dNTPs and standard PCR conditions.
Orientation of Cloned PCR-recovered products and Sequencing:
[0246] The PCR recovered products of mutant TempliPhi™ amplification were gel purified and then cloned into the pGEM T-EASY™ vector (Promega) and transformed into E. coli. Plasmids from individual clones were restricted with selected restriction endonucleases to determine the orientation of the cloned mutated DNA inserted in the plasmid. Plasmids DNA from selected individual clones with common insert orientation were combined in an approximate
equimolar mixture and sequenced using Standard sequencing conditions.
[0247] Clones with the same insert orientation were determined by restriction analysis.
Examples of a mixed sequence read is the nine mutated previously unsequencable human fragment RP11-167L9 clones of the same insert orientation illustrated in Figures 6B, and the eight mutated previously unsequencable human fragment RP11-167L9 clones of the same insert orientation illustrated in Figures 6C and 6D.
[0248] A further example is the mixture of 15 different mutated pTEST™ clones of the same insert orientation that was sequenced directly using Standard DNA Sequencing by Cycle PCR Amplification (Big Dye version 3.0) illustrated in Figures 9B. Here mutations are typically incoφorated in individual molecules up to a frequency of 1 in 5 (see EXAMPLE 1 above).
EXAMPLE 12
SAQOM Sequencing by Direct Cycle PCR Amplification (Big Dye v3.0 and v3.1) with
Nucleotide Analogues [0249] The target sequence was typically an undefined DNA fragment cloned into pDRINE™ or pUC19. Cycle sequencing amplification conditions were essentially as described by Applied Biosystems Incoφorated (2002) for BigDye™® Terminator v2.0, v3.0 and v3.1 Cycle Sequencing Kits. Using a concentration of 200 μM of each dATP, dCTP, dGTP, dTTP, ABI proprietary concentrations of the dye terminators ddATP, ddCTP, ddGTP and ddTTP, 30-200 μM of an appropriate mutagenic nucleotide analogue and universal M13 primers (FSP-21, FSP-40, RSP-26, RSP-48) the amplified products were used directly for sequencing.
V 3.0 Reaction conditions:
[0250] 10 ng DΝA template, lx ABI PRISM BigDye™™ Terminator version 3.0 sequencing buffer, 200 μM dΝTPs, approximately 200 μM ddΝTPs, 30 to 200 μM of mutagenic analogue dΝTP, 2 mM magnesium chloride, 0.4 μM each primer, 1 unit of AmpliTaq® DΝA Polymerase, FS in a total of 20 μL. Reactions were performed as follows: 1 cycle of 95 °C for 5 min., 50 cycles of rapid thermal ramping (1° C/sec) to 95° C 0.5 min, 50° C, 10 sec, 60° C for 8 min, then to 4° C and hold until ready to purify the sequencing products. This regimen yields standard SAQOM modified sequencing products. The standard number of 50 thermal cycles was reduced with strongly mutagenic nucleotide analogues to 30 cycles.
V3.1 Reaction conditions:
[0251] 10 ng DΝA template in 4 μL, 4 μL ABI BigDye™™ Terminator cycle Sequencing Kit version 3.1 with 4 μL M13-40 primer (0.8 pmole/μL), BDT, 1 μL of 4mM dPTP or other mutagenic nucleotide analogue, in a final reaction volume of 13 μL. Reactions were performed as
follows: 1 cycle of 95° C for 5 min, 30 cycles of rapid thermal ramping (1° C/sec) of 96° C for 10 sec, 50° C for 5 seconds, 60° C for 8 minutes, then to 4° C and hold until ready to purify the sequencing products. The standard number of 50 thermal cycles was reduced with strongly mutagenic nucleotide analogues to 30 cycles.
Protocol for Cycle Sequencing Large DNA Templates for sequencing large DNA templates:
[0252] 1. Place the tubes in a thermal cycler and set the volume to 20 μL.
[0253] 2. Heat the tubes at 95° C for 5 minutes.
[0254] 3. Repeat the following for 50 cycles:
[0255] Rapid thermal ramp to 95° C; [0256] 95° C for 30 sec;
[0257] Rapid thermal ramp to 50-55 ° C (depending on template);
[0258] 50-55° C for 10 sec;
[0259] Rapid thermal ramp to 60° C (i.e., 1° C/sec); and
[0260] 60°C for 4 min. [0261] 4. Rapid thermal ramp to 4° C and hold until ready to purify.
[0262] 5. Spin down the contents of the tubes in a microcentrifuge.
[0263] If required, the number of cycles can be increased.
[0264] The cycle sequencing protocol provided above work for a variety of templates.
However, the following modifications may be made: [0265] For short PCR products, a reduced number of cycles can be used (e.g., 20 cycles for a
300-bp or smaller fragment).
[0266] If the Tm of a primer is >60° C, the annealing step can be eliminated.
[0267] If the Tm of a primer is <50° C, increase the annealing time to 30 seconds or decrease the annealing temperature to 48° C. [0268] For templates with high GC content (>70%), heat the tubes at 98° C for 5 minutes before cycling to help denature the template.
[0269] For sequencing large DNA templates such as BAC DNA, cosmid DNA, and genomic
DNA heat the tubes at 98° C for 5 minutes before cycling to help denature the template.
EXAMPLE 13
Recovery of original sequence by Mutagenic cycle sequencing:
PROTOCOL 4
[0270] Mutated DNAs were sequenced using non-Standard cycle sequencing conditions.
Reaction conditions:
[0271] 10 ng DNA template, lx ABI PRISM BigDye™™ Terminator version 3.0 sequencing buffer, 200 μM dNTPs, approximately 200 μM ddNTPs, 35 to 200 μM mutagenic analogue dNTP, 2 mM magnesium chloride, 0.4 μM each primer, 1 unit of AmpliTaq® DNA Polymerase, FS in a total of 20 μL. Reactions were performed as follows: 1 cycle of 95° C for 5 min., 50 cycles of rapid thermal ramping (1° C/sec) to 95° C 0.5 min, 50° C, 10 sec, 60° C for 8 min, then to 4° C and hold until ready to purify the sequencing products. This regimen yields SAQOM modified sequencing products. The effect of a mixture of different cycle-sequence amplified mutated fragments (SAQOM) on the composite sequence of a previously "unsequencable clone" D12 is illustrated in Figure 10, as described above. Two further examples of direct SAQOM cycle-sequence amplified and mutated fragments of previously "unsequencable" DNA motifs D4 and RP11-167L9 are illustrated in Figures 11 and 12, respectively.
EXAMPLE 14
TempliPhi™ amplification with standard incorporation of mutagenic analogues:
PROTOCOL 5 [0272] The reaction comprises: 5 μL denaturing sample buffer, 1 μL 1/1000 diluted plasmid
(used 1/1000 dilution of DNA), and heat at 95° C 3 min, then cool to room temperature or place on ice. On ice add: 5 μL reaction buffer, 2 μL 25 mM magnesium chloride, 2 μL 10 mM dPTP, 0.2 μL TempliPhi™ DNA polymerase, then incubate at 30° C for 8 hrs. Inactivate the enzyme at 65° C for 10 minutes. The amplified DNA does not require further purification to perform cycle sequencing. Due to the viscous nature of the amplified product, a dilution step is recommended prior to transfer. Add 40 μL of water or TE (10 mM Tris-HCI, pH 8.0, 1 mM EDTA) to the amplified product and mix well by pipetting up and down, or vortexing if required. After mixing, transfer 2-5 μL (100- 250 ng) of diluted sample per 20 μL sequencing reaction. If the incubation period was less than 12 h, add 4-8 μL of diluted product.
EXAMPLE 15
Elimination of Taq DNA polymerase stutter by SAQOM Genotyping:
PROTOCOL 6
[0273] The problem of DNA polymerase "stutter" during the amplification and analysis length polymoφhism of microsatellite or simple tandem repeat (STR) or variable number tandem repeat (NNTR) alleles arises because of the repetitive nature of the sequence motifs (Baran, Lapidot & Manor, 1991 Proc. Natl. Acad. Sci. USA 88:507-511; Kaiser et ah, 2002 Clinical Biochemistry 35:49-56) that potentially allow extension from either aligned triplex strands or from misalignment partially replicated duplex primed ends, or from other forms of simple repeat, homopolymer. These sequence motifs are known to limit or to prevent entirely the processive synthesis of DNA by DNA polymerases.
[0274] Such simple repeat motifs can be simultaneously genotyped by direct PCR amplification from human genomic DNA template using PCR amplification in the presence of a mutagenic nucleotide analogue and multiplex primer pairs. Single locus genotypes can also be determined using single locus primer pairs. The method is advantageous to eliminate Taq DNA polymerase "stutter products" from simple repeat motifs comprising homopolymer and dinucleotide sequences.
Reaction conditions: [0275] 2 μg of human genomic DNA template, lx AmpliTaq™ Gold buffer, 200 μM dNTPs, 2 mM magnesium chloride, 0.4 μM each primer (e.g., a pair of primers flanking a polymoφhic microsatellite locus or sometimes referred to as simple tandem repeat (STR) primers), 1 unit of AmpliTaq™ Gold, and about 30 to 200 μM of a mutagenic nucleotide analogue such as 8-oxo- dGTP, and in a Nucleotide Analogue PCR Amplification in a total of 25 μL. Reactions were performed as follows: 1 cycle of 94° C for 15 min., 30 cycles of 94° C 1 min, 50° C, 0.5 min, 72° C for 5 min., 1 cycle 72° C for 10 min. This regimen yields PCR products incoφorating analogue bases. Locus specific STR primers were used simultaneously to direct the PCR amplification of a multiplicity of specific genomic fragments. The effect of a mixture of different PCR amplified mutated STR fragments on the composite fragment repeat-length of each locus specific fragment is illustrated in Figure 13. As illustrated in this figure, the peaks defining allelic variants of the locus are generally single in nature and do not include erroneous peaks obtained due to polymerase stutter.
EXAMPLE 16 Standard assay for determining mutagenic frequency
Standard PCR amplification for substitution mutation
[0276] 2 ng DNA template, being a pUC19 or pDRTvΕ plasmid containing the 171 base pair Lactococcus lactis nisZ gene [SEQ ID NO:l], was combined with lx AmpliTaq™ Gold buffer, 200 μM of three dNTPs, 14 μM a fourth dNTP, 200 μM of a mutagenic nucleotide analogue, 5 mM magnesium chloride, 0.4 μM each primer, 1 unit of AmpliTaq™ Gold, in a total of 25 μL. Reactions were performed as follows: 1 cycle of 94° C for 15 min., 30 cycles of 94° C 1 min, 50° C, 0.5 min, 72° C for 5 min., 1 cycle 72° C for 10 min. This regimen yields PCR products incoφorating analogue bases.
[0277] Universal M13 primers (FSP-21, FSP-40, RSP-26, RSP-^8) were used for PCR amplification and sequencing. All amplification reaction described herein were performed with an Applied Biosystems GeneAmp™ PCR System 9700 or 9600 thermal cycler.
Standard PCR Amplification (Recovery) [0278] The Lactococcus lactis nisZ gene mutated in the above amplification is then subjected to recovery amplification for subsequent cloning into pUC19 or pDRJNE™1. Amplification conditions were essentially as described by Zaccolo et al (1996, supra). Using a concentration of 200 μM of each dNTP, dCTP, dGTP, dTTP and universal M13 primers (FSP-21, FSP-40, RSP-26, RSP-48) the PCR amplified products could be used directly for sequencing, or could be cloned and individual clones could be used for sequencing using the protocols described in Examples 2-7.
[0279] The disclosure of every patent, patent application, and publication cited herein is hereby incoφorated herein by reference in its entirety.
[0280] The citation of any reference herein should not be construed as an admission that such reference is available as "Prior Art" to the instant application. [0281] Throughout the specification the aim has been to describe the preferred embodiments of the invention without limiting the invention to any one embodiment or specific collection of features. Those of skill in the art will therefore appreciate that, in light of the instant disclosure, various modifications and changes can be made in the particular embodiments exemplified without departing from the scope of the present invention. All such modifications and changes are intended to be included within the scope of the appended claims.
PROTOCOL 1
Original DNA template [plasmid or BAC clone]
Amplify the region in the presen Ice of mutagenic nucleotide analogue using PCR amplification.
3. Perform "Recovery PCR" with normal dNTPs
Clone amplified mutated region in recoverable plasmid vector
5. Sequence individual plasmid insert Is to determine the Mutated Sequence
6. Determine the original wild type se Iquence by application of SAM (Sequence Analysis via Mutagenesis) reconstruction, as described in international Publication
WO2002/079502, using data from a defined number of individual mutant sequences.
PROTOCOL 2
1. Original DNA template [plasmid or larger DNA fragment, e.g. BAC clone]
Amplify the region in the prese Ince of mutagenic nucleotide analogue using TempliPhi™ amplification.
3. Perform "Recovery PCR" with norm Ial dNTPs
Clone amplified mutated region in re Icoverable plasmid vector
5. Restrict plasmid clones to determin orientation of mutated fragment insert
6. Determine the original wild type seq Iuence by application of SAM (Sequence Analysis via Mutagenesis) reconstruction, as described in International Publication
WO2002/079502, using data from a defined number of individual mutant sequences.
SAQOM PROTOCOL 3
1. Original DNA template [plasmid or larger DNA fragment, e.g. BAC clone]
Amplify the region in the presence o If mutagenic nucleotide analogue using either PCR amplification or TempliPhi™ amplification
3. Perform "Recovery PCR" with normal dNTPs
4. Select a multiplicity of clones with a common mutated insert orientation
5. Determine the mean or composite se Iquence of the pool of fragments using nested inner primers
SAQOM PROTOCOL 4
1. Original DNA template [plasmid or larger DNA fragment, e.g. BAC clone]
Amplify the region in the presence of mutagenic nucleotide analogue using Cycle Sequencing PCR Amplification with appropriate sequencing primer
\ 3. Determine the mean or composite sequence of the pool of fragments
SAQOM PROTOCOL 5
1. Original DNA template [plasmid or BAC clone or genomic DNA]
2. Amplify the region in the presenc of mutagenic nucleotide analogue using
TempliPhi™ amplification
3. Determine the mean or composite sequence of the pool of fragments using appropriate secondary sequencing primer and Standard Cycle Sequencing reaction
SAQOM PROTOCOL 6
1. Original DNA template [e.g. genomic DNA template] containing simple tandem repeat motif for repeat-length genotyping
Amplify copies of the region in the pr Iesence of mutagenic nucleotide analogue using PCR amplification
3. Determine the mean repeat length of the pool of SSR fragments