[go: up one dir, main page]

WO2001062958A1 - Assessment of the functional significance of a nucleotide sequence by subjecting the reading frame data to data processing - Google Patents

Assessment of the functional significance of a nucleotide sequence by subjecting the reading frame data to data processing Download PDF

Info

Publication number
WO2001062958A1
WO2001062958A1 PCT/AU2001/000193 AU0100193W WO0162958A1 WO 2001062958 A1 WO2001062958 A1 WO 2001062958A1 AU 0100193 W AU0100193 W AU 0100193W WO 0162958 A1 WO0162958 A1 WO 0162958A1
Authority
WO
WIPO (PCT)
Prior art keywords
codons
nucleotide sequence
base
sequence
codon
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/AU2001/000193
Other languages
French (fr)
Inventor
Richard John Epstein
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HUGHES EDWARD JOHN LANGFORD
National Cancer Centre of Singapore Pte Ltd
Original Assignee
HUGHES EDWARD JOHN LANGFORD
National Cancer Centre of Singapore Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HUGHES EDWARD JOHN LANGFORD, National Cancer Centre of Singapore Pte Ltd filed Critical HUGHES EDWARD JOHN LANGFORD
Priority to AU35255/01A priority Critical patent/AU3525501A/en
Publication of WO2001062958A1 publication Critical patent/WO2001062958A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07FACYCLIC, CARBOCYCLIC OR HETEROCYCLIC COMPOUNDS CONTAINING ELEMENTS OTHER THAN CARBON, HYDROGEN, HALOGEN, OXYGEN, NITROGEN, SULFUR, SELENIUM OR TELLURIUM
    • C07F9/00Compounds containing elements of Groups 5 or 15 of the Periodic Table
    • C07F9/02Phosphorus compounds
    • C07F9/28Phosphorus compounds with one or more P—C bonds
    • C07F9/54Quaternary phosphonium compounds
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the present invention relates generally to a method for the genetic analysis of nucleic acid molecules and expression products encoded therefrom.
  • the present invention relates to the analysis of the genetic code, and to the analysis of individual codons that comprise a nucleotide sequence.
  • the present invention relates to the third base position of a codon for a particular amino acid, and to the functional significance of that particular amino acid. More particularly, the present invention provides a means for predicting the functional significance of a nucleotide sequence or a segment of a larger nucleotide sequence and to a means for predicting the functional significance of an expression product of said nucleotide sequence.
  • the present invention further relates to the designation of a particular trinucleotide codon for the specification of an amino acid, and to the preferential use and evolutionary conservation of the nucleotide bases guanine and cytosine in the third position of a trinucleotide codon.
  • the present invention further combines software programming with genetic to provide a method for the analysis of a nucleotide sequence to determine the functional significance of the amino acids encoded by a nucleotide sequence.
  • the functional significance of a nucleotide sequence includes the relative potential for that sequence to be associated with a critical cellular function such as the likelihood that a particular sequence encodes an expression product associated with genetic replication and/or cell survival growth, development and/or maintenance.
  • G guanine
  • C cytosine
  • T thymine
  • A uracil
  • nucleotide sequencing techniques have not been associated with a corresponding ability to assess the significance of particular sequences. This is a particular problem associated with genome sequencing projects for human, animal, insect and plant genomes There is a need, therefore, to be able to predict that a particular nucleotide sequence is of potential importance such as for cellular development and genetic regulation
  • an amino acid is specified by a trinucleotide codon.
  • the appropriate anticodon containing amino-acylated tRNA is brought into base pairing alignment with the codon of the mRNA and the amino acid is added to the growing peptide chain in a peptidyl transferase reaction.
  • a trinucleotide codon specifies the appropriate amino acid primarily through the information present within the first and second bases within the trinucleotide codon.
  • the third base position may be specified by more than one nucleotide, and in the case of the amino acids serine, leucine, proline, threonine, valine and alanine, the third base of the trinucleotide codon may be any nucleotide.
  • the third base of the trinucleotide triplet has generally been regarded as being substantially redundant with the same amino acid being encoded by four separate codons.
  • the author has used methods of statistical analysis to study the variation of the third base position in a large number of transcriptionally active genes, including the family of human receptor tyrosine kinase genes, ⁇ glucuronidase and glyceraldehyde-3-phosphate dehydrogenase genes, and a range of house keeping genes such as and /3-actin.
  • the inventor identified an unexpected correlation between third-base content in a three base trinucleotide codon in which amino acids encoded by trinucleotide codons that contain guanine or cytosine in the third nucleotide position indicate that the trinucleotide codon is functionally significant.
  • the instant inventor used unexpected correlation to develop a method by which the functional significance of both the nucleotide sequence and its expression may be determined.
  • One aspect contemplates a method for assessing the functional significance of a nucleotide sequence which encodes an expression product which expression product comprises a sequence of nucleotides or amino acid residues, said method comprising determining reading frame data which comprises codons encoding nucleotide triplets or amino acid residues and subjecting the reading frame data to data processing means in order to identify the third-base in each codon or the overall third-base in some or all codons wherein the identification of a codon or codons having a particular base or type of base in the third-base position is indicative of the functional significance of said nucleotide sequence.
  • Another aspect of the present invention provides a method for assessing the functional significance of an amino acid sequence or nucleotide sequence in an expression product of a DNA sequence, said method comprising determining reading frame data on the DNA which comprises codons encoding residue of said amino acid sequence or nucleotide of said nucleotide sequence of the expression product and subjecting the reading frame data to data processing means in order to identify amino acids or nucleotide triplets in an expression product encoded by one or more codons wherein the third-base of said codon or codons is determined wherein the identification of a codon or codons including a sequence of codons having a particular base or type of base in the third-base position is indicative of the functional significance of said expression product.
  • a further aspect of the present invention contemplates a method for determining the likelihood or otherwise that a nucleotide sequence of an open reading frame encodes a peptide, polypeptide or protein involved in or otherwise associated with cell survival, growth, maintenance or development or other physiological process, said method comprising subjecting the nucleotide sequence of the open reading frame to data processing means such that the third-base in each codon or group of codons making up the open read frame is identified or otherwise assigned a value wherein a first value is provided to an A and/or T and a second value is assigned to a G and/or C wherein the identification of a codon or group of contiguous or non-contiguous codons comprising a bias towards an A and/or T composition or G and/or C composition at the third position of said one or more codons prov ides a predicative indicator as to the nature and/or function of said nucleotide sequence 01 its expression product
  • Yet another aspect of the present invention provides a method for functionally analyzing nucleotide sequences putatively defining a gene such as identified in a genome sequencing project, said method comprising identifying transc ⁇ ptional unit data comprising codons each encoding an amino acid, subjecting the transc ⁇ ptional unit data to data processing means to identify the composition of the third-base in one or more of said codons wherein a bias towaids an A and/or T composition or G and/or C composition at the third position of said one or more codons provides a predicative indicator as to the nature and/or function of said nucleotide sequence or its expression product towards said nucleotide sequence or expression produced being involved inlet aha in cell survival, growth, development and/or maintenance
  • Still another aspect of the present invention is directed to data processing means for assessing the functional significance of a nucleotide sequence defining a gene, said data processing means comprising means for assigning an identifier to the third-base in one or more codons
  • Still yet another aspect of the present invention contemplates the use of chemical analogues of ammo acids that have been identified as having a functional significance by the method of the present invention
  • Another aspect of the present invention contemplates codons that have been identified as functionally significant by the method of the present invention provides a means to produce a combinatorial library of peptides, polypeptides or proteins which are enriched or decreased in one or more functionalities associated with an amino acid encoded by a triplet which has been identified as functionally significant codon by the method of the present invention.
  • a further aspect of the present invention is directed to a recombinant or artificially created nucleic acid molecule or peptide, polypeptide or protein comprising a sequence of nucleotides or amino acids rationally designed based on the composition of the codon third-base.
  • Yet another aspect of the present invention provides a means of interpreting the likely functional significance, if any, of codon third-base polymorphisms.
  • Still another aspect of the present invention is directed to the analysis and identification of regulatory sequences.
  • Figure 1 is a representation of description of the functional continuity of nucleic acids and amino acids (A) Arrangement of codon third bases in palindromic manner so as to minimize the mutational distance between codons (B) Functional significance to amino acid dist ⁇ bution of codon third bases
  • Figure 2 is a diagrammatical representation of a model whereby genomic evolution proceeds as a Darwinian balance between molecular function (hypomethylation and GC retention) and loss-of-function (methylation, leading to GC conversion to AT)
  • the present invention is predicated in part on the determination by the inventor that variations in the third-base composition of codons can be used to predict the relative functional significance of the nucleotide sequence, the particular codon and/or the expression product.
  • variations in the third-base composition of codons arc used to interface with data processing means to facilitate the prediction of the relativ e functional significance of different codons or corresponding expression products of the codons.
  • the present inv ention enables, therefore, the design of genes or expression products such as proteins w ith particular utilities
  • one aspect of the present invention contemplates a method for assessing the functional significance of a nucleotide sequence which encodes an expression product which expression product comprises a sequence of nucleotides or amino acid residues, said method comprising determining reading frame data which comp ⁇ ses codons encoding nucleotide triplets or ammo acid residues and subjecting the reading frame data to data processing means in order to identify the third-base in each codon or the overall third-base in some or all codons wherein the identification of a codon or codons having a particular base or type of base in the third-base position is indicative of the functional significance of said nucleotide sequence
  • the base identified is of the type G or C or of the type A or T.
  • the recognition of individual or overall G/C or A/T content of the third base position is indicative of the functional significance of the nucleotide sequence.
  • the presence of a G or C at the third-base position is more likely to indicate a functionally significant nucleotide sequence relative to an A or T at the same position
  • the present invention further contemplates the functional significance of the absence of a G or a C in the third-base position of a codon, i.e. the presence of an A or T.
  • An example of the latter includes heterodime ⁇ zation of different AT- ⁇ ch receptors which may decrease the rate of degradation of receptors thereby prolonging their signalling
  • the instant method may be used to predict the likely signalling strength of certain receptors
  • Reference herein to "data processing means” includes any electronic hardware device, in combination with one or more software applications, which participates in or otherwise facilitates the analysis, calculation, computation or reconfiguration of information
  • the hardware device generally comprises an assemblage of electronic components and includes a computer system
  • the data processing means executes a method of assigning a value or an identifier to the third nucleotide in some or all codons
  • a “codon” is a unit of three nucleotides which, when transcribed, produces a particular amino acid residue or a triplet of ⁇ bonucleotides such as in mRNA
  • the term “third-base” in relation to a codon is the third nucleotide in a three nucleotide codon
  • the terms “nucleotide” and “base” are used interchangeably throughout the subject specification
  • an “expression product” encompasses both an amino acid sequence such as in a peptide, polypeptide or protein including a single amino acid residue as well as a mRNA molecule comprising triplets of nucleotides each corresponding to a codon of a DNA sequence
  • An “expression product” may also be a hybrid complex of, for example, tRNA, mRNA and/or amino acid residues When the expression product is mRNA, the nucleotide may include U instead of T
  • An expression product may also be considered as other RNA sequences including synthetic components such as an engineered structural component including but not limited to a triple helix forming sequence or a stem loop structure
  • An expression product includes topological folds and structures of RNA that are de ⁇ ved from either typical or atypical A form geometry and may include internal bulges, loops and non-Watson and C ⁇ ck base pairs
  • an "expression product” refers to an expression product de ⁇ ved from a polynucleotide sequence or gene
  • An "expression product” includes complementary and antisense RNA sequences that can be expressed from a particular polynucleotide sequence or gene, and any mRNA sequence expressed from a gene in a given transcriptional reading frame.
  • An expression product is also considered to be a peptide, polypeptide or protein that can be translated from an expressed RNA sequence.
  • gene is used in its broadest sense and includes cDNA corresponding to the exons of a gene. Accordingly, reference herein to a “gene” is to be taken to include:-
  • a classical genomic gene consisting of transcriptional and/or translational regulatory sequences and/or a coding region and/or non-translated sequences (i.e. introns, 5'- and 3'- untranslated sequences); or
  • gene is also used to describe synthetic or fusion molecules encoding all or part of an expression product.
  • nucleic acid molecule and “gene” may be used interchangeably.
  • a gene comprises, when read in a particular reading frame, a series of codons each encoding a particular amino acid residue or triplet of nucleotides or a mRNA molecule wherein the series of codons provides a particular sequence of amino acids or sequence of mRNA triplet nucleotides.
  • a “peptide”, “polypeptide” and “protein” refer to molecules comprising a sequence of amino acid residues or derivatives or analogues thereof. Accordingly, a peptide, polypeptide or protein may comprise chemical analogues of amino acids.
  • the "functional significance" of a nucleotide sequence includes the attributes of the nucleotide sequence including attributes affecting fidelity, longevity, stability or efficiency of replication, transcription and/or translation as well as those attributes affecting physiological processes within or by a cell carrying the nucleotide sequence Physiological processes include but are not limited to cell survival, maintenance, development and growth differentiation and/or proliferation capacities, and immunopotential and mitogenic potential such as the potential to develop or proliferate to a cancerous condition.
  • the functional significance of a codon or group of codons within a nucleotide sequence may be determined once the third-base composition is identified This may be, for example, by determining G/C contact or A T contact
  • the present inv ention may be practised at the lev el of the nucleic acid and/or expression product level
  • the present invention is particularly suitable for use in DNA microarray analysis of gene expression and in differential display and subtractive hybridization analysis of gene expression It is intended that the present invention may be practiced when sequence information of a gene, or expressed sequence, permits the identification of a codon in which the third base is indicative of a functionally significant amino acid in accordance with the method of the present invention
  • another aspect of the present invention provides a method for assessing the functional significance of an amino acid sequence or nucleotide sequence in an expression product of a DNA sequence, said method comprising determining reading frame data on the DNA which comp ⁇ ses codons encoding residue of said amino acid sequence or nucleotide of said nucleotide sequence of the expression product and subjecting the reading frame data to data processing means in order to identify amino acids or nucleotide triplets in an expression product encoded by one or more codons wherein the third-base of said codon or codons is determined wherein the identification of a codon or codons including a sequence of codons having, for example, a cytosine or guanine in the third- base position is indicative of the functional significance of said expression product
  • This aspect of the present invention mav also be practised by identifying an A or T/U or a combination of all foui/fiv e nucleotides
  • the expression product is a peptide, polypeptide or protein
  • the expression product is an mRNA molecule such as an antisense mRNA molecule, including but not limited to chemical analogues and derivatives of nucleotides such as but not limited to phosphothiotes
  • the present invention is predicated part on the surprising determination that genes characterized by being particularly physiologically significant such as having high transcriptional and/or mitogenic and/or rephcative activity are generally more GC- ⁇ ch than those having relatively lower transcriptional, mitogenic and/or rephcative activity Accordingly, third-base GC composition is more common in codons specifying amino acids in peptides, polypeptides or proteins or triplet nucleotides in mRNA molecules which are critical for physiological processes such as but not limited to cell survival, maintenance, development, growth, differentiation, proliferation, mitogenic potential and/or lmmunopotential
  • mitogenic potential includes the potential of cells to develop into cancers or tumours
  • the present invention enables, therefore, a method for complementing genetic analysis of nucleotide sequences by providing predicative means to assess the functional significance of the nucleotide sequence This is particularly useful, for example, in assessing the potential functional significance of particular nucleotide sequences identifying in the human genome project or similar projects for plants, animals, insects and
  • the present invention contemplates, therefore, in a preferred embodiment, a method for determining the likelihood or otherwise that a nucleotide sequence of an open reading frame encodes a peptide, polypeptide or protein involved in or otherwise associated with cell survival, growth, maintenance or development, said method comp ⁇ sing subjecting the nucleotide sequence of the open reading frame to data processing means such that the third-base in each codon or group of codons making up the open read frame is identified or otherwise assigned a value wherein a first value is provided to an A and/or T and a second value is assigned to a G and/or C wherein the identification of a codon or group of contiguous or non-contiguous codons comprising a bias towards an A and/or T composition or G and/or C composition at the third position of said one or more codons provides a predicative indicator as to the nature and/or function of said nucleotide sequence or its expression product
  • the method of the present invention is particularly useful assessing the potential functional significance of nucleotide sequences identified during, for example, the human genome project
  • a method for functionally analyzing nucleotide sequences putatively defining a gene such as identified in a genome sequencing project comprising identifying transcriptional unit data comprising codons each encoding an amino acid, subjecting the transcriptional unit data to data processing means to identify the composition of the third-base in one or more of said codons wherein a bias towards an A and/or T composition or G and/or C composition at the third position of said one or more codons provides a predicative indicator as to the nature and/or function of said nucleotide sequence or its expression product towards said nucleotide sequence or expression produced being involved inter aha in cell survival, growth, development and/or maintenance
  • the gene may be involved in transcription, translation, replication, proliferation and/or differentiation
  • the method of the present invention can at least be partly implemented using a suitably programmed computer
  • the preferred data processing means comprises a suitably programmed computer and the steps of the method are preferably performed using the suitably programmed computer
  • the input information may take the form of values, identifiers or other data in respect of the identity of the third-base of a codon
  • the input data may be digitized Alternatively, foi implementation of the invention, a dedicated Fast Fourier transform chip can be employed as at least part of the processing means.
  • representative measurements are made identifying or valuing the third-base in the codon in selected locations within a sequence.
  • the selected locations define a gene or gene sequence.
  • another aspect of the invention is directed to a data processing means for assessing the functional significance of a nucleotide sequence defining a gene, at least in part by assigning an identifer to the third-base in one or more codons.
  • data processing means executes the steps of:-
  • the subsets referred to in part (iv) above include identifying a guanine alone, cytosine alone or either a guanine or cytosine.
  • the input program and general methodology of the present invention are useful in a range of applications including designing genes and proteins having a particular ability or function. For example, if it is desired to modify the function of any gene or protein, whether natural or synthetic, the present invention may be used to identify codons and encoded amino acids in which an alteration of codon third-base composition could yield the desired change in function Furthermore, once codon third-bases of interest are so identified, routine genetic techniques are employed to introduce site-specific changes of nucleotide composition at the designated base positions
  • expression products may be generated by the selective mutagenesis of a codon in which the third base is indicative of a functionally significant amino acid
  • the identified codons may be targets for mutagenesis or, alternatively, be retained in the expression product
  • selective mutagenesis or retention of the identified codons provide a method for altering the functionality of the expressed product
  • Modulation of functionality of an expressed product may be useful in regulating physiological and/or biochemical processes such as those associated with ligand bindings, dime ⁇ zation, including ohgome ⁇ zation, catalytic activity, phosphorylation state, glycosylation state and piotcin stability
  • the present invention is used to identify key codons encoding either the gand/substrate or the receptor/enzyme domains, such that selective mutagenesis (and/or retention) of codons gives the greatest likelihood of producing a particular antagonistic (for functional) effect
  • a mixture of such effects may also be generated, such as, for example, codons which retain effective ligand binding or where other codons may be mutagenized to abolish, for example, dime ⁇ zation, thus creating a competitive inhibitor
  • This approach may be scaled-up, such as using combinatorial chemistry to produce libraries of compounds or leads which are thus enriched for potential commercially valuable products
  • the present invention further contemplates the use of chemical analogues of ammo acids that have been identified as having a functional significance by the method of the present invention.
  • a functionally significant amino acid may be identified by the determination of the third base composition of a codon or a group of codons
  • the instant method is used to improv e the functional characteristics of a peptide, polypeptide or protein via site-specific incorporation of particular amino acids or chemically modified ammo acids or ammo acid analogues
  • Such chemical analogues are particularly useful due to their inter alia conformational stability, resistance to enzymatic degradation and/or their stability in diagnostic and therapeutic tests
  • Chemical analogues of a functionally significant amino acid contemplated herein include but are not limited to modifications to side chains, incorporation of unnatural amino acids and/or their derivatives du ⁇ ng peptide, polypeptide or protein synthesis and the use of cross nkers and other methods which impose conformational constraints on the proteinaceous molecule or their ana
  • side chain modifications contemplated by the present invention include modifications of amino groups such as by reductive alkylation by reaction with an aldehyde followed by reduction with NaBFL*., amidination with methylacetimidate, acylation with acetic anhydride, carbamoylation of amino groups with cyanate; t ⁇ nitrobenzylation of amino groups with 2, 4, 6-t ⁇ n ⁇ trobenzene sulphonic acid (TNBS), acylation of amino groups with succinic anhydride and tetrahydrophthahc anhydride, and py ⁇ doxylation of lysine with py ⁇ doxal-5-phosphate followed by reduction with NaBH 4
  • modifications of amino groups such as by reductive alkylation by reaction with an aldehyde followed by reduction with NaBFL*., amidination with methylacetimidate, acylation with acetic anhydride, carbamoylation of amino groups with cyanate; t ⁇ nitrobenzylation of amino groups with 2, 4, 6-t ⁇
  • the guanidine group of arginine residues may be modified by the formation of heterocyclic condensation products with reagents such as 2,3-butaned ⁇ one, phenylglyoxal and glyoxal
  • the carboxyl group may be modified by carbodnmide activation ⁇ ⁇ a O-acylisourea formation followed by subsequent de ⁇ itization, for example, to a corresponding amide
  • Sulphydryl groups may be modified by methods such as carboxymethylation with lodoacetic acid or lodoacetamide; performic acid oxidation to cysteic acid; formation of a mixed disulphides with other thiol compounds; reaction with maleimide, maleic anhydride or other substituted maleimide, formation of mercurial de ⁇ vatives using 4- chloromercunbenzoate, 4-chloromercu ⁇ phenylsulphon ⁇ c acid, phenylmercury chlo ⁇ de, 2- chloromercu ⁇ -4-n ⁇ trophenol and other mercurials, carbamoylation with cyanate at alkaline pH
  • Tryptophan residues may be modified by, for example, oxidation with N- bromosuccimmide or alkylation of the indole ring with 2-hydroxy-5-mtrobenzyl bromide or sulphenyl halides
  • Tyrosine residues on the other hand, may be altered by nitration with tetranitromethane to form a 3-n ⁇ trotyros ⁇ ne derivative
  • Modification of the lmidazole ring of a histidine residue may be accomplished by alkylation with lodoacetic acid derivatives or N-carbethoxylation with diethylpyrocarbonate
  • Examples of incorporating unnatural amino acids and derivatives du ⁇ ng peptide synthesis include, but are not limited to, use of norleucine, 4-ammo buty ⁇ c acid, 4-am ⁇ no-3- hydroxy-5-phenylpentano ⁇ c acid, 6-am ⁇ nohexano ⁇ c acid, t-butylglycine, norvahne, phenylglycine, ornithine, sarcosine, 4-am ⁇ no-3-hydroxy-6-methylheptano ⁇ c acid, 2-th ⁇ enyl alanine and/or D-isomers of amino acids
  • Table 1 A list of unnatural amino acid, contemplated herein is shown in Table 1
  • peptides can be conformationally constrained by, for example, inco ⁇ oration of C ⁇ and N ⁇ -methylamino acids, introduction of double bonds between C a and C ⁇ atoms of amino acids and the formation of cyclic peptides or analogues by introducing covalent bonds such as forming an amide bond between the N and C termini, between two side chains or between a side chain and the N or C terminus.
  • identification of functionally significant codons by the method of the present invention provides a means to produce a combinatorial library of peptides, polypeptides or proteins which are enriched or decreased in one or more functionalities associated with an amino acid encoded by a triplet which has been identified as functionally significant by the method of the present invention.
  • the expression product of a mutated/or retained form of a nucleotide sequence may be used in biochemical screens of combinatorial libraries of molecules and high throughput screening methodologies to identify modulators of an activity present in an expression product including an expression product from a mutant or a retained form of the nucleotide sequence identified by the method of the present invention.
  • another aspect of the present invention is directed to a recombinant or artificially created nucleic acid molecule or peptide, polypeptide or protein comprising a sequence of nucleotides or amino acids rationally designed based on the composition of the codon third-base.
  • Yet another aspect of the present invention provides a means of inte ⁇ reting the likely functional significance, if any, of codon third-base polymo ⁇ hisms Codon third-base polymo ⁇ hisms provides a means for diagnosing or determining the likelihood of development of a disease condition or a particular phenotypic trait It is also useful for susceptibility testing
  • the method of the present invention may be used to characterize the probable functional significance of third-base polymo ⁇ hisms and their association with disease-predisposing mutations Such mutations may affect the same or different (l e upstream or downstream genes)
  • the present invention is further directed to the analysis and identification of regulatory sequences Promoter mefhylation is a mechanism of the regulation of gene expression
  • the present invention provides an analysis of the GC content and compositions of genetic regulatory regions
  • the present invention further contemplates the identification of such regions where expression is regulated by DNA methylation and acetylation particularly where chromatin conformation will determine the efficiency of expression of foreign genetic sequences such as those used in gene therapy
  • a highly AT- ⁇ ch upstream sequence may be identified by the method of the present invention
  • the subject invention contemplates the replacement of such regions or sequences with more GC- ⁇ ch sequences to facilitate or prolong the expression of introduced gene therapy products
  • the present invention provides method for assessing the functional significance of a nucleotide sequence, said method comprising subjecting the nucleotide sequence to data processing means in order to identify the composition bias for GC or AT wherein the identification of a GC- or AT- ⁇ ch region is indicative of the functional significance of said nucleotide sequence
  • the nucleotide sequence is upstream sequence of the gene
  • the nucleotide sequence comp ⁇ ses a promoter
  • the present invention provides a method of treatment comp ⁇ sing subjecting a nucleotide sequence to data processing means in order to identify the base composition and administering hypomethylating drugs if it is a GC- ⁇ ch region to modify expression of a nucleotide sequence operably linked to said first mentioned nucleotide sequence
  • the nucleotide sequence comp ⁇ ses a promoter
  • the hypomethlyating drug is 5-azacyt ⁇ d ⁇ ne
  • the present invention may be used to predict not only the likely functionality of any individual amino acid but its potential significance or influence within a region of peptide sequence. Without intending to limit the present invention to any one theory or mode of action, it is proposed that once function is well established in evolutionary terms, there is, micro-evolutionary tolerance to methylation-dependent GC-to-AT switching. Indeed, other aspects of the data indicate a regional significance to the instant invention, for example, TM sequences which are abundant in function (as measured by TMPRED score) exhibit a decline in the GC-richness of non-hydrophobic-encoding codons as well as hydrophobic amino acids phenylalanine-leucine-valine (FLV)-encoding codons, albeit from a lower starting value. In mechanistic terms, this could indicate transcription-coupled repair playing a regional role in nucleotide composition in addition to the selectionist (purifying) effect of mutations affecting individual residues.
  • FLV phenylalanine-leucine-valine
  • Genbank sequences were obtained form Genbank, and correlative functional data sourced from Medline. Genbank sequences for analysis of GC:AT content (the total number of G and C bases expressed relative to total A and T bases) were eliminated if no informative comparison could be made due to an inadequate number of human gene homologs, or similar GC:AT ratios between gene family members.
  • the NCBI data bank was searched using the following key words: tyrosine kinase receptor; growth factor receptor; and epidermal growth factor receptor, ErbB2, insulin receptor, insulin-like growth factor receptor, platelet-derived growth factor receptor, fibroblast growth factor receptor, EphA2, EphB2, Ret, Met, VEGF receptor, TrkA, TrkB.
  • This search downloaded 1 1 16 entries which were filtered by several methods to yield 141 sequences as the initial dataset, which was further refined by exclusion of repeat sequences to yield 50 unique mRNAs. Transmembrane regions from the corresponding amino acid sequences were then identified using the TMPRED algorithm
  • ⁇ C platelet-derived growth factor contain two Cys ' residues in the first extracellular immunoglobulin-like domain, which is within the region binding the common ligand PDGF-B (Miyazawa et al, 1998).
  • PDGF-B common ligand PDGF-B
  • PDGFR- ⁇ contains four Cys ⁇ ⁇ amino acids
  • PDGFR- 3 contains four Cys G residues (p ⁇ 0.03). Since PDGFR- ⁇ appears less mitogenic than PDGFR-3 (Inui et al., 1994), inhibition of PGDF-B binding via selective
  • Cys ' binding of PDGF-A would be expected to prevent formation of a/ ⁇ receptor heterodimers and thus to impair cell proliferation in the presence of PDGF-B.
  • TGT-encoded Cys ' " 4 in the insulin receptor forms a class 1 disulfide bond linking the two ⁇ -chains but mutation of this cysteine neither alters insulin affinity nor abolishes receptor homodimerization (Bilan and Yip. 1994).
  • Yes and Src are even more dissimilar: express of Yes is prominent in terminally differentiated cells, inversely correlated with cell growth and unnecessary for tumorigenesis (Park and Cartwright, 1995), and this gene is also unusual among non-receptor tyrosine kinases in its homology to a pseudogene of similar GC:AT content (Semba ct al, 1988).
  • TAC:TAT ratio characterizes the non- catalytic molecule Cbl, a growth-inhibitory tyrosine phosphoprotein (Mu ⁇ hy et al , 1998) which acts by binding (Lupher et al, 1997) and antagonizing the GC-rich tyrosine kinase Syk (Ota and Samelson, 1997); the tumor-suppressive potency of this TAT-rich molecule is illustrated by the hype ⁇ lastic and oncogenic sequelae of its knockout and viral inhibition respectively (Mu ⁇ hy et al, 1998).
  • TAC:TAT ratios of several human receptor genes are strikingly different in rodent cells, with marked increases in Fms and PDGFR- ⁇ (Table 5). Discrepancies between the perceived behaviour of cognate gene products within these two cell systems could then reflect evolutionary differences in gene behaviour related to differences in host longevity and long-term tumor risk.
  • TAT-encoded PLY- ⁇ -binding site in PDGFR-/3 may not impair PDGF-dependent growth significantly (Hill et al, 1990; Tyr 751 is not essential for PI 3'-kinase binding in PDGFR- 5 (Kashishian et al, 1992); the interaction of Cbl with PI 3- kinase may be associated with growth suppression and apoptosis (Kitanaka et al, 1996); and the transforming activity of Ret does not require the presence of Tyr 1096 (Asai et al, 1996).
  • Tyrosine codons exhibit non-random patterns of distribution similar to those noted above for cysteine codons.
  • 15 of 18 TAT codons encode tyrosines within the N-terminal 330 amino acids of the cytoplasmic domain
  • TAC-encoded tyrosines are also non- randomly distributed in extracellular domains, however, consistent with a methylation-dependent mechanism for discarding deleterious TGC codons: if loss of function becomes selectable (e.g. to permit a change of ligand specificity) then it is more efficient to mutate from TGC to TAC than to TGT and thence to TAT.
  • the third tyrosine in the ErbB2 kinase domain is encoded by TAT and corresponds to a TTC-encoded phenylalanine in EGFR, suggesting that both sites may have evolved by mutation from a parental TAC codon. Since conservation of amino acid repeats arguably implies a site-specific function, the codon composition of extracellular cysteine and intracellular tyrosine doublets (Cys-Cys or Tyr-
  • the inventor created a database of 50 genes encoding single- transmembrane-domain growth factor receptors and 60 G-protein-coupled receptor sequences, yielding 470 sequences encoding transmembrane domains.
  • codon conservation may occur via two distinct routes: (1 ) positive or negative selection based on mitogenic and/or evolutionary advantage or disadvantage respectively, or (2) default mutation of non-critical motifs towards a mutationally stable endpoint.
  • the latter mechanism may explain why mutation of conserved codons is often associated with little discernible effect (Bilan and Yip, 1994, Ronco et al, 1995).
  • the association of polymo ⁇ hisms with disease may prove to have mechanistic significance hitherto unappreciated.
  • Table 9 presents an expanded summary of the transmembrane domain database, which now includes data approximately 10-fold more numerous than those presented in the Epstein et al, 2000 due to addition of a G-protein-coupled receptor (GPCR) database to the former receptor tyrosine kinase (RTK) statistics.
  • GPCR G-protein-coupled receptor
  • RTK receptor tyrosine kinase
  • Table 10 shows the GC:AT ration in transmembrane domains which are categorized according to their TMPRED score (this detects the "fit" of the sequence with that predicted for transmembrane domains and correlates with total hydrophobicity).
  • Table 12 is an analysis of the GPCR database and Table 13 relates to
  • RTKs GPCRs have seven transmembrane (TM) domains, whereas RTKs have only one.
  • TM transmembrane
  • Table 1 1 shows that GPCRs have 42.5% hydrophobic (defined here as FLV) residues and 2.3%, hydrophihc (KRED) residues, whereas RTKs have 47.8% FLV vs 0.3% KRED. Hence, the FLV:KRED ratio is 18.5 for GPCRs but 159 for RTKs (a significant difference). Since the RTK has only one TM domain to anchor it in the membrane, whereas the GPCR has seven, it makes functional sense that the RTK would have a "better" (more functional) TM domain in this way, hence, these data confirm that the functionality of the RTK TM domain is greater (closer to redundancy) than is that of the individual GPCR TM domains. Comparison with Table 9 confirms that this relative functional redundancy of RTK TM domains is associated with a lower GC:AT ratio (3.17) than for GPCRs (4.55).
  • Codon nucleotide analyses were undertaken using either published sequences or Genbank data; statistical significance (see text) was determined using a two-sample t-test
  • Codon numbers were derived manually
  • GC:AT ratios were obtained from Genbank mRNA sequences; statistical significance (see text) was determined using non-parametric Spearman's correlation coefficient TABLE 5 Nucleotide composition of non-receptor tyrosine phosphoproteins and c-Cbl
  • GC:AT ratios are derived from the published Genbank sequences
  • TAC:TAT quotients are followed by numerical codon ratios in parentheses
  • Autophosphorylation sites were identified by Medline search using the string ⁇ tyrosine and phosphorylation and site and (mapped or mappingor tryptic or phosphototrypfic) and (Grb2 or phospholipase or phosphatidylinositol or Cbl or Shp-2 or PTP) and (growth factor or receptor tyrosine kinase) ) ; numbered site designates the amino acid location of the rceptor phosphotyrosine residue.
  • Statistical significance for the codon identity of phosphotyrosines binding SHP-2 relative to the other second messengers was derived using Fisher's exact test.
  • Frequency represents the proportion of amino acids within the predicted transmembrane sequences of the 50 receptor tyrosine kinases

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biochemistry (AREA)
  • Physics & Mathematics (AREA)
  • Biotechnology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Zoology (AREA)
  • Biophysics (AREA)
  • Wood Science & Technology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Medical Informatics (AREA)
  • Plant Pathology (AREA)
  • Microbiology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Peptides Or Proteins (AREA)

Abstract

The present invention relates generally to a method for the genetic analysis of nucleic acid molecules and expression products encoded therefrom. The present invention relates to the analysis of the genetic code, and to the analysis of individual codons that comprise a nucleotide sequence. The present invention relates to the third base position of a codon for a particular amino acid, and to the functional significance of that particular amino acid. More particularly, the present invention provides a means for predicting the functional significance of a nucleotide sequence or a segment of a larger nucleotide sequence and to a means for predicting the functional significance of an expression product of said nucleotide sequence. The present invention further relates to the designation of a particular trinucleotide codon for the specification of an amino acid, and to the preferential use and evolutionary conservation of the nucleotide bases guanine and cytosine in the third position of a trinucleotide codon. The present invention further combines software programming with genetic to provide a method for the analysis of a nucleotide sequence to determine the functional significance of the amino acids encoded by a nucleotide sequence. The functional significance of a nucleotide sequence includes the relative potential for that sequence to be associated with a critical cellular function such as the likelihood that a particular sequence encodes an expression product associated with genetic replication and/or cell survival growth, development and/or maintenance.

Description

ASSESSMENT OF THE FUNCTIONAL SIGNIFICANCE OF A NUCLEOTIDE SEQUENCE BY SUBJECTING THE READING FRAME DATA TO DATA PROCESSING
FIELD OF THE INVENTION
The present invention relates generally to a method for the genetic analysis of nucleic acid molecules and expression products encoded therefrom. The present invention relates to the analysis of the genetic code, and to the analysis of individual codons that comprise a nucleotide sequence. The present invention relates to the third base position of a codon for a particular amino acid, and to the functional significance of that particular amino acid. More particularly, the present invention provides a means for predicting the functional significance of a nucleotide sequence or a segment of a larger nucleotide sequence and to a means for predicting the functional significance of an expression product of said nucleotide sequence. The present invention further relates to the designation of a particular trinucleotide codon for the specification of an amino acid, and to the preferential use and evolutionary conservation of the nucleotide bases guanine and cytosine in the third position of a trinucleotide codon. The present invention further combines software programming with genetic to provide a method for the analysis of a nucleotide sequence to determine the functional significance of the amino acids encoded by a nucleotide sequence. The functional significance of a nucleotide sequence includes the relative potential for that sequence to be associated with a critical cellular function such as the likelihood that a particular sequence encodes an expression product associated with genetic replication and/or cell survival growth, development and/or maintenance.
BACKGROUND OF THE INVENTION
Reference to any prior art in this specification is not, and should not be taken as, an acknowledgment or any form of suggestion that this prior art forms part of the common general knowledge in Australia or any other country.
Bibliographic details of the publications referred to by author in this specification are collected at the end of the description. ? .
In the following description, single letter abbreviations are used for guanine (G), cytosine (C), thymine (T). adenine (A) and uracil (U)
The increasing sophistication of recombinant DNA technology is greatly facilitating research and de\ elopment in the medical and allied health fields as well as the horticultural and agricultural industries Significant advances have been made in nucleotide sequencing techniques However, the amount of nucleotide sequence information increasingly available has not been associated with a corresponding ability to assess the significance of particular sequences This is a particular problem associated with genome sequencing projects for human, animal, insect and plant genomes There is a need, therefore, to be able to predict that a particular nucleotide sequence is of potential importance such as for cellular development and genetic regulation
Amino acids exhibit structural and functional continuity with the encoding nucleotides (Epstein, 1967) which implies that codons are organized such that the most frequent base substitutions have the least deleterious consequences for protein function Conceivably, then, each codon may possess a selectable phenotype based on mutational stability Knowledge of prevailing mutational patterns could then permit deduction of the evolutionary and, hence, functional, pedigree of a codon
Two such patterns dominate mammalian genome evolution First, most mutations are transitions rather than transversions, that is, puπnes tend to substitute for puπnes and pyπmidines for pyrimidines Second, pseudogene studies have illustrated that base transitions most often occur in a GC → AT direction, reflecting the propensity of cytosine to undergo methylation followed by deamination to thymine, progressive replacement of complementary GC base pairs by AT is likewise common in transcπptionally inactive genomic regions which tend to be methylated Conversely, genes which are heavily transcπbed should incur less GC → AT switching due to demethylation as well as enhanced DNA repair in transcπbed sequences, consistent with the known hypomethylation status of CpG islands adjacent to housekeeping genes (Cooper and Krawczak, 1989).
Generally, an amino acid is specified by a trinucleotide codon. During translation of an mRNA, the appropriate anticodon containing amino-acylated tRNA is brought into base pairing alignment with the codon of the mRNA and the amino acid is added to the growing peptide chain in a peptidyl transferase reaction. A trinucleotide codon specifies the appropriate amino acid primarily through the information present within the first and second bases within the trinucleotide codon. The third base position may be specified by more than one nucleotide, and in the case of the amino acids serine, leucine, proline, threonine, valine and alanine, the third base of the trinucleotide codon may be any nucleotide. Thus, the third base of the trinucleotide triplet has generally been regarded as being substantially redundant with the same amino acid being encoded by four separate codons.
In work leading up to the present invention, the author has used methods of statistical analysis to study the variation of the third base position in a large number of transcriptionally active genes, including the family of human receptor tyrosine kinase genes, β glucuronidase and glyceraldehyde-3-phosphate dehydrogenase genes, and a range of house keeping genes such as and /3-actin.
The inventor identified an unexpected correlation between third-base content in a three base trinucleotide codon in which amino acids encoded by trinucleotide codons that contain guanine or cytosine in the third nucleotide position indicate that the trinucleotide codon is functionally significant. The instant inventor used unexpected correlation to develop a method by which the functional significance of both the nucleotide sequence and its expression may be determined. SUMMARY OF THE INVENTION
Throughout this specification, unless the context requires otherwise, the word "comprise", or variations such as "comprises,, or "comprising", will be understood to imply the inclusion of a stated element or integer or group of elements or integers but not the exclusion of any other element or integer or group of elements or integers.
One aspect contemplates a method for assessing the functional significance of a nucleotide sequence which encodes an expression product which expression product comprises a sequence of nucleotides or amino acid residues, said method comprising determining reading frame data which comprises codons encoding nucleotide triplets or amino acid residues and subjecting the reading frame data to data processing means in order to identify the third-base in each codon or the overall third-base in some or all codons wherein the identification of a codon or codons having a particular base or type of base in the third-base position is indicative of the functional significance of said nucleotide sequence.
Another aspect of the present invention provides a method for assessing the functional significance of an amino acid sequence or nucleotide sequence in an expression product of a DNA sequence, said method comprising determining reading frame data on the DNA which comprises codons encoding residue of said amino acid sequence or nucleotide of said nucleotide sequence of the expression product and subjecting the reading frame data to data processing means in order to identify amino acids or nucleotide triplets in an expression product encoded by one or more codons wherein the third-base of said codon or codons is determined wherein the identification of a codon or codons including a sequence of codons having a particular base or type of base in the third-base position is indicative of the functional significance of said expression product.
A further aspect of the present invention contemplates a method for determining the likelihood or otherwise that a nucleotide sequence of an open reading frame encodes a peptide, polypeptide or protein involved in or otherwise associated with cell survival, growth, maintenance or development or other physiological process, said method comprising subjecting the nucleotide sequence of the open reading frame to data processing means such that the third-base in each codon or group of codons making up the open read frame is identified or otherwise assigned a value wherein a first value is provided to an A and/or T and a second value is assigned to a G and/or C wherein the identification of a codon or group of contiguous or non-contiguous codons comprising a bias towards an A and/or T composition or G and/or C composition at the third position of said one or more codons prov ides a predicative indicator as to the nature and/or function of said nucleotide sequence 01 its expression product
Yet another aspect of the present invention provides a method for functionally analyzing nucleotide sequences putatively defining a gene such as identified in a genome sequencing project, said method comprising identifying transcπptional unit data comprising codons each encoding an amino acid, subjecting the transcπptional unit data to data processing means to identify the composition of the third-base in one or more of said codons wherein a bias towaids an A and/or T composition or G and/or C composition at the third position of said one or more codons provides a predicative indicator as to the nature and/or function of said nucleotide sequence or its expression product towards said nucleotide sequence or expression produced being involved inlet aha in cell survival, growth, development and/or maintenance
Still another aspect of the present invention is directed to data processing means for assessing the functional significance of a nucleotide sequence defining a gene, said data processing means comprising means for assigning an identifier to the third-base in one or more codons
Even yet another aspect of the present invention contemplates the use of chemical analogues of ammo acids that have been identified as having a functional significance by the method of the present invention Another aspect of the present invention contemplates codons that have been identified as functionally significant by the method of the present invention provides a means to produce a combinatorial library of peptides, polypeptides or proteins which are enriched or decreased in one or more functionalities associated with an amino acid encoded by a triplet which has been identified as functionally significant codon by the method of the present invention.
A further aspect of the present invention is directed to a recombinant or artificially created nucleic acid molecule or peptide, polypeptide or protein comprising a sequence of nucleotides or amino acids rationally designed based on the composition of the codon third-base.
Yet another aspect of the present invention provides a means of interpreting the likely functional significance, if any, of codon third-base polymorphisms.
Still another aspect of the present invention is directed to the analysis and identification of regulatory sequences.
The following abbreviations are used with the subject specification:-
Figure imgf000008_0001
BRIEF DESCRIPTION OF THE FIGURES
Figure 1 is a representation of description of the functional continuity of nucleic acids and amino acids (A) Arrangement of codon third bases in palindromic manner so as to minimize the mutational distance between codons (B) Functional significance to amino acid distπbution of codon third bases
Figure 2 is a diagrammatical representation of a model whereby genomic evolution proceeds as a Darwinian balance between molecular function (hypomethylation and GC retention) and loss-of-function (methylation, leading to GC conversion to AT)
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention is predicated in part on the determination by the inventor that variations in the third-base composition of codons can be used to predict the relative functional significance of the nucleotide sequence, the particular codon and/or the expression product. Generally, but not exclusively, ariations in the third-base composition of codons arc used to interface with data processing means to facilitate the prediction of the relativ e functional significance of different codons or corresponding expression products of the codons The present inv ention enables, therefore, the design of genes or expression products such as proteins w ith particular utilities
Accordingly, one aspect of the present invention contemplates a method for assessing the functional significance of a nucleotide sequence which encodes an expression product which expression product comprises a sequence of nucleotides or amino acid residues, said method comprising determining reading frame data which compπses codons encoding nucleotide triplets or ammo acid residues and subjecting the reading frame data to data processing means in order to identify the third-base in each codon or the overall third-base in some or all codons wherein the identification of a codon or codons having a particular base or type of base in the third-base position is indicative of the functional significance of said nucleotide sequence
Generally, but not exclusively, the base identified is of the type G or C or of the type A or T. The recognition of individual or overall G/C or A/T content of the third base position is indicative of the functional significance of the nucleotide sequence. In accordance with a preferred aspect of the present invention, the presence of a G or C at the third-base position is more likely to indicate a functionally significant nucleotide sequence relative to an A or T at the same position The present invention further contemplates the functional significance of the absence of a G or a C in the third-base position of a codon, i.e. the presence of an A or T. An example of the latter includes heterodimeπzation of different AT-πch receptors which may decrease the rate of degradation of receptors thereby prolonging their signalling The instant method may be used to predict the likely signalling strength of certain receptors
Reference herein to "data processing means" includes any electronic hardware device, in combination with one or more software applications, which participates in or otherwise facilitates the analysis, calculation, computation or reconfiguration of information The hardware device generally comprises an assemblage of electronic components and includes a computer system In a preferred embodiment, the data processing means executes a method of assigning a value or an identifier to the third nucleotide in some or all codons
A "codon" is a unit of three nucleotides which, when transcribed, produces a particular amino acid residue or a triplet of πbonucleotides such as in mRNA The term "third-base" in relation to a codon is the third nucleotide in a three nucleotide codon The terms "nucleotide" and "base" are used interchangeably throughout the subject specification
An "expression product" encompasses both an amino acid sequence such as in a peptide, polypeptide or protein including a single amino acid residue as well as a mRNA molecule comprising triplets of nucleotides each corresponding to a codon of a DNA sequence An "expression product" may also be a hybrid complex of, for example, tRNA, mRNA and/or amino acid residues When the expression product is mRNA, the nucleotide may include U instead of T
An expression product may also be considered as other RNA sequences including synthetic components such as an engineered structural component including but not limited to a triple helix forming sequence or a stem loop structure An expression product includes topological folds and structures of RNA that are deπved from either typical or atypical A form geometry and may include internal bulges, loops and non-Watson and Cπck base pairs
Reference herein to an "expression product" refers to an expression product deπved from a polynucleotide sequence or gene An "expression product" includes complementary and antisense RNA sequences that can be expressed from a particular polynucleotide sequence or gene, and any mRNA sequence expressed from a gene in a given transcriptional reading frame. An expression product is also considered to be a peptide, polypeptide or protein that can be translated from an expressed RNA sequence.
The term "gene" is used in its broadest sense and includes cDNA corresponding to the exons of a gene. Accordingly, reference herein to a "gene" is to be taken to include:-
(i) a classical genomic gene consisting of transcriptional and/or translational regulatory sequences and/or a coding region and/or non-translated sequences (i.e. introns, 5'- and 3'- untranslated sequences); or
(ii) mRNA or cDNA corresponding to the coding regions (i.e. exons) and 5'- and 3'- untranslated sequences of the gene.
The term "gene" is also used to describe synthetic or fusion molecules encoding all or part of an expression product. In particular embodiments, the term "nucleic acid molecule" and "gene" may be used interchangeably.
Generally, a gene comprises, when read in a particular reading frame, a series of codons each encoding a particular amino acid residue or triplet of nucleotides or a mRNA molecule wherein the series of codons provides a particular sequence of amino acids or sequence of mRNA triplet nucleotides.
A "peptide", "polypeptide" and "protein" refer to molecules comprising a sequence of amino acid residues or derivatives or analogues thereof. Accordingly, a peptide, polypeptide or protein may comprise chemical analogues of amino acids.
The "functional significance" of a nucleotide sequence includes the attributes of the nucleotide sequence including attributes affecting fidelity, longevity, stability or efficiency of replication, transcription and/or translation as well as those attributes affecting physiological processes within or by a cell carrying the nucleotide sequence Physiological processes include but are not limited to cell survival, maintenance, development and growth differentiation and/or proliferation capacities, and immunopotential and mitogenic potential such as the potential to develop or proliferate to a cancerous condition The functional significance of a codon or group of codons within a nucleotide sequence may be determined once the third-base composition is identified This may be, for example, by determining G/C contact or A T contact
The present inv ention may be practised at the lev el of the nucleic acid and/or expression product level In this regard, the present invention is particularly suitable for use in DNA microarray analysis of gene expression and in differential display and subtractive hybridization analysis of gene expression It is intended that the present invention may be practiced when sequence information of a gene, or expressed sequence, permits the identification of a codon in which the third base is indicative of a functionally significant amino acid in accordance with the method of the present invention
Methods for subtractive hybridization and differential display and for the construction of DNA microarravs are well known in the art
Accordingly, another aspect of the present invention provides a method for assessing the functional significance of an amino acid sequence or nucleotide sequence in an expression product of a DNA sequence, said method comprising determining reading frame data on the DNA which compπses codons encoding residue of said amino acid sequence or nucleotide of said nucleotide sequence of the expression product and subjecting the reading frame data to data processing means in order to identify amino acids or nucleotide triplets in an expression product encoded by one or more codons wherein the third-base of said codon or codons is determined wherein the identification of a codon or codons including a sequence of codons having, for example, a cytosine or guanine in the third- base position is indicative of the functional significance of said expression product This aspect of the present invention mav also be practised by identifying an A or T/U or a combination of all foui/fiv e nucleotides
Preferably, the expression product is a peptide, polypeptide or protein Preferably, the expression product is an mRNA molecule such as an antisense mRNA molecule, including but not limited to chemical analogues and derivatives of nucleotides such as but not limited to phosphothiotes
The present invention is predicated part on the surprising determination that genes characterized by being particularly physiologically significant such as having high transcriptional and/or mitogenic and/or rephcative activity are generally more GC-πch than those having relatively lower transcriptional, mitogenic and/or rephcative activity Accordingly, third-base GC composition is more common in codons specifying amino acids in peptides, polypeptides or proteins or triplet nucleotides in mRNA molecules which are critical for physiological processes such as but not limited to cell survival, maintenance, development, growth, differentiation, proliferation, mitogenic potential and/or lmmunopotential The term "mitogenic potential" includes the potential of cells to develop into cancers or tumours The present invention enables, therefore, a method for complementing genetic analysis of nucleotide sequences by providing predicative means to assess the functional significance of the nucleotide sequence This is particularly useful, for example, in assessing the potential functional significance of particular nucleotide sequences identifying in the human genome project or similar projects for plants, animals, insects and micro-organisms The significance of third-base composition may also be determined by determining A/T[U] content
The present invention contemplates, therefore, in a preferred embodiment, a method for determining the likelihood or otherwise that a nucleotide sequence of an open reading frame encodes a peptide, polypeptide or protein involved in or otherwise associated with cell survival, growth, maintenance or development, said method compπsing subjecting the nucleotide sequence of the open reading frame to data processing means such that the third-base in each codon or group of codons making up the open read frame is identified or otherwise assigned a value wherein a first value is provided to an A and/or T and a second value is assigned to a G and/or C wherein the identification of a codon or group of contiguous or non-contiguous codons comprising a bias towards an A and/or T composition or G and/or C composition at the third position of said one or more codons provides a predicative indicator as to the nature and/or function of said nucleotide sequence or its expression product
The method of the present invention is particularly useful assessing the potential functional significance of nucleotide sequences identified during, for example, the human genome project
According to this embodiment, there is provided a method for functionally analyzing nucleotide sequences putatively defining a gene such as identified in a genome sequencing project, said method comprising identifying transcriptional unit data comprising codons each encoding an amino acid, subjecting the transcriptional unit data to data processing means to identify the composition of the third-base in one or more of said codons wherein a bias towards an A and/or T composition or G and/or C composition at the third position of said one or more codons provides a predicative indicator as to the nature and/or function of said nucleotide sequence or its expression product towards said nucleotide sequence or expression produced being involved inter aha in cell survival, growth, development and/or maintenance
Alternatively, or in addition, the gene may be involved in transcription, translation, replication, proliferation and/or differentiation
The method of the present invention can at least be partly implemented using a suitably programmed computer In particular, the preferred data processing means comprises a suitably programmed computer and the steps of the method are preferably performed using the suitably programmed computer In vaπous forms of the invention, the input information may take the form of values, identifiers or other data in respect of the identity of the third-base of a codon The input data may be digitized Alternatively, foi implementation of the invention, a dedicated Fast Fourier transform chip can be employed as at least part of the processing means.
In a preferred form of the invention, representative measurements are made identifying or valuing the third-base in the codon in selected locations within a sequence. The selected locations define a gene or gene sequence.
Accordingly, another aspect of the invention is directed to a data processing means for assessing the functional significance of a nucleotide sequence defining a gene, at least in part by assigning an identifer to the third-base in one or more codons.
More particularly, data processing means executes the steps of:-
(i) partitioning a gene sequence into codons using the correct reading frame;
(ii) analyzing the data obtained in 1 to identify codons of potential interest;
(iii) analyzing a protocol to distinguish whether the third-base of codons of interest comprises a G, C, A or T[U] nucleotide;
(iv) optionally characterizing the third-base content of target codons into subsets; and
(v) analyzing the data to attribute functional characteristics to the codons and encoded amino acids.
The subsets referred to in part (iv) above include identifying a guanine alone, cytosine alone or either a guanine or cytosine.
The input program and general methodology of the present invention are useful in a range of applications including designing genes and proteins having a particular ability or function. For example, if it is desired to modify the function of any gene or protein, whether natural or synthetic, the present invention may be used to identify codons and encoded amino acids in which an alteration of codon third-base composition could yield the desired change in function Furthermore, once codon third-bases of interest are so identified, routine genetic techniques are employed to introduce site-specific changes of nucleotide composition at the designated base positions
In a particularly preferred embodiment, expression products may be generated by the selective mutagenesis of a codon in which the third base is indicative of a functionally significant amino acid In this aspect of the present invention, the identified codons may be targets for mutagenesis or, alternatively, be retained in the expression product
According to this aspect of the present invention, selective mutagenesis or retention of the identified codons provide a method for altering the functionality of the expressed product Modulation of functionality of an expressed product may be useful in regulating physiological and/or biochemical processes such as those associated with ligand bindings, dimeπzation, including ohgomeπzation, catalytic activity, phosphorylation state, glycosylation state and piotcin stability
In one embodiment, where a ligand or substrate is known to bind a respectiv e ligand- or substrate binding domain or a receptor or enzyme, the present invention is used to identify key codons encoding either the gand/substrate or the receptor/enzyme domains, such that selective mutagenesis (and/or retention) of codons gives the greatest likelihood of producing a particular antagonistic (for functional) effect A mixture of such effects may also be generated, such as, for example, codons which retain effective ligand binding or where other codons may be mutagenized to abolish, for example, dimeπzation, thus creating a competitive inhibitor This approach may be scaled-up, such as using combinatorial chemistry to produce libraries of compounds or leads which are thus enriched for potential commercially valuable products
The present invention further contemplates the use of chemical analogues of ammo acids that have been identified as having a functional significance by the method of the present invention. According to this method, a functionally significant amino acid may be identified by the determination of the third base composition of a codon or a group of codons In a preferred embodiment, the instant method is used to improv e the functional characteristics of a peptide, polypeptide or protein via site-specific incorporation of particular amino acids or chemically modified ammo acids or ammo acid analogues Such chemical analogues are particularly useful due to their inter alia conformational stability, resistance to enzymatic degradation and/or their stability in diagnostic and therapeutic tests Chemical analogues of a functionally significant amino acid contemplated herein include but are not limited to modifications to side chains, incorporation of unnatural amino acids and/or their derivatives duπng peptide, polypeptide or protein synthesis and the use of cross nkers and other methods which impose conformational constraints on the proteinaceous molecule or their analogues.
Examples of side chain modifications contemplated by the present invention include modifications of amino groups such as by reductive alkylation by reaction with an aldehyde followed by reduction with NaBFL*., amidination with methylacetimidate, acylation with acetic anhydride, carbamoylation of amino groups with cyanate; tπnitrobenzylation of amino groups with 2, 4, 6-tπnιtrobenzene sulphonic acid (TNBS), acylation of amino groups with succinic anhydride and tetrahydrophthahc anhydride, and pyπdoxylation of lysine with pyπdoxal-5-phosphate followed by reduction with NaBH4
The guanidine group of arginine residues may be modified by the formation of heterocyclic condensation products with reagents such as 2,3-butanedιone, phenylglyoxal and glyoxal
The carboxyl group may be modified by carbodnmide activation \ ιa O-acylisourea formation followed by subsequent deπ itization, for example, to a corresponding amide
Sulphydryl groups may be modified by methods such as carboxymethylation with lodoacetic acid or lodoacetamide; performic acid oxidation to cysteic acid; formation of a mixed disulphides with other thiol compounds; reaction with maleimide, maleic anhydride or other substituted maleimide, formation of mercurial deπvatives using 4- chloromercunbenzoate, 4-chloromercuπphenylsulphonιc acid, phenylmercury chloπde, 2- chloromercuπ-4-nιtrophenol and other mercurials, carbamoylation with cyanate at alkaline pH
Tryptophan residues may be modified by, for example, oxidation with N- bromosuccimmide or alkylation of the indole ring with 2-hydroxy-5-mtrobenzyl bromide or sulphenyl halides Tyrosine residues on the other hand, may be altered by nitration with tetranitromethane to form a 3-nιtrotyrosιne derivative
Modification of the lmidazole ring of a histidine residue may be accomplished by alkylation with lodoacetic acid derivatives or N-carbethoxylation with diethylpyrocarbonate
Examples of incorporating unnatural amino acids and derivatives duπng peptide synthesis include, but are not limited to, use of norleucine, 4-ammo butyπc acid, 4-amιno-3- hydroxy-5-phenylpentanoιc acid, 6-amιnohexanoιc acid, t-butylglycine, norvahne, phenylglycine, ornithine, sarcosine, 4-amιno-3-hydroxy-6-methylheptanoιc acid, 2-thιenyl alanine and/or D-isomers of amino acids A list of unnatural amino acid, contemplated herein is shown in Table 1
TABLE 1
Non-conventional Code Non-conventional Code amino acid ammo acid
α-aminobutyπc acid Abu L-N-methylalanine Nmala α-amino-omethylbutyrate Mgabu L-N-methylarginine N arg aminocyclopropane- Cpro L-N-methylasparagine Nmasn carboxylate L-N-methylaspartic acid Nmasp aminoisobutyπc acid Aib L-N-methvlcysteine Nmcys aminonorbornyl- Norb L-N-methylglutamine Nmgln carboxylate L-N-methylglutamic acid Nmglu cyclohexylalanine Chexa L-Nmethylhistidine Nmhis cyclopentylalanine Cpen L-N-methyhsolleucine Nmile D-alanine Dal L-N-methylleucine Nmleu
D-arginine Darg L-N-methyllysine Nmlys
D-aspartic acid Dasp L-N-methylmethionine Nmmet
D-cysteine Dcys L-N-methylnorleucine Nmnle
D-glutamine Dgln L-N-methylnorvahne Nmnva D-glutamic acid Dglu L-N-methylorni thine Nmorn
D-histidine Dhis L-N-methylphenylalanine Nmphe
D-isoleucine Dile L-N-methylprohne Nmpro
D-leucine Dleu L-N-methylseπne Nmser
D-lysine Dlys L-N-methylthreomne Nmthr D-methionine Dmet L-N-methyltryptophan Nmtrp
D-ormthine Dorn L-N-methyltyrosine Nmtyr
D-phenylalanme Dphe L-N-methylvahne Nmval
D-prohne Dpro L-N-methylethylglycme Nmetg
D-seπne Dser L-N-methyl-t-butylglycine Nmtbug D-threonine Dthr L-norleucine Nle
D-tryptophan Dtrp L-norvahne Nva D-tyrosine Dtyr α-methyl-aminoisobutyrate Maib
D-valine Dval α-methyl-γ-aminobutyrate Mgabu
D-α-methylalanine Dmala α-methylcyclohexylalanine Mchexa
D-α-methylarginine Dmarg α-methylcylcopentylalanine Mcpen D-α-methylasparagine Dmasn α-methyl-α-napthylalanine Manap
D-α-methylaspartate Dmasp α-methylpenicillamine Mpen
D-α-methylcysteine Dmcys N-(4-aminobutyl)glycine Nglu
D-omethylglutamine Dmgln N-(2-aminoethyl)glycine Naeg
D-omethylhistidine Dmhis N-(3-aminopropyl)glycine Norn D-α-methylisoleucine Dmile N-amino-α-methylbutyrate Nmaabu
D-omethylleucine Dmleu α-napthylalanine Anap
D-α-methyllysine Dmlys N-benzylglycine Nphe
D-omethylmethionine Dmmet N-(2-carbamylethyl)glycine Ngln
D-omethylomithine Dmorn N-(carbamylmethyl)glycine Nasn D-α-methylphenylalanine Dmphe N-(2-carboxyethyl)glycine Nglu
D-α-methylproline Dmpro N-(carboxymethyl)glycine Nasp
D-α-methylserine Dmser N-cyclobutylglycine Ncbut
D-omethylthreonine Dmthr N-cycloheptylglycine Nchep
D-α-methyltryptophan Dmtφ N-cyclohexylglycine Nchex D-omethyltyrosine Dmty N-cyclodecylglycine Ncdec
D-o;-methylvaline Dmval N-cylcododecylglycine Ncdod
D-N-methylalanine Dnmala N-cyclooctylglycine Ncoct
D-N-methylarginine Dnmarg N-cyclopropylglycine Ncpro
D-N-methylasparagine Dnmasn N-cycloundecylglycine Ncund D-N-methylaspartate Dnmasp N-(2,2-diphenylethyl)glycine Nbhm
D-N-methylcysteine Dnmcys N-(3,3-diphenylpropyl)glycine Nbhe
D-N-methylglutamine Dnmgln N-(3-guanidinopropyl)glycine Narg
D-N-methylglutamate Dnmglu N-(l-hydroxyethyl)glycine Nthr
D-N-methylhistidine Dnmhis N-(hydroxyethyl))glycine Nser D-N-methvlisoleucine Dnmile N-(imidazolylethyl))glycine Nhis D-N-methylleucine Dnmleu N-(3-indolylyethyl)glycine Nhtφ
D-N-methyllysine Dnmlys N-methyl-γ-aminobutyrate Nmgabu N-methylcyclohexylalanine Nmchexa D-N-methylmethionine Dnmmet
D-N-methylornithine Dnmorn N-methylcyclopentylalanine Nmcpen N-methylglycine Nala D-N-methylphenylalanine Dnmphe
N-methylaminoisobutyrate Nmaib D-N-methylproline Dnmpro
N-(l-methylpropyl)glycine Nile D-N-methylserine Dnmser
N-(2-methylpropyl)glycine Nleu D-N-methylthreonine Dnmthr
D-N-methyltryptophan Dnmtφ N-( l-methylethyl)glycine Nval D-N-methyltyrosine Dnmtyr N-methyla-napthylalanine Nmanap
D-N-methylvaline Dnmval N-methylpenicillamine Nmpen γ-aminobutyric acid Gabu N-( -hydroxyphenyl)glycine N tyr
L-t-butylglycine Tbug N-(thiomethyl)glycine Ncys
L-ethylglycine Etg penicillamine Pen L-homophenylalanine Hphe L-α-methylalanine Mala
L-α-methylarginine Marg L-α-methylasparagine Masn
L-omethylaspartate Masp L-α-methyl-t-butylglycine Mtbug
L-α-methylcysteine Mcys L-methylethylglycine Metg
L-α-methylglutamine Mgln L-α-methylglutamate Mglu L-α-methylhistidine Mhis L-α-methylhomophenylalanine Mhphe
L-α-methylisoleucine Mile N-(2-methylthioethyl)glycine Nmet
L-omethylleucine Mleu L-α-methyllysine Mlys
L-α-methylmethionine Mmet L-α-methylnorleucine Mnle
L-α-methylnorvaline Mnva L-α-methylornithine Morn L-α-methylphenylalanine Mphe L-α-methylproline Mpro
L-α-methylserine Mser L-α-methylthreonine Mthr
L-α-methyltryptophan Mtrp L-α-methyltyrosine Mtyr
L-α-methylvaline Mval L-N-methylhomophenylalanine Nmhphe
N-(N-(2,2-diphenylethyl) Nnbhm N-(N-(3,3-diphenylpropyl) Nnbhe carbamylmethyl)glycine carbamylmethyl)glycine l-carboxy-l -(2,2-diphenyl- Nmbc ethylamino)cyclopropane
Crosslinkers can be used, for example, to stabilize 3D conformations, using homo- bifunctional crosslinkers such as the bifunctional imido esters having (CH2)n spacer groups with n=l to n=6, glutaraldehyde, N-hydroxysuccinimide esters and hetero-bifunctional reagents which usually contain an amino-reactive moiety such as N-hydroxysuccinimide and another group specific-reactive moiety such as maleimido or dithio moiety (SH) or carbodiimide (COOH). In addition, peptides can be conformationally constrained by, for example, incoφoration of Cα and N α-methylamino acids, introduction of double bonds between Ca and Cβ atoms of amino acids and the formation of cyclic peptides or analogues by introducing covalent bonds such as forming an amide bond between the N and C termini, between two side chains or between a side chain and the N or C terminus.
In a further aspect of the present invention, identification of functionally significant codons by the method of the present invention provides a means to produce a combinatorial library of peptides, polypeptides or proteins which are enriched or decreased in one or more functionalities associated with an amino acid encoded by a triplet which has been identified as functionally significant by the method of the present invention.
The expression product of a mutated/or retained form of a nucleotide sequence may be used in biochemical screens of combinatorial libraries of molecules and high throughput screening methodologies to identify modulators of an activity present in an expression product including an expression product from a mutant or a retained form of the nucleotide sequence identified by the method of the present invention.
Accordingly, another aspect of the present invention is directed to a recombinant or artificially created nucleic acid molecule or peptide, polypeptide or protein comprising a sequence of nucleotides or amino acids rationally designed based on the composition of the codon third-base. Yet another aspect of the present invention provides a means of inteφreting the likely functional significance, if any, of codon third-base polymoφhisms Codon third-base polymoφhisms provides a means for diagnosing or determining the likelihood of development of a disease condition or a particular phenotypic trait It is also useful for susceptibility testing The method of the present invention may be used to characterize the probable functional significance of third-base polymoφhisms and their association with disease-predisposing mutations Such mutations may affect the same or different (l e upstream or downstream genes)
The use of the instant invention to determine third-base polymoφhisms provides a more efficacious screening means than does in vitro mutagenesis or disease mutation analysis, particularly if the latter is very diverse in its sequence characteristics
Particular preferred embodiments of the invention are described below
Consider a newly discovered gene (X) encoding a protein (Y) which has potential therapeutic application To develop drugs, antibodies, oligonucleotides or other commercial reagents to inhibit the function of Y, until the advent of the present invention, it would first be necessary to characterize which are the critical functional targets within Y so that a specific inhibitor could be developed This may be accomplished by first electing to make a series of site-directed mutants, attempting to identify in vitro phenotypic correlates of such mutations and then confirming such observations using antisense, neutralizing antibody and/or in vivo transgene knock out/knock in approaches Use of the present invention obviates these time-consuming and labour-intensive steps by allowing the identification from X of a subset of target codons with the desired functional importance
Consider a gene (A) encoding a protein (B) which needs to be delivered effectively in animal models using gene therapy However, the fidelity, efficiency and longevity of A expression using a particular delivery system maybe sub-optimal, leading to transient and/or sub-therapeutic levels of B By using the invention to identify target codons in A which are able to be modified in the desired manner using methodologies well known in art, A can be modified to thus improve therapeutic expression of B to the desired level
The present invention is further directed to the analysis and identification of regulatory sequences Promoter mefhylation is a mechanism of the regulation of gene expression The present invention provides an analysis of the GC content and compositions of genetic regulatory regions The present invention further contemplates the identification of such regions where expression is regulated by DNA methylation and acetylation particularly where chromatin conformation will determine the efficiency of expression of foreign genetic sequences such as those used in gene therapy A highly AT-πch upstream sequence may be identified by the method of the present invention The subject invention contemplates the replacement of such regions or sequences with more GC-πch sequences to facilitate or prolong the expression of introduced gene therapy products
Accordingly, the present invention provides method for assessing the functional significance of a nucleotide sequence, said method comprising subjecting the nucleotide sequence to data processing means in order to identify the composition bias for GC or AT wherein the identification of a GC- or AT-πch region is indicative of the functional significance of said nucleotide sequence Preferably, the nucleotide sequence is upstream sequence of the gene Preferably, the nucleotide sequence compπses a promoter
In another embodiment, the present invention provides a method of treatment compπsing subjecting a nucleotide sequence to data processing means in order to identify the base composition and administering hypomethylating drugs if it is a GC-πch region to modify expression of a nucleotide sequence operably linked to said first mentioned nucleotide sequence Preferably, the nucleotide sequence compπses a promoter Preferably, the hypomethlyating drug is 5-azacytιdιne
The present invention may be used to predict not only the likely functionality of any individual amino acid but its potential significance or influence within a region of peptide sequence. Without intending to limit the present invention to any one theory or mode of action, it is proposed that once function is well established in evolutionary terms, there is, micro-evolutionary tolerance to methylation-dependent GC-to-AT switching. Indeed, other aspects of the data indicate a regional significance to the instant invention, for example, TM sequences which are abundant in function (as measured by TMPRED score) exhibit a decline in the GC-richness of non-hydrophobic-encoding codons as well as hydrophobic amino acids phenylalanine-leucine-valine (FLV)-encoding codons, albeit from a lower starting value. In mechanistic terms, this could indicate transcription-coupled repair playing a regional role in nucleotide composition in addition to the selectionist (purifying) effect of mutations affecting individual residues.
The present invention is further described by the following non-limiting Examples.
EXAMPLE 1 Preliminary gene sequence analysis
Gene sequences were obtained form Genbank, and correlative functional data sourced from Medline. Genbank sequences for analysis of GC:AT content (the total number of G and C bases expressed relative to total A and T bases) were eliminated if no informative comparison could be made due to an inadequate number of human gene homologs, or similar GC:AT ratios between gene family members.
EXAMPLE 2
Systematic analysis of receptor tyrosine kinase gene sequences
For construction of a single-spanning transmembrane receptor gene sequence database, the NCBI data bank was searched using the following key words: tyrosine kinase receptor; growth factor receptor; and epidermal growth factor receptor, ErbB2, insulin receptor, insulin-like growth factor receptor, platelet-derived growth factor receptor, fibroblast growth factor receptor, EphA2, EphB2, Ret, Met, VEGF receptor, TrkA, TrkB. This search downloaded 1 1 16 entries which were filtered by several methods to yield 141 sequences as the initial dataset, which was further refined by exclusion of repeat sequences to yield 50 unique mRNAs. Transmembrane regions from the corresponding amino acid sequences were then identified using the TMPRED algorithm
(http://www.ch.embnet.org/softvvare/TMPRED form.html) which predicts membrane- spanning regions and orientation based on statistical analysis of the TMbase database of transmembrane proteins.
EXAMPLE 3 Selection of codons for comparative third-base analysis
Third-base codon compositions was compared in three contrasting functional subsets of transmembrane amino acids: hydrophobic (phenylalanine, F, leucine, L, valine, V), hydrophilic (lysine, K, arginine. R. glutamate. E, aspartate. D) and neutral (proline, P, alanine, A, seπne, S, threonme. T) In particular circumstances, analysis was restricted to hydrophobic amino acids (which, being lipophihc, are functionally implicated in membrane insertion) encoded by equal (even) numbers of synonymous codons thereby excluding methionine, isoleucine and tryptophan. which are biased in their third-base codon composition Neutral amino acids were selected exclusively from those residues able to undergo transitional (default) mutation to a hydrophobic amino acid
EXAMPLE 4 Nucleotide composition of genes varying in transcriptional activity
Analysis of GC AT nucleotide ratios in eight human housekeeping genes (mean GC AT - 1.48 ± 0 07) and their pseudogene homologs (mean GC AT = 0 89 ± 0 07) supports a relationship between gene functionality - as indicated here by transcriptional frequency - and GC content (Table 2, p = 0 007) To clarify the significance of this association, nucleotide composition was then analyzed in tumor-associated genes undergoing either loss-of-function mutation, amplification or chromosomal translocation (which typically causes constitutive gain-of-function via 5' juxtaposition of a housekeeping gene regulatory sequence)
EXAMPLE 5
Nucleotide composition of genes with opposite mitogenic significance
Table 3 shows that genes undergoing loss-of-function mutation tend to be GC poor (GC.AT = 0 84 -r 0 05) whereas those involved in translocations tend to be GR-πch (GC AT = 1 58 + 0 06), amplified genes are also GC-πch with the exception of a subset involved in brain tumors (overall GC.at = 1.23 + 0 10) Since tumors are defined by a clonal survival advantage, this difference between gain-of-function and loss-of-function genotypes (p < 0 0001 ) implies that GC-πch genes tend to promote cell survival whereas AT-πch genes more often negatively regulate cell proliferation EXAMPLE 6 Third-base codon composition of extracellular domain cysteines
The above observations relating to gene nucleotide composition were, therefore, assessed in terms of codon third-base identity, commencing with codons specifying the sulfhydryl- containing amino acid cysteine. A survey of mitogenic transmembrane receptors indicates that extracellular domain cysteines are more often encoded by TGC than TGT (Table 4; p = 0.0013), a trend which correlates positively through not significantly with overall GC:AT content (p = 0.09). To determine the significance, if any, of individual codon variants - and thereby to exclude the confounding possibility of regional genomic GC retention - correlations were sought with functional differences reported due to extracellular domain cysteine mutations. The Ret proto-oncogene is activated in human
ΥCC ΎCΎ tumors by five reported mutations affecting Cys ' but none involving Cys , and five to seven tumorigenic ErbB2 mutations in a transgenic animal model disrupt TGC codons whereas only one affects TGT (Donis-Keller et al, 1993; Siegel et al, 1994). The 51 extracellular cysteines of the epidermal growth factor receptor (EGFR) are perfectly conserved in ErbB2, but the latter receptor - which fails to bind a ligand - has converted seven of the 39 TGC codons of EGFR to TGT, five of which lie within a region encoding
ErbB2 amino acids 500-580 (p = 0.037). Similarly, the homologous a and 5-receptors for
Ύ C platelet-derived growth factor (PDGF) contain two Cys ' residues in the first extracellular immunoglobulin-like domain, which is within the region binding the common ligand PDGF-B (Miyazawa et al, 1998). In immunoglobulin-like domains 2 and 3 which are implicated in PDGF-A binding, however, PDGFR-α contains four Cysτ τ amino acids whereas PDGFR- 3 contains four Cys G residues (p < 0.03). Since PDGFR-α appears less mitogenic than PDGFR-3 (Inui et al., 1994), inhibition of PGDF-B binding via selective
Cys ' binding of PDGF-A would be expected to prevent formation of a/β receptor heterodimers and thus to impair cell proliferation in the presence of PDGF-B.
TGT-encoded Cys'"4 in the insulin receptor forms a class 1 disulfide bond linking the two α-chains but mutation of this cysteine neither alters insulin affinity nor abolishes receptor homodimerization (Bilan and Yip. 1994). Further work has revealed the presence of additional α-α bonds between Cys 6'S2 ~ (-T-rG- /C-- ), and j/,o„ _r r C-.y ,s„6'S3 (TGC), moreover, the sole β-a disulfide bond responsible for insulin receptor heterotetrameπzation occurs between Cys6 7 (TGC) and Cys "2 (TGC) (Sparrow et al , 1997) Mutation of four cysteines (2 TGC, 2 TGT) in the hormone-binding domain of the estrogen receptor causes no detectable change in hormone affinity, but a marked impairment of hormone-dependent transactivation is confined to the TGC mutants (Reese and Katzenellenbogen, 1991 ) The extracellular domain of the GM-CSF receptor contains two disulfide bonds encoded by TGT-TGT codons (Cys, 26-Cys' 6, Cys'6'1-Cysl 7s) disruption of these sites impairs low-affinity ligand binding but mutant receptor binding to a preformed /3-chaιn heterodimer is not abolished nor is hgand-dependent growth (Ronco, Doyle et al , 1995)
The reported effects of site-directed mutagenesis on extracellular TGC- and TGT- encoded cysteine function in other molecules thus support the conclusion that Cys ] C 'Υ residues CC I Γ function similarly to Cys ' but that Cys ' residues tend to be more critically located for functions relevant to cell survival
Mistranslation of AAT codons occurs 6-8 times more frequently than that of AAC codons (Parker et al , 1983), and guanine is more accurately translated than other nucleotides (Negre et al , 1988) Moreover, A and T bases destabilize duplex nucleic acid binding in TATA boxes and poly-A tails, raising the possibility that the tπple-hydrogen-bonding properties of guanine and cytosine permit more stable mRNA-anti-codon binding and thus more faithful protein translation If codons terminating in T or A are significantly more prone to mistranslation, GC third-base variants could be selected over generations due to a higher probability of wild-type protein function and hence more efficient gene action Codons important for cell survival become GC-πch in this model, whereas functionally neutral codons revert in a methylation-dependent manner to a third-base AT mutational ground state EXAMPLE 7 Third-base codon composition of intracellular domain tyrosine residues
Similar analysis can be applied to patterns of TAC and TAT codons specifying tyrosine residues. A wide spectrum of GC:AT and TAC:TAT ratios is evident in a sample of cytosolic tyrosine kinases (Table 5). The kinases reported to be transforming in this group(Src, Abl and Fyn) have higher GC.AT and TAC:TAT ratios, whereas the AT-rich kinases Btk, Fak and Yes are not primarily associated with cell proliferation - Btk activation plays a central role in B-cell differentiation and apoptosis, and Fak regulates integrin-dependent signalling events within focal adhesions. The functions of Yes and Src (the two extremes of the TAC:TAT distribution) are even more dissimilar: express of Yes is prominent in terminally differentiated cells, inversely correlated with cell growth and unnecessary for tumorigenesis (Park and Cartwright, 1995), and this gene is also unusual among non-receptor tyrosine kinases in its homology to a pseudogene of similar GC:AT content (Semba ct al, 1988). An even lower TAC:TAT ratio characterizes the non- catalytic molecule Cbl, a growth-inhibitory tyrosine phosphoprotein (Muφhy et al , 1998) which acts by binding (Lupher et al, 1997) and antagonizing the GC-rich tyrosine kinase Syk (Ota and Samelson, 1997); the tumor-suppressive potency of this TAT-rich molecule is illustrated by the hypeφlastic and oncogenic sequelae of its knockout and viral inhibition respectively (Muφhy et al, 1998).
EXAMPLE 8 Third-base codon composition of growth factor receptor tyrosine sites
This apparent correlation between mitogenic activity and the TAC.TAT ratio is supported by the codon composition of intracellular tyrosines in transmembrane receptors (Table 6). Strongly mitogenic receptors retain a majority of Cys A intracellular residues and autophosphorylation sites, whereas TAT-rich receptors are less clearly linked to growth: with respect to the latter, receptor group, ErbB3, has acquired mutations in its kinase domain (Kraus et al , 1989) which render it functionally inactive (Guy et al, 1994); c-Met functions as a transmembrane docking site distinguished by its effects on cell motilitv rather than growth; the mitogenic impairment of PDGRF-α noted above is attributable at least in part to failure of Src co-activation (Hooshmand-Rad et al, 1998), a major mechanism of PDGFR-/3 -induced growth (Hansen et al, 1996); whereas Fms, like Kit, mediates hemopoietic lineage determination in vivo, with its four autophosphorylation sites (three encoded by TAT) all implicated in ligand-dependent cell differentiation (Rohrschneider et al, 1997). Of note, the TAC:TAT ratios of several human receptor genes are strikingly different in rodent cells, with marked increases in Fms and PDGFR-α (Table 5). Discrepancies between the perceived behaviour of cognate gene products within these two cell systems could then reflect evolutionary differences in gene behaviour related to differences in host longevity and long-term tumor risk.
EXAMPLE 9 Third-base codon composition of tyrosines binding second messengers
To determine the relative functional significance of tyrosines encoded by TAC and TAT, the codon identity of phosphotyrosines interacting with downstream signalling molecules was established by literature search. Table 7 shows that the second messengers phospholipase C-γ, phosphatidylinositol 3'-kinase and Grb2 associate almost exclusively with TAC-encoded tyrosines. An opposite pattern of codon-specific binding is seen with the tyrosine phosphatase SHP-2, with TyrrAT being the preferred docking site {p = 0.0001 ); conversely, consistent with its implication as a steric inhibitor of mitogenic substrate action, Cbll binding preferentially involves Tyr ' residues (which presumably bind at other times to pro-mitogenic proteins). Loss of the TAT-encoded PLY-γ-binding site in PDGFR-/3 may not impair PDGF-dependent growth significantly (Hill et al, 1990; Tyr751 is not essential for PI 3'-kinase binding in PDGFR- 5 (Kashishian et al, 1992); the interaction of Cbl with PI 3- kinase may be associated with growth suppression and apoptosis (Kitanaka et al, 1996); and the transforming activity of Ret does not require the presence of Tyr1096 (Asai et al, 1996). These findings further support the hypothesis that TAC-encoded tyrosines tend to have a pro-mitogenic significance. Tyrosine codons exhibit non-random patterns of distribution similar to those noted above for cysteine codons. In the PDGFR-α gene, for example, 15 of 18 TAT codons encode tyrosines within the N-terminal 330 amino acids of the cytoplasmic domain, whereas 9 of 1 1 TAC codons encode residues confined to the hydrophilic C-terminal 150 residues (p = 0.001). A similarly non-random TAC-TAT distribution corresponds to the N-terminal 190 amino acids of estrogen receptor-α (p = 0.005). TAC-encoded tyrosines are also non- randomly distributed in extracellular domains, however, consistent with a methylation- dependent mechanism for discarding deleterious TGC codons: if loss of function becomes selectable (e.g. to permit a change of ligand specificity) then it is more efficient to mutate from TGC to TAC than to TGT and thence to TAT. A testable prediction of this hypothesis is that TAT will tend to be the least frequently conserved of the four codons in the extracellular domain of homologous mitogenic receptors, and this is true for the insulin and IGF1 receptors (67% TGC, 48% TGT, 41 % TAC, 25% TAT conserved; p = 0001). A second prediction is that TyrTΛC and Cysτ ' residues will tend to exhibit a topographically reciprocal relationship, as demonstrable between TGC and TTC (phenylalanine) codons in adjacent extracellular domains of EGFR (p = 0.001 ), reinforcing the impression that selection occurs at the level of codons.
In human genes which are important for long term survival of cells and/or organisms, conservation of codons containing GC third bases appears stronger than that of AT third bases. With respect to the conserved tyrosine kinase domains of EGFR and ErbB2, none of the 10 kinase domain tyrosines is encoded by TAT in both receptors despite the presence of five TAT codons, whereas five TAC-encoded sites are identically conserved (Semba et al, 19985; p = 0.007). Relevant to this, the third tyrosine in the ErbB2 kinase domain is encoded by TAT and corresponds to a TTC-encoded phenylalanine in EGFR, suggesting that both sites may have evolved by mutation from a parental TAC codon. Since conservation of amino acid repeats arguably implies a site-specific function, the codon composition of extracellular cysteine and intracellular tyrosine doublets (Cys-Cys or Tyr-
Tyr sequences) was then assessed in a group of 20 signalling molecules. TGT-TGT and TAT-TAT doublets are rare in this series, accounting for only one-tenth and one-fifteenth the frequency of TGC-TGC or TAC-TAC doublets, respectively. EXAMPLE 10 Third-base codon composition of transmembrane domain amino acids
To determine the third-base GC and AT content in a prospective seπes of functionally characterized codons, the inventor created a database of 50 genes encoding single- transmembrane-domain growth factor receptors and 60 G-protein-coupled receptor sequences, yielding 470 sequences encoding transmembrane domains. All hydrophobic (FLV) and hydrophihc amino acids (KRED) encoded by equal numbers of synonymous codons were examined, as were neutral ammo acids capable of transitional mutation to a hydrophobic amino acid (PAST) The functional importance of hydrophobic amino acids for stable membrane insertion is supported by the higher proportion of hydrophobic vs hydrophihc residues; third bases of codons encoding these same hydrophobic amino acids (χ2 = 134.20, Table 8). These data further support they hypothesis that codons containing GC third bases tend to encode residues of greater functional importance than those encoded by codons containing AT third bases.
EXAMPLE 11
Significance of the data
Depending upon the relative intensities of environmental selection and mutation pressures, the above data suggest that codon conservation may occur via two distinct routes: (1 ) positive or negative selection based on mitogenic and/or evolutionary advantage or disadvantage respectively, or (2) default mutation of non-critical motifs towards a mutationally stable endpoint. The latter mechanism may explain why mutation of conserved codons is often associated with little discernible effect (Bilan and Yip, 1994, Ronco et al, 1995). Conversely, the association of polymoφhisms with disease may prove to have mechanistic significance hitherto unappreciated. EXAMPLE 12
Table 9 presents an expanded summary of the transmembrane domain database, which now includes data approximately 10-fold more numerous than those presented in the Epstein et al, 2000 due to addition of a G-protein-coupled receptor (GPCR) database to the former receptor tyrosine kinase (RTK) statistics. In addition, the inventor has included further statistical verification of statistical significance which support the validity of the invention while also illustrating its application.
EXAMPLE 13
A further analysis of this database illustrates the utility of the invention. Table 10 shows the GC:AT ration in transmembrane domains which are categorized according to their TMPRED score (this detects the "fit" of the sequence with that predicted for transmembrane domains and correlates with total hydrophobicity). The analysis shows that hydrophobic amino acids (FLV) are most GC-rich (GC:AT = 6.41 ) in the third-base position when they are fewest in number (lowest TMPRED score) and, conversely, are least GC-rich (GC:AT = 2.82) when most abundant (TMPRED >3000). A similar trend is apparent when considering the third-base composition of hydrophihc residues (KRED) in nuclear localization seuqences (NLS; Table 1 1 ): this shows that NLS scores <2.0 are associated with GC:AT = 2.29, whereas NLS scores >3.5 are associated with GC:AT = 1.38. These new unpublished data confirm the prediction of the invention that functional redundancy (abundance, reduced need) of an amino acid is associated with a relative reduction of third-base GC content, whereas functional scarcity of (increased need for) an amino acid is associated with a relative increase in the third-base GC content.
EXAMPLE 14
Similarly, another data set which illustrates the validity of the invention is presented in Tables 12 and 13. Table 12 is an analysis of the GPCR database and Table 13 relates to
RTKs: GPCRs have seven transmembrane (TM) domains, whereas RTKs have only one. WO 01/62958 PCT/ATJOl/00193
Table 1 1 shows that GPCRs have 42.5% hydrophobic (defined here as FLV) residues and 2.3%, hydrophihc (KRED) residues, whereas RTKs have 47.8% FLV vs 0.3% KRED. Hence, the FLV:KRED ratio is 18.5 for GPCRs but 159 for RTKs (a significant difference). Since the RTK has only one TM domain to anchor it in the membrane, whereas the GPCR has seven, it makes functional sense that the RTK would have a "better" (more functional) TM domain in this way, hence, these data confirm that the functionality of the RTK TM domain is greater (closer to redundancy) than is that of the individual GPCR TM domains. Comparison with Table 9 confirms that this relative functional redundancy of RTK TM domains is associated with a lower GC:AT ratio (3.17) than for GPCRs (4.55).
Those skilled in the art will appreciate that the invention described herein is susceptible to variations and modifications other than those specifically described. It is to be understood that the invention includes all such variations and modifications. The invention also includes all of the steps, features, compositions and compounds referred to or indicated in this specification, individually or collectively, and any and all combinations of any two or more of said steps or features.
TABLE 2 Nucleotide composition of housekeeping genes and pseudogene homologs
Figure imgf000037_0001
"Housekeeping genes" were identified by searching Medline using this string
Gene and pseudogene nucleotide contents were derived directly from Genbank. Statistical significance of the differences in GC:AT content (see text) was determined using Fisher's exact test
TABLE 3 Nucleotide composition of human tumor-associated genes
Figure imgf000038_0001
Figure imgf000039_0001
Gene abnormalities and corresponding tumor types were identified by Medline literature search by searching a group of terms (for example, "mutation and {loss of function or loss-o-function} and {cancer or tumor) and human and gene")
Codon nucleotide analyses were undertaken using either published sequences or Genbank data; statistical significance (see text) was determined using a two-sample t-test
TABLE 4 Frequency of cell surface receptor codons specifying extracellular cysteines
Figure imgf000040_0001
Codon numbers were derived manually
GC:AT ratios were obtained from Genbank mRNA sequences; statistical significance (see text) was determined using non-parametric Spearman's correlation coefficient TABLE 5 Nucleotide composition of non-receptor tyrosine phosphoproteins and c-Cbl
Figure imgf000041_0001
Data relating to identity of autophosphorylation sites were derived from Medline
Genes are listed in descending order of total TAC:TAT content
GC:AT ratios are derived from the published Genbank sequences
TABLE 6 TAC:TAT ratios in growth factor receptor genes
Figure imgf000042_0001
For catalytic domain sites (centre) and entire intracellular domain (right), TAC:TAT quotients are followed by numerical codon ratios in parentheses
Values are given for rodent intracellular domain TAC:TAT quotients and codon ratios TABLE 7 Codon correlation with phosphotyrosine-binding effectors
Figure imgf000043_0001
Figure imgf000044_0001
Autophosphorylation sites were identified by Medline search using the string {tyrosine and phosphorylation and site and (mapped or mappingor tryptic or phosphototrypfic) and (Grb2 or phospholipase or phosphatidylinositol or Cbl or Shp-2 or PTP) and (growth factor or receptor tyrosine kinase) ) ; numbered site designates the amino acid location of the rceptor phosphotyrosine residue. Statistical significance for the codon identity of phosphotyrosines binding SHP-2 relative to the other second messengers (see text) was derived using Fisher's exact test.
As predicted from consensus sequences (Lupher et al, 1997) TABLE 8 Single-transmembrane-domain codon analysis from receptor analysis from receptor tyrosine kinases
Figure imgf000045_0001
Frequency represents the proportion of amino acids within the predicted transmembrane sequences of the 50 receptor tyrosine kinases
GC3 = number of codons having a third-base either G or C; AT3 = number of codons having a third-base either A or T
Chi-square statistical analysis relates to the differential GC3:AT3 frequency
TABLE 9
Human Transmembrane Protein Codon Usage Analysis Database
[Last updated: May- 19-2000]
Analysis Summary
TMD distribution anal sis:
Figure imgf000046_0002
The statistical tests for GC3/AT3 proportion in GPCRs TM domain: 1. Ho: The proportion of GC3/AT3 is the same for all 5 groups. dfj|critical pointjj computed value result
14.8602 ||3959.16567667912||Rcject the hypothesis at p-value <= 0.005,
2. Ho: The proportion of GC3/AT3 is the same for group FLV and others
Figure imgf000046_0001
The statistical tests for GC3/AT3 proportion in GPCRs groups: o Group: FLV
1. Ho: The proportion of GC3/AT3 is the same for all 3 domains. df]|critical point) computed value result
21| 10.5966 || 1 138.84049509516||Reject the hypothesis at p-value <-= 0.005
2. Ho: The proportion of GC3/AT3 is the same for TM and IC + EC.
Sf critical point computed value result
I I 7.87944 ||49.7794268368964j|Reject the hypothesis at p-value <= 0.005 TABLE 8 (continued)
o Group: PAST
1 . Ho: The proportion of GC3/AT3 is the same for all 3 domains. dfj critical point|_computed value result
0.5966 ||438.207493670886||Reject the hypothesis at p-value <= 0.005
2.
Figure imgf000047_0001
o Group: KRED
1. Ho: The proportion of GC3/AT3 is the same for all 3 domains. df critical point computed value result
2 II 10.5966 |f l 909.91571314882[[Reject the hypothesis at p-value <= 0.005
2. Ho: The proportion of GC3/AT3 is the same for TM and IC + EC.
[dfj critical point computed value result 2
1 || 7.87944 ||3160.18593450644||Reject the hypothesis at p-value <= 0.005
o Group: CHNGQY
1 . Ho: The proportion of GC3/AT3 is the same for all 3 domains. df critical point computed value result
2 1 10.5966 ||295.3 1404874958 l |[Reject the hypothesis at p-value <= 0.005
2. Ho: The proportion of GC3/AT3 is the same for TM and IC + EC. dfjjcritical point|| computed value || result
[T|| 7.87944~ [ 3 7-22239985658|JReject the hypothesis at p-value <= 0.005
o Group: other
1. Ho: The proportion of GC3/AT3 is the same for all 3 domains. df critical point computed value result
2] 10.5966~π|224.604212685204|[Reject the hypothesis at p-value <= 0.005,
2. Ho: The proportion of GC3/AT3 is the same for TM and IC + EC. dfl critical point computed value result
1 7.87944 [32.2623086080694||Reject the hypothesis at p-value <= 0.005 TABLE 10 • TM trend analysis
Figure imgf000048_0001
TABLE 11 NLS trend analysis
Figure imgf000048_0002
TABLE 12 • Frequency distribution for GPCRs' TMD:
Figure imgf000049_0001
TABLE 13 • Frequency distribution for RTKs' TMD:
Figure imgf000049_0002
BIBLIOGRAPHY
Asaietα/. (\ 996) JBiol Chem 271:17644-17649
Bilan et al. (1994) Biochem Biophys Res Comm 205:1891-1898
Cooper etal. (1989) Hum Genet 83:181-188
Doms-Keller etal (1993) Hum Mol Genetics 2:851-856
Epstein, C.J. (1967) Nature 215:355-359
Epstein et al. (2000) Gene 245:291-298
Guyetα/. (1994) Proc Natl Acad Sci USA 91:8132-8136
Hansenetα/. (1996) EMBO J 15:5299-5313
Hill et al (1990) Science 248:1660-2
Hooshmand-Rad et al. ( 1998) J Cell Set 111 :607-614
Inui etal. (1994) J Biol Chem 269:30546-30552
Kashishin et al (1992) EMBO J 11:1373-1382
Kitanaka et al. (1996) Blood 88:590-598
Kraus etal (\989) Proc Natl Acad Sci USA 86:9193-9197
Lupheretα/. (1997) JBiol Chem 272:33140-33144 Miyazawaetα/ (1998) J Biol Chem 273.25495-25502
Murphy etα/. (1998) Mol Cell Biol 18:4872-4882
Negre etal. (1988) Biochimie 70:1715-1718
Nembrot et al (1990) Biochem Biophys Res Comm 166:601-607
Ota, Y. and Samelson, L.E. (1997) Science 276:418-420
Park, J. and Cartwπght, CA. (1995) Mol Cell Biol 15:2374-2382
Parker etal. ( 1983) J Biol Chem 258; 10007- 10012
Reese, J.C. and Katzenellenbogen, B.S. (1991)J5to/ Chem 266.10880-10887
Rohrschneider et al (1997) Mol Reprod Dev 46:96-103
Ronco et al (1995) J Immunol 154:3444-3453
Sembaetα/. (1985) Proc Natl Acad Sci USA 82(October):6497-6501
Siegeletα/. (1994) Mol Cell Biol 14:7068-7077
Sparrow et al (1997) JBiol Chem 272:29460-29467

Claims

1. A method for assessing the functional significance of a nucleotide sequence which encodes an expression product which expression product comprises a sequence of nucleotides or amino acid residues, said method comprising determining reading frame data which comprises codons encoding nucleotide triplets or amino acid residues and subjecting the reading frame data to data processing means in order to identify the third- base in each codon or the overall third-base in some or all codons wherein the identification of a codon or codons having a particular base or type of base in the third- base position is indicative of the functional significance of said nucleotide sequence.
2. A method according to Claim 1 wherein the presence of a G or C at the third-base position indicates a functionally significant nucleotide sequence.
3. A method according to Claim lor 2 wherein the data processing means executes a method of assigning a value or an identifier to the third nucleotide in some or all of the codons.
4. A method according to any on of the preceding claims wherein the nucleotide sequence is a DNA sequence.
5. A method according to Claim 1 or 4 wherein the expression product is a peptide, polypeptide or protein.
6. A method according to Claim 1 or 4 wherein the expression product is an RNA molecule.
7. A method according to Claim 1 or 4 wherein the expression product is an mRNA molecule.
8. A method for assessing the functional significance of an amino acid sequence or nucleotide sequence in an expression product of a DNA sequence, said method comprising determining reading frame data on the DNA which comprises codons encoding residue of said amino acid sequence or nucleotide of said nucleotide sequence of the expression product and subjecting the reading frame data to data processing means in order to identify amino acids or nucleotide triplets in an expression product encoded by one or more codons wherein the third-base of said codon or codons is determined wherein the identification of a codon or codons including a sequence of codons having a cytosine or guanine in the third-base position is indicative of the functional significance of said expression product
9 A method according to Claim 8 wherein the expression product is a peptide, polypeptide or protein
10 A method according to Claim 8 w herein the expression product is an RNA molecule
1 1 A method according to Claim 8 wherein the expression product is an mRNA molecule
12 A method for functionally analyzing nucleotide sequences putatively defining a gene such as identified in a genome sequencing project, said method compπsing identifying transcriptional unit data comprising codons each encoding an amino acid, subjecting the transcriptional unit data to data processing means to identify the composition of the third-base in one or more of said codons wherein a bias towards an A and/or T composition or G and/or C composition at the third position of said one or more codons provides a predicative indicator as to the nature and/or function of said nucleotide sequence or its expression product towards said nucleotide sequence or expression produced being involved inter aha in cell survival, growth, development and/or maintenance
13 A method according to Claim 12 wherein the gene is alternatively involved in transcription, translation, replication, proliferation and/or differentiation.
14. A method according to Claim 1 or 8 or 12 wherein the data processing means executes the steps of:-
(i) partitioning a gene sequence into codons using the correct reading frame;
(ii) analyzing the data obtained in 1 to identify codons of potential interest;
(iii) analyzing a protocol to distinguish whether the third-base of codons of interest comprises a G, C, A or T[U] nucleotide;
(iv) optionally characterizing the third-base content of target codons into subsets; and
(v) analyzing the data to attribute functional characteristics to the codons and encoded amino acids.
15. A method according to Claim 14 wherein part (iv) is performed and identifies a guanine alone, cytosine alone or either a guanine or cytosine.
16. A recombinant or artificially created nucleic acid molecule or peptide, polypeptide or protein comprising a sequence of nucleotides or amino acids rationally designed based on the composition of the codon third-base.
17. Use of data processing means in order to identify the third-base in each codon or the overall third-base in some or all codons wherein the identification of a codon or codons having a particular base or type of base in the third-base position is indicative of the functional significance of said nucleotide sequence.
18. A method for assessing the functional significance of a nucleotide sequence, said method comprising subjecting the nucleotide sequence to data processing means in order to identify the composition bias for GC or AT wherein the identification of a GC- or AT-rich region is indicative of the functional significance of said nucleotide sequence.
19. A method according to Claim 18 wherein the nucleotide sequence comprises the upstream sequence of a gene.
20. A method according to Claim 18 or 19 wherein the nucleotide sequence comprises a promoter.
21. A method of treatment comprising subjecting a nucleotide sequence to data processing means in order to identify the base composition and administering hypomethylating drugs if it is a GC-rich region to modify expression of a nucleotide sequence operably linked to said first mentioned nucleotide sequence.
22. A method according to Claim 21 wherein the nucleotide sequence comprises a promoter.
23. A method according to Claim 21 wherein the hypomethlyating drug is 5- azacytidine.
PCT/AU2001/000193 2000-02-23 2001-02-23 Assessment of the functional significance of a nucleotide sequence by subjecting the reading frame data to data processing Ceased WO2001062958A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU35255/01A AU3525501A (en) 2000-02-23 2001-02-23 Assessment of the functional significance of a nucleotide sequence by subjectingthe reading frame data to data processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AUPQ5803 2000-02-23
AUPQ5803A AUPQ580300A0 (en) 2000-02-23 2000-02-23 Genetic analysis

Publications (1)

Publication Number Publication Date
WO2001062958A1 true WO2001062958A1 (en) 2001-08-30

Family

ID=3819911

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2001/000193 Ceased WO2001062958A1 (en) 2000-02-23 2001-02-23 Assessment of the functional significance of a nucleotide sequence by subjecting the reading frame data to data processing

Country Status (2)

Country Link
AU (1) AUPQ580300A0 (en)
WO (1) WO2001062958A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110556163B (en) * 2019-09-04 2022-12-30 广州基迪奥生物科技有限公司 Analysis method of long-chain non-coding RNA translation small peptide based on translation group

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000065076A2 (en) * 1999-04-26 2000-11-02 K.U. Leuven Research & Development Synthetic gene for expressing active retroviral protein in eukaryotes

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000065076A2 (en) * 1999-04-26 2000-11-02 K.U. Leuven Research & Development Synthetic gene for expressing active retroviral protein in eukaryotes

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
ALVAREZ F. ET AL.: "Evolution of codon usage and base contents in kinetoplastid protozoans", MOLECULAR BIOLOGY AND EVOLUTION, vol. 11, no. 5, 1994, pages 790 - 802 *
ANTEZANA M.A. ET AL.: "The nonrandom location of synonymous codons suggests that reading frame-independent forces have patterned codon preferences", JOURNAL OF MOLECULAR EVOLUTION, vol. 49, 1999, pages 36 - 43 *
ARGENTINE J.A. ET AL.: "Codon preference of aedes aegypti and aedes albopictus", INSECT MOLECULAR BIOLOGY, vol. 1, no. 4, 1993, pages 189 - 194 *
D'ONOGRIO G. ET AL.: "Correlations between compositional properties of the human genes, codon usage and amino acid composition of proteins", JOURNAL OF MOLECULAR EVOLUTION, vol. 32, 1991, pages 504 - 510 *
FENNOY S.L. ET AL.: "Synonymous codon usage in zea mays L. Nuclear genes is varied by levels of C and G-ending codons", NUCLEIC ACIDS RESEARCH, vol. 21, no. 23, 1993, pages 5294 - 5300 *
LIO P. ET AL.: "Third codon G+C periodicity as a possible signal for an "internal" selective constraint", JOURNAL OF THEORETICAL BIOLOGY, vol. 171, 1994, pages 215 - 223 *
NISHIZAWA M. ET AL.: "Biased usages of arginines and lysines in proteins are correlated with local-scale fluctuations of the G+C content of DNA sequences", JOURNAL OF MOLECULAR EVOLUTION, vol. 47, 1998, pages 385 - 393 *
RODRIQUEZ-TRELLES F. ET AL.: "Switch in codon bias and increased rates of amino acid substitution in the drosophila saltans species group", GENTICS, vol. 153, 1999, pages 339 - 350 *
XIA X.: "Maximising transcription efficiency causes codon usage bias", GENETICS, vol. 144, 1996, pages 1309 - 1320 *

Also Published As

Publication number Publication date
AUPQ580300A0 (en) 2000-03-16

Similar Documents

Publication Publication Date Title
US20080146502A1 (en) Methods for modulating the sensation of satiety perception and agents useful for same
EP0948522B1 (en) Therapeutic and diagnostic agents capable of modulating cellular responsiveness to cytokines
US7897570B2 (en) Method of treatment
US20080009444A1 (en) Biologically active complex of NR6 and cardiotrophin-like-cytokine
US8299027B2 (en) Method of modulating cell survival and reagents useful for same
AU2001256001A1 (en) Sphingosine kinase and uses thereof
WO2001085953A1 (en) Sphingosine kinase and uses thereof
WO1998009510A1 (en) Methods of diagnosing and treating cancer
US20080057517A1 (en) Methods of regulating cytokine signalling
US20070043512A1 (en) Therapeutic and prophylactic compositions and uses therefor
US20080254023A1 (en) Treating Gliosis, Glial Scarring, Inflammation or Inhibition of Axonal Growth in the Nervous System by Modulating Eph Receptor
WO2001062958A1 (en) Assessment of the functional significance of a nucleotide sequence by subjecting the reading frame data to data processing
US20070148129A1 (en) Therapeutic agents and uses therefor
EP1446414A1 (en) Differentially expressed genes associated with obesity and type 2 diabetes
US7105480B1 (en) Method of treatment and agents useful for same
US20060190186A1 (en) Viral polymerase and modulation thereof
US20100196874A1 (en) method of drug design
EP0842272A1 (en) Novel receptor ligands and genetic sequences encoding same
WO2006105602A1 (en) Animal models and cells with a modified gene encoding transthyretin-related protein and applications thereof
US20040091964A1 (en) Modified proteins, isolated novel peptides,and uses thereof
WO2000064931A1 (en) A ligand of the protein &#39;beacon&#39;
WO2000002589A1 (en) Modulation of haemopoietic cell activity and agents useful for same
WO2004099412A1 (en) Nucleic acid molecules differentially expressed in animals exhibiting behavioural disorders
WO2005020677A1 (en) Method of selecting animal models from animals which have been subject to mutagenesis, and the use of myb transcription factors for screening
WO2009046483A1 (en) Therapeutic protocol for the treatment or prevention of thrombocytopenia

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP