WO2016164530A1

WO2016164530A1 - Compositions and methods for high throughput protein sequencing

Info

Publication number: WO2016164530A1
Application number: PCT/US2016/026354
Authority: WO
Inventors: Shuwei Li; John Marino; Thomas Fuerst
Original assignee: National Institute of Standards and Technology NIST; University of Maryland College Park
Current assignee: National Institute of Standards and Technology NIST; University of Maryland College Park
Priority date: 2015-04-07
Filing date: 2016-04-07
Publication date: 2016-10-13
Anticipated expiration: 2017-10-07

Abstract

Provided are compositions and methods for elucidating the amino acid sequences of a plurality of polypeptides in high-throughput approaches. The compositions include unique binding partners that have specificity for N-terminal amino acids. The methods include using the unique binding partners in processes that involves obtaining a plurality of polypeptides, binding detectably labeled N-terminal amino acid binding agents to the N-terminal amino acid of the polypeptides, detecting the N-terminal amino acid binding agents to identify the N-terminal amino acid for some or all of the polypeptides, removing the N-terminal amino acid binding agents, liberating the N-terminal amino acid to reveal the next amino acid in the polypeptides in the N->C terminal direction, and repeating the process to determine some or all of the amino acid sequence of the polypeptides. The approaches be used to analyze polypeptides from a wide diversity of sources, and for detecting proteins of low abundance from any particular source.

Description

COMPOSITIONS AND METHODS FOR HIGH THROUGHPUT PROTEIN

SEQUENCING

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. provisional patent application no.

62/144,204, filed April 7, 2015, the disclosure of which is incorporated herein by reference.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

[0002] This invention was made with United States government support from the

National Institute of Standards and Technology under grant no. 70NANB12H159. The government has certain rights in the invention.

FIELD

[0003] The present disclosure relates generally to protein analysis and more specifically to high-throughput protein sequence determination.

BACKGROUND [0004] Despite great promise for the field of proteomics, current technology for identifying and quantifying low abundance peptides and proteins is quite limited. Mass spectrometry (MS) is the most widely used approach for peptide/protein identification and quantification and is the method of choice for clinical biomarker discovery. Although MS technology has rapidly advanced, it remains limited in dynamic range (10⁴ ~ 10⁵) and has difficulty in detecting low abundance proteins and providing absolute quantification. These represent significant drawbacks in analysis of complex biological samples such as serum where proteins can span a dynamic range of ~10¹⁰ and most valuable biomarkers are believed to be low abundance proteins at a concentration of a pmol/L or less. As a result, despite extensive research efforts and large financial investments during the last 10+ years for biomarker discovery, current technologies have mostly failed to provide clinically validated biomolecules that can be reliably applied for disease diagnosis and treatment. Thus, there is an ongoing need for improved compositions and methods for identifying and determining protein sequences. The present disclosure addresses these and other needs.

SUMMARY [0005] The present disclosure is related to improved compositions and methods for elucidating the amino acid sequences of a plurality of polypeptides. It is suitable for a variety of high-throughput approaches, and can be used to analyze polypeptides from a wide diversity of sources, and for detecting proteins of low abundance from any particular source. The compositions comprise, among other components, novel binding partners that have specificity for N-terminal amino acids. In various aspects the N-terminal amino acid binding agent is detectably labeled.

[0006] Methods provided herein general comprise obtaining a plurality of polypeptides, binding detectably labeled N-terminal amino acid binding agents to the N-terminal amino acid of the polypeptides, detecting the N-terminal amino acid binding agents to identify the N- terminal amino acid for some or all of the polypeptides, removing the N-terminal amino acid binding agents, liberating the N-terminal amino acid to reveal the next amino acid in the polypeptides in the N->C terminal direction, and repeating the process to determine some or all of the amino acid sequence of the polypeptides.

[0007] All combinations of each and every binding agent described herein are encompassed by the disclosure. In certain approaches, at least one N-terminal amino acid binding agent used in the methods, compositions, and or/kits of this disclosure comprises a sequence selected from SEQ ID NO:3, SEQ ID NO:4, SEQ ID NO:5, SEQ ID NO:6, and SEQ ID NO:8, or the sequence of SEQ ID NO: 1 comprising any one or any combination of changes selected from the group consisting of L47N, V53D, F61C, Dl 15V, Y301C, or the sequence of SEQ ID NO:2, comprising any one or any combination of changes selected from the group consisting of M178I D229V, and L227R, or the sequence of SEQ ID NO:9, comprising an

I496F change. In certain embodiments the binding agents may comprise binding partners such as antibodies or specific N-terminal amino-acid binding fragments thereof. In a non-limiting example the disclosure comprises using a binding partner having a complementarity determining region (CDR) selected from (LI) SGDALPKKYAY (SEQ ID NO:3); (L2) EDVKRLS (SEQ ID NO:4); (L3) YSNSKTGNYNV (SEQ ID NO:5); (HI) GYTFTDYWIS (SEQ ID NO:6); (H2) QIAMTNSATVYGPSFQG (SEQ ID NO:7); (H3) DYSDNYYNDTYS (SEQ ID NO:8). The disclosure includes but is not limited to using combinations of LI and HI, LI and H2, LI and H3, L2 and HI, L2 and H2, L2 and H3, L3 and HI, L3 and H2, and L3 and H3.

[0008] Methods of the disclosure are suitable for determining the sequence of all or a portion of the polypeptides in or derived from, for example, any biological sample. The disclosure includes generating a report, including but not necessarily limited to a printed report, or a digitized report that includes some or all of the peptide sequence information that is generated using the compositions and methods described herein.

[0009] In another aspect the disclosure includes a complex comprising a polypeptide and an N-terminal amino acid binding agent, wherein the binding agent comprises a sequence selected from SEQ ID NO:3, SEQ ID NO:4, SEQ ID NO:5, SEQ ID NO:6, and SEQ ID NO:8, the sequence of SEQ ID NO: l comprising any one or any combination of changes selected from the group consisting of L47N, V53D, F61C, Dl 15V, Y301C, or the sequence of SEQ ID NO:2, comprising any one or any combination of changes selected from the group consisting of M178I D229V, and L227R, or the sequence of SEQ ID NO:9, comprising an I496F change. The polypeptides may be provided in physical association with a solid substrate. The polypeptides may be present in an array of polypeptides.

[0010] The disclosure also includes kits that are useful for performing methods of the disclosure. The kits comprise at least one sealed container comprising at least one of the N- terminal amino acid binding agents that are described herein, and may comprise 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 19, or 20 distinct binding partners. The kits can further comprise printed material describing use of N-terminal amino acid binding agents in a process for determining amino acid sequences of a plurality of polypeptides. DESCRIPTION OF THE FIGURES

[0011] Figure 1 depicts an illustration of single molecule peptide sequencing according to one embodiment of this disclosure.

[0012] Figure 2 depicts the crystal structure (top) and a depiction (bottom) of an N- terminus Met occupying the active site of a methionyl-tRNA synthetase.

[0013] Figure 3 provides a flowchart showing affinity selection cycle using a phage display protein library.

[0014] Figure 4A provides a depiction of one illustrative embodiment of the disclosure showing covalent immobilization of trypsin digested peptides onto glass surface coated with methylisourea through a C-terminus lysine. In this embodiment the surface is coated with a compound, such as polyethylene glycol, to inhibit nonspecific binding.

[0015] Figure 4B provides a graphical summary of data obtained from determining expression of distinct tRNA synthetases expressed on M13 phage as Flag-aaRS-pIII fusion proteins. A higher OD450 represents better expression.

[0016] Figure 5A. Representative scheme depicting labeling and purification of NAA- binding proteins with organic dyes by using as an illustrative example bio-orthogonal click chemistry. As a non-limiting example of an embodiment of this disclosure, the excitation and emission maxima of four organic dyes are indicated.

[0017] Figure 5B. Graphical depiction of data showing PheRS shows their specificity to N-terminus phenylalanine over leucine. In an experiment used to generate these data, a 96-well plate was first coated with NeutroAvidin (100 μΐ of 10μg/mL for each well), then incubated with a peptide containing N-terminal Phe and C-terminal biotin connected with 11 polyethylene glycol (PEG) units. 100 μΐ_^ serially diluted Ml 3 phage expressing wild-type PheRS-pIII on the surface was bound to the plate. The bound phages were detected with the HRP-conjugated anti- Mi 3 antibody and TMB substrate.

[0018] Figure 6. Graphical depiction of data demonstrating binding improvement of

TyrRS mutants resulting from three cycles of panning. The phage-displayed TyrRS library was generated by error prone PCR. The library was incubated with immobilized peptide containing N-terminus tyrosine and C-terminal biotin connected with 11-PEG units. After three cycles, individual clones were isolated and sequenced. Their binding to immobilized N-terminal tyrosine and leucine (a control) peptides were the same as described in Figure 5B. The cartoon demonstrates an aspect of library screening. The table provides an indication of amino acid changes in the TyrRS and demonstrates distinct substrate differences that correlate with the amino acid changes.

[0019] Figure 7. Graphical depiction of data obtained from binding assays of scFv towards N- and C-terminal tyrosine. A naive phage-display scFv library was selected for binding with an immobilized peptide containing N-terminal tyrosine. Three cycles were performed. The enriched clones were tested with ELISA as described herein. Clone p807/C2 shows specific binding to N-terminal tyrosine.

[0020] Figure 8. Graphical depiction of results obtained from PheRS Screening performed generally as outlined in Figure 5B and its description. Two clones (Al with Ml 781, D229V mutations; CI with L227R mutation) show binding preference to N-terminus Phe peptides over control N-terminus Leu peptides (Al, 1.33 fold; CI, 1.96 fold).

[0021] Figure 9. Graphical depiction of results obtained from LeuRS Screening performed generally as outlined in Figure 5B and its description, but with LeuRS. One clone (D2 with an I496F mutation) shows binding preference to N-terminus Leu peptides over control N-terminus Phe peptides (1.54 fold).

DETAILED DESCRIPTION [0022] Unless defined otherwise herein, all technical and scientific terms used in this disclosure have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains.

[0023] Every numerical range given throughout this specification includes its upper and lower values, as well as every narrower numerical range that falls within it, as if such narrower numerical ranges were all expressly written herein.

[0024] The disclosure includes all polynucleotide and amino acid sequences described herein, and every polynucleotide sequence referred to herein includes its complementary DNA sequence, and also includes the RNA equivalents thereof to the extent an RNA sequence is not given. Every DNA and RNA sequence encoding polypeptides disclosed herein is encompassed by this disclosure

[0025] The present disclosure provides compositions and reagents designed for determining amino acid sequences of a plurality of polypeptides. The compositions comprise novel binding partners that have specificity for N-terminal amino acids. In general, the method comprises obtaining a plurality of polypeptides, binding detectably labeled N-terminal amino acid binding agents to the N-terminal amino acid of the polypeptides, detecting the N-terminal amino acid binding agents to identify the N-terminal amino acid, removing the N-terminal amino acid binding agents, liberating the N-terminal amino acid to reveal the next amino acid in the polypeptides in the N->C terminal direction, and repeating the process to determine some or all of the amino acid sequence of the polypeptides. Additional description of this process and reagents suitable for use in performing are described further below, and in part with reference to Figure 1.

[0026] In certain non-limiting embodiments, proteins are separated from a sample using any suitable approaches, and denatured using conventional techniques. In embodiments, they may be modified, such as by being alkylated on cysteine residues, and digested by any one or a combination of known proteases, including but not necessarily limited to trypsin. If trypsin is used, the average peptide length will be from 7-25 residues. Thus, peptides analyzed according to the present disclosure can in certain embodiments comprise or consist of between 7-25 amino acid residues, inclusive, and including all integers and ranges there between.

[0027] In embodiments, sequencing 7-10 residues is expected to be sufficient to identify a unique peptide. In embodiments, the compositions and methods are adapted for the identification and quantification of peptides/proteins regardless of their relative abundance in any particular sample or set of samples. In embodiments, the compositions and methods are adapted for concurrent or sequential sequencing of a plurality of proteins ranging from two proteins, up to and including a billion proteins, or more. In embodiments, by determining amino acid sequences, the amount of one or more distinct peptides in a plurality of proteins is determined. Thus, the disclosure includes but is not necessarily limited to determining use of amino acid sequence determination to measure the relative abundance of one or more distinct proteins relative to other proteins in a sample, and/or absolute peptide quantification, such as by determining the mass, molarity, or number of any one or more distinct polypeptides in a sample, or from more than one sample. In certain embodiments one or more low abundance proteins can be determined to have been present in a sample that is analyzed according to methods of the present disclosure.

[0028] Polypeptides analyzed using the compositions and methods of this disclosure can be from any source that contains or is expected to contain polypeptides. Embodiments of the disclosure are adaptable for use in, for example, large-scale, massively parallel peptide sequencing. Thus the present disclosure has broad applicability for use in, for example, identification of low-abundance polypeptides from a diversity of sources, wherein the polypeptides may have significance in a wide variety of areas, including but not necessarily limited to the human and veterinary health areas, diagnostics and therapeutics, forensics, agricultural products and processes, food science-based technologies, microbiomics, proteomics- based analysis of whole organisms, organs, systems, tissues, microorganisms and viruses, including but not limited to pathogens, biofilms, and cell populations, including but not limited to cultured cells or cells obtained or derived from an organism, or populations of cells enriched for one or more cell types, and biological fluids which include but are not limited to mucosa, serum, blood, lymph, urine, cerebrospinal fluid, semen, saliva, tears, and in any other composition of matter in which the identification of proteins would be desirable.

[0029] In one aspect the disclosure includes analyzing the sequence of a plurality of proteins from a test sample, and comparing one or more proteins analyzed in the first sample to any suitable reference. By comparing results obtained by analyzing one or more proteins in the first sample and comparing the results to a reference, a difference in the test sample can be identified and used to characterize the test sample. As one non-limiting embodiment provided only to illustrate one of the many expected utilities of the instant invention, the disclosure includes determining a plurality of protein sequences from a biological sample obtained or derived from an individual. By comparing the analysis of one or more proteins from the test sample to the reference, such as a suitable control sample or a standardized reference, one or more differences in the test sample relative to the reference can be identified. Such differences if present may be used for a variety of purposes, such as phenotyping, or diagnosing or aiding in a physician's diagnosis of a condition or disorder, for staging a disease and/or making a prognosis, for making a treatment recommendation, for monitoring the progress of a treatment, or for identification of a source of one or more of the polypeptides identified by performance of the method. In an embodiment, the method is used for analysis of a sample obtained or derived from an individual who has is at risk for, is suspected of having, or has been diagnosed with a condition or disorder, non-limiting examples of which include any form of cancer, an infectious disease, an auto-immune disorder, a muscular or neuromuscular disorder, a non-cancer blood disorder, a disorder confined to a particular organ or tissue, or any other condition wherein a protein biomarker may be present. In certain embodiments the compositions and methods are used to identify novel polypeptides that are, for example, previously unknown proteins, isoforms, splice variants, and the like.

[0030] The binding partners that have specificity for N-terminal amino acids can be any suitable moiety, compound or composition of matter that can distinguish one of the 20 naturally occurring amino acids from the others. The disclosure includes single and combinations of novel N-terminal amino acid binding partners, methods of using them, as well as kits that comprise them. The novel N-terminal binding partners may be combined with, for example, previously available or naturally occurring N-terminal amino acid binding partners that, given the benefit of this disclosure, will be recognized by those skilled in the art as being adaptable for certain aspects of the protein sequencing approaches described further herein. In embodiments, selective evolution of binding agents is employed to provide improved N-terminal amino acid binding reagents. The N-terminal amino acid binding agents are abbreviated as "NAA" from time to time in this disclosure.

[0031] In one aspect, the disclosure includes novel N-terminal amino acid binding agents that are modified or otherwise engineered proteins. In one aspect, protein-based binding agents are developed by modifying naturally occurring proteins that bind amino acids with specificity, such as aminoacyl-tRNA synthetases (referred to herein from time to time as "RS"), and/or amino-acid binding fragments thereof. In embodiments, the RS fragments comprise or consist of the amino acid binding pocket of the RS, with modifications thereof. Any RS of any origin can be adapted for use in the present disclosure. In embodiments, the RS is a prokaryotic RS, or a eukaryotic RS. In embodiments, the RS is a class I or a class II RS. In embodiments, a combination of an RS and a tRNA can be used.

[0032] Also provided are protein-based novel N-terminal amino acid binding agents that are based on immunological molecules, such as antibodies or fragments thereof that can bind with specificity to distinct N-terminal amino acids. In certain embodiments, the immunological N-terminal binding agent can comprise antibody Fab fragments, Fab' fragments, F(ab')2 fragments, Fv fragments, scFv fragments, antibody -based aptamers, nanobodies, llama bodies, diabodies, or any other N-terminal amino acid binding portion of an immunological molecule so long as adequate complementary determining regions (CDRs) are included to achieve the requisite specificity. In embodiments, the disclosure provides N-terminal amino acid binding agents in the form of peptide aptamers which generally comprise a variable peptide loop attached at both ends to a protein scaffold, wherein the combination of the loop and the scaffold imparts N-terminal amino acid binding specificity.

[0033] Also provided are nucleic acid-based novel N-terminal amino acid binding agents. These include but are not limited to RNA, DNA, hybrid RNA/DNA molecules, or XNA as a modified form of a polynucleotide. In embodiments, these agents comprise single-stranded DNA or RNA (ssDNA or ssRNA) molecules provided as aptamers. In embodiments, the modified nucleic acids are altered to, for example, be resistant to degradation, or for enhanced amino acid recognition, or a combination thereof. Modified nucleic acids include but are not necessarily limited to polynucleotides which comprise modified nucleotides, and/or modified phosphodiester linkages. In embodiments phosphodiester linkages are changed to a synthetic linkage, i.e., a linkage other than a phosphodiester linkage. Thus, polynucleotide based N- terminal amino acid binding agents can comprise, in non-limiting embodiments, an inter- nucleoside linkage that is an alkylphosphonate, phosphorothioate, phosphorodithioate, phosphate ester, alkylphosphonothioate, phosphoramidate, carbamate, carbonate, morpholino, phosphate trister, acetamidate, and/or carboxymethyl ester, or combinations thereof. Further, the polynucleotide based N-terminal amino acid binding agents can comprise modified nucleotides. For ribonucleotides, the modifications will generally be at the 2' position of the ribose and include but are not limited to 2'O-methy, 2'-0-(2-Methoxyethyl), and 2'0-(2-Aminopropyl) modifications, and combinations thereof. In embodiments, the nucleic acid-based novel N- terminal amino acid binding agents are selected from aptamers, ribozymes, or modified riboswitches. In certain embodiments, a modified lysine riboswitch, which is a lysine-binding RNA molecule that regulates lysine biosynthesis is used. In embodiments, a combination of a peptide and polynucleotide is used as an N-terminal amino acid binding agent. In embodiments, if the N-terminal amino acid binding agent comprises a polynucleotide, the presence of the particular amino acid at the N-terminus can be established by determining all or part of the sequence of the polynucleotide.

[0034] All polynucleotides that are disclosed herein, all amino acid sequences, and polynucleotides encoding the peptide sequences are encompassed in this disclosure. Likewise, all methods of making the N-terminal amino acid binding agents, polynucleotides encoding the protein-based N-terminal amino acid binding agents, expression systems comprising those polynucleotides, and cell cultures comprising those expression vectors, are also included in this disclosure. In one aspect the disclosure includes expressing a recombinant, modified N-terminal amino acid binding agent in a cell culture, and separating the modified N-terminal amino acid binding agent from the cell culture to obtain an isolated and/or purified modified N-terminal amino acid binding agent.

[0035] In a further embodiment, kits for use in performing one or more methods or steps of this disclosure are provided. In general the kits comprise one or more containers, such as glass or plastic containers that can be sealed to contain, for example, one or more of the modified N-terminal amino acid binding agents. The kits can comprise up to 20 modified N- terminal amino acid binding agents provided in one or more of the containers. The binding agents can be provided in solution, such as in a stabilized buffer, or as a frozen solution, or they can be provided in a dry form to be reconstituted for use in the method. The kits can further include a solid substrate, such as planar surface, or a spherical surface such as a plurality of beads, for use in the method. The solid substrate can be modified such that it is suitable for binding the C-termini of a plurality of polypeptides, as will be further outlined below. The kits can also include reagents for use in fixing polypeptides to the substrate, and/or for labeling and/or detecting the binding agents. Further, the binding agents may be provided pre-labeled for use in the method, and if desired segregated into groups of binding agents having distinct labels for stepwise determination of polypeptide sequences. The kits can include printed material providing instructions for using the modified N-terminal amino acid binding agents in a method of this disclosure.

[0036] The principle behind the single molecule polypeptide sequencing of this disclosure is illustrated in Figure 1, which provides a non-limiting and illustrative example of one embodiment. As outlined in Figure 1, a plurality of peptides from enzymatic digestion of a sample, such as a proteome sample, can be randomly immobilized on a glass surface through their C-terminus amino acids. In one embodiment, the N-terminus amino acids of these peptides are bound to detectably labeled proteins, each of which recognizes one of the 20 NAAs with high affinity and specificity. In embodiments, the proteins bind to the NAA with a nanomolar affinity. In embodiments, and without intending to be constrained by any particular theory, assuming diffusion limited on-rates of - 10^"6 M^sec^"1, an exemplary minimum off-rate is less than 10^"1 to lO^sec^"1.

[0037] In embodiments, 20 distinct N-terminus detectably labeled amino acid binding partners are used which allows for detection of the N-terminus of every bound peptide. In alternative embodiments, fewer than 20 distinct N-terminus detectably labeled amino acid binding partners are used, and sequential detection steps are employed.

[0038] In the embodiment shown in Figure 1, the binding partners are fluorescently labeled, and one or more fluorescent images of single molecule resolution is acquired by a CCD camera (illustrative Step 1). Those skilled in the art will recognize that, depending on the choice of signal, alternative detection methods can be used, such as fluorometers, scanning lasers, and microfluidic based imaging devices.

[0039] In the exemplary Figure 1, after fluorescently labeled proteins are washed off, the

NAAs of peptides are removed by employing any suitable technique, such as the well-known Edman degradation chemistry to expose a new set of NAAs (illustrative Step 2). This procedure is repeated for multiple cycles. Finally, all acquired fluorescent images (or other signals) are assembled by, for example, a computer to identify the sequences of all peptides. As will be recognized by those skilled in the art, the iterative sequence approach of this disclosure can be integrated with a variety of devices for detection of the NAAs, and for separating the labeled NAAs from the bound peptides.

[0040] It will be recognized from Figure 1 that in various aspects of this disclosure, the peptides to be sequenced are placed in physical association with a substrate, illustrated as glass in Figure 1. The peptides may be covalently or non-covalently attached to the substrate, and in either case the attachment may be reversible or irreversible. The peptides are attached via their C-terminus, and are either attached directly to the substrate, or attached by an intermediate, such as a linking group, a functionalized group, or any other suitable composition of matter that will keep the peptide in place to perform the method of the invention. A linking group may be monofunctional, specifically binding a peptide to the surface, or bifunctional to include an additional moiety that could be used in proximity ligation-type assays that would increase the discrimination of a true NAA binding signal over non-specific background noise. In certain embodiments, the substrate, the peptide, or both may be functionalized to facilitate peptide attachment. The substrate may be blocked prior to attaching the peptides to reduce artifacts. In certain embodiments, with respect to peptide immobilization and glass surface chemistry, proteins can be extracted from cells and digested with proteases such as trypsin into short peptides. To covalently immobilize peptides on functionalized glass surfaces, in non-limiting embodiments, cysteine containing peptides can be specifically conjugated onto glass coated with a bromacetyl group. Alternatively, the C-terminus lysine of trypsin digested peptides can be selectively immobilized on glass derivatized with a methylisourea functional group (see, i.e., Figure 4A). The N-terminus amine of a peptide, except for N-terminus glycine, typically does not react with methylisourea, therefore C-terminus specific immobilization via lysine can be used. It is possible for peptides to adsorb nonspecifically to untreated glass surface and lead to inaccessible N-termini. This can be reduced by, for example, using well-establish surface derivatization methods to passivate the surface against non-specific protein adsorption. As one non-limiting example, coating a glass surface with polyethylene glycol can significantly reduce nonspecific binding and has been widely used in microarray -based technologies. The use of the high- affinity NAA-binding proteins described herein will also allow stringent washing steps to remove non-specifically bound NAA-binding proteins, if desired.

[0041] Those skilled in the art will recognize it is desirable to control the density of peptides on the substrate to achieve spacing comparable to the resolution of a fluorescence microscope, which is about 200 nm to 250 nm (0.20 μπι to 0.25 μπι). In embodiments, super- resolution fluorescence microscopes with 20 nm to 50 nm resolution may also be adapted for use with this disclosure. In certain embodiment, peptides will be immobilized at an average density of 4 peptides per 1 μπι² by controlling the density of available functional groups on the glass surface. The distance between two peptides (500 nm) is large enough for visualization by conventional optical system, while a 2 cm x 2 cm surface can hold up to 1.6 billion peptides.

[0042] It will be recognized that, in order to detect an N-terminal amino acid, the N- terminal amino acid binding agents of this disclosure may be modified so that they are detectably labeled. Any moiety or other compound or composition that is capable of producing a detectable signal is included in this disclosure. In embodiments, the detectable label produces a fluorescent signal as generally outlined in Figure 1. In other embodiments, the N-terminal amino acid binding agents are adapted for detecting by way of proton or photon release assay, or a fluorescence detection assay such as Fluorescence Resonance Energy Transfer (FRET). The invention thus encompasses any detectable label, instrumentation, software, and devices that can be used to detect the signal from the detectable label.

[0043] In certain embodiments the NAA-binding partners are fused with fluorescent proteins, i.e., Green florescent protein and its derivatives, red florescent protein and its derivatives, yellow fluorescent proteins and others known in the art. With respect to N-terminal amino acid binding partners labeled with fluorophores or fluorochromes, in certain

embodiments, the labels comprise bright fluorophores that have adequate photostability, high quantum yields, and narrow emission spectra so that they can be used for certain embodiments that employ multicolor detection. In embodiments, the detectable label will comprise one or more organic dyes as the detectable label. In embodiments, the four fluorescent dyes used in well-known Sanger DNA sequencing methods can be used. These labels display high quantum yields, reasonable photostability, and well-resolved emission spectra for four-color detection. In this non-limiting approach, 20 NAA-binding proteins will be equally divided into five groups and four proteins in each group will be labeled by these organic dyes respectively, thus meeting both the practical limit in number of available fluorophores as well as channels for single molecule detection. Because sequencing one position of all peptides needs to cover the entire 20 NAAs, incubations of peptides with all five sets of fluorescently labeled NAA- binding proteins after every cycle of Edman degradation will be used, while each group of proteins is simultaneously detected by the four-color mode. However, using a combinatorial two-color detection scheme by labeling each NAA-binding protein with two fluorophores, embodiments of the disclosure includes detecting up to 10 NAA-binding proteins in a single cycle. In this regard, non-limiting embodiments of suitable detectable fluorescent moieties include use of one or a combination of Acridine dyes, Cyanine dyes, Fluorone dyes, Oxazine dyes, Phenanthridine dyes, or Rhodamine dyes. In particular embodiments, the detectable labels are Xanthene derivatives, including but not limited to fluorescein, rhodamine, Oregon green, eosin, and Texas red; Cyanine derivatives, including but not limited to cyanine,

indocarbocyanine, oxacarbocyanine, thiacarbocyanine, and merocyanine; Squaraine derivatives and ring-substituted squaraines, including but not limited to Seta, SeTau, and Square dyes, Naphthalene derivatives, Coumarin derivatives, oxadiazole derivatives; including but not limited to pyridyloxazole, nitrobenzoxadiazole and benzoxadiazole; Anthracene derivatives, including but not limited to anthraquinones, including DRAQ5, DRAQ7 and CyTRAK Orange; Pyrene derivatives, including but not limited to cascade blue; Oxazine derivatives, including but not limited to Nile red, Nile blue, cresyl violet, oxazine 170; Acridine derivatives, including but not limited to proflavin, acridine orange, acridine yellow, Arylmethine derivatives, including but not limited to auramine, crystal violet, malachite green; and Tetrapyrrole derivatives, including but not limited to porphin, phthalocyanine and bilirubin.

[0044] In certain embodiments, in order to label the NAA-binding proteins that are RS or modified RA, the NAA-binding proteins can be expressed so that they contain an unnatural azide group at a position away from the NAA-binding site. This approach can genetically incorporate an unnatural amino acid bearing the azide into the NAA-binding protein site- specifically. This azide will allow the NAA-binding proteins to be labeled with organic dyes containing a terminal alkyne group through bio- orthogonal "click" chemistry (the reaction between azide and alkyne). The labeled proteins can be purified using any suitable techniques, such as on immobilized antibodies against fluorescent dyes.

[0045] In certain embodiments and as described above, the N-terminal amino acid binding agents can be detected by imaging a signal that can be interpreted by a machine, such as a camera or other detecting machine that is configured to function with, for example, a charge- coupled device (CCD), i.e., a CCD camera. In another approach a microfluidic fluorescence imager can be used. Thus, a confocal microscope coupled with a microfluidic liquid handling system can be adapted for use with the compositions and methods of this disclosure. Confocal imaging can detect fluorescence from a thin focal plane, so fluorescent signals of all peptides immobilized on the inner wall of a glass tube can be recorded by CCD cameras from different angles. This geometry facilities more efficient design of a microfluidic system to deliver reagents and washing buffers at high speed and consume minimal amount of fluorescently labeled NAA-binding proteins. In embodiments, certain aspects of a suitable microfluidic fluorescence imager can be adapted from, as one example, the single molecule DNA sequencer available from Helicos (www.helicosbio.com). For example, the Helicos DNA sequencer uses a single laser (635 nm) for the detection of four nucleotides one by one. For the peptide sequencer, multiple lasers (e.g. diode pumped solid state lasers at 488 nm, 589 nm, and 642 nm) and filters can be added for simultaneous detection of the four NAA-binding proteins, and to increase throughput and speed for sequencing each residue. Software can be developed to function with 20 amino acids, rather than four nucleotides. Thus, in various embodiments the disclosure includes a system for determining a plurality of polypeptide sequences. The system can comprise the NAAs, the substrate for immobilizing polypeptides, and/or one or more devices for detecting the NAAs when bound to the N-termini of immobilized peptides. The device(s) can be integrated with a digital processor and/or software to interpret the position, amount, frequency, etc. of the NAAs over one or successive rounds of NAA determination, and assemble data to generate contiguous amino acid sequences for the polypeptides in a sample. In embodiments the system can include a microfluidic component, and/or a device for capturing images.

[0046] The present disclosure also contemplates addressing sequencing errors at the single molecule level caused by the combination of several factors, such as incomplete Edman degradation and possible posttranslational modifications. For small posttranslational

modifications such as methylation, acetylation, and phosphorylation, specific NAA-binding proteins can be evolved for their recognition and used in the approaches of this disclosure.

[0047] In another aspect the present disclosure comprises recording of the amino acid present at the N-terminus of a plurality of peptides over successive rounds of sequencing as described above, and fixing such amino acid sequences in a tangible medium of expression, such as a digitized file. In embodiments, the disclosure includes generating a report that comprises such amino acid sequences. In embodiments the disclosure comprises conveying the report to a third party, for example, a party from which a protein-containing sample is received and analyzed according to the method of this disclosure. Thus, in embodiments, the disclosure includes receiving a protein-containing sample, analyzing the sample to determine amino acid sequences in the sample, generating a report describing the amino acid sequences, and communicating the amino acid sequences and/or the report to a party.

[0048] With respect to particular specific NAA-binding proteins, representative examples are described below. In certain embodiments, additional agents will be derived from the catalytic domain of other RS by affinity selection. A representative example of an RS is shown in Figure 2 where the active site of a methionyl-tRNA synthetase is occupied by a Met residue. Both the a-amine and the side chain of Met are buried deeply, while its carboxylate is more exposed to surface. Without intending to be bound by any particular theory, it is considered that as a result, these enzymes specifically interact with the NAA of a peptide, but not the same residues embedded inside the peptide, because only the NAA has free a-amine to prevent spatial clash. However, many wild-type aminoacyl tRNA synthetases are not adequate for use with the instant disclosure due to the lack of required stringent specificity and affinity. For example, since Val and He are similar amino acids, i sol eucyl -tRNA synthetase can mischarge the lie-specific tRNA with a Val at ~1 in 200 events, although other proofreading mechanisms of the enzyme can reduce this mis-incorporation rate to ~1 in 3000. To overcome this problem, we have employed phage display libraries and screened mutants of aminoacyl- tRNA synthetases to select the sequence variants with enhanced specificity and affinity for the NAA-binding proteins. As an illustration of one approach, a phage display library based on Ml 3 bacteriophage, a bacterial virus composed of a circular single-strand DNA, is shown Figure 3. When an exogenous gene is inserted into the Ml 3 DNA can be expressed as a fusion protein on the surface of Ml 3 phage. As a result, when a library of genes is engineered into Ml 3 DNA, a library of corresponding proteins can be produced, each of which is present on an individual phage surface as multiple copies and can be identified by sequencing the insertion on phage DNA. For instance, the gene encoding the catalytic domain of the tyrosyl-tRNA synthetase and its mutants at random positions, which are generated by error-prone PCR and DNS shuffling, can be inserted into Ml 3 DNA to make a library of tyrosyl-tRNA synthetase carrying random point mutations. This library can be incubated with isoleucine immobilized on solid support beads through its carboxylate. After unbound phages are washed away, the remaining phages are stripped off from beads, amplified by infecting bacterial cells, and subjected to next round of selection. Free phenylalanine, which is quite similar to tyrosine, can be added into the incubation solution as a competing agent to remove phages capable of binding with N-terminus

phenylalanine. This multi-cycle scheme will improve the affinity of tyrosyl-tRNA synthetase variants to N-terminus tyrosine, while reduce their binding to N-terminus phenylalanine. After the selection is done, the sequence of the tyrosyl-tRNA synthetase mutants with improved specificity and affinity required by single molecule peptide sequencing are determined by DNA sequencing. The corresponding protein is then expressed in bacterial cells. This protocol will be applied to other aminoacyl-tRNA synthetases to improve their performances as well. As proof of principle, we have developed representative modified RS and demonstrate selective binding to cognate amino acids. As will be recognized from the figures and data presented herein, and in particular Figures 6A, 7 and 8, only some mutations to the wild type RS sequence improve amino acid specificity. As an additional proof of principle, we also have produced an scFv that has specificity for N-tyrosine over C-tyrosine.

[0049] The following Examples are intended to illustrate certain embodiments of the disclosure but are not intended to be limiting.

Example 1

[0050] Expression of various tRNA synthetases on Ml 3 phage surface

[0051] The catalytic domain of several tRNA synthetases have been successfully expressed on M13 phage surface. As shown in Figure 4B, M13 phages expressing wild-type

Flag (sequence: DYKDDDDK (SEQ ID NO: 10)) tagged TyrRS-pIII, PheRS-pIII or LeuRS-pIII were serially diluted to several concentrations as indicated (pfu/mL, plague forming unit/mL). Then 100 μΐ of these diluted phage solutions for each aaRS were used to coat individual wells on a 96-well plate, washed and then incubated with HRP-conjugated anti-Flag antibody. M13K07 helper phage was also included as a negative control. TMB substrate was added and the absorbance at 450 nm (OD450) was monitored to detect the level of Flag-RS-pIII fusion proteins on Ml 3 phage surface.

[0052] Expression of wild-type pheRS-PIII on Ml 3 phage surface with detectable affinity and specificity towards N-terminal phenylalanine

[0053] Thanks to the diverse structures and functions of all naturally occurring aaRS, wild-type enzymes may exhibit different degree of specificity and affinity towards their respective N-terminal amino acids. When TyrRS, PheRS, and LeuRS expressed on Ml 3 phage surface were tested for their ability to interact with immobilized peptides with cognate N- terminal amino acids, they indeed displayed such variation. For instance, wild-type PheRS can specifically bind to N-terminal phenylalanine, but not leucine (Figure 5 A), while neither TyrRS nor LeuRS showed such specific binding. By using a directed evolution approach we generated an improved reagent, starting with the E. coli wild-type TyrRS sequence having the amino acid sequence: MASS LIKQLQERGLVAQVTDEEALAERLAQGPIALYCGFDPTADSLHLGHLVPLLCLK RFQQAGHKPVALVGGATGLIGDPSFKAAERKLNTEETVQEWVDKIRKQVAPFLDFDCG ENS AIAANNYDWFG MNVLTFLRDIGKHF SVNQMINKEAVKQRLNREDQGISFTEF S Y NLLQGYDFACLNKQYGVVLQIGGSDQWGNITSGIDLTRRLHQNQVFGLTVPLITKADG TKFGKTEGGAVWLDPKKTSPYKFYQFWINTADADVYRFLKFFTFMSIEEINALEEEDKN SGKAPRAQYVLAEQVTRLVHGEEGLQAAKRITECLFSGSLSALSEADFEQLAQDGVPM VEMEKGADLMQALVDSELQPSRGQARKTIASNAITINGEKQSDPEYFFKEEDRLFGRFT LLRRGKKNYCLICWK (SEQ ID NO: 1).

[0054] Although wild-type TyrRS did not show specific binding to N-terminal tyrosine, a directed evolution with Ml 3 expressed TyrRS library was successfully performed to improve its affinity. After three cycles of panning, the binding of the TyrRS library towards N-terminal tyrosine became apparent. Six clones were sequenced and their respective fusion proteins were tested. One of them, H2 with mutations of L47N, V53D, F61C, D115V, Y301C, demonstrated specific binding towards N-terminal tyrosine over leucine (Figure 6), suggesting that phage- display based affinity selection was indeed able to enhance the binding of aaRS. Thus, a modified TyrRS having the sequence of SEQ ID NO: 1 or a Tyr binding fragment thereof, comprising any one or any combination of L47N, V53D, F61C, Dl 15V, Y301C changes is included in this disclosure.

Example 2 [0055] We constructed PheRS mutants. The wild type E. coli PheRs sequence is:

MSHLAELVASAKAAISQASDVAALDNVRVEYLGKKGHLTLQMTTLRELPPEERPAAGA VINEAKEQVQQALNARKAELESAALNARLAAETIDVSLPGRRIENGGLHPVTRTIDRIES FFGELGFTVATGPEIEDDYHNFDALNIPGHHPARADHDTFWFDTTRLLRTQTSGVQIRT MKAQQPPIRIIAPGRVYRNDYDQTHTPMFHQMEGLIVDTNISFTNLKGTLHDFLRNFFEE DLQIRFRPSWPFTEPSAEVDVMGKNGKWLEVLGCGMVHPNVLRNVGIDPEVYSGFAF GMGMERLTMLRYGVTDLRSFFENDLRFLKQFK (SEQ ID NO:2). As shown in Figure 8, two clones, Al and C2 showed enhanced properties over the wild type sequences. The changes are: Al : M178I and D229V, and in C2: C2: L227R. Thus, the disclosure includes a modified PheRS comprising any one or any combination of these three mutations, or a Phe binding fragment thereof.

Example 3

[0056] We constructed LeuRS mutants. The E. coli wild type LeuRS sequence is: MNNPGnSTSSARKAVLTRAFGLCYADLK HINATFVAVLKTGPLAAMQEQYRPEEIES K VQLHWDEKRTFE VTEDE SKEK Y YCL SMLP YP S GREHMGHVRN YTIGD VI AR YQRML GKNVLQPIGWDAFGLPAEGAAVKNNTAPAPWTYDNIAYMKNQLKMLGFGYDWSREL ATCTPEYYRWEQKFFTELYKKGLVYKKTSAVNWCP DQTVLA EQVIDGCCWRCDTK VERKEffQWFIKITAYADELL DLDKLDHWPDTVKTMQRNWIGRSEGVEITFNV DYD NTLTVYTTRPDTFMGCTYLAVAAGHPLAQKAAEN PELAAFIDECRNTKVAEAEMAT MEKKGVDTGFKAVHPLTGEEIPVWAA FVLMEYGTGAVMAVPGHDQRDYEFASKYG LNIKPVILAADGSEPDLSQQALTEKGVLFNSGEFNGLDHEAAFNAIADKLTAMGVGERK VNYRLRDWGVSRQRYWGAPIPMVTLEDGTVMPTPDDQLPVILPEDVVMDGITSPIKAD PEWAKTTVNGMPALRETDTFDTFMESSWYYARYTCPEYKEGMLDSKAANYWLPVDIY IGGffiHAIMHLLYFRFFHKLMRDAGMVNSDEPAKQLLCQGMVLADAFYYVGENGERN WVSPVDAIVERDEKGRIVKAKDAAGHELVYTGMSKMSKSKNNGIDPQVMVERYGADT VRLFMMFASPADMTLEWQESGVEGA RFLKRVWKLVYEHTAKGDVAALNVDALTED QKALRRDVHKTIAKVTDDIGRRQTFNTAIAAIMELMNKLAKAPTDGEQDRALMQEALL A VVRML PF TPHICF TL WQELKGEGDIDN AP WP V ADEK AM VED S TL V V VQ VNGK VR A KITVPVDATEEQVRERAGQEHLVAKYLDGVTVRKVIYVPGKLL LVVG (SEQ ID NO:9)

[0057] As shown in Figure 9, mutant D2 comprising the I496F change has enhanced Leu binding properties. Thus, the disclosure includes a modified LeuRS comprising this mutation, or a Leu binding fragment thereof.

Example 4

[0058] Affinity maturation of a scFv preferably binding to N-tyrosine over C-tyrosine.

[0059] scFv antibody, an engineered form of native IgG antibody, has broad applications due to its small size and easiness for affinity selection. We used a naive phage-display scFv library to select scFv that can bind with N-terminal tyrosine. One of clones, p807/C2, indeed exhibited preferably binding towards N-tyrosine over C-tyrosine (Figure 7), indicating that phage-displayed scFv antibodies may complement aaRS as NAA-binding reagent. The CDR sequence of Clone 807/C2: (LI) SGDALPKKYAY (SEQ ID NO:3); (L2) EDVKRLS (SEQ ID NO:4); (L3) YSNSKTGNYNV (SEQ ID NO:5); (HI) GYTFTDYWIS (SEQ ID NO:6); (H2) QIAMTNSATVYGPSFQG (SEQ ID NO:7); (H3) DYSDNYYNDTYS (SEQ ID NO:8).

Binding partners including any one or any combination of these distinct amino acid sequences are encompassed in this disclosure, including but not limited to binding partners that include LI and HI, LI and H2, LI and H3, L2 and HI, L2 and H2, L2 and H3, L3 and HI, L3 and H2, and L3 and H3. [0060] While the invention has been described through specific embodiments, routine modifications will be apparent to those skilled in the art and such modifications are intended to be within the scope of the present invention.

Claims

What is claimed is:

1. A method for determining amino acid sequences of a plurality of polypeptides comprising: i) associating a plurality of polypeptides with a substrate via the C-termini of the polypeptides; ii) forming a complex between N-termini of at least some of the polypeptides with one or more detectably labeled N-terminal amino acid binding agents; iii) detecting the detectably labeled N-terminal amino acid binding agents to determine the N-terminal amino acid of the polypeptides and recording the N-terminal amino acid; iv) removing the N-terminal amino acid from the polypeptide; vi) repeating steps i)-iv) to determine contiguous amino acids of the polypeptides.

2. The method of claim 1, wherein more than one of the detectably labeled N-terminal amino acid binding agents are used in ii).

3. The method of claim 1, further comprising generating a report comprising a description of the contiguous amino acid sequences determined in step vi).

4. The method of claim 1, wherein the plurality of polypeptides comprises a population of polypeptides obtained from in vitro enzymatic digestion of a pool of polypeptides.

5. The method of claim 1, wherein at least one of the N-terminal amino acid binding agents comprises a sequence selected from SEQ ID NO:3, SEQ ID NO:4, SEQ ID NO:5, SEQ ID

NO:6, and SEQ ID NO:8, the sequence of SEQ ID NO: 1 comprising any one or any

combination of changes selected from the group consisting of L47N, V53D, F61C, Dl 15V, Y301C, or the sequence of SEQ ID NO:2, comprising any one or any combination of changes selected from the group consisting of Ml 781 D229V, and L227R, or the sequence of SEQ ID NO:9, comprising an I496F change.

6. The method of any one of claims 1-5, wherein the plurality of polypeptides comprises all or a portion of the polypeptides in or derived from a biological sample.

7. The method of claim 6, further comprising generating a report comprising a description of the contiguous amino acid sequences.

8. A complex comprising at least one polypeptide and an N-terminal amino acid binding agent which comprises a sequence selected from SEQ ID NO:3, SEQ ID NO:4, SEQ ID NO:5, SEQ ID NO:6, and SEQ ID NO:8, the sequence of SEQ ID NO: 1 comprising any one or any combination of changes selected from the group consisting of L47N, V53D, F61C, Dl 15V, Y301C, or the sequence of SEQ ID NO:2, comprising any one or any combination of changes selected from the group consisting of Ml 781 D229V, and L227R, or the sequence of SEQ ID NO:9, comprising an I496F change.

9. The complex of claim 8, wherein the polypeptide is in physical association with a solid substrate.

10. The complex of claim 9, wherein the N-terminal amino acid binding agent is detectably labeled.

11. The complex of claim 9, wherein the polypeptide was obtained from an in vitro enzymatic digestion of a pool of polypeptides.

12. The complex of claim 9, wherein the pool of polypeptides comprised all or a portion of polypeptides in or derived from a biological sample.

13. The complex of any one of claims 8-13, wherein the complex is in an array comprising additional complexes each comprising a polypeptide and the N-terminal binding agent, wherein the N-terminal binding agent is detectably labeled.

14. A recombinant protein comprising an amino acid sequence selected from the sequence of SEQ ID NO:3, SEQ ID NO:4, SEQ ID NO:5, SEQ ID NO:6, and SEQ ID NO:8, the sequence of

SEQ ID NO: 1 comprising any one or any combination of changes selected from the group consisting of L47N, V53D, F61C, Dl 15V, Y301C, or the sequence of SEQ ID NO:2, comprising any one or any combination of changes selected from the group consisting of M178I D229V, and L227R, or the sequence of SEQ ID NO:9, comprising an I496F change.

15. A kit for performing the method of claim 1, the kit comprising at least one sealed container comprising at least one of the N-terminal amino acid binding agents.

16. The kit of claim 15, wherein the at least one of the N-terminal amino acid binding agents is detectably labeled, or wherein at least one detectable label is provided with the kit.

17. The kit of claim 15, further comprising a solid substrate suitable for attachment of polypeptides.

18. The kit of any one of claims 15-17, wherein the at least one of the N-terminal amino acid binding agent comprises a sequence selected from the from the sequence of SEQ ID NO:3, SEQ ID NO:4, SEQ ID NO:5, SEQ ID NO:6, and SEQ ID NO:8, the sequence of SEQ ID NO: 1 comprising any one or any combination of changes selected from the group consisting of L47N, V53D, F61C, Dl 15V, Y301C, or the sequence of SEQ ID NO:2, comprising any one or any combination of changes selected from the group consisting of Ml 781 D229V, and L227R, or the sequence of SEQ ID NO:9, comprising an I496F change.

19. The kit of claim 18, comprising at least two of the N-terminal amino acid binding agents.

20. The kit of claim 18, further comprising printed material describing use of N-terminal amino acid binding agents in a process for determining amino acid sequences of a plurality of polypeptides.