[go: up one dir, main page]

WO2025207810A1 - Oligonucleotides, sequences, methods, and systems thereof for analyzing nucleic acid molecules - Google Patents

Oligonucleotides, sequences, methods, and systems thereof for analyzing nucleic acid molecules

Info

Publication number
WO2025207810A1
WO2025207810A1 PCT/US2025/021612 US2025021612W WO2025207810A1 WO 2025207810 A1 WO2025207810 A1 WO 2025207810A1 US 2025021612 W US2025021612 W US 2025021612W WO 2025207810 A1 WO2025207810 A1 WO 2025207810A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleotide
nucleotides
energy
hamming
optimized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/021612
Other languages
French (fr)
Inventor
Wei Chen
Ashwin Gopinath
David Yu Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Biostate Ai Inc
Original Assignee
Biostate Ai Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Biostate Ai Inc filed Critical Biostate Ai Inc
Publication of WO2025207810A1 publication Critical patent/WO2025207810A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1096Processes for the isolation, preparation or purification of DNA or RNA cDNA Synthesis; Subtracted cDNA library construction, e.g. RT, RT-PCR
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay

Definitions

  • the present disclosure relates generally to oligonucleotides, sequences, methods, and systems for analyzing nucleic acid molecules, and more specifically for preparing next generation sequencing libraries, such as those used for determining RNA sequences.
  • RNAseq Next generation sequencing
  • biological samples such as blood and tissue.
  • RNAseq provides more information at higher accuracy than older expression profiling technologies such as microarrays, in part because RNAseq samples from nearly all RNA molecules in a solution, without a complete prior knowledge of the RNA species in the solution (e.g., RNA splice isoforms).
  • NGS libraries such as RNAseq libraries
  • command high costs can be extremely labor intensive, provide limited throughput, and may not accurately reflect the expression of different RNA species within the sample.
  • drawbacks are not limited to the preparation of only RNAseq libraries, but concern NGS library preparation methods, in general.
  • manual NGS library preparation can typically take two days for processing only sixteen samples.
  • critical reagents for purifying the NGS libraries such as SPRI purification beads, can be exorbitantly expensive.
  • the molecular tools used in NGS methods can be susceptible to off-target hybridization, such as to unintended regions of the genome or the transcriptome, or themselves (e.g., self-binding).
  • some NGS methods such as RNAseq, have primarily been performed at a small scale. Indeed, according to public datasets on the GEO database, only tens to hundreds of samples are sequenced per RNA sequencing study. In general, improved methods are needed for increasing the scalability and efficiency of NGS library preparation. For example, improvements in the molecular tools used for NGS library preparation would be highly beneficial. The present disclosure addresses these needs.
  • NGS libraries such as, RNAseq libraries.
  • Existing methods for preparing NGS libraries can be costly, labor intensive, and low throughput. These drawbacks stem, in part, from the use of expensive and inefficient reagents, as well as workflows that are difficult to scale with increased sample size.
  • the methods and systems herein describe oligonucleotides and nucleic acid sequences, and methods and systems relating to the oligonucleotides and sequences, that improve multiple aspects of NGS library preparation.
  • an oligonucleotide comprising an engineered hairpin structure that mitigates hybridizing of the oligonucleotide to undesirable nucleic acid sequences.
  • error-checking barcode nucleotide sequences based on Hamming codes are described.
  • energy-optimized mixamers featuring more uniform Gibbs standard free energies of binding to their corresponding nucleic acid targets, relative to traditional hexamer sequences, are disclosed.
  • the combined or non-combined tools can greatly improve the costs, scalability, and efficiency of NGS library preparation, e.g., RNAseq library preparation.
  • Methods and systems of generating and/or using the oligonucleotides, errorchecking barcodes, and energy-optimized mixamers are also disclosed.
  • an oligonucleotide comprising a helper sequence that comprises a hairpin structure configured to mitigate untargeted hybridizing of a hybridizing region to an off-target nucleotide sequence.
  • a Hamming barcode sequence comprising informationencoding nucleotides and error-checking nucleotides and configured to check for nucleotide errors in the Hamming barcode sequence
  • the Hamming barcode sequence is based on a system of linear equations, values of the system of linear equations comprising: predetermined information values assigned to the information-encoding nucleotides, expected predetermined constant values for the system of linear equations, and error-checking values assigned to the error-checking nucleotides, based on solving the system of linear equations comprising the predetermined information values and the predetermined constant values.
  • a unifying key advantage of the embodiments described herein is that in the case of processing RNA molecules for NGS library preparation, e.g., RNAseq, the barcoding reaction can be performed directly on the RNA molecules.
  • RNA molecules for NGS library preparation e.g., RNAseq
  • the barcoding reaction can be performed directly on the RNA molecules.
  • oligonucleotides e.g., primers
  • the oligonucleotides used to bind to a plurality of RNA molecules are biased for hybridizing to certain RNA sequences, and are prone to forming unintended secondary structures that prevent at least a portion of the oligonucleotide from binding to the RNA molecules, thus further biasing the hybridizing.
  • a considerable proportion of the RNA-binding oligonucleotide sequences can display unfavorable binding kinetics or thermodynamics to the RNA molecules, which also contributes to biases when preparing the RNAseq libraries.
  • any of the embodiments described herein addresses the biased hybridizing of oligonucleotides to RNA molecules for e.g., RNAseq processing.
  • the embodiments described herein can be used to improve the efficiency and costs of NGS library preparation, in general, and are not limited to only RNAseq library processing.
  • an oligonucleotide comprising a helper sequence that comprises a hairpin structure configured to mitigate untargeted hybridizing of a hybridizing region to an off-target nucleotide sequence.
  • the hairpin structure can comprise a barcode sequence, such as a Hamming barcode sequence.
  • the Hamming barcode sequence includes biological sequences, such as nucleotide sequences.
  • the barcode sequence can reside in the loop portion (e.g., the non-palindromic sequence portions) of the helper sequence. By placing the barcode sequence in the loop of the hairpin structure, the barcode sequence cannot bias the hybridizing of the oligonucleotide to nucleic acid targets.
  • the stem of the hairpin structure can comprise a universal sequence that is common to multiple oligonucleotide sequences. That is, multiple oligonucleotides can comprise a common universal sequence that allows for the self-hybridizing necessary for generating the stem of the hairpin structure, whereas the sequence that resides in the loop of the hairpin structure, e.g., the barcode sequence, can vary across nucleotides.
  • the oligonucleotide can also comprise energy-optimized mixamers as the target nucleotide-binding sequences. Described herein also are methods of using the oligonucleotide for NGS library preparation, e.g., RNAseq library preparation.
  • the methods can comprise directly reverse transcribing RNA transcripts using primers comprising the disclosed oligonucleotides.
  • the methods can also comprise reverse transcribing RNA transcripts using primers that do not comprise oligonucleotides comprising the helper sequences, but do comprise DBCO functional groups or azide groups.
  • the reverse transcribed DBCO- functionalized cDNA molecules or DBCO-functionalized barcode molecules can then ligate azide-functionalized barcodes or azide-functionalized cDNA molecules, which can be Hamming barcodes, via a click chemistry reaction.
  • the resulting click chemistry products can comprise barcoded cDNA molecules.
  • the click chemistry is implemented with DBCO on the mixamer primer and azide on the sample barcode.
  • the click chemistry is implemented with azide on the mixamer primer and DBCO on the sample barcode.
  • the information-encoding nucleotides can encode the nucleotides used for the actual barcode sequence, e.g., the sequence that the error-checking nucleotides are designed to protect, where the protecting refers to being able to detect or correct an error, if one of the information-encoding nucleotides erroneously change.
  • the system of linear equations can comprise a system of modular mathematics linear equations.
  • any one or more of the embodiments of the present disclosure can, but need not, be combined to improve the preparing of NGS libraries. Either alone or in combination, the embodiments of the present disclosure improve the scalability, costs, labor intensity, or accuracy of NGS library preparation, including the preparation of RNAseq libraries.
  • reagents and methods for affordable and high-throughput NGS library preparation from RNA samples by integrating sample- specific barcodes either during or immediately after the reverse transcription step. Barcoded cDNA products are pooled before any subsequent purification, adaptor ligation, or other steps, avoiding significant labor and reagent costs inherent in standard workflows.
  • the universal first sequence exhibits a G/C content of between 40% and 100%.
  • the primer set comprises at least 8 distinct BIRT primer pairs each with a distinct barcode sequence.
  • the primer set comprises at least 24 distinct BIRT primer pairs each with a distinct barcode sequence.
  • the barcodes sequences have pairwise Hamming distances of at least 2 nucleotides for all possible pairs.
  • a method for preparing a barcoded RNA sequencing library from a set of least 24 biological samples comprising: (1) extracting RNA from each of the biological samples into a separate container for each RNA sample; (2) mixing each RNA sample with a different BIRT primer, reverse transcriptase, and aqueous buffers/reagents to enable reverse transcription; (3) incubating the resulting mixture to allow reverse transcription; (4) pooling all reverse transcription reaction products in a single container to form the barcoded RNA sequencing library, wherein the BIRT primers are as described herein.
  • step (6) of adding adapters and index sequences is performed using ligation. In other variations, step (6) of adding adapters and index sequences is performed using PCR.
  • a method for preparing a barcoded RNA sequencing library from a set of least 8 biological samples comprising: (1) extracting RNA from each of the biological samples into a separate container for each RNA sample; (2) mixing each RNA sample with a DBCO-functionalized degenerate oligonucleotide primer or an azide-functionalized degenerate oligonucleotide primer, reverse transcriptase, and aqueous buffers/reagents to enable reverse transcription; (3) incubating the resulting mixture to allow reverse transcription; (4) adding to each mixture a distinct azide-functionalized barcode oligonucleotide or a distinct DBCO-functionalized barcode oligonucleotide; (5) incubating the resulting mixture to allow click chemistry to proceed; and (6) pooling all reverse transcription reaction products in a single container to form the barcoded RNA sequencing library.
  • the method further comprises: (5.5) addition of free azide (e.g. sodium azide) or an azide-functionalized molecule that is distinct from the azide-functionalized barcode oligonucleotides, after the click chemistry incubation step (5) and before the pooling step (6).
  • ‘About” and “approximately” shall generally mean an acceptable degree of error for the quantity measured given the nature or precision of the measurements. Exemplary degrees of error are within 20 percent (%), typically, within 10%, and more typically, within 5% of a given value or range of values. Exemplary degrees of error may also include an absolute value range for values, such as within 1 kcal/mol or within 0.5 kcal/mol for Gibbs standard free energies.
  • the terms “comprising” (and any form or variant of comprising, such as “comprise” and “comprises”), “having” (and any form or variant of having, such as “have” and “has”), “including” (and any form or variant of including, such as “includes” and “include”), or “containing” (and any form or variant of containing, such as “contains” and “contain”), are inclusive or open-ended and do not exclude additional, un-recited additives, components, integers, elements, or method steps.
  • ordinal terms such as “first”, “second”, “third”, etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
  • use of a), b), etc., or i), ii), etc. does not by itself connote any priority, precedence, or order of steps in the claims.
  • the use of these terms in the specification does not by itself connote any required priority, precedence, or order.
  • Oligonucleotides and sequences for analyzing nucleic acid molecules are provided.
  • compositions for analyzing nucleic acid molecules including: an oligonucleotide comprising a helper sequence that comprises a hairpin structure configured to mitigate untargeted hybridizing of a hybridizing region to an off-target nucleotide sequence; a Hamming barcode sequence comprising information-encoding nucleotides and error-checking nucleotides and configured to check for nucleotide errors in the Hamming barcode sequence, wherein the Hamming barcode sequence is based on a system of linear equations, values of the system of linear equations comprising: predetermined information values assigned to the information-encoding nucleotides, expected predetermined constant values for the system of linear equations, and error-checking values assigned to the error-checking nucleotides, based on solving the system of linear equations comprising the predetermined information values and the predetermined constant values.
  • Oligonucleotide comprising a helper sequence comprising a helper sequence
  • the hairpin structure of the helper sequence can comprise exclusively or near exclusively nucleotides from the helper sequence. That is, the helper sequence is, for the most part, not bound to other portions of a broader nucleotide sequence that the helper sequence may be a member of.
  • the helper sequence is part of a primer comprising both the helper sequence and a region that hybridizes to a target nucleotide sequence, e.g., an energy- optimized mixamer
  • the helper sequence’s hairpin structure is restricted to being comprised of only nucleotides from the helper sequence, and is, for the most part, not comprised of nucleotides from the region that hybridizes to the target nucleotide sequence, e.g., the energy- optimized mixamer.
  • the hairpin structure may comprise nucleotides from not the helper sequence, provided that those nucleotides do not prohibit the function of the broader sequence, e.g., primer sequence, that the helper sequence is a part of.
  • the hairpin structure of the helper sequence can comprise a single nucleotide from the region of the primer that hybridizes to the target nucleotide sequence, provided that the single nucleotide being part of the hairpin structure does not prohibit the primer from hybridizing to the target nucleotide sequence.
  • a nucleotide mismatch in the hairpin structure can refer to two nucleotide bases in the hairpin structure that are positioned in direct physical opposition to each other, but are not complementary to each another.
  • the nucleotide mismatch can result in a bulge in the hairpin structure.
  • the bulge can also result when the two opposing sequences in the stem of the hairpin structure comprise of unequal lengths, such that most of the nucleotides for one sequence are complementary to most of the nucleotides of the opposing sequence, except for one or more nucleotides.
  • Those one or more nucleotides that are not complementarity hybridizing to the opposing sequence can be considered to be a part of the bulge.
  • the information-encoding nucleotides can comprise a cluster in the Hamming barcode sequence.
  • the cluster can refer to most of the information-encoding nucleotides being adjacent to one another. In some instances, the clustering together can refer to all of the information-encoding nucleotides being adjacent to one another.
  • the cluster from the clustering together of the information-encoding nucleotides can be found on one end of a sequence, such as the Hamming barcode sequence. See FIG. 1 IB for a schematic depiction of Hamming barcode sequences with clustered information-encoding nucleotides and clustered error-checking nucleotides.
  • the energy-optimized mixamer can be a nucleotide sequence.
  • the equilibrium Gibbs standard free energy can refer to the Gibbs standard free energy at equilibrium at a prespecified temperature and salinity conducive to reverse transcription.
  • the thresholds used to select the energy-optimized mixamers that is, the mean equilibrium Gibbs standard free energy threshold and the standard deviation equilibrium Gibbs standard free energy threshold — need not be predetermined thresholds. Instead, the thresholds can be determined based on the properties of the distribution of predicted equilibrium Gibbs standard free energies for the mixamer sequences.
  • FIG. 6A provides a schematic overview of current standard RNAseq library preparation.
  • Samples 1, 2, through N are subject to individual adaptor ligating and indexing reactions per RNA sample, and the indexed reactions can be performed immediately after, per RNA sample, to generate indexed RNAseq libraries.
  • the RNAseq libraries can then be pooled together, enriched for the RNA molecule types of interest, and sequenced.
  • FIG. 6B provides a schematic overview of an RNAeseq library preparations according to one or more embodiments in the present disclosure.
  • the RNA transcripts are directly barcoded, before any intermediate rection, during the reverse transcription process.
  • the barcoding reaction is not limited to a single RNA sample. Instead, a single barcoding reaction can be used across multiple samples, such as samples 1 to 100, as shown in FIG. 6B.
  • scaling the preparation of RNAseq libraries is improved according to the embodiments described herein.
  • the workflow shown in FIG. 6B scales the indexing reactions as a function of a group of barcoded samples.
  • RNA samples 1-100 a single indexing reaction can be performed, and for samples 101-200, another indexing reaction is performed.
  • To index 200 RNA samples according to conventional methods, as shown in FIG. 6A requires 200 indexing reactions for 200 RNA samples according to conventional methods, as shown in FIG. 6A.
  • the barcodes depicted in FIG. 6B which are aligned with the embodiments provided herein, are separate from standard NGS indices. Therefore, the number of identifiers per RNA molecule across one or more samples can be combinatorially expanded, based on both traditional NGS indices and barcodes directly integrated via reverse transcription.
  • 96 distinct primers configured to directly reverse transcribe RNA molecules can be used in conjunction with 12 indices, to result in an analysis directed to 1152 RNA samples, within a single NGS run.
  • 384 distinct primers configured to directly reverse transcribe RNA molecules can be used in conjunction with 24 indices, to result in an analysis directed to 9216 RNA samples, within a single NGS run.
  • FIG. 8 provides a schematic overview of using oligonucleotides in accordance with an embodiment of the present disclosure for NGS library preparation, such as oligonucleotides comprising a helper sequence.
  • FIG. 8 describes an approach where the barcodes are integrated during reverse transcription, using a set of hairpin primer sequences with a degenerate randomer sequence at the 3’ end.
  • the 3’ end sequence that hybridizes to the target molecule e.g., the RNA molecule
  • N_6 random hexamer
  • the stem of the primer hairpin is a universal sequence, U, and can be the same for all primers in the set.
  • the loop of the primer hairpin can be the barcode sequence, and can be unique for primer in the set.
  • the barcode sequence can be between 4 and 14 nucleotides long, inclusive.
  • the universal sequence comprising half of the hairpin stem can be between 10 and 20 nucleotides long, inclusive. Because of the relatively long length of the primers (between 30 and 60 nucleotides, inclusive), the primer may not be fully removed from the cDNA product, if using size selection. The random hexamer sequences can be replaced with energy- optimized mixamers.
  • FIG. 9 provides a schematic overview of an additional method of NGS library preparation in accordance with an embodiment of the present disclosure.
  • the method depicted in FIG. 9 is based on click chemistry, and uses a short degenerate randomer primer with a 5’ chemical modification conducive to click chemistry.
  • the 5’ chemical modification is a DBCO functional group.
  • the 5’ chemical modification is an azide functional group.
  • the 3’ end of the primer sequence is random hexamer (N_6). After the primer is extended via reverse transcription to form ’DNA, one of a set of barcode oligonucleotides is added to each cDNA sample.
  • FIG. 10 provides example primer sequences comprising helper sequences and random hexamer (N_6) sequences — the random hexamer sequences can be replaced with energy- optimized mixamers.
  • U denotes a universal sequence
  • U* denotes a complementary universal sequence
  • B denotes a barcode sequence.
  • the universal sequence, U refer to a portion of the hairpin structure that binds to complementary bases that are in physical opposition to U.
  • the complementary bases that are in physical opposition to U are referred to as U*.
  • the helper sequence in contrast, can refer to the entire hairpin structure.
  • FIG. 10A shows a short primer design with a length of 30 nt. It should be understood that “nt” refers to “nucleotides”.
  • the universal sequence is 10 nt
  • the barcode is 4 nt
  • the priming sequence is N_6.
  • FIG. 10B shows a moderate length primer of 37 nt.
  • the universal sequence is extended to 12 nt to further ensure that the designed hairpin structure is the most stable configuration of the primer.
  • the barcode has been increased to 7 nt to provide error-detecting/correcting capabilities.
  • 10C shows a variant moderate length primer of 41nt.
  • the universal sequence is extended to 14 nt, but the two sequences that form the hairpin stem are mismatched by 1 nt in the middle.
  • the intentional mismatch in the hairpin stem can be beneficial in downstream library preparation steps in preventing the DNA polymerase from getting stuck during polymerase chain reaction (PCR) amplification.
  • the helper sequence length can be determined by subtracting the length of the entire molecule — in this case, the primer molecule — by the region of the molecule that is configured to hybridize to the target nucleotide — in this case, the random hexamer sequence.
  • FIG. 10C shows a variant moderate length primer of 41nt.
  • the universal sequence is extended to 14 nt, but the two sequences that form the hairpin stem are mismatched by 1 nt in the middle.
  • the intentional mismatch in the hairpin stem can be beneficial in downstream library preparation steps in preventing the DNA polymerase from getting stuck during polymerase chain reaction (PCR)
  • FIG. 11 depicts an example implementation of error correcting barcode sequences using Hamming encodings.
  • every adenine (“a”) nucleotide is mapped to 0, guanine (“g”) is mapped to 1, thymine (“t”) is mapped to 2, and cytosine (“c”) is mapped to 3, in modulus 4 (mod 4) arithmetic. Note that in mod 4, 3 is equivalent to -1.
  • FIGS. 11B and 11C depict two implementations of a Hamming barcode sequence, where both sequences are based on the same system of linear equations used for generating the values of the informationencoding and error-checking nucleotides.
  • the 4 information-bearing nucleotides xl, x2, x3, and x4 are clustered together and listed first, followed by cl, c2, and c3.
  • the 4 information-encoding nucleotides are interspersed with the error-checking nucleotides. That is, the information-encoding nucleotides comprise a cluster in the Hamming barcode sequence.
  • FIG. 11C can be alternatively described as the 3 error-checking nucleotides being interspersed across the Hamming barcode sequence. Longer Hamming barcodes can also be used.
  • FIG. 13 depicts an example workflow for click chemistry -based RNAseq library preparation.
  • a 96-well plate is loaded with extracted RNA samples.
  • a mixture of DBCO- random hexamer (DBCO-N_6) degenerate randomer primer, reverse transcriptase, and reverse transcriptase buffer/reagents can be added to each well, as shown in 1302, and then the resultant solutions can be incubated at a temperature suitable for reverse transcription (e.g. 37°C or 42°C), as shown in 1304.
  • DBCO-N_6 degenerate randomer primer, reverse transcriptase, and reverse transcriptase buffer/reagents
  • a distinct azide-functionalized barcode oligonucleotide can be added to each well, as shown in 1306, and then the solutions can be further incubated to allow the click chemistry reaction to proceed, as shown in 1308.
  • excess DBCO Dibenzocyclooctyne-amine
  • the reaction products from the 96 wells can be combined into a single mixed library.
  • the click chemistry-based workflow shown in FIG. 13 suffers from requiring more labor and reagents before pooling, The benefits of the click chemistry -based workflow shown in FIG.
  • N_6 primers are used, which due to their shorter lengths, can result in their efficient removal via size-based purification.
  • unused primers are purified, and at 1316, further NGS processing reactions, such as the adding of reagents for ligating adaptors and/or indices, are provided.
  • the random hexamer (N_6) can be replaced with energy-optimized mixamers.
  • FIG. 14 provides a schematic comparing random hexamer sequences against energy- optimized mixamers, for NGS library preparation, such as for reverse transcribing RNA molecules.
  • the traditional random hexamer (N_6) used for processing nucleic acid molecules, such as RNA molecules is synthesized as a single oligonucleotide, incorporating roughly an equal mix of all 4 bases (adenine, thymine, guanine, and cytosine) at each position.
  • random hexamers are synthesized aiming to achieve a roughly equal concentration distribution of all 4,096 DNA oligonucleotides 6 nt long.
  • the 4,096 random hexamer sequences exhibit a wide range of hybridization thermodynamics, due to the differences in G/C content, resulting in potential biases in the RNAseq results — namely, disfavoring RNA species with lower G/C fraction.
  • the present disclosure depicts energy-optimized mixamers, such as the example sequence motif of five units of either a strong nucleotide or two adjacent weak nucleotides, denoted as (S/WW)_5.
  • the strong nucleotide, S corresponds to a G nucleotide or a C nucleotide
  • the weak nucleotide, W corresponds to an A or a T nucleotide.
  • WW refers to AA, AT, TA, or TT nucleotides.
  • the energy-optimized mixamers comprising the sequence motif (S/WW)_5 can be synthesized as 32 independent oligonucleotides with degenerate mixed bases as shown.
  • RNA molecules include methods of reverse transcribing RNA molecules, a method of checking nucleotide errors in a Hamming barcode sequence, and a method of generating energy-optimized mixamers.
  • FIG. 1 shows an exemplary schematic showing a general process 100 for reverse transcribing RNA molecules.
  • the method can include: extracting the RNA molecules from a sample (102); and reverse transcribing the RNA molecules into cDNA molecules, with primers comprising a helper sequence, the helper sequence comprising a hairpin structure, the hairpin structure configured to mitigate untargeted hybridizing of the primers to an off-target nucleotide sequence (104).
  • FIG. 2A shows an exemplary schematic showing a general process 200A for reverse transcribing RNA molecules.
  • the method can include: extracting the RNA molecules from a sample (202A); reverse transcribing the RNA molecules into cDNA molecules, with primers comprising DBCO functional groups on the 5’ ends of the primers, to generate DBCO- functionalized cDNA molecules (204A); providing azide-functionalized barcode oligonucleotides to the DBCO-functionalized cDNA molecules (206 A); and ligating the azide- functionalized barcode oligonucleotides to the DBCO-functionalized cDNA molecules via a click reaction, thereby generating barcoded cDNA molecules (208A).
  • 2B also shows an exemplary schematic showing a general process 200B for reverse transcribing RNA molecules.
  • the method can include: extracting the RNA molecules from a sample (202B); reverse transcribing the RNA molecules into cDNA molecules, with primers comprising azide functional groups on the 5’ ends of the primers, to generate azide-functionalized cDNA molecules (204B); providing DBCO-functionalized barcode oligonucleotides to the azide-functionalized cDNA molecules (206B); and ligating the DBCO-functionalized barcode oligonucleotides to the azide- functionalized cDNA molecules via a click reaction, thereby generating barcoded cDNA molecules (208B).
  • FIG. 3 shows an exemplary schematic showing a general process 300 for checking nucleotide errors in a Hamming barcode sequence.
  • the method can include: creating a system of linear equations, using one or more processors, based on the Hamming barcode sequence, values of the system of linear equations comprising: predetermined information values assigned to information-encoding nucleotides, expected predetermined constant values, and error-checking values assigned to error-checking nucleotides (302); and solving for the error-checking values, using the one or more processors, from the system of linear equations (304).
  • FIG. 5 shows an exemplary schematic showing a general process 500 for generating energy-optimized mixamers.
  • the method can include: determining mixamer sequences comprising at least two adjacent weak nucleotides or at least one strong nucleotide (502); determining a distribution of predicted equilibrium Gibbs standard free energies for the mixamer sequences hybridizing to one or more target nucleic acid sequences at a predetermined temperature and a predetermined salinity conducive to reverse transcription (504); and selecting energy-optimized mixamers based on the distribution, when the distribution comprises a standard deviation equilibrium Gibbs standard free energy less than or equal to a standard deviation equilibrium Gibbs standard free energy threshold (506).
  • Process 300, 400, or 500 can be performed, for example, using one or more electronic devices implementing a software platform.
  • process 300, 400, or 500 is performed using a client-server system, and the blocks of process 300, 400, or 500 are divided up in any manner between the server and a client device.
  • the blocks of process 300, 400, or 500 are divided up between the server and multiple client devices.
  • portions of process 300, 400, or 500 are described herein as being performed by particular devices of a client-server system, it will be appreciated that process 300, 400, or 500 is not so limited.
  • process 300, 400, or 500 is performed using only a client device or only multiple client devices.
  • RNA molecules comprising: cDNA molecules reverse transcribed from RNA molecules, the RNA molecules having been extracted from a sample, and the cDNA molecules having been reverse transcribed with primers that comprise a helper sequence, the helper sequence comprising a hairpin structure, and the hairpin structure configured to mitigate untargeted hybridizing of the primers to an off-target nucleotide sequence.
  • RNA molecules comprising: RNA molecules, the RNA molecules having been extracted from a sample; and primers comprising DBCO functional groups on the 5’ ends of the primers.
  • systems for reverse transcribing RNA molecules comprising: cDNA molecules reverse transcribed from RNA molecules having been extracted from a sample, using primers comprising DBCO functional groups on the 5’ ends of the primers.
  • RNA molecules comprising: cDNA molecules reverse transcribed from the RNA molecules having been extracted from a sample, using primers comprising DBCO functional groups on the 5’ ends of the primers; and azide- functionalized barcode oligonucleotides.
  • systems for reverse transcribing RNA molecules comprising: barcoded cDNA molecules generated from a click reaction comprising primers that comprise DBCO functional groups on the 5’ ends of the primers and azide-functionalized barcode oligonucleotides. 1 [0086]
  • system for checking nucleotide errors in a Hamming barcode sequence are disclosed herein.
  • the systems may comprise, e.g., one or more processors, and a memory unit communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to: create a system of linear equations based on the Hamming barcode sequence, values of the system of linear equations comprising: predetermined information values assigned to information-encoding nucleotides, expected predetermined constant values, and error-checking values assigned to error-checking nucleotides; and solve for the error-checking values, using the one or more processors, from the system of linear equations.
  • the system may comprise further instructions that cause the system to: solve for error-checked predetermined constant values, using the one or more processors, based on the predetermined information values and the error-checking values; and determine a difference between the expected predetermined constant values and the error- checked predetermined constant values.
  • FIG. 15 illustrates an example of a computing device or system in accordance with one embodiment.
  • Device 1500 can be a host computer connected to a network.
  • Device 1500 can be a client computer or a server.
  • device 1500 can be any suitable type of microprocessor-based device, such as a personal computer, workstation, server or handheld computing device (portable electronic device) such as a phone or tablet.
  • the device can include, for example, one or more processor(s) 1510, input devices 1520, output devices 1530, memory or storage devices 1540, communication devices 1560, and nucleic acid sequencers 1570.
  • Software 1550 residing in memory or storage device 1540 may comprise, e.g., an operating system as well as software for executing the methods described herein.
  • Input device 1520 and output device 1530 can generally correspond to those described herein, and can either be connectable or integrated with the computer.
  • Input device 1520 can be any suitable device that provides input, such as a touch screen, keyboard or keypad, mouse, or voice-recognition device.
  • Output device 1530 can be any suitable device that provides output, such as a touch screen, haptics device, or speaker.
  • Storage 1540 can be any suitable device that provides storage (e.g., an electrical, magnetic or optical memory including a RAM (volatile and non-volatile), cache, hard drive, or removable storage disk).
  • Communication device 1560 can include any suitable device capable of transmitting and receiving signals over a network, such as a network interface chip or device.
  • Software module 1550 which can be stored as executable instructions in storage 1540 and executed by processor(s) 1510, can include, for example, an operating system and/or the processes that embody the functionality of the methods of the present disclosure (e.g., as embodied in the devices as described herein).
  • Software module 1550 can also be stored and/or transported within any non-transitory computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described herein, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions.
  • a computer-readable storage medium can be any medium, such as storage X40, that can contain or store processes for use by or in connection with an instruction execution system, apparatus, or device. Examples of computer- readable storage media may include memory units like hard drives, flash drives and distribute modules that operate as a single functional unit.
  • various processes described herein may be embodied as modules configured to operate in accordance with the embodiments and techniques described above. Further, while processes may be shown and/or described separately, those skilled in the art will appreciate that the above processes may be routines or modules within other processes.
  • Software module 1550 can also be propagated within any transport medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions.
  • a transport medium can be any medium that can communicate, propagate or transport programming for use by or in connection with an instruction execution system, apparatus, or device.
  • the transport readable medium may include, for example, an electronic, magnetic, optical, electromagnetic or infrared wired or wireless propagation medium.
  • Device 1500 may be connected to a network (e.g., network 1604, as shown in FIG. 16 and/or described below), which can be any suitable type of interconnected communication system.
  • the network can implement any suitable communications protocol and can be secured by any suitable security protocol.
  • the network can comprise network links of any suitable arrangement that can implement the transmission and reception of network signals, such as wireless network connections, T1 or T3 lines, cable networks, DSL, or telephone lines.
  • Device 1500 can be implemented using any operating system, e.g., an operating system suitable for operating on the network.
  • Software module 1550 can be written in any suitable programming language, such as C, C++, Java or Python.
  • application software embodying the functionality of the present disclosure can be deployed in different configurations, such as in a client/server arrangement or through a Web browser as a Web-based application or Web service, for example.
  • the operating system is executed by one or more processors, e.g., processor(s) 1510.
  • FIG. 16 illustrates an example of a computing system in accordance with one embodiment.
  • device 1500 e.g., as described above and illustrated in FIG. 15
  • network 1604 which is also connected to device 1606.
  • Devices 1500 and 1606 may communicate, e.g., using suitable communication interfaces via network 1604, such as a Local Area Network (LAN), Virtual Private Network (VPN), or the Internet.
  • network 1604 can be, for example, the Internet, an intranet, a virtual private network, a cloud network, a wired network, or a wireless network.
  • Devices 1500 and 1606 may communicate, in part or in whole, via wireless or hardwired communications, such as Ethernet, IEEE 802.11b wireless, or the like. Additionally, devices 1500 and 1606 may communicate, e.g., using suitable communication interfaces, via a second network, such as a mobile/cellular network.
  • One or all of devices 1500 and 1606 generally include logic (e.g., http web server logic) or are programmed to format data, accessed from local or remote databases or other sources of data and content, for providing and/or receiving information via network 1604 according to various examples described herein.
  • logic e.g., http web server logic
  • devices 1500 and 1606 are programmed to format data, accessed from local or remote databases or other sources of data and content, for providing and/or receiving information via network 1604 according to various examples described herein.
  • the mixamer set used to generate data shown in FIG. 15 ranged between 5 to 10 nucleotides in length, and in total comprises 7,776 sequences.
  • the bottom panels show the predicted, e.g., calculated, thermodynamics of DNA hybridization for the random hexamers (N_6) versus the energy-optimized mixamers (S/WW)_5, at 37 °C in 0.18M sodium (Na+), based on thermodynamic parameters from SantaLucia and Hicks, (2004) Annu Rev Biophys Biomol Struct. 33:415-40. That is, the predicted hybridization thermodynamic values, e.g., the predicted hybridization equilibrium Gibbs standard free energy values, were calculated based on a biophysical simulation.
  • Energy-optimized mixamers (S/WW)_5 can replace hexamers (N_6) in applications where the random hexamers are used during NGS library preparation, such as during the reverse transcription of RNA molecules.
  • Barcode reads are defined as sequencing reads that contain the exact BIRT barcode sequence, including both the universal sequence located in the stem region and the unique barcode sequence within the loop region of the BIRT primer.
  • the barcode fraction is calculated as the percentage of total sequencing reads from the pooled library that contain a valid, full-length barcode sequence. It reflects the efficiency of barcode incorporation during reverse transcription.
  • the barcode rate is defined as the cumulative barcode fraction across all unique barcode sequences used in a pooled experiment. For example, if Barcode 1, 2, and 3 yield barcode fractions of 20%, 25%, and 22% respectively, the overall barcode rate is 67%.
  • the alignment rate is defined as the percentage of total sequencing reads that successfully align to the reference genome.
  • a high alignment rate indicates accurate reverse transcription and minimal artifacts. It serves as a key metric for evaluating the fidelity of the BIRT-mediated cDNA synthesis and the overall quality of the resulting RNA-Seq libraries.
  • Standard RNA-Seq protocols typically use incubation conditions of 25 °C for 10 minutes followed by 42 °C for 15 minutes.
  • BIRT RNA-Seq achieves improved performance with extended incubation:
  • BIRT RNA-Seq uses a universal primer design that enables reverse transcription from a wide range of RNA molecules, including highly degraded RNA such as that extracted from formalin-fixed paraffin-embedded (FFPE) tissues.
  • FFPE formalin-fixed paraffin-embedded
  • poly-N or poly-A priming methods which are either general (poly-N) or mRNA-specific (poly-A).
  • poly- A priming requires high RNA integrity (e.g., RNA Integrity Number [RIN] > 5), whereas BIRT is tolerant of low-quality or fragmented RNA.
  • BIRT RNA-Seq demonstrates high sensitivity, enabling RNA-Seq library preparation from low-input RNA samples.
  • Step 1 Mix each RNA sample with the corresponding BIRT primer, dNTPs, reverse transcriptase, and reaction buffer.
  • Step 3 For each sample, collect the entire 5 pL of first-strand cDNA product. Pool all 16 samples (each with a unique barcode) to obtain 80 pL of pooled cDNA.
  • Step 4 Take 20 pL from the pooled mixture and proceed with clean-up using standard methods.
  • Step 5 Proceed with standard RNA-Seq library construction, including end-repair, dA-tailing, adaptor ligation, and indexing PCR.
  • Step 6 Sequence the final library using commercially available platforms.
  • Step 1 Barcode Demultiplexing
  • Each demultiplexed dataset is then analyzed using a standard RNA-Seq pipeline, including quality control, alignment, quantification, and differential expression analysis as needed.
  • FIG. 10D illustrates the actual BIRT primer structured used for all validation experiments.
  • FIGS. 19A-19C show high consistency with standard RNAseq.
  • FIG. 19B shows good concordance across different barcodes.
  • FIG. 19C shows good uniformity across different barcodes, and good reproducibility across different operators.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Engineering & Computer Science (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Microbiology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Biochemistry (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Immunology (AREA)
  • Plant Pathology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present disclosure relates to oligonucleotides and sequences, methods, and systems for improved NGS library preparation, including RNAseq library preparation. Described is an oligonucleotide comprising an engineered hairpin structure that mitigates hybridizing of the oligonucleotide to undesirable nucleic acid sequences. In addition, error-checking barcode nucleotide sequences based on Hamming codes are described. Furthermore, energy-optimized mixamers featuring more uniform Gibbs standard free energies relative to traditional hexamer sequences are disclosed. Methods and systems of generating and/or using the oligonucleotides, error-checking barcodes, and energy-optimized mixamers are also disclosed.

Description

OLIGONUCLEOTIDES, SEQUENCES, METHODS, AND SYSTEMS THEREOF FOR
ANALYZING NUCLEIC ACID MOLECULES
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional Patent Application No. 63/570,082, filed March 26, 2024, which is incorporated herein by reference in its entirety.
REFERENCE TO AN ELECTRONIC SEQUENCE LISTING
[0002] The contents of the electronic sequence listing (300402000740SEQLIST.xml; Size: 53,414 bytes; and Date of Creation: March 25, 2025) is herein incorporated by reference in its entirety.
FIELD
[0003] The present disclosure relates generally to oligonucleotides, sequences, methods, and systems for analyzing nucleic acid molecules, and more specifically for preparing next generation sequencing libraries, such as those used for determining RNA sequences.
BACKGROUND
[0004] Next generation sequencing (NGS) methods have been crucial for myriad biotechnological applications, as well as for clinical uses, such as diagnostics and therapeutics. For example, high-throughput RNA sequencing (RNAseq) using random hexamer primers has been a powerful method of characterizing the RNA expression patterns of biological samples, such as blood and tissue. RNAseq provides more information at higher accuracy than older expression profiling technologies such as microarrays, in part because RNAseq samples from nearly all RNA molecules in a solution, without a complete prior knowledge of the RNA species in the solution (e.g., RNA splice isoforms).
[0005] The preparation of NGS libraries, such as RNAseq libraries, however, command high costs, can be extremely labor intensive, provide limited throughput, and may not accurately reflect the expression of different RNA species within the sample. Such drawbacks are not limited to the preparation of only RNAseq libraries, but concern NGS library preparation methods, in general. For example, manual NGS library preparation can typically take two days for processing only sixteen samples. In addition, the critical reagents for purifying the NGS libraries, such as SPRI purification beads, can be exorbitantly expensive. Moreover, the molecular tools used in NGS methods, such as primers, can be susceptible to off-target hybridization, such as to unintended regions of the genome or the transcriptome, or themselves (e.g., self-binding). For these reasons, some NGS methods, such as RNAseq, have primarily been performed at a small scale. Indeed, according to public datasets on the GEO database, only tens to hundreds of samples are sequenced per RNA sequencing study. In general, improved methods are needed for increasing the scalability and efficiency of NGS library preparation. For example, improvements in the molecular tools used for NGS library preparation would be highly beneficial. The present disclosure addresses these needs.
BRIEF SUMMARY
[0006] Disclosed herein are molecules, methods, and systems for preparing NGS libraries, such as, RNAseq libraries. Existing methods for preparing NGS libraries can be costly, labor intensive, and low throughput. These drawbacks stem, in part, from the use of expensive and inefficient reagents, as well as workflows that are difficult to scale with increased sample size.
[0007] In some aspects, the methods and systems herein describe oligonucleotides and nucleic acid sequences, and methods and systems relating to the oligonucleotides and sequences, that improve multiple aspects of NGS library preparation. In one aspect, disclosed herein is an oligonucleotide comprising an engineered hairpin structure that mitigates hybridizing of the oligonucleotide to undesirable nucleic acid sequences. In addition, error-checking barcode nucleotide sequences based on Hamming codes are described. Furthermore, energy-optimized mixamers featuring more uniform Gibbs standard free energies of binding to their corresponding nucleic acid targets, relative to traditional hexamer sequences, are disclosed. Any of the three oligonucleotides and sequences can be combined with one another, e.g., the oligonucleotide can be combined with the error-checking barcode nucleotide sequences and/or the energy-optimized mixamers. Use of one or more of the disclosed tools can improve the simultaneous processing of NGS libraries by approximately 100-fold or more or improve the accuracy of RNA expression profiling, without significant increases in cost or time. Averaged out, the improvements can result in up to 100-fold reductions of NGS library preparation costs per sample, e.g., RNA sample, especially as the number of samples increase. The combined or non-combined tools can greatly improve the costs, scalability, and efficiency of NGS library preparation, e.g., RNAseq library preparation. Methods and systems of generating and/or using the oligonucleotides, errorchecking barcodes, and energy-optimized mixamers are also disclosed.
[0008] In some aspects, provided is an oligonucleotide comprising a helper sequence that comprises a hairpin structure configured to mitigate untargeted hybridizing of a hybridizing region to an off-target nucleotide sequence.
[0009] In some aspects, provided is a Hamming barcode sequence comprising informationencoding nucleotides and error-checking nucleotides and configured to check for nucleotide errors in the Hamming barcode sequence, wherein: the Hamming barcode sequence is based on a system of linear equations, values of the system of linear equations comprising: predetermined information values assigned to the information-encoding nucleotides, expected predetermined constant values for the system of linear equations, and error-checking values assigned to the error-checking nucleotides, based on solving the system of linear equations comprising the predetermined information values and the predetermined constant values.
[0010] In some aspects, provided is an energy-optimized mixamer sequence comprising an equilibrium Gibbs-free energy sampled from a distribution of predicted equilibrium Gibbs standard free energies for hybridizing nucleic acid sequences, the distribution comprising a mean of about -5 to -7 kcal/mol and a standard deviation of about 0.4 to 0.6 kcal/mol. It should be understood that the Gibbs standard free energy is provided at a given temperature and salinity. In variations of the foregoing, the temperature and salinity, respectively, is at about 40°C in about 0.2M equivalent of sodium ions. [0011] In some aspects, provided is a method of reverse transcribing RNA molecules comprising: extracting the RNA molecules from a sample; and reverse transcribing the RNA molecules into cDNA molecules, with primers comprising a helper sequence, the helper sequence comprising a hairpin structure, the hairpin structure configured to mitigate untargeted hybridizing of the primers to an off-target nucleotide sequence.
[0012] In some aspects, provided is a method of reverse transcribing RNA molecules comprising: extracting the RNA molecules from a sample; reverse transcribing the RNA molecules into cDNA molecules, with primers comprising DBCO functional groups on the 5’ ends of the primers, to generate DBCO-functionalized cDNA molecules; providing azide- functionalized barcode oligonucleotides to the DBCO-functionalized cDNA molecules; and ligating the azide-functionalized barcode oligonucleotides to the DBCO-functionalized cDNA molecules via a click reaction, thereby generating barcoded cDNA molecules. In some variations of the foregoing, the method provided does not require or can avoid purification before pooling.
[0013] In some aspects, provided is a method of reverse transcribing RNA molecules comprising: extracting the RNA molecules from a sample; reverse transcribing the RNA molecules into cDNA molecules, with primers comprising azide functional groups on the 5’ ends of the primers, to generate azide-functionalized cDNA molecules; providing DBCO- functionalized barcode oligonucleotides to the azide-functionalized cDNA molecules; and ligating the DBCO-functionalized barcode oligonucleotides to the azide-functionalized cDNA molecules via a click reaction, thereby generating barcoded cDNA molecules.
[0014] In some aspects, provided is a method of checking nucleotide errors in a Hamming barcode sequence, comprising: creating a system of linear equations, using one or more processors, based on the Hamming barcode sequence, values of the system of linear equations comprising: predetermined information values assigned to information-encoding nucleotides, expected predetermined constant values, and error-checking values assigned to error-checking nucleotides; and solving for the error-checking values, using the one or more processors, from the system of linear equations. [0015] In some aspects, provided is a method of generating energy-optimized mixamers comprising: determining mixamer sequences comprising at least two adjacent weak nucleotides or at least one strong nucleotide; determining a distribution of predicted equilibrium Gibbs standard free energies for the mixamer sequences hybridizing to one or more target nucleic acid sequences; and selecting energy-optimized mixamers based on the distribution, when the distribution comprises a standard deviation equilibrium Gibbs standard free energy less than or equal to a standard deviation equilibrium Gibbs standard free energy threshold. The distribution can be based on standard conditions, e.g., conditions comprising a predetermined temperature and a predetermined buffer salinity for reverse transcription.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] Various aspects of the disclosed methods, devices, and systems are set forth with particularity in the appended claims. A better understanding of the features and advantages of the disclosed methods, devices, and systems will be obtained by reference to the following detailed description of illustrative embodiments and the accompanying drawings, of which:
[0017] FIGS. 1, 2A, and 2B provide exemplary methods for reverse transcribing RNA molecules into cDNA molecules.
[0018] FIGS. 3 and 4 provides exemplary methods for generating Hamming barcode sequences.
[0019] FIG. 5 provides an exemplary method for selecting energy-optimized mixamers.
[0020] FIGS. 6 A and 6B provide a schematic overview of RNAseq library preparation methods.
[0021] FIG. 7A provides a schematic of a standard primer used for NGS library preparation. FIGS. 7B and 7C provide a schematic of problems observed in primers used for NGS library preparation. [0022] FIG. 8 provides a schematic overview of using oligonucleotides in accordance with an embodiment of the present disclosure for NGS library preparation.
[0023] FIG. 9 provides a schematic overview of an additional method of NGS library preparation in accordance with an embodiment of the present disclosure.
[0024] FIGS. 10A-10D provide schematics of an oligonucleotide in accordance with various embodiments of the present disclosure for NGS library preparation.
GTGTCCTCTGAAAACAGAGGACACNNNNNN (SEQ ID NO:1);
CGTGTCCTCTGCAAAAGGGGCAGAGGACACGNNNNNN (SEQ ID NO:2);
CGTGTCCACTCTGCAAAAGGGGCAGAGGCGACACGNNNNNN (SEQ ID NOG); and GTGTCCTCTGAAAACAGAGGACACNNNNNN (SEQ ID NO:4).
[0025] FIG. 11 A provides a schematic of a system of linear equations for generating Hamming nucleotide barcodes. FIGS. 11B and 11C provide schematics of exemplary Hamming nucleotide barcodes.
[0026] FIGS. 12 and 13 provide a schematic overview of workflows for RNAseq library preparation in accordance with various embodiments of the present disclosure.
[0027] FIG. 14 provides a schematic comparing random hexamer sequences against energy- optimized mixamers, for NGS library preparation.
[0028] FIG. 15 depicts an exemplary computing device or system in accordance with an embodiment of the present disclosure.
[0029] FIG. 16 depicts an exemplary computer system or computer network, in accordance with some instances of the systems described herein.
[0030] FIG. 17A provides example data of a histogram for equilibrium Gibbs standard free hybridizing energies of random hexamers. FIG. 17B provides example data of a histogram for equilibrium Gibbs standard free hybridizing energies for energy-optimized mixamers. [0031] FIG. 18 provides an exemplary experimental design. The starting materials including (1) human reference RNA, (2) mouse reference RNA, and (3) rat reference RNA. 16 samples and 16 BIRT primers were used. There was pooling after 1st strand cDNA synthesis, using 1 library and sequencing.
[0032] FIG. 19A is a graph showing high consistency with standard RNAseq. FIG. 19B is a graph showing good concordance across different barcodes. FIG. 19C is a graph showing good uniformity across different barcodes, and good reproducibility across different operators.
[0033] FIG. 20A is a graph showing barcode fraction. FIG. 20B is a graph showing alignment rate to genome sequence. These results demonstrate a low requirement — e.g., as low as 1 ng without sacrificing performance, as compared to standard RNS-Seq typically requiring 100-1000 ng.
DETAILED DESCRIPTION
[0034] In some aspects, compositions, e.g., oligonucleotides and sequences, methods, and systems for improved NGS library preparation, including RNAseq library preparation, are described. For one, disclosed herein is an oligonucleotide comprising an engineered hairpin structure that mitigates hybridizing of the oligonucleotide to undesirable nucleic acid sequences. In addition, error-checking barcode nucleotide sequences based on Hamming codes are described. Furthermore, energy-optimized mixamers featuring more uniform Gibbs standard free energies relative to traditional hexamer sequences are disclosed. Methods and systems of generating and/or using the oligonucleotides, error-checking barcodes, and energy-optimized mixamers are also disclosed.
[0035] Existing methods for preparing NGS libraries, including RNAseq libraries, can be costly, labor intensive, low throughput, and inaccurately reflect the expression of different RNA species within a sample. These drawbacks arise, at least, from the use of expensive and inefficient reagents, as well as workflows that are difficult to scale with increased sample size. Disclosed herein are compositions, methods, and systems for improving NGS library preparation with respect to scale, efficiency, cost, and accuracy. For example, off-target hybridizing of oligonucleotides, e.g., primers, inefficient hybridizing due to poor oligonucleotide design (e.g., as a result of unintended primer dimerization and other forms of unintended self-hybridization), biased hybridizing due to poor oligonucleotide design, and non-correcting barcode sequences are addressed by the embodiments of the present disclosure. The embodiments disclosed herein can also be combined with one another to improve the preparing of NGS libraries. For example, primer sets used for the amplifying or reverse transcribing of RNA transcripts for RNAseq can comprise the oligonucleotide featuring the engineered hairpin structure, an error-checking Hamming barcode, and energy-optimized mixamers that reduce hybridization bias. The combining of the embodiments disclosed herein is not necessary, however, to improve the preparing of the NGS library.
[0036] A unifying key advantage of the embodiments described herein is that in the case of processing RNA molecules for NGS library preparation, e.g., RNAseq, the barcoding reaction can be performed directly on the RNA molecules. Traditionally such an approach may not be viable, because the oligonucleotides, e.g., primers, used to bind to a plurality of RNA molecules are biased for hybridizing to certain RNA sequences, and are prone to forming unintended secondary structures that prevent at least a portion of the oligonucleotide from binding to the RNA molecules, thus further biasing the hybridizing, In addition, a considerable proportion of the RNA-binding oligonucleotide sequences can display unfavorable binding kinetics or thermodynamics to the RNA molecules, which also contributes to biases when preparing the RNAseq libraries. Any of the embodiments described herein, either alone or in combination with one another, addresses the biased hybridizing of oligonucleotides to RNA molecules for e.g., RNAseq processing. Furthermore, the embodiments described herein can be used to improve the efficiency and costs of NGS library preparation, in general, and are not limited to only RNAseq library processing.
[0037] In some aspects, described herein is an oligonucleotide comprising a helper sequence that comprises a hairpin structure configured to mitigate untargeted hybridizing of a hybridizing region to an off-target nucleotide sequence. The hairpin structure can comprise a barcode sequence, such as a Hamming barcode sequence. In some variations, the Hamming barcode sequence includes biological sequences, such as nucleotide sequences. The barcode sequence can reside in the loop portion (e.g., the non-palindromic sequence portions) of the helper sequence. By placing the barcode sequence in the loop of the hairpin structure, the barcode sequence cannot bias the hybridizing of the oligonucleotide to nucleic acid targets. The stem of the hairpin structure can comprise a universal sequence that is common to multiple oligonucleotide sequences. That is, multiple oligonucleotides can comprise a common universal sequence that allows for the self-hybridizing necessary for generating the stem of the hairpin structure, whereas the sequence that resides in the loop of the hairpin structure, e.g., the barcode sequence, can vary across nucleotides. The oligonucleotide can also comprise energy-optimized mixamers as the target nucleotide-binding sequences. Described herein also are methods of using the oligonucleotide for NGS library preparation, e.g., RNAseq library preparation. The methods can comprise directly reverse transcribing RNA transcripts using primers comprising the disclosed oligonucleotides. The methods can also comprise reverse transcribing RNA transcripts using primers that do not comprise oligonucleotides comprising the helper sequences, but do comprise DBCO functional groups or azide groups. The reverse transcribed DBCO- functionalized cDNA molecules or DBCO-functionalized barcode molecules can then ligate azide-functionalized barcodes or azide-functionalized cDNA molecules, which can be Hamming barcodes, via a click chemistry reaction. The resulting click chemistry products can comprise barcoded cDNA molecules. In one variation, the click chemistry is implemented with DBCO on the mixamer primer and azide on the sample barcode. In another variation, the click chemistry is implemented with azide on the mixamer primer and DBCO on the sample barcode.
[0038] In certain aspects, described herein is a Hamming barcode sequence, and a method of generating the Hamming barcode sequence. Generating the Hamming barcode sequence can comprise creating a system of linear equations, the values of the system of linear equations comprising: predetermined information values assigned to information-encoding nucleotides, expected predetermined constant values, and error-checking values assigned to error-checking nucleotides; and solving for the error-checking values from the system of linear equations. The generating the Hamming barcode sequence can further comprise solving for error-checked predetermined constant values, based on the predetermined information values and the error- checking values; and determining a difference between the expected predetermined constant values and the error-checked predetermined constant values. The information-encoding nucleotides can encode the nucleotides used for the actual barcode sequence, e.g., the sequence that the error-checking nucleotides are designed to protect, where the protecting refers to being able to detect or correct an error, if one of the information-encoding nucleotides erroneously change. The system of linear equations can comprise a system of modular mathematics linear equations.
[0039] In certain aspects, described herein is an energy -optimized mixamer, and a method of generating the energy-optimized mixamers. The method can comprise: determining mixamer sequences comprising at least two adjacent weak nucleotides or at least one strong nucleotide; determining a distribution of predicted equilibrium Gibbs standard free energies for the mixamer sequences hybridizing to one or more target nucleic acid sequences; and selecting energy- optimized mixamers based on the distribution, when the distribution comprises a standard deviation equilibrium Gibbs standard free energy less than or equal to a standard deviation equilibrium Gibbs standard free energy threshold. The distribution can be based on standard conditions comprising a predetermined temperature and predetermined salinity conditions conducive to reverse transcription. The standard conditions can comprise a temperature of about 40 °C and a salinity of about 0.2M Na+. The Na+ can be replaced with a functionally equivalent ion, such as K+.
[0040] Any one or more of the embodiments of the present disclosure can, but need not, be combined to improve the preparing of NGS libraries. Either alone or in combination, the embodiments of the present disclosure improve the scalability, costs, labor intensity, or accuracy of NGS library preparation, including the preparation of RNAseq libraries.
[0041] In one aspect, provided are reagents and methods for affordable and high-throughput NGS library preparation from RNA samples, by integrating sample- specific barcodes either during or immediately after the reverse transcription step. Barcoded cDNA products are pooled before any subsequent purification, adaptor ligation, or other steps, avoiding significant labor and reagent costs inherent in standard workflows. [0042] In some embodiments, provided is a set of barcode-integrated reverse transcription (BIRT) primer DNA oligonucleotides, each BIRT primer comprising from 5' to 3'; a universal first sequence comprising between 8 and 30 nucleotides; a barcode sequence comprising between 3 and 20 nucleotides; a universal second sequence comprising between 8 and 30 nucleotides that is 85% reverse complementary to the universal first sequence, and a degenerate randomer sequence comprising between 4 and 16 nucleotides. In some variations, the degenerate randomer sequence comprises N_k, corresponding to a roughly equal mix of all 4 nucleotides (A, T, C, G) at each N position, and k corresponds to the number of nucleotides. In certain variations, k=6 nucleotides. In certain variations, the degenerate randomer sequence comprises (S/WW)_k, corresponding to a roughly equal mix of G and C nucleotides for S, a roughly equal mix of AA, AT, TA, and TT dinucleotides for WW, a roughly equal mix of S or WW for each repeating (S/WW) unit, and k corresponds to the number of repeating (S/WW) units. In certain variations, k=4, 5, 6, or 7 units. In certain variations, the universal first sequence exhibits a G/C content of between 40% and 100%. In certain variations, the primer set comprises at least 8 distinct BIRT primer pairs each with a distinct barcode sequence. In certain variations, the primer set comprises at least 24 distinct BIRT primer pairs each with a distinct barcode sequence. In certain variations, the barcodes sequences have pairwise Hamming distances of at least 2 nucleotides for all possible pairs.
[0043] In some embodiments, provided is a method for preparing a barcoded RNA sequencing library from a set of least 24 biological samples (in which case, each of the 24 samples can correspond to each of 24 distinct BIRT primer pairs) or 8 biological samples (in which case, each of the 8 samples can correspond to each of 8 distinct BIRT primers), the method comprising: (1) extracting RNA from each of the biological samples into a separate container for each RNA sample; (2) mixing each RNA sample with a different BIRT primer, reverse transcriptase, and aqueous buffers/reagents to enable reverse transcription; (3) incubating the resulting mixture to allow reverse transcription; (4) pooling all reverse transcription reaction products in a single container to form the barcoded RNA sequencing library, wherein the BIRT primers are as described herein. In some variations, further comprising (3.5) heating the resulting mixture to deactivate the reverse transcriptase; after the incubation step (3) and before the pooling step (4). In some variations, the method specifically excludes any purification steps between steps (2) and (4). In certain variations, the avoided purification steps comprise size selection or affinity binding. In certain variations, the method further comprises: (5) purification to remove enzymes and primers; (6) addition of adapters and index sequences; (7) further pooling with additional libraries with different index sequences, after the pool step (4). In certain variations, step (6) of adding adapters and index sequences is performed using ligation. In other variations, step (6) of adding adapters and index sequences is performed using PCR.
[0044] In some embodiments, provided is a method for preparing a barcoded RNA sequencing library from a set of least 8 biological samples, the method comprising: (1) extracting RNA from each of the biological samples into a separate container for each RNA sample; (2) mixing each RNA sample with a DBCO-functionalized degenerate oligonucleotide primer or an azide-functionalized degenerate oligonucleotide primer, reverse transcriptase, and aqueous buffers/reagents to enable reverse transcription; (3) incubating the resulting mixture to allow reverse transcription; (4) adding to each mixture a distinct azide-functionalized barcode oligonucleotide or a distinct DBCO-functionalized barcode oligonucleotide; (5) incubating the resulting mixture to allow click chemistry to proceed; and (6) pooling all reverse transcription reaction products in a single container to form the barcoded RNA sequencing library. In some variations, the method further comprises: (5.5) addition of free azide (e.g. sodium azide) or an azide-functionalized molecule that is distinct from the azide-functionalized barcode oligonucleotides, after the click chemistry incubation step (5) and before the pooling step (6). In some variations, the DBCO-functionalized degenerate oligonucleotide primer comprises N_k, corresponding to a roughly equal mix of all 4 nucleotides (A, T, C, G) at each N position, and k corresponds to the number of nucleotides. In some variations, k=6 nucleotides. In some variations, the DBCO-functionalized degenerate oligonucleotide primer comprises (S/WW)_k, corresponding to a roughly equal mix of G and C nucleotides for S, a roughly equal mix of AA, AT, TA, and TT dinucleotides for WW, a roughly equal mix of S or WW for each repeating (S/WW) unit, and k corresponds to the number of repeating (S/WW) units. In some variations, k=5 units. Definitions
[0045] Unless otherwise defined, all of the technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art in the field to which this disclosure belongs.
[0046] As used in this specification and the appended claims, the singular forms “a”, “an”, and “the” include plural references unless the context clearly dictates otherwise. Any reference to “or” herein is intended to encompass “and/or” unless otherwise stated.
[0047] ‘About” and “approximately” shall generally mean an acceptable degree of error for the quantity measured given the nature or precision of the measurements. Exemplary degrees of error are within 20 percent (%), typically, within 10%, and more typically, within 5% of a given value or range of values. Exemplary degrees of error may also include an absolute value range for values, such as within 1 kcal/mol or within 0.5 kcal/mol for Gibbs standard free energies.
[0048] As used herein, the terms "comprising" (and any form or variant of comprising, such as "comprise" and "comprises"), "having" (and any form or variant of having, such as "have" and "has"), "including" (and any form or variant of including, such as "includes" and "include"), or "containing" (and any form or variant of containing, such as "contains" and "contain"), are inclusive or open-ended and do not exclude additional, un-recited additives, components, integers, elements, or method steps.
[0049] As used herein, ordinal terms such as “first”, “second”, “third”, etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements. Similarly, use of a), b), etc., or i), ii), etc. does not by itself connote any priority, precedence, or order of steps in the claims. Similarly, the use of these terms in the specification does not by itself connote any required priority, precedence, or order. [0050] The section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter described. The description is presented to enable one of ordinary skill in the art to make and use the invention and is provided in the context of a patent application and its requirements. Various modifications to the described embodiments will be readily apparent to those persons skilled in the art and the generic principles herein may be applied to other embodiments. Thus, the present invention is not intended to be limited to the embodiment shown but is to be accorded the widest scope consistent with the principles and features described herein.
[0051] The figures illustrate processes according to various embodiments. In the exemplary processes, some blocks are, optionally, combined, the order of some blocks is, optionally, changed, and some blocks are, optionally, omitted. In some examples, additional steps may be performed in combination with the exemplary processes. Accordingly, the operations as illustrated (and described in greater detail below) are exemplary by nature, and, as such, should not be viewed as limiting.
Oligonucleotides and sequences for analyzing nucleic acid molecules
[0052] Disclosed herein are compositions for analyzing nucleic acid molecules, including: an oligonucleotide comprising a helper sequence that comprises a hairpin structure configured to mitigate untargeted hybridizing of a hybridizing region to an off-target nucleotide sequence; a Hamming barcode sequence comprising information-encoding nucleotides and error-checking nucleotides and configured to check for nucleotide errors in the Hamming barcode sequence, wherein the Hamming barcode sequence is based on a system of linear equations, values of the system of linear equations comprising: predetermined information values assigned to the information-encoding nucleotides, expected predetermined constant values for the system of linear equations, and error-checking values assigned to the error-checking nucleotides, based on solving the system of linear equations comprising the predetermined information values and the predetermined constant values. Oligonucleotide comprising a helper sequence
[0053] The hairpin structure of the helper sequence can comprise exclusively or near exclusively nucleotides from the helper sequence. That is, the helper sequence is, for the most part, not bound to other portions of a broader nucleotide sequence that the helper sequence may be a member of. For example, if the helper sequence is part of a primer comprising both the helper sequence and a region that hybridizes to a target nucleotide sequence, e.g., an energy- optimized mixamer, the helper sequence’s hairpin structure is restricted to being comprised of only nucleotides from the helper sequence, and is, for the most part, not comprised of nucleotides from the region that hybridizes to the target nucleotide sequence, e.g., the energy- optimized mixamer. The hairpin structure may comprise nucleotides from not the helper sequence, provided that those nucleotides do not prohibit the function of the broader sequence, e.g., primer sequence, that the helper sequence is a part of. For example, the hairpin structure of the helper sequence can comprise a single nucleotide from the region of the primer that hybridizes to the target nucleotide sequence, provided that the single nucleotide being part of the hairpin structure does not prohibit the primer from hybridizing to the target nucleotide sequence.
[0054] A nucleotide mismatch in the hairpin structure can refer to two nucleotide bases in the hairpin structure that are positioned in direct physical opposition to each other, but are not complementary to each another. The nucleotide mismatch can result in a bulge in the hairpin structure. The bulge can also result when the two opposing sequences in the stem of the hairpin structure comprise of unequal lengths, such that most of the nucleotides for one sequence are complementary to most of the nucleotides of the opposing sequence, except for one or more nucleotides. Those one or more nucleotides that are not complementarity hybridizing to the opposing sequence can be considered to be a part of the bulge. The bulge can be in the middle of the one of the two sequences in the stem of the hairpin structure. For example, for the hairpin structure of the helper sequence, the universal sequence, U, can comprise the sequence “cacacaca” and the complementary universal sequence, U*, can comprise the sequence “tgtgAtgtg.” Hamming barcode sequence
[0055] The Hamming barcode sequence for checking nucleotide errors checks for nucleotide errors within the Hamming barcode sequence. That is, the Hamming barcode sequence does not check for nucleotide errors outside of the Hamming barcode sequence. The checking for nucleotide errors can include correcting nucleotide errors and/or detecting nucleotide errors.
[0056] A nucleotide error can, but need not be, a mutation. The nucleotide error can be the result of an artificial machine-level error, such as an incorrect sequencing read, rather than the result of the nucleotide base changing due to biological processes, e.g., polymerase proofreading error.
[0057] Each linear equation in the system of modular mathematics linear equations can comprise a corresponding predetermined constant value. For example, of the system of linear equations, a first linear equation can correspond to a KI constant value, a second linear equation can correspond to a K2 constant value, and a third linear equation can correspond to a K3 constant value.
[0058] The checking for errors can comprise detecting nucleotide errors or correcting nucleotide errors. In the case of detecting nucleotide errors, a nucleotide error in the Hamming barcode sequence indicates that an error exists, but the original nucleotide base before the onset of the error, cannot be ascertained. In contrast, in the case of correcting nucleotide errors, a nucleotide error in the Hamming barcode sequence indicates that not only does an error exist, but the original nucleotide base before the onset of the error can be ascertained via the solving of the system of linear equations used to generate the Hamming barcode sequence.
[0059] Each predetermined information value from the predetermined information values can correspond to an information-encoding nucleotide from the information-encoding nucleotides. For example, the information-encoding nucleotides can comprise A, T, G, and C nucleotides. The A can encode a predetermined information value of 0, the G can encode a predetermined information value of 1, the T can encode a predetermined information value of 2, and the C can encode a predetermined information value of 3 or - 1. The C can encode a predetermined information of 3 or -1, because 3 modulo 4 is equal to -1 modulo 4.
[0060] The information-encoding nucleotides can comprise a cluster in the Hamming barcode sequence. The cluster can refer to most of the information-encoding nucleotides being adjacent to one another. In some instances, the clustering together can refer to all of the information-encoding nucleotides being adjacent to one another. The cluster from the clustering together of the information-encoding nucleotides can be found on one end of a sequence, such as the Hamming barcode sequence. See FIG. 1 IB for a schematic depiction of Hamming barcode sequences with clustered information-encoding nucleotides and clustered error-checking nucleotides.
[0061] The error-checking nucleotides can be interspersed across the Hamming barcode sequence. The interspersing can refer to most of the error-checking nucleotides not being adjacent to one another. In some instances, none of the error-checking nucleotides are adjacent to one another. See FIG. 11C for a schematic depiction of Hamming barcode sequences with interspersed error-checking nucleotides. The Hamming barcode sequence can include information-encoding nucleotides that comprise a cluster, as well as error-checking nucleotides that are interspersed across the barcode sequence.
Energy-optimized mixamers
[0062] The energy-optimized mixamer can be a nucleotide sequence. The equilibrium Gibbs standard free energy can refer to the Gibbs standard free energy at equilibrium at a prespecified temperature and salinity conducive to reverse transcription. The thresholds used to select the energy-optimized mixamers — that is, the mean equilibrium Gibbs standard free energy threshold and the standard deviation equilibrium Gibbs standard free energy threshold — need not be predetermined thresholds. Instead, the thresholds can be determined based on the properties of the distribution of predicted equilibrium Gibbs standard free energies for the mixamer sequences.
[0063] The energy-optimized mixamers can comprise at least two adjacent weak nucleotides or at least one strong nucleotide, where the weak nucleotide can comprise an adenine (A) nucleotide, a thymine (T) nucleotide, and/or a uracil nucleotide, and the strong nucleotide can comprise a cytosine (C) nucleotide and/or a guanine (G) nucleotide. The energy-optimized mixamers can comprise at least four, at least five, at least six, or at least seven of the two adjacent weak nucleotides or the at least one strong nucleotide.
[0064] FIG. 6A provides a schematic overview of current standard RNAseq library preparation. Samples 1, 2, through N are subject to individual adaptor ligating and indexing reactions per RNA sample, and the indexed reactions can be performed immediately after, per RNA sample, to generate indexed RNAseq libraries. The RNAseq libraries can then be pooled together, enriched for the RNA molecule types of interest, and sequenced.
[0065] FIG. 6B provides a schematic overview of an RNAeseq library preparations according to one or more embodiments in the present disclosure. In contrast to FIG. 6A, the RNA transcripts are directly barcoded, before any intermediate rection, during the reverse transcription process. In further contrast to FIG. 6 A, the barcoding reaction is not limited to a single RNA sample. Instead, a single barcoding reaction can be used across multiple samples, such as samples 1 to 100, as shown in FIG. 6B. Thus, scaling the preparation of RNAseq libraries is improved according to the embodiments described herein. In contrast to FIG. 6A, the workflow shown in FIG. 6B scales the indexing reactions as a function of a group of barcoded samples. That is, for samples 1-100 a single indexing reaction can be performed, and for samples 101-200, another indexing reaction is performed. To index 200 RNA samples according to conventional methods, as shown in FIG. 6A, requires 200 indexing reactions for 200 RNA samples according to conventional methods, as shown in FIG. 6A. The barcodes depicted in FIG. 6B, which are aligned with the embodiments provided herein, are separate from standard NGS indices. Therefore, the number of identifiers per RNA molecule across one or more samples can be combinatorially expanded, based on both traditional NGS indices and barcodes directly integrated via reverse transcription. For example, 96 distinct primers configured to directly reverse transcribe RNA molecules can be used in conjunction with 12 indices, to result in an analysis directed to 1152 RNA samples, within a single NGS run. In another example, 384 distinct primers configured to directly reverse transcribe RNA molecules can be used in conjunction with 24 indices, to result in an analysis directed to 9216 RNA samples, within a single NGS run.
[0066] FIG. 7A provides a schematic of a standard oligonucleotide, e.g., primer, used for NGS library preparation. A naive approach where a standard primer comprising a barcode sequence B at the 5’ end of a random hexamer-comprising primer will likely introduce strong bias in the reverse transcription step, resulting in dropout of some sequences and overrepresentation of other sequences. These two effects occur for two reasons. First, as depicted in FIG. 7B, is that the random hexamer binds to different subsequences of the barcode sequence, resulting in a formation of hairpin structures that render the primer less effective at reverse transcription, with variable effect based on the barcode sequence. Second, as depicted in FIG. 7C, is that the barcode sequence will bind randomly to different RNA sequences, increasing the representation of those positively biased RNA sequences.
[0067] FIG. 8 provides a schematic overview of using oligonucleotides in accordance with an embodiment of the present disclosure for NGS library preparation, such as oligonucleotides comprising a helper sequence. FIG. 8 describes an approach where the barcodes are integrated during reverse transcription, using a set of hairpin primer sequences with a degenerate randomer sequence at the 3’ end. In some instances, the 3’ end sequence that hybridizes to the target molecule, e.g., the RNA molecule, is a random hexamer (N_6). The stem of the primer hairpin is a universal sequence, U, and can be the same for all primers in the set. The loop of the primer hairpin can be the barcode sequence, and can be unique for primer in the set. In some embodiments, the barcode sequence can be between 4 and 14 nucleotides long, inclusive. In some embodiments, the universal sequence comprising half of the hairpin stem can be between 10 and 20 nucleotides long, inclusive. Because of the relatively long length of the primers (between 30 and 60 nucleotides, inclusive), the primer may not be fully removed from the cDNA product, if using size selection. The random hexamer sequences can be replaced with energy- optimized mixamers.
[0068] FIG. 9 provides a schematic overview of an additional method of NGS library preparation in accordance with an embodiment of the present disclosure. The method depicted in FIG. 9 is based on click chemistry, and uses a short degenerate randomer primer with a 5’ chemical modification conducive to click chemistry. In some embodiments, the 5’ chemical modification is a DBCO functional group. In other embodiments, the 5’ chemical modification is an azide functional group. In some embodiments, the 3’ end of the primer sequence is random hexamer (N_6). After the primer is extended via reverse transcription to form ’DNA, one of a set of barcode oligonucleotides is added to each cDNA sample. Each barcode oligonucleotide is functionalized with a 3’ chemical modification conducive to click chemistry in conjunction with the 5’ modification of the primer. In some embodiments, 3’ chemical modification of the barcode oligonucleotide is an azide where the 5’ chemical modification of the primer sequence is a DBCO functional group. In some embodiments, there is an additional spacer DNA sequence at the 3’ end of the barcode sequence between the barcode sequence and the 3’ functional group. In some embodiments, the additional spacer sequence is the same sequenc’ for all barcode oligonucleotides. The random hexamer sequences can be replaced with energy-optimized mixamers.
[0069] FIG. 10 provides example primer sequences comprising helper sequences and random hexamer (N_6) sequences — the random hexamer sequences can be replaced with energy- optimized mixamers. For FIGS. 10A-10C, U denotes a universal sequence, U* denotes a complementary universal sequence, and B denotes a barcode sequence. In some variations, the universal sequence, U, as shown in these figures, refer to a portion of the hairpin structure that binds to complementary bases that are in physical opposition to U. The complementary bases that are in physical opposition to U are referred to as U*. The helper sequence, in contrast, can refer to the entire hairpin structure. That is, the helper sequence can comprise U, U*, and the portion of the hairpin structure does not participate in complementary nucleotide binding. The portion of the hairpin structure that does not participate in complementary nucleotide binding can be a barcode, B. The helper sequence can comprise U, U*, and B. The helper sequence comprises the universal sequence, U. The universal sequence, U, can comprise a length of at least 8 and at most 25 nucleotides. The universal sequence, U, can, however, comprise a length greater than 25 nucleotides. The universal sequence need not be limited to any theoretical length, but may be impacted by the manufacturing capabilities used for manufacturing an oligonucleotide, e.g., primer, comprising the universal sequence.
[0070] FIG. 10A shows a short primer design with a length of 30 nt. It should be understood that “nt” refers to “nucleotides”. The universal sequence is 10 nt, the barcode is 4 nt, and the priming sequence is N_6. When the universal sequence is significantly shorter than 10 nt, the hairpin stem will be less stable, and may result in the N_6 priming sequence forming an alternative undesired hairpin with other sequences of the primer. FIG. 10B shows a moderate length primer of 37 nt. The universal sequence is extended to 12 nt to further ensure that the designed hairpin structure is the most stable configuration of the primer. The barcode has been increased to 7 nt to provide error-detecting/correcting capabilities. FIG. 10C shows a variant moderate length primer of 41nt. Here, the universal sequence is extended to 14 nt, but the two sequences that form the hairpin stem are mismatched by 1 nt in the middle. The intentional mismatch in the hairpin stem can be beneficial in downstream library preparation steps in preventing the DNA polymerase from getting stuck during polymerase chain reaction (PCR) amplification. For FIGS. 10A, 10B, and 10C, the helper sequence length can be determined by subtracting the length of the entire molecule — in this case, the primer molecule — by the region of the molecule that is configured to hybridize to the target nucleotide — in this case, the random hexamer sequence. For example, in FIG. 10A, the length of the entire primer molecule is 30 nt, and when subtracted by the length of the random hexamer sequence, which is 6 nt, then 30 nt - 6 nt = 24 nt is the length of the helper sequence length. That is, the helper sequence, in the case of the primer molecules represented by the schematics in FIGS. 10A, 10B, and 10C, is the sequence that does bind to the target nucleic acid.
[0071] FIG. 11 depicts an example implementation of error correcting barcode sequences using Hamming encodings. In the example depicted in FIG. 11A, every adenine (“a”) nucleotide is mapped to 0, guanine (“g”) is mapped to 1, thymine (“t”) is mapped to 2, and cytosine (“c”) is mapped to 3, in modulus 4 (mod 4) arithmetic. Note that in mod 4, 3 is equivalent to -1. The 4 nucleotide positions denoted by xl, x2, x3, and x4 are the information-encoding nucleotides, and the 3 nucleotide positions denoted by cl, c2, and c3 are the error-checking nucleotides. Such a configuration can be notated as a (4,7) Hamming barcode sequence, where the 4 represents the number of information-encoding nucleotides and the 7 represents the total number of nucleotides in the barcode sequence, including the error-checking nucleotides. To implement error correction, xl, x2, x3, and x4 values can be each arbitrarily set as 0, 1, 2, or 3, and the values of cl, c2, and c3 can be computed using the listed formulas. In the example depicted in FIG. 11A, KI, K2, and K3 are constants. In some embodiments, K1=K2=K3=1. In the interpretation of a 7 letter barcode sequence, any one nucleotide sequencing error can be corrected based on identifying the value change that restores the 3 equations listed. For example, an error reporting xl as “t” instead of “g” would uniquely result in the first two equations requiring Kl=K2=0, rather than K1=K2=1. A 7 nucleotide barcode sequence can be structured arbitrarily. FIGS. 11B and 11C depict two implementations of a Hamming barcode sequence, where both sequences are based on the same system of linear equations used for generating the values of the informationencoding and error-checking nucleotides. In FIG. 11B, the 4 information-bearing nucleotides xl, x2, x3, and x4 are clustered together and listed first, followed by cl, c2, and c3. In example FIG. 11C, the 4 information-encoding nucleotides are interspersed with the error-checking nucleotides. That is, the information-encoding nucleotides comprise a cluster in the Hamming barcode sequence. FIG. 11C can be alternatively described as the 3 error-checking nucleotides being interspersed across the Hamming barcode sequence. Longer Hamming barcodes can also be used.
[0072] FIG. 12 depicts an example workflow for primer-based RNAseq library preparation. A 96-well plate can be pre-loaded with a different barcoded primer in each well, along with lyophilized enzyme and reagents for reverse transcription. First, RNA samples can be dispensed into each well of the 96-well plate, possibly using a multi-channel pipettor, as shown in 1202. The resulting solutions can be incubated at temperature suitable for reverse transcription (e.g. 37°C or 42°C), as shown in 1204, and then pooled into a single mixed library, as shown in 1206. Subsequently, the mixed library can be purified using standard size selection or affinity binding reagents, as shown in 1208, and appended to next-generation sequencing (NGS) adapters and indexes, as shown in 1210. Importantly, all reagents, consumables, and labor for steps after the formation of the mixed library are reduced by 96-fold. The mixed library after indexing can be further pooled with other indexed libraries, including other indexed generated mixed libraries.
[0073] FIG. 13 depicts an example workflow for click chemistry -based RNAseq library preparation. A 96-well plate is loaded with extracted RNA samples. A mixture of DBCO- random hexamer (DBCO-N_6) degenerate randomer primer, reverse transcriptase, and reverse transcriptase buffer/reagents can be added to each well, as shown in 1302, and then the resultant solutions can be incubated at a temperature suitable for reverse transcription (e.g. 37°C or 42°C), as shown in 1304. Subsequently, a distinct azide-functionalized barcode oligonucleotide can be added to each well, as shown in 1306, and then the solutions can be further incubated to allow the click chemistry reaction to proceed, as shown in 1308. Next, at 1310, excess DBCO (Dibenzocyclooctyne-amine) can be added to terminate the reaction and prevent unintended cross-barcode reaction in subsequent steps. At 1312, the reaction products from the 96 wells can be combined into a single mixed library. Compared to the RNAseq library preparation workflow shown in FIG. 12, the click chemistry-based workflow shown in FIG. 13 suffers from requiring more labor and reagents before pooling, The benefits of the click chemistry -based workflow shown in FIG. 13, however, is that shorter barcode oligonucleotides and N_6 primers can be used, which due to their shorter lengths, can result in their efficient removal via size-based purification. At 1314, unused primers are purified, and at 1316, further NGS processing reactions, such as the adding of reagents for ligating adaptors and/or indices, are provided. In any of the described contexts, the random hexamer (N_6) can be replaced with energy-optimized mixamers.
[0074] FIG. 14 provides a schematic comparing random hexamer sequences against energy- optimized mixamers, for NGS library preparation, such as for reverse transcribing RNA molecules. The traditional random hexamer (N_6) used for processing nucleic acid molecules, such as RNA molecules, is synthesized as a single oligonucleotide, incorporating roughly an equal mix of all 4 bases (adenine, thymine, guanine, and cytosine) at each position. In total, random hexamers are synthesized aiming to achieve a roughly equal concentration distribution of all 4,096 DNA oligonucleotides 6 nt long. The 4,096 random hexamer sequences, however, exhibit a wide range of hybridization thermodynamics, due to the differences in G/C content, resulting in potential biases in the RNAseq results — namely, disfavoring RNA species with lower G/C fraction. The present disclosure, as shown in FIG. 14, depicts energy-optimized mixamers, such as the example sequence motif of five units of either a strong nucleotide or two adjacent weak nucleotides, denoted as (S/WW)_5. The strong nucleotide, S, corresponds to a G nucleotide or a C nucleotide, whereas the weak nucleotide, W, corresponds to an A or a T nucleotide. Thus, WW, refers to AA, AT, TA, or TT nucleotides. The energy-optimized mixamers comprising the sequence motif (S/WW)_5 can be synthesized as 32 independent oligonucleotides with degenerate mixed bases as shown.
Methods for analyzing nucleic acid molecules
[0075] Disclosed herein are methods for analyzing nucleic acid molecules, including methods of reverse transcribing RNA molecules, a method of checking nucleotide errors in a Hamming barcode sequence, and a method of generating energy-optimized mixamers.
[0076] FIG. 1 shows an exemplary schematic showing a general process 100 for reverse transcribing RNA molecules. The method can include: extracting the RNA molecules from a sample (102); and reverse transcribing the RNA molecules into cDNA molecules, with primers comprising a helper sequence, the helper sequence comprising a hairpin structure, the hairpin structure configured to mitigate untargeted hybridizing of the primers to an off-target nucleotide sequence (104).
[0077] FIG. 2A shows an exemplary schematic showing a general process 200A for reverse transcribing RNA molecules. The method can include: extracting the RNA molecules from a sample (202A); reverse transcribing the RNA molecules into cDNA molecules, with primers comprising DBCO functional groups on the 5’ ends of the primers, to generate DBCO- functionalized cDNA molecules (204A); providing azide-functionalized barcode oligonucleotides to the DBCO-functionalized cDNA molecules (206 A); and ligating the azide- functionalized barcode oligonucleotides to the DBCO-functionalized cDNA molecules via a click reaction, thereby generating barcoded cDNA molecules (208A). FIG. 2B also shows an exemplary schematic showing a general process 200B for reverse transcribing RNA molecules. The method can include: extracting the RNA molecules from a sample (202B); reverse transcribing the RNA molecules into cDNA molecules, with primers comprising azide functional groups on the 5’ ends of the primers, to generate azide-functionalized cDNA molecules (204B); providing DBCO-functionalized barcode oligonucleotides to the azide-functionalized cDNA molecules (206B); and ligating the DBCO-functionalized barcode oligonucleotides to the azide- functionalized cDNA molecules via a click reaction, thereby generating barcoded cDNA molecules (208B).
[0078] The general process 200A or 200B can further comprise adding a free azide molecule. The free azide molecule can be sodium azide. The azide-functionalized molecule is not identical to the azide-functionalized barcode nucleic acid molecules. The click reaction, i.e., click chemistry reaction, used to generate the barcoded cDNA molecules at 208A can be between the azide functional group of the azide-functionalized barcode oligonucleotides and the DBCO- functional group of the DBCO-functionalized cDNA molecules. Similarly, the click rection used to generate the barcoded cDNA molecules at 208B can be between the azide functional group of the azide-functionalized cDNA molecules and the DBCO-functionalized barcode nucleic acid molecules.
[0079] FIG. 3 shows an exemplary schematic showing a general process 300 for checking nucleotide errors in a Hamming barcode sequence. The method can include: creating a system of linear equations, using one or more processors, based on the Hamming barcode sequence, values of the system of linear equations comprising: predetermined information values assigned to information-encoding nucleotides, expected predetermined constant values, and error-checking values assigned to error-checking nucleotides (302); and solving for the error-checking values, using the one or more processors, from the system of linear equations (304).
[0080] FIG. 4 shows an exemplary schematic showing a general process 400 for checking nucleotide errors in a Hamming barcode sequence. The method can include: creating a system of linear equations, using one or more processors, based on the Hamming barcode sequence, values of the system of linear equations comprising: predetermined information values assigned to information-encoding nucleotides, expected predetermined constant values, and error-checking values assigned to error-checking nucleotides (402); solving for the error-checking values, using the one or more processors, from the system of linear equations (404); solving for error-checked predetermined constant values, using the one or more processors, based on the predetermined information values and the error-checking values (406); and determining a difference between the expected predetermined constant values and the error-checked predetermined constant values.
[0081] FIG. 5 shows an exemplary schematic showing a general process 500 for generating energy-optimized mixamers. The method can include: determining mixamer sequences comprising at least two adjacent weak nucleotides or at least one strong nucleotide (502); determining a distribution of predicted equilibrium Gibbs standard free energies for the mixamer sequences hybridizing to one or more target nucleic acid sequences at a predetermined temperature and a predetermined salinity conducive to reverse transcription (504); and selecting energy-optimized mixamers based on the distribution, when the distribution comprises a standard deviation equilibrium Gibbs standard free energy less than or equal to a standard deviation equilibrium Gibbs standard free energy threshold (506).
[0082] Process 300, 400, or 500 can be performed, for example, using one or more electronic devices implementing a software platform. In some examples, process 300, 400, or 500 is performed using a client-server system, and the blocks of process 300, 400, or 500 are divided up in any manner between the server and a client device. In other examples, the blocks of process 300, 400, or 500 are divided up between the server and multiple client devices. Thus, while portions of process 300, 400, or 500 are described herein as being performed by particular devices of a client-server system, it will be appreciated that process 300, 400, or 500 is not so limited. In other examples, process 300, 400, or 500 is performed using only a client device or only multiple client devices. In process 300, 400, or 500, some blocks are, optionally, combined, the order of some blocks is, optionally, changed, and some blocks are, optionally, omitted. In some examples, additional steps may be performed in combination with the process 300, 400, or 500. Accordingly, the operations as illustrated (and described in greater detail below) are exemplary by nature and, as such, should not be viewed as limiting.
Systems
[0083] In some aspects, disclosed herein are systems for reverse transcribing RNA molecules comprising: the RNA molecules, the RNA molecules having been extracted from a sample; and primers comprising a helper sequence, the helper sequence comprising a hairpin structure, and the hairpin structure configured to mitigate untargeted hybridizing of the primers to an off-target nucleotide sequence.
[0084] In some aspects, disclosed herein are systems for reverse transcribing RNA molecules comprising: cDNA molecules reverse transcribed from RNA molecules, the RNA molecules having been extracted from a sample, and the cDNA molecules having been reverse transcribed with primers that comprise a helper sequence, the helper sequence comprising a hairpin structure, and the hairpin structure configured to mitigate untargeted hybridizing of the primers to an off-target nucleotide sequence.
[0085] In some aspects, disclosed herein are systems for reverse transcribing RNA molecules comprising: RNA molecules, the RNA molecules having been extracted from a sample; and primers comprising DBCO functional groups on the 5’ ends of the primers. In some aspects, disclosed herein are systems for reverse transcribing RNA molecules comprising: cDNA molecules reverse transcribed from RNA molecules having been extracted from a sample, using primers comprising DBCO functional groups on the 5’ ends of the primers. In some aspects, disclosed herein are systems for reverse transcribing RNA molecules comprising: cDNA molecules reverse transcribed from the RNA molecules having been extracted from a sample, using primers comprising DBCO functional groups on the 5’ ends of the primers; and azide- functionalized barcode oligonucleotides. In some aspects, disclosed herein are systems for reverse transcribing RNA molecules comprising: barcoded cDNA molecules generated from a click reaction comprising primers that comprise DBCO functional groups on the 5’ ends of the primers and azide-functionalized barcode oligonucleotides. 1 [0086] In some aspects, disclosed herein are system for checking nucleotide errors in a Hamming barcode sequence. The systems may comprise, e.g., one or more processors, and a memory unit communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to: create a system of linear equations based on the Hamming barcode sequence, values of the system of linear equations comprising: predetermined information values assigned to information-encoding nucleotides, expected predetermined constant values, and error-checking values assigned to error-checking nucleotides; and solve for the error-checking values, using the one or more processors, from the system of linear equations. The system may comprise further instructions that cause the system to: solve for error-checked predetermined constant values, using the one or more processors, based on the predetermined information values and the error-checking values; and determine a difference between the expected predetermined constant values and the error- checked predetermined constant values.
[0087] In some aspects, disclosed herein are systems for generating energy-optimized mixamers. The systems may comprise, e.g., one or more processors, and a memory unit communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to: determine mixamer sequences comprising at least two adjacent weak nucleotides or at least one strong nucleotide; determine a distribution of predicted equilibrium Gibbs standard free energies for the mixamer sequences hybridizing to one or more target nucleic acid sequences at a temperature and salinity conducive to reverse transcription; and select energy-optimized mixamers based on the distribution, when the distribution comprises a standard deviation equilibrium Gibbs standard free energy less than or equal to a standard deviation equilibrium Gibbs standard free energy threshold.
[0088] In some instances, the disclosed systems may further comprise sample processing and library preparation workstations, microplate-handling robotics, fluid dispensing systems, temperature control modules, environmental control chambers, additional data storage modules, data communication modules e.g., Bluetooth®, WiFi, intranet, or internet communication hardware and associated software), display modules, one or more local and/or cloud-based software packages (e.g., instrument / system control software packages, sequencing data analysis software packages), etc., or any combination thereof. In some instances, the systems may comprise, or be part of, a computer system or computer network as described elsewhere herein.
[0089] In certain aspects, also disclosed herein are systems of molecular components for analyzing a nucleic acid molecule from a subject. The system can comprise the components of a chemical reaction, such as the products and/or the reactants of a chemical reaction. In such cases, the chemical reaction may not necessarily react to completion, and a proportion of reactants may remain at reaction equilibrium. Amounts of reactants that remain during equilibrium can be effectively ignored from the system, given that the system has reached chemical reaction equilibrium. The remaining reactant amounts need not be trace amounts that are undetectable to instrumentation.
Computer systems and networks
[0090] FIG. 15 illustrates an example of a computing device or system in accordance with one embodiment. Device 1500 can be a host computer connected to a network. Device 1500 can be a client computer or a server. As shown in FIG. 15, device 1500 can be any suitable type of microprocessor-based device, such as a personal computer, workstation, server or handheld computing device (portable electronic device) such as a phone or tablet. The device can include, for example, one or more processor(s) 1510, input devices 1520, output devices 1530, memory or storage devices 1540, communication devices 1560, and nucleic acid sequencers 1570. Software 1550 residing in memory or storage device 1540 may comprise, e.g., an operating system as well as software for executing the methods described herein. Input device 1520 and output device 1530 can generally correspond to those described herein, and can either be connectable or integrated with the computer.
[0091] Input device 1520 can be any suitable device that provides input, such as a touch screen, keyboard or keypad, mouse, or voice-recognition device. Output device 1530 can be any suitable device that provides output, such as a touch screen, haptics device, or speaker. [0092] Storage 1540 can be any suitable device that provides storage (e.g., an electrical, magnetic or optical memory including a RAM (volatile and non-volatile), cache, hard drive, or removable storage disk). Communication device 1560 can include any suitable device capable of transmitting and receiving signals over a network, such as a network interface chip or device. The components of the computer can be connected in any suitable manner, such as via a wired media (e.g., a physical system bus 1580, Ethernet connection, or any other wire transfer technology) or wirelessly (e.g., Bluetooth®, Wi-Fi®, or any other wireless technology).
[0093] Software module 1550, which can be stored as executable instructions in storage 1540 and executed by processor(s) 1510, can include, for example, an operating system and/or the processes that embody the functionality of the methods of the present disclosure (e.g., as embodied in the devices as described herein).
[0094] Software module 1550 can also be stored and/or transported within any non-transitory computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described herein, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions. In the context of this disclosure, a computer-readable storage medium can be any medium, such as storage X40, that can contain or store processes for use by or in connection with an instruction execution system, apparatus, or device. Examples of computer- readable storage media may include memory units like hard drives, flash drives and distribute modules that operate as a single functional unit. Also, various processes described herein may be embodied as modules configured to operate in accordance with the embodiments and techniques described above. Further, while processes may be shown and/or described separately, those skilled in the art will appreciate that the above processes may be routines or modules within other processes.
[0095] Software module 1550 can also be propagated within any transport medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions. In the context of this disclosure, a transport medium can be any medium that can communicate, propagate or transport programming for use by or in connection with an instruction execution system, apparatus, or device. The transport readable medium may include, for example, an electronic, magnetic, optical, electromagnetic or infrared wired or wireless propagation medium.
[0096] Device 1500 may be connected to a network (e.g., network 1604, as shown in FIG. 16 and/or described below), which can be any suitable type of interconnected communication system. The network can implement any suitable communications protocol and can be secured by any suitable security protocol. The network can comprise network links of any suitable arrangement that can implement the transmission and reception of network signals, such as wireless network connections, T1 or T3 lines, cable networks, DSL, or telephone lines.
[0097] Device 1500 can be implemented using any operating system, e.g., an operating system suitable for operating on the network. Software module 1550 can be written in any suitable programming language, such as C, C++, Java or Python. In various embodiments, application software embodying the functionality of the present disclosure can be deployed in different configurations, such as in a client/server arrangement or through a Web browser as a Web-based application or Web service, for example. In some embodiments, the operating system is executed by one or more processors, e.g., processor(s) 1510.
[0098] FIG. 16 illustrates an example of a computing system in accordance with one embodiment. In system 1600, device 1500 (e.g., as described above and illustrated in FIG. 15) is connected to network 1604, which is also connected to device 1606.
[0099] Devices 1500 and 1606 may communicate, e.g., using suitable communication interfaces via network 1604, such as a Local Area Network (LAN), Virtual Private Network (VPN), or the Internet. In some embodiments, network 1604 can be, for example, the Internet, an intranet, a virtual private network, a cloud network, a wired network, or a wireless network. Devices 1500 and 1606 may communicate, in part or in whole, via wireless or hardwired communications, such as Ethernet, IEEE 802.11b wireless, or the like. Additionally, devices 1500 and 1606 may communicate, e.g., using suitable communication interfaces, via a second network, such as a mobile/cellular network. Communication between devices 1500 and 1606 may further include or communicate with various servers such as a mail server, mobile server, media server, telephone server, and the like. In some embodiments, Devices 1500 and 1606 can communicate directly (instead of, or in addition to, communicating via network 1604), e.g., via wireless or hardwired communications, such as Ethernet, IEEE 802.1 lb wireless, or the like. In some embodiments, devices 1500 and 1606 communicate via communications 1608, which can be a direct connection or can occur via a network (e.g., network 1604).
[0100] One or all of devices 1500 and 1606 generally include logic (e.g., http web server logic) or are programmed to format data, accessed from local or remote databases or other sources of data and content, for providing and/or receiving information via network 1604 according to various examples described herein.
EXAMPLES
[0101] The following examples further demonstrate one skilled in the art how to make and use the methods and systems described herein, and are not intended to limit the scope of the claimed invention.
Example 1
[0102] The mixamer set used to generate data shown in FIG. 15 ranged between 5 to 10 nucleotides in length, and in total comprises 7,776 sequences. The bottom panels show the predicted, e.g., calculated, thermodynamics of DNA hybridization for the random hexamers (N_6) versus the energy-optimized mixamers (S/WW)_5, at 37 °C in 0.18M sodium (Na+), based on thermodynamic parameters from SantaLucia and Hicks, (2004) Annu Rev Biophys Biomol Struct. 33:415-40. That is, the predicted hybridization thermodynamic values, e.g., the predicted hybridization equilibrium Gibbs standard free energy values, were calculated based on a biophysical simulation. The simulated temperature and salinity conditions for the biophysical simulation can be considered standard conditions, and can be conducive to reverse transcription, either in silico or in vitro. The distributions of the equilibrium Gibbs standard free energies were then plotted as a histogram for both the random hexamers, as shown in FIG. 17A, and the energy-optimized mixamers, as shown in FIG. 17B. The energy -optimized mixamers exhibited a standard deviation on the AG° of hybridization that is 59% lower than the hexamers (0.52 kcal/mol for the energy-optimized mixamers vs. 1.26 kcal/mol for the random hexamers).
Energy-optimized mixamers (S/WW)_5 can replace hexamers (N_6) in applications where the random hexamers are used during NGS library preparation, such as during the reverse transcription of RNA molecules.
Example 2
[0103] The experimental design for this example is set forth in FIG. 18.
Materials and Methods
1. Evaluation Metrics for BIRT Performance
1.1 Barcode Reads
[0104] ‘Barcode reads” are defined as sequencing reads that contain the exact BIRT barcode sequence, including both the universal sequence located in the stem region and the unique barcode sequence within the loop region of the BIRT primer.
1.2 Non-barcode Reads
[0105] Reads that lack the complete universal sequence or only contain a partial version are classified as non-barcode reads.
1.3 Barcode Fraction
[0106] The barcode fraction is calculated as the percentage of total sequencing reads from the pooled library that contain a valid, full-length barcode sequence. It reflects the efficiency of barcode incorporation during reverse transcription.
1.4 Barcode Rate [0107] The barcode rate is defined as the cumulative barcode fraction across all unique barcode sequences used in a pooled experiment. For example, if Barcode 1, 2, and 3 yield barcode fractions of 20%, 25%, and 22% respectively, the overall barcode rate is 67%.
1.5 Alignment Rate
[0108] The alignment rate is defined as the percentage of total sequencing reads that successfully align to the reference genome. A high alignment rate indicates accurate reverse transcription and minimal artifacts. It serves as a key metric for evaluating the fidelity of the BIRT-mediated cDNA synthesis and the overall quality of the resulting RNA-Seq libraries.
2. Experimental Conditions for Optimal BIRT Performance
2.1 BIRT Primer Concentration
[0109] BIRT primers were tested across a concentration range of 0.1 pM to 100 pM. Concentrations between 10-20 pM yielded the highest barcode rates, representing the saturation range. Lower concentrations (e.g., 0.1-3.5 pM) are still functional but result in a modest reduction in barcoding efficiency. The method is robust across a wide primer concentration range.
2.2 Reverse Transcription Incubation Time
[0110] Standard RNA-Seq protocols typically use incubation conditions of 25 °C for 10 minutes followed by 42 °C for 15 minutes. In contrast, BIRT RNA-Seq achieves improved performance with extended incubation:
• Optimized condition: 25 °C for 20 minutes, then 42 °C for 30 minutes
• Tested working range: 25 °C for 10-40 minutes and 42 °C for 15-60 minutes
[0111] These adjustments allow more efficient primer annealing and barcode incorporation, especially in degraded RNA samples. 2.3 BIRT Primer Synthesis Purity
[0112] The performance of BIRT primers is positively correlated with synthesis purity. The following order of purification method performance was observed:
[0113] Dual HPLC > Single HPLC > PAGE > Desalting
[0114] While all methods yield functional primers, higher-purity primers (HPLC -based) consistently result in higher barcode rates.
2.4 RNA Input Type and Quality
[0115] BIRT RNA-Seq uses a universal primer design that enables reverse transcription from a wide range of RNA molecules, including highly degraded RNA such as that extracted from formalin-fixed paraffin-embedded (FFPE) tissues. This differs from standard poly-N or poly-A priming methods, which are either general (poly-N) or mRNA-specific (poly-A). Notably, poly- A priming requires high RNA integrity (e.g., RNA Integrity Number [RIN] > 5), whereas BIRT is tolerant of low-quality or fragmented RNA.
2.5 Low RNA Input Requirement
[0116] BIRT RNA-Seq demonstrates high sensitivity, enabling RNA-Seq library preparation from low-input RNA samples.
• Poly-N priming: >1 ng required; 10-100 ng recommended
• Poly-A priming: >10 ng required; 100-1000 ng recommended
• BIRT priming: functional at >0.3 ng; 3-10 ng recommended for optimal results
[0117] This low input tolerance makes BIRT particularly suited for precious or limited samples.
3. Exemplary Experimental Workflow [0118] The following steps were performed.
[0119] Step 1. Mix each RNA sample with the corresponding BIRT primer, dNTPs, reverse transcriptase, and reaction buffer.
[0120] Step 2. Incubate the reaction mixture using the optimized protocol: 25 °C for 20 minutes — 42 °C for 30 minutes — 70 °C for 15 minutes (enzyme inactivation)
[0121] Step 3. For each sample, collect the entire 5 pL of first-strand cDNA product. Pool all 16 samples (each with a unique barcode) to obtain 80 pL of pooled cDNA.
[0122] Step 4. Take 20 pL from the pooled mixture and proceed with clean-up using standard methods.
[0123] Step 5. Proceed with standard RNA-Seq library construction, including end-repair, dA-tailing, adaptor ligation, and indexing PCR.
[0124] Step 6. Sequence the final library using commercially available platforms.
4. Data Analysis Workflow
[0125] Step 1: Barcode Demultiplexing
[0126] Identify the universal sequence in each read. Extract and match the associated barcode sequence to assign reads to their original sample.
[0127] Step 2: Downstream RNA-Seq Processing
[0128] Each demultiplexed dataset is then analyzed using a standard RNA-Seq pipeline, including quality control, alignment, quantification, and differential expression analysis as needed.
5. Representative Sequence
[0129] Representative sequences are provided in Table 1 below. Table 1.
[0130] FIG. 10D illustrates the actual BIRT primer structured used for all validation experiments.
[0131] The results of this example are provided in FIGS. 19A-19C. FIG. 19A shows high consistency with standard RNAseq. FIG. 19B shows good concordance across different barcodes. FIG. 19C shows good uniformity across different barcodes, and good reproducibility across different operators.
[0132] FIGS. 20A and 20B show a low input requirement. Specifically, 1 ng can be used without sacrificing performance. This is in contrast to standard RNA-seq, which typically requires 100-1000 ng.
[0133] It should be understood from the foregoing that, while particular implementations of the disclosed methods and systems have been illustrated and described, various modifications can be made thereto and are contemplated herein. It is also not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the aforementioned specification, the descriptions and illustrations of the preferable embodiments herein are not meant to be construed in a limiting sense.
Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. Various modifications in form and detail of the embodiments of the invention will be apparent to a person skilled in the art. It is therefore contemplated that the invention shall also cover any such modifications, variations and equivalents.

Claims

CLAIMS What is claimed is:
1. An oligonucleotide comprising a helper sequence that comprises a hairpin structure configured to mitigate untargeted hybridizing of a hybridizing region to an off-target nucleotide sequence.
2. The oligonucleotide of claim 1, the hairpin structure comprising nucleotides from exclusively the helper sequence.
3. The oligonucleotide of claim 1 or 2, wherein the helper sequence comprises at least 20 and at most 35 nucleotides.
4. The oligonucleotide of any of claims 1-3, wherein the oligonucleotide comprises the hybridizing region and the helper sequence.
5. The oligonucleotide of claim 4, wherein the oligonucleotide comprises a primer.
6. The oligonucleotide of any of claims 1-5, wherein the hybridizing region comprises a hexamer sequence.
7. The oligonucleotide of any of claims 1-6, wherein the hybridizing region comprises an energy-optimized mixamer.
8. The oligonucleotide of claim 7, wherein the energy-optimized mixamer comprises an equilibrium Gibbs standard free energy sampled from a distribution of predicted equilibrium Gibbs standard free energies for hybridizing nucleic acid sequences, the distribution comprising a standard deviation of about 0.4 to 0.6 kcal/mol.
9. The oligonucleotide of claim 7 or 8, wherein the distribution of predicted equilibrium Gibbs standard free energies further comprises a mean of about -5 to -7 kcal/mol.
10. The oligonucleotide of any of claims 7-9, wherein the energy-optimized mixamer comprises at least two adjacent weak nucleotides or at least one strong nucleotide.
11. The oligonucleotide of claim 10, wherein the energy -optimized mixamer comprises at least four of: the at least two adjacent weak nucleotides, the at least one strong nucleotide, or a combination thereof.
12. The oligonucleotide of claim 10 or 11, wherein a weak nucleotide of the at least two adjacent weak nucleotides comprises an adenine (A) nucleotide or a thymine (T) nucleotide or a uracil nucleotide.
13. The oligonucleotide of any of claims 10-12, wherein the at least one strong nucleotide comprises a guanine (G) nucleotide or a cytosine (C) nucleotide.
14. The oligonucleotide of any of claims 8-13, wherein standard conditions for the distribution of predicted equilibrium Gibbs standard free energies is determined at a temperature of about 40 °C and a salinity of about 0.2M Na+.
15. The oligonucleotide of any of claims 1-14, comprising a barcode sequence.
16. The oligonucleotide of any of claims 1-15, wherein the hairpin structure comprises the barcode sequence.
17. The oligonucleotide of claim 15 or 16, wherein the barcode sequence comprises a Hamming barcode sequence for checking nucleotide errors.
18. The oligonucleotide of claim 17, wherein the Hamming barcode sequence comprises information-encoding nucleotides and error-checking nucleotides and is configured to check for nucleotide errors in the Hamming barcode sequence, wherein the Hamming barcode sequence is based on a system of linear equations, values of the system of linear equations comprising: predetermined information values assigned to the information-encoding nucleotides, expected predetermined constant values for the system of linear equations, and error-checking values assigned to the error-checking nucleotides, based on solving the system of linear equations comprising the predetermined information values and the predetermined constant values.
19. The oligonucleotide of claim 18, wherein the information-encoding nucleotides comprise a cluster in the Hamming barcode sequence.
20. The oligonucleotide of claim 18 or 19, wherein the error-checking nucleotides are interspersed across the Hamming barcode sequence.
21. The oligonucleotide of any of claims 18-20, wherein the checking for errors comprises detecting nucleotide errors or correcting nucleotide errors.
22. The oligonucleotide of any of claims 18-21, wherein the checking for errors comprises solving for error-checked predetermined constant values based on the predetermined information values and the error-checking values, and the error-checked predetermined constant values being different from the expected predetermined constant values.
23. The oligonucleotide of any of claims 1-22, wherein the hairpin structure comprises a universal sequence and a complementary universal sequence.
24. The oligonucleotide of claim 23, wherein the universal sequence comprises a G/C percentage between about 40% and about 100%.
25. The oligonucleotide of claim 23 or 24, wherein the universal sequence comprises an identical nucleotide length to a nucleotide length of the complementary universal sequence.
26. The oligonucleotide of any of claims 23-25, wherein the universal sequence comprises at least 8, and optionally at most 25 nucleotides.
27. The oligonucleotide of any of claims 1-26, comprising a nucleotide mismatch or bulge in the hairpin structure.
28. The oligonucleotide of any of claims 1-27, wherein the off-target nucleotide sequence comprises RNA sequences.
29. The oligonucleotide of claim 28, wherein the RNA sequences comprise mRNA sequences.
30. A Hamming barcode sequence comprising information-encoding nucleotides and errorchecking nucleotides and configured to check for nucleotide errors in the Hamming barcode sequence, wherein the Hamming barcode sequence is based on a system of linear equations, values of the system of linear equations comprising: predetermined information values assigned to the information-encoding nucleotides, expected predetermined constant values for the system of linear equations, and error-checking values assigned to the error-checking nucleotides, based on solving the system of linear equations comprising the predetermined information values and the predetermined constant values.
31. The Hamming barcode sequence of claim 30, wherein the information-encoding nucleotides comprise a cluster in the Hamming barcode sequence.
32. The Hamming barcode sequence of claim 30 or 31, wherein the error-checking nucleotides are interspersed across the Hamming barcode sequence.
33. The Hamming barcode sequence of any of claims 30-32, wherein the checking for errors comprises detecting nucleotide errors or correcting nucleotide errors.
34. The Hamming barcode sequence of any of claims 30-33, wherein the checking for errors comprises solving for error-checked predetermined constant values based on the predetermined information values and the error-checking values, and the error-checked predetermined constant values being different from the expected predetermined constant values.
35. The Hamming barcode sequence of any of claims 30-34, wherein each predetermined information value from the predetermined information values corresponds to an information-encoding nucleotide from the information-encoding nucleotides.
36. The Hamming barcode sequence of any of claims 30-35, wherein the informationencoding nucleotides comprise A, T, G, or C nucleotides.
37. The Hamming barcode sequence of any of claims 30-36, wherein the predetermined information values comprise 0, 1, 2, 3 or -1.
38. The Hamming barcode sequence of any of claims 30-37, wherein the system of linear equations comprise three linear equations.
39. The Hamming barcode sequence of any of claims 30-38, wherein the system of linear equations comprises four predetermined information values, three expected predetermined constants, and three error-checking values.
40. The Hamming barcode sequence of any of claims 30-39, wherein the expected predetermined constant values are equal in value to each other.
41. The Hamming barcode sequence of any of claims 30-40, wherein the expected predetermined constant values are equal to 1.
42. The Hamming barcode sequence of any of claims 30-41, wherein the Hamming barcode sequence comprises 7 nucleotides.
43. The Hamming barcode sequence of any of claims 30-42, wherein the Hamming barcode sequence comprises 4 information-encoding nucleotides.
44. The Hamming barcode sequence of any of claims 30-43, wherein the Hamming barcode sequence comprises 3 error-checking nucleotides.
45. The Hamming barcode sequence of any of claims 30-44, wherein a linear equation from the system of linear equations comprises a modulo operation, wherein a modulus of the modulo operation is equal to a number of unique information-encoding nucleotides.
46. The Hamming barcode sequence of claim 45, wherein the modulus is 4.
47. The Hamming barcode sequence of any of claims 30-46, wherein the system of linear equations comprises a system of modular mathematics linear equations.
48. The Hamming barcode sequence of any of claims 30-47, wherein a plurality of Hamming barcode sequences comprise the Hamming barcode sequence and from the plurality of Hamming barcode sequences, all pairs of Hamming barcode sequences comprise a Hamming distance of at least 2 nucleotides.
49. The Hamming barcode sequence of any of claims 30-48, wherein an oligonucleotide comprises the Hamming barcode sequence.
50. The Hamming barcode sequence of claim 49, wherein the oligonucleotide comprises a primer.
51. The Hamming barcode sequence of claim 49 or 50, wherein the oligonucleotide comprises a helper sequence.
52. The Hamming barcode sequence of claim 51, wherein the helper sequence comprises a hairpin structure, the hairpin structure configured to mitigate untargeted hybridizing of a hybridizing region to an off-target nucleotide sequence.
53. The Hamming barcode sequence of claim 52, wherein the hairpin structure comprises nucleotides from exclusively the helper sequence.
54. The Hamming barcode sequence of any of claims 49-53, wherein the oligonucleotide comprises a hexamer sequence.
55. The Hamming barcode sequence of any of claims 49-54, wherein the oligonucleotide comprises an energy-optimized mixamer.
56. The Hamming barcode sequence of claim 55, wherein the energy-optimized mixamer comprises an equilibrium Gibbs standard free energy sampled from a distribution of predicted equilibrium Gibbs standard free energies for hybridizing nucleic acid sequences, the distribution comprising a standard deviation of about 0.4 to 0.6 kcal/mol.
57. The Hamming barcode sequence of claim 56, wherein the distribution of predicted equilibrium Gibbs standard free energies further comprises a mean of about -5 to -7 kcal/mol.
58. The Hamming barcode sequence of claim 56 or 57, wherein the energy-optimized mixamer comprises at least two adjacent weak nucleotides or at least one strong nucleotide.
59. The Hamming barcode sequence of claim 58, wherein the energy-optimized mixamer comprises at least four of: the at least two adjacent weak nucleotides, the at least one strong nucleotide, or a combination thereof.
60. The Hamming barcode sequence of claim 58 or 59, wherein a weak nucleotide of the at least two adjacent weak nucleotides comprises an adenine (A) nucleotide or a thymine (T) nucleotide or a uracil nucleotide.
61. The Hamming barcode sequence of any of claims 58-60, wherein the at least one strong nucleotide comprises a guanine (G) nucleotide or a cytosine (C) nucleotide.
62. The Hamming barcode sequence of any of claims 56-61, wherein standard conditions for the distribution of predicted equilibrium Gibbs standard free energies is determined at a temperature of about 40 °C and a salinity of about 0.2M Na+.
63. An energy-optimized mixamer sequence comprising an equilibrium Gibbs standard free energy sampled from a distribution of predicted equilibrium Gibbs standard free energies for hybridizing nucleic acid sequences, the distribution comprising a standard deviation of about 0.4 to 0.6 kcal/mol.
64. The energy-optimized mixamer of claim 63, wherein the distribution of predicted equilibrium Gibbs standard free energies further comprises a mean of about -5 to -7 kcal/mol.
65. The energy-optimized mixamer of claim 63 or 64, wherein the energy -optimized mixamer comprises at least two adjacent weak nucleotides or at least one strong nucleotide.
66. The energy-optimized mixamer of any of claims 63-65, wherein the energy-optimized mixamer comprises at least four of: the at least two adjacent weak nucleotides, the at least one strong nucleotide, or a combination thereof.
67. The energy-optimized mixamer of any of claims 63-66, wherein the energy-optimized mixamer comprises at least five of: the at least two adjacent weak nucleotides, the at least one strong nucleotide, or a combination thereof.
68. The energy-optimized mixamer of any of claims 63-67, wherein the energy-optimized mixamer comprises at least six of: the at least two adjacent weak nucleotides, the at least one strong nucleotide, or a combination thereof.
69. The energy-optimized mixamer of claim 68, wherein the energy -optimized mixamer comprises at least seven of: the at least two adjacent weak nucleotides, the at least one strong nucleotide, or a combination thereof.
70. The energy-optimized mixamer of claim 68 or 69, wherein a weak nucleotide of the at least two adjacent weak nucleotides comprises an adenine (A) nucleotide or a thymine (T) nucleotide or a uracil nucleotide.
71. The energy-optimized mixamer of any of claims 68-70, wherein the at least one strong nucleotide comprises a guanine (G) nucleotide or a cytosine (C) nucleotide.
72. The energy-optimized mixamer of any of claims 63-71, wherein the energy-optimized mixamer comprises at least 4 and at most 8 nucleotides.
73. The energy-optimized mixamer of any of claims 63-72, wherein the energy-optimized mixamer comprises at least 5 and at most 10 nucleotides.
74. The energy-optimized mixamer of any of claims 63-73, wherein the energy-optimized mixamer comprises at least 6 and at most 12 nucleotides.
75. The energy-optimized mixamer of any of claims 63-74, wherein the energy-optimized mixamer comprises at least 7 and at most 14 nucleotides.
76. The energy-optimized mixamer of any of claims 63-75, wherein the energy-optimized mixamer is of any of mixamer forms: SSSSSS, SSSSWW, SSSWWS, SSWWSS, swwsss, wwssss, ssswww, sswwsww, swwssww, wwsssww, sswwwws, swwswws, wwsswws, swwwwss, wwswwss, wwwwsss, sswwwww, swwswwww, swwwwsww, swwwwwws, wwsswwww, wwswwsww, wwswwwws, wwwwssww, wwwwswws, wwwwwwss, swwwwwww, wwswwwwww, wwwwswwww, WWWWWWSWW, WWWWWWWWS, or WWWWWWWWWW, wherein S is a G nucleotide or a C nucleotide and W is an A nucleotide or a T nucleotide.
77. The energy-optimized mixamer of any of claims 63-76, wherein the energy-optimized mixamer is configured to hybridize to an RNA sequence.
78. The energy-optimized mixamer of claim 77, wherein the RNA sequence is an mRNA sequence.
79. The energy-optimized mixamer of any of claims 63-78, wherein the energy-optimized mixamer comprises at least a portion of a primer sequence.
80. The energy-optimized mixamer of any of claims 63-79, wherein the mean is about -6.13 kcal/mol.
81. The energy-optimized mixamer of any of claims 63-80, wherein the standard deviation is about 0.52 kcal/mol.
82. The energy-optimized mixamer of any of claims 63-81, wherein standard conditions for the distribution of predicted equilibrium Gibbs standard free energies is determined at a temperature of about 40 °C and a salinity of about 0.2M Na+.
83. The energy-optimized mixamer of any of claims 63-82, wherein the energy-optimized mixamer is 3’ of a helper sequence.
84. The energy-optimized mixamer of claim 83, wherein the helper sequence comprises a hairpin structure, the hairpin structure configured to mitigate untargeted hybridizing of a hybridizing region to an off-target nucleotide sequence.
85. The energy-optimized mixamer of claim 84, wherein the hairpin structure comprises nucleotides from exclusively the helper sequence.
86. The energy-optimized mixamer of any of claims 63-85, wherein the helper sequence comprises a barcode sequence.
87. The energy-optimized mixamer of claim 86, wherein the barcode sequence comprises a Hamming barcode sequence for checking nucleotide errors.
88. The energy-optimized mixamer of claim 87, wherein the Hamming barcode sequence comprises information-encoding nucleotides and error-checking nucleotides and is configured to check for nucleotide errors in the Hamming barcode sequence, wherein the Hamming barcode sequence is based on a system of linear equations, values of the system of linear equations comprising: predetermined information values assigned to the information-encoding nucleotides, expected predetermined constant values for the system of linear equations, and error-checking values assigned to the error-checking nucleotides, based on solving the system of linear equations comprising the predetermined information values and the predetermined constant values.
89. The energy-optimized mixamer of claim 88, wherein the information-encoding nucleotides comprise a cluster in the Hamming barcode sequence.
90. The energy-optimized mixamer of claim 88 or 89, wherein the error-checking nucleotides are interspersed across the Hamming barcode sequence.
91. The energy-optimized mixamer of any of claims 88-90, wherein the checking for errors comprises detecting nucleotide errors or correcting nucleotide errors.
92. The energy-optimized mixamer of any of claims 88-91, wherein the checking for errors comprises solving for error-checked predetermined constant values based on the predetermined information values and the error-checking values, and the error-checked predetermined constant values being different from the expected predetermined constant values.
93. A method of reverse transcribing RNA molecules comprising: extracting the RNA molecules from a sample; and reverse transcribing the RNA molecules into cDNA molecules, with primers comprising a helper sequence, the helper sequence comprising a hairpin structure, the hairpin structure configured to mitigate untargeted hybridizing of the primers to an off- target nucleotide sequence.
94. The method of claim 93, wherein the helper sequence comprises a barcode sequence.
95. The method of claim 93 further comprising incorporating barcode sequences into the RNA molecules.
96. The method of claim 94, the incorporating the barcode sequences comprising amplifying the cDNA molecules with primers comprising the barcode sequences.
97. The method of any of claims 93-96, further comprising pooling the cDNA molecules.
98. The method of any of claims 93-97, further comprising incorporating adapter sequences or index sequences into the cDNA molecules to generate prepared cDNA molecules.
99. The method of any of claims 93-98, further comprising purifying for the prepared cDNA molecules.
100. The method of any of claims 94-99, wherein the barcode sequence comprises a Hamming barcode sequence.
101. The method of claim 100, wherein the Hamming barcode sequence comprises information-encoding nucleotides and error-checking nucleotides and is configured to check for nucleotide errors in the Hamming barcode sequence, wherein the Hamming barcode sequence is based on a system of linear equations, values of the system of linear equations comprising: predetermined information values assigned to the information-encoding nucleotides, expected predetermined constant values for the system of linear equations, and error-checking values assigned to the error-checking nucleotides, based on solving the system of linear equations comprising the predetermined information values and the predetermined constant values.
102. The method of claim 101, wherein the information-encoding nucleotides comprise a cluster in the Hamming barcode sequence.
103. The method of claim 101 or 102, wherein the error-checking nucleotides are interspersed across the Hamming barcode sequence.
104. The method of any of claims 101-103, wherein the checking for errors comprises detecting nucleotide errors or correcting nucleotide errors.
105. The method of any of claims 101-104, wherein the checking for errors comprises solving for error-checked predetermined constant values based on the predetermined information values and the error-checking values, and the error-checked predetermined constant values being different from the expected predetermined constant values.
106. The method of any of claims 93-105, wherein the primers comprise an energy- optimized mixamer.
107. The method of claim 106, wherein the energy-optimized mixamer comprises an equilibrium Gibbs standard free energy sampled from a distribution of predicted equilibrium Gibbs standard free energies for hybridizing nucleic acid sequences, the distribution comprising a standard deviation of about 0.4 to 0.6 kcal/mol at a temperature of about 40 °C and a salinity of about 0.2M Na+.
108. The method of claim 106 or 107, wherein the energy-optimized mixamer comprises at least two adjacent weak nucleotides or at least one strong nucleotide.
109. The method of claim 108, wherein the energy-optimized mixamer comprises at least five of: the at least two adjacent weak nucleotides, the at least one strong nucleotide, or a combination thereof.
110. The method of claim 108 or 109, wherein a weak nucleotide of the at least two adjacent weak nucleotides comprises an adenine (A) nucleotide or a thymine (T) nucleotide or a uracil nucleotide.
111. The method of any of claims 108-110, wherein the at least one strong nucleotide comprises a guanine (G) nucleotide or a cytosine (C) nucleotide.
112. The method of any of claims 93-111, wherein the primers comprise a mixture of at least two different primer sequences.
113. The method of claim 112, wherein the mixture comprises at least 8 different primer sequences.
114. The method of claim 112 or 113, wherein the mixture comprises at least 24 different primer sequences.
115. A system for reverse transcribing RNA molecules comprising: the RNA molecules, the RNA molecules having been extracted from a sample; and primers comprising a helper sequence, the helper sequence comprising a hairpin structure, and the hairpin structure configured to mitigate untargeted hybridizing of the primers to an off-target nucleotide sequence.
116. A system for reverse transcribing RNA molecules comprising: cDNA molecules reverse transcribed from RNA molecules, the RNA molecules having been extracted from a sample, and the cDNA molecules having been reverse transcribed with primers that comprise a helper sequence, the helper sequence comprising a hairpin structure, and the hairpin structure configured to mitigate untargeted hybridizing of the primers to an off-target nucleotide sequence.
117. The system of claim 115 or 116, further comprising a reverse transcriptase.
118. The system of any of claims 115-117, further comprising an aqueous buffer.
119. A method of reverse transcribing RNA molecules comprising: extracting the RNA molecules from a sample; reverse transcribing the RNA molecules into cDNA molecules, with primers comprising DBCO functional groups on the 5’ ends of the primers, to generate DBCO- functionalized cDNA molecules; purifying for the DBCO-functionalized cDNA molecules; providing azide-functionalized barcode oligonucleotides to the DBCO-functionalized cDNA molecules; and ligating the azide-functionalized barcode oligonucleotides to the DBCO- functionalized cDNA molecules via a click reaction, thereby generating barcoded cDNA molecules.
120. The method of claim 119, wherein the azide-functionalized barcode oligonucleotides comprise the azide functional groups on single ends of the azide-functionalized barcode oligonucleotides.
121. A method of reverse transcribing RNA molecules comprising: extracting the RNA molecules from a sample; reverse transcribing the RNA molecules into cDNA molecules, with primers comprising azide functional groups on the 5’ ends of the primers, to generate azide- functionalized cDNA molecules; providing DBCO-functionalized barcode oligonucleotides to the azide-functionalized cDNA molecules; and ligating the DBCO-functionalized barcode oligonucleotides to the azide- functionalized cDNA molecules via a click reaction, thereby generating barcoded cDNA molecules.
122. The method of any of claims 119-121, further comprising adding a free azide molecule or an azide-functionalized molecule.
123. The method of any of claims 119-122, further comprising pooling the barcoded cDNA molecules.
124. The method of any of claims 119-123, wherein a barcode sequence of the azide- functionalized barcode oligonucleotides comprises a Hamming barcode sequence.
125. The method of claim 124, wherein the Hamming barcode sequence comprises information-encoding nucleotides and error-checking nucleotides and is configured to check for nucleotide errors in the Hamming barcode sequence, wherein the Hamming barcode sequence is based on a system of linear equations, values of the system of linear equations comprising: predetermined information values assigned to the information-encoding nucleotides, expected predetermined constant values for the system of linear equations, and error-checking values assigned to the error-checking nucleotides, based on solving the system of linear equations comprising the predetermined information values and the predetermined constant values.
126. The method of claim 125, wherein the information-encoding nucleotides comprise a cluster in the Hamming barcode sequence.
127. The method of claim 125 or 126, wherein the error-checking nucleotides are interspersed across the Hamming barcode sequence.
128. The method of any of claims 125-127, wherein the checking for errors comprises detecting nucleotide errors or correcting nucleotide errors.
129. The method of any of claims 125-128, wherein the checking for errors comprises solving for error-checked predetermined constant values based on the predetermined information values and the error-checking values, and the error-checked predetermined constant values being different from the expected predetermined constant values.
130. The method of any of claims 119-129, wherein the primers comprising the DBCO functional groups comprise an energy-optimized mixamer.
131. The method of claim 130, wherein the energy-optimized mixamer comprises an equilibrium Gibbs standard free energy sampled from a distribution of predicted equilibrium Gibbs standard free energies for hybridizing nucleic acid sequences, the distribution comprising a standard deviation of about 0.4 to 0.6 kcal/mol, at a temperature of about 40 °C and a salinity of about 0.2 M Na+.
132. The method of claim 130 or 131, wherein the energy-optimized mixamer comprises at least two adjacent weak nucleotides or at least one strong nucleotide.
133. The method of claim 132, wherein the energy-optimized mixamer comprises at least four of: the at least two adjacent weak nucleotides, the at least one strong nucleotide, or a combination thereof.
134. The method of claim 132 or 133, wherein a weak nucleotide of the at least two adjacent weak nucleotides comprises an adenine (A) nucleotide or a thymine (T) nucleotide or a uracil nucleotide.
135. The method of any of claims 132-134, wherein the at least one strong nucleotide comprises a guanine (G) nucleotide or cytosine (C) nucleotide.
136. The method of any of claims 131-135, wherein standard conditions for the distribution of predicted equilibrium Gibbs standard free energies is determined at a temperature of about 40 °C and a salinity of about 0.2M Na+.
137. A system for reverse transcribing RNA molecules comprising:
RNA molecules, the RNA molecules having been extracted from a sample; and primers comprising DBCO functional groups on the 5’ ends of the primers.
138. A system for reverse transcribing RNA molecules comprising: cDNA molecules reverse transcribed from RNA molecules having been extracted from a sample, using primers comprising DBCO functional groups on the 5’ ends of the primers.
139. A system for generating barcoded cDNA molecules comprising: cDNA molecules reverse transcribed from the RNA molecules having been extracted from a sample, using primers comprising DBCO functional groups on the 5’ ends of the primers; and azide-functionalized barcode oligonucleotides.
140. A system for generating barcoded cDNA molecules comprising: barcoded cDNA molecules generated from a click reaction comprising primers that comprise DBCO functional groups on the 5’ ends of the primers and azide-functionalized barcode oligonucleotides.
141. The system of any of claims 137-140, further comprising a reverse transcriptase.
142. The system of any of claims 137-141, further comprising an aqueous buffer.
143. The system of any of claims 137-142, further comprising a free azide molecule or an azide-functionalized molecule.
144. A method of checking nucleotide errors in a Hamming barcode sequence, comprising: creating a system of linear equations, using one or more processors, based on the Hamming barcode sequence, values of the system of linear equations comprising: predetermined information values assigned to information-encoding nucleotides, expected predetermined constant values, and error-checking values assigned to error-checking nucleotides; and solving for the error-checking values, using the one or more processors, from the system of linear equations.
145. The method of claim 144, further comprising: solving for error-checked predetermined constant values, using the one or more processors, based on the predetermined information values and the error-checking values; and determining a difference between the expected predetermined constant values and the error-checked predetermined constant values.
146. The method of claim 144 or 145, wherein the information-encoding nucleotides comprise a cluster in the Hamming barcode sequence.
147. The method of any of claims 144-146, wherein the error-checking nucleotides are interspersed across the Hamming barcode sequence.
148. The method of any of claims 144-147, wherein the checking for errors comprises detecting nucleotide errors or correcting nucleotide errors.
149. A system for checking nucleotide errors in a Hamming barcode sequence, comprising: a) one or more processors; and b) a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to: create a system of linear equations based on the Hamming barcode sequence, values of the system of linear equations comprising: predetermined information values assigned to informationencoding nucleotides, expected predetermined constant values, and error-checking values assigned to error-checking nucleotides; and solve for the error-checking values, using the one or more processors, from the system of linear equations.
150. The system of claim 149 comprising further instructions that, when executed by the one or more processors, cause the system to: solve for error-checked predetermined constant values, using the one or more processors, based on the predetermined information values and the error-checking values; and determine a difference between the expected predetermined constant values and the error-checked predetermined constant values.
151. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of a system, cause the system to: create a system of linear equations based on the Hamming barcode sequence, values of the system of linear equations comprising: predetermined information values assigned to information-encoding nucleotides, expected predetermined constant values, and error-checking values assigned to error-checking nucleotides; and solve for the error-checking values, using the one or more processors, from the system of linear equations.
152. The non-transitory computer-readable storage medium of claim 151, comprising further instructions that, when executed by the one or more processors, cause the system to: solve for error-checked predetermined constant values, using the one or more processors, based on the predetermined information values and the error-checking values; and determine a difference between the expected predetermined constant values and the error-checked predetermined constant values.
153. A method of generating energy-optimized mixamers comprising: determining mixamer sequences comprising at least two adjacent weak nucleotides or at least one strong nucleotide; determining a distribution of predicted equilibrium Gibbs standard free energies for the mixamer sequences hybridizing to one or more target nucleic acid sequences; and selecting energy-optimized mixamers based on the distribution, when the distribution comprises a standard deviation equilibrium Gibbs standard free energy less than or equal to a standard deviation equilibrium Gibbs standard free energy threshold.
154. The method of claim 153, wherein the distribution further comprises a mean equilibrium Gibbs standard free energy less than or equal to a mean equilibrium Gibbs standard free energy threshold.
155. The method of claim 154, wherein the mean equilibrium Gibbs standard free energy threshold is about -5 kcal/mol.
156. The method of any of claims 153-155, wherein the standard deviation equilibrium Gibbs standard free energy threshold is about 0.6 kcal/mol.
157. The method of any of claims 153-156, wherein an energy-optimized mixamer of the one or more energy -optimized mixamers comprises at least four of: the at least two adjacent weak nucleotides, the at least one strong nucleotide, or a combination thereof.
158. The method of any of claims 153-157, wherein the weak nucleotide of the at least two adjacent weak nucleotides comprises an adenine (A) nucleotide or a thymine (T) nucleotide or a uracil nucleotide.
159. The method of any of claims 153-158, wherein the at least one strong nucleotide comprises a guanine (G) nucleotide or a cytosine (C) nucleotide.
160. The method of any of claims 157-159, wherein the energy-optimized mixamer comprises at least 4 and at most 8 nucleotides.
161. The method of any of claims 157-160, wherein the energy-optimized mixamer comprises at least 5 and at most 10 nucleotides.
162. The method of any of claims 157-161, wherein the energy-optimized mixamer comprises at least 6 and at most 12 nucleotides.
163. The method of any of claims 157-162, wherein the energy-optimized mixamer comprises at least 7 and at most 14 nucleotides.
164. The method of any of claims 157-163, wherein the energy-optimized mixamer is of any of mixamer forms: SSSSSS, SSSSWW, SSSWWS, SSWWSS, SWWSSS, wwssss, ssswww, sswwsww, swwssww, wwsssww, sswwwws, swwswws, wwsswws, swwwwss, wwswwss, wwwwsss, sswwwww, swwswwww, swwwwsww, swwwwwws, wwsswwww, wwswwsww, wwswwwws, wwwwssww, wwwwswws, wwwwwwss, swwwwwww, wwswwwwww, wwwwswwww, WWWWWWSWW, WWWWWWWWS, or WWWWWWWWWW, wherein S is a G nucleotide or a C nucleotide and W is an A nucleotide or a T nucleotide.
165. The method of any of claims 153-164, wherein the energy-optimized mixamers comprise a mixture of any of the mixamer forms.
166. The method of any of claims 153-164, wherein the energy-optimized mixamers are configured to hybridize to RNA sequences.
167. The method of claim 166, wherein the RNA sequence is an mRNA sequence.
168. The method of any of claims 153-167, wherein the energy-optimized mixamer comprises at least a portion of a primer sequence.
169. The method of any of claims 153-168, wherein the portion of the primer sequence comprises a helper sequence.
170. The method of claim 169, wherein the portion of the primer sequence comprises a barcode sequence.
171. The method of claim 170, wherein the barcode sequence comprises a Hamming barcode sequence.
172. A system for generating energy-optimized mixamers comprising: a) one or more processors; and b) a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to: determine mixamer sequences comprising at least two adjacent weak nucleotides or at least one strong nucleotide; determine a distribution of predicted equilibrium Gibbs standard free energies for the mixamer sequences hybridizing to one or more target nucleic acid sequences at a temperature and salinity conducive to reverse transcription; and select energy-optimized mixamers based on the distribution, when the distribution comprises a standard deviation equilibrium Gibbs standard free energy less than or equal to a standard deviation equilibrium Gibbs standard free energy threshold.
173. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of a system, cause the system to: determine mixamer sequences comprising at least two adjacent weak nucleotides or at least one strong nucleotide; determine a distribution of predicted equilibrium Gibbs standard free energies for the mixamer sequences hybridizing to one or more target nucleic acid sequences; and select energy-optimized mixamers based on the distribution, when the distribution comprises a standard deviation equilibrium Gibbs standard free energy less than or equal to a standard deviation equilibrium Gibbs standard free energy threshold.
PCT/US2025/021612 2024-03-26 2025-03-26 Oligonucleotides, sequences, methods, and systems thereof for analyzing nucleic acid molecules Pending WO2025207810A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202463570082P 2024-03-26 2024-03-26
US63/570,082 2024-03-26

Publications (1)

Publication Number Publication Date
WO2025207810A1 true WO2025207810A1 (en) 2025-10-02

Family

ID=97218820

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2025/021612 Pending WO2025207810A1 (en) 2024-03-26 2025-03-26 Oligonucleotides, sequences, methods, and systems thereof for analyzing nucleic acid molecules

Country Status (1)

Country Link
WO (1) WO2025207810A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180051277A1 (en) * 2015-02-24 2018-02-22 Trustees Of Boston University Protection of barcodes during dna amplification using molecular hairpins
US20210054369A1 (en) * 2019-08-20 2021-02-25 Fluent Biosciences Inc. Hairpin primer design for sequential pcr production of targeted sequencing libraries

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180051277A1 (en) * 2015-02-24 2018-02-22 Trustees Of Boston University Protection of barcodes during dna amplification using molecular hairpins
US20210054369A1 (en) * 2019-08-20 2021-02-25 Fluent Biosciences Inc. Hairpin primer design for sequential pcr production of targeted sequencing libraries

Similar Documents

Publication Publication Date Title
Liu et al. Hi-TOM: a platform for high-throughput tracking of mutations induced by CRISPR/Cas systems
US7058515B1 (en) Methods for making character strings, polynucleotides and polypeptides having desired characteristics
US7957912B2 (en) Methods for identifying and producing polypeptides
US6917882B2 (en) Methods for making character strings, polynucleotides and polypeptides having desired characteristics
EP2984598B1 (en) Systems and methods for determining copy number variation
Ziegenhain et al. Molecular spikes: a gold standard for single-cell RNA counting
Shapland et al. Low-cost, high-throughput sequencing of DNA assemblies using a highly multiplexed Nextera process
WO2001075767A2 (en) In silico cross-over site selection
CA2965849A1 (en) Sequencing controls
Sun et al. Correcting PCR amplification errors in unique molecular identifiers to generate accurate numbers of sequencing molecules
US20180127804A1 (en) High-throughput sequencing of polynucleotides
US20230083827A1 (en) Systems and methods for identifying somatic mutations
CN116312776B (en) Method for detecting differentiated RNA editing sites
US20230340586A1 (en) Systems and methods for paired end sequencing
Zheng et al. HIT-scISOseq: high-throughput and high-accuracy single-cell full-length isoform sequencing for corneal epithelium
Santacruz et al. Automation of high-throughput mRNA-seq library preparation: a robust, hands-free and time efficient methodology
Yang et al. Defining protein variant functions using high-complexity mutagenesis libraries and enhanced mutant detection software ASMv1. 0
Gazestani et al. circTAIL-seq, a targeted method for deep analysis of RNA 3′ tails, reveals transcript-specific differences by multiple metrics
Yang et al. Nanopore sequencing of forensic short tandem repeats using QNome of Qitan Technology
CN114464254B (en) Multi-omics analysis methods, systems, devices and storage media for direct RNA sequencing
Stähler et al. Another side of genomics: synthetic biology as a means for the exploitation of whole-genome sequence information
WO2025207810A1 (en) Oligonucleotides, sequences, methods, and systems thereof for analyzing nucleic acid molecules
You et al. Benchmarking long-read RNA-sequencing technologies with LongBench: a cross-platform reference dataset profiling cancer cell lines with bulk and single-cell approaches
EP3143159B1 (en) Systems and methods for validation of sequencing results
Kapustina et al. Sensitive and accurate analysis of gene expression signatures enabled by oligonucleotide-labelled cDNA

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25779187

Country of ref document: EP

Kind code of ref document: A1