[go: up one dir, main page]

WO2018085779A1 - Methods for assessing genetic variant screen performance - Google Patents

Methods for assessing genetic variant screen performance Download PDF

Info

Publication number
WO2018085779A1
WO2018085779A1 PCT/US2017/060222 US2017060222W WO2018085779A1 WO 2018085779 A1 WO2018085779 A1 WO 2018085779A1 US 2017060222 W US2017060222 W US 2017060222W WO 2018085779 A1 WO2018085779 A1 WO 2018085779A1
Authority
WO
WIPO (PCT)
Prior art keywords
real
sequencing reads
synthetic
interest
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2017/060222
Other languages
French (fr)
Inventor
Genevieve M. GOULD
Xin Wang
Peter V. GRAUMAN
Gregory John HOGAN
Alexander De Jong ROBERTSON
Jared Robert MAGUIRE
Hyunseok KANG
Imran Saeedul HAQUE
Eric Andrew EVANS
Kevin R. HAAS
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Counsyl Inc
Original Assignee
Counsyl Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Counsyl Inc filed Critical Counsyl Inc
Publication of WO2018085779A1 publication Critical patent/WO2018085779A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/60In silico combinatorial chemistry

Definitions

  • the present invention relates to methods of assessing the performance of a genetic variant screen
  • the performance of genetic variant screen is assessed for concordance with known reference samples. Assessment of the screen can be performed using a large number of positive controls with known genetic variants, and a summary statistic (such as sensitivity or specificity) for the screen can be determined. However, when a large number of positive controls are unavailable, such as for controls with rare genetic variant events, the performance of the genetic variant calling algorithm (i.e., a "caller") or assay cannot be accurately assessed. While large numbers of positive controls having single nucleotide variants (SNVs) are commonly available, positive control samples having insertion, deletion, inversion, or copy number variants are less frequent.
  • SNVs single nucleotide variants
  • Described herein is method of assessing the performance of a genetic variant screen, comprising generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and determining a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen.
  • the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, or negative predictive value.
  • the synthetic number of sequencing reads from each of the one or more segments is generated by increasing or decreasing a number of real sequencing reads from the one or more segments from a real test sample in proportion to a number (which may be an integer number or non-integer number) of copies of the region of interest.
  • the predetermined number of copies is an integer number of copies. In some embodiments, the predetermined number of copies is a non-integer number of copies.
  • the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a binomial distribution with a success probability equal to mix and a number of trials equal to a real number of sequencing reads from a real test sample, wherein m is the synthetic number of copies of the region of interest, and x is an assumed number of copies of the region of interest from the real test sample.
  • the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a number of sequencing reads as a negative binomial distribution with a success probability equal to m/x and a number of success equal to a real number of sequencing reads from a real test sample, wherein /;/ is the synthetic number of copies of the region of interest, and x is an assumed number of copies of the region of interest from the real test sample, and adding the sampled number of sequencing reads to the real number of sequencing reads from the test sample.
  • the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a number of sequencing reads as an expectation of the negative binomial distribution.
  • the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples. In some embodiments, the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample.
  • the average number of real sequencing reads from a corresponding segment from the one or more real reference samples is a median number of real sequencing reads from a corresponding segment from the one or more real reference samples. In some embodiments, the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample is the median average number of real sequencing reads from the one or more segments within the region of interest from the real test sample.
  • the number of real sequ encing read s are normalized for GC content bias or mappability.
  • the region of interest is more than about 100 bases in length. In some embodiments, the region of interest is at least one ex on. In some embodiments, the region of interest is at least one gene. In some embodiments, the region of interest is at least one chromosome.
  • the method further comprises generating the real sequencing reads from the real test sample. In some embodiments, the method further comprises generating the real sequencing reads from the one or more real reference samples.
  • a method of assessing the performance of a genetic variant screen comprising calling, using a genetic variant caller, for an inverse of a synthetic variant in each test sequence in a plurality of test sequences, each test sequence comprising a plurality of sequencing reads aligned with a reference sequence comprising a synthetic variant; and determining a summary statistic for the genetic variant screen based on a called inverse of the synthetic variant in the plurality of test sequences and the synthetic variant in the reference sequ ence, thereby assessing the performance of the genetic variant screen.
  • the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, or negative predictive value.
  • the method further comprises aligning the plurality of sequencing reads with the reference sequence comprising the synthetic variant.
  • the synthetic variant comprises an insertion, a deletion, an inversion, a translocation, a SNV, or a combination thereof.
  • the synthetic variant comprises an insertion or a deletion.
  • the synthetic variant comprises an insertion and the inverse of the synthetic variant is a deletion, in some embodiments, the synthetic variant comprises a deletion and the inverse of the synthetic variant is an insertion.
  • the synthetic variant is an inversion and the inverse of the synthetic variant is an inversion.
  • the synthetic variant is between I base in length and about 1 chromosome in length. In some embodiments, the synthetic variant is between about 20 bases in length and about 1000 bases in length. In some embodiments, the synthetic variant is the length of a full gene.
  • the method further comprises generating the sequencing reads from the one or more test samples.
  • the method further comprises reporting the summary statistic.
  • the method further comprises displaying the summary statistic on a monitor.
  • the method is implemented by a program executed on a computer system.
  • the method further comprises storing the summary statistic in a database.
  • Also described herein is a computer readable storage medium comprising instructions for carrying out any one of the methods described above.
  • a system comprising a processor, and a memory, wherein the memory comprises computer readable instructions operable to cause the processor to carry out any one of the methods described above.
  • FIG. 1A presents the frequency (percent occurrence) of a copy number deletion or duplication event for several genes among approximately 56,000 samples.
  • FIG. IB shows the fraction of nucleotides exhibiting an insertion or deletion event occurring in at least one of the same approximately 56,000 samples.
  • FIG. 2 is a schematic showing binomial sampling of a real number of sequencing reads from a real test sample to obtain a synthetic number of sequencing reads for a synthetic copy number deletion variant.
  • FIG. 3 depicts an exemplary computing system configured to perform any one of the processes described herein, including the various exemplary methods for of assessing the performance of a genetic variant screen
  • FIG. 4 presents a theoretical copy number of the X chromosome plotted against the sequencing depth (i .e., absolute number of sequencing reads prior to normalization) for each segment within the X chromosome of real female samples, real male samples, and synthetic copy number variants.
  • the theoretical copy number is based on double-normalized sequencing reads from the male and female samples, as well as the double-normalized sequencing reads that have been decreased to reflect a single copy of the X chromosome in the simulated male samples,
  • FIG. 5 presents the sensitivity of a genetic variant screen for deletion copy number variants and duplication copy number variants before and after the modification of a genetic variant screen.
  • the sensitivities were determined by constructing synthetic copy number variants having a synthetic copy number of a single exon from various genes, and using a genetic variant caller to call the number of copies of the exon in the synthetic copy number variants. Error bars show the confidence interval of the sensitivity given the number of simulated deletions and the observed false negatives and true positives.
  • FIG. 6 presents the sensitivity of a genetic variant screen for deletion copy number variants and duplication copy number variants before and after the modification of a genetic variant screen.
  • the sensitivities were determined by constructing synthetic copy number variants having a synthetic copy number for one exon, two exons, four exons, or whole genes from 36 different genes.
  • a genetic variant caller was used to call the number of copies of exons in the synthetic copy number variants.
  • Described herein is a method of assessing the performance of a genetic variant screen, comprising generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and determining a summary statistic (i.e., a summary statistic of performance) for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen.
  • a summary statistic i.e., a summary statistic of performance
  • the synthetic number of sequencing reads for each of the one or more segments can be generated by increasing or decreasing a number of real sequencing reads from the one or more segments from a real test sample in proportion to a number (which may be an integer number or a non-integer number) of copies of the region of interest.
  • a synthetic copy number variant having three copies of a region of interest can be generated by increasing the number of real sequencing reads for one or more segments within the region of interest taken from a real test sample having two copies of the region of interest by a number of sequencing reads representing one copy of the region of interest (for example, by adding sequencing reads representing one copy of the region of interest to the real number of sequencing reads or by multiplying the number of real sequencing reads by 1.5).
  • a synthetic copy number variant having one copy of a region of interest can be generated by decreasing the number of real sequencing reads for one or more segments within the region of interest taken from a real test sample having two copies of the region of interest by a number of sequencing reads representing one copy of the region of interest (for example, by removing sequencing reads representing one copy of the region of interest from the real number of sequencing reads or by multiplying the number of real sequencing reads by 0.5).
  • the number of copies may be an integer number or a non-integer number of copies.
  • a non-integer number of copies can used, for example, to analyze mosaic copy number variants or sub-population somatic copy number variants.
  • the sequencing reads are normalized against the one or more real reference samples (for example, the number of real sequencing reads from each of the one or more segments from the real test sample can be normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average (such as mean or median) number of real sequencing reads from a corresponding segment from the one or more real reference samples).
  • the number of real sequencing reads from each of the one or more segments from the real test sample are normalized against the other segments within the region of interest (for example, by dividing the number of real sequencing reads from each segment from the real test sample by the average (such as mean or median) number of real sequencing reads from the one or more segments within the region of interest from the real test sample).
  • the number of real sequencing reads is normalized using both methods, in either order.
  • Also described herein is a method of assessing the performance of a genetic variant screen, comprising calling, using a genetic variant caller, for an inverse of a synthetic variant in each test sequence in a plurality of test sequences, each test sequence comprising a plurality of sequencing reads aligned with a reference sequence comprising a synthetic variant; and determining a summary statistic (i.e., a summary statistic of performance) for the genetic variant screen based on a called inverse of the synthetic variant in the plurality of test sequences and the synthetic variant in the reference sequence, thereby assessing the performance of the genetic variant screen.
  • the method further comprises aligning the plurality of sequencing reads with the reference sequence comprising the synthetic variant.
  • the variant can be, for example, an insertion, a deletion, an inversion, a translocation, a SNV, or a combination thereof.
  • the variant is an inversion with a deletion, an insertion, or a deletion and an insertion.
  • the genetic variant caller calls for the inverse variant.
  • the reference sequence comprises an insertion and the genetic variant caller calls for a deletion.
  • the reference sequence comprises a deletion and the genetic variant caller calls for an insertion. Since the inverse of an inversion is itself an inversion, in some embodiments, the reference sequence comprises an inversion and the genetic variant caller calls for an inversion (although the called inversion is in the opposite direction as the inversion in the reference sequence).
  • the summary statistic of performance (also referred to herein as a "summary statistic”) can be, but is not limited to, sensitivity, specificity, positive predictive value, negative predictive value, precision, recall, accuracy, or any other metric of concordance.
  • the summary statistic of performance is useful for assessing the performance of the genetic variant screen.
  • the summary static can be above a predetermined threshold, which indicates that the genetic variant screen performs as desired. If the summary statistic is below the predetermined threshold, for example, the genetic variant screen does not perform as desired.
  • the summary statistic is reported (for example, to a patient, doctor, caregiver, or regulator).
  • the summary statistic is displayed, for example on a monitor, which can be part of a computer system.
  • the methods described herein can be implemented by a program executed on a computer system.
  • a computer- implemented method of assessing the performance of a genetic variant screen comprising, at an electronic device having one or more processors and memory, generating, by the one or more processors, a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest; calling, by the one or more processors, a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and determining, by the one or more processors, a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen
  • a computer-implemented method of assessing the performance of a genetic variant screen comprising, at an electronic device having one or more processors and memory, aligning, by one or more processors, sequencing reads from a plurality of test samples with a reference sequence comprising a variant; calling, by the one or more processers, for an inverse of the variant in the test sample using a genetic variant caller; and determining, by the one or more processors, a summary statistic for the genetic variant screen based on a called inverse of the variant in the plurality of test samples and the variant in the reference sequence, thereby assessing the performance of the genetic variant screen.
  • FIG. 1A presents the frequency (percent occurrence) of a copy number deletion or duplication event for several genes among approximately 56,000 samples. As can be seen in FIG. 1A, the frequency of a deletion or duplication event for a given region of interest is low. Furthermore, approximately 25% of the region examined contained no duplications and approximately 45% of the region examined contained no deletions in any of the same approximately 56,000 samples.
  • FIG. IB shows the fraction of nucleotides with at least one CNV observed in at least one of the same approximately 56,000 samples.
  • the methods described herein allow for assessing the performance of a genetic variant screen without the need for positive controls with known genetic variants from real samples.
  • the performance of the genetic variant screen is assessed using synthetic copy number variants generated from non-variant reference samples.
  • the performance of the genetic variant screen is assessed using a reference sequence comprising a synthetic variant.
  • the synthetic variants do not rely on the existence of genetic variants from real sample, but can be simulated from real samples. A large number of synthetic variants can be readily generated to adequately assess the performance of the genetic variant screen.
  • Reference to "about” a value or parameter herein includes (and describes) variations that are directed to that value or parameter per se. For example, description referring to "about X” includes description of "X”.
  • average refers to either a mean or a median, or any value used to approximate the mean or the median.
  • CNV copy number variant
  • deletion refers to any decrease in the number of copies of a region of interest relative to one or more real reference samples. For example, if the one or more real reference samples have two copies of a region of interest, a deletion can refer to a single copy of the region of interest. If the one or more real reference samples have four copies of a region of interest, a deletion can refer to one, two, or three copies of the region of interest.
  • duplication refers to any increase in the number of copies of a region of interest relative to one or more real reference samples, including three or more, four or more, five or more, etc. copies of the region of interest.
  • a "genetic variant caller” is any method or technique (including software) that can be used to identify one or more genetic features. Genetic features that can be identified by a genetic variant caller include, but are not limited to, the copy number of a region of interest, an insertion, a deletion, a translocation, an inversion, or a small nucleotide variant (SNV).
  • SNV small nucleotide variant
  • a "number of sequencing reads” as used herein refers to an absolute number of sequencing reads or a normalized number of sequencing reads.
  • a "real sample” refers to a nucleic acid sequence or sequencing reads originating from a nucleic acid sequence that originates from a physical sample subjected to genetic sequencing without the sequence, sequencing reads, or number of sequencing reads being altered.
  • a “real reference sample” refers to a real sample that is compared to a synthetic sample (e.g., a synthetic copy number variant) by the genetic variant caller.
  • a “real test sample” refers to a real sample that is used to generate the synthetic sample.
  • a “real sequencing read” refers to a sequencing read that originates from a real sample without alteration of the sequence.
  • a “number of real sequencing reads” refers to an absolute number of real sequencing reads or a normalized number of sequencing reads, but does not refer to a number of sequencing reads that has be altered to reflect an increase in a number of copies of any segment or region of interest.
  • a “segment” refers to a sub-region in a region of interest that serves as a locus of origin for sequencing reads.
  • the segment can be as short as a single base or can be as long as the region of interest. Multiple segments within a region of interest may be, but need not be, continuous, contiguous, or overlapping,
  • synthetic copy number variant refers to an artificial nucleic acid sequence generated using real sequencing reads from a real sample with an increase or decrease in the number of copies of a region of interest compared to the real sample.
  • the synthetic copy number variant need not be (although, in some embodiments, could be) an aligned or assembled nucleic acid sequence, and can be represented by a synthetic number of sequencing reads.
  • a "synthetic number of copies” refers to the number of copies of a region of interest in the synthetic copy number variant, and can be an increase or decrease in the number of copies relative to the real sample.
  • a "synthetic number of sequencing reads” refers to a number of real sequencing reads that has been altered to reflect an increase or a decrease in the number of copies of a segment within a region of interest.
  • the real sequencing reads originate from the same segment (i.e., originate for a corresponding segment) within the region of interest as the sequencing reads in the synthetic number of sequencing reads.
  • a "synthetic variant” in a reference genome refers to a variant artificially introduced into a nucleic acid sequence in the reference genome, unless context clearly indicates otherwise.
  • the "inverse" of a synthetic variant refers to the opposite consequence of the synthetic variant that would appear in a nucleic acid sequence when compared to the reference sequence comprising the synthetic variant,
  • the methods described herein are used to assess the performance of a copy number variant screen.
  • Synthetic copy number variants are generated, for example in silico.
  • the synthetic copy number variant includes a synthetic number of copies of a region of interest, which is represented by a synthetic number of sequencing reads from one or more segments within the region of interest.
  • the synthetic number of sequencing reads is obtained by adjusting a number of sequencing reads of the one or more segments within the region of interest from a real test sample. The adjustment is made in proportion to the synthetic number of copies.
  • the synthetic number of sequencing reads is obtained by direct manipulation of a database comprising sequencing reads of the one or more segments within the region of interest from a real sample, for example by random deletion or duplication of sequencing reads within the database.
  • the synthetic number of sequencing reads is generated by sampling a distribution (such as a binomial distribution or a negative binomial distribution).
  • a plurality of synthetic copy number variants can be generated, for example based on a plurality of real test samples.
  • a number of copies of the region of interest present in the synthetic copy number variant is called using the copy number variant caller.
  • the caller compares the synthetic number of sequencing reads from the one or more segments in the synthetic copy number variant to the number of sequencing reads from the one or more segments in a real reference sample with a known number of copies of the region of interest.
  • the caller can use, for example, a hidden-Markov model (HMM) to determine the number of copies of the region of interest in the synthetic copy number variant.
  • the real reference sample is preferably a different real sample that the real sample used as a basis for generating the synthetic copy number variants,
  • a summary statistic for the copy number variant screen can be determined to assess the performance of the copy number variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants. Since a plurality of synthetic copy number variants are generated and called by the caller, the summary statistic reflects the performance of the screen in the context of the synthetic variants. Thus, a greater diversity of synthetic copy number variants (which can be based on a plurality of real samples) provides a more accurate summary statistic
  • assessing the performance of a genetic variant screen comprises generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number vari ant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples: and determining a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen.
  • the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, negative predictive value, or any other metric of concordance.
  • the method of assessing the performance of a genetic variant screen comprises generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by increasing or decreasing a number of real sequencing reads from the one or more segments from a real test sample in proportion to a number (which may be an integer number or non-integer number) of copies of the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and determining a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number vari ants, thereby
  • the method of assessing the performance of a genetic variant screen comprises generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by (i) normalizing a number of real sequencing reads from the one or more segments from a real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples; and (ii) increasing or decreasing the number of normalized real sequencing reads from the one or more segments from the real test sample in proportion to a predetermined number (which may be an integer number or a non-integer number) of copies of the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the
  • the method of assessing the performance of a genetic variant screen comprises generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by (i) normalizing a number of real sequencing reads from the one or more segments from a real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples; (ii) normalizing the number of real sequencing reads from the one or more segments from the real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample; and (iii) increasing or decreasing the number of normalized real sequencing reads from the
  • the method of assessing the performance of a genetic variant screen comprises generating real sequencing reads from a real test sample; generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by (i) normalizing a number of real sequencing reads from the one or more segments from the real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples; (ii) normalizing the number of real sequencing reads from the one or more segments from the real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample, and (iii) increasing or
  • the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, negative predictive value, or any other metric of concordance.
  • the method of assessing the performance of a genetic variant screen comprises generating real sequencing reads from a real test sample; generating real sequencing reads from one or more real reference samples; generating a plurality of synthetic copy number variants compri sing a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by (i) normalizing a number of real sequencing reads from the one or more segments from the real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples, (ii) normalizing the number of real sequencing reads from the one or more segments from the
  • the one or more real reference samples are compared to the synthetic copy number variant using the genetic variant caller to call the number of copies in the synthetic copy number variant, for example by comparing a number of real sequencing reads from the one or more real reference samples to the synthetic number of sequencing reads representing the synthetic copy number variant.
  • the genetic variant caller uses a hidden Markov Model to cal l the copy number in the synthetic copy number variant. As copy number variants in any given region of interest are relatively rare (see FIG. 1 A), it can be assumed that the one or more real reference samples do not have a copy number variant. In some embodiments, the reference samples are verified negative for a copy number variant.
  • the one or more reference samples includes, two or more, three or more, four or more, five or more, six or more, eight or more, ten or more, twenty or more, thirty or more, forty or more, forty-eight or more, sixty or more, seventy or more, eighty or more, ninety or more, ninety-three, or ninety-six or more reference samples.
  • the one or more real reference samples include the test sample (although the real reference sample is not exclusively the real test sample). In some embodiments, the one or more real reference samples excludes the real test sample. In some embodiments, real sequencing reads from the real test sample and the one or more real reference samples are simultaneously generated. In some embodiments, the region of interest (or segments within the region of interest) from the real test sample and the one or more real reference samples are enriched using the same methods (for example, using PGR
  • the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average (such as mean or median) number of real sequencing reads from a corresponding segment from one or more real reference samples. In some embodiments, the number of real sequencing reads from each of the one or more segments from the one or more real reference samples are normalized by dividing the number of real sequencing reads from each segment from each of the one or more real reference samples by the average (such as mean or median) number of real sequencing reads from a corresponding segment from one or more real reference samples.
  • a real test sample can include a first number of real sequencing reads from a first segment and a second number of real sequencing reads from a second segment.
  • a first average number of sequencing reads from the first segment from the one or more reference samples can be determined, and a second average number of sequencing reads from the second segment from the one or more reference samples can be determined.
  • the first number of real sequencing reads is divided by the first average number of real sequencing reads, and to normalize the number of real sequencing reads from the second segment from the real test sample, the second number of real sequencing reads is divided by the second average number of real sequencing reads.
  • the real sequencing reads from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample.
  • a region of interest can include three segments.
  • the real test sample can include a first number of real sequencing reads for the first segment, a second number of real sequencing reads for the second segment, and a third number of real sequencing reads for the third segment.
  • An average number of sequencing reads can be determined for the three segments within the region of interest from the real test sample.
  • the first number of real sequencing reads is divided by the average number of sequencing reads.
  • the second number of real sequencing reads is divided by the average number of sequencing reads.
  • the third number of real sequencing reads is divided by the average number of sequencing reads.
  • the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from one or more real reference samples, and the real sequencing reads from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample.
  • the number of real sequencing reads can be double-normalized in either order.
  • first the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from one or more real reference samples
  • second the real sequencing reads from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample.
  • first the real sequencing reads from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample, and second the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from one or more real reference samples.
  • the region of interest can be of any length, for example, 1 base to a full length chromosome.
  • the region of interest can be 1 base to about 250 million bases in length (such as about 1 base to about 50 bases in length, about 50 bases to about 100 bases in length, about 100 bases to about 250 bases in length, about 250 bases to about 500 bases in length, about 500 base to about 1000 bases in length, about 1000 bases to about 2000 bases in length, about 2000 bases to about 4000 bases in length, about 4000 bases to about 8000 bases in length, about 8000 bases to about 16,000 bases in length, about 16,000 bases to about 32,000 bases in length, about 32,000 bases to about 64,000 bases in length, about 64,000 bases to about 125,000 bases in length, about 125,000 bases to about 250,000 bases in length, about 250,000 bases to about 500,000 bases in length, about 500,000 bases to about 1 million bases in length, about 1 million bases to about 2 million bases in length, about 2 million bases to about 4 million bases in length, about 4 million bases to about 8 million bases in length, about 8
  • the region of interest is about 1 base or more (such as about 50 bases or more, about 100 bases or more, about 250 bases or more, about 500 base or more, about 1000 bases or more, about 2000 bases or more, about 4000 bases or more, about 8000 bases or more, about 16,000 bases or more, about 32,000 bases or more, about 64,000 bases or more, about 125,000 bases or more, about 250,000 bases or more, about 500,000 bases or more, about 1 million bases or more, about 2 million bases or more, about 4 million bases or more, about 8 million bases or more, about 16 million bases or more, about 32 million bases or more, about 64 million bases or more, or about 125 million bases or more.
  • the region of interest comprises one or more genes (such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 40, 50, 75, 100, 150, 200, 250 or more genes). In some embodiments, the region of interest comprises one or more exons (such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 40, 50, 75, 100, 150, 200, 250 or more genes).
  • the region of interest can be divided into one or more segments, which may or may not be continuous, contiguous, or partially overlapping.
  • the region of interest comprises 1 or more (such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 40, 50, 75, 100, 150, 200, 250 or more) segments.
  • the sequencing reads (or a portion of the sequencing reads) correspond to one of the one or more segments (i.e., the sequencing reads can be aligned to segments, for example using a reference sequence) within the region of interest.
  • sequences may not accurately map to a particular segment (for example, a sequencing read may map to more than one segment or may map to no segment); such un-mappable or un-alignabfe sequencing reads are optionally ignored or discarded.
  • a region of interest from one or more real samples is sequenced to generate real sequencing reads.
  • the real sequencing reads can correspond to segments within the region of interest.
  • Real sequencing reads can be generated from one or more real samples (e.g., one or more sequencing library from the one or more real samples) using any known sequencing method, such as massively parallel sequencing (for example using an lilumina HiSeq 2500 system ).
  • the region of interest can be enriched for the region of interest, which can increase the proportion of sequencing reads that correspond to the region of interest.
  • the region of interest can be enriched by PGR (for example, by including one or more primers that hybridize to portions of segments within the region of interest with genomic DNA from a real sample, and amplifying the segments within the region of interest).
  • the region of interest is enriched by combining capture probes (such as biotinylated DNA, RNA, synthetic oligonucleotides) that hybridize to segments within the region of interest with genomic DNA (which is preferably sheared).
  • the capture probes can then be used to isolate DNA fragments that include segments from the region of interest, and those DNA fragments can be sequenced to generate sequencing reads.
  • the real sequencing reads are normalized.
  • the real sequencing reads are normalized for GC content or mappability.
  • some segments within the region of interest may have a higher GC content than other segments within the region of interest.
  • the higher GC content may increase or decrease the assay efficiency within that segment, inflating or deflating the relative number of sequencing reads for reasons other than copy number.
  • Methods to normalize GC content are known in the art, for example as described in Fan & Quake, PLoS ONE, vol. 5, el 0439 (2010).
  • the certain segments within the region of interest may be more easily mappable (or alignabie to a reference region of interest), and a number of sequencing reads may be excluded, thereby deflating the relative number of sequencing reads for reasons other than copy number
  • Mappability at a given position in the genome can be predetermined for a given read length, k, by segmenting every position within the region of interest into A-mers and aligning the sequences back to the region of interest, iv-mers that align to a unique position in the interrogated region are labeled "mappable,” and k-mers that no not align to a unique position in the region of interest are labeled "not mappable.”
  • a given segment can be normalized for mappability by scaling the number of reads in the segment by the inverse of the fraction of the mappable k-mers in the segment. For example, if 50% of k-mers within a bin are mappable, the number of observed reads from within that segment can be scaled by a factor of 2.
  • Sequencing reads from a population of real test samples assumed to be wild-type for a number of copies of a region of interest form a negative binomial distribution with an average (mean or median) and a variance.
  • the variance of the distribution can arise, for example, from noise during enrichment or sequencing of the region of interest.
  • the distribution of sequencing reads from a population of synthetic copy number variants preferably resembles an expected negative binomial distribution of sequencing reads from a theoretical population of real copy number variants.
  • the average of the expected distribution of the theoretical population of real copy number variants (and thus, the distribution of the synthetic copy number variants) is shifted to reflect the change in number of copies of the region of interest relative to the average of the real test samples.
  • the average number of sequencing reads in a population of samples having one copy of a region of interest is expected to be half the average number of sequencing reads in a population of samples having two copies of the region of interest.
  • the average number of sequencing reads in a population of samples having three copies of a region of interest is expected to have 1.5 times the average number of sequencing reads in a population of samples having two copies of the region of interest.
  • assessment of the genetic variant caller includes calling a number of copies of the region of interest from a plurality of copy number variants (i .e., a population of variants), it is preferable to assess the caller against a synthetic population with a distribution that mimics a real copy number variant population.
  • a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest is generated.
  • the synthetic number of sequencing reads for each of the one or more segments can be generated by increasing or decreasing a number of real sequencing reads from the one or more segments within a region of interest from a real test sample.
  • a synthetic copy number variant having three copies of the region of interest can be generated by generating a first synthetic number of sequencing reads corresponding to the first segment by increasing the first number of real sequencing reads to reflect three copies of the first segment, and generating a second synthetic number of sequencing reads corresponding to the second segment by increasing the second number of real sequencing reads to reflect three copies of the second segment.
  • the synthetic copy number variant has three copies of the region of interest having the first segment and the second segment.
  • the synthetic number of sequencing reads are generated by multiplying the number of real sequencing reads by a factor (such as 1.5 to increase the copy number from two to three, or 0.5 to decrease the copy number from two to one).
  • the synthetic number of sequencing reads are generated by adding (or subtracting) a number of sequencing reads (such as 50% of the average number of real sequencing reads corresponding to all segments within the region of interest) to the number of real sequencing reads.
  • the number of sequencing reads are normalized (for example, as described below) such that a single copy of a region of interest is represented by a normalized number of sequencing reads (e.g., 0.5), and two copies of a region of interest are represented by a normalized number of sequencing reads (e.g., 1).
  • a number of normalized sequencing reads (such as 0.5) are added to the normalized number of sequencing reads to increase the number of copies in the synthetic copy number variant, and a number of normalized sequencing reads (such as 0.5) are subtracted to the normalized number of sequencing reads to decrease the number of copies in the synthetic copy number variant.
  • the number of real sequencing reads are increased or decreased to generate the synthetic number of sequencing reads to represent a synthetic copy number variant with a predetermined number (which may be an integer number or a non-integer number) of copies of the region of interest (such as 1 or more, 2 or more, 3 or more, 4 or more, or 5 or more copies of the region of interest).
  • a synthetic number of sequencing reads is generated by- adding or subtracting a number of sequencing reads from a number of sequencing reads from a real test sample to generate a synthetic copy number variant.
  • a synthetic copy number variant comprising a duplication is generated by adding the number of sequencing reads, and a synthetic copy number variant comprising a deletion event is generated by deleting a number of sequencing reads.
  • the number of sequencing reads added or subtracted from the number of number of sequencing reads from the real test sample is based, in part, on how many duplication or deletion events are simulated in the synthetic copy number variant.
  • a synthetic number of sequencing reads for a synthetic copy number variant comprising n copies of a region of interest (or segment thereof) more (or less) than an assumed (e.g., wild-type) number of copies x in a real test sample is determined by adding (or subtracting) - times an average (e.g., mean or median) number of sequencing reads from a plurality of real test samples for that region of interest (or segment thereof) to (or from) the number of sequencing reads for that region of interest (or segment thereof) from a real test sample.
  • the synthetic number of sequencing reads for the synthetic copy number variant is determined as wherein refers to the number of
  • sequencing reads at region of interest (or segment) / for real test sample i and ⁇ refers to an average (mean or median) number of sequencing reads, which can be, for example, an average number of sequencing reads from all segments within the real test sample (i.e., ⁇ j ), an average number of sequencing reads at region of interest (or segment) ; across a plurality of real test samples (i.e., ⁇ j ), or a normalized (or double normalized) average number of sequencing reads that i s an average number of sequencing reads for a region of interest (or segment) j for a plurality of real test samples, wherein the number of sequencing reads for each real test sample has been normalized across the real test sample
  • a synthetic copy number variant having one copy of a region of interest i can be determined based on a number of sequencing reads from a real test sample i assumed to have two copies of the region of interest can be determined as In some
  • a synthetic copy number variant comprising a duplication (i.e., having n additional copies of a region of interest or segment thereof than an assumed number of copies x in a real test sample), the synthetic number of sequencing reads for the synthetic copy
  • number variant having three copies of a region of interest j can be determined based on a number of sequencing reads from a real test sample i assumed to have two copies of the region of interest can be determined as
  • a synthetic number of sequencing reads for a synthetic copy number variant comprising m copi es of a region of interest (or segment thereof) can be generated based on a number of sequencing reads of that region of interests (or segment thereof) comprising x copies of the region of interest (or segment thereof) according to For example, a synthetic number of sequencing reads for a
  • synthetic copy number variant with three copies of a region of interest can be generated based on a number of sequencing reads from a real test sample having two copies of the region of interest (or segment thereof) according to:
  • a synthetic copy number variant with three copies of a region of interest can be generated based on a number of sequencing reads from a real test sample having two copies of the region of interest (or segment thereof) according to:
  • one copy of a region of interest (or segment thereof) can be generated based on a number of sequencing reads from a real test sample having two copies of the region of interest (or segment thereof) according to:
  • the synthetic number of sequencing reads for a synthetic copy number variant comprising m copies of a region of interest is generated from a number of sequencing reads from a real test sample with an assumed (e.g., wild-type) number of copies x by multiplying the number of sequencing reads from the real test sample by That is, a synthetic number of sequencing reads for a synthetic copy number variant can be determined based on a number of sequencing reads according to: For
  • a synthetic number of sequencing reads for a synthetic copy number variant having three copies of a region of interest can be generated based on a number of sequencing reads from a real test sample assumed to have two copies of the region of interest (or segment thereof) according to:
  • reads for a synthetic copy number variant having one copy of a region of interest (or segment thereof) can be generated based on a number of sequencing reads from a real test sample assumed to have two copies of the region of interest (or segment thereof) according to:
  • a fudge factor is included when determining the synthetic
  • the fudge factor can be derived from the increase or decrease in variance expected for a Poisson distribution when changing the average number of sequencing reads.
  • the synthetic number of sequencing reads for a synthetic copy number variant is determined by sampling a binomial distribution or a negative binomial distribution of reals sequencing reads from a real test sample.
  • the synthetic number of sequencing reads can be generated by sampling from a binomial distribution of a real number of sequencing reads from a real test sample having x copies of the region of interest (or segment thereof) with a success probability equal to and a number of trials equal to the number of real sequencing reads. That is, for a synthetic copy number deletion variant, For example, for a synthetic
  • the synthetic number of sequencing reads can be generated by sampling from a binomial distribution of a real number of sequencing reads from a real test sample having two copies of the region of interest (or segment thereof) with a success probability equal to 1 ⁇ 2 and a number of trails equal to the number of real sequencing reads. That is,
  • FIG. 2 illustrates binomial sampling of real sequencing
  • each real test sample includes a real number of sequencing reads of 100, although it is understood that a distribution of sequencing reads would be likely.
  • a binomial distribution is sampled for each real test sample with a success probability equal to 1 ⁇ 2, A success represents a first copy of the region of interest and a failure represents the second copy.
  • the number of successful sequencing reads (that is, those representing the first copy) is equal to the synthetic number of sequencing reads for the synthetic copy number variant
  • the synthetic number of sequencing reads for a synthetic copy number duplication variant having m number copies of a region of interest (or segment thereof) is generated by sampling from a negative binomial distribution, wherein a number of successes is equal to the real number of sequencing reads from a real test sample having an assumed x number of copies of the region of interest (or segment thereof), and the probability of success is equal to and adding an expectation of the sampled negative binomial
  • synthetic number of sequencing reads for a synthetic copy number duplication variant having three copies of a region of interest (or segment thereof ) can be generated by sampling from a negative binomial distribution, wherein a number of successes is equal to the real number of sequencing reads from a real test sample having an assumed two number of copies of the region of interest (or segment thereof), and the probability of success is equal to and adding
  • a fudge factor is included when determining the synthetic number of sequencing reads, which can be used to more closely model the variance of a plurality of synthetic numbers of sequencing reads (i.e., a plurality of synthetic copy number variants) to the variance of a plurality or real test samples used as a basis for the plurality of synthetic copy number variants.
  • the fudge factor can be determined empirically.
  • the fudge factor can be determined by comparing the distribution of sequencing reads from an X chromosome in males (which have a single copy of the X chromosome) to the distribution of sequencing reads from the X chromosome in females (which have two copies of the X chromosome) that have a simulated deletion of a single X chromosome (thus having a simulated one copy of the X chromosome).
  • the fudge factor can be adjusted such that the observed one copy males are compared to simulated one copy females.
  • the synthetic number of sequencing reads can be determined according to:
  • the genetic variant caller can call a number of copies of the region of interest for each synthetic copy number variant in the plurality of synthetic copy number variants.
  • the number of copies of the region of interest in each synthetic copy number variant is known, as the number of copies of the region of interest in the synthetic copy number variant is represented by the synthetic number of sequencing reads, which were generated by adjusting the number of real sequencing reads from the real test sample to a desired number of copies of the region of interest.
  • the called number of copies can be compared to the number of copies in each of the synthetic copy number variant in the plurality of synthetic copy number variants to determine a summary statistic for the genetic variant screen.
  • the summary statistic can be, for example, sensitivity, specificity, precision, recall, accuracy, positive predictive value, negative predictive value, or any other metric of concordance.
  • the summary statistic indicates the performance of the genetic variant screen. For example, a high number of true positives and a low number of false negatives for a genetic variant screen is preferable. Thus, the summary statistic can be used to assess the
  • a predetermined threshold for the summary statistic can be selected. In some embodiments, if the summary statistic is below the predetermined threshold, the genetic variant screen can be refined (for example, by altering the method for generating real sequencing reads or altering the genetic variant caller).
  • method of assessing the performance of a genetic variant screen comprising calling, using a genetic variant caller, for an inverse of a synthetic variant in each test sequence in a plurality of test sequences, each test sequence comprising a plurality of sequencing reads aligned with a reference sequence comprising a synthetic variant; and determining a summary statistic for the genetic variant screen based on a called inverse of the variant in the plurality of test sequences and the variant in the reference sequence, thereby assessing the performance of the genetic variant screen.
  • the method further comprises aligning the plurality of sequencing reads with the reference sequence comprising the synthetic variant.
  • the method comprises generating the sequencing reads can be generated from one or more real samples (e.g., one or more sequencing library from the one or more real samples) using any known sequencing method, such as massively parallel sequencing (for example using an Illumina HiSeq 2500 system).
  • any known sequencing method such as massively parallel sequencing (for example using an Illumina HiSeq 2500 system).
  • Genetic variant callers generally function by comparing aligned sequencing reads to a reference sequence.
  • the sequencing reads are aligned to the reference sequence, and the genetic variant caller identifies differences between the aligned sequencing reads and the reference sequence. Solely by way of example, if the reference sequence includes 40 bases that are not present in the aligned sequence, the genetic variant caller is intended call a deletion of those 40 bases.
  • insertions, deletions, and inversions are relatively rare occurrences, and it is difficult to acquire a sufficiently large number of positive variants to adequately assess the performance of a genetic variant screen.
  • the genetic variant caller can call the inverse of the synthetic variant in a test sequence (which is known to be negative for the synthetic variant) when compared to the reference sequence comprising the synthetic variant.
  • An exemplary variant can include, but is not limited to, one or more of an insertion, a deletion, an inversion, a translocation, a single nucleotide variant (SNV), or a combination thereof.
  • the synthetic variant comprises an insertion and the inverse of the synthetic vari ant is a deletion.
  • the synthetic variant comprises a deletion and the inverse of the synthetic variant is an insertion.
  • the synthetic variant is an inversion and the inverse of the synthetic variant is an inversion.
  • the synthetic variant can be of any length, for example, 1 base to a full length chromosome.
  • the synthetic variant can be 1 base to about 250 million bases in length (such as about 1 base to about 50 bases in length, about 50 bases to about 100 bases in length, about 100 bases to about 150 bases, about 150 bases to about 250 bases in length, about 250 bases to about 500 bases in length, about 500 base to about 1000 bases in length, about 1000 bases to about 2000 bases in length, about 2000 bases to about 4000 bases in length, about 4000 bases to about 8000 bases in length, about 8000 bases to about 16,000 bases in length, about 16,000 bases to about 32,000 bases in length, about 32,000 bases to about 64,000 bases in length, about 64,000 bases to about 125,000 bases in length, about 125,000 bases to about 250,000 bases in length, about 250,000 bases to about 500,000 bases in length, about 500,000 bases to about 1 million bases in length, about 1 million bases to about 2 million bases in length, about 2 million bases to about 4 million bases in length, about 4 million bases to about 8 million
  • the synthetic variant is about 1 base or more (such as about 50 bases or more, about 100 bases or more, about 250 bases or more, about 500 base or more, about 1000 bases or more, about 2000 bases or more, about 4000 bases or more, about 8000 bases or more, about 16,000 bases or more, about 32,000 bases or more, about 64,000 bases or more, about 125,000 bases or more, about 250,000 bases or more, about 500,000 bases or more, about I million bases or more, about 2 million bases or more, about 4 million bases or more, about 8 million bases or more, about 16 million bases or more, about 32 million bases or more, about 64 million bases or more, or about 125 million bases or more.
  • the genetic variant caller can call the inverse synthetic variant for each test sequence in the plurality test sequences.
  • the inverse of the synthetic variant in each test sequence is known, as the synthetic variant was artificially introduced into the reference sequence.
  • the called inverse of the synthetic variant can be compared to known inverse of the synthetic variant to determine a summary statistic for the genetic variant screen.
  • the summary statistic can be, for example, sensitivity, specificity, precision, recall, accuracy, positive predictive value, negative predictive value, or any other metric of concordance.
  • the summary statistic indicates the performance of the genetic vari ant screen. For example, a high number of true positives and a low number of false negatives for a genetic variant screen is preferable. Thus, the summary statistic can be used to assess the
  • a predetermined threshold for the summary statistic can be selected. In some embodiments, if the summary statistic is below the predetermined threshold, the genetic variant screen can be refined (for example, by altering the method for generating real sequencing reads or altering the genetic variant caller).
  • FIG. 3 depicts an exemplary computing system 300 configured to perform any one of the above-described processes, including the various exemplar ⁇ ' methods for of assessing the performance of a genetic variant screen.
  • the computing system 300 may include, for example, a processor, memory, storage, and input/output devices (e.g., monitor, keyboard, disk drive, Internet connection, etc).
  • the computing system 300 may include circuitry or other specialized hardware for carrying out some or all aspects of the processes.
  • the computing system includes a sequencer (such as a massive parallel sequencer).
  • computing system 300 may be configured as a system that includes one or more units, each of which is configured to carry out some aspects of the processes either in software, hardware, or some combination thereof.
  • FIG. 3 depicts computing system 300 with a number of components that may be used to perform the above-described processes.
  • the main system 302 includes a
  • motherboard 304 having an input/output ("I/O") section 306, one or more central processing units (“CPU”) 308, and a memory section 310, which may have a flash memory card 312 related to it.
  • the I/O section 306 is connected to a display 314, a keyboard 316, a disk storage unit 318, and a media drive unit 320.
  • the media drive unit 320 can read/write a computer-readable medium 322, which can contain programs 324 and/or data.
  • a non-transitory computer-readable medium can be used to store (e.g., tangibly embody) one or more computer programs for performing any one of the above-described processes by means of a computer.
  • the computer program may be written, for example, in a general-purpose programming language (e.g., Pascal, C, C++, Java, Python, J SON, R, etc.) or some specialized application-specific language.
  • the summary statistic is reported (for example, to a patient, a doctor, a caregiver, or a regulator). In some embodiments, the summary statistic is displayed, for example on a monitor.
  • Embodiment 1 A method of assessing the performance of a genetic variant screen, comprising:
  • Embodiment 2 The method of embodiment 1 , wherein the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, or negative predictive value.
  • Embodiment 3 The method of embodiment 1 or 2, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by increasing or decreasing a number of real sequencing reads from the one or more segments from a real test sample in proportion to a predetermined number of copies of the region of interest.
  • Embodiment 4 The method of embodiment 3, wherein the predetermined number of copies is an integer number of copies.
  • Embodiment 5 The method of embodiment 3, wherein the predetermined number of copies is a non-integer number of copies.
  • Embodiment 6. The method of any one of embodiments 1-5, wherein the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a binomial distribution with a success probability equal to m/x and a number of trials equal to a real number of sequencing reads from a real test sample, wherein m is the synthetic number of copies of the region of interest, and x is an assumed number of copies of the region of interest from the real test sample.
  • Embodiment 7 The method of any one of embodiments 1-5, wherein the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a number of sequencing reads as a negative binomial distribution with a success probability equal to m/x and a number of success equal to a real number of sequencing reads from a real test sample, wherein m is the synthetic number of copies of the region of interest, and x is an assumed number of copies of the region of interest from the real test sample, and adding the sampled number of sequencing reads to the real number of sequencing reads from the test sample.
  • Embodiment 8 The method of embodiment 7, wherein the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a number of sequencing reads as an expectation of the negative binomial distribution.
  • Embodiment 9 The method of any one of embodiments 3-8, wherein the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples.
  • Embodiment 10 The method of any one of embodiments 3-9, wherein the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample.
  • Embodiment 11 The method of embodiment 9 or 10, wherein the average number of real sequencing reads from a corresponding segment from the one or more real reference samples is a median number of real sequencing reads from a corresponding segment from the one or more real reference samples.
  • Embodiment 12 The method of any one of embodiments 9 or 10, wherein the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample is the median average number of real sequencing reads from the one or more segments within the region of interest from the real test sample.
  • Embodiment 13 The method of any one of embodiments 3-12, wherein the number of real sequencing reads are normalized for GC content bias or mappability.
  • Embodiment 14 The method of any one of embodiments 1 -13, wherein the region of interest is more than about 100 bases in length.
  • Embodiment 15 The method of any one of embodiments 1-14, wherein the region of interest is at least one ex on.
  • Embodiment 16 The method of any one of embodiments 1 -15, wherein the region of interest is at least one gene.
  • Embodiment 17 The method of any one of embodiments 1-16, wherein the region of interest is at least one chromosome.
  • Embodiment 18 The method of any one of embodiment 3-17, further comprising generating the real sequencing reads from the real test sample.
  • Embodiment 19 The method of any one of embodiment 1-18, further comprising generating the real sequencing reads from the one or more real reference samples.
  • Embodiment 20 A method of assessing the performance of a genetic variant screen, comprising:
  • each test sequence comprising a plurality of sequencing reads aligned with a reference sequence comprising a synthetic variant
  • Embodiment 21 The method of embodiment 20, further comprising aligning the plurality of sequencing reads with the reference sequence comprising the synthetic variant.
  • Embodiment 22 The method of embodiment 20 or 21, wherein the synthetic variant comprises an insertion, a deletion, an inversion, a translocation, a SNV, or a combination thereof.
  • Embodiment 23 The method of any one of embodiments 20-22, wherein the synthetic variant comprises an insertion or a deletion.
  • Embodiment 24 The method of any one of embodiments 20-23, wherein the synthetic variant comprises an insertion and the inverse of the synthetic variant is a deletion.
  • Embodiment 25 The method of any one of embodiments 20-23, wherein the synthetic variant comprises a deletion and the inverse of the synthetic variant is an insertion.
  • Embodiment 26 The method of any one of embodiments 20-22, wherein the synthetic variant is an inversion and the inverse of the synthetic variant is an inversion.
  • Embodiment 27 The method of any one of embodiments 20-26, wherein the synthetic variant is between 1 base in length and about 1 chromosome in length.
  • Embodiment 28 The method of any one of embodiments 20-27, wherein the synthetic variant is between about 20 bases in length and about 1000 bases in length.
  • Embodiment 29 The method of any one of embodiments 20-28, wherein the synthetic variant is the length of a full gene.
  • Embodiment 30 The method of any one of embodiments 20-29, wherein the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, or negative predictive value,
  • Embodiment 31 The method of any one of embodiments 20-30, further comprising generating the sequencing reads from the one or more test samples.
  • Embodiment 32 The method of any one of embodiments 1-31 , further comprising reporting the summary statistic.
  • Embodiment 33 The method of any one of embodiments 1 -32, further comprising displaying the summary statistic on a monitor.
  • Embodiment 34 The method of any one of embodiments 1-33, wherein the method is implemented by a program executed on a computer system.
  • Embodiment 35 The method of any one of embodiments 1 -34, further comprising storing the summary statistic in a database.
  • Embodiment 36 A computer readable storage medium comprising instructions for carrying out the method of any one of embodiments 1-35,
  • Embodiment 37 A system comprising
  • the memory comprises computer readable instructions operable to cause the processor to carry out the method of any one of embodiments 1-36,
  • the X chromosome from several female test samples was sequenced using massively parallel sequencing methods.
  • the X chromosome was divided into a plurality of segments, and the number of sequencing reads within each segment was normalized against corresponding segments from a plurality of reference X chromosomes by dividing the number of sequencing reads from each segment by the median number of sequencing reads from the corresponding segment in all reference X chromosomes.
  • the numbers of sequencing reads from each segment from each test X chromosome were further normalized against the number of sequencing reads from each segment with the test X chromosome by dividing the number of sequencing reads in any given segment from any given X
  • chromosome by the median number of sequencing reads from the segments in that X chromosome. Synthetic copy number variants of the X chromosome were generated by subtracting half of the median number of sequencing reads from any given X chromosome from the number of sequencing reads from each segment with that X chromosome,
  • chromosome can be considered "simulated males", as each synthetic copy number variant includes a single copy of the X chromosome.
  • a theoretical copy number was determined based on the synthetic number of sequencing reads from the simulated male samples (which are based on double-normalized sequencing reads).
  • a theoretical copy number was determined based on the double-normalized number of sequencing reads from real female (XX) samples and real male (XY) samples, with the median number of sequencing reads from the real female samples set to 1 and the real males set to 0.5.
  • the theoretical copy number was determined by rescaling a double-normalized number of sequencing reads of 0.5 to 1 (indicating a single copy of the X chromosome) and a double-normalized number of sequencing reads of 1 to 2 (indicating two copies of the X chromosome).
  • the theoretical copy number was plotted against the sequencing depth (i.e., absolute number of sequencing reads prior to normalization) for each segment within the X chromosome of real female samples, real male samples, and synthetic copy number variants (see FIG. 4).
  • the female samples have approximately two copies of the X chromosome
  • male samples as have approximately one copy of the X chromosome, with greater precision for those segments with a greater absolute number of sequencing reads.
  • the simulated males behave approximately the same as real male samples.
  • Synthetic copy number variants having a synthetic copy number of an exon from various genes were generated.
  • the synthetic copy number included an additional copy of the ex on (a "duplication” copy number variant) or removed a copy of the ex on (a "deletion” copy number variant).
  • a genetic variant caller was used to call copy numbers of the ex on in the synthetic copy number variant, and sensitivity for the screen was determined.
  • the assay was then modified by increasing the number of segments within the region of interest and re-run, and synthetic copy number variants having a synthetic copy number of the exon from various genes were re-generated based on the new assay.
  • the genetic variant caller was used to call copy numbers of the re-generated synthetic copy number variants, and a new sensitivity for the assay was determined.
  • Synthetic copy number variants having a synthetic copy number of one exon, two exons, four exons, or an entire gene from one of 36 different genes were generated.
  • the synthetic copy number variants included either deletion or duplication events,
  • a genetic variant caller was used to call copy numbers of the exons or gene in the synthetic copy number variant, and sensitivity for the screen was determined.
  • the assay was then modified by increasing the number of segments with the region of interest, and re-run, and synthetic copy number variants having a synthetic copy number of the exons or genes were regenerated based on the new assay.
  • the genetic variant caller was used to call copy numbers of the re-generated synthetic copy number variants, and a new sensitivity for the assay was determined. The results are presented in FIG.
  • the new genetic variant screen had an overall sensitivity for deletion events of 99.60% (with a 95% confidence interval of 99.50% to 99,72%), and an overall sensitivity for duplication events of 99.00% (with a 95% confidence interval of 98.82% to 99.19%).

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Library & Information Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biochemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computing Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

Described herein are methods of assessing the performance of a genetic variant screen. A summary statistic can be determined to assess the performance of the genetic variant screen. In some embodiments, the summary statistic is determined using synthetic copy number variants. In some embodiments, the summary statistic is determined using a reference sequence having a synthetic variant.

Description

METHODS FOR ASSESSING GENETIC VARIANT SCREEN PERFORMANCE
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority from U. S. Provisional Patent Application No.
62/418,622, filed November 7, 2016; and U.S. Provisional Patent Application No.
62/544,31 1 , filed August 1 1 , 2017; the contents of each of which are incorporated herein by reference in its entirety,
FIELD OF THE INVENTION
[0002] The present invention relates to methods of assessing the performance of a genetic variant screen,
BACKGROUND
[0003] Generally, the performance of genetic variant screen is assessed for concordance with known reference samples. Assessment of the screen can be performed using a large number of positive controls with known genetic variants, and a summary statistic (such as sensitivity or specificity) for the screen can be determined. However, when a large number of positive controls are unavailable, such as for controls with rare genetic variant events, the performance of the genetic variant calling algorithm (i.e., a "caller") or assay cannot be accurately assessed. While large numbers of positive controls having single nucleotide variants (SNVs) are commonly available, positive control samples having insertion, deletion, inversion, or copy number variants are less frequent.
[0004] The disclosures of all publications, patents, patent applications and published patent applications referred to herein are hereby incorporated herein by reference in their entirety. To the extent that any reference incorporated by references conflicts with the instant disclosure, the instant disclosure shall control.
SUMMARY OF THE INVENTION
[0005] Described herein is method of assessing the performance of a genetic variant screen, comprising generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and determining a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen.
[0006] In some embodiments, the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, or negative predictive value.
[0007] In some embodiments, the synthetic number of sequencing reads from each of the one or more segments is generated by increasing or decreasing a number of real sequencing reads from the one or more segments from a real test sample in proportion to a number (which may be an integer number or non-integer number) of copies of the region of interest.
[0008] In some embodiments, the predetermined number of copies is an integer number of copies. In some embodiments, the predetermined number of copies is a non-integer number of copies.
[0009] In some embodiments, the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a binomial distribution with a success probability equal to mix and a number of trials equal to a real number of sequencing reads from a real test sample, wherein m is the synthetic number of copies of the region of interest, and x is an assumed number of copies of the region of interest from the real test sample. In some embodiments, the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a number of sequencing reads as a negative binomial distribution with a success probability equal to m/x and a number of success equal to a real number of sequencing reads from a real test sample, wherein /;/ is the synthetic number of copies of the region of interest, and x is an assumed number of copies of the region of interest from the real test sample, and adding the sampled number of sequencing reads to the real number of sequencing reads from the test sample. In some embodiments, the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a number of sequencing reads as an expectation of the negative binomial distribution.
[0010] In some embodiments, the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples. In some embodiments, the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample. In some embodiments, the average number of real sequencing reads from a corresponding segment from the one or more real reference samples is a median number of real sequencing reads from a corresponding segment from the one or more real reference samples. In some embodiments, the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample is the median average number of real sequencing reads from the one or more segments within the region of interest from the real test sample.
[0011] In some embodiments, the number of real sequ encing read s are normalized for GC content bias or mappability.
[0012] In some embodiments, the region of interest is more than about 100 bases in length. In some embodiments, the region of interest is at least one ex on. In some embodiments, the region of interest is at least one gene. In some embodiments, the region of interest is at least one chromosome.
[0013] In some embodiments, the method further comprises generating the real sequencing reads from the real test sample. In some embodiments, the method further comprises generating the real sequencing reads from the one or more real reference samples.
[0014] Further described herein is a method of assessing the performance of a genetic variant screen, comprising calling, using a genetic variant caller, for an inverse of a synthetic variant in each test sequence in a plurality of test sequences, each test sequence comprising a plurality of sequencing reads aligned with a reference sequence comprising a synthetic variant; and determining a summary statistic for the genetic variant screen based on a called inverse of the synthetic variant in the plurality of test sequences and the synthetic variant in the reference sequ ence, thereby assessing the performance of the genetic variant screen. In some embodiments, the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, or negative predictive value.
[0015] In some embodiments, the method further comprises aligning the plurality of sequencing reads with the reference sequence comprising the synthetic variant.
[0016] In some embodiments, the synthetic variant comprises an insertion, a deletion, an inversion, a translocation, a SNV, or a combination thereof. In some embodiments, the synthetic variant comprises an insertion or a deletion. In some embodiments, the synthetic variant comprises an insertion and the inverse of the synthetic variant is a deletion, in some embodiments, the synthetic variant comprises a deletion and the inverse of the synthetic variant is an insertion. In some embodiments, the synthetic variant is an inversion and the inverse of the synthetic variant is an inversion.
[0017] In some embodiments, the synthetic variant is between I base in length and about 1 chromosome in length. In some embodiments, the synthetic variant is between about 20 bases in length and about 1000 bases in length. In some embodiments, the synthetic variant is the length of a full gene.
[0018] In some embodiments, the method further comprises generating the sequencing reads from the one or more test samples.
[0019] In some embodiments of the methods described above, the method further comprises reporting the summary statistic.
[0020] In some embodiments of the methods described above, the method further comprises displaying the summary statistic on a monitor.
[0021] In some embodiments of the methods described above, the method is implemented by a program executed on a computer system.
[0022] In some embodiments of the methods described above, the method further comprises storing the summary statistic in a database.
[0023] Also described herein is a computer readable storage medium comprising instructions for carrying out any one of the methods described above.
[0024] Further described herein is a system comprising a processor, and a memory, wherein the memory comprises computer readable instructions operable to cause the processor to carry out any one of the methods described above.
BRIEF DESCRIPTION OF THE DRAWINGS
[0025] FIG. 1A presents the frequency (percent occurrence) of a copy number deletion or duplication event for several genes among approximately 56,000 samples.
[0026] FIG. IB shows the fraction of nucleotides exhibiting an insertion or deletion event occurring in at least one of the same approximately 56,000 samples.
[0027] FIG. 2 is a schematic showing binomial sampling of a real number of sequencing reads from a real test sample to obtain a synthetic number of sequencing reads for a synthetic copy number deletion variant.
[0028] FIG. 3 depicts an exemplary computing system configured to perform any one of the processes described herein, including the various exemplary methods for of assessing the performance of a genetic variant screen, [0029] FIG. 4 presents a theoretical copy number of the X chromosome plotted against the sequencing depth (i .e., absolute number of sequencing reads prior to normalization) for each segment within the X chromosome of real female samples, real male samples, and synthetic copy number variants. The theoretical copy number is based on double-normalized sequencing reads from the male and female samples, as well as the double-normalized sequencing reads that have been decreased to reflect a single copy of the X chromosome in the simulated male samples,
[0030] FIG. 5 presents the sensitivity of a genetic variant screen for deletion copy number variants and duplication copy number variants before and after the modification of a genetic variant screen. The sensitivities were determined by constructing synthetic copy number variants having a synthetic copy number of a single exon from various genes, and using a genetic variant caller to call the number of copies of the exon in the synthetic copy number variants. Error bars show the confidence interval of the sensitivity given the number of simulated deletions and the observed false negatives and true positives.
[0031] FIG. 6 presents the sensitivity of a genetic variant screen for deletion copy number variants and duplication copy number variants before and after the modification of a genetic variant screen. The sensitivities were determined by constructing synthetic copy number variants having a synthetic copy number for one exon, two exons, four exons, or whole genes from 36 different genes. A genetic variant caller was used to call the number of copies of exons in the synthetic copy number variants.
DETAILED DESCRIPTION
[0032] Described herein is a method of assessing the performance of a genetic variant screen, comprising generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and determining a summary statistic (i.e., a summary statistic of performance) for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen. [0033] The synthetic number of sequencing reads for each of the one or more segments can be generated by increasing or decreasing a number of real sequencing reads from the one or more segments from a real test sample in proportion to a number (which may be an integer number or a non-integer number) of copies of the region of interest. For example, a synthetic copy number variant having three copies of a region of interest can be generated by increasing the number of real sequencing reads for one or more segments within the region of interest taken from a real test sample having two copies of the region of interest by a number of sequencing reads representing one copy of the region of interest (for example, by adding sequencing reads representing one copy of the region of interest to the real number of sequencing reads or by multiplying the number of real sequencing reads by 1.5). In another example, a synthetic copy number variant having one copy of a region of interest can be generated by decreasing the number of real sequencing reads for one or more segments within the region of interest taken from a real test sample having two copies of the region of interest by a number of sequencing reads representing one copy of the region of interest (for example, by removing sequencing reads representing one copy of the region of interest from the real number of sequencing reads or by multiplying the number of real sequencing reads by 0.5). The number of copies may be an integer number or a non-integer number of copies. A non-integer number of copies can used, for example, to analyze mosaic copy number variants or sub-population somatic copy number variants. In some embodiments the sequencing reads are normalized against the one or more real reference samples (for example, the number of real sequencing reads from each of the one or more segments from the real test sample can be normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average (such as mean or median) number of real sequencing reads from a corresponding segment from the one or more real reference samples). In some embodiments, the number of real sequencing reads from each of the one or more segments from the real test sample are normalized against the other segments within the region of interest (for example, by dividing the number of real sequencing reads from each segment from the real test sample by the average (such as mean or median) number of real sequencing reads from the one or more segments within the region of interest from the real test sample). In some embodiments, the number of real sequencing reads is normalized using both methods, in either order.
[0034] Also described herein is a method of assessing the performance of a genetic variant screen, comprising calling, using a genetic variant caller, for an inverse of a synthetic variant in each test sequence in a plurality of test sequences, each test sequence comprising a plurality of sequencing reads aligned with a reference sequence comprising a synthetic variant; and determining a summary statistic (i.e., a summary statistic of performance) for the genetic variant screen based on a called inverse of the synthetic variant in the plurality of test sequences and the synthetic variant in the reference sequence, thereby assessing the performance of the genetic variant screen. In some embodiments, the method further comprises aligning the plurality of sequencing reads with the reference sequence comprising the synthetic variant. The variant can be, for example, an insertion, a deletion, an inversion, a translocation, a SNV, or a combination thereof. For example, in some embodiments, the variant is an inversion with a deletion, an insertion, or a deletion and an insertion. Since the variant is in the reference sequence, the genetic variant caller calls for the inverse variant. For example, in some embodiments, the reference sequence comprises an insertion and the genetic variant caller calls for a deletion. In some embodiments, the reference sequence comprises a deletion and the genetic variant caller calls for an insertion. Since the inverse of an inversion is itself an inversion, in some embodiments, the reference sequence comprises an inversion and the genetic variant caller calls for an inversion (although the called inversion is in the opposite direction as the inversion in the reference sequence).
[0035] The summary statistic of performance (also referred to herein as a "summary statistic") can be, but is not limited to, sensitivity, specificity, positive predictive value, negative predictive value, precision, recall, accuracy, or any other metric of concordance. The summary statistic of performance is useful for assessing the performance of the genetic variant screen. For example, the summary static can be above a predetermined threshold, which indicates that the genetic variant screen performs as desired. If the summary statistic is below the predetermined threshold, for example, the genetic variant screen does not perform as desired. In some embodiments, the summary statistic is reported (for example, to a patient, doctor, caregiver, or regulator). In some embodiments, the summary statistic is displayed, for example on a monitor, which can be part of a computer system.
[0036] In some embodiments, the methods described herein can be implemented by a program executed on a computer system. For example, in one aspect there is a computer- implemented method of assessing the performance of a genetic variant screen, comprising, at an electronic device having one or more processors and memory, generating, by the one or more processors, a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest; calling, by the one or more processors, a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and determining, by the one or more processors, a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen. In another aspect, there is a computer-implemented method of assessing the performance of a genetic variant screen, comprising, at an electronic device having one or more processors and memory, aligning, by one or more processors, sequencing reads from a plurality of test samples with a reference sequence comprising a variant; calling, by the one or more processers, for an inverse of the variant in the test sample using a genetic variant caller; and determining, by the one or more processors, a summary statistic for the genetic variant screen based on a called inverse of the variant in the plurality of test samples and the variant in the reference sequence, thereby assessing the performance of the genetic variant screen.
[0037] Real samples having a copy number variant (such as a duplication or deletion), an insertion, or a deletion for any particular region of interest (such as a gene) are relatively rare, FIG. 1A presents the frequency (percent occurrence) of a copy number deletion or duplication event for several genes among approximately 56,000 samples. As can be seen in FIG. 1A, the frequency of a deletion or duplication event for a given region of interest is low. Furthermore, approximately 25% of the region examined contained no duplications and approximately 45% of the region examined contained no deletions in any of the same approximately 56,000 samples. FIG. IB shows the fraction of nucleotides with at least one CNV observed in at least one of the same approximately 56,000 samples. See Beauchamp et al., Systematic Design and Comparison of Expanded Carrier Screening Panels, bioRxiv 080713 (2016). Thus, it is desirable to assess the performance of a genetic variant screen to ensure that the screen is sensitive and specific to any given low frequency variant event.
[0038] The methods described herein allow for assessing the performance of a genetic variant screen without the need for positive controls with known genetic variants from real samples. In some embodiments, the performance of the genetic variant screen is assessed using synthetic copy number variants generated from non-variant reference samples. In some embodiments, the performance of the genetic variant screen is assessed using a reference sequence comprising a synthetic variant. The synthetic variants do not rely on the existence of genetic variants from real sample, but can be simulated from real samples. A large number of synthetic variants can be readily generated to adequately assess the performance of the genetic variant screen.
[0039] As used herein, the singular forms "a," "an," and "the" include the plural reference unless the context clearly dictates otherwise.
[0040] Reference to "about" a value or parameter herein includes (and describes) variations that are directed to that value or parameter per se. For example, description referring to "about X" includes description of "X".
[0041] The term "average" as used herein refers to either a mean or a median, or any value used to approximate the mean or the median.
[0042] The term "copy number variant" or "CNV" refers to any duplication or deletion of a region of interest.
[0043] The term "deletion" refers to any decrease in the number of copies of a region of interest relative to one or more real reference samples. For example, if the one or more real reference samples have two copies of a region of interest, a deletion can refer to a single copy of the region of interest. If the one or more real reference samples have four copies of a region of interest, a deletion can refer to one, two, or three copies of the region of interest.
[0044] The term "duplication" refers to any increase in the number of copies of a region of interest relative to one or more real reference samples, including three or more, four or more, five or more, etc. copies of the region of interest.
[0045] A "genetic variant caller" is any method or technique (including software) that can be used to identify one or more genetic features. Genetic features that can be identified by a genetic variant caller include, but are not limited to, the copy number of a region of interest, an insertion, a deletion, a translocation, an inversion, or a small nucleotide variant (SNV).
[0046] A "number of sequencing reads" as used herein refers to an absolute number of sequencing reads or a normalized number of sequencing reads.
[0047] A "real sample" refers to a nucleic acid sequence or sequencing reads originating from a nucleic acid sequence that originates from a physical sample subjected to genetic sequencing without the sequence, sequencing reads, or number of sequencing reads being altered. A "real reference sample" refers to a real sample that is compared to a synthetic sample (e.g., a synthetic copy number variant) by the genetic variant caller. A "real test sample" refers to a real sample that is used to generate the synthetic sample.
[0048] A "real sequencing read" refers to a sequencing read that originates from a real sample without alteration of the sequence. A "number of real sequencing reads" refers to an absolute number of real sequencing reads or a normalized number of sequencing reads, but does not refer to a number of sequencing reads that has be altered to reflect an increase in a number of copies of any segment or region of interest.
[0049] A "segment" refers to a sub-region in a region of interest that serves as a locus of origin for sequencing reads. The segment can be as short as a single base or can be as long as the region of interest. Multiple segments within a region of interest may be, but need not be, continuous, contiguous, or overlapping,
[0050] The term "synthetic copy number variant" refers to an artificial nucleic acid sequence generated using real sequencing reads from a real sample with an increase or decrease in the number of copies of a region of interest compared to the real sample. The synthetic copy number variant need not be (although, in some embodiments, could be) an aligned or assembled nucleic acid sequence, and can be represented by a synthetic number of sequencing reads.
[0051] A "synthetic number of copies" refers to the number of copies of a region of interest in the synthetic copy number variant, and can be an increase or decrease in the number of copies relative to the real sample.
[0052] A "synthetic number of sequencing reads" refers to a number of real sequencing reads that has been altered to reflect an increase or a decrease in the number of copies of a segment within a region of interest. The real sequencing reads originate from the same segment (i.e., originate for a corresponding segment) within the region of interest as the sequencing reads in the synthetic number of sequencing reads.
[0053] A "synthetic variant" in a reference genome refers to a variant artificially introduced into a nucleic acid sequence in the reference genome, unless context clearly indicates otherwise. The "inverse" of a synthetic variant refers to the opposite consequence of the synthetic variant that would appear in a nucleic acid sequence when compared to the reference sequence comprising the synthetic variant,
[0054] It is understood that aspects and variations of the invention described herein include "consisting" and/or "consisting essentially of aspects and variations.
[0055] W here a range of values is provided, it is to be understood that each intervening value between the upper and lower limit of that range, and any other stated or intervening value in that stated range, is encompassed within the scope of the present disclosure. Where the stated range includes upper or lower limits, ranges excluding either of those included limits are also included in the present disclosure. Performance of Copy Number Variant Screen
[0056] In certain aspects, the methods described herein are used to assess the performance of a copy number variant screen. Synthetic copy number variants are generated, for example in silico. The synthetic copy number variant includes a synthetic number of copies of a region of interest, which is represented by a synthetic number of sequencing reads from one or more segments within the region of interest. In some embodiments, the synthetic number of sequencing reads is obtained by adjusting a number of sequencing reads of the one or more segments within the region of interest from a real test sample. The adjustment is made in proportion to the synthetic number of copies. In some embodiments, the synthetic number of sequencing reads is obtained by direct manipulation of a database comprising sequencing reads of the one or more segments within the region of interest from a real sample, for example by random deletion or duplication of sequencing reads within the database. In some embodiments, the synthetic number of sequencing reads is generated by sampling a distribution (such as a binomial distribution or a negative binomial distribution). A plurality of synthetic copy number variants can be generated, for example based on a plurality of real test samples.
[0057] A number of copies of the region of interest present in the synthetic copy number variant is called using the copy number variant caller. In some embodiments, the caller compares the synthetic number of sequencing reads from the one or more segments in the synthetic copy number variant to the number of sequencing reads from the one or more segments in a real reference sample with a known number of copies of the region of interest. The caller can use, for example, a hidden-Markov model (HMM) to determine the number of copies of the region of interest in the synthetic copy number variant. The real reference sample is preferably a different real sample that the real sample used as a basis for generating the synthetic copy number variants,
[0058] A summary statistic for the copy number variant screen can be determined to assess the performance of the copy number variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants. Since a plurality of synthetic copy number variants are generated and called by the caller, the summary statistic reflects the performance of the screen in the context of the synthetic variants. Thus, a greater diversity of synthetic copy number variants (which can be based on a plurality of real samples) provides a more accurate summary statistic
characterizing the performance of the genetic variant screen. [0059] In some embodiments, assessing the performance of a genetic variant screen (such as a copy number variant screen) comprises generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number vari ant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples: and determining a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen. In some embodiments, the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, negative predictive value, or any other metric of concordance.
[0060] In some embodiments, the method of assessing the performance of a genetic variant screen (such as a copy number variant screen), comprises generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by increasing or decreasing a number of real sequencing reads from the one or more segments from a real test sample in proportion to a number (which may be an integer number or non-integer number) of copies of the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and determining a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number vari ants, thereby assessing the performance of the genetic variant screen. In some embodiments, the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, negative predictive value, or any other metric of concordance.
[0061] In some embodiments, the method of assessing the performance of a genetic variant screen (such as a copy number variant screen), comprises generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by (i) normalizing a number of real sequencing reads from the one or more segments from a real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples; and (ii) increasing or decreasing the number of normalized real sequencing reads from the one or more segments from the real test sample in proportion to a predetermined number (which may be an integer number or a non-integer number) of copies of the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and determining a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen. In some embodiments, the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, negative predictive value, or any other metric of concordance.
[0062] In some embodiments, the method of assessing the performance of a genetic variant screen (such as a copy number variant screen), comprises generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by (i) normalizing a number of real sequencing reads from the one or more segments from a real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples; (ii) normalizing the number of real sequencing reads from the one or more segments from the real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample; and (iii) increasing or decreasing the number of normalized real sequencing reads from the one or more segments from the real test sample in proportion to a predetermined number (which may be an integer number or a non-integer number) of copies of the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and determining a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen. In some embodiments, the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, negative predictive value, or any other metric of concordance.
[0063] In some embodiments, the method of assessing the performance of a genetic variant screen (such as a copy number variant screen), comprises generating real sequencing reads from a real test sample; generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by (i) normalizing a number of real sequencing reads from the one or more segments from the real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples; (ii) normalizing the number of real sequencing reads from the one or more segments from the real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample, and (iii) increasing or decreasing the number of normalized real sequencing reads from the one or more segments from the real test sample in proportion to a predetermined number (which may be an integer number or a non-integer number) of copies of the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or m ore segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and determining a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen. In some embodiments, the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, negative predictive value, or any other metric of concordance. [0064] In some embodiments, the method of assessing the performance of a genetic variant screen (such as a copy number variant screen), comprises generating real sequencing reads from a real test sample; generating real sequencing reads from one or more real reference samples; generating a plurality of synthetic copy number variants compri sing a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by (i) normalizing a number of real sequencing reads from the one or more segments from the real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples, (ii) normalizing the number of real sequencing reads from the one or more segments from the real test sample by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample; and (iii) increasing or decreasing the number of normalized real sequencing reads from the one or more segments from the real test sample in proportion to a predetermined number (which may be an integer number or a non -integer number) of copies of the region of interest; calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and determining a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen, hi some embodiments, the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, negative predictive value, or any other metric of concordance.
[0065] The one or more real reference samples are compared to the synthetic copy number variant using the genetic variant caller to call the number of copies in the synthetic copy number variant, for example by comparing a number of real sequencing reads from the one or more real reference samples to the synthetic number of sequencing reads representing the synthetic copy number variant. In some embodiments, the genetic variant caller uses a hidden Markov Model to cal l the copy number in the synthetic copy number variant. As copy number variants in any given region of interest are relatively rare (see FIG. 1 A), it can be assumed that the one or more real reference samples do not have a copy number variant. In some embodiments, the reference samples are verified negative for a copy number variant. In some embodiments, the one or more reference samples includes, two or more, three or more, four or more, five or more, six or more, eight or more, ten or more, twenty or more, thirty or more, forty or more, forty-eight or more, sixty or more, seventy or more, eighty or more, ninety or more, ninety-three, or ninety-six or more reference samples.
[0066] In some embodiments, the one or more real reference samples include the test sample (although the real reference sample is not exclusively the real test sample). In some embodiments, the one or more real reference samples excludes the real test sample. In some embodiments, real sequencing reads from the real test sample and the one or more real reference samples are simultaneously generated. In some embodiments, the region of interest (or segments within the region of interest) from the real test sample and the one or more real reference samples are enriched using the same methods (for example, using PGR
amplification or capture probes).
[0067] In some embodiments, the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average (such as mean or median) number of real sequencing reads from a corresponding segment from one or more real reference samples. In some embodiments, the number of real sequencing reads from each of the one or more segments from the one or more real reference samples are normalized by dividing the number of real sequencing reads from each segment from each of the one or more real reference samples by the average (such as mean or median) number of real sequencing reads from a corresponding segment from one or more real reference samples. For example, a real test sample can include a first number of real sequencing reads from a first segment and a second number of real sequencing reads from a second segment. A first average number of sequencing reads from the first segment from the one or more reference samples can be determined, and a second average number of sequencing reads from the second segment from the one or more reference samples can be determined. To normalize the number of real sequencing reads for the first segment from the real test sample, the first number of real sequencing reads is divided by the first average number of real sequencing reads, and to normalize the number of real sequencing reads from the second segment from the real test sample, the second number of real sequencing reads is divided by the second average number of real sequencing reads.
[0068] In some embodiments, the real sequencing reads from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample. Solely by way of example, a region of interest can include three segments. The real test sample can include a first number of real sequencing reads for the first segment, a second number of real sequencing reads for the second segment, and a third number of real sequencing reads for the third segment. An average number of sequencing reads can be determined for the three segments within the region of interest from the real test sample. To normalize the first number of sequencing reads for the first segment, the first number of real sequencing reads is divided by the average number of sequencing reads. To normalize the second number of sequencing reads for the second segment, the second number of real sequencing reads is divided by the average number of sequencing reads. To normalize the third number of sequencing reads for the third segment, the third number of real sequencing reads is divided by the average number of sequencing reads.
[0069] In some embodiments, the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from one or more real reference samples, and the real sequencing reads from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample. The number of real sequencing reads can be double-normalized in either order. In some embodiments, first the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from one or more real reference samples, and second the real sequencing reads from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample. In some embodiments, first the real sequencing reads from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample, and second the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from one or more real reference samples.
[0070] The region of interest can be of any length, for example, 1 base to a full length chromosome. For example, the region of interest can be 1 base to about 250 million bases in length (such as about 1 base to about 50 bases in length, about 50 bases to about 100 bases in length, about 100 bases to about 250 bases in length, about 250 bases to about 500 bases in length, about 500 base to about 1000 bases in length, about 1000 bases to about 2000 bases in length, about 2000 bases to about 4000 bases in length, about 4000 bases to about 8000 bases in length, about 8000 bases to about 16,000 bases in length, about 16,000 bases to about 32,000 bases in length, about 32,000 bases to about 64,000 bases in length, about 64,000 bases to about 125,000 bases in length, about 125,000 bases to about 250,000 bases in length, about 250,000 bases to about 500,000 bases in length, about 500,000 bases to about 1 million bases in length, about 1 million bases to about 2 million bases in length, about 2 million bases to about 4 million bases in length, about 4 million bases to about 8 million bases in length, about 8 million bases to about 16 million bases in length, about 16 million bases to about 32 million bases in length, about 32 million bases to about 64 million bases in length, about 64 million bases to about 125 million bases in length, or about 125 million bases to about 250 million bases in length). In some embodiments, the region of interest is about 1 base or more (such as about 50 bases or more, about 100 bases or more, about 250 bases or more, about 500 base or more, about 1000 bases or more, about 2000 bases or more, about 4000 bases or more, about 8000 bases or more, about 16,000 bases or more, about 32,000 bases or more, about 64,000 bases or more, about 125,000 bases or more, about 250,000 bases or more, about 500,000 bases or more, about 1 million bases or more, about 2 million bases or more, about 4 million bases or more, about 8 million bases or more, about 16 million bases or more, about 32 million bases or more, about 64 million bases or more, or about 125 million bases or more. In some embodiments, the region of interest comprises one or more genes (such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 40, 50, 75, 100, 150, 200, 250 or more genes). In some embodiments, the region of interest comprises one or more exons (such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 40, 50, 75, 100, 150, 200, 250 or more genes).
[0071] The region of interest can be divided into one or more segments, which may or may not be continuous, contiguous, or partially overlapping. In some embodiments, the region of interest comprises 1 or more (such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 40, 50, 75, 100, 150, 200, 250 or more) segments. The sequencing reads (or a portion of the sequencing reads) correspond to one of the one or more segments (i.e., the sequencing reads can be aligned to segments, for example using a reference sequence) within the region of interest. It is understood that a portion of the sequences may not accurately map to a particular segment (for example, a sequencing read may map to more than one segment or may map to no segment); such un-mappable or un-alignabfe sequencing reads are optionally ignored or discarded.
[0072] In some embodiments, a region of interest from one or more real samples is sequenced to generate real sequencing reads. The real sequencing reads can correspond to segments within the region of interest. Real sequencing reads can be generated from one or more real samples (e.g., one or more sequencing library from the one or more real samples) using any known sequencing method, such as massively parallel sequencing (for example using an lilumina HiSeq 2500 system ). The region of interest can be enriched for the region of interest, which can increase the proportion of sequencing reads that correspond to the region of interest. For example, the region of interest can be enriched by PGR (for example, by including one or more primers that hybridize to portions of segments within the region of interest with genomic DNA from a real sample, and amplifying the segments within the region of interest). In some embodiments, the region of interest is enriched by combining capture probes (such as biotinylated DNA, RNA, synthetic oligonucleotides) that hybridize to segments within the region of interest with genomic DNA (which is preferably sheared). The capture probes can then be used to isolate DNA fragments that include segments from the region of interest, and those DNA fragments can be sequenced to generate sequencing reads.
[0073] In some embodiments, the real sequencing reads are normalized. For example, in some embodiments, the real sequencing reads are normalized for GC content or mappability. For example, some segments within the region of interest may have a higher GC content than other segments within the region of interest. The higher GC content may increase or decrease the assay efficiency within that segment, inflating or deflating the relative number of sequencing reads for reasons other than copy number. Methods to normalize GC content are known in the art, for example as described in Fan & Quake, PLoS ONE, vol. 5, el 0439 (2010). Similarly, the certain segments within the region of interest may be more easily mappable (or alignabie to a reference region of interest), and a number of sequencing reads may be excluded, thereby deflating the relative number of sequencing reads for reasons other than copy number, Mappability at a given position in the genome can be predetermined for a given read length, k, by segmenting every position within the region of interest into A-mers and aligning the sequences back to the region of interest, iv-mers that align to a unique position in the interrogated region are labeled "mappable," and k-mers that no not align to a unique position in the region of interest are labeled "not mappable." A given segment can be normalized for mappability by scaling the number of reads in the segment by the inverse of the fraction of the mappable k-mers in the segment. For example, if 50% of k-mers within a bin are mappable, the number of observed reads from within that segment can be scaled by a factor of 2.
[0074] Sequencing reads from a population of real test samples assumed to be wild-type for a number of copies of a region of interest form a negative binomial distribution with an average (mean or median) and a variance. The variance of the distribution can arise, for example, from noise during enrichment or sequencing of the region of interest. The distribution of sequencing reads from a population of synthetic copy number variants preferably resembles an expected negative binomial distribution of sequencing reads from a theoretical population of real copy number variants. Although the performance of the caller could be tested against an actual population of copy number variants, an actual population of copy number variants is more challenging to obtain because of the relatively low frequency of occurrences of the copy number variant, but would be expected to follow a negative binomial distribution. The average of the expected distribution of the theoretical population of real copy number variants (and thus, the distribution of the synthetic copy number variants) is shifted to reflect the change in number of copies of the region of interest relative to the average of the real test samples. For example, the average number of sequencing reads in a population of samples having one copy of a region of interest is expected to be half the average number of sequencing reads in a population of samples having two copies of the region of interest. Similarly, the average number of sequencing reads in a population of samples having three copies of a region of interest is expected to have 1.5 times the average number of sequencing reads in a population of samples having two copies of the region of interest. However, it is more challenging to simulate the variance of the expected distribution of sequencing reads from the theoretical population of synthetic sequencing reads in the distribution of sequencing reads from the population of synthetic sequencing reads. Because assessment of the genetic variant caller includes calling a number of copies of the region of interest from a plurality of copy number variants (i .e., a population of variants), it is preferable to assess the caller against a synthetic population with a distribution that mimics a real copy number variant population.
[0075] In some embodiments, a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest is generated. The synthetic number of sequencing reads for each of the one or more segments can be generated by increasing or decreasing a number of real sequencing reads from the one or more segments within a region of interest from a real test sample. For example, if a first number of real sequencing reads corresponds to a first segment in a region of interest, and a second number of real sequencing reads corresponds to a second segment in the region of interest, and the real sample has two copies of the region of interest, a synthetic copy number variant having three copies of the region of interest can be generated by generating a first synthetic number of sequencing reads corresponding to the first segment by increasing the first number of real sequencing reads to reflect three copies of the first segment, and generating a second synthetic number of sequencing reads corresponding to the second segment by increasing the second number of real sequencing reads to reflect three copies of the second segment. Since the synthetic number of sequencing reads corresponding to the first segment and the second segment are increased to reflect three copies, the synthetic copy number variant has three copies of the region of interest having the first segment and the second segment. In some embodiments, the synthetic number of sequencing reads are generated by multiplying the number of real sequencing reads by a factor (such as 1.5 to increase the copy number from two to three, or 0.5 to decrease the copy number from two to one). In some embodiments, the synthetic number of sequencing reads are generated by adding (or subtracting) a number of sequencing reads (such as 50% of the average number of real sequencing reads corresponding to all segments within the region of interest) to the number of real sequencing reads. In some embodiments, the number of sequencing reads are normalized (for example, as described below) such that a single copy of a region of interest is represented by a normalized number of sequencing reads (e.g., 0.5), and two copies of a region of interest are represented by a normalized number of sequencing reads (e.g., 1). Thus, in some embodiments, a number of normalized sequencing reads (such as 0.5) are added to the normalized number of sequencing reads to increase the number of copies in the synthetic copy number variant, and a number of normalized sequencing reads (such as 0.5) are subtracted to the normalized number of sequencing reads to decrease the number of copies in the synthetic copy number variant. Preferably, the number of real sequencing reads are increased or decreased to generate the synthetic number of sequencing reads to represent a synthetic copy number variant with a predetermined number (which may be an integer number or a non-integer number) of copies of the region of interest (such as 1 or more, 2 or more, 3 or more, 4 or more, or 5 or more copies of the region of interest). [0076] In some embodiments, a synthetic number of sequencing reads is generated by- adding or subtracting a number of sequencing reads from a number of sequencing reads from a real test sample to generate a synthetic copy number variant. A synthetic copy number variant comprising a duplication is generated by adding the number of sequencing reads, and a synthetic copy number variant comprising a deletion event is generated by deleting a number of sequencing reads. The number of sequencing reads added or subtracted from the number of number of sequencing reads from the real test sample is based, in part, on how many duplication or deletion events are simulated in the synthetic copy number variant. In some embodiments, a synthetic number of sequencing reads for a synthetic copy number variant comprising n copies of a region of interest (or segment thereof) more (or less) than an assumed (e.g., wild-type) number of copies x in a real test sample is determined by adding (or subtracting) - times an average (e.g., mean or median) number of sequencing reads from a plurality of real test samples for that region of interest (or segment thereof) to (or from) the number of sequencing reads for that region of interest (or segment thereof) from a real test sample. For example, for a synthetic copy number variant comprising a deletion (i.e., having n fewer copies of a region of interest or segment thereof than an assumed number of copies x in a real test sample), the synthetic number of sequencing reads
Figure imgf000024_0002
for the synthetic copy number variant is determined as wherein refers to the number of
Figure imgf000024_0001
Figure imgf000024_0003
sequencing reads at region of interest (or segment) / for real test sample i, and μ refers to an average (mean or median) number of sequencing reads, which can be, for example, an average number of sequencing reads from all segments within the real test sample (i.e., μj), an average number of sequencing reads at region of interest (or segment) ; across a plurality of real test samples (i.e., μj), or a normalized (or double normalized) average number of sequencing reads that i s an average number of sequencing reads for a region of interest (or segment) j for a plurality of real test samples, wherein the number of sequencing reads for each real test sample has been normalized across the real test sample
Figure imgf000024_0004
By way of example, a synthetic copy number variant having one copy of a region of interest i can be determined based on a number of sequencing reads from a real test sample i assumed to have two copies of the region of interest can be determined as In some
Figure imgf000024_0005
embodiments, a synthetic copy number variant comprising a duplication (i.e., having n additional copies of a region of interest or segment thereof than an assumed number of copies x in a real test sample), the synthetic number of sequencing reads for the synthetic copy
Figure imgf000025_0003
number variant is determined as By way of example, a synthetic copy
Figure imgf000025_0002
number variant having three copies of a region of interest j can be determined based on a number of sequencing reads from a real test sample i assumed to have two copies of the region of interest can be determined as
Figure imgf000025_0004
[0077] In some embodiments, a synthetic number of sequencing reads for a synthetic copy number variant comprising m copi es of a region of interest (or segment thereof) can be generated based on a number of sequencing reads of that region of interests (or segment thereof) comprising x copies of the region of interest (or segment thereof) according to For example, a synthetic number of sequencing reads for a
Figure imgf000025_0007
synthetic copy number variant with three copies of a region of interest (or segm ent thereof) can be generated based on a number of sequencing reads from a real test sample having two copies of the region of interest (or segment thereof) according to: In some embodiments, a synthetic copy number variant with
Figure imgf000025_0005
one copy of a region of interest (or segment thereof) can be generated based on a number of sequencing reads from a real test sample having two copies of the region of interest (or segment thereof) according to:
Figure imgf000025_0001
[0078] In some embodiments, the synthetic number of sequencing reads for a synthetic copy number variant comprising m copies of a region of interest is generated from a number of sequencing reads from a real test sample with an assumed (e.g., wild-type) number of copies x by multiplying the number of sequencing reads from the real test sample by That
Figure imgf000025_0009
is, a synthetic number of sequencing reads for a synthetic copy number variant can be
Figure imgf000025_0006
determined based on a number of sequencing reads according to: For
Figure imgf000025_0008
Figure imgf000025_0010
example, a synthetic number of sequencing reads for a synthetic copy number variant having three copies of a region of interest (or segm ent thereof) can be generated based on a number of sequencing reads from a real test sample assumed to have two copies of the region of interest (or segment thereof) according to: A synthetic number of sequencing
Figure imgf000026_0001
reads for a synthetic copy number variant having one copy of a region of interest (or segment thereof) can be generated based on a number of sequencing reads from a real test sample assumed to have two copies of the region of interest (or segment thereof) according to: In some embodiments, a fudge factor is included when determining the synthetic
Figure imgf000026_0002
number of sequencing reads, which can be used to more closely model the variance of a plurality of synthetic numbers of sequencing reads (i.e., a plurality of synthetic copy number variants) to the variance of a plurality or real test samples used as a basis for the plurality of synthetic copy number variants. The fudge factor can be derived from the increase or decrease in variance expected for a Poisson distribution when changing the average number of sequencing reads.
[0079] In some embodiments, the synthetic number of sequencing reads for a synthetic copy number variant is determined by sampling a binomial distribution or a negative binomial distribution of reals sequencing reads from a real test sample. For example, for synthetic copy number deletion variant having m copies of a region of interest (or segment thereof), the synthetic number of sequencing reads can be generated by sampling from a binomial distribution of a real number of sequencing reads from a real test sample having x copies of the region of interest (or segment thereof) with a success probability equal to and
Figure imgf000026_0004
a number of trials equal to the number of real sequencing reads. That is, for a synthetic copy number deletion variant, For example, for a synthetic
Figure imgf000026_0003
copy number variant having one copy of a region of interest (or segment thereof), the synthetic number of sequencing reads can be generated by sampling from a binomial distribution of a real number of sequencing reads from a real test sample having two copies of the region of interest (or segment thereof) with a success probability equal to ½ and a number of trails equal to the number of real sequencing reads. That is,
FIG. 2 illustrates binomial sampling of real sequencing
Figure imgf000026_0005
reads from real test samples having two copies of the region of interest to generate the synthetic copy number variants having one copy of the region of interest. In the illustrated example, a five real test samples are used to generate five synthetic copy number variants, although the plurality can comprise any number of real test samples and synthetic copy number variants. In the illustrated example, each real test sample includes a real number of sequencing reads of 100, although it is understood that a distribution of sequencing reads would be likely. A binomial distribution is sampled for each real test sample with a success probability equal to ½, A success represents a first copy of the region of interest and a failure represents the second copy. The number of successful sequencing reads (that is, those representing the first copy) is equal to the synthetic number of sequencing reads for the synthetic copy number variant
[0080] In some embodiments, the synthetic number of sequencing reads for a synthetic copy number duplication variant having m number copies of a region of interest (or segment thereof) is generated by sampling from a negative binomial distribution, wherein a number of successes is equal to the real number of sequencing reads from a real test sample having an assumed x number of copies of the region of interest (or segment thereof), and the probability of success is equal to and adding an expectation of the sampled negative binomial
Figure imgf000027_0001
distribution to the real number of sequencing reads. That is, for a synthetic copy number duplication variant, . For example, a
Figure imgf000027_0002
synthetic number of sequencing reads for a synthetic copy number duplication variant having three copies of a region of interest (or segment thereof ) can be generated by sampling from a negative binomial distribution, wherein a number of successes is equal to the real number of sequencing reads from a real test sample having an assumed two number of copies of the region of interest (or segment thereof), and the probability of success is equal to and adding
Figure imgf000027_0004
an expectation of the sampled negative binomial distribution to the real number of sequencing reads. That is In some
Figure imgf000027_0003
embodiments, a fudge factor is included when determining the synthetic number of sequencing reads, which can be used to more closely model the variance of a plurality of synthetic numbers of sequencing reads (i.e., a plurality of synthetic copy number variants) to the variance of a plurality or real test samples used as a basis for the plurality of synthetic copy number variants. The fudge factor can be determined empirically. For example, the fudge factor can be determined by comparing the distribution of sequencing reads from an X chromosome in males (which have a single copy of the X chromosome) to the distribution of sequencing reads from the X chromosome in females (which have two copies of the X chromosome) that have a simulated deletion of a single X chromosome (thus having a simulated one copy of the X chromosome). The fudge factor can be adjusted such that the observed one copy males are compared to simulated one copy females. For example, the synthetic number of sequencing reads can be determined according to:
is the
Figure imgf000028_0001
fudge factor. In one example, wherein
Figure imgf000028_0003
Figure imgf000028_0002
[0081] The genetic variant caller can call a number of copies of the region of interest for each synthetic copy number variant in the plurality of synthetic copy number variants. The number of copies of the region of interest in each synthetic copy number variant is known, as the number of copies of the region of interest in the synthetic copy number variant is represented by the synthetic number of sequencing reads, which were generated by adjusting the number of real sequencing reads from the real test sample to a desired number of copies of the region of interest. The called number of copies can be compared to the number of copies in each of the synthetic copy number variant in the plurality of synthetic copy number variants to determine a summary statistic for the genetic variant screen. The summary statistic can be, for example, sensitivity, specificity, precision, recall, accuracy, positive predictive value, negative predictive value, or any other metric of concordance.
[0082] The summary statistic indicates the performance of the genetic variant screen. For example, a high number of true positives and a low number of false negatives for a genetic variant screen is preferable. Thus, the summary statistic can be used to assess the
performance of the genetic variant screen. In some embodiments, a predetermined threshold for the summary statistic can be selected. In some embodiments, if the summary statistic is below the predetermined threshold, the genetic variant screen can be refined (for example, by altering the method for generating real sequencing reads or altering the genetic variant caller).
Performance of a SNP or Indel Variant Caller
[0083] In another aspect, there is provided method of assessing the performance of a genetic variant screen, comprising calling, using a genetic variant caller, for an inverse of a synthetic variant in each test sequence in a plurality of test sequences, each test sequence comprising a plurality of sequencing reads aligned with a reference sequence comprising a synthetic variant; and determining a summary statistic for the genetic variant screen based on a called inverse of the variant in the plurality of test sequences and the variant in the reference sequence, thereby assessing the performance of the genetic variant screen. In some embodiments, the method further comprises aligning the plurality of sequencing reads with the reference sequence comprising the synthetic variant. In some embodiments, the method comprises generating the sequencing reads can be generated from one or more real samples (e.g., one or more sequencing library from the one or more real samples) using any known sequencing method, such as massively parallel sequencing (for example using an Illumina HiSeq 2500 system).
[0084] Genetic variant callers generally function by comparing aligned sequencing reads to a reference sequence. The sequencing reads are aligned to the reference sequence, and the genetic variant caller identifies differences between the aligned sequencing reads and the reference sequence. Solely by way of example, if the reference sequence includes 40 bases that are not present in the aligned sequence, the genetic variant caller is intended call a deletion of those 40 bases. However, insertions, deletions, and inversions are relatively rare occurrences, and it is difficult to acquire a sufficiently large number of positive variants to adequately assess the performance of a genetic variant screen. By introducing a synthetic variant into the reference sequence, the genetic variant caller can call the inverse of the synthetic variant in a test sequence (which is known to be negative for the synthetic variant) when compared to the reference sequence comprising the synthetic variant.
[0085] An exemplary variant can include, but is not limited to, one or more of an insertion, a deletion, an inversion, a translocation, a single nucleotide variant (SNV), or a combination thereof. In some embodiments, the synthetic variant comprises an insertion and the inverse of the synthetic vari ant is a deletion. In some embodiments, the synthetic variant comprises a deletion and the inverse of the synthetic variant is an insertion. In some embodiments, the synthetic variant is an inversion and the inverse of the synthetic variant is an inversion.
[0086] The synthetic variant can be of any length, for example, 1 base to a full length chromosome. For example, the synthetic variant can be 1 base to about 250 million bases in length (such as about 1 base to about 50 bases in length, about 50 bases to about 100 bases in length, about 100 bases to about 150 bases, about 150 bases to about 250 bases in length, about 250 bases to about 500 bases in length, about 500 base to about 1000 bases in length, about 1000 bases to about 2000 bases in length, about 2000 bases to about 4000 bases in length, about 4000 bases to about 8000 bases in length, about 8000 bases to about 16,000 bases in length, about 16,000 bases to about 32,000 bases in length, about 32,000 bases to about 64,000 bases in length, about 64,000 bases to about 125,000 bases in length, about 125,000 bases to about 250,000 bases in length, about 250,000 bases to about 500,000 bases in length, about 500,000 bases to about 1 million bases in length, about 1 million bases to about 2 million bases in length, about 2 million bases to about 4 million bases in length, about 4 million bases to about 8 million bases in length, about 8 million bases to about 16 million bases in length, about 16 million bases to about 32 million bases in length, about 32 million bases to about 64 million bases in length, about 64 million bases to about 125 million bases in length, or about 125 million bases to about 250 million bases in length). In some embodiments, the synthetic variant is about 1 base or more (such as about 50 bases or more, about 100 bases or more, about 250 bases or more, about 500 base or more, about 1000 bases or more, about 2000 bases or more, about 4000 bases or more, about 8000 bases or more, about 16,000 bases or more, about 32,000 bases or more, about 64,000 bases or more, about 125,000 bases or more, about 250,000 bases or more, about 500,000 bases or more, about I million bases or more, about 2 million bases or more, about 4 million bases or more, about 8 million bases or more, about 16 million bases or more, about 32 million bases or more, about 64 million bases or more, or about 125 million bases or more.
[0087] The genetic variant caller can call the inverse synthetic variant for each test sequence in the plurality test sequences. The inverse of the synthetic variant in each test sequence is known, as the synthetic variant was artificially introduced into the reference sequence. The called inverse of the synthetic variant can be compared to known inverse of the synthetic variant to determine a summary statistic for the genetic variant screen. The summary statistic can be, for example, sensitivity, specificity, precision, recall, accuracy, positive predictive value, negative predictive value, or any other metric of concordance.
[0088] The summary statistic indicates the performance of the genetic vari ant screen. For example, a high number of true positives and a low number of false negatives for a genetic variant screen is preferable. Thus, the summary statistic can be used to assess the
performance of the genetic variant screen. In some embodiments, a predetermined threshold for the summary statistic can be selected. In some embodiments, if the summary statistic is below the predetermined threshold, the genetic variant screen can be refined (for example, by altering the method for generating real sequencing reads or altering the genetic variant caller).
Computer Systems
[0089] In some embodiments, the methods described herein are implemented by a program executed on a computer system, FIG. 3 depicts an exemplary computing system 300 configured to perform any one of the above-described processes, including the various exemplar}' methods for of assessing the performance of a genetic variant screen. The computing system 300 may include, for example, a processor, memory, storage, and input/output devices (e.g., monitor, keyboard, disk drive, Internet connection, etc). The computing system 300 may include circuitry or other specialized hardware for carrying out some or all aspects of the processes. For example, in some embodiments, the computing system includes a sequencer (such as a massive parallel sequencer). In some operational settings, computing system 300 may be configured as a system that includes one or more units, each of which is configured to carry out some aspects of the processes either in software, hardware, or some combination thereof.
[0090] FIG. 3 depicts computing system 300 with a number of components that may be used to perform the above-described processes. The main system 302 includes a
motherboard 304 having an input/output ("I/O") section 306, one or more central processing units ("CPU") 308, and a memory section 310, which may have a flash memory card 312 related to it. The I/O section 306 is connected to a display 314, a keyboard 316, a disk storage unit 318, and a media drive unit 320. The media drive unit 320 can read/write a computer-readable medium 322, which can contain programs 324 and/or data.
[0091] At least some values based on the results of the above-described processes can be saved for subsequent use. Additionally, a non-transitory computer-readable medium can be used to store (e.g., tangibly embody) one or more computer programs for performing any one of the above-described processes by means of a computer. The computer program may be written, for example, in a general-purpose programming language (e.g., Pascal, C, C++, Java, Python, J SON, R, etc.) or some specialized application-specific language.
[0092] In some embodiments, the summary statistic is reported (for example, to a patient, a doctor, a caregiver, or a regulator). In some embodiments, the summary statistic is displayed, for example on a monitor.
[0093] Various exemplary embodiments are described herein. Reference is made to these examples in a non-limiting sense. They are provided to illustrate more broadly applicable aspects of the disclosed technology. Various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the vari ous embodiments. In addition, many modifications may be made to adapt a particular situation, material, composition of matter, process, process act(s) or step(s) to the objective(s), spirit or scope of the various embodiments. Further, as will be appreciated by those with skill in the art, each of the individual variations described and illustrated herein has discrete components and features that may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the various embodiments. All such modifications are intended to be within the scope of claims associated with this disclosure.
[0094] The following non-limiting examples further illustrate the methods of the present invention. Those skilled in the art will recognize that several embodiments are possible within the scope and spirit of this invention. While illustrative of the invention, the following examples should not be construed in any way limiting its scope.
EXEMPLARY EMBODIMENTS
[0095] Embodiment 1. A method of assessing the performance of a genetic variant screen, comprising:
generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest;
calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and
determining a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy number variants, thereby assessing the performance of the genetic variant screen.
[0096] Embodiment 2. The method of embodiment 1 , wherein the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, or negative predictive value.
[0097] Embodiment 3. The method of embodiment 1 or 2, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by increasing or decreasing a number of real sequencing reads from the one or more segments from a real test sample in proportion to a predetermined number of copies of the region of interest.
[0098] Embodiment 4. The method of embodiment 3, wherein the predetermined number of copies is an integer number of copies.
[0099] Embodiment 5. The method of embodiment 3, wherein the predetermined number of copies is a non-integer number of copies. [0100] Embodiment 6. The method of any one of embodiments 1-5, wherein the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a binomial distribution with a success probability equal to m/x and a number of trials equal to a real number of sequencing reads from a real test sample, wherein m is the synthetic number of copies of the region of interest, and x is an assumed number of copies of the region of interest from the real test sample.
[0101] Embodiment 7. The method of any one of embodiments 1-5, wherein the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a number of sequencing reads as a negative binomial distribution with a success probability equal to m/x and a number of success equal to a real number of sequencing reads from a real test sample, wherein m is the synthetic number of copies of the region of interest, and x is an assumed number of copies of the region of interest from the real test sample, and adding the sampled number of sequencing reads to the real number of sequencing reads from the test sample.
[0102] Embodiment 8. The method of embodiment 7, wherein the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a number of sequencing reads as an expectation of the negative binomial distribution.
[0103] Embodiment 9. The method of any one of embodiments 3-8, wherein the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples.
[0104] Embodiment 10. The method of any one of embodiments 3-9, wherein the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample.
[0105] Embodiment 11. The method of embodiment 9 or 10, wherein the average number of real sequencing reads from a corresponding segment from the one or more real reference samples is a median number of real sequencing reads from a corresponding segment from the one or more real reference samples.
[0106] Embodiment 12. The method of any one of embodiments 9 or 10, wherein the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample is the median average number of real sequencing reads from the one or more segments within the region of interest from the real test sample.
[0107] Embodiment 13. The method of any one of embodiments 3-12, wherein the number of real sequencing reads are normalized for GC content bias or mappability.
[0108] Embodiment 14. The method of any one of embodiments 1 -13, wherein the region of interest is more than about 100 bases in length.
[0109] Embodiment 15. The method of any one of embodiments 1-14, wherein the region of interest is at least one ex on.
[0110] Embodiment 16. The method of any one of embodiments 1 -15, wherein the region of interest is at least one gene.
[0111] Embodiment 17. The method of any one of embodiments 1-16, wherein the region of interest is at least one chromosome.
[0112] Embodiment 18. The method of any one of embodiment 3-17, further comprising generating the real sequencing reads from the real test sample.
[0113] Embodiment 19. The method of any one of embodiment 1-18, further comprising generating the real sequencing reads from the one or more real reference samples.
[0114] Embodiment 20. A method of assessing the performance of a genetic variant screen, comprising:
calling, using a genetic variant caller, for an inverse of a synthetic variant in each test sequence in a plurality of test sequences, each test sequence comprising a plurality of sequencing reads aligned with a reference sequence comprising a synthetic variant; and
determining a summary statistic for the genetic variant screen based on a called inverse of the synthetic variant in the plurality of test sequences and the synthetic variant in the reference sequence, thereby assessing the performance of the genetic variant screen.
[01.15] Embodiment 21. The method of embodiment 20, further comprising aligning the plurality of sequencing reads with the reference sequence comprising the synthetic variant.
[0116] Embodiment 22. The method of embodiment 20 or 21, wherein the synthetic variant comprises an insertion, a deletion, an inversion, a translocation, a SNV, or a combination thereof.
[0117] Embodiment 23. The method of any one of embodiments 20-22, wherein the synthetic variant comprises an insertion or a deletion.
[0118] Embodiment 24. The method of any one of embodiments 20-23, wherein the synthetic variant comprises an insertion and the inverse of the synthetic variant is a deletion. [0119] Embodiment 25. The method of any one of embodiments 20-23, wherein the synthetic variant comprises a deletion and the inverse of the synthetic variant is an insertion.
[0120] Embodiment 26. The method of any one of embodiments 20-22, wherein the synthetic variant is an inversion and the inverse of the synthetic variant is an inversion.
[0121] Embodiment 27. The method of any one of embodiments 20-26, wherein the synthetic variant is between 1 base in length and about 1 chromosome in length.
[0122] Embodiment 28. The method of any one of embodiments 20-27, wherein the synthetic variant is between about 20 bases in length and about 1000 bases in length.
[0123] Embodiment 29. The method of any one of embodiments 20-28, wherein the synthetic variant is the length of a full gene.
[0124] Embodiment 30. The method of any one of embodiments 20-29, wherein the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, or negative predictive value,
[0125] Embodiment 31. The method of any one of embodiments 20-30, further comprising generating the sequencing reads from the one or more test samples.
[0126] Embodiment 32. The method of any one of embodiments 1-31 , further comprising reporting the summary statistic.
[0127] Embodiment 33. The method of any one of embodiments 1 -32, further comprising displaying the summary statistic on a monitor.
[0128] Embodiment 34. The method of any one of embodiments 1-33, wherein the method is implemented by a program executed on a computer system.
[0129] Embodiment 35. The method of any one of embodiments 1 -34, further comprising storing the summary statistic in a database.
[0130] Embodiment 36. A computer readable storage medium comprising instructions for carrying out the method of any one of embodiments 1-35,
[0131] Embodiment 37. A system comprising
a processor, and
a memory, wherein the memory comprises computer readable instructions operable to cause the processor to carry out the method of any one of embodiments 1-36,
EXAMPLES
Example 1
[0132] The X chromosome from several female test samples was sequenced using massively parallel sequencing methods. The X chromosome was divided into a plurality of segments, and the number of sequencing reads within each segment was normalized against corresponding segments from a plurality of reference X chromosomes by dividing the number of sequencing reads from each segment by the median number of sequencing reads from the corresponding segment in all reference X chromosomes. The numbers of sequencing reads from each segment from each test X chromosome were further normalized against the number of sequencing reads from each segment with the test X chromosome by dividing the number of sequencing reads in any given segment from any given X
chromosome by the median number of sequencing reads from the segments in that X chromosome. Synthetic copy number variants of the X chromosome were generated by subtracting half of the median number of sequencing reads from any given X chromosome from the number of sequencing reads from each segment with that X chromosome,
[0133] As female samples each contain two copies of the X chromosome and males contain a single copy of the X chromosome, the synthetic copy number variants of the X
chromosome can be considered "simulated males", as each synthetic copy number variant includes a single copy of the X chromosome. A theoretical copy number was determined based on the synthetic number of sequencing reads from the simulated male samples (which are based on double-normalized sequencing reads). A theoretical copy number was determined based on the double-normalized number of sequencing reads from real female (XX) samples and real male (XY) samples, with the median number of sequencing reads from the real female samples set to 1 and the real males set to 0.5. The theoretical copy number was determined by rescaling a double-normalized number of sequencing reads of 0.5 to 1 (indicating a single copy of the X chromosome) and a double-normalized number of sequencing reads of 1 to 2 (indicating two copies of the X chromosome). The theoretical copy number was plotted against the sequencing depth (i.e., absolute number of sequencing reads prior to normalization) for each segment within the X chromosome of real female samples, real male samples, and synthetic copy number variants (see FIG. 4). As can be seen in FIG. 4, the female samples have approximately two copies of the X chromosome, and male samples as have approximately one copy of the X chromosome, with greater precision for those segments with a greater absolute number of sequencing reads. The simulated males behave approximately the same as real male samples.
Example 2
[0134] Synthetic copy number variants having a synthetic copy number of an exon from various genes were generated. The synthetic copy number included an additional copy of the ex on (a "duplication" copy number variant) or removed a copy of the ex on (a "deletion" copy number variant). A genetic variant caller was used to call copy numbers of the ex on in the synthetic copy number variant, and sensitivity for the screen was determined. The assay was then modified by increasing the number of segments within the region of interest and re-run, and synthetic copy number variants having a synthetic copy number of the exon from various genes were re-generated based on the new assay. The genetic variant caller was used to call copy numbers of the re-generated synthetic copy number variants, and a new sensitivity for the assay was determined. These results are presented in FIG. 5. As can be seen in FIG. 5, the original assay (labeled "prototype") had low sensitivity for calling copy number variants for several exons in various genes, particularly in detecting duplication events. Once the assay was modified, however, the performance of the assay was substantially improved.
Example 3
[0135] Synthetic copy number variants having a synthetic copy number of one exon, two exons, four exons, or an entire gene from one of 36 different genes were generated. The synthetic copy number variants included either deletion or duplication events, A genetic variant caller was used to call copy numbers of the exons or gene in the synthetic copy number variant, and sensitivity for the screen was determined. The assay was then modified by increasing the number of segments with the region of interest, and re-run, and synthetic copy number variants having a synthetic copy number of the exons or genes were regenerated based on the new assay. The genetic variant caller was used to call copy numbers of the re-generated synthetic copy number variants, and a new sensitivity for the assay was determined. The results are presented in FIG. 6, The new genetic variant screen had an overall sensitivity for deletion events of 99.60% (with a 95% confidence interval of 99.50% to 99,72%), and an overall sensitivity for duplication events of 99.00% (with a 95% confidence interval of 98.82% to 99.19%).

Claims

CLAIMS What is claimed is:
1. A method of assessing the performance of a genetic variant screen, comprising:
generating a plurality of synthetic copy number variants comprising a synthetic number of copies of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest;
calling a number of copies of the region of interest using a genetic variant caller for each synthetic copy number variant based on the synthetic number of sequencing reads from the one or more segments and a number of real sequencing reads from the one or more segments from one or more real reference samples; and
determining a summary statistic for the genetic variant screen based on the differences between the called number of copies and the synthetic number of copies in the synthetic copy- number variants, thereby assessing the performance of the genetic variant screen.
2. The method of claim 1, wherein the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, or negative predictive value.
3. The method of claim 1 or 2, wherein the synthetic number of sequencing reads from each of the one or more segments is generated by increasing or decreasing a number of real sequencing reads from the one or more segments from a real test sample in proportion to a predetermined number of copies of the region of interest,
4. The method of claim 3, wherein the predetermined number of copies is an integer number of copies.
5. The method of claim 3, wherein the predetermined number of copies is a non-integer number of copies.
6. The method of any one of claims 1-5, wherein the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a binomial distribution with a success probability equal to mix and a number of trials equal to a real number of sequencing reads from a real test sample, wherein m is the synthetic number of copies of the region of interest and x is an assumed number of copies of the region of interest from the real test sample,
7. The method of any one of claims 1-5, wherein the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a number of sequencing reads as a negative binomial distribution with a success probability equal to mix and a number of success equal to a real number of sequencing reads from a real test sample, wherein m is the synthetic number of copies of the region of interest, and x is an assumed number of copies of the region of interest from the real test sample, and adding the sampled number of sequencing reads to the real number of sequencing reads from the test sample,
8. The method of claim 7, wherein the synthetic number of sequencing reads from the one or more segments within the region of interest is generated by sampling a number of sequencing reads as an expectation of the negative binomial distribution,
9. The method of any one of claims 3-8, wherein the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from a corresponding segment from the one or more real reference samples.
10. The method of any one of claims 3-9, wherein the number of real sequencing reads from each of the one or more segments from the real test sample are normalized by dividing the number of real sequencing reads from each segment from the real test sample by the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample.
1 1. The method of claim 9 or 10, wherein the average number of real sequencing reads from a corresponding segment from the one or more real reference samples is a median number of real sequencing reads from a corresponding segment from the one or more real reference samples.
12. The method of any one of claims 9 or 10, wherein the average number of real sequencing reads from the one or more segments within the region of interest from the real test sample is the median average number of real sequencing reads from the one or more segments within the region of interest from the real test sample,
13. The method of any one of claims 3-12, wherein the number of real sequencing reads are normalized for GC content bias or mappability.
14. The method of any one of claims 1-13, wherein the region of interest is more than about 100 bases in length.
15. The method of any one of claims 1- 14, wherein the region of interest is at least one exon.
16. The method of any one of claims 1-15, wherein the region of interest is at least one gene.
17. The method of any one of claims 1 -16, wherein the region of interest is at least one chromosome.
18. The method of any one of claim 3- 1 7, further comprising generating the real sequencing reads from the real test sample.
19. The method of any one of claim 1-18, further comprising generating the real sequencing reads from the one or more real reference samples,
20. A method of assessing the performance of a genetic variant screen, comprising:
calling, using a genetic variant caller, for an inverse of a synthetic variant in each test sequence in a plurality of test sequences, each test sequence comprising a plurality of sequencing reads aligned with a reference sequence comprising a synthetic variant; and
determining a summary statistic for the genetic variant screen based on a called inverse of the synthetic variant in the plurality of test sequences and the synthetic variant in the reference sequence, thereby assessing the performance of the genetic variant screen.
21. The method of claim 20, further comprising aligning the plurality of sequencing reads with the reference sequence comprising the synthetic variant.
22. The method of claim 20 or 21, wherein the synthetic variant comprises an insertion, a deletion, an inversion, a translocation, a SNV, or a combination thereof.
23. The method of any one of claims 20-22, wherein the synthetic variant comprises an insertion or a deletion.
24. The method of any one of claims 20-23, wherein the synthetic variant comprises an insertion and the inverse of the synthetic variant is a deletion.
25. The method of any one of claims 20-23, wherein the synthetic variant comprises a deletion and the inverse of the synthetic variant is an insertion.
26. The method of any one of claims 20-22, wherein the synthetic variant is an inversion and the inverse of the synthetic variant is an inversion.
27. The method of any one of claims 20-26, wherein the synthetic variant is between 1 base in length and about 1 chromosome in length,
28. The method of any one of claims 20-27, wherein the synthetic variant is between about 20 bases in length and about 1000 bases in length.
29. The method of any one of claims 20-28, wherein the synthetic variant is the length of a full gene.
30. The method of any one of claims 20-29, wherein the summary statistic is sensitivity, specificity, precision, recall, accuracy, positive predictive value, or negative predictive value.
31. The method of any one of claims 20-30, further comprising generating the sequencing reads from the one or more test samples.
32. The method of any one of claims 1-31, further comprising reporting the summary statistic.
33. The method of any one of claims 1-32, further comprising displaying the summary statistic on a monitor.
34. The method of any one of claims 1-33, wherein the method is implemented by a program executed on a computer system.
35. The method of any one of claims 1 -34, further comprising storing the summary statistic in a database.
36. A computer readable storage medium comprising instructions for carrying out the method of any one of claim s 1-35.
37. A system comprising
a processor; and
a memory, wherein the memory comprises computer readable instructions operable to cause the processor to carry out the method of any one of claims 1-36.
PCT/US2017/060222 2016-11-07 2017-11-06 Methods for assessing genetic variant screen performance Ceased WO2018085779A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201662418622P 2016-11-07 2016-11-07
US62/418,622 2016-11-07
US201762544311P 2017-08-11 2017-08-11
US62/544,311 2017-08-11

Publications (1)

Publication Number Publication Date
WO2018085779A1 true WO2018085779A1 (en) 2018-05-11

Family

ID=62076408

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/060222 Ceased WO2018085779A1 (en) 2016-11-07 2017-11-06 Methods for assessing genetic variant screen performance

Country Status (1)

Country Link
WO (1) WO2018085779A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920899A (en) * 2018-06-10 2018-11-30 杭州迈迪科生物科技有限公司 A kind of single exon copy number variation prediction technique based on target area sequencing
WO2019236420A1 (en) * 2018-06-06 2019-12-12 Myriad Women's Health, Inc. Copy number variant caller
CN110648721A (en) * 2019-09-19 2020-01-03 北京市儿科研究所 Method and device for detecting copy number variation by aiming at exon capture technology

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160300013A1 (en) * 2015-04-10 2016-10-13 Agilent Technologies, Inc. METHOD FOR SIMULTANEOUS DETECTION OF GENOME-WIDE COPY NUMBER CHANGES, cnLOH, INDELS, AND GENE MUTATIONS

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160300013A1 (en) * 2015-04-10 2016-10-13 Agilent Technologies, Inc. METHOD FOR SIMULTANEOUS DETECTION OF GENOME-WIDE COPY NUMBER CHANGES, cnLOH, INDELS, AND GENE MUTATIONS

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MAGI ET AL.: "Characterization of MinION nanopore data for resequencing analyses", BRIEFINGS IN BIOINFORMATICS, vol. 18, no. 6, 24 August 2016 (2016-08-24), pages 940 - 953 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019236420A1 (en) * 2018-06-06 2019-12-12 Myriad Women's Health, Inc. Copy number variant caller
JP2021527250A (en) * 2018-06-06 2021-10-11 ミリアド・ウィメンズ・ヘルス・インコーポレーテッド Number of copies Variant cola
EP3803879A4 (en) * 2018-06-06 2022-10-05 Myriad Women's Health, Inc. COPY NUMBER VARIANT CALLER
JP2024069550A (en) * 2018-06-06 2024-05-21 ミリアド・ウィメンズ・ヘルス・インコーポレーテッド Copy Number Variant Cola
JP7488772B2 (en) 2018-06-06 2024-05-22 ミリアド・ウィメンズ・ヘルス・インコーポレーテッド Copy Number Variant Cola
JP7735457B2 (en) 2018-06-06 2025-09-08 ミリアド・ウィメンズ・ヘルス・インコーポレーテッド Copy Number Variant Cola
CN108920899A (en) * 2018-06-10 2018-11-30 杭州迈迪科生物科技有限公司 A kind of single exon copy number variation prediction technique based on target area sequencing
CN110648721A (en) * 2019-09-19 2020-01-03 北京市儿科研究所 Method and device for detecting copy number variation by aiming at exon capture technology
CN110648721B (en) * 2019-09-19 2022-04-12 首都医科大学附属北京儿童医院 Method and device for detecting copy number variation by aiming at exon capture technology

Similar Documents

Publication Publication Date Title
US20240194293A1 (en) Noninvasive prenatal screening using dynamic iterative depth optimization
Beichman et al. Using genomic data to infer historic population dynamics of nonmodel organisms
US20240247306A1 (en) Detecting Cross-Contamination in Sequencing Data Using Regression Techniques
Schrider Background selection does not mimic the patterns of genetic diversity produced by selective sweeps
KR102665592B1 (en) Methods and processes for non-invasive assessment of genetic variations
US20140067355A1 (en) Using Haplotypes to Infer Ancestral Origins for Recently Admixed Individuals
AU2016355983A1 (en) Methods for detecting copy-number variations in next-generation sequencing
US20180300450A1 (en) Systems and methods for performing and optimizing performance of dna-based noninvasive prenatal screens
CN114649055A (en) Methods, devices, and media for detecting single nucleotide variations and indels
AU2020296110B2 (en) Systems and methods for determining genome ploidy
WO2018085779A1 (en) Methods for assessing genetic variant screen performance
Bredemeyer et al. Rapid macrosatellite evolution promotes X-linked hybrid male sterility in a feline interspecies cross
Zhu et al. A generalized dSpliceType framework to detect differential splicing and differential expression events using RNA-Seq
Fine et al. A novel expectation-maximization approach to infer general diploid selection from time-series genetic data
Kuznetsov et al. CNV-finder: streamlining copy number variation discovery
TW202300656A (en) Machine detection of a candidate break-point of a copy number variant on a genomic sequence
US11328794B2 (en) Method for determining relatedness of genomic samples using partial sequence information
Brand et al. Soft sweeps predominate recent positive selection in bonobos (Pan paniscus) and chimpanzees (Pan troglodytes)
US20220375544A1 (en) Kit and method of using kit
JP4922646B2 (en) Gene information display method and display device
US20170226588A1 (en) Systems and methods for dna amplification with post-sequencing data filtering and cell isolation
CN111028885A (en) A method and device for detecting yak RNA editing sites
Bharadwaj Characterizing Alterations to Chromatin Accessibility in Crohn’s Disease Patients by Identifying Potential Causal Variants in Regulatory Regions Through a QTL Approach
Natri et al. Genetic architecture of gene regulation in Indonesian populations identifies QTLs associated with local ancestry and archaic introgression
Scheid et al. Permutation filtering: A novel concept for significance analysis of large-scale genomic data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17867403

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17867403

Country of ref document: EP

Kind code of ref document: A1