Disclosure of Invention
The object of the present invention is to provide a method for detection of whole genome CNV, use of the method for cancer diagnosis, and a device for diagnosing a cancer condition. When the method is used for carrying out statistical analysis on the whole genome CNV, the length of the CNV and the degree of deviation of the copy number of the CNV from the normal copy number are fully considered, so that the genome disorder degree of a sample is more accurately and comprehensively reflected, the limitation caused by CNV analysis in a specific region is avoided, and the problem of low diagnosis accuracy caused by introduction of high heterogeneity characteristics of a cancer sample is solved.
Accordingly, in a first aspect, the present invention provides a method for detecting genome-wide Copy Number Variation (CNV), comprising the steps of:
(1) obtaining a sequencing result sequence of whole genome sequencing of the DNA sample;
(2) comparing the sequencing result sequence with a human reference genome, and calculating the coordinate position of the CNV in the chromosome segment and the copy number CN of the CNV in the segmentDetection of;
(3) Calculating a composite CNV score for the DNA sample by the following formula:
wherein CNV length represents the length of CNV calculated from the coordinate position of CNV in the chromosome segment, CNReference toRepresents chromosomal copy number in a normal sample, wherein the CNV score represents genome-wide copy number variation of the nucleotide sample.
In one embodiment, for autosomes and female X chromosomes, CNReference toIs 2; for male X and Y chromosomes, CNReference toIs 1.
In one embodiment, the DNA sample is isolated directly from a tissue, cell, or fluid of the subject (e.g., blood, serum, plasma, urine, spinal fluid, amniotic fluid, lymph fluid, pleural fluid). In another embodiment, the DNA sample is derived from a tissue sample that has been specially processed (e.g., fixed, embedded in paraffin such as formalin fixation, laser microdissection, etc.).
In one embodiment, whole genome sequencing described herein can be performed by any sequencing technique known to those skilled in the art. For example, sequencing techniques that may be used in the present invention include, but are not limited to: sequencing by synthesis, single molecule sequencing, nanopore sequencing, semiconductor sequencing, ligation sequencing, sequencing by hybridization, digital gene expression, single molecule sequencing by synthesis, massively parallel signature sequencing, clonal single molecule arrays, pyrosequencing, and any other sequencing technique known in the art. For example, sequencing of the invention can be performed on the following platform: HiSeq 2000(Illumina), 454Genome sequence (Roche), SOLID System (Applied Biosystem), Polonator G007(Dover/Harvard), HeliScope (Helicos biosciences), Ion Torque (Thermo Life), GS FLX System (Roche), and any other high throughput sequencing platform known in the art.
In one embodiment, the sequencing of the present invention can be either single-ended or double-ended. Single-ended sequencing firstly fragmenting a DNA sample to form a fragment of 200-500bp, connecting a primer sequence to one end of the DNA fragment, then adding a joint at the tail end, and sequencing the single-ended reading sequence on a machine. Double-ended sequencing refers to adding sequencing primer binding sites on the joints at two ends when a DNA library is constructed, removing a template strand of the first round of sequencing after the first round of sequencing is completed, and guiding the complementary strand to regenerate and amplify in situ by using a double-ended sequencing module so as to achieve the template amount used by the second round of sequencing to perform the second round of sequencing synthesis. In a preferred embodiment, the present invention uses single-ended sequencing to achieve the desired sensitivity while saving sequencing time and cost. In another preferred embodiment, the sequencing of the invention uses short read length sequencing, such as 36bp sequencing, 75bp sequencing, etc., and also meets the sensitivity requirements of the invention. And compared with long-read long-sequencing, short-read long-sequencing has the advantages of saving sequencing time and cost. In general, the short read length should not be less than 36bp to avoid excessive noise generated by the sequencing result, so that a sequence unique and completely matching with the reference genome cannot be accurately extracted in the alignment with the reference genome.
In one embodiment, the sequencing of the invention is performed on a sequencing library prepared from said DNA sample. Methods of preparing sequencing libraries are known to those skilled in the art. Generally, the sequencing library preparation method comprises the following steps: and (3) carrying out end repair on the DNA sample, then adding A to the repaired DNA at the end, connecting a sequencing joint, and then carrying out PCR amplification reaction and purifying the amplification product to obtain the sequencing library. The preparation of the sequencing library can be modified and adapted accordingly by the skilled person based on the different DNA samples. In addition, the preparation of sequencing libraries can also be performed using library preparation kits known to those skilled in the art. Such library preparation kits include, but are not limited to: ChroSure kit (berry and kang), konnaean kit (bery and kang), KAPA library construction kit (Illumina), Nextera DNA library preparation kit (Illumina), and the like.
In the present invention, "human reference genome" refers to standard human reference genome sequences in the NCBI database, which may be, for example, hg18, NCBI Build 36; hg19, NCBI Build 37. The sequence of the human reference genome can be obtained in the genetic data of NCBI, Ensembl and UCSC.
In one embodiment, alignment of the sequencing result sequence to the human reference genome can be performed using algorithms or software known to those skilled in the art. Examples of such algorithms or software include, but are not limited to: BLAST, BLAT, MAQ, SOAP, Bowtie, BWA, SSAHA, ELAND.
In one embodiment, the calculation of the coordinate position of the CNV in the chromosome segment and the copy number of the CNV in the segment in the step (2) above can be performed by any method known in the art, for example, in CNV-seq, a new method to detect copy number variation using high-throughput sequencing (Xiiet., BMC Bioinformatics,2009,10(1): 80-80); CNVnator: An apple to discover, genotype, and characteristics type and outline CNVs from family and optimization Genome sequencing (Abyzov et al, Genome Res,2011,21(6): 974); a parallel r package for detecting copy number indexes from sequence reads (Miller et al, PLoS One,2011,6(1): e 16327); MOPS, mix of Poissons for converting copy number variations in next-generating conversion data with a low frequency conversion rate (Klambauer et al, nucleic acids Res,2012,40(9): e69-e 69). In a specific embodiment, the step (2) further comprises the steps of:
(2.1) comparing the sequencing result sequence with a human reference genome, and extracting a sequence which is completely matched with the human reference genome in the sequencing result sequence and is aligned to a unique position in the human reference genome;
(2.2) dividing the human reference genome into a plurality of continuous regions (bins), and counting the sequence number Q value in each region obtained by the sequence alignment in the step (2.1);
(2.3) correcting and standardizing the sequence number Q value obtained in the step (2.2) to obtain a standard sequence number P value in each area;
(2.4) dividing the number of standard sequences P obtained in step (2.3) by the median of the number of standard sequences obtained from normal samples to obtain the fd value in each region;
(2.5) calculating the coordinate position of the CNV in the chromosome segment and the copy number CN of the segment by a mathematical model based on the fd values of all the regionsDetection of。
In one embodiment, the size of the regions (bins) is uniform or substantially uniform. One skilled in the art can adjust the size of the region based on the resolution and amount of sequencing data required for CNV detection. In the present invention, the size of the region is, for example, about 10kb, 20kb, 30kb, 40kb, 50kb, 60kb, 70kb, 80kb, 90kb or 100 kb.
In the above step (2.3), in order to reduce the variation in the number of sequences and to improve the detection sensitivity, the number of sequences Q value needs to be corrected, mainly the GC bias correction. GC bias leads to uneven presentation of GC-rich or frequent regions in the sequence reads. Any method of GC correction known in the art can be used in the present invention. Examples of GC correction methods are, for example, the loess GC correction method or the area GC correction method.
In the above step (2.3), in order to provide a DNA sample having the same data amount as that of the normal sample for easy comparison, it is necessary to normalize the sequence number Q value so as to obtain the standard sequence number P value in each region. In one embodiment, the corrected Q value is normalized by the following equation:
p ═ Q corrected/(total number of sequencing result sequences)]×[(5×106)]Wherein 5 is multiplied by 106Is the minimum amount of sequencing data required to meet the present invention.
In another embodiment, the normalization step may also be performed using one or more of algorithms known in the art, such as hidden Markov, dynamic programming, support vector machines, Bayesian or probabilistic modeling, trellis decoding, Viterbi decoding, expectation maximization, Kalman filtering, or neural networks.
In step (2.4) of the present invention, the value of fd in each region was obtained by dividing the number of standard sequences P by the median of the number of standard sequences obtained from normal samples. The standard sequence number of the normal sample can be obtained according to any method known in the art, for example, the method described in the above steps (2.1) to (2.3). In addition, methods for calculating the median value from the standard number of sequences of all regions of a normal sample are well known to those skilled in the art.
In step (2.5) of the present invention, the coordinate position of CNV in the chromosome segment and the copy number CN of the segment are calculated by a mathematical model based on the fd values of all regionsDetection of. Mathematical models known in the art that can calculate the coordinate position of the CNV and its copy number can be used in the present invention. Examples of such mathematical models include, but are not limited to, Fused Lasso algorithm or CBS algorithm.
Based on the coordinate position of the CNV in the chromosome segment, the length of the CNV can be calculated. In the present invention, the comprehensive CNV score of a DNA sample is calculated by the following formula:
wherein CNV length represents the length of CNV calculated from the coordinate position of CNV in the chromosome segment, CNReference toRepresents the number of chromosome copies in a normal sample,
wherein for autosomes and female X chromosomes, CNReference toIs 2; for male X and Y chromosomes, CNReference toThe number of the carbon atoms is 1,
wherein the CNV score represents genome-wide copy number variation of the DNA sample.
The CNV score calculation method is exemplified as follows: for one particular sample, two CNVs were detected: wherein one CNV is located at the position chr9:100000-24300000 of chromosome 9 (length of 119040000bp), and the copy number is 1.1; the other CNV is located at position chr16:15540000-31780000 of chromosome 16 (length 78700000bp) with copy number of 2.97; CNReference toIs 2. The CNV score for this sample is then: [(24300000-100000)/119040000]*|1.1-2|+[(31780000-15540000)/78700000]*|2.97-2|=0.38。
In one embodiment, the CNV detected by the present invention is 100kb or more in length.
The detection method of the whole genome CNV can objectively and quantitatively reflect the degree of the genome disorder of the sample. One advantage of the present invention is that the CNV score measures the length and degree of variability of the CNV (i.e., the degree to which the CNV copy number deviates from the normal copy number, expressed in the present invention as | CNDetection of-CNReference toAll the values are included in statistics, and the degree of the genome disorder of the sample can be more comprehensively reflected. On the other hand, the CNV detection method of the present invention also includes statistics of all CNVs (including chimeric CNVs) deviating from normal copy number, and can more fully reflect the degree of genomic disorder of the sample for cancer samples with high heterogeneity of genetic variation. Finally, the invention has high sensitivity and specificity. Even for sequencing data after amplification of a sample with a small starting amount (e.g., a microdissected sample), satisfactory sensitivity and specificity can still be achieved.
In a second aspect, the present invention provides an apparatus for diagnosing a cancer condition in a subject, comprising:
a memory configured to store one or more programs;
a processing unit coupled to the memory and configured to execute the one or more programs to cause the management system to perform a plurality of actions, the actions comprising:
(1) obtaining a plurality of CNV scores of a plurality of DNA samples by the detection method of the whole genome CNV;
(2) determining a threshold value from the plurality of CNV scores;
(3) and obtaining the CNV score of the subject by the detection method of the whole genome CNV, and comparing the CNV score with the threshold value to obtain a diagnosis result.
In a third aspect, the present invention provides a method for diagnosing a cancer condition in a subject, comprising:
(1) obtaining a plurality of CNV scores of a plurality of DNA samples by the detection method of the whole genome CNV;
(2) determining a threshold value from the plurality of CNV scores;
(3) and obtaining the CNV score of the subject by the detection method of the whole genome CNV, and comparing the CNV score with the threshold value to obtain a diagnosis result. In one embodiment, the plurality of DNA samples are from a plurality of individuals. In one embodiment, the plurality of DNA samples comprises a cancer sample and a non-cancer sample. In another embodiment, the plurality of DNA samples comprises a low-invasive cancer sample and a high-invasive cancer sample. In yet another embodiment, the plurality of DNA samples comprises cancer samples with histological subtype classification I, II and/or class III. One skilled in the art can select multiple DNA samples for determining the threshold by CNV scoring based on the desired diagnostic result. In these embodiments, the threshold determined by the CNV score can distinguish between cancer and non-cancer samples, as well as the invasive grade and histological subtype of cancer samples. Accordingly, the diagnosis is whether the subject has cancer and, if so, how much cancer has infiltrated.
In addition, cancer is clinically classified into grade I, II, III according to the major histological subtype classification system. For example, among lung cancers, the grade I mainly refers to the lung cancer in situ, the lung cancer in micro-invasive or adherent form, the grade II mainly refers to the lung cancer in invasive form, such as acinar or papillary form, and the grade III mainly refers to the lung cancer in invasive form, such as solid or micro-papillary form. The prognosis for patients with different grades has significant differences, with grade III patients having the worst prognosis. Therefore, the method of the invention can diagnose the histological subtype grading condition of the subjects in advance and has important significance for clinical treatment and prognosis. Thus, the threshold determined from the CNV score may also distinguish between histological subtype grading situations, such as grade I/II and grade III. Accordingly, the diagnosis is a histological subtype grading of the subject.
In one embodiment, the cancer is selected from the following: brain cancer, skin cancer, breast cancer, lung cancer, liver cancer, esophageal cancer, prostatic cancer, colorectal cancer, nasopharyngeal cancer, gastric cancer, pancreatic cancer, renal cancer, bladder cancer, lymphoma, cervical cancer, uterine cancer, ovarian cancer, vaginal cancer, and thyroid cancer.
In a preferred embodiment, the cancer is lung cancer. More specifically, the cancer includes: orthotopic lung cancer (AIS), micro-invasive lung cancer (MIA), invasive lung cancer (IA), Malignant Pleural Effusion (MPE), and Metastatic Nodules (MN). Among them, the in situ lung cancer and the micro-invasive lung cancer have low infiltrations, and the infiltrative lung cancer has high infiltrative. Malignant pleural effusions and metastatic nodules are generally not evaluated for their infiltrability because they have metastasized from lung cells. In one embodiment, invasive lung cancer may also include the subtypes adnexal, alveolar, papillary, and parenchymal. In one embodiment, the threshold may be determined by various statistical methods known to those skilled in the art, such as by the receiver operating characteristic curve (ROC). In the present invention, the threshold is the CNV score at which both sensitivity and specificity are maximum.
The use of the above-described whole-gene CNV detection method of the present invention in diagnosing a cancer condition in a subject relies on the following facts found by the inventors: the CNV score obtained by the method can accurately distinguish cancer samples from non-cancer samples; the CNV score obtained by the method is in positive correlation with the infiltration degree and the histological subtype grading of the cancer sample, and has good classification effect on samples with different infiltration degrees and histological subtype grading, so that the method is applied to clinical diagnosis and can obviously improve the accuracy of cancer diagnosis.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon machine executable instructions which, when executed, cause a machine to perform the steps of the method for detection of whole genome CNVs according to the present invention.
In a fifth aspect, the present invention provides a computer readable storage medium having stored thereon machine executable instructions which, when executed, cause a machine to perform the steps of the method of:
(1) obtaining a plurality of CNV scores of a plurality of DNA samples by the detection method of the whole genome CNV;
(2) determining a threshold value from the plurality of CNV scores;
(3) and obtaining the CNV score of the subject by the detection method of the whole genome CNV, and comparing the CNV score with the threshold value to obtain a diagnosis result.
In the present invention, a computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium includes, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The machine-executable instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives machine-executable instructions from the network and forwards the machine-executable instructions for storage in a computer-readable storage medium in the respective computing/processing device.
Machine executable instructions for performing the operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The machine-executable instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, by utilizing state information of machine-executable instructions to personalize an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), the electronic circuit can execute the machine-executable instructions to implement aspects of the present disclosure.
These machine-executable instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions of various aspects of the present invention. These machine-executable instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the functions of the various aspects of the present invention.
The machine-executable instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer-implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions of aspects of the present invention.
The invention will be further elucidated with reference to the drawings and examples.