CN111028888A

CN111028888A - A genome-wide copy number variation detection method and its application

Info

Publication number: CN111028888A
Application number: CN201811172904.7A
Authority: CN
Inventors: 刘弼; 李小雨; 宫玉艳; 武旺; 张钰; 王征; 刘东戈
Original assignee: Berry Genomics Co Ltd; Beijing Hospital
Current assignee: Berry Genomics Co Ltd; Beijing Hospital
Priority date: 2018-10-09
Filing date: 2018-10-09
Publication date: 2020-04-17
Anticipated expiration: 2038-10-09
Also published as: CN111028888B

Abstract

本发明涉及一种全基因组拷贝数变异的检测方法，其包括以下步骤：(1)获得DNA样品的全基因组测序的测序结果序列；(2)将所述测序结果序列与人类参考基因组比对，并计算CNV在染色体区段的坐标位置以及所述CNV在该区段的拷贝数CN_检测；(3)通过以下公式计算所述DNA样品的综合CNV评分：

其中，CNV长度表示根据CNV在染色体区段的坐标位置计算的CNV的长度，CN_参考表示正常样品中的染色体拷贝数，其中所述CNV评分表示所述DNA样品的全基因组拷贝数变异。本发明还涉及所述检测方法在癌症诊断中的用途，以及用于诊断癌症状况的设备。The present invention relates to a whole genome copy number variation detection method, which comprises the following steps: (1) obtaining the sequencing result sequence of the whole genome sequencing of a DNA sample; (2) comparing the sequencing result sequence with the human reference genome, And calculate the coordinate position of CNV in the chromosome segment and the CN _detection of the copy number of the CNV in this segment; (3) Calculate the comprehensive CNV score of the DNA sample by the following formula:

Wherein, the CNV length represents the length of the CNV calculated according to the coordinate position of the CNV in the chromosomal segment, and the CN _reference represents the chromosome copy number in a normal sample, wherein the CNV score represents the genome-wide copy number variation of the DNA sample. The invention also relates to the use of the detection method in the diagnosis of cancer, and to a device for diagnosing a cancer condition.

Description

Detection method of genome-wide copy number variation and application thereof

Technical Field

The invention relates to a method for detecting genome-wide copy number variation. The invention also relates to the use of the detection method in cancer diagnosis, and to a device for diagnosing cancer based on the detection method.

Background

The Copy Number Variation (CNV) refers to a complex chromosome structure Variation derived from insertion, deletion, inversion, translocation and/or duplication of DNA fragments of more than or equal to 1kb in a genome and combination of the DNA fragments, compared with a genome reference sequence, and has the characteristics of wide distribution range, heritability, relative stability, high heterogeneity and the like.

Studies have shown that CNV is an important factor in the development of tumorigenesis, which can induce tumors by affecting the activity of proto-oncogenes and tumor suppressor genes. Several studies have shown that CNV detection has potential as an indicator for tumor diagnosis. For example, the U.S. FDA approved 4 FISH probes for CNV detection from Abbott Molecular corporation in 2005 for diagnosis of bladder cancer in hematuria patients and monitoring of bladder cancer patients, with 68.6% sensitivity and 77.7% specificity. The study by Ayal et al found that there was a general decrease in copy number (del) of 11 whole arms or chromosomes in atypical meningiomas, with a trend that the total number of these 11 CNVs in the patient sample was positively correlated with the clinical stage of the patient (Kadota K et al, Am J Surg Pathol.2014; 38(4): 448-. Whole genome CNV studies of CTC cells by Ni et al found that all Circulating Tumor Cells (CTC) from the same patient had similar CNV patterns (regardless of cancer subtype), while small Cell lung cancer and adenocarcinoma CTC samples showed significant differences in CNV patterns (Xiaohui Ni et al, PNAS.2013; 110(52): 21083-21088). Bowcock et al detected CNV of 18 lung cancer patients and 47 controls by gene chip technology, screened CNV segments 3p26-p11.1 (deletion), 3q26.2-29 (increase) and 6q25.3-24.3 (deletion) common in three lung cancer samples as tumor markers, and used these tumor markers to diagnose 24 controls of bronchopulmonary dysplasia and 12 patients of bronchial epithelialization (all diagnosed as tumors at follow-up). The results indicated that these CNV markers diagnosed all controls and 3 follow-up patients with bronchiolitis metaplasia diagnosed as tumors as low risk patients with an accuracy of tumor prediction of 92% and a negative predictive value of 89% for this population (Bowcock AM et al, Thorax.2014; 69(5): 495-.

Currently, the methods for detecting CNV mainly include Fluorescence In Situ Hybridization (FISH) technology, Comparative Genomic Hybridization (CGH) technology, microarray Comparative Genomic Hybridization (Array-CGH) technology, Single Nucleotide Polymorphism microarray-based Array (SNParray) technology, and Next Generation Sequencing (NGS) technology. The principle of the FISH technology is mainly that a labeled nucleic acid probe is specifically combined with a cell chromosome, then specific avidin with a fluorescent group is combined with the nucleic acid probe, and finally the condition of the chromosome is observed through fluorescence distribution. The principle of CGH technology is mainly that fluorescent dyes with different colors are used for marking DNA to be detected and control DNA in a distributed mode, the two DNAs are mixed in equal quantity and then are hybridized with chromosomes, and images formed by a fluorescence microscope are analyzed through corresponding software, so that whether the chromosomes are abnormal or not is detected. The principle of Array-CGH technology is similar to CGH technology, except that two DNAs mixed in equal amounts are hybridized not with chromosomes but with microarrays carrying genomic fragment cloning vectors, which makes the resolution higher than that of CGH technology. The principle of the SNP array technology is that a microarray with SNP probes is hybridized with fragmented single-stranded genomic DNA, and SNP sites of the genomic DNA are captured, so that the SNP condition of the genome is analyzed and obtained.

The FISH and CGH techniques are limited in their application because they are limited to probes that can detect only a part of the chromosome, but do not reflect the entire genome. Although the Array-CGH technology and the SNP Array technology have been primarily applied to clinic due to the characteristics of rapidness, accuracy and high resolution, the high price and the analysis difficulty caused by excessive information in the detection process still limit the popularization and application of the technologies to some extent.

With the continuous development of sequencing technology, NGS has gradually become a trend of whole genome analysis. The whole situation of CNV in the genome can be judged by carrying out high-throughput large-scale parallel genome sequencing on the whole genome amplification product of the sample and carrying out statistical information analysis on the sequencing result. In contrast to FISH and CGH technologies, NGS can reflect all chromosomal events, as it is not limited by probes. In addition, because the NGS can simultaneously sequence a plurality of samples, compared with Array-CGH and SNP Array technologies, the method has the advantage of low detection cost on the premise of ensuring rapidness, accuracy and high resolution.

However, in the current process of diagnosing tumors by NGS results of CNV, tumor risk is mainly diagnosed by detecting the presence or absence of some specific CNV markers. However, tumor is a highly heterogeneous disease, and it is actually difficult to effectively diagnose tumor by using only some CNVs as markers. In addition, the parameters considered in the current analysis of the whole genome CNV are relatively single, and generally only the length of the CNV is considered, so that the analysis result of the CNV cannot comprehensively reflect the genome disorder degree of the sample.

Therefore, in order to improve efficiency and accuracy, there is still a need in the art for a universal, simple, economical, and quantifiable method for detecting CNV, which is more advantageous for the application and popularization of CNV in tumor diagnosis.

Disclosure of Invention

The object of the present invention is to provide a method for detection of whole genome CNV, use of the method for cancer diagnosis, and a device for diagnosing a cancer condition. When the method is used for carrying out statistical analysis on the whole genome CNV, the length of the CNV and the degree of deviation of the copy number of the CNV from the normal copy number are fully considered, so that the genome disorder degree of a sample is more accurately and comprehensively reflected, the limitation caused by CNV analysis in a specific region is avoided, and the problem of low diagnosis accuracy caused by introduction of high heterogeneity characteristics of a cancer sample is solved.

Accordingly, in a first aspect, the present invention provides a method for detecting genome-wide Copy Number Variation (CNV), comprising the steps of:

(1) obtaining a sequencing result sequence of whole genome sequencing of the DNA sample;

(2) comparing the sequencing result sequence with a human reference genome, and calculating the coordinate position of the CNV in the chromosome segment and the copy number CN of the CNV in the segment_{Detection of}；

(3) Calculating a composite CNV score for the DNA sample by the following formula:

wherein CNV length represents the length of CNV calculated from the coordinate position of CNV in the chromosome segment, CN_{Reference to}Represents chromosomal copy number in a normal sample, wherein the CNV score represents genome-wide copy number variation of the nucleotide sample.

In one embodiment, for autosomes and female X chromosomes, CN_{Reference to}Is 2; for male X and Y chromosomes, CN_{Reference to}Is 1.

In one embodiment, the DNA sample is isolated directly from a tissue, cell, or fluid of the subject (e.g., blood, serum, plasma, urine, spinal fluid, amniotic fluid, lymph fluid, pleural fluid). In another embodiment, the DNA sample is derived from a tissue sample that has been specially processed (e.g., fixed, embedded in paraffin such as formalin fixation, laser microdissection, etc.).

In one embodiment, whole genome sequencing described herein can be performed by any sequencing technique known to those skilled in the art. For example, sequencing techniques that may be used in the present invention include, but are not limited to: sequencing by synthesis, single molecule sequencing, nanopore sequencing, semiconductor sequencing, ligation sequencing, sequencing by hybridization, digital gene expression, single molecule sequencing by synthesis, massively parallel signature sequencing, clonal single molecule arrays, pyrosequencing, and any other sequencing technique known in the art. For example, sequencing of the invention can be performed on the following platform: HiSeq 2000(Illumina), 454Genome sequence (Roche), SOLID System (Applied Biosystem), Polonator G007(Dover/Harvard), HeliScope (Helicos biosciences), Ion Torque (Thermo Life), GS FLX System (Roche), and any other high throughput sequencing platform known in the art.

In one embodiment, the sequencing of the present invention can be either single-ended or double-ended. Single-ended sequencing firstly fragmenting a DNA sample to form a fragment of 200-500bp, connecting a primer sequence to one end of the DNA fragment, then adding a joint at the tail end, and sequencing the single-ended reading sequence on a machine. Double-ended sequencing refers to adding sequencing primer binding sites on the joints at two ends when a DNA library is constructed, removing a template strand of the first round of sequencing after the first round of sequencing is completed, and guiding the complementary strand to regenerate and amplify in situ by using a double-ended sequencing module so as to achieve the template amount used by the second round of sequencing to perform the second round of sequencing synthesis. In a preferred embodiment, the present invention uses single-ended sequencing to achieve the desired sensitivity while saving sequencing time and cost. In another preferred embodiment, the sequencing of the invention uses short read length sequencing, such as 36bp sequencing, 75bp sequencing, etc., and also meets the sensitivity requirements of the invention. And compared with long-read long-sequencing, short-read long-sequencing has the advantages of saving sequencing time and cost. In general, the short read length should not be less than 36bp to avoid excessive noise generated by the sequencing result, so that a sequence unique and completely matching with the reference genome cannot be accurately extracted in the alignment with the reference genome.

In one embodiment, the sequencing of the invention is performed on a sequencing library prepared from said DNA sample. Methods of preparing sequencing libraries are known to those skilled in the art. Generally, the sequencing library preparation method comprises the following steps: and (3) carrying out end repair on the DNA sample, then adding A to the repaired DNA at the end, connecting a sequencing joint, and then carrying out PCR amplification reaction and purifying the amplification product to obtain the sequencing library. The preparation of the sequencing library can be modified and adapted accordingly by the skilled person based on the different DNA samples. In addition, the preparation of sequencing libraries can also be performed using library preparation kits known to those skilled in the art. Such library preparation kits include, but are not limited to: ChroSure kit (berry and kang), konnaean kit (bery and kang), KAPA library construction kit (Illumina), Nextera DNA library preparation kit (Illumina), and the like.

In the present invention, "human reference genome" refers to standard human reference genome sequences in the NCBI database, which may be, for example, hg18, NCBI Build 36; hg19, NCBI Build 37. The sequence of the human reference genome can be obtained in the genetic data of NCBI, Ensembl and UCSC.

In one embodiment, alignment of the sequencing result sequence to the human reference genome can be performed using algorithms or software known to those skilled in the art. Examples of such algorithms or software include, but are not limited to: BLAST, BLAT, MAQ, SOAP, Bowtie, BWA, SSAHA, ELAND.

In one embodiment, the calculation of the coordinate position of the CNV in the chromosome segment and the copy number of the CNV in the segment in the step (2) above can be performed by any method known in the art, for example, in CNV-seq, a new method to detect copy number variation using high-throughput sequencing (Xiiet., BMC Bioinformatics,2009,10(1): 80-80); CNVnator: An apple to discover, genotype, and characteristics type and outline CNVs from family and optimization Genome sequencing (Abyzov et al, Genome Res,2011,21(6): 974); a parallel r package for detecting copy number indexes from sequence reads (Miller et al, PLoS One,2011,6(1): e 16327); MOPS, mix of Poissons for converting copy number variations in next-generating conversion data with a low frequency conversion rate (Klambauer et al, nucleic acids Res,2012,40(9): e69-e 69). In a specific embodiment, the step (2) further comprises the steps of:

(2.1) comparing the sequencing result sequence with a human reference genome, and extracting a sequence which is completely matched with the human reference genome in the sequencing result sequence and is aligned to a unique position in the human reference genome;

(2.2) dividing the human reference genome into a plurality of continuous regions (bins), and counting the sequence number Q value in each region obtained by the sequence alignment in the step (2.1);

(2.3) correcting and standardizing the sequence number Q value obtained in the step (2.2) to obtain a standard sequence number P value in each area;

(2.4) dividing the number of standard sequences P obtained in step (2.3) by the median of the number of standard sequences obtained from normal samples to obtain the fd value in each region;

(2.5) calculating the coordinate position of the CNV in the chromosome segment and the copy number CN of the segment by a mathematical model based on the fd values of all the regions_{Detection of}。

In one embodiment, the size of the regions (bins) is uniform or substantially uniform. One skilled in the art can adjust the size of the region based on the resolution and amount of sequencing data required for CNV detection. In the present invention, the size of the region is, for example, about 10kb, 20kb, 30kb, 40kb, 50kb, 60kb, 70kb, 80kb, 90kb or 100 kb.

In the above step (2.3), in order to reduce the variation in the number of sequences and to improve the detection sensitivity, the number of sequences Q value needs to be corrected, mainly the GC bias correction. GC bias leads to uneven presentation of GC-rich or frequent regions in the sequence reads. Any method of GC correction known in the art can be used in the present invention. Examples of GC correction methods are, for example, the loess GC correction method or the area GC correction method.

In the above step (2.3), in order to provide a DNA sample having the same data amount as that of the normal sample for easy comparison, it is necessary to normalize the sequence number Q value so as to obtain the standard sequence number P value in each region. In one embodiment, the corrected Q value is normalized by the following equation:

p ═ Q corrected/(total number of sequencing result sequences)]×[(5×10⁶)]Wherein 5 is multiplied by 10⁶Is the minimum amount of sequencing data required to meet the present invention.

In another embodiment, the normalization step may also be performed using one or more of algorithms known in the art, such as hidden Markov, dynamic programming, support vector machines, Bayesian or probabilistic modeling, trellis decoding, Viterbi decoding, expectation maximization, Kalman filtering, or neural networks.

In step (2.4) of the present invention, the value of fd in each region was obtained by dividing the number of standard sequences P by the median of the number of standard sequences obtained from normal samples. The standard sequence number of the normal sample can be obtained according to any method known in the art, for example, the method described in the above steps (2.1) to (2.3). In addition, methods for calculating the median value from the standard number of sequences of all regions of a normal sample are well known to those skilled in the art.

In step (2.5) of the present invention, the coordinate position of CNV in the chromosome segment and the copy number CN of the segment are calculated by a mathematical model based on the fd values of all regions_{Detection of}. Mathematical models known in the art that can calculate the coordinate position of the CNV and its copy number can be used in the present invention. Examples of such mathematical models include, but are not limited to, Fused Lasso algorithm or CBS algorithm.

Based on the coordinate position of the CNV in the chromosome segment, the length of the CNV can be calculated. In the present invention, the comprehensive CNV score of a DNA sample is calculated by the following formula:

wherein CNV length represents the length of CNV calculated from the coordinate position of CNV in the chromosome segment, CN_{Reference to}Represents the number of chromosome copies in a normal sample,

wherein for autosomes and female X chromosomes, CN_{Reference to}Is 2; for male X and Y chromosomes, CN_{Reference to}The number of the carbon atoms is 1,

wherein the CNV score represents genome-wide copy number variation of the DNA sample.

The CNV score calculation method is exemplified as follows: for one particular sample, two CNVs were detected: wherein one CNV is located at the position chr9:100000-24300000 of chromosome 9 (length of 119040000bp), and the copy number is 1.1; the other CNV is located at position chr16:15540000-31780000 of chromosome 16 (length 78700000bp) with copy number of 2.97; CN_{Reference to}Is 2. The CNV score for this sample is then: [(24300000-100000)/119040000]*|1.1-2|+[(31780000-15540000)/78700000]*|2.97-2|＝0.38。

In one embodiment, the CNV detected by the present invention is 100kb or more in length.

The detection method of the whole genome CNV can objectively and quantitatively reflect the degree of the genome disorder of the sample. One advantage of the present invention is that the CNV score measures the length and degree of variability of the CNV (i.e., the degree to which the CNV copy number deviates from the normal copy number, expressed in the present invention as | CN_{Detection of}-CN_{Reference to}All the values are included in statistics, and the degree of the genome disorder of the sample can be more comprehensively reflected. On the other hand, the CNV detection method of the present invention also includes statistics of all CNVs (including chimeric CNVs) deviating from normal copy number, and can more fully reflect the degree of genomic disorder of the sample for cancer samples with high heterogeneity of genetic variation. Finally, the invention has high sensitivity and specificity. Even for sequencing data after amplification of a sample with a small starting amount (e.g., a microdissected sample), satisfactory sensitivity and specificity can still be achieved.

In a second aspect, the present invention provides an apparatus for diagnosing a cancer condition in a subject, comprising:

a memory configured to store one or more programs;

a processing unit coupled to the memory and configured to execute the one or more programs to cause the management system to perform a plurality of actions, the actions comprising:

(1) obtaining a plurality of CNV scores of a plurality of DNA samples by the detection method of the whole genome CNV;

(2) determining a threshold value from the plurality of CNV scores;

(3) and obtaining the CNV score of the subject by the detection method of the whole genome CNV, and comparing the CNV score with the threshold value to obtain a diagnosis result.

In a third aspect, the present invention provides a method for diagnosing a cancer condition in a subject, comprising:

(2) determining a threshold value from the plurality of CNV scores;

(3) and obtaining the CNV score of the subject by the detection method of the whole genome CNV, and comparing the CNV score with the threshold value to obtain a diagnosis result. In one embodiment, the plurality of DNA samples are from a plurality of individuals. In one embodiment, the plurality of DNA samples comprises a cancer sample and a non-cancer sample. In another embodiment, the plurality of DNA samples comprises a low-invasive cancer sample and a high-invasive cancer sample. In yet another embodiment, the plurality of DNA samples comprises cancer samples with histological subtype classification I, II and/or class III. One skilled in the art can select multiple DNA samples for determining the threshold by CNV scoring based on the desired diagnostic result. In these embodiments, the threshold determined by the CNV score can distinguish between cancer and non-cancer samples, as well as the invasive grade and histological subtype of cancer samples. Accordingly, the diagnosis is whether the subject has cancer and, if so, how much cancer has infiltrated.

In addition, cancer is clinically classified into grade I, II, III according to the major histological subtype classification system. For example, among lung cancers, the grade I mainly refers to the lung cancer in situ, the lung cancer in micro-invasive or adherent form, the grade II mainly refers to the lung cancer in invasive form, such as acinar or papillary form, and the grade III mainly refers to the lung cancer in invasive form, such as solid or micro-papillary form. The prognosis for patients with different grades has significant differences, with grade III patients having the worst prognosis. Therefore, the method of the invention can diagnose the histological subtype grading condition of the subjects in advance and has important significance for clinical treatment and prognosis. Thus, the threshold determined from the CNV score may also distinguish between histological subtype grading situations, such as grade I/II and grade III. Accordingly, the diagnosis is a histological subtype grading of the subject.

In one embodiment, the cancer is selected from the following: brain cancer, skin cancer, breast cancer, lung cancer, liver cancer, esophageal cancer, prostatic cancer, colorectal cancer, nasopharyngeal cancer, gastric cancer, pancreatic cancer, renal cancer, bladder cancer, lymphoma, cervical cancer, uterine cancer, ovarian cancer, vaginal cancer, and thyroid cancer.

In a preferred embodiment, the cancer is lung cancer. More specifically, the cancer includes: orthotopic lung cancer (AIS), micro-invasive lung cancer (MIA), invasive lung cancer (IA), Malignant Pleural Effusion (MPE), and Metastatic Nodules (MN). Among them, the in situ lung cancer and the micro-invasive lung cancer have low infiltrations, and the infiltrative lung cancer has high infiltrative. Malignant pleural effusions and metastatic nodules are generally not evaluated for their infiltrability because they have metastasized from lung cells. In one embodiment, invasive lung cancer may also include the subtypes adnexal, alveolar, papillary, and parenchymal. In one embodiment, the threshold may be determined by various statistical methods known to those skilled in the art, such as by the receiver operating characteristic curve (ROC). In the present invention, the threshold is the CNV score at which both sensitivity and specificity are maximum.

The use of the above-described whole-gene CNV detection method of the present invention in diagnosing a cancer condition in a subject relies on the following facts found by the inventors: the CNV score obtained by the method can accurately distinguish cancer samples from non-cancer samples; the CNV score obtained by the method is in positive correlation with the infiltration degree and the histological subtype grading of the cancer sample, and has good classification effect on samples with different infiltration degrees and histological subtype grading, so that the method is applied to clinical diagnosis and can obviously improve the accuracy of cancer diagnosis.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon machine executable instructions which, when executed, cause a machine to perform the steps of the method for detection of whole genome CNVs according to the present invention.

In a fifth aspect, the present invention provides a computer readable storage medium having stored thereon machine executable instructions which, when executed, cause a machine to perform the steps of the method of:

(2) determining a threshold value from the plurality of CNV scores;

In the present invention, a computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium includes, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The machine-executable instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives machine-executable instructions from the network and forwards the machine-executable instructions for storage in a computer-readable storage medium in the respective computing/processing device.

Machine executable instructions for performing the operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The machine-executable instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, by utilizing state information of machine-executable instructions to personalize an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), the electronic circuit can execute the machine-executable instructions to implement aspects of the present disclosure.

These machine-executable instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions of various aspects of the present invention. These machine-executable instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the functions of the various aspects of the present invention.

The machine-executable instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer-implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions of aspects of the present invention.

The invention will be further elucidated with reference to the drawings and examples.

Drawings

FIG. 1: schematic representation of one embodiment of a whole genome CNV detection method according to the present invention.

FIG. 2A: the ROC curve for the threshold value to distinguish between cancer and non-cancer samples was determined in example 2.

FIG. 2B: in example 2, the ROC curve for the threshold value to distinguish low-invasive cancer samples from high-invasive cancer samples was determined.

FIG. 2C: in example 2, ROC curves were determined for threshold values to distinguish between class I/II and class III cancer samples.

Detailed Description

The whole genome CNV detection method of the present invention and its use for diagnosing cancer are exemplified below by lung cancer as an example.

Example 1 determination of Whole genome CNV score of samples according to the detection method of the invention

1. Sample collection

Tumor tissue samples were obtained from patients during the period of 2017, 1 month to 2018, 3 months, by surgical operation at subsidiary hospitals of Beijing Hospital and Zhengzhou university. The patient is a pathologically diagnosed lung cancer patient and is not treated for lung cancer. Specifically, the patient has lung cancer of: orthotopic lung cancer (AIS), micro-invasive lung cancer (MIA), invasive lung cancer (IA), Malignant Pleural Effusion (MPE), and Metastatic Nodules (MN). Among them, AIS and MIA belong to low-invasive lung cancer, and IA belongs to high-invasive lung cancer. A total of 71 patients were obtained with tumor tissue samples and 15 of them with paracancerous samples (i.e., non-cancerous samples). Tumor tissue samples and paracancerous samples obtained from patients by surgery were fixed, then sectioned, and HE stained. Tumor cell mass was then accurately collected by laser microdissection into 200ul EP tubes. Approximately 100 cells were excised from each sample.

2. Construction of sequencing libraries

2ul PBS was added into 200ul EP tube, and after centrifugation at 12000g for 5min, tumor and paracancer cell mass were collected to the bottom of the tube. Sequencing libraries were then constructed using the Chrosure kit (Berry & kang, product number R0005) according to the manufacturer's instructions.

3. Sequencing on machine

The constructed sequencing library is processed on an Illumination HiSeq 2500 sequencer under the processing condition of 36SE, and the processing data volume is set>5M reads. The average sequencing result sequence obtained in practice was 7.2X 10⁶And (4) a sequence.

CNV scoring assay

4.1 aligning the sequencing result sequence with the human reference genome hg19, and extracting the sequence which is completely matched with the human reference genome in the sequencing result sequence and aligned to a unique position in the human reference genome.

4.2 partitioning the human reference genome into 1.5X 10⁵The number of sequences Q aligned in each region of the sequence obtained in step 4.1 was counted for 20kb contiguous regions (bins). For example, for the 50 th region, the Q value is 24.

4.3 GC bias correction was performed on the value of Q for the sequence number obtained in step 4.2 by the loess GC correction method. The Q value of the 50 th region was corrected to 25. The corrected value is then calculated by the following formulaLine normalization to obtain the standard sequence number P value within each region: p ═ Q value corrected/total number of sequencing sequences]×[(5×10⁶)]. Thus, for the 50 th region, the P value is: [ 25/(7.2X 10)⁶)]×(5×10⁶)＝17。

4.4 dividing the number of standard sequences P obtained in step 4.3 by the median of the number of standard sequences obtained from normal samples to obtain the value of fd in each region. By sequencing normal samples and following the above procedure, the median of the standard sequence numbers for normal samples was calculated to be 30. Thus, for the 50 th region, the fd value is: 17/30 ═ 0.57.

4.5 based on the fd values of all regions, calculating the coordinate position of CNV in the chromosome segment and the copy number CN detection of the segment by Fused Lasso algorithm. Table 1 below shows the location and copy number information for all CNVs detected in one of the samples (sample No. 6).

Table 1: location and copy number information of all CNVs in sample No. 6.

4.6 calculate the CNV score for each sample by the following formula:

for example, for sample No. 6, the CNV score was calculated as: 0.07.

example 2: the whole genome CNV detection method based on the invention diagnoses lung cancer, infiltration and cancer grade

1. ROC curves were plotted according to the CNV scores of 71 tumor tissue samples and 15 paracancerous samples obtained in example 1 to determine the threshold (see fig. 2A-2B). As shown in fig. 2A, when the threshold of the CNV score was set to 0.015, the accuracy of distinguishing cancer samples from normal samples was the highest, and the sensitivity and specificity were 93.3% and 97.2%, respectively. As shown in fig. 2B, when the threshold of the CNV score was set to 1.940, the accuracy of distinguishing the low-invasive cancer sample from the high-invasive cancer sample was the highest, and the sensitivity and specificity were 82.6% and 92.9%, respectively. As shown in fig. 2C, when the threshold for CNV score was set at 2.550, it was the most accurate to distinguish between low-grade (grade I, II) and high-grade (grade III) cancer samples, with 70% and 90% sensitivity and specificity, respectively.

2. Verification test

In order to verify the diagnostic ability of the threshold determined according to the method of the present invention for lung cancer samples, the inventors performed tests on 20 samples.

CNV scores were obtained for 20 samples (including 15 lung cancer samples and 5 normal samples) according to the method described in example 1, and these samples were judged to be cancer or non-cancer samples according to the above-identified threshold values, and in the case where the samples were cancer samples, the degree of invasion was high or low. Then, this judgment result is compared with the pathological diagnosis result, as shown in table 2 below.

As shown in table 2 above, in terms of distinguishing cancer from non-cancer samples, the diagnostic results based on the threshold determined by the CNV detection method of the present invention are completely consistent with the pathological diagnostic results, and the accuracy rate reaches 100%. In addition, in terms of the degree of invasiveness in diagnosing cancer, only 3 of the 15 cancer samples were inconsistent with the pathological diagnosis result based on the CNV detection method of the present invention, and the accuracy rate reached 80%. In the aspect of diagnosing the tissue subtype grading condition of cancer, only 2 cases of diagnosis results based on the CNV detection method of the invention are inconsistent with pathological diagnosis results in 15 cancer samples, and the accuracy rate reaches 86.7%.

Therefore, the CNV score of the present invention can not only comprehensively reflect the degree of genomic disorders of a sample, but also be effectively used for diagnosing cancer conditions such as whether cancer is present or not, the degree of invasion of cancer, and the histological subtype classification of cancer.

The embodiments of the present application described above are exemplary only, not exhaustive, and are not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The foregoing is merely an alternative embodiment of the present application and is not intended to limit the present disclosure, as numerous modifications and variations will readily occur to those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A detection method for genome-wide copy number variation (CNV), comprising the following steps:

(1) Obtain the sequencing result sequence of the whole genome sequencing of the DNA sample;

(2) aligning the sequence of the sequencing result with the human reference genome, and calculating the coordinate position of the CNV in the chromosome segment and the CN _detection of the copy number of the CNV in the segment;

(3) Calculate the comprehensive CNV score of the DNA sample by the following formula:

Among them, CNV length represents the length of CNV calculated according to the coordinate position of CNV in the chromosome segment, CN _reference represents the chromosome copy number in normal samples,

wherein the CNV score represents the genome-wide copy number variation of the DNA sample.

2. The detection method of claim 1, wherein step (2) further comprises:

(2.1) aligning the sequencing result sequence with the human reference genome, and extracting the sequence of the sequencing result sequence that completely matches the human reference genome and is aligned to the unique position in the human reference genome;

(2.2) Divide the human reference genome into several consecutive regions, and compare the sequences obtained in step (2.1) to the sequence number Q value in each region;

(2.3) Correcting and standardizing the sequence number Q value obtained in step (2.2) to obtain the standard sequence number P value in each region;

(2.4) Divide the standard sequence number P obtained in step (2.3) by the median value of the standard sequence number obtained from the normal sample to obtain the fd value in each region;

(2.5) Based on the fd values in all regions, the coordinate position of CNV in the chromosome segment and the copy number CN _detection of the segment were calculated by mathematical model.

3. The method of claim 2, wherein the correction is a GC bias correction.

4. The method of claim 3, wherein the GC bias correction is performed by a loess GC correction method or a regional GC correction method.

5. The method of claim 2, wherein normalization is performed by the following formula:

P=[corrected Q value/(total number of sequencing result sequences)]×[(5×10 ⁶ )].

6. The method of claim 2, wherein the mathematical model is the Fused Lasso algorithm or the CBS algorithm.

7. The method of any one of claims 1-6, wherein the sequencing is a single-end sequencing approach.

8. The method of any one of claims 1-7, wherein the CNV is 100 kb or more in length.

9. A device for diagnosing a cancer condition in a subject, comprising:

memory, configured to store one or more programs;

A processing unit coupled to the memory and configured to execute the one or more programs causes the management system to perform a plurality of actions, the actions including:

(1) obtaining a plurality of CNV scores of a plurality of DNA samples by the method of any one of claims 1-8;

(2) determining a threshold according to the multiple CNV scores;

(3) Obtaining a CNV score from a sample from a subject by the method of any one of claims 1-8, and comparing it with the threshold to obtain a diagnostic result.

10. The device of claim 9, wherein the cancer is selected from the group consisting of brain cancer, skin cancer, breast cancer, lung cancer, liver cancer, esophageal cancer, prostate cancer, colorectal cancer, nasopharyngeal cancer, stomach cancer, pancreatic cancer, Kidney cancer, bladder cancer, lymphoma, cervical cancer, uterine cancer, ovarian cancer, vaginal cancer, thyroid cancer.

11. The apparatus of claim 9 or 10, wherein the threshold is determined by a receiver operating characteristic curve.

12. The apparatus of any one of claims 9-11, wherein the diagnosis refers to whether the subject has cancer.

13. The device of any one of claims 9-11, wherein the diagnosis is that the cancer the subject has is a low-invasive cancer or a high-invasive cancer.

14. The apparatus of any one of claims 9-11, wherein the diagnostic result refers to the histological subtype grade of the cancer the subject has.

15. A computer-readable storage medium having stored thereon machine-executable instructions that, when executed, cause a machine to perform the method of any one of claims 1-8 steps of the method.

16. A computer-readable storage medium having stored thereon machine-executable instructions that, when executed, cause a machine to perform the steps of the following methods:

(1) obtaining a plurality of CNV scores of a plurality of DNA samples according to the method of any one of claims 1-8;

(2) determining a threshold according to the multiple CNV scores;

(3) Obtaining the CNV score of the subject according to the method of any one of claims 1-8, and comparing it with the threshold to obtain a diagnosis result.

17. A method for diagnosing a cancer condition in a subject, comprising:

(2) determining a threshold according to the multiple CNV scores;