CN106715711B

CN106715711B - Method for determining probe sequence and method for detecting genome structure variation

Info

Publication number: CN106715711B
Application number: CN201480080426.0A
Authority: CN
Inventors: 李剑; 王煜; 李尉; 李金良; 赵霞; 陈仕平; 张现东; 刘赛军
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2014-07-04
Filing date: 2014-07-04
Publication date: 2021-09-17
Anticipated expiration: 2034-07-04
Also published as: CN106715711A; WO2016000267A1

Abstract

本发明提供了基于参考序列确定探针序列的方法和基因组结构变异的检测方法。其中，基于参考序列确定探针序列的方法包括：基于多个离散高频SNP位点，构建第一候选探针集，第一候选探针集由多个候选探针构成，并且多个候选探针中的每一个均含有至少一个离散高频SNP；将第一候选探针集中的多个候选探针与参考序列进行比对，以便获得比对结果；基于比对结果，对第一候选探针集进行第一筛选，获得第二候选探针集；将参考序列划分为多个具有预定长度的窗口，分别将第二候选探针集中的多个候选探针分配至各自匹配的窗口，以确定多个候选探针各自的位置信息；基于所说的位置信息以及离散高频SNP的等位基因频率，对第二候选探针集进行第二筛选，以便确定所述探针序列。The present invention provides a method for determining a probe sequence based on a reference sequence and a method for detecting genomic structural variation. Wherein, the method for determining a probe sequence based on a reference sequence includes: constructing a first candidate probe set based on multiple discrete high-frequency SNP sites, the first candidate probe set is composed of multiple candidate probes, and the multiple candidate probes Each of the needles contains at least one discrete high-frequency SNP; the plurality of candidate probes in the first candidate probe set are aligned with the reference sequence to obtain an alignment result; based on the alignment result, the first candidate probe is aligned. The first screening of the needle set is performed to obtain a second candidate probe set; the reference sequence is divided into a plurality of windows with a predetermined length, and the plurality of candidate probes in the second candidate probe set are respectively allocated to the respective matching windows, so as to obtain a second candidate probe set. The respective position information of the plurality of candidate probes is determined; based on the position information and the allele frequencies of the discrete high-frequency SNPs, a second screening is performed on the second candidate probe set, so as to determine the probe sequence.

Description

Methods for determining probe sequences and methods for detecting genomic structural variation

优先权信息priority information

无none

技术领域technical field

本发明涉及基因组学及生物信息学技术领域，具体涉及确定探针序列的方法和基因组结构变异的检测方法。The invention relates to the technical field of genomics and bioinformatics, in particular to a method for determining a probe sequence and a method for detecting genome structural variation.

背景技术Background technique

DNA拷贝数变异(copy number variation，CNV)和杂合性丢失(Loss ofheterozygosity，LOH)是不同类型的基因组变异。CNV是一种常见基因组结构变异，片段从1kb到几Mb不等，主要表现为亚显微水平的缺失和重复。LOH是指一对染色体上某一个染色体上基因缺失，与之配对的染色体上仍然存在，表现为在DNA很长一段区域只出现纯合子SNP。当LOH没有发生拷贝数的变化，即只从一个亲本遗传两个副本，被称单亲二倍体(uniparental disomy，UPD)。CNV，LOH，和UPD与许多常见的遗传性疾病，癌症和其他复杂疾病相关。建立一种准确、全面、高效、快速、简单、经济的检测CNV、LOH和UPD的方法，对于研究染色体变异事件，明确相关疾病的病因和采取相应的治疗方案，都具有重要的价值。DNA copy number variation (CNV) and loss of heterozygosity (LOH) are different types of genomic variation. CNV is a common genomic structural variation, with fragments ranging from 1 kb to several Mb, mainly manifested as deletions and duplications at the submicroscopic level. LOH refers to the deletion of a gene on one chromosome of a pair of chromosomes, but it still exists on the paired chromosome, which shows that only homozygous SNPs appear in a long region of DNA. When there is no copy number change in LOH, that is, only two copies are inherited from one parent, it is called uniparental disomy (UPD). CNV, LOH, and UPD are associated with many common genetic disorders, cancer and other complex diseases. Establishing an accurate, comprehensive, efficient, rapid, simple and economical method for the detection of CNV, LOH and UPD is of great value for studying chromosomal mutation events, clarifying the etiology of related diseases and taking corresponding treatment plans.

目前已有一些检查技术，比如PCR技术，包括实时荧光定量PCR技术和多重连接扩增技术(Multiplex Ligation-dependent Probe Amplification，MLPA)，实时荧光PCR技术每次检测分析一个或数个靶点，MLPA一次能够分析40多个序列，灵敏度高，检测范围受限于探针所针对的染色体和区域；FISH技术，FISH技术一般用于检测特定的几条染色体，无法检测未知区域；基于芯片的技术，包括基于芯片的比较基因组杂交技术(array-basedComparative Genomic Hybridization，aCGH)和基于SNP芯片的技术(SNP-array)，aCGH可检测全基因组范围内的CNV，不能检测出多倍体，小片段的丢失的漏检率高；以及测序技术，基于全基因组测序(whole genome sequnecing，WGS)检测全基因组范围的结构变异和基于目标区域测序检测目标区域的变异，主要有四种方法分析CNV，包括：配对末端映射(paired-end mapping)，读长深度分析(read-depth analysis)，分开读长策略(split-read strategies)和序列组装比较(sequence assembly comparisons)。At present, there are some inspection technologies, such as PCR technology, including real-time fluorescence quantitative PCR technology and Multiplex Ligation-dependent Probe Amplification (MLPA), real-time fluorescence PCR technology analyzes one or several targets at a time, MLPA It can analyze more than 40 sequences at a time, with high sensitivity, and the detection range is limited by the chromosomes and regions targeted by the probe; FISH technology, FISH technology is generally used to detect a few specific chromosomes, but cannot detect unknown regions; chip-based technology, Including chip-based comparative genomic hybridization (array-basedComparative Genomic Hybridization, aCGH) and SNP chip-based technology (SNP-array), aCGH can detect genome-wide CNV, can not detect polyploidy, the loss of small fragments and sequencing technology, based on whole genome sequencing (WGS) to detect genome-wide structural variation and target region-based sequencing to detect variation in target regions, there are four main methods to analyze CNV, including: paired Paired-end mapping, read-depth analysis, split-read strategies and sequence assembly comparisons.

随着测序技术的发展，有必要研究基于测序结果特别是局部区域测序结果来发现基因组结构异常的手段，包括发现染色体非整倍性、CNV、插入缺失(insertion-deletion，indel)、LOH、UPD以及SNP的手段。With the development of sequencing technology, it is necessary to study the means to discover genomic structural abnormalities based on sequencing results, especially local region sequencing results, including the discovery of chromosomal aneuploidy, CNV, insertion-deletion (indel), LOH, UPD and the means of SNP.

发明内容SUMMARY OF THE INVENTION

本发明的一方面提供一种基于参考序列确定探针序列的方法，包括以下步骤：基于多个离散高频SNP位点，构建第一候选探针集，第一候选探针集由多个候选探针构成，并且多个候选探针中的每一个均含有至少一个离散高频SNP；将第一候选探针集中的多个候选探针与参考序列进行比对，以便获得比对结果；基于比对结果，对第一候选探针集进行第一筛选，获得第二候选探针集；将参考序列划分为多个具有预定长度的窗口，分别将第二候选探针集中的多个候选探针分配至各自匹配的窗口，以确定多个候选探针各自的位置信息；基于所说的位置信息以及离散高频SNP的等位基因频率，对第二候选探针集进行第二筛选，以便确定所述探针序列。其中，离散高频SNP位点为等位基因频率大于10％，优选的不大于90％，并且与任意另外一个离散高频SNP位点在参考基因组上的物理距离不小于候选探针长度，候选探针长度为50-250mer。One aspect of the present invention provides a method for determining a probe sequence based on a reference sequence, comprising the steps of: constructing a first candidate probe set based on a plurality of discrete high-frequency SNP sites, and the first candidate probe set consists of a plurality of candidate probes Probe composition, and each of the plurality of candidate probes contains at least one discrete high-frequency SNP; the plurality of candidate probes in the first candidate probe set are aligned with the reference sequence, so as to obtain an alignment result; based on Compare the results, perform a first screening on the first candidate probe set to obtain a second candidate probe set; divide the reference sequence into a plurality of windows with a predetermined length, and separate the plurality of candidate probes in the second candidate probe set respectively. needles are assigned to their respective matching windows to determine the respective position information of the plurality of candidate probes; based on said position information and the allele frequencies of the discrete high frequency SNPs, a second screening of the second candidate probe set is performed so as to The probe sequence is determined. Among them, the discrete high-frequency SNP site has an allele frequency greater than 10%, preferably not greater than 90%, and the physical distance from any other discrete high-frequency SNP site on the reference genome is not less than the length of the candidate probe. The probe length is 50-250mer.

利用本发明的确定探针序列的方法获得的探针，用于杂交捕获基因组获得多个基因组局部区域，捕获得的多个局部区域能够代表全基因组、能够反映全基因组变异信息，用于发现全基因范围的结构变异的发生。The probe obtained by the method for determining the probe sequence of the present invention is used to hybridize the captured genome to obtain a plurality of local regions of the genome, and the captured local regions can represent the whole genome and reflect the variation information of the whole genome, and are used for discovering the whole genome. The occurrence of gene-wide structural variation.

本发明的另一方面提供了一种检测基因组结构变异的方法，适用于检测染色体非整倍性、拷贝数变异和插入缺失，包括以下步骤：对目标样本基因组核酸进行测序，以获得基因组测序结果，所说的基因组测序结果由多个读段构成，可选地，所说的测序包括采用探针进行筛选，其中，探针是通过本发明一方面提供的基于参考序列确定探针序列的方法获得的。基因组测序结果，可以通过提取基因组DNA，依据现有高通量平台指导手册进行文库构建及上机测序获得；基因组测序结果也可以通过探针捕获目标样本的基因组并进行测序获得的，探针可以通过本发明一方面提供的基于参考序列确定探针序列的方法获得；将参考基因组分为m个区域，利用基因组测序结果中落入区域i的读段计算目标样本基因组区域i的覆盖深度TD_i，其中，m和i为自然数，1≤i≤m，10<m；基于目标样本基因组区域i的覆盖深度与k个参照样本的区域i的覆盖深度的差异程度，判断目标样本区域i结构变异的发生，其中，k为自然数，k≥2，各参照样本的区域i的覆盖深度的得来方法可参照目标样本区域i的覆盖深度的获得方法。通过合并邻近发生结构变异的区域，进一步检测合并后的区域是否发生大的结构变异，或者说进一步检测发生在区域i的结构变异是否横跨几个区域。Another aspect of the present invention provides a method for detecting genomic structural variation, suitable for detecting chromosomal aneuploidy, copy number variation and indels, comprising the steps of: sequencing the target sample genomic nucleic acid to obtain a genomic sequencing result , the genome sequencing result is composed of a plurality of reading segments, optionally, the sequencing includes using a probe to screen, wherein the probe is a method for determining a probe sequence based on a reference sequence provided by an aspect of the present invention. acquired. The genome sequencing results can be obtained by extracting genomic DNA, library construction and on-machine sequencing according to the existing high-throughput platform instruction manual; the genome sequencing results can also be obtained by capturing and sequencing the genome of the target sample with probes. Obtained by the method for determining a probe sequence based on a reference sequence provided in one aspect of the present invention; dividing the reference genome into m regions, and calculating the coverage depth TD _i of the target sample genome region i by using the reads that fall into the region i in the genome sequencing result , where m and i are natural numbers, 1≤i≤m, 10<m; based on the difference between the coverage depth of the target sample genome region i and the coverage depth of the k reference samples, the structural variation of the target sample region i is determined occurs, where k is a natural number, k≥2, and the method for obtaining the coverage depth of the region i of each reference sample can refer to the method for obtaining the coverage depth of the target sample region i. By merging adjacent regions with structural variation, it is further detected whether large structural variation occurs in the merged region, or whether the structural variation occurring in region i spans several regions.

本发明的再一方面提供了适用于检测另一种基因组结构变异——杂合性丢失的方法，包括以下步骤：获取目标样本的基因组测序结果，可选地，所说的基因组测序结果是通过探针捕获目标样本的基因组并进行测序获得的，探针是按照本发明一方面提供的基于参考序列确定探针序列的方法获得的；将基因组分成m’个区域，基于基因组测序结果中落在区域i中的读段和群体区域i数据，获得目标样本基因组区域i和群体区域i共有的SNP集，分别计算目标样本和群体的共有SNP集中的各个SNP位点所在片段的杂合度，获得目标样本基因组区域i的杂合度集U_i，和群体区域i的杂合度集U_0i，比较目标样本U_i和群体U_0i以确定目标样本区域i杂合性丢失是否发生；其中，共有SNP集中的每个SNP的等位基因频率都大于0.1，共有SNP集中的一个SNP位点所在片段是以与该SNP相邻的上下游两个SNP为边界点的，m’和i为自然数，m’≥i≥1，m’≥6。抽取多少样本能够真实反映群体，可根据检测所需的精确度、统计方法、样本数据分布情况等确定，群体数据由多个同物种的样本数据构成，可通过全基因组测序、或者依据获得目标样本数据的方法、或者从已完成已公开的数据库或网站获得，比如千人基因组数据。Yet another aspect of the present invention provides a method suitable for detecting another kind of genome structural variation—loss of heterozygosity, comprising the following steps: obtaining a genome sequencing result of a target sample, optionally, the genome sequencing result is obtained by The probe is obtained by capturing the genome of the target sample and performing sequencing, and the probe is obtained according to the method for determining the probe sequence based on the reference sequence provided in one aspect of the present invention; the genome is divided into m' regions, and based on the genome sequencing results, the The reads in the region i and the population region i data, obtain the SNP set shared by the target sample genomic region i and the population region i, calculate the heterozygosity of each SNP site in the target sample and the population shared SNP set respectively, and obtain the target The heterozygosity set U i of the sample genome region _i , and the heterozygosity set U _0i of the population region i, compare the target sample U _i and the population U _0i to determine whether the loss of heterozygosity in the target sample region i occurs; The allele frequency of each SNP is greater than 0.1. The fragment of a SNP site in the common SNP set is bounded by the upstream and downstream two SNPs adjacent to the SNP. m' and i are natural numbers, and m'≥ i≥1, m'≥6. How many samples to draw can truly reflect the population can be determined according to the accuracy required for detection, statistical methods, and sample data distribution. The population data consists of multiple sample data of the same species, which can be obtained through whole genome sequencing, or by obtaining target samples based on data, or obtained from publicly available databases or websites, such as 1000 Genomes.

本发明的再一方面提供一种计算机可读存储介质，用于存储供计算机执行的程序，本领域普通技术人员可以理解，在执行该程序时，通过指令相关硬件可完成上述检测基因组结构变异的各种方法的全部或部分步骤。所称存储介质可以包括：只读存储器、随机存储器、磁盘或光盘等。Yet another aspect of the present invention provides a computer-readable storage medium for storing a program for computer execution. Those of ordinary skill in the art can understand that when the program is executed, the above-mentioned detection of genomic structural variation can be accomplished by instructing relevant hardware. All or part of the steps of various methods. The so-called storage medium may include: read-only memory, random access memory, magnetic disk or optical disk, and the like.

根据本发明的最后一方面提供检测基因组结构变异的装置，包括：数据输入单元，用于输入数据；数据输出单元，用于输出数据；存储单元，用于存储数据，其中包括可执行的程序；处理器，与上述数据输入单元、数据输出单元及存储单元数据连接，用于执行存储单元中存储的可执行的程序，程序的执行包括完成上述检测基因组结构变异的各种方法的全部或部分步骤。According to a final aspect of the present invention, there is provided an apparatus for detecting genome structure variation, comprising: a data input unit for inputting data; a data output unit for outputting data; a storage unit for storing data, including an executable program; The processor is connected to the data input unit, the data output unit and the storage unit, and is used for executing the executable program stored in the storage unit. .

利用本发明的基于参考序列确定探针序列的方法获得的探针，利用探针或者包含这些探针的固相/液相芯片进行目标区域捕获测序，能够低测序成本的实现在全基因组范围内检测结构变异，包括覆盖人的23对染色体检测CNV、LOH和UPD，而且检测分辨率能根据需求通过调整探针的平均间距分布即增加/减少SNP位点进行调整。利用本发明的目标区域捕获测序结合生物信息分析方法实现了在全基因组范围内进行高分辨率、高准确性、高通量、低成本的CNV、LOH和UPD检测，同时本发明的基因组结构变异检测方法也适用于染色体非整倍性变异、SNP和Indel的检测，适用于基于全基因测序数据的结构变异分析检测。Using the probes obtained by the method for determining the probe sequence based on the reference sequence of the present invention, and using the probes or the solid-phase/liquid-phase chips containing these probes to capture and sequence the target region, the low sequencing cost can be achieved in the whole genome. Detect structural variation, including detection of CNV, LOH and UPD covering 23 pairs of human chromosomes, and the detection resolution can be adjusted according to needs by adjusting the average spacing distribution of probes, that is, increasing/decreasing SNP sites. The target region capture sequencing combined with the bioinformatics analysis method of the present invention realizes high-resolution, high-accuracy, high-throughput, and low-cost detection of CNV, LOH and UPD in the whole genome, and the genome structure variation of the present invention The detection method is also suitable for the detection of chromosomal aneuploidy variation, SNP and Indel, and for the detection of structural variation based on whole-genome sequencing data.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be set forth, in part, from the following description, and in part will be apparent from the following description, or may be learned by practice of the invention.

附图说明Description of drawings

本发明的上述和/或附加的方面和优点，结合下面附图对实施方式的描述将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will be apparent and readily understood from the following description of embodiments in conjunction with the accompanying drawings, wherein:

图1是本发明一个实施方式中的SeTR探针在全基因组上的特性的示意图，(A)SeTR探针序列的长度分布图；(B)SETR探针中两两探针的物理距离分布图。Fig. 1 is a schematic diagram of the characteristics of the SeTR probe in one embodiment of the present invention on the whole genome, (A) the length distribution map of the SeTR probe sequence; (B) the physical distance distribution map of the two probes in the SETR probe .

图2是本发明一个实施方式中的SeTR探针的测试结果图，(A)目标区域的覆盖深度分布图(B)支持ref碱基型和非ref碱基型的reads分布图。Fig. 2 is a graph of the test results of the SeTR probe in one embodiment of the present invention, (A) a coverage depth distribution graph of the target region (B) a read distribution graph supporting ref base type and non-ref base type.

图3是本发明的一个实施方式中的CNV、LOH和UPD的检测流程示意图。FIG. 3 is a schematic diagram of the detection flow of CNV, LOH and UPD in one embodiment of the present invention.

图4是本发明的一个实施方式中的R_i基准线示图。FIG. 4 is a graph of the R _i reference line in one embodiment of the present invention.

图5是本发明一个实施方式中的检测到的一个样本(GM50275)的基因组结构变异的示意图，圆环由外到里，依次为I)染色体信息，II)r_i值的变化(波浪线)；III)R_het对应的P值变化，IV)R_het值变化(点)。5 is a schematic diagram of the detected genomic structure variation of a sample (GM50275) according to an embodiment of the present invention, the circle is from the outside to the inside, followed by I) chromosome information, II) change of _ri value (wavy line) ; III) P value change corresponding to R _het , IV) R _het value change (dots).

发明详细描述Detailed description of the invention

下面详细描述本发明的实施例。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below. The embodiments described below with reference to the accompanying drawings are exemplary, only used to explain the present invention, and should not be construed as a limitation of the present invention.

需要说明的是，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。进一步地，在本发明的描述中，除非另有说明，“多个”的含义是两个或两个以上。It should be noted that the terms "first" and "second" are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as "first" or "second" may expressly or implicitly include one or more of that feature. Further, in the description of the present invention, unless otherwise specified, "plurality" means two or more.

根据本发明的一种实施方式，提供一种基于参考序列确定探针序列的方法，包括以下步骤：According to one embodiment of the present invention, there is provided a method for determining a probe sequence based on a reference sequence, comprising the following steps:

步骤一：构建第一候选探针集Step 1: Construct the first candidate probe set

利用分布于基因组的离散高频SNP位点构建第一候选探针集，第一候选探针集中的每一条候选探针包含至少一个离散高频SNP位点，离散高频SNP位点为等位基因频率大于10％、并且与任意另外一个离散高频SNP位点在参考基因组上的物理距离不小于候选探针长度，候选探针长度为50-250mer。A first candidate probe set is constructed by using discrete high-frequency SNP sites distributed in the genome. Each candidate probe in the first candidate probe set contains at least one discrete high-frequency SNP site, and the discrete high-frequency SNP sites are alleles The gene frequency is greater than 10%, and the physical distance from any other discrete high-frequency SNP site on the reference genome is not less than the length of the candidate probe, and the length of the candidate probe is 50-250mer.

在本发明的一个具体实施方式中，离散高频SNP是通过千人基因组数据获得的，也可以从其它已公开的基因组数据或者获得的进一步选择等位基因频率小于90％的离散高频SNP位点，确定候选探针长度为100mer。In a specific embodiment of the present invention, the discrete high-frequency SNPs are obtained from the 1000 Genomes data, and the discrete high-frequency SNPs with allele frequencies less than 90% can also be further selected from other published genome data or obtained. Point, determine the candidate probe length is 100mer.

在本发明的一个具体实施方式中，每个候选探针包含一个离散高频SNP位点，并且离散高频SNP位点位于所说的候选探针的中段。这样每条候选探针只包含一个高频SNP位点，相邻候选探针之间可能有重叠也可能没有重叠。这里的“中段”，是相对于“前段”和“后段”来说的，可以按常规理解，比如一条序列，其上、下游1/3分别定为“前段”和“后段”，中间的1/3为“中段”；进一步的，离散高频SNP位点位于所说的候选探针的中点，这里的“中点”位置，比如一条序列包含2n+1个核苷酸，中点即为第n+1核苷酸的位置，而当一条序列含有2n个核苷酸，序列的中点为第n或第n+1个核苷酸的位置，这样可以增强探针对目标离散高频SNP位点的捕获效率。In a specific embodiment of the present invention, each candidate probe contains a discrete high-frequency SNP site, and the discrete high-frequency SNP site is located in the middle of the candidate probe. In this way, each candidate probe contains only one high-frequency SNP site, and adjacent candidate probes may or may not overlap. The "middle section" here is relative to the "front section" and "rear section", which can be understood conventionally. For example, for a sequence, the upper and lower 1/3 of the sequence are designated as the "front section" and the "rear section", respectively. 1/3 of is the "middle segment"; further, the discrete high-frequency SNP sites are located at the midpoint of the candidate probe, where the "midpoint" position, for example, a sequence contains 2n+1 nucleotides, the middle The point is the position of the n+1th nucleotide, and when a sequence contains 2n nucleotides, the midpoint of the sequence is the position of the nth or n+1th nucleotide, which can enhance the probe to the target Capture efficiency of discrete high-frequency SNP sites.

在本发明的一个具体实施方式中，基于第一候选探针集中的候选探针序列的GC含量和/或单碱基重复对第一候选探针集进行预筛选，保留了第一候选探针集中的GC含量为35％-65％和/或单碱基重度小于7的候选探针。单碱基重复度是指在一段序列中一个碱基类型连续出现的次数，比如TGAAAAAAAAGC中，其中的A连续出现8次，该序列的A碱基重复度为8。序列GC含量偏高或偏低、高杂合度容易影响该序列的PCR或者杂交捕获过程，带来GC偏向性(GC bias)等，使捕获特异性降低，经此预筛选保留的第一候选探针集将不会与这些序列杂交，从而免除GC bias或低特异性捕获对结果产生的影响。In a specific embodiment of the present invention, the first candidate probe set is pre-screened based on the GC content and/or single base repeats of the candidate probe sequences in the first candidate probe set, and the first candidate probes are retained Candidate probes with a GC content of 35%-65% and/or a single base gravity of less than 7 were pooled. The single-base repetition degree refers to the number of consecutive occurrences of a base type in a sequence. For example, in TGAAAAAAAGC, where A appears 8 times in a row, the A base repetition degree of the sequence is 8. The high or low GC content and high heterozygosity of the sequence can easily affect the PCR or hybridization capture process of the sequence, resulting in GC bias (GC bias), etc., which reduces the capture specificity. The needle set will not hybridize to these sequences, thus avoiding the effect of GC bias or low specificity capture on the results.

步骤二：将第一候选探针集与参考序列进行比对,以便获得比对结果Step 2: Align the first candidate probe set with the reference sequence in order to obtain the alignment result

将第一候选探针集与参考序列进行比对，获得比对结果,获得第一候选探针集在参考序列上的位置信息。所使用的参考序列是已知序列，可以是预先获得的目标样本所属生物类别中的任意的参考模板。比如，目标样本是人类的，参考序列可选择美国国家生物技术信息中心(NCBI)提供的HG18或者HG19，进一步的可以预先配置包含更多参考序列的资源库，在进行序列比对前，先依据目标样本的性别、人种、地域等因素选择更接近的参考序列，有利于获得更有针对性的探针序列。The first candidate probe set is aligned with the reference sequence to obtain an alignment result, and the position information of the first candidate probe set on the reference sequence is obtained. The reference sequence used is a known sequence, which can be any reference template in the biological category to which the target sample is obtained in advance. For example, if the target sample is human, the reference sequence can be selected from HG18 or HG19 provided by the National Center for Biotechnology Information (NCBI). Selecting a closer reference sequence based on factors such as gender, race, and region of the target sample is conducive to obtaining more targeted probe sequences.

步骤三：对第一候选探针集进行第一筛选，以便获得第二候选探针集Step 3: Perform a first screening on the first candidate probe set to obtain a second candidate probe set

在本发明的一个具体实施方式中，经过第一筛选保留的候选探针需满足以下两个条件中的任一个：1)第一候选探针集中的比对到参考基因组唯一位置的候选探针；2)第一候选探针集中的比对到参考序列多个位置、并且与参考序列多个位置中的至少两个位置的错配比例都小于10％；比如候选探针长度100mer，10个碱基错配即错配比例10％，错配率低用于杂交时能与目标区接近完全互补配对，捕获效果佳，特异性高。In a specific embodiment of the present invention, the candidate probes retained after the first screening need to meet either of the following two conditions: 1) The candidate probes in the first candidate probe set that are aligned to the unique position of the reference genome 2) The first candidate probe set is aligned to multiple positions in the reference sequence, and the mismatch ratio with at least two positions in the multiple positions of the reference sequence is less than 10%; for example, the length of the candidate probe is 100mer, 10 The base mismatch means that the mismatch ratio is 10%, and the mismatch ratio is low. When used for hybridization, it can be close to complete complementary pairing with the target region, with good capture effect and high specificity.

步骤四：将参考序列划分为多个窗口，将第二候选探针集分配至各自匹配的窗口Step 4: Divide the reference sequence into multiple windows, and assign the second candidate probe set to each matching window

将参考序列划分成多个具有预定长度的窗口，利用比对，将第二候选探针集中的多个候选探针分配到匹配上的窗口，获得各个候选探针在各自窗口上的位置信息。The reference sequence is divided into multiple windows with a predetermined length, and the multiple candidate probes in the second candidate probe set are allocated to the matching windows by alignment, and the position information of each candidate probe on the respective window is obtained.

多个预定长度的窗口的长度可以一致可以不一致，可以重叠可以不重叠，在本发明的一个具体实施方式中，参考序列为参考基因组，将参考基因组划分为多个一致长度的窗口，窗口长度为10Kb，且相邻两个窗口连接但不重叠。The lengths of multiple windows of predetermined lengths may be consistent or inconsistent, and may or may not overlap. In a specific embodiment of the present invention, the reference sequence is a reference genome, and the reference genome is divided into multiple windows of consistent length, and the window length is 10Kb, and two adjacent windows are connected but do not overlap.

步骤五：基于所说的位置信息以及离散高频SNP的等位基因频率，对第二候选探针Step 5: Based on the location information and the allele frequencies of the discrete high-frequency SNPs, the second candidate probe 集进行第二筛选，确定探针序列Sets for second screening to determine probe sequences

在本发明的一个具体实施方式中，进行第二筛选包括两个步骤，(a)如果存在多个候选探针位于同一个窗口，则确定离散高频SNP的等位基因频率最高的候选探针；(b)如果仅存在一个离散高频SNP的等位基因频率最高的候选探针，则选择该离散高频SNP的等位基因频率最高的候选探针作为探针，如果存在多个离散高频SNP的等位基因频率最高的候选探针，则选择多个离散高频SNP的等位基因频率最高的候选探针中距离窗口中心最近的候选探针作为所述探针。候选探针与窗口中心的距离可以是候选探针的中点与该窗口中心的距离。目标位置尽可能处于探针序列的中心位置，有利于提高捕获效率。In a specific embodiment of the present invention, performing the second screening includes two steps: (a) if there are multiple candidate probes located in the same window, determine the candidate probe with the highest allele frequency of the discrete high-frequency SNP (b) If there is only one candidate probe with the highest allele frequency of a discrete high-frequency SNP, select the candidate probe with the highest allele frequency of the discrete high-frequency SNP as the probe. If the candidate probe with the highest allele frequency of the high frequency SNP is selected, the candidate probe closest to the center of the window among the candidate probes with the highest allele frequency of the multiple discrete high frequency SNPs is selected as the probe. The distance of the candidate probe from the center of the window may be the distance between the midpoint of the candidate probe and the center of the window. The target position should be located in the center of the probe sequence as much as possible, which is beneficial to improve the capture efficiency.

在本发明的一个具体实施方式中，对第二候选探针集进行第二筛选之后，当第二候选探针集中的分别落入相邻两个窗口的相邻两条候选探针在参考基因组上的距离大于相邻两窗口中任一窗口的长度时，可选择地，进一步将参考基因组上的位于相邻两条候选探针之间的短串连重复序列或者短串联重复序列的一部分添加到经第二筛选后的第二候选探针集中，一起构成探针序列。这样，利用这些设计获得的探针序列捕获全基因组时，能使捕获得的区域的间距呈现相对均匀的分布，能使捕获确定的区域组合更好的全面反映整个基因组信息。In a specific embodiment of the present invention, after the second screening is performed on the second candidate probe set, when two adjacent candidate probes in the second candidate probe set that fall into two adjacent windows respectively are in the reference genome When the distance on the reference genome is greater than the length of any one of the two adjacent windows, optionally, further add the short tandem repeats or a part of the short tandem repeats between the two adjacent candidate probes on the reference genome. A probe sequence is formed together in the second candidate probe set after the second screening. In this way, when the probe sequences obtained by these designs are used to capture the whole genome, the spacing of the captured regions can be relatively uniformly distributed, and the combination of captured regions can better comprehensively reflect the entire genome information.

根据本发明的另一种实施方式，提供一种基因组结构变异的检测方法，所说的基因组结构变异包括染色体非整倍性、拷贝数变异和插入缺失的至少之一，包括以下步骤：According to another embodiment of the present invention, there is provided a method for detecting genomic structural variation, wherein the genomic structural variation includes at least one of chromosomal aneuploidy, copy number variation and indel, comprising the following steps:

(一)对目标样本基因组核酸进行测序，以便获得基因组测序结果，所说的基因组测序结果由多个读段构成，基因组测序结果可以通过全基因测序获得，比如通过提取基因组DNA，依据现有高通量平台的指导手册，比如利用Illumina Hiseq2000/2500、Roche 454、Life technologies Ion Torrent、单分子或纳米孔测序平台等进行文库构建及上机测序获得读段(reads)；或者通过探针捕获所述目标样本的基因组并进行测序获得，探针可以通过本发明一方面提供的探针的确定方法进行设计确定，接着按照现有的方法合成或制备而得的。(1) Sequencing the genomic nucleic acid of the target sample in order to obtain the genome sequencing result. The genome sequencing result consists of multiple reads. The genome sequencing result can be obtained by whole-gene sequencing, for example, by extracting genomic DNA. Instruction manuals for throughput platforms, such as library construction and on-board sequencing to obtain reads using Illumina Hiseq2000/2500, Roche 454, Life technologies Ion Torrent, single-molecule or nanopore sequencing platforms, etc.; The genome of the target sample is obtained by sequencing, and the probe can be designed and determined by the probe determination method provided in one aspect of the present invention, and then synthesized or prepared according to the existing method.

(二)将参考基因组分为m个区域，利用测序结果中的读段中落入区域i的读段计算目标样本基因组区域i的覆盖深度TD_i，其中，m和i为自然数，i为区域编号，1≤i≤m，10<m。(2) Divide the reference genome into m regions, and calculate the coverage depth TD _i of the target sample genome region i by using the reads that fall into the region i in the reads in the sequencing result, where m and i are natural numbers, and i is the region number, 1≤i≤m, 10<m.

在本发明的一个具体实施方式中，区域i的覆盖深度的计算公式为

或者

其中，i表示区域的编号。读段落到基因组上位置可以通过序列比对确定，比对可使用各种比对软件，例如SOAP(Short Oligonucleotide Analysis Package)，bwa(Burrows-Wheeler Aligner)，samtools，GATK(Genome Analysis Toolkit)等。In a specific embodiment of the present invention, the calculation formula of the coverage depth of the area i is:

or

Among them, i represents the number of the area. The position of the reads on the genome can be determined by sequence alignment, and the alignment can be performed using various alignment software, such as SOAP (Short Oligonucleotide Analysis Package), bwa (Burrows-Wheeler Aligner), samtools, GATK (Genome Analysis Toolkit) and the like.

(三)基于目标样本基因组区域i的覆盖深度与k个参照样本的区域i的覆盖深度的差异程度，判断目标样本区域i的结构变异的发生，其中，k为自然数，k≥2。(3) Determine the occurrence of structural variation in the target sample region i based on the difference between the coverage depth of the target sample genome region i and the coverage depth of the k reference sample regions i, where k is a natural number and k≥2.

在本发明的一个具体实施方式中，目标样本基因组区域i的覆盖深度与k个参照样本的区域i的覆盖深度的差异程度的比较，是通过比较目标样本和参照样本的基因组区域i的覆盖深度系数来实现的，目标样本基因组区域i的覆盖深度系数R_i的确定包括以下步骤，(a)对TD_i进行第一校正以获得TD_ai，第一校正是通过对包含区域i在内的2n个连续区域的覆盖深度值进行线性回归实现的，其中，n为自然数，10<n≤m/2，在本发明的一个具体实施方式中，经第一校正线性回归获得的

其中，TD_j表示n个连续区域中的第j个区域的覆盖深度，j为自然数，1≤j≤n；(b)在获得区域i的第一校正覆盖深度TD_ai后，进一步对TD_ai进行均一化获得

进而获得

在本发明一个具体实施方式中，对区域i的第一校正覆盖深度TD_ai进行均一化获得的

在本发明的一个具体实施方式中，在获得目标样本的R_i后进一步包括对R_i进行第二校正以获得r_i，

其中，R_ai为k个参照样本基因组区域i的覆盖深度系数的平均值，

y为自然数表示参照样本编号，R_i，y表示参照样本y基因组区域i的覆盖深度系数。In a specific embodiment of the present invention, the comparison of the degree of difference between the coverage depth of the genomic region i of the target sample and the coverage depth of the region i of the k reference samples is performed by comparing the coverage depth of the genomic region i of the target sample and the reference sample. The determination of the coverage depth coefficient R _i of the target sample genome region i includes the following steps: (a) performing a first correction on TD _i to obtain TD _ai , and the first correction is performed by adjusting 2n including the region i It is realized by performing linear regression on the coverage depth values of the consecutive regions, where n is a natural number, and 10<n≤m/2.

Among them, TD _j represents the coverage depth of the jth area in the n continuous areas, j is a natural number, 1≤j≤n; (b) After obtaining the first corrected coverage depth TD _ai of the area i, further measure the TD _ai to obtain a homogenization

to obtain

In a specific embodiment of the present invention, the first corrected coverage depth TD _ai of region i is obtained by normalizing

In a specific embodiment of the present invention, after obtaining the R _i of the target sample, it further includes performing a second correction on R _i to obtain r _i ,

where R _ai is the average of the coverage depth coefficients of the k reference sample genome regions i,

y is a natural number representing the reference sample number, R _{i, y} represents the coverage depth coefficient of the genomic region i of the reference sample y.

在本发明的另一个具体实施方式中，在获得目标样本的R_i后进一步包括对R_i进行第二校正以获得r_i，

其中，R_ai为k个参照样本和一个目标样本的基因组区域i的覆盖深度系数的平均值，

y为自然数表示参照样本编号，R_i，y表示参照样本y基因组区域i的覆盖深度系数。In another specific embodiment of the present invention, after obtaining the R _i of the target sample, it further includes performing a second correction on R _i to obtain r _i ,

where _Rai is the average of the coverage depth coefficients of the genomic region i of k reference samples and one target sample,

上述计算处理目标样本基因组区域i的覆盖深度系数R_i的过程中，对中间数值的的校正、均一化等处理能减少因实验条件的波动、样品间本身的差异等带来的误差，使最后的r_i能真实反映R_i且围绕1的波动幅度比R_i小，且多个样本的r_i符合正态分布；上述实施方式中对TD_i进行第一校正，接着对第一校正后的数值进行均一化，相当于两次求均值的过程，即在打算以包含区域i的n个连续区域的覆盖深度均值代表区域i的覆盖深度之前，n个区域中每个区域的覆盖深度值的计算都是利用以该区域为第一个区域的n个连续区域的覆盖深度均值表示的，这样相当于利用包含目标区域i的2n个连续区域的覆盖深度值来校正TD_i，能使连续区域的覆盖深度保持稳定。需要说明的是，本领域人员可以利用其它校正或求平均值处理使相邻几个区域的覆盖深度值保持稳定，比如以与目标区域间隔多少个的几个区域的平均覆盖深度来校正目标区域覆盖深度，均属于本发明的构思。参照样本基因组区域i的覆盖深度系数的计算处理可以参考目标样本基因组区域i的覆盖深度系数的计算处理过程，参照样本数据可以预先计算处理好备用，也可以与目标样本的计算处理过程同步进行而获得。In the process of calculating and processing the coverage depth coefficient R _i of the target sample genomic region i, the correction and normalization of the intermediate values can reduce the errors caused by fluctuations in experimental conditions and differences between samples, so that the final The r _i can truly reflect R _i and the fluctuation range around ₁ is smaller than R _i , and the r _i of multiple samples conform to the normal distribution; The value is normalized, which is equivalent to the process of calculating the mean value twice, that is, before the average coverage depth of n continuous areas including area i is intended to represent the coverage depth of area i, the coverage depth value of each area in the n areas is calculated. The calculation is expressed by the average coverage depth of n consecutive regions with this region as the first region, which is equivalent to correcting TD _i by using the coverage depth values of 2n continuous regions including the target region i, which can make the continuous region The depth of coverage remains stable. It should be noted that those skilled in the art can use other correction or averaging process to keep the coverage depth value of several adjacent areas stable, for example, to correct the target area with the average coverage depth of several areas separated from the target area. The coverage depth belongs to the concept of the present invention. The calculation process of the coverage depth coefficient of the reference sample genome region i can refer to the calculation process of the coverage depth coefficient of the target sample genome region i, and the reference sample data can be pre-calculated and processed for use, or can be performed synchronously with the calculation process of the target sample. get.

在本发明的一个具体实施方式中，目标样本基因组区域i的覆盖深度与k个参照样本的区域i的覆盖深度的差异程度的判断，是通过t检验二者的覆盖深度系数的差异是否显著来实现的。在本发明的一个具体实施方式中，目标样本基因组区域i的t检验统计量的计算公式为

其中，

表示k个参照样本的r_i,y的平均值，r_i,y为参照样本y基因组区域i的经第二校正的覆盖深度系数，

为k个参照样本标准差，

基于目标样本基因组区域i的t_i值，获得显著水平P_i，当P_i<0.05，判定所述区域i发生结构变异；反之，则判定所述区域i不发生结构变异。在本发明的另一个具体实施方式中，基于目标样本基因组区域i的t_i值和预先确定的显著水平P_i0，获得t_i理论值t_i0，当t_i≥t_i0，判定所述区域i发生结构变异，反之，则判定所述区域i不发生结构变异，预先确定的P_i0≤0.05。根据t检验的t值表，预定P_i0后可查得对应的t_i0。In a specific embodiment of the present invention, the degree of difference between the coverage depth of the genomic region i of the target sample and the coverage depth of the region i of the k reference samples is judged by t-testing whether the difference in the coverage depth coefficients of the two is significant. realized. In a specific embodiment of the present invention, the calculation formula of the t-test statistic of the genomic region i of the target sample is:

in,

represents the average value of ri _,y of k reference samples, ri _,y is the second corrected coverage depth coefficient of genomic region i of reference sample y,

is the standard deviation of k reference samples,

Based on the t _i value of the target sample genome region i, a significant level P _i is obtained. When P _i <0.05, it is determined that the region i has structural variation; otherwise, it is determined that the region i has no structural variation. In another specific embodiment of the present invention, based on the t _i value of the target sample genome region i and the predetermined significance level P _i0 , the theoretical value t _i0 of t _i is obtained, and when t _i ≥t _i0 , the region i is determined Structural variation occurs, otherwise, it is determined that the region i does not undergo structural variation, and the predetermined P _i0 ≤ 0.05. According to the t value table of the t test, the corresponding t _{i0 can be found after the predetermined P i0} _.

在本发明的一个实施方式中，为检测更大的CNV或插入缺失，在进行步骤(三)之后，将同方向且连续的W个区域合并，获得一级合并区域，合并两个一级合并区域当两个一级合并区域是同方向的并且之间的跨度不超过L个区域，获得二级合并区域，检测二级合并区域的结构变异；其中，同方向区域指区域的覆盖深度的t统计量都大于0或者都小于0的区域，W和L均为自然数，W≥2，L-W≤1。要进一步检测更大的结构变异，可依次类推，如进一步合并符合条件的二级合并区域，合并条件可类似的为两个二级合并区域同方向且之间的在参考基因组上的距离不超过L个区域或L个二级合并区域。In one embodiment of the present invention, in order to detect larger CNVs or indels, after step (3), merge W consecutive regions in the same direction to obtain a first-level merged region, and merge two first-level merges Region When two first-level merged regions are in the same direction and the span between them does not exceed L regions, the second-level merged region is obtained, and the structural variation of the second-level merged region is detected; among them, the same-direction region refers to the coverage depth of the region t In the region where the statistics are all greater than 0 or less than 0, W and L are both natural numbers, W≥2, L-W≤1. To further detect larger structural variations, it can be deduced by analogy, such as further merging qualified secondary merged regions, the merge conditions can be similar to two secondary merged regions in the same direction and the distance between them on the reference genome does not exceed L regions or L secondary merged regions.

在本发明的一个具体实施方式中，检测二级合并区域的结构变异，是基于目标样本基因组的所述二级合并区域的覆盖深度与多个参照样本基因组上对应的区域的覆盖深度的差异程度，来判断该二级合并区域是否发生结构变异，或者说来判断发生在区域i的结构变异是否横跨W个区域。参照样本基因组上对应的二级合并区域的覆盖深度的获得、目标样本基因组上的二级合并区域覆盖深度的t统计量的计算及结构变异判断过程可参见前面相对小的区域i的结构变异的计算判断过程。In a specific embodiment of the present invention, detecting the structural variation of the secondary merged region is based on the degree of difference between the coverage depth of the secondary merged region of the target sample genome and the coverage depth of the corresponding regions on the genomes of multiple reference samples , to judge whether structural variation occurs in the secondary merged region, or to judge whether the structural variation occurring in region i spans W regions. For the acquisition of the coverage depth of the corresponding secondary merged region on the reference sample genome, the calculation of the t-statistic of the coverage depth of the secondary merged region on the target sample genome, and the process of structural variation judgment, please refer to the structural variation of the relatively small region i above. Computational judgment process.

根据本发明的再一个实施方式，提供一种适用于检测基因组结构变异中的杂合性丢失的的方法，包括以下步骤：According to yet another embodiment of the present invention, there is provided a method suitable for detecting loss of heterozygosity in genomic structural variation, comprising the following steps:

(1)对目标样本基因组核酸进行测序，以便获得基因组测序结果，所说的基因组测序结果由多个读段构成，基因组测序结果可以通过全基因测序获得，比如通过提取基因组DNA，依据现有高通量平台的指导手册，比如利用Illumina Hiseq2000/2500、Roche 454、Life technologies Ion Torrent、单分子或纳米孔测序平台等进行文库构建及上机测序获得读段(reads)；或者通过探针捕获所述目标样本的基因组并进行测序获得，探针可以通过本发明一方面提供的探针的确定方法进行设计确定，接着按照现有的方法合成或制备而得的。(1) Sequence the genomic nucleic acid of the target sample in order to obtain the genomic sequencing result. The genomic sequencing result is composed of multiple reads. The genomic sequencing result can be obtained by whole-gene sequencing, for example, by extracting genomic DNA. Instruction manuals for throughput platforms, such as library construction and on-board sequencing to obtain reads using Illumina Hiseq2000/2500, Roche 454, Life technologies Ion Torrent, single-molecule or nanopore sequencing platforms, etc.; The genome of the target sample is obtained by sequencing, and the probe can be designed and determined by the probe determination method provided in one aspect of the present invention, and then synthesized or prepared according to the existing method.

(2)将参考基因组分成m’个区域，基于测序结果中落在参考基因组区域i中的读段信息和群体区域i数据，获得目标样本基因组区域i和群体区域i共有的SNP集，分别计算目标样本和群体的共有SNP集中的各个SNP位点所在片段的杂合度，获得目标样本基因组区域i的杂合度集U_i，和群体区域i的杂合度集U_0i，比较目标样本U_i和群体U_0i以确定目标样本区域i杂合性丢失是否发生；其中，所述共有SNP集中的每个SNP的等位基因频率都大于0.1，所说的共有SNP集中的一个SNP位点所在片段是以与该SNP相邻的上下游两个SNP为边界点的，m’和i为自然数，m’≥i≥1，m’≥6。(2) Divide the reference genome into m' regions, and obtain the SNP sets shared by the target sample genome region i and the population region i based on the read information in the reference genome region i and the population region i data in the sequencing results, and calculate separately The heterozygosity of each SNP site in the shared SNP set of the target sample and the population is obtained, and the heterozygosity set U _i of the target sample genome region i and the heterozygosity set U _0i of the population region i are obtained, and the target sample U _i and the population are compared. U _0i to determine whether the loss of heterozygosity in the target sample region i occurs; wherein, the allele frequency of each SNP in the shared SNP set is greater than 0.1, and the fragment where a SNP site in the shared SNP set is located is a The upstream and downstream two SNPs adjacent to the SNP are boundary points, m' and i are natural numbers, m'≥i≥1, m'≥6.

在本发明的一个具体实施方式中，一个SNP位点所在片段的杂合度是以该SNP位点的次等位基因频率系数表示的，所述SNP位点的次等位基因频率系数R_het＝MAF/(1-MAF)，MAF为该高频SNP的次等位基因频率。In a specific embodiment of the present invention, the heterozygosity of the fragment where a SNP locus is located is represented by the sub-allele frequency coefficient of the SNP locus, and the sub-allele frequency coefficient of the SNP locus R _het = MAF/(1-MAF), MAF is the minor allele frequency of the high frequency SNP.

在本发明的一个具体实施方式中，比较目标样本U_i和群体U_0i以确定目标样本区域i杂合性丢失是否发生，包括利用F检验判断U_i的方差

和U_0i的方差

是否有显著性差异，，若U_i和U_0i的方差差异显著，则判定所述目标样本区域i存在杂合性丢失，反之，则判定所述目标样本区域i没有存在杂合性丢失。In a specific embodiment of the present invention, comparing the target sample U _i with the population U _0i to determine whether the loss of heterozygosity in the target sample region i occurs, includes using an F test to determine the variance of U _i

and the variance of U _0i

Whether there is a significant difference, if the variance of U _i and U _0i is significantly different, it is determined that the target sample region i has loss of heterozygosity; otherwise, it is determined that the target sample region i has no loss of heterozygosity.

在本发明的一个具体实施方式中，F检验包括分别计算U_i和U_i0的方差，利用所得目标样本U_i的方差

和群体U_i0的方差

计算获得两个互为倒数的统计量F_upper和F_under，利用互为所说的倒数的统计量获得显著水平p_F，比较p_F与预定显著水平p_F0的大小，p_F≤p_F0说明两方差差异显著，反之则差异不显著，F检验包含计算公式，In a specific embodiment of the present invention, the F test includes calculating the variances of U _i and U _i0 respectively, and using the variance of the obtained target sample U _i

and the variance of the population U _i0

Calculate and obtain two mutually reciprocal statistics F _upper and F _under , use the mutually said reciprocal statistics to obtain the significant level p _F , compare the size of p _F and the predetermined significant level p _F0 , p _F ≤ p _F0 description The difference between the two variances is significant, otherwise the difference is not significant. The F test includes the calculation formula,

p_F＝p_upper+(1-p_under)，其中，v为目标样本基因组区域i和群体区域i共有SNP集中SNP的编号，q为目标样本基因组区域i和群体区域i共有SNP集中SNP的个数，R_het,i,v为目标样本基因组区域i的共有SNP集中的第v个SNP的次等位基因频率系数，

为目标样本基因组区域i的共有SNP集中的q个SNP的次等位基因频率系数的平均值，R_het,i0,v群体样本基因组区域i的共有SNP集中的第v个SNP的次等位基因频率系数，

为群体样本基因组区域i的共有SNP集中的q个SNP的次等位基因频率系数的平均值，p_upper和p_under分别根据F_upper和F_under获得，p_F0≤0.05。p_F0可以取通常设置的值、或者根据所掌握的已知信息、对检测准确性的要求等调整设置。

p _F =p _upper +(1-p _under ), where v is the number of the SNP in the SNP set shared by the target sample genome region i and the population region i, q is the number of SNPs in the SNP set shared by the target sample genome region i and the population region i number, R _{het, i, v} is the sub-allele frequency coefficient of the vth SNP in the shared SNP set of the target sample genome region i,

is the average of the minor allele frequency coefficients of q SNPs in the shared SNP set of the target sample genome region i, R _het,i0,v is the minor allele of the vth SNP in the shared SNP set of the i population sample genome region i frequency coefficient,

is the average value of the sub-allele frequency coefficients of q SNPs in the common SNP set in the genomic region i of the population sample, p _upper and p _under are obtained according to F _upper and F _under , respectively, p _F0 ≤ 0.05. p _F0 can take the value that is usually set, or adjust the setting according to the known information and the requirements for detection accuracy.

在本发明的一个实施方式中，为检测更大的LOH，在步骤(2)之后，将W’个发生杂合性丢失且连续的区域合并，获得三级合并区域，合并两个三级合并区域当所述两个三级合并区域之间的跨度不超过L’个区域时，获得四级合并区域，分别获得目标样本四级合并区域的杂合度集和群体同样区域的杂合度集，比较两个杂合度集，以确定目标样本四级合并区域是否发生杂合性丢失，其中，W’和L’均为自然数，W’≥2，W’/2≥L’。在本发明的一个具体实施方式中，W’≥4。要检测更大区域发生的LOH，可依次类推，比如进一步合并符合条件的四级合并区域，合并条件可类似的为两个四级合并区域之间的在参考基因组上的距离不超过L’个区域或L’个三级合并区域。In one embodiment of the present invention, in order to detect a larger LOH, after step (2), the W' loss of heterozygosity and continuous regions are combined to obtain a three-level combined region, and two three-level combined regions are combined. When the span between the two third-level merged regions does not exceed L' regions, obtain a fourth-level merged region, respectively obtain the heterozygosity set of the fourth-level merged region of the target sample and the heterozygosity set of the same region of the population, and compare Two sets of heterozygosity are used to determine whether loss of heterozygosity occurs in the four-level merged region of the target sample, where W' and L' are both natural numbers, W'≥2, and W'/2≥L'. In a specific embodiment of the present invention, W'≥4. To detect LOH in a larger region, it can be deduced by analogy, for example, further merging eligible fourth-level merged regions, the merging conditions can be similar that the distance between two fourth-level merged regions on the reference genome does not exceed L' area or L' tertiary merged areas.

根据本发明的再一个实施方式，提供一种检测单亲二倍体的方法，当某目标样本基因组区域存在杂合性丢失时，计算这个区域的拷贝数，当这个区域拷贝数与同物种正常基因组上该区域的拷贝数一样时，判定所述目标样本的这个基因组区域存在UPD；基因组区域是否存在LOH可通过前面本发明披露的一方面的LOH检测方法进行。According to another embodiment of the present invention, a method for detecting uniparental diploidy is provided. When there is a loss of heterozygosity in a genomic region of a target sample, the copy number of this region is calculated. When the copy number of this region is the same as that of the normal genome of the same species When the copy number of the above region is the same, it is determined that there is UPD in this genomic region of the target sample; whether there is LOH in the genomic region can be determined by the LOH detection method disclosed in one aspect of the present invention.

本领域普通技术人员可以理解，上述实施方式中各种方法的全部或部分步骤可以通过程序指令相关硬件完成，该程序可以存储于一计算机可读存储介质中，存储介质可以包括：只读存储器、随机存储器、磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps of the various methods in the above-described embodiments can be completed by program instructions related to hardware, and the program can be stored in a computer-readable storage medium, and the storage medium can include: read-only memory, Random access memory, magnetic disk or CD, etc.

根据本发明的最后一个实施方式，还提供一种检测基因组结构变异的装置，包括：数据输入单元，用于输入数据；数据输出单元，用于输出数据；存储单元，用于存储数据，其中包括可执行的程序；处理器，与上述数据输入单元、数据输出单元及存储单元数据连接，用于执行存储单元中存储的可执行的程序，程序的执行包括完成上述实施方式中各种方法的全部或部分步骤。According to the last embodiment of the present invention, there is also provided an apparatus for detecting genome structure variation, comprising: a data input unit for inputting data; a data output unit for outputting data; a storage unit for storing data, including An executable program; a processor, data-connected to the above-mentioned data input unit, data output unit and storage unit, for executing the executable program stored in the storage unit, the execution of the program includes completing all of the various methods in the above-mentioned embodiments or part of the steps.

以下结合具体目标个体对依据本发明的具体探针设计方法及结构变异检测方法的运行结果进行详细的描述。下述过程涉及的名称定义或具体参数设置选择为：The operation results of the specific probe design method and the structural variation detection method according to the present invention will be described in detail below in conjunction with specific target individuals. The name definitions or specific parameter settings involved in the following process are selected as:

1、将设计的探针称为选择目标区域探针(Seleted Target Region Primers，SeTR)；1. The designed probe is called Selected Target Region Primers (SeTR);

2、下文中的“覆盖深度”、“测序深度”和“深度”，可替换使用；下文中的“区域”和“目标区域”可替换使用；2. The following "coverage depth", "sequencing depth" and "depth" can be used interchangeably; the following "area" and "target area" can be used interchangeably;

3、文库构建、测序依据Hiseq 2000平台提供的小片段文库构建操作说明及上机测序说明来操作，文库的大小为300bp-350bp，双端测序(pair-end sequencing)，读段长91bp(测序类型为PE91+8+91)；3. Library construction and sequencing are operated according to the operation instructions for small fragment library construction and on-machine sequencing instructions provided by the Hiseq 2000 platform. The size of the library is 300bp-350bp, pair-end sequencing (pair-end sequencing), and the read length is 91bp (sequencing). Type is PE91+8+91);

4、比对选择的参考基因组或参考序列为人类参考基因组(hg19，Build 37)。4. The reference genome or reference sequence selected for alignment is the human reference genome (hg19, Build 37).

实施例中未注明具体技术或条件的，按照本领域内的文献所描述的技术或条件(例如参考J.萨姆布鲁克等著，黄培堂等译的《分子克隆实验指南》，第三版，科学出版社)或者按照产品说明书进行。所用试剂或仪器未注明生产厂商者，均为可以通过市购获得的常规产品，例如可以采购自Illumina公司。If the specific technique or condition is not indicated in the embodiment, according to the technique or condition described in the literature in this area (for example, with reference to J. Sambrook etc., "Molecular Cloning Experiment Guide" translated by Huang Peitang etc., 3rd edition, Science Press) or follow the product instructions. The reagents or instruments used without specifying the manufacturer are conventional products that can be obtained commercially, for example, can be purchased from Illumina.

实施例1：芯片设计、制备、测试Example 1: Chip Design, Preparation, and Testing

通常，高(>60％)或者低(<35％)GC含量和高杂合度容易给其DNA片段在PCR或者探针捕获过程中是带来不利的影响，为了避免此种现象，我们设计了特殊的探针，我们将其称为SeTR.在设计SeTR探针的时候，遵循以下几个原则：a)探针序列的唯一性和稳定性较高，要求具有低杂合性和中等的GC(35％～60％)含量，b)含有离散型的高频SNP标记(SNPmarker)，各SNP的等位基因频率(allele frequency，0.9>AF>0.1)以便更好检测全基因组的LOH，c)最终的目标区域呈现出相对均匀的分布。Usually, high (>60%) or low (<35%) GC content and high heterozygosity easily bring adverse effects on the DNA fragments during PCR or probe capture. In order to avoid this phenomenon, we designed The special probe is called SeTR. When designing the SeTR probe, the following principles are followed: a) The uniqueness and stability of the probe sequence are high, requiring low heterozygosity and moderate GC (35%～60%) content, b) contains discrete high-frequency SNP markers (SNPmarker), the allele frequency of each SNP (allele frequency, 0.9>AF>0.1) to better detect the LOH of the whole genome, c ), the final target area exhibits a relatively uniform distribution.

SeTR探针设计或者说目标区域的选择流程如下：The SeTR probe design or the selection process of the target region is as follows:

1)基于千人基因数据库(ftp://ftp.ncbi.nih.gov/1000genomes/ftp/release),挑选出等位基因频率(Allele frequency，AF)为10％～90％的候选SNP集，然后再在SNP集中去掉两个SNP之间物理距离小于100pb的一个SNP，从而构成SNP maker1集。1) Based on the 1000 Genes Database (ftp://ftp.ncbi.nih.gov/1000genomes/ftp/release), select the candidate SNP set with an allele frequency (AF) of 10% to 90%, Then, one SNP whose physical distance between two SNPs is less than 100pb is removed from the SNP set, thereby forming the SNP maker1 set.

2)以SNP maker1集的每一个SNP为中点，在其上下游各截取参考基因组序列50pb，构成100bp的理论探针序列集，然后将此探针序列集比回到参考序列上。如果某一探针序列的最佳比对没有错配，且其次佳比对也只有小于5％的错配，那么其对应的SNP则被保留，从而构成SNP maker2集。2) Take each SNP in the SNP maker1 set as the midpoint, cut off 50 pb of the reference genome sequence upstream and downstream of it to form a theoretical probe sequence set of 100 bp, and then compare this probe sequence set back to the reference sequence. If the best alignment of a certain probe sequence has no mismatches, and the next best alignment has only less than 5% mismatches, then its corresponding SNP is retained, thus forming the SNP maker2 set.

3)基于SNP maker2集，我们挑选出在参考基因组物理位置上均匀的SNPmaker为最终的SNP maker集。在我们的研究中，我们选择了物理距离大约为10Kbp的SNP maker集。3) Based on the SNP maker2 set, we select the SNPmaker that is uniform in the physical position of the reference genome as the final SNP maker set. In our study, we selected the SNP maker set with a physical distance of about 10Kbp.

4)如果在最终的SNP maker集中，有两个临近的SNP之间的距离大于了10Kbp，在选用他们之间的短重复序列(short tandem repeat，STR)来填补均匀。4) If the distance between two adjacent SNPs is greater than 10Kbp in the final SNP maker set, the short tandem repeat (STR) between them is selected to fill evenly.

设计完SeTR探针后，我们委托Roche来产出SeTR液相芯片。SeTR液相芯片含有278800个探针，总大小为41,795,106bp，其覆盖了有效的全基因组(2.89G)1.45％的区域。SeTR探针平均长度达到了149bp，相邻两两探针之间的平均物理距离为10.6kbp，如表1和图1所示。After designing the SeTR probe, we commissioned Roche to produce the SeTR LC chip. The SeTR liquid chip contains 278,800 probes with a total size of 41,795,106 bp, which covers an effective genome-wide (2.89G) 1.45% region. The average length of the SeTR probes reached 149 bp, and the average physical distance between adjacent probes was 10.6 kbp, as shown in Table 1 and Figure 1.

表1 SeTR探针在每条染色体上的分布Table 1 Distribution of SeTR probes on each chromosome

用3个质检合格的DNA样品，YH(炎黄样本，中国人基因组DNA)、HG00537(千人基因组项目中的一个样本)和GM50275(获自柯瑞尔医学研究所Coriell Institute forMedical Research的人成纤维母细胞样本)，来测试SeTR芯片的可用性，以保证此探针芯片能用于后续的检测研究。三个样本都利用SeTR捕获建库测序，获得测序序列(reads)。首先我们去掉被接头(adapter)污染和质量较低的比如平均质量值低于20的reads后，称剩下的reads为干净reads(clean reads)，将干净reads比对到参考序列hg19上，得到了98.13％～99.29的reads比对到参考基因组上，其中比对到目标区域的达到了67.43％～67.87％，此外，有99.73％～99.95的目标区域至少被一条reads覆盖了，有超过99％的区域至少被覆盖到了10次，如表2所示，这些表现都要优于同类型的外显子组捕获(exome capture)芯片，比如Roche Nimblegen公司生产的外显子组液相芯片。此外，目标区域的深度分布，如图4A所示，类似于泊松分布(Poisson distribution)，图4B显示目标区域内的绝大部分的高杂合位点的非参考序列碱基型(the non-reference allele)的reads支持数与参考序列碱基型(reference allele)的reads支持数几乎相当，即比对时某高杂合位点获得的正负reads支持数相当(正负reads分别来源两条同源染色体)，这些都显示此探针无明显的单倍体型(常见为参考序列碱基型，即ref型)捕获的偏向性，以及对目标区域捕获均一性较优。Using 3 DNA samples that have passed the quality inspection, YH (Yanhuang sample, Chinese genomic DNA), HG00537 (a sample in the Thousand Genomes Project) and GM50275 (obtained from the human adult of Coriell Institute for Medical Research) fibroblast samples) to test the availability of the SeTR chip to ensure that this probe chip can be used for subsequent detection studies. All three samples were sequenced using SeTR capture library construction to obtain sequencing sequences (reads). First, after removing the reads that are contaminated by adapters and low quality, such as the average quality value lower than 20, the remaining reads are called clean reads (clean reads), and the clean reads are aligned to the reference sequence hg19 to get 98.13% to 99.29 of the reads were aligned to the reference genome, of which 67.43% to 67.87% were aligned to the target region. In addition, 99.73% to 99.95 of the target region was covered by at least one read, and more than 99% The area was covered at least 10 times, as shown in Table 2, and these performances were superior to the same type of exome capture chip, such as the exome liquid phase chip produced by Roche Nimblegen. In addition, the depth distribution of the target region, as shown in Figure 4A, is similar to the Poisson distribution, and Figure 4B shows the non-reference sequence base type (the non-reference sequence base type of most of the high heterozygous sites in the target region) The number of reads supported by -reference allele is almost the same as the number of reads supported by the base type of the reference sequence (reference allele), that is, the number of positive and negative reads obtained from a high heterozygous site during alignment is equivalent (the positive and negative reads come from two sources, respectively). Homologous chromosomes), these all show that this probe has no obvious haplotype (commonly referred to as the reference sequence base type, ie ref type) capture bias, and the capture uniformity of the target region is better.

表2 三个样品的比对结果Table 2 Comparison results of three samples

实施例2：目标区域文库构建、测序Example 2: Target Region Library Construction and Sequencing

1、试验材料、试剂、仪器1. Test materials, reagents and instruments

样本：15例目标gDNA样本(人基因组DNA，样本编号见以下表3，“GM”“开头的都是人成纤维母细胞)，24个参照DNA样本。Samples: 15 target gDNA samples (human genomic DNA, the sample number is shown in Table 3 below, "GM"" begins with human fibroblasts), 24 reference DNA samples.

主要试剂仪器：PCR仪、移液器、离心机、舒适型恒温混匀仪、DNA打断仪、涡旋振荡器、磁力架、电泳仪、Hiseq2000测序仪、Nanodrop紫外分光光度计等，所用试剂或仪器未注明生产厂商者，均为可以通过市购获得的常规产品。Main reagent instruments: PCR instrument, pipette, centrifuge, comfortable constant temperature mixer, DNA fragmentation instrument, vortex oscillator, magnetic stand, electrophoresis instrument, Hiseq2000 sequencer, Nanodrop UV spectrophotometer, etc. Reagents used Or the instrument does not indicate the manufacturer, it is a conventional product that can be obtained through the market.

探针设计及合成：通过实施例一获得，在人的全基因组范围内，选取约41M的目标区域，从罗氏公司(Roche)定制NimbleGen SeqCap EZ液相探针，该探针集能够捕获对应的所设计的目标区域。Probe design and synthesis: obtained in Example 1, in the human genome, a target region of about 41M was selected, and the NimbleGen SeqCap EZ liquid-phase probe was customized from Roche. The probe set can capture the corresponding probes. The designed target area.

2、文库构建2. Library construction

1)基因组DNA提取1) Genomic DNA extraction

使用QIAGEN DNA提取试剂盒(DNA Mini Kit)，并按照试剂盒说明书，从目标样本中提取基因组DNA约3-5μg，用于后续实验。将提取好的DNA(30-100ng)跑电泳检测，判断是否完整以及降解程度。Using the QIAGEN DNA Extraction Kit (DNA Mini Kit) and following the kit instructions, about 3-5 μg of genomic DNA was extracted from the target sample for subsequent experiments. Run the extracted DNA (30-100ng) for electrophoresis to determine whether it is complete and the degree of degradation.

2)基因组DNA打断及纯化2) Genomic DNA fragmentation and purification

使用covaris E210仪器对基因组DNA进行打断(参照仪器使用说明进行操作)。将DNA打断成200-250bp。使用QIAquick PCR Purification kit(250)试剂盒，按照试剂盒说明书操作，对打断后的DNA片段进行纯化，电泳检测主带大小是否符合要求，即主带大小是否为200-250bp。Genomic DNA was disrupted using a covaris E210 instrument (refer to the instructions for use of the instrument). The DNA was broken into 200-250bp. Use the QIAquick PCR Purification kit (250) kit and operate according to the kit instructions to purify the fragmented DNA fragments, and electrophoresis to detect whether the size of the main band meets the requirements, that is, whether the main band size is 200-250bp.

3)末端修复、末端加A、加接头、预扩增3) End repair, end adding A, adding adapter, pre-amplification

按建库要求，按双末端标签文库构建说明书步骤及其列明的试剂、反应条件等，对上述断裂纯化后的DNA片段进行末端修复，并进行纯化；加个碱基A于经末端修复纯化后的DNA片段的两端，纯化末端加A产物；在末端加A产品的两端连接测序接头，并利用能与测序接头互补结合的磁珠纯化带接头的DNA片段。配制PCR反应体系，扩增带接头的DNA片断，磁珠纯化PCR产物，电泳检测扩增产物主带大小是否在300-350bp；用Nanodrop紫外分光光度计检测DNA量，总量需大于1.0μg。According to the library construction requirements, according to the steps of the double-end tag library construction instructions and the listed reagents, reaction conditions, etc., the above-mentioned fragmented and purified DNA fragments are end-repaired and purified; add a base A to the end-repaired purification The two ends of the DNA fragment are purified by adding A product to the end; the two ends of the end adding A product are connected with sequencing adapters, and the DNA fragments with adapters are purified by magnetic beads that can be complementary to the sequencing adapter. Prepare a PCR reaction system, amplify the DNA fragments with adapters, purify the PCR products with magnetic beads, and detect by electrophoresis whether the size of the main band of the amplified products is 300-350 bp; use a Nanodrop UV spectrophotometer to detect the amount of DNA, and the total amount should be greater than 1.0 μg.

4)SeTR探针杂交及洗脱，扩增4) SeTR probe hybridization and elution, amplification

依照市售的NimbleGen SeqCap EZ杂交洗脱试剂盒说明书进行，购买或配置试剂盒说明书中的杂交、洗脱相关试剂。准备1.5mL离心管，加入Cot-1DNA，通用封闭序列(Block)，标签的封闭序列(index N Block)和经步骤3)后的DNA样品。然后离心1min，60℃真空浓缩干燥，然后加入杂交缓冲液等，震荡离心，放到95℃的金属干浴锅里变性10min，震荡后高速离心。在离心管中加入4.5ul探针，在PCR仪上杂交(47℃，64-72hours)。杂交完成后进行洗脱。然后按照文库构建说明书最后的扩增步骤进行PCR，按要求配制PCR反应体系，将杂交洗脱获得的DNA，聚合酶、底物、PCR反应缓冲液，Flowcell引物(依测序仪的测序芯片flowcell上带有的固定序列设计的引物)等反应物混合均匀。PCR程序为94℃预变性2min，94℃变性15s，58℃退火30s，72℃延伸30s，反应15个循环后，再72℃延伸5min。PCR完成后，取出PCR产物，离心，磁珠纯化，获得目标区域文库。用Nanodrop紫外分光光度计测文库的浓度，准备上机测序。Follow the instructions of the commercially available NimbleGen SeqCap EZ hybridization and elution kit, and purchase or configure the relevant reagents for hybridization and elution in the kit instructions. Prepare a 1.5mL centrifuge tube, add Cot-1 DNA, universal blocking sequence (Block), tag blocking sequence (index N Block) and the DNA sample after step 3). Then centrifuged for 1 min, concentrated and dried in vacuum at 60 °C, then added hybridization buffer, etc., and centrifuged by shaking, put it in a metal dry bath at 95 °C for denaturation for 10 minutes, and centrifuged at high speed after shaking. Add 4.5ul probe to the centrifuge tube and hybridize on PCR machine (47°C, 64-72hours). Elution is performed after hybridization is complete. Then carry out PCR according to the last amplification step of the library construction instructions, prepare a PCR reaction system as required, and eluate the DNA obtained by hybridization, polymerase, substrate, PCR reaction buffer, Flowcell primers (according to the sequencing chip flowcell of the sequencer). The primers with the fixed sequence design) are mixed evenly. The PCR program was pre-denaturation at 94 °C for 2 min, denaturation at 94 °C for 15 s, annealing at 58 °C for 30 s, extension at 72 °C for 30 s, and after 15 cycles of reaction, extension at 72 °C for 5 min. After the PCR is completed, the PCR product is taken out, centrifuged, purified by magnetic beads, and the target region library is obtained. Measure the concentration of the library with Nanodrop UV spectrophotometer and prepare it for on-board sequencing.

3、Hiseq2000高通量测序3. Hiseq2000 high-throughput sequencing

质检合格的DNA文库，根据Hiseq2000操作说明进行上机测序。每个样本的数据量约4.5G，平均测序深度达到100X，但由于捕获芯片的效率很难达到100％，通过分析，最终的目标区域的有效测序深度为30X～45X。Qualified DNA libraries were sequenced according to the Hiseq2000 operating instructions. The data volume of each sample is about 4.5G, and the average sequencing depth reaches 100X, but because the efficiency of the capture chip is difficult to reach 100%, through analysis, the effective sequencing depth of the final target region is 30X~45X.

实施例3：CNV、LOH和UPD的检测Example 3: Detection of CNV, LOH and UPD

总体流程参见图3。测序完成之后，下机数据为fastq文件格式。然后将过滤后得到高质量的reads与参考基因组(Hg19,Build 37)采用BWA软件进行比对得到SAM格式的比对文件，之后使用samtools软件将SAM比对文件格式转换成二进制的BAM文件，并对比对结果进行去重复和排序处理，接着，将再次使用samtools软件，将BAM格式转换成PILEUP格式具体详情请见生物信息分析策略部分。The overall process is shown in Figure 3. After the sequencing is completed, the off-machine data is in the fastq file format. Then, the high-quality reads obtained after filtering were compared with the reference genome (Hg19, Build 37) using BWA software to obtain an alignment file in SAM format, and then the SAM alignment file format was converted into a binary BAM file using samtools software, and The results are compared and deduplicated and sorted. Then, the samtools software will be used again to convert the BAM format to the PILEUP format. For details, please refer to the section on bioinformatics analysis strategies.

一、测序数据过滤、比对1. Sequencing data filtering and comparison

先将上述实施例illumina Hiseq2000下机的测序数据进行简单的数据过滤，将被adapter污染，含N比例高于5％，平均质量值低于Q20的reads进行去除。然后使用bwa比对软件将过滤后的数据比对到人类参考基因组上(hg19，Build 37)，输出序列比对结果即SAM(sequence alignment/map)格式的比对文件(简称SAM文件)，接着使用Samtools软件将SAM文件转换成二进制的BAM文件、去除掉PCR引起的重复(PCR duplicates)和进行排序处理，使用GATK软件对比对结果进行重比对和重校正。First, perform simple data filtering on the sequencing data of the illumina Hiseq2000 in the above example, and remove the reads that are contaminated by the adapter, contain more than 5% of N, and whose average quality value is lower than Q20. Then use the bwa alignment software to align the filtered data to the human reference genome (hg19, Build 37), and output the sequence alignment result that is the alignment file in SAM (sequence alignment/map) format (SAM file for short), and then Use Samtools software to convert SAM files into binary BAM files, remove PCR duplicates and perform sorting, and use GATK software to compare the results for multiple alignment and re-correction.

二、计算目标区域的经第二校正的覆盖深度系数r_i和片段的杂合度R_het 2. Calculate the second corrected coverage depth coefficient ri of the target area and the heterozygosity R _het of the _fragment

根据上述过滤比对后获得的探针区域文件包含的信息计算出每个目标区域的r_i和片段杂合度R_het值。根据r_i值，采用t检验预测CNV，根据R_Het，采用F检验预测LOH和UPD。The ri and fragment heterozygosity _Rhet values of each target region are calculated according to the information contained in the probe region files obtained after the above filtering and _alignment . The t-test was used to predict CNV according to the _ri value, and the F-test was used to predict LOH and UPD according to R _Het .

三、检测CNV，LOH和UPD的分析3. Analysis to detect CNV, LOH and UPD

1、CNV检测1. CNV detection

1.1计算每个目标区域的深度系数(R_i)1.1 Calculate the depth coefficient (R _i ) of each target region

计算目标区域的深度，并用TD_i表示(如公式1)，为了保持连续的几个目标区域TD的稳定性，采用了公式2的方法来校正TD_i，即利用第i区域后面的n‘个区域的深度来校正TD_i，得到TD_ai，然后利用公式3和4对TD_ai进行均一化，此时得到每个目标区域的深度系数R_i。Calculate the depth of the target area and express it with TD _i (such as formula 1). In order to maintain the stability of several consecutive target areas TD, the method of formula 2 is used to correct TD _i , that is, use the n' behind the i-th area The depth of the region is used to correct TD _i to obtain TD _ai , and then the TD _ai is normalized by formulas 3 and 4, and the depth coefficient R _i of each target region is obtained at this time.

公式1:TD_i＝T_ibase/T_ilen，Formula 1: TD _i =T _i base/T _i len,

公式2:

Formula 2:

公式3:

Formula 3:

公式4:

T_ibase：比对到目标区域i的碱基数；T_ilen：目标区域i的长度。Formula 4:

T _i base: the number of bases aligned to the target region i; T _i len: the length of the target region i.

1.2利用多个参照样本(k＝24)数据创建基准线，校正R_i获得r_i 1.2 Use multiple reference sample (k=24) data to create a baseline, correct R _i to obtain r _i

由于每次实验条件的波动和样品间本身的差异导致每次捕获的效率也存在一定的波动，进而引起R_i的波动，容易导致出现CNV假信号。因此，根据多个样的波动情况，创建统一的一个基准线是非常有益的。图4很好的体现出创建基准线利于这个检测，前体R_i(preR_i)

的分布如图波动很大，而R_i波动相对小些，R_i通过基准线的校正后得到r_i，其波动更小，更敏感更易检测CNV的发生。理论上，认为不同的样品中，不发生CNV的情况下，在同一个目标区域内，R_i值理论上是符合泊松分布的，并且都围绕各自特有的值相对稳定的上下波动，为了保持各自特有值的稳定性，通过调查多个样品的同一区域的R_i值，采用R_i平均值(mean R_i)来代替这个各自特有的值，为每个目标区域构建一个各自特有的基准线(robust baseline)。基于每个目标区域的R_i是围绕着mean R_i值上下波动的假设，我们将R_i除以mean Ri转化成r_i，进而使得r_i围绕1上下波动的正态分布。Due to the fluctuation of each experimental condition and the differences between samples, there is also a certain fluctuation in the efficiency of each capture, which in turn causes fluctuations in R _i , which easily lead to CNV false signals. Therefore, it is very beneficial to create a unified baseline based on multiple sample fluctuations. Figure 4 is a good example of creating a baseline for this assay, the precursor R _i (preR _i )

The distribution of , as shown in the figure, fluctuates greatly, while the fluctuation of R _i is relatively small. After R _i is corrected by the baseline, _ri is obtained, and its fluctuation is smaller, which is more sensitive and easier to detect the occurrence of CNV. Theoretically, it is considered that in different samples, if CNV does not occur, in the same target area, the R _i value theoretically conforms to Poisson distribution, and fluctuates relatively stably around their unique values. The stability of each unique value, by investigating the R _i value of the same area of multiple samples, using the average value of R _i (mean R _i ) to replace this unique value, and constructing a unique baseline for each target area. (robust baseline). Based on the assumption that the R _i of each target area fluctuates around the mean R _i value, we convert R _i by the mean Ri into r _i , which in turn makes r _i a normal distribution that fluctuates around 1.

1.3检测目标区域的CNV1.3 Detecting CNVs in the target area

理论上，来自多个样品的同一目标区域的r_i值都应符合正态分布，因此调查某个样品的目标区域i时，可以通过比较多个样品此区域的r_i值，利用T检验，t统计量的计算公式如下，来检测这个样品目标区域i的拷贝数变化情况。Theoretically, the ri values of the same target area from multiple samples should conform to the normal distribution, so when investigating the target area _i of a certain sample, you can compare the _ri values of this area of multiple samples and use the T test, The calculation formula of the t statistic is as follows, to detect the copy number change of the target region i of this sample.

公式中各参数下标中的1代表目标样本，2代表多个参照样本，

表示n₁个待测样本的r_i的平均数，

为n₂个参照样本的r_i的平均数，μ₁为理论上所有待测样本的r_i平均数，μ₂理论上全部参照样本r_i2平均数,S₁和S₂分别为待测样本和参照样本的标准差，df为自由度，df＝n₁+n₂-2。1 in the subscript of each parameter in the formula represents the target sample, 2 represents multiple reference samples,

represents the mean of ri of _n ₁ samples to be tested,

is the average number of ri of _n ₂ reference samples, μ ₁ is the average number of ri of all samples to be tested in theory, _μ ₂ is the average number of _ri of all reference samples in theory, S ₁ and S ₂ are the samples to be tested respectively and the standard deviation of the reference sample, df is the degree of freedom, df=n ₁ +n ₂ -2.

当待测样本为1即n₁＝1,待测样本和参照样本理论均值相同，上式化简为：When the sample to be tested is 1, that is, n ₁ =1, the theoretical mean of the sample to be tested and the reference sample are the same, and the above formula is simplified as:

通过上面的简化公式，每个目标区域都对应一个可检测CNV的t值，进而得到P值(置信度)，当某区域的P<0.05的时候，此区域则为一个发生CNV的区域。Through the above simplified formula, each target area corresponds to a t value of detectable CNV, and then the P value (confidence) is obtained. When the P value of a certain area is less than 0.05, this area is an area where CNV occurs.

1.4检测大CNV1.4 Detection of large CNVs

基于单个区域t检验的p值，为每个区域附上一个伪信号值来表征是否被下一步CNV区域连接所考虑，再沿着染色体，将可能具有一致CNV的目标区域连接成块，从而确定CNV最终的大小及拷贝数。Based on the p-value of the t-test of a single region, a pseudo-signal value is attached to each region to indicate whether it is considered by the next step of CNV region connection, and then along the chromosome, the target regions that may have the same CNV are connected into blocks, so as to determine The final size and copy number of the CNV.

伪信号值的标记规则为，当至少四个连续目标区域的测量值同方向(t值同时大于或者同时小于0)即偏离参照样品的相应区域时，若有3个区域的P值小于第一阈值(如0.05，常用的显著水平阈值)，而且第四个不超过第二阈值(0.2，第一阈值的四倍)，则四个区域均标记为偏离方向(比如偏大标为+，偏小标为-)，合并成一个块；这里连续且同方向的区域个数和第一、第二阈值数值都是可调整的。如若一个块与另一个块之间的距离不超过5个区域的跨度，则这两个块进行合并为一个大块，依此类推，最后获得区块；参考前面1.3的方法公式，这个区块的r值以其所包含的所有区域的r_i的平均值表示，对待测样本和参照样本的该区块域的r值进行t检验，计算该区块的P值。当该区块的P<0.05，此区块发生CNV，从而确定该区块的边界与大小，获得大CNV的边界和大小。The labeling rule for false signal values is that when the measured values of at least four consecutive target areas are in the same direction (t values are greater than or less than 0 at the same time), that is, when they deviate from the corresponding areas of the reference sample, if there are three areas whose P values are less than the first. Threshold (such as 0.05, commonly used significant level threshold), and the fourth does not exceed the second threshold (0.2, four times the first threshold), then the four regions are marked as deviating directions (for example, large areas are marked as +, partial The small mark is -), which is merged into a block; here, the number of continuous and same-direction regions and the first and second threshold values are adjustable. If the distance between one block and another block does not exceed the span of 5 areas, the two blocks will be merged into one large block, and so on, and finally a block will be obtained; referring to the method formula in the previous 1.3, this block The r value of is represented by the average value of r _i of all the areas it contains, and the r value of the block area of the sample to be tested and the reference sample is tested by t test, and the P value of the block is calculated. When the P<0.05 of the block, CNV occurred in this block, so the boundary and size of the block were determined, and the boundary and size of the large CNV were obtained.

通过对目标15例样品的分析，我们得到的CNV结果与已知的验证结果(SNP-array结果)高度一致，并且不存在假阳性和假阴性，见表3。再者，我们模拟了8个30X的全基因组数据，其中包括5个正常样品，3个含有CNV的样品，通过对这8个模拟数据进行CNV检测分析，比较了当前已报到的exome区域CNV预测软件CONTRA(Li J,Lupat R,et al，CONTRA:copynumber analysis for targeted resequencing，Bioinformatics.2012 May 15；28(10):1307-13)，结果显示，我们的方法敏感度和特异性均达到了100％，且各自的拷贝数也被精确的检测出来，对CNV的检测精度可达到500Kb且能精确定位，而CONTRA的敏感度为88.9％,特异性只为66.7％，拷贝数未给出，如表4所示。Through the analysis of the target 15 samples, the CNV results we obtained were highly consistent with the known validation results (SNP-array results), and there were no false positives and false negatives, as shown in Table 3. Furthermore, we simulated 8 30X genome-wide data, including 5 normal samples and 3 CNV-containing samples, and compared the currently reported CNV predictions in the exome region by performing CNV detection and analysis on these 8 simulated data. The software CONTRA (Li J, Lupat R, et al, CONTRA: copynumber analysis for targeted resequencing, Bioinformatics. 2012 May 15; 28(10): 1307-13), the results show that the sensitivity and specificity of our method have reached 100%, and their respective copy numbers are also accurately detected, the detection accuracy of CNV can reach 500Kb and can be accurately located, while the sensitivity of CONTRA is 88.9%, the specificity is only 66.7%, and the copy number is not given. As shown in Table 4.

表3table 3

表4Table 4

2、LOH检测2. LOH detection

2.1全基因组各区域的杂合状态检测2.1 Detection of heterozygous status in each region of the whole genome

在待测样本基因组某区域内，找出在千人数据中基因频率(AF)为0.1～0.9的SNP位点，并按以下公式计算出千人中和待测样本中的这些SNP位点所在区域的R_Het值。当待测样本中区域i为绝对杂合状态时，则R_het＝1，反之，为绝对纯合的时候，R_het＝0。In a certain region of the genome of the sample to be tested, find the SNP loci with gene frequency (AF) of 0.1 to 0.9 in the thousand-person data, and calculate the location of these SNP loci in the thousand-person neutralization and the sample to be tested according to the following formula The R _Het value of the area. When the region i in the sample to be tested is in an absolute heterozygous state, R _het =1; otherwise, when it is an absolute homozygous state, R _het =0.

R_het＝MAF/(1-MAF)，MAF(minor allele frequency)为次等位基因频率。R _het =MAF/(1-MAF), MAF (minor allele frequency) is the minor allele frequency.

在待测样品中，以某区域内任意一个SNP位点m作为起始点，向后连续取n个SNP位点作为该区域内的杂合度集Sm，即S_m＝{R_het,sm,R_het,s(m+1),...,R_het,s(m+n)}，以同样的方式，在千人数据库中，取相同位置的SNP位点，构成杂合度集Pm，即P_m＝{R_het,phetm,R_het,p(m+1),...,R_het,p(m+n)}，F检验两个杂合度集的方差是否相等，具体的，分别计算待测样本该区域的杂合度集的方差

和千人样本的同样区域的杂合度集的方差

以及待测样本该区域杂合度集Sm的p值。In the sample to be tested, any SNP site m in a certain region is taken as the starting point, and n consecutive SNP sites are taken backward as the heterozygosity set Sm in the region, that is, S _m ={R _het,sm ,R _het,s(m+1) ,...,R _het,s(m+n) }, in the same way, in the thousand-person database, take the SNP sites at the same position to form the heterozygosity set Pm, that is P _m ={R _het,phetm ,R _het,p(m+1) ,...,R _het,p(m+n) }, F tests whether the variances of the two heterozygosity sets are equal, specifically, respectively Calculate the variance of the heterozygosity set for this region of the sample to be tested

The variance of the heterozygosity set in the same area as the thousand-person sample

and the p-value of the heterozygosity set Sm in this region of the sample to be tested.

H₀:σ_s＝σ_p H ₀ :σ _s =σ _p

H_A:σ_s≠σ_p H _A :σ _s ≠σ _p

p＝p_upper+(1-p_under)p=p _upper +(1-p _under )

当p≤0.01的时候，我们接受H_A，判断杂合度集Sm失去了群体中的杂合性，即集合Sm所在区域发生LOH。When p≤0.01, we accept _HA and judge that the heterozygosity set Sm loses the heterozygosity in the population, that is, LOH occurs in the area where the set Sm is located.

2.2检测大的LOH2.2 Detection of large LOH

结合2.1的结果，采用检测大CNV步骤1.4的方式，记录连续的4个失去杂合状态的子集为一个最小单元。如若两个单元之间不超过2个子集跨度，则将两个单元合并成更大的单元,依此类推，最后连接成区块，此时，再根据待测样品和千人参考集之间的R_Het值进行F检验，计算区块的p值，当p≦0.01的时候，我们则认为此区块发生LOH，否则为非LOH区块。Combined with the results of 2.1, the method of step 1.4 for detecting large CNVs was used to record the consecutive 4 subsets of loss of heterozygosity as a minimum unit. If there are no more than 2 subset spans between the two units, the two units will be merged into a larger unit, and so on, and finally connected into a block. Perform F test on the R _Het value of , and calculate the p value of the block. When p≦0.01, we consider this block to have LOH, otherwise it is a non-LOH block.

或者，为更准确地检测对合并条件可设置更严格，如为避免某些随机误差导致的假阳性，定义至少大于5M的区域才可能为一个真实的LOH。在此基础上，设置区块容错为1(即允许区块中1个子集的p值大于0.01)的条件下，将在2.1中p≤0.01的子集附近的p≤0.01的连续的子集与之合并。最后，对合并以后的区域内的RHet再进行了一次F检验，若其p值小于0.01，则认为该区块是一个真实的LOH。Alternatively, the merging conditions can be set more stringently for more accurate detection. For example, in order to avoid false positives caused by some random errors, a region larger than 5M can be defined as a real LOH. On this basis, under the condition that the block fault tolerance is set to 1 (that is, the p value of 1 subset in the block is allowed to be greater than 0.01), the continuous subset with p≤0.01 in the vicinity of the subset with p≤0.01 in 2.1 will be merge with it. Finally, an F test is performed on the RHet in the merged area. If the p-value is less than 0.01, the block is considered to be a real LOH.

3、UPD检测3. UPD detection

结合上述全基因组的CNV和LOH检测结果，根据孟德尔遗传规律，进行UPD检测。Combined with the above genome-wide CNV and LOH detection results, according to the Mendelian law of inheritance, UPD detection was performed.

如果某一DNA区域在千人数据中显示为杂合状态，即R_Het＝1，而在实际检测中，其杂合状态消失，即R_Het趋近于0，则判定此区域为发生了LOH，而如果在这个区域同时发生有CNV且有两个拷贝(CN＝2)，即拷贝数没有发生变化(本实施例的样本是二倍体样本，正常二倍体样本基因组各区域都是两个拷贝)，则判定此区域发生了单亲二倍体(UPD)。If a DNA region shows a heterozygous state in the thousand-person data, that is, R _Het = 1, but in the actual detection, its heterozygous state disappears, that is, R _Het approaches 0, then it is determined that this region has LOH. , and if there is a CNV in this region with two copies at the same time (CN=2), that is, the copy number does not change (the sample in this example is a diploid sample, and each region of the normal diploid sample genome has two copies. copies), it was determined that uniparental diploidy (UPD) occurred in this region.

在15个样品的13个中，检测出了10个大于5M的LOH和4个大于5M的UPD，结果请见表5，LOH和UPD的检测在没有配对样本的情况下(一般是拿自身病变的组织和正常的组织进行比较的，这是配对样本，即有某种关联的样本，而本实施方式方检测LOH和UPD是把目标样本和多个参照样本集合做比较的，目标样本和参照样本集合没有相关性，所以不是配对样本)，≥5M的LOH检测结果与CN＝1的CNV结果一致(可利用CNV检测结果验证LOH检测结果的准确性)，本发明方案检测LOH、UPD的准确性高，且可达到5M级别的精度。In 13 of the 15 samples, 10 LOH greater than 5M and 4 UPD greater than 5M were detected. The results are shown in Table 5. The detection of LOH and UPD was performed in the absence of paired samples (usually the use of own lesions). Compared with normal tissue, this is a paired sample, that is, a sample that has a certain correlation, and this embodiment detects LOH and UPD by comparing the target sample with multiple reference sample sets. The target sample and the reference sample are compared. The sample set has no correlation, so it is not a paired sample), the LOH test result of ≥ 5M is consistent with the CNV result of CN=1 (the CNV test result can be used to verify the accuracy of the LOH test result), the solution of the present invention detects the accuracy of LOH and UPD High performance, and can reach 5M level of accuracy.

Circos图(图5)综合展示了GM50275样本的CNV、LOH和UPD检测结果。The Circos plot (Fig. 5) comprehensively shows the CNV, LOH and UPD detection results of GM50275 samples.

表5table 5

工业实用性Industrial Applicability

本发明的基于参考序列确定探针序列的方法，能够有效用于确定探针序列，并且获得的探针，用于杂交捕获基因组获得多个基因组局部区域，捕获得的多个局部区域能够代表全基因组、能够反映全基因组变异信息，用于发现全基因范围的结构变异的发生。The method for determining the probe sequence based on the reference sequence of the present invention can be effectively used to determine the probe sequence, and the obtained probe is used to hybridize the captured genome to obtain a plurality of local regions of the genome, and the captured local regions can represent the whole Genome, which can reflect genome-wide variation information, is used to discover the occurrence of genome-wide structural variation.

尽管本发明的具体实施方式已经得到详细的描述，本领域技术人员将会理解。根据已经公开的所有教导，可以对那些细节进行各种修改和替换，这些改变均在本发明的保护范围之内。本发明的全部范围由所附权利要求及其任何等同物给出。Although specific embodiments of the present invention have been described in detail, those skilled in the art will understand. Various modifications and substitutions of those details may be made within the scope of the present invention in light of all the teachings disclosed. The full scope of the invention is given by the appended claims and any equivalents thereof.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示意性实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, reference to the terms "one embodiment," "some embodiments," "exemplary embodiment," "example," "specific example," or "some examples," or the like, is meant to incorporate the embodiment. A particular feature, structure, material, or characteristic described by an example or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Claims

1. A method for determining a probe sequence based on a reference sequence, comprising:

(1) constructing a first candidate probe set based on a plurality of discrete high-frequency SNP sites, wherein the first candidate probe set is composed of a plurality of candidate probes, each of the plurality of candidate probes corresponds to at least one discrete high-frequency SNP site, and the allele frequency of each of the plurality of discrete high-frequency SNP sites is at least 10 percent respectively;

(2) aligning the plurality of candidate probes in the first candidate probe set with a reference sequence to obtain an alignment result;

(3) performing a first screening on the first candidate probe set based on the comparison result to obtain a second candidate probe set consisting of a plurality of candidate probes,

wherein the first screening comprises retaining candidate probes that satisfy at least one of the following conditions:

a candidate probe that is uniquely aligned with the reference sequence;

candidate probes aligned to a plurality of positions of the reference sequence, and at least two of the plurality of positions each having a mismatch ratio of less than 10%;

(4) dividing the reference sequence into a plurality of windows with predetermined lengths respectively, and distributing a plurality of candidate probes in the second candidate probe set to the matched windows respectively so as to determine the position information of the candidate probes respectively;

(5) performing a second screening of the second candidate probe set based on the positional information and the allele frequencies of the discrete high frequency SNP sites to determine the probe sequences,

wherein the probe is determined according to the following steps:

(a) if a plurality of candidate probes are positioned in the same window, determining the candidate probe with the highest allele frequency of the corresponding discrete high-frequency SNP locus;

(b) and if the same window only has one candidate probe with the highest allele frequency of the corresponding discrete high-frequency SNP locus, selecting the candidate probe with the highest allele frequency of the corresponding discrete high-frequency SNP locus as the probe, and if the same window has a plurality of candidate probes with the highest allele frequency of the corresponding discrete high-frequency SNP locus, selecting the candidate probe which is closest to the center of the window from the candidate probes with the highest allele frequency of the corresponding discrete high-frequency SNP locus as the probe.

2. The method of claim 1, wherein the allele frequency of each of the plurality of discrete high frequency SNP sites is no more than 90% respectively.

3. The method of claim 1, wherein any two adjacent discrete high frequency SNP sites in the plurality of discrete high frequency SNP sites are not physically closer to the reference sequence than the length of the candidate probe.

4. The method of claim 1, wherein the candidate probe is 50-250 mers in length.

5. The method of claim 4, wherein the candidate probe is 100 mers in length.

6. The method of claim 1, wherein the candidate probe corresponds to one of the discrete high frequency SNP sites, and wherein the discrete high frequency SNP site corresponds to a mid-section of the candidate probe.

7. The method of claim 6, wherein the discrete high frequency SNP sites correspond to midpoints of the candidate probes.

8. The method of claim 1, wherein the candidate probe is truncated from the reference sequence.

9. The method of claim 1, wherein prior to performing the alignment, the first set of candidate probes is pre-screened in advance based on at least one of GC content and number of single base repeats of the candidate probes;

the prescreening includes retaining candidate probes that satisfy at least one of:

the GC content is 35 to 65 percent; and

the single base gravity was less than 7.

10. The method of claim 1, wherein in step (4), the reference sequence is divided into a plurality of windows each having the same predetermined length.

11. The method of claim 10, wherein the reference sequence is partitioned into a plurality of windows of 10Kb in length.

12. The method of claim 1, wherein after the second candidate probe set is subjected to the second screening, when the distance between two adjacent candidate probes in the second candidate probe set respectively falling into two adjacent windows on the reference genome is greater than the length of either of the two adjacent windows, the short tandem repeat sequence or a part of the short tandem repeat sequence located between the two adjacent candidate probes on the reference genome is further added to the second candidate probe set subjected to the second screening to form the probe sequence together.

13. The method of claim 1, wherein the reference sequence is a reference genome or a portion thereof.

14. A method of detecting a genomic structural variation comprising at least one of a chromosomal aneuploidy, a copy number variation, and an indel, for a non-diagnostic purpose, the method comprising,

(1) sequencing the genomic nucleic acid of the target sample to obtain a genomic sequencing result, the genomic sequencing result being comprised of a plurality of reads, wherein the sequencing comprises screening with a probe, wherein the probe is obtained by the method of any one of claims 1 to 13;

(2) dividing the reference genome into m regions, calculating the depth of coverage TD of region i using the number of reads falling into region i_iM and i are natural numbers, i represents the number of the region, i is more than or equal to 1 and less than or equal to m, 10<m；

(3) And determining whether the region i has structural variation or not based on the difference degree between the coverage depth of the region i and the coverage depth of the regions i of k reference samples, wherein k is a natural number and is more than or equal to 2.

15. The method of claim 14, wherein the depth of coverage of the region i is determined using the following equation:

or

Where i represents the number of the region.

16. The method of claim 14, wherein the test for the degree of difference between the depth of coverage of the genomic region i of the target sample and the depth of coverage of the region i of the k reference samples is performed by a t-test.

17. The method according to claim 14, wherein the comparison of the degree of difference between the depth of coverage of region i and the depth of coverage of region i of the k reference samples is performed by comparing the depth of coverage coefficients of genomic region i of the target sample and the reference sample, wherein the depth of coverage coefficient R of region i is_iThe determination of (a) comprises the steps of,

(a) for TD_iPerforming a first correction to obtain a first corrected coverage depth TD_aiThe first correction is implemented by performing linear regression on the depth-covered values of 2n consecutive areas including an area i, where n is a natural number, 10<n≤m/2；

(b) For TD_aiIs homogenized to obtain

Thereby obtaining

18. The method of claim 17, wherein in step (a), the first correction coverage depth TD is determined based on the following formula_ai：

Wherein, TD_jAnd j is a natural number, and j is more than or equal to 1 and less than or equal to n.

19. The method of claim 18, wherein in step (b), the TD is identified based on the following formula_aiIs homogenized to obtain

20. The method of any one of claims 16 to 19, wherein R is measured at the time of obtaining the target sample_iThen further comprises the step of reacting with R_iPerforming a second correction to obtain r_i，

Wherein R is_aiIs the average of the depth of coverage coefficients for k reference sample genomic regions i,

y is a natural number representing the reference sample number, R_i，yThe coverage depth coefficient for genomic region i of the reference sample y is shown.

21. The method of any one of claims 16 to 19, wherein R is measured at the time of obtaining the target sample_iThen further comprises the step of reacting with R_iPerforming a second correction to obtain r_i，

Wherein R is_aiThe mean of the depth of coverage coefficients for genomic region i for k reference samples and one target sample,

22. The method of claim 20, wherein the t-test is performed such that the t-statistic for the genomic region i of the target sample is calculated as

Wherein,

r representing k reference samples_i,yAverage value of r_i,yFor reference to the second corrected depth of coverage coefficient of genomic region i of sample ygenome,

s is the standard deviation of k reference samples,

23. the method of claim 22, wherein t is based on the genomic region i of the target sample_iValue, obtaining a significance level P_iWhen P is_i<0.05, judging that the structural variation exists in the region i; otherwise, judging that the structural variation does not exist in the region i.

24. The method of claim 22, wherein t is based on the genomic region i of the target sample_iValue and predetermined significance level P_i0Obtaining t_iTheoretical value t_i0When t is_i≥t_i0Judging that the region i has structural variation, otherwise, judging that the region i has no structural variation; the predetermined P_i0≤0.05。

25. The method according to any one of claims 14 to 19, wherein after performing step (3), W regions that are co-directional and continuous are merged to obtain a primary merged region, when the two primary merged regions are co-directional and span no more than L regions, the two primary merged regions are merged to obtain a secondary merged region, and structural variation of the secondary merged region is detected based on the degree of difference between the coverage depth of the secondary merged region of the target sample genome and the coverage depth of the corresponding regions on the plurality of reference sample genomes; wherein, the equidirectional region refers to a region in which the t statistics of the region are both greater than 0 or both less than 0, W and L are both natural numbers, W is greater than or equal to 2, and L-W is less than or equal to 1.

26. A method for detecting loss of heterozygosity for non-diagnostic purposes comprising,

(2) dividing a reference genome into m' regions, obtaining SNP sites shared by a target sample genome region i and a population region i to form a shared SNP set based on read information falling in the region i and data of the population region i in the genome sequencing result, respectively calculating the heterozygosity of fragments of the SNP sites in the target sample genome region i and the population shared SNP set, and obtaining a heterozygosity set U of the target sample genome region i_iAnd heterozygosity set U of population region i_0iComparing the target samples U_iAnd group U_0iTo determine whether there is loss of heterozygosity in the target sample region i; wherein, the segment where the SNP locus is located is a boundary point of two upstream SNPs and downstream SNPs adjacent to the SNP, m 'and i are natural numbers, m' is not less than 1 and not less than 6.

27. The method of claim 26, wherein each SNP in the common set of SNPs has an allele frequency greater than 0.1.

28. The method according to claim 26, wherein the heterozygosity of the fragment containing the SNP site is represented by a frequency coefficient R of a sub-allele of the SNP site_hetMAF/(1-MAF), which is the sub-allele frequency of the SNP.

29. The method of claim 28, wherein the comparison target sample U is_iAnd group U_0iTo determine whether loss of heterozygosity in the target sample region i has occurred, comprises determining U using an F-test_iVariance of (2)

And U_0iVariance of (2)

Whether there is a significant difference, if U_iAnd U_0iIf the variance difference is significant, it is determined that the target sample region i has loss of heterozygosity, otherwise, it is determined that the target sample region i has no loss of heterozygosity.

30. The method of claim 29, wherein the F-test comprises separately calculating U_iAnd U_i0Using the obtained target sample U_iVariance of (2)

And group U_i0Variance of (2)

Calculating to obtain two statistics F reciprocal to each other_upperAnd F_underObtaining significance level p using said reciprocal statistics_FComparison of p_FWith a predetermined significance level p_F0The size of (a), including the calculation formula,

p_F＝p_upper+(1-p_under) Wherein v is the number of SNPs in the high frequency SNP set shared by the target sample genomic region i and the population region i, q is the number of SNPs in the high frequency SNP set shared by the target sample genomic region i and the population region i, and R_het,i,vThe sub-allele frequency coefficient of the v-th SNP in the common high-frequency SNP set of the target sample genome region i,

is the average value of the sub-allele frequency coefficients, R, of q SNPs in the common high-frequency SNP set of the target sample genomic region i_het,i0,vThe sub-allele frequency coefficient of the v-th SNP in the shared high-frequency SNP set of the genome region i of the population sample,

is the average of the sub-allelic gene frequency coefficients of q SNPs in a common high-frequency SNP set of a population sample genomic region i, p_upperAnd p_underAre respectively according to F_upperAnd F_underObtaining of p_F0≤0.05。

31. The method according to any one of claims 26 to 30, wherein after step (2), W ' regions with loss of heterozygosity and continuity are merged to obtain a three-level merged region, when the span between the two three-level merged regions does not exceed L ' regions, the two three-level merged regions are merged to obtain a four-level merged region, a heterozygosity set of the target sample four-level merged region and a heterozygosity set of the same region of the population are respectively obtained, and the two heterozygosity sets are compared to determine whether loss of heterozygosity occurs in the target sample four-level merged region, wherein W ' and L ' are both natural numbers, W ' is not less than 2, and W '/2 is not less than L '.

32. A method for detecting an monadic diploid, said method being used for non-diagnostic purposes, characterized in that when the loss of heterozygosity is detected in a genomic region of a target sample, the copy number of the genomic region is calculated, and when the copy number of the genomic region is the same as that of the genomic region of a normal genome of the same species, the genomic region of the target sample is determined to be the monadic diploid; the determination of loss of heterozygosity in a genomic region of a target sample is carried out by the method of any one of claims 26 to 30.