CN108319817B

CN108319817B - Method and device for processing circulating tumor DNA repetitive sequence

Info

Publication number: CN108319817B
Application number: CN201810036544.1A
Authority: CN
Inventors: 郭昊; 于佳宁; 韩天澄; 宋雪; 林小静
Original assignee: Wuxi Zhenhe Biotechnology Co ltd
Current assignee: Wuhan Zhenhe Medical Laboratory Co ltd; Zhenyue Biotechnology Jiangsu Co ltd
Priority date: 2018-01-15
Filing date: 2018-01-15
Publication date: 2020-12-25
Anticipated expiration: 2038-01-15
Also published as: CN108319817A

Abstract

The invention discloses a method and device for processing the repeating sequence of circulating tumor DNA. The method includes: acquiring sequencing data of circulating tumor DNA to be detected and a reference genome sequence, wherein the sequencing data is data obtained by high-throughput sequencing of circulating tumor DNA to be detected; comparing the sequencing data with the reference genome sequence, Obtaining a first alignment result, wherein the first alignment result at least includes: the genomic positions, base sequences and corresponding base quality value sequences of multiple pairs of double-ended sequences; based on the first alignment result, at least one sequence set is obtained ; Perform network clustering on at least a pair of double-ended sequences in each sequence set to obtain at least one first network; Obtain the largest number of double-ended sequences in each first network to obtain independent sequences corresponding to each first network . The invention solves the technical problem of low accuracy in the prior art sequencing data processing method to delete or mark repetitive sequences for sample sequencing.

Description

Method and device for processing circulating tumor DNA repeats

技术领域technical field

本发明涉及遗传工程领域，具体而言，涉及一种循环肿瘤DNA重复序列的处理方法及装置。The present invention relates to the field of genetic engineering, in particular, to a method and device for processing repeated sequences of circulating tumor DNA.

背景技术Background technique

肿瘤细胞在进行分裂增殖过程当中，会凋亡、死亡、坏死，也会主动向体液中释放携带有肿瘤突变的DNA碎片，也即循环肿瘤DNA(circulating tumor DNA，简称为ctDNA)，多存在于血液、滑膜液和脑脊液等体液中，尤其是血浆游离DNA(cell-free DNA，cfDNA)中。通过对ctDNA的测序，检测肿瘤细胞DNA分子上发生碱基序列改变(突变)的基因组区域，能够有效反应病人对治疗的响应；在检测到药物响应之后，肿瘤有可能对药物治疗产生耐药，ctDNA检测也可以追踪耐药突变的产生，定性定量；检测手术后是否存在残余组织，判断预后效果以及早期肿瘤的筛查。不同于常规的基因组DNA，ctDNA片段较短，通常只有100～400bp，而且在血液中含量较少，所以实际中能提取到的ctDNA量含量很低。During the process of division and proliferation, tumor cells will undergo apoptosis, death, and necrosis, and will also actively release DNA fragments carrying tumor mutations into body fluids, that is, circulating tumor DNA (ctDNA), which is mostly found in the body fluids. In body fluids such as blood, synovial fluid and cerebrospinal fluid, especially in plasma cell-free DNA (cfDNA). Through the sequencing of ctDNA, the detection of the genomic region of the base sequence change (mutation) on the DNA molecule of tumor cells can effectively reflect the patient's response to treatment; after the detection of drug response, the tumor may become resistant to drug treatment, ctDNA detection can also track the generation of drug-resistant mutations, qualitatively and quantitatively; detect whether there is residual tissue after surgery, judge the prognosis and early tumor screening. Different from conventional genomic DNA, ctDNA fragments are short, usually only 100-400 bp, and the content in blood is small, so the amount of ctDNA that can be extracted in practice is very low.

由于ctDNA片段较短且含量较低的特点，因此在提取量较少时，需要在建库阶段进行多轮聚合酶链式反应(Polymerase Chain Reaction，简称为PCR)，扩大原始提取DNA的含量，以产生足够的DNA分子数目做高通量测序(High-throughput sequencing，简称为NGS测序)和后续生物信息学分析。由于PCR扩增导致对一个分子进行多次镜像复制，产生重复序列(Duplicated reads)，这些无效的重复数据对于检测变异极容易引入人工误差。对于最理想的NGS数据分析流程中，都需要尽可能把所有通过PCR获得的测序数据全部去除，还原到没有PCR的状态。Due to the short and low content of ctDNA fragments, when the amount of extraction is small, multiple rounds of polymerase chain reaction (PCR) need to be carried out in the library construction stage to expand the content of the original extracted DNA. High-throughput sequencing (NGS sequencing for short) and subsequent bioinformatics analysis are performed to generate enough DNA molecules. Since PCR amplification leads to multiple mirror copies of a molecule, resulting in duplicated reads, these invalid duplicate data are very easy to introduce artificial errors for the detection of variants. For the most ideal NGS data analysis process, it is necessary to remove all the sequencing data obtained by PCR as much as possible and restore it to a state without PCR.

现有技术中，提供了两种重复序列的去重方法，samtools rmdup和Picard’sMarkDuplicates。其中，samtools rmdup的工作原理为：NGS测序得到的序列(read)通过与人类参考基因组比对(mapping)，得到这条read的比对位置，如果不同的reads比对到相同的基因组位置，则认为这部分的reads是通过PCR产生的多个重复序列，只保留mapping质量最高的read，删除其余的重复序列。对于PE reads，如果两端的read比多到基因组的不同染色体上或者两者之前的距离过长(即不是Proper Paired)，则不作去重考虑。Picard’sMarkDuplicates的基本思路与samtools rmdup相同，通过比较reads中5'端的mapping位置，对于具有相同5'位置的序列，选取测序质量最高的reads作为去重后保留的唯一reads，且对于PE reads不是Proper Paired的情况也会做去重处理。In the prior art, two methods for deduplication of repetitive sequences are provided, samtools rmdup and Picard's MarkDuplicates. Among them, the working principle of samtools rmdup is: the sequence (read) obtained by NGS sequencing is compared with the human reference genome (mapping) to obtain the alignment position of this read. If different reads are aligned to the same genomic position, then This part of the reads is considered to be multiple repetitive sequences generated by PCR, only the reads with the highest mapping quality are retained, and the rest of the repetitive sequences are deleted. For PE reads, if the read ratio at both ends is too large on different chromosomes in the genome or the distance between the two is too long (that is, not Proper Paired), it will not be considered. The basic idea of Picard'sMarkDuplicates is the same as that of samtools rmdup. By comparing the mapping positions at the 5' end of the reads, for sequences with the same 5' position, the reads with the highest sequencing quality are selected as the only reads retained after deduplication, and PE reads are not The case of Proper Paired will also be deduplicated.

但在基因组相同位置上，往往有可能会存在多个原始分子，这些原始分子并不是通常意义上的PCR重复，有可能存在有意义的突变(例如ctDNA中就是肿瘤相关变异)，但在上述的去重方法中，对于这种情况的判断，samtools rmdup和Picard’s MarkDuplicates会错误的认为是同一个原始分子，仅保留1对reads，导致过度去重，浪费了部分有意义的数据量。并且由于PCR过程可能伴有随机错误的产生，这些错误最后很可能造成假阳性的突变检出，而samtools rmdup和Picard’s MarkDuplicates并没有方法对其进行校正。无论是Samtools rmdup还是Picard’s MarkDuplicates都只考虑了read的某种质量值，而并没有考虑read上的具体序列上的差异，导致过去重或错误选取保留的unique reads的发生。However, at the same position in the genome, there may often be multiple original molecules. These original molecules are not PCR repeats in the usual sense, and there may be meaningful mutations (for example, tumor-related mutations in ctDNA), but in the above-mentioned In the deduplication method, samtools rmdup and Picard's MarkDuplicates will mistakenly think that they are the same original molecule, and only one pair of reads will be retained, resulting in excessive deduplication and wasting part of the meaningful data. And because the PCR process may be accompanied by random errors, these errors are likely to result in false positive mutation calls, and samtools rmdup and Picard's MarkDuplicates have no way to correct them. Both Samtools rmdup and Picard's MarkDuplicates only consider a certain quality value of the read, but do not consider the specific sequence differences on the read, resulting in the occurrence of unique reads that have been duplicated or incorrectly selected in the past.

针对现有技术中测序数据的处理方法对样本测序进行重复序列删除或标记，准确度低的问题，目前尚未提出有效的解决方案。Aiming at the problem of low accuracy of performing repetitive sequence deletion or labeling on sample sequencing in the prior art sequencing data processing methods, no effective solution has been proposed so far.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供了一种循环肿瘤DNA重复序列的处理方法及装置，以至少解决现有技术中测序数据的处理方法对样本测序进行重复序列删除或标记，准确度低的技术问题。Embodiments of the present invention provide a method and device for processing repeating sequences of circulating tumor DNA, so as to at least solve the technical problem of low accuracy of repeating sequence deletion or labeling for sample sequencing in prior art sequencing data processing methods.

根据本发明实施例的一个方面，提供了一种循环肿瘤DNA重复序列的处理方法，包括：获取待检测循环肿瘤DNA的测序数据和参考基因组序列，其中，测序数据为对待检测循环肿瘤DNA进行高通量测序得到的数据，测序数据包括：多对双端序列；将测序数据和参考基因组序列进行比对，得到第一比对结果，其中，第一比对结果至少包括：多对双端序列的基因组位置、碱基序列和对应的碱基质量值序列；基于第一比对结果，得到至少一个序列集合，其中，每个序列集合包括：至少一对双端序列，同一个序列集合中的双端序列的基因组位置相同；对每个序列集合中的至少一对双端序列进行网络聚类，得到至少一个第一网络；获取每个第一网络中数量最多的双端序列，得到每个第一网络对应的独立序列。According to an aspect of the embodiments of the present invention, there is provided a method for processing circulating tumor DNA repeat sequences, including: acquiring sequencing data and reference genome sequence of circulating tumor DNA to be detected, wherein the sequencing data is the high-resolution analysis of circulating tumor DNA to be detected. The data obtained by flux sequencing, the sequencing data includes: multiple pairs of double-ended sequences; the sequencing data and the reference genome sequence are compared to obtain a first alignment result, wherein the first alignment result at least includes: multiple pairs of double-ended sequences The genome position, base sequence and corresponding base quality value sequence of ; based on the first alignment result, at least one sequence set is obtained, wherein each sequence set includes: at least one pair of double-ended sequences, the The genomic positions of the double-ended sequences are the same; perform network clustering on at least one pair of double-ended sequences in each sequence set to obtain at least one first network; obtain the largest number of double-ended sequences in each first network, and obtain each The independent sequence corresponding to the first network.

进一步地，对每个序列集合中的至少一对双端序列进行网络聚类，得到至少一个第一网络包括：将每个序列集合中的至少一对双端序列进行比较，得到至少一个结点以及每个结点的成员数，其中，同一个结点对应的双端序列的碱基序列相同，成员数用于表征同一个结点对应的双端序列的数量；获取每个序列集合中任意两个结点之间的编辑距离；判断任意两个结点之间的编辑距离是否小于预设距离；如果任意两个结点之间的编辑距离小于预设距离，则确定任意两个结点属于同一个网络，得到至少一个第一网络。Further, performing network clustering on at least a pair of double-ended sequences in each sequence set to obtain at least one first network includes: comparing at least a pair of double-ended sequences in each sequence set to obtain at least one node and the number of members of each node, where the base sequences of the double-ended sequences corresponding to the same node are the same, and the number of members is used to represent the number of double-ended sequences corresponding to the same node; obtain any sequence in each sequence set. Edit distance between two nodes; determine whether the edit distance between any two nodes is less than the preset distance; if the edit distance between any two nodes is less than the preset distance, determine any two nodes belong to the same network, obtain at least one first network.

进一步地，获取每个第一网络中数量最多的双端序列，得到每个第一网络对应的独立序列包括：步骤A，获取每个第一网络中成员数最多的第一结点，计算第一结点对应的每对双端序列包含的所有碱基的碱基质量值之和，得到第一结点对应的每对双端序列的碱基质量和，并将最大碱基质量和对应的双端序列作为每个第一网络对应的独立序列；步骤B，从每个第一网络中将第一结点和与第一结点相邻的结点进行删除，得到至少一个第二网络；步骤C，判断任意一个第二网络是否包含结点；步骤D，如果任意一个第二网络包含结点，则将任意一个第二网络作为第一网络，并循环执行步骤A和步骤B。Further, obtaining the double-ended sequence with the largest number in each first network, and obtaining the independent sequence corresponding to each first network includes: Step A, obtaining the first node with the largest number of members in each first network, and calculating the first node. The sum of the base quality values of all bases contained in each pair of paired-end sequences corresponding to a node is obtained, the sum of the base qualities of each pair of paired-end sequences corresponding to the first node is obtained, and the maximum base quality is added to the corresponding The double-ended sequence is used as an independent sequence corresponding to each first network; in step B, the first node and the nodes adjacent to the first node are deleted from each first network to obtain at least one second network; Step C, judging whether any second network includes a node; Step D, if any second network includes a node, use any second network as the first network, and execute Step A and Step B cyclically.

进一步地，在确定任意两个结点属于同一个网络，得到至少一个第一网络之前，上述方法还包括：判断任意两个结点的成员数是否满足预设条件；如果任意两个结点的成员数满足预设条件，则确定任意两个结点属于同一个网络，得到至少一个第一网络。Further, before it is determined that any two nodes belong to the same network and at least one first network is obtained, the above method further includes: judging whether the number of members of any two nodes satisfies a preset condition; If the number of members satisfies the preset condition, it is determined that any two nodes belong to the same network, and at least one first network is obtained.

进一步地，判断任意两个结点的成员数是否满足预设条件包括：获取任意两个结点中第一结点的成员数的预设倍数，得到第一数值；获取第一数值与预设值的差值；判断任意两个结点中第二结点的成员数是否大于差值；如果第二结点的成员数大于差值，则确定任意两个结点的成员数满足预设条件。Further, judging whether the number of members of any two nodes satisfies the preset condition includes: obtaining a preset multiple of the number of members of the first node in any two nodes, and obtaining the first value; obtaining the first value and the preset The difference between the values; determine whether the number of members of the second node in any two nodes is greater than the difference; if the number of members of the second node is greater than the difference, determine that the number of members of any two nodes satisfies the preset condition .

进一步地，在获取每个第一网络中数量最多的双端序列，得到每个第一网络对应的独立序列之后，上述方法还包括：将第一比对结果中除每个第一网络对应的独立序列之外的其他双端序列的比对结果删除，得到第二比对结果。Further, after obtaining the largest number of double-ended sequences in each first network, and obtaining the independent sequence corresponding to each first network, the above method further includes: dividing the first comparison result by the number corresponding to each first network. The alignment results of other paired-end sequences other than independent sequences are deleted to obtain a second alignment result.

进一步地，在将第一比对结果中除每个第一网络对应的独立序列之外的其他双端序列的比对结果删除，得到第二比对结果之后，上述方法还包括：按照每个独立序列的基因组位置，对第二比对结果进行排序，得到第三比对结果。Further, after deleting the alignment results of other double-ended sequences except the independent sequence corresponding to each first network in the first alignment result, after obtaining the second alignment result, the above method also includes: according to each The genomic position of the independent sequence, the second alignment result is sorted, and the third alignment result is obtained.

进一步地，在按照每个独立序列的基因组位置，对第二比对结果进行排序，得到第三比对结果之后，上述方法还包括：获取捕获测序区间；根据捕获测序区间，对每个独立序列进行单核苷酸变异检测和插入缺失检测，得到检测结果。Further, after sorting the second alignment result according to the genomic position of each independent sequence to obtain the third alignment result, the above method further includes: acquiring a capture sequencing interval; Perform single nucleotide variation detection and indel detection to obtain the detection results.

进一步地，在基于第一比对结果，得到至少一个序列集合之前，上述方法还包括：按照多对双端序列的基因组位置，对第一比对结果进行排序，得到第四比对结果，并为第四比对结果建立索引；对第四比对结果进行过滤，得到第五比对结果；基于第五比对结果，得到至少一个序列集合。Further, before obtaining at least one sequence set based on the first alignment result, the method further includes: sorting the first alignment result according to the genomic positions of the multiple pairs of double-ended sequences to obtain a fourth alignment result, and establishing an index for the fourth alignment result; filtering the fourth alignment result to obtain a fifth alignment result; and obtaining at least one sequence set based on the fifth alignment result.

进一步地，将测序数据和参考基因组序列进行比对，得到第一比对结果包括：获取多对双端序列中每条序列和参考基因组序列中的每段序列的匹配度；获取最高匹配度对应的至少一段序列，得到每条序列的匹配序列；根据每条序列的匹配序列，确定每条序列的基因组位置。Further, the sequencing data is compared with the reference genome sequence, and obtaining the first comparison result includes: obtaining the matching degree of each sequence in the multiple pairs of double-ended sequences and each sequence in the reference genome sequence; obtaining the highest matching degree corresponding to At least one sequence of each sequence is obtained, and the matching sequence of each sequence is obtained; according to the matching sequence of each sequence, the genomic position of each sequence is determined.

根据本发明实施例的另一方面，还提供了一种循环肿瘤DNA重复序列的处理装置，包括：第一获取模块，用于获取待检测循环肿瘤DNA的测序数据和参考基因组序列，其中，测序数据为对待检测循环肿瘤DNA进行高通量测序得到的数据，测序数据包括：多对双端序列；比对模块，用于对测序数据和参考基因组序列进行比对，得到第一比对结果，其中，第一比对结果至少包括：多对双端序列的基因组位置、碱基序列和对应的碱基质量值序列；处理模块，用于基于第一比对结果，得到至少一个序列集合，其中，每个序列集合包括：至少一对双端序列，同一个序列集合中的双端序列的基因组位置相同；聚类模块，用于对每个序列集合中的至少一对双端序列进行网络聚类，得到至少一个第一网络；第二获取模块，用于获取每个第一网络中数量最多的双端序列，得到每个第一网络对应的独立序列。According to another aspect of the embodiments of the present invention, there is also provided an apparatus for processing circulating tumor DNA repeat sequences, including: a first acquisition module for acquiring sequencing data and reference genome sequences of circulating tumor DNA to be detected, wherein the sequencing The data is the data obtained by high-throughput sequencing of the circulating tumor DNA to be detected. The sequencing data includes: multiple pairs of double-ended sequences; an alignment module for aligning the sequencing data with the reference genome sequence to obtain the first alignment result, Wherein, the first alignment result at least includes: genomic positions, base sequences and corresponding base quality value sequences of multiple pairs of double-ended sequences; a processing module, configured to obtain at least one sequence set based on the first alignment result, wherein , each sequence set includes: at least a pair of double-ended sequences, and the genomic positions of the double-ended sequences in the same sequence set are the same; a clustering module is used to perform network clustering on at least a pair of double-ended sequences in each sequence set. class, to obtain at least one first network; a second acquisition module, used to obtain the largest number of double-ended sequences in each first network, and obtain an independent sequence corresponding to each first network.

根据本发明实施例的另一方面，还提供了一种存储介质，存储介质包括存储的程序，其中，在程序运行时控制存储介质所在设备执行上述的循环肿瘤DNA重复序列的处理方法。According to another aspect of the embodiments of the present invention, a storage medium is also provided, the storage medium includes a stored program, wherein when the program is run, the device where the storage medium is located is controlled to execute the above-mentioned method for processing circulating tumor DNA repeats.

根据本发明实施例的另一方面，还提供了一种处理器，处理器用于运行程序，其中，程序运行时执行上述的循环肿瘤DNA重复序列的处理方法。According to another aspect of the embodiments of the present invention, a processor is also provided, and the processor is configured to run a program, wherein the above-mentioned method for processing circulating tumor DNA repeats is executed when the program runs.

在本发明实施例中，获取待检测循环肿瘤DNA的测序数据和参考基因组序列，将测序数据和参考基因组序列进行比对，得到第一比对结果，基于第一比对结果，得到至少一个序列集合，进一步对每个序列集合中的至少一对双端序列进行网络聚类，得到至少一个第一网络，获取每个第一网络中数量最多的双端序列，得到每个第一网络对应的独立序列，从而实现重复序列的去重处理。容易注意到的是，由于在将测序数据和参考基因组序列进行比对之后，需要对基因组位置相同的双端序列进行网络聚类，并选取每个网络中数量最多的双端序列得到独立序列，从而实现在考虑到序列质量值的同时，考虑具体序列上的差异，达到保留更多的原始分子，减少人工错误，提高处理准确度的技术效果，进而解决了现有技术中测序数据的处理方法对样本测序进行重复序列删除或标记，准确度低的技术问题。In the embodiment of the present invention, the sequencing data of the circulating tumor DNA to be detected and the reference genome sequence are obtained, the sequencing data and the reference genome sequence are compared to obtain a first comparison result, and based on the first comparison result, at least one sequence is obtained set, and further perform network clustering on at least a pair of double-ended sequences in each sequence set to obtain at least one first network, obtain the largest number of double-ended sequences in each first network, and obtain the corresponding Independent sequences, thus realizing the deduplication of repetitive sequences. It is easy to notice that, after aligning the sequencing data with the reference genome sequence, it is necessary to perform network clustering on the double-ended sequences with the same genomic position, and select the largest number of double-ended sequences in each network to obtain independent sequences. Therefore, while considering the sequence quality value, the differences in specific sequences are considered, so as to retain more original molecules, reduce manual errors, and improve the technical effect of processing accuracy, thereby solving the processing method of sequencing data in the prior art. Repeated sequence deletion or labeling for sample sequencing is a technical problem with low accuracy.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。在附图中：The accompanying drawings described herein are used to provide further understanding of the present invention and constitute a part of the present application. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an improper limitation of the present invention. In the attached image:

图1是根据本发明实施例的一种循环肿瘤DNA重复序列的处理方法的流程图；1 is a flowchart of a method for processing circulating tumor DNA repeats according to an embodiment of the present invention;

图2是根据本发明实施例的第一种可选的循环肿瘤DNA重复序列的处理方法的示意图；2 is a schematic diagram of a first alternative method for processing circulating tumor DNA repeats according to an embodiment of the present invention;

图3是根据本发明实施例的第二种可选的循环肿瘤DNA重复序列的处理方法的示意图；3 is a schematic diagram of a second alternative method for processing circulating tumor DNA repeats according to an embodiment of the present invention;

图4是根据本发明实施例的第三种可选的循环肿瘤DNA重复序列的处理方法的示意图；4 is a schematic diagram of a third alternative method for processing circulating tumor DNA repeats according to an embodiment of the present invention;

图5是根据本发明实施例的一种可选的循环肿瘤DNA重复序列的处理方法的流程图；以及FIG. 5 is a flowchart of an alternative method for processing circulating tumor DNA repeats according to an embodiment of the present invention; and

图6是根据本发明实施例的一种循环肿瘤DNA重复序列的处理装置的示意图。Fig. 6 is a schematic diagram of a processing device for circulating tumor DNA repeats according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分的实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to make those skilled in the art better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only Embodiments are part of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

需要说明的是，本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second" and the like in the description and claims of the present invention and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

下面，首先对本发明实施例中出现的技术名词进行解释如下：Below, the technical terms that appear in the embodiments of the present invention are first explained as follows:

NGS测序：高通量测序(High-throughput sequencing)，又称“下一代”测序("Next-generation"sequencing，简称为NGS)，是相对于传统的桑格测序(SangerSequencing)而言的，以能一次并行对几十万到几百万条DNA分子进行序列测定和一般读长较短等特点为标志。NGS技术目前已经在诊断遗传病、测量基因表达水平、构建进化树、区分形态学上近似物种、对非模式生物进行重头测序头(de novo sequencing)，构建其参考基因组序列等多个领域得到广泛应用。NGS测序有单端(Single-end，简称为SE)和双端(Paird-end，PE)测序之分，目前大部分使用的都是PE测序技术，PE即指同一个DNA分子的两头，测序时先测一端，然后反过来再测另一端。测的这两条序列就叫做Paired ends Reads(PE Reads)，很多时候，它又被叫做"Mate Paires"。NGS sequencing: High-throughput sequencing (High-throughput sequencing), also known as "Next-generation" sequencing ("Next-generation" sequencing, abbreviated as NGS), is relative to traditional Sanger Sequencing (Sanger Sequencing). It is marked by the ability to sequence hundreds of thousands to millions of DNA molecules in parallel and the general short read length. NGS technology has been widely used in many fields such as diagnosing genetic diseases, measuring gene expression levels, constructing evolutionary trees, distinguishing morphologically similar species, de novo sequencing of non-model organisms, and constructing their reference genome sequences. application. NGS sequencing can be divided into single-end (Single-end, referred to as SE) and pair-end (Paird-end, PE) sequencing. At present, most of them use PE sequencing technology. PE refers to the two ends of the same DNA molecule. Test one end first, then reverse and test the other end. The two sequences tested are called Paired ends Reads (PE Reads), and in many cases, they are also called "Mate Paires".

捕获测序：将感兴趣的基因组区域定制成特异性探针与基因组DNA在序列捕获芯片(或溶液)进行杂交，将目标基因组区域的DNA片段进行富集后再利用NGS测序技术进行测序的研究策略。由于其捕获区域短，大大降低测序成本，被各领域广泛应用，肿瘤检测领域也多使用捕获测序的策略。Capture and sequencing: the genomic region of interest is customized into specific probes and hybridized with genomic DNA on a sequence capture chip (or solution), and the DNA fragments of the target genomic region are enriched and then sequenced using NGS sequencing technology. Research strategy . Due to its short capture area, it greatly reduces the cost of sequencing, and is widely used in various fields. The strategy of capture and sequencing is also used in the field of tumor detection.

ctDNA：循环肿瘤DNA，肿瘤细胞在进行分裂增值过程当中，主动向体液中分泌的已经经历过基因突变的DNA片段。ctDNA: Circulating tumor DNA, a DNA fragment that has undergone gene mutation actively secreted into body fluids by tumor cells during the process of division and proliferation.

PCR：聚合酶链式反应，一种用于放大扩增特定的DNA片段的技术。PCR: The polymerase chain reaction, a technique used to amplify specific DNA fragments.

重复序列：由于PCR扩增导致对一个分子进行多次镜像复制的后果。Repeated sequences: The consequence of multiple mirror copies of a molecule due to PCR amplification.

编辑距离：是指两个字符串之间，由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符，插入一个字符，删除一个字符，编辑距离越小，两个字符串的相似度越大。Edit distance: Refers to the minimum number of edit operations required to convert two strings from one to the other. Permitted editing operations include replacing one character with another, inserting a character, deleting a character, and the smaller the edit distance, the more similar the two strings are.

reads：测序读长，测序仪测到的基因组或转录组序列片段。reads: Sequencing read length, the genome or transcriptome sequence fragments measured by the sequencer.

fasta：一种基于文本用于表示核苷酸序列或氨基酸序列的格式。fasta: A text-based format for representing nucleotide or amino acid sequences.

fastq：一种常见的高通量测序文件类型，通常原始测序数据都是以该文件类型储存的。fastq: A common high-throughput sequencing file type, usually raw sequencing data are stored in this file type.

bwa：一种比对方法软件，用于查找测序序列在人类基因参考序列中的位置,可输出bam格式结果文件。bwa: an alignment method software used to find the position of the sequenced sequence in the human gene reference sequence, and can output the result file in bam format.

sam：一种序列比对格式，用来存储测序序列回贴到参考基因组的结果sam: a sequence alignment format used to store the results of sequencing sequences backed into reference genomes

bam：sam文件的二进制压缩格式，用来存储测序序列回贴到参考基因组的结果。bam: The binary compression format of the sam file, which is used to store the results of the sequencing sequence back to the reference genome.

samtools：一种处理bam/sam文件的工具。samtools: A tool for working with bam/sam files.

picard：一种处理高通量测序数据的工具，可用于处理sam/bam等比对结果文件。picard: A tool for processing high-throughput sequencing data, which can be used to process alignment result files such as sam/bam.

比对质量：用于量化比对到错误位置的可能性，值越高表示可能性越低。Alignment quality: used to quantify the likelihood of an alignment to the wrong location, with higher values indicating lower likelihood.

CIGAR：简要比对信息表达式，其以参考序列为基础，使用数据加字母表示比对结果。CIGAR: Brief Alignment Information Expression, which is based on a reference sequence and uses data plus letters to indicate alignment results.

实施例1Example 1

根据本发明实施例，提供了一种循环肿瘤DNA重复序列的处理方法的实施例，需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present invention, an embodiment of a method for processing circulating tumor DNA repeats is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings may be implemented in a computer system such as a set of computer-executable instructions. and, although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that herein.

图1是根据本发明实施例的一种循环肿瘤DNA重复序列的处理方法的流程图，如图1所示，该方法包括如下步骤：FIG. 1 is a flowchart of a method for processing circulating tumor DNA repeats according to an embodiment of the present invention. As shown in FIG. 1 , the method includes the following steps:

步骤S100，获取待检测循环肿瘤DNA的测序数据和参考基因组序列，其中，测序数据为对待检测循环肿瘤DNA进行高通量测序得到的数据，测序数据包括：多对双端序列。Step S100, obtaining sequencing data and reference genome sequences of the circulating tumor DNA to be detected, wherein the sequencing data is data obtained by high-throughput sequencing of the circulating tumor DNA to be detected, and the sequencing data includes: multiple pairs of double-ended sequences.

具体地，上述的待检测循环肿瘤DNA可以从病人的血液、淋巴液、组织间隙液、脑髓液等体液中提取得到的基因序列，在本发明实施例中以血液中提取到的ctDNA为例进行说明；上述的测序数据可以是对待检测ctDNA进行NGS测序得到的ctDNA样本捕获测序fastq数据；上述的参考基因组序列可以是从公开数据库NCBI等网站下载的人类参考基因组fasta数据。Specifically, the above-mentioned circulating tumor DNA to be detected can be the gene sequence extracted from the patient's blood, lymph fluid, interstitial fluid, cerebrospinal fluid and other body fluids. In the embodiment of the present invention, the ctDNA extracted from blood is used as an example. Note; the above-mentioned sequencing data can be the fastq data of ctDNA sample capture sequencing obtained by NGS sequencing of the ctDNA to be detected; the above-mentioned reference genome sequence can be the human reference genome fasta data downloaded from the public database NCBI and other websites.

步骤S102，将测序数据和参考基因组序列进行比对，得到第一比对结果，其中，第一比对结果至少包括：多对双端序列的基因组位置、碱基序列和对应的碱基质量值序列。Step S102, compare the sequencing data with the reference genome sequence to obtain a first comparison result, wherein the first comparison result at least includes: the genome positions, base sequences and corresponding base quality values of multiple pairs of double-ended sequences sequence.

具体地，上述的基因组位置可以是每对PE reads比对到参考基因组序列中的位置，不同的PE reads可以比对到相同的位置；上述的碱基序列可以是每对双端序列中每个碱基位置上的碱基类型，DNA序列中包含四种类型的碱基，分别为G、C、T、A，在NGS测序过程中，可以确定每个碱基位置上的碱基类型，并得到该碱基类型的碱基质量值；上述的碱基质量值可以是通过NGS测序得到的测序质量，用于衡量每个碱基位置上碱基类型测量的准确度，碱基质量值越大，说明碱基类型测量的准确度越高。Specifically, the above-mentioned genome position can be the position in which each pair of PE reads is aligned with the reference genome sequence, and different PE reads can be aligned to the same position; the above-mentioned base sequence can be each pair of double-ended sequences in each pair. The base type at the base position. The DNA sequence contains four types of bases, namely G, C, T, and A. During the NGS sequencing process, the base type at each base position can be determined, and Obtain the base quality value of the base type; the above-mentioned base quality value can be the sequencing quality obtained by NGS sequencing, which is used to measure the accuracy of the base type measurement at each base position. The larger the base quality value is , indicating that the accuracy of the base type measurement is higher.

在一种可选的方案中，可以获取人类参考基因组fasta数据和ctDNA样本捕获测序fastq数据，利用基因组比对工具bwa mem进行序列比对，得到比对结果文件(.bam)，也即，得到上述的第一比对结果，比对结果文件为bam格式，包含每对PE reads的名称、位置信息、SAM标记、比对质量信息、CIGAR字串、mate pair信息、片段序列、测序质量等)。In an optional solution, the human reference genome fasta data and the ctDNA sample capture sequencing fastq data can be obtained, and the genome alignment tool bwa mem can be used to perform sequence alignment to obtain an alignment result file (.bam), that is, to obtain The above-mentioned first comparison result, the comparison result file is in bam format, including the name of each pair of PE reads, position information, SAM marker, comparison quality information, CIGAR string, mate pair information, fragment sequence, sequencing quality, etc.) .

需要说明的是，第一比对结果中的多对双端序列的碱基序列和碱基质量值是从NGS测序的测序数据中直接继承过来的数据，第一比对结果包含基因组比对位置及比对情况的信息的同时，还存储了多对双端序列的碱基序列和碱基质量值，方便后续的其他分析，不再使用fastq文件。It should be noted that the base sequences and base quality values of the multiple pairs of double-ended sequences in the first alignment result are data directly inherited from the sequencing data of NGS sequencing, and the first alignment result includes the genome alignment position. In addition to the information on the comparison situation, the base sequences and base quality values of multiple pairs of double-ended sequences are also stored, which is convenient for subsequent analysis, and the fastq file is no longer used.

步骤S104，基于第一比对结果，得到第一序列集合，其中，每个序列集合包括：至少一对双端序列，同一个序列集合中的双端序列的基因组位置相同。Step S104, obtaining a first sequence set based on the first alignment result, wherein each sequence set includes: at least a pair of double-ended sequences, and the double-ended sequences in the same sequence set have the same genomic position.

步骤S106，对每个序列集合中的至少一对双端序列进行网络聚类，得到至少一个第一网络。Step S106: Perform network clustering on at least a pair of double-ended sequences in each sequence set to obtain at least one first network.

步骤S108，获取每个第一网络中数量最多的双端序列，得到每个第一网络对应的独立序列。Step S108: Acquire the largest number of double-ended sequences in each first network, and obtain an independent sequence corresponding to each first network.

在一种可选的方案中，由于在没有PCR和测序错误时，同一个DNA分子片段(fragment)经过PCR产生多个完全一样的fragment，这组fragment均可以比对到参考基因组同一个位置，且序列相同；PCR和测序错误是随机发生的小概率事件，所以其产生的碱基序列相比正确的碱基序列占比较少；而来自不同fragment的DNA分子，虽然可能比对到基因组的相同位置，但由于其可能分别属于不同的DNA分子(例如，血液中提取到的DNA包含两种类型，一种是包含肿瘤信息的ctDNA分子；另一种是在血液中游离的自身DNA，多是从身体的细胞或者白血球破裂释放出来的，一般认为是无害的，不用多久会被人体自身清理掉，两种DNA携带的信息不同；又例如，不同的ctDNA分子)，所以其序列可能并不相同。基于上述基本事实，可以将比对到基因组相同位置的所有PE reads的碱基序列做网络聚类，两两之间的编辑距离越小的越容易聚到同一个类别中，而编辑距离大的更可能是来自另两个不同的fragment的数据，从而被分到两个不同的类中，从而得到至少一个第一网络，进一步可以选择每个第一网络中选择最多的一对序列作为最终代表的unique reads，其余的认为是重复序列。通过上述方案在保留更多的原始分子的时候也降低了最终选择带有PCR或测序错误序列的可能性，提高了结果的准确性。In an optional solution, since the same DNA molecule fragment (fragment) generates multiple identical fragments through PCR without PCR and sequencing errors, all the fragments can be aligned to the same position in the reference genome, And the sequence is the same; PCR and sequencing errors are small probability events that occur randomly, so the base sequence generated by them accounts for less than the correct base sequence; and DNA molecules from different fragments, although they may be aligned to the same genome. However, because they may belong to different DNA molecules (for example, DNA extracted from blood contains two types, one is ctDNA molecules containing tumor information; the other is self-free DNA in blood, mostly Released from the rupture of cells or white blood cells in the body, it is generally considered to be harmless, and will be cleaned up by the body itself in a short time. The information carried by the two types of DNA is different; for example, different ctDNA molecules), so its sequence may not be same. Based on the above basic facts, the base sequences of all PE reads aligned to the same position in the genome can be clustered by network. It is more likely that the data from two other different fragments are divided into two different classes, so as to obtain at least one first network, and further, the most selected pair of sequences in each first network can be selected as the final representative unique reads, and the rest are considered repetitive sequences. The above scheme also reduces the possibility of the final selection of sequences with PCR or sequencing errors while retaining more original molecules, and improves the accuracy of the results.

需要说明的是，无论是通过本发明提供的去重方法进行去重之前，还是通过本发明提供的去重方法进行去重之后，所有的bam文件均包含下列信息：每对PE reads的名称、位置信息、SAM标记、比对质量信息、CIGAR字串、mate pair信息、片段序列、测序质量等。It should be noted that, before the deduplication is carried out by the deduplication method provided by the present invention, or after deduplication is carried out by the deduplication method provided by the present invention, all bam files contain the following information: the name of each pair of PE reads, Position information, SAM marker, alignment quality information, CIGAR string, mate pair information, fragment sequence, sequencing quality, etc.

根据本发明上述实施例，获取待检测循环肿瘤DNA的测序数据和参考基因组序列，将测序数据和参考基因组序列进行比对，得到第一比对结果，基于第一比对结果，得到至少一个序列集合，进一步对每个序列集合中的至少一对双端序列进行网络聚类，得到至少一个第一网络，获取每个第一网络中数量最多的双端序列，得到每个第一网络对应的独立序列，从而实现重复序列的去重处理。容易注意到的是，由于在将测序数据和参考基因组序列进行比对之后，需要对基因组位置相同的双端序列进行网络聚类，并选取每个网络中数量最多的双端序列得到独立序列，从而实现在考虑到序列质量值的同时，考虑具体序列上的差异，达到保留更多的原始分子，减少人工错误，提高处理准确度的技术效果，进而解决了现有技术中测序数据的处理方法对样本测序进行重复序列删除或标记，准确度低的技术问题。According to the above embodiment of the present invention, the sequencing data of the circulating tumor DNA to be detected and the reference genome sequence are obtained, the sequencing data and the reference genome sequence are compared to obtain a first comparison result, and based on the first comparison result, at least one sequence is obtained set, and further perform network clustering on at least a pair of double-ended sequences in each sequence set to obtain at least one first network, obtain the largest number of double-ended sequences in each first network, and obtain the corresponding Independent sequences, thus realizing the deduplication of repetitive sequences. It is easy to notice that, after aligning the sequencing data with the reference genome sequence, it is necessary to perform network clustering on the double-ended sequences with the same genomic position, and select the largest number of double-ended sequences in each network to obtain independent sequences. Therefore, while considering the sequence quality value, the differences in specific sequences are considered, so as to retain more original molecules, reduce manual errors, and improve the technical effect of processing accuracy, thereby solving the processing method of sequencing data in the prior art. Repeated sequence deletion or labeling for sample sequencing is a technical problem with low accuracy.

可选地，在本发明上述实施例中，步骤S106，对每个序列集合中的至少一对双端序列进行网络聚类，得到至少一个第一网络包括：Optionally, in the above embodiment of the present invention, in step S106, network clustering is performed on at least a pair of double-ended sequences in each sequence set to obtain at least one first network including:

步骤S1061，将每个序列集合中的至少一对双端序列的碱基序列进行比较，得到至少一个结点以及每个结点的成员数，其中，同一个结点对应的双端序列的碱基序列相同，成员数用于表征同一个结点对应的双端序列的数量。Step S1061, compare the base sequences of at least a pair of double-ended sequences in each sequence set to obtain at least one node and the number of members of each node, wherein the bases of the double-ended sequences corresponding to the same node are obtained. The base sequences are the same, and the number of members is used to characterize the number of double-ended sequences corresponding to the same node.

具体地，可以以PE reads为结点，将序列相同的PE reads确定为一个结点，并根据序列相同的PE reads的数量得到该结点的成员数，从而进一步可以对每个序列集合中的结点进行网络聚类。Specifically, PE reads can be used as nodes, PE reads with the same sequence can be determined as a node, and the number of members of the node can be obtained according to the number of PE reads with the same sequence, so that the number of members in each sequence set can be further analyzed. Nodes are clustered in the network.

步骤S1062，获取每个序列集合中任意两个结点之间的编辑距离。Step S1062: Obtain the edit distance between any two nodes in each sequence set.

具体地，任意两个结点之间的编辑距离可以表征任意两个结点的相似性，编辑距离越小，相似性越高，任意两个结点越有可能来自同一个fragment；编辑距离越大，相似度越低，越有可能是来自两个不同的fragment。Specifically, the edit distance between any two nodes can represent the similarity between any two nodes. The smaller the edit distance, the higher the similarity, and the more likely any two nodes are from the same fragment; the higher the edit distance, the higher the similarity. The larger the similarity, the more likely it is from two different fragments.

需要说明的是，可以采用现有的编辑距离的处理方法获取任意两个结点之间的编辑距离，本发明对此不做具体限定。It should be noted that the edit distance between any two nodes can be obtained by using the existing edit distance processing method, which is not specifically limited in the present invention.

步骤S1063，判断任意两个结点之间的编辑距离是否小于预设距离。Step S1063, it is judged whether the edit distance between any two nodes is smaller than the preset distance.

具体地，上述的预设距离可以根据实际需要设定的阈值，例如，预设距离可以是1。Specifically, the above-mentioned preset distance may be a threshold value set according to actual needs, for example, the preset distance may be 1.

步骤S1064，如果任意两个结点之间的编辑距离小于预设距离，则确定任意两个结点列属于同一个网络，得到至少一个第一网络。Step S1064, if the edit distance between any two nodes is less than the preset distance, it is determined that any two node columns belong to the same network, and at least one first network is obtained.

在一种可选的方案中，本发明实施例提供了第一种网络聚类方法，也即Cluster算法，可以将比对到基因组相同位置且序列间编辑距离小于阈值(默认为1)的所有序列连接成一个网络，以序列为结点，每个结点的成员数即代表这种序列在这个位置上出现的次数，整个网络就可以认为来自同一个原始的fragment，且同一组内，选择成员数最多的一种序列，并确定这种序列中碱基质量和最高的一个序列作为最终的代表unique reads。In an optional solution, the embodiment of the present invention provides the first network clustering method, that is, the Cluster algorithm, which can align all the sequences in the same position in the genome and the editing distance between sequences is less than a threshold (1 by default). The sequence is connected into a network, with the sequence as the node, the number of members of each node represents the number of times the sequence appears at this position, the entire network can be considered to come from the same original fragment, and within the same group, select A sequence with the largest number of members is determined, and the sequence with the highest base quality and the highest in this sequence is determined as the final representative unique reads.

例如，如图2所示，对于如图2所示的6种PE reads，分别为AGTC、CGTC、TGTC、TGAC、TAAC、TGAG，假设6种PE reads的基因组位置相同，且编辑距离小于1，该6种PE reads连接成一个网络，进一步根据每个结点的成员数，也即AGTC的成员数为2，CGTC的成员数为2，TGTC的成员数为365，TGAC的成员数为62，TAAC的成员数为80，TGAG的成员数为1，可以确定最终的代表unique reads为碱基质量和最高的TGTC。For example, as shown in Figure 2, for the 6 kinds of PE reads shown in Figure 2, namely AGTC, CGTC, TGTC, TGAC, TAAC, TGAG, it is assumed that the genome positions of the 6 kinds of PE reads are the same, and the editing distance is less than 1, The six PE reads are connected into a network, and further according to the number of members of each node, that is, the number of members of AGTC is 2, the number of members of CGTC is 2, the number of members of TGTC is 365, and the number of members of TGAC is 62, The number of members of TAAC is 80, and the number of members of TGAG is 1. It can be determined that the final representative unique reads are the base quality and the highest TGTC.

可选地，在本发明上述实施例中，步骤S108，获取每个第一网络中数量最多的双端序列，得到每个第一网络对应的独立序列包括：Optionally, in the above-mentioned embodiment of the present invention, in step S108, obtaining the double-ended sequence with the largest number in each first network, and obtaining the independent sequence corresponding to each first network includes:

步骤A，获取每个第一网络中成员数最多的第一结点，计算第一结点对应的每对双端序列包含的所有碱基的碱基质量值之和，得到第一结点对应的每对双端序列的碱基质量和，并将最大碱基质量和对应的双端序列作为每个第一网络对应的独立序列。Step A: Obtain the first node with the largest number of members in each first network, calculate the sum of the base quality values of all bases contained in each pair of double-ended sequences corresponding to the first node, and obtain the corresponding first node. The base mass sum of each pair of paired-end sequences, and the maximum base mass and the corresponding paired-end sequence are taken as independent sequences corresponding to each first network.

步骤B，从每个第一网络中将第一结点和与第一结点相邻的结点进行删除，得到至少一个第二网络。Step B, delete the first node and the nodes adjacent to the first node from each first network to obtain at least one second network.

步骤C，判断任意一个第二网络是否包含结点。Step C, judging whether any second network includes a node.

具体地，当第二网络不包含结点时，可以确定第一网络中所有的结点均遍历完全。Specifically, when the second network does not contain nodes, it may be determined that all nodes in the first network are completely traversed.

步骤D，如果任意一个第二网络包含结点，则将任意一个第二网络作为第一网络，并循环执行步骤A和步骤B。In step D, if any second network includes a node, any second network is used as the first network, and steps A and B are executed cyclically.

在一种可选的方案中，本发明实施例提供了第二种网络聚类方法，也即Adjacency算法，可以将比对到基因组相同位置且序列间编辑距离小于阈值(默认为1)的所有序列连接成一个网络，以序列为结点，每个结点的成员数即代表这种序列在这个位置上出现的次数，整个网络中的序列可能来自多个原始的fragment，进一步地，uniqe reads的挑选规则包含下面几步：首选从网络中挑选成员数最多的结点，从该节点中选择碱基质量和最高的双端序列作为第一个uniqe reads；然后从网络中删掉所有与这个结点直接相连的结点；进一步挑选剩余结点中成员数最多的结点，从该节点中选择碱基质量和最高的双端序列作为第二个代表reads，再删掉所有与它直接相连的结点；重复上述方法直到遍历完全部的结点，从而得到所有的独立序列。In an optional solution, the embodiment of the present invention provides a second network clustering method, that is, the Adjacency algorithm, which can align all the sequences that are aligned to the same position in the genome and whose editing distance between sequences is less than a threshold (1 by default). Sequences are connected into a network, with sequences as nodes. The number of members of each node represents the number of times the sequence appears at this position. The sequences in the entire network may come from multiple original fragments. Further, uniqe reads The selection rule includes the following steps: first select the node with the largest number of members from the network, and select the base quality and the highest paired-end sequence from the node as the first uniqe reads; The node directly connected to the node; further select the node with the largest number of members among the remaining nodes, select the base quality and the highest paired-end sequence from the node as the second representative reads, and then delete all directly connected to it. nodes; repeat the above method until all nodes are traversed, so as to obtain all independent sequences.

例如，如图3所示，对于如图3所示的6种PE reads，分别为AGTC、CGTC、TGTC、TGAC、TAAC、TGAG，假设6种PE reads的基因组位置相同，且编辑距离小于1，该6种PE reads连接成一个网络，进一步根据每个结点的成员数，也即AGTC的成员数为2，CGTC的成员数为2，TGTC的成员数为365，TGAC的成员数为62，TAAC的成员数为80，TGAG的成员数为1，可以获取最大成员数且碱基质量和最高的序列，也即，获取碱基质量和最高的TGTC作为独立序列，进一步删除与TGTC直接相连的序列，也即删除AGTC、CGTC和TGAC，剩余的序列为TAAC和TGAG，获取最大成员数且碱基质量和最高的序列，也即，获取碱基质量和最高的TAAC作为独立序列，由于TAAC和TGAG并未直接相连，因此，可以将碱基质量和最高的TAAC和碱基质量和最高的TGAG均作为独立序列，因此，可以得到三个独立序列，分别为碱基质量和最高的TGTC、碱基质量和最高的TAAC和碱基质量和最高的TGAG。For example, as shown in Figure 3, for the 6 PE reads shown in Figure 3, which are AGTC, CGTC, TGTC, TGAC, TAAC, and TGAG, assuming that the genomic positions of the 6 PE reads are the same, and the editing distance is less than 1, The six PE reads are connected into a network, and further according to the number of members of each node, that is, the number of members of AGTC is 2, the number of members of CGTC is 2, the number of members of TGTC is 365, and the number of members of TGAC is 62, The number of members of TAAC is 80, the number of members of TGAG is 1, and the sequence with the maximum number of members and the highest base quality can be obtained, that is, the TGTC with the highest base quality and the highest quality can be obtained as an independent sequence, and the sequence directly connected to TGTC can be further deleted. Sequence, that is, delete AGTC, CGTC and TGAC, the remaining sequences are TAAC and TGAG, obtain the sequence with the largest number of members and the highest base quality, that is, obtain the base quality and the highest TAAC as an independent sequence, because TAAC and TGAG is not directly connected, therefore, the base quality and the highest TAAC and the base quality and the highest TGAG can be regarded as independent sequences. Therefore, three independent sequences can be obtained, namely the base quality and the highest TGTC, the base quality and the highest TGAG. Base quality and highest TAAC and base quality and highest TGAG.

可选地，在本发明上述实施例中，在步骤S1064，确定任意两个结点列属于同一个网络，得到至少一个第一网络之前，该方法还包括：Optionally, in the above embodiment of the present invention, in step S1064, before it is determined that any two node columns belong to the same network and at least one first network is obtained, the method further includes:

步骤S1065，判断任意两个结点的成员数是否满足预设条件。Step S1065, it is judged whether the number of members of any two nodes satisfies a preset condition.

具体地，上述的预设条件用于表征任意两个结点可以连接成一个网络的条件。Specifically, the above-mentioned preset conditions are used to represent the conditions that any two nodes can be connected to form a network.

步骤S1066，如果任意两个结点的成员数满足预设条件，则确定任意两个结点属于同一个网络，得到至少一个第一网络。Step S1066, if the number of members of any two nodes satisfies a preset condition, it is determined that any two nodes belong to the same network, and at least one first network is obtained.

在一种可选的方案中，本发明实施例提供了第三种网络聚类方法，也即Direct算法，可以将比对到基因组相同位置且序列间编辑距离小于阈值(默认为1)且相邻结点a和结点b满足一定的条件的所有序列连接成一个网络，以序列为结点，每个结点的成员数即代表这种序列在这个位置上出现的次数，这样在这个位置处可以形成一个或多个网络，每个网络中的序列来自通一个原始的fragment，选择成员数最多的一种序列，并确定这种序列中碱基质量和最高的一个序列作为最终的代表。In an optional solution, the embodiment of the present invention provides a third network clustering method, that is, the Direct algorithm, which can be aligned to the same position in the genome and the editing distance between sequences is less than a threshold (default is 1) and the same All sequences of adjacent node a and node b that meet certain conditions are connected into a network, with the sequence as the node, the number of members of each node represents the number of times this sequence appears at this position, so that at this position One or more networks can be formed at the location, the sequences in each network are derived from an original fragment, the sequence with the largest number of members is selected, and the sequence with the highest base quality and the highest base quality in this sequence is determined as the final representative.

例如，如图4所示，对于如图4所示的6种PE reads，分别为AGTC、CGTC、TGTC、TGAC、TAAC、TGAG，其中，AGTC和CGTC相邻，CGTC和TGTC相邻，TGTC和AGTC相邻，TGTC和TGAC相邻，TGAC和TAAC相邻，TGAC和TGAG相邻，假设6种PE reads的基因组位置相同，且编辑距离小于1，而且AGTC、CGTC、TGTC、TGAC和TGAG中相邻两结点满足一定的条件，则可以将AGTC、CGTC、TGTC、TGAC和TGAG连接成一个网络，TAAC为单独的网络。进一步根据每个结点的成员数，也即AGTC的成员数为2，CGTC的成员数为2，TGTC的成员数为365，TGAC的成员数为62，TAAC的成员数为80，TGAG的成员数为1，可以获取每个网络中最大成员数且碱基质量和最高的序列，也即，获取碱基质量和最高的TGTC和碱基质量和最高的TAAC作为独立序列。For example, as shown in Figure 4, for the six PE reads shown in Figure 4, they are AGTC, CGTC, TGTC, TGAC, TAAC, TGAG, where AGTC and CGTC are adjacent, CGTC and TGTC are adjacent, and TGTC and TGTC are adjacent to each other. AGTC is adjacent, TGTC and TGAC are adjacent, TGAC and TAAC are adjacent, and TGAC and TGAG are adjacent. It is assumed that the genomic positions of the six PE reads are the same, and the editing distance is less than 1, and the AGTC, CGTC, TGTC, TGAC and TGAG are similar. If two adjacent nodes meet certain conditions, AGTC, CGTC, TGTC, TGAC and TGAG can be connected into a network, and TAAC is a separate network. Further according to the number of members of each node, that is, the number of members of AGTC is 2, the number of members of CGTC is 2, the number of members of TGTC is 365, the number of members of TGAC is 62, the number of members of TAAC is 80, and the number of members of TGAG is 80. When the number is 1, the sequence with the largest number of members and the highest base quality in each network can be obtained, that is, the base quality and the highest TGTC and the base quality and the highest TAAC can be obtained as independent sequences.

可选地，在本发明上述实施例中，步骤S1065，判断任意两个结点的成员数是否满足预设条件包括：Optionally, in the above-mentioned embodiment of the present invention, step S1065, judging whether the number of members of any two nodes satisfies a preset condition includes:

步骤S10652，获取任意两个结点中第一结点的成员数的预设倍数，得到第一数值。Step S10652: Obtain a preset multiple of the number of members of the first node in any two nodes to obtain a first value.

步骤S10654，获取第一数值与预设值的差值。Step S10654: Obtain the difference between the first value and the preset value.

具体地，上述的预设值可以根据实际需要进行设定，例如，可以是1。Specifically, the above-mentioned preset value can be set according to actual needs, for example, it can be 1.

步骤S10656，判断任意两个结点中第二结点的成员数是否大于差值。Step S10656, determine whether the number of members of the second node in any two nodes is greater than the difference.

步骤S10658，如果第二结点的成员数大于差值，则确定任意两个结点的成员数满足预设条件。Step S10658, if the number of members of the second node is greater than the difference, it is determined that the number of members of any two nodes meets the preset condition.

在一种可选的方案中，可以将将比对到基因组相同位置且序列间编辑距离小于某一阈值(默认为1)的且相邻结点a和结点b的成员数满足：结点a的成员数≥2*结点b的成员数-1的所有结点连接成一个网络。In an optional solution, the number of members of adjacent nodes a and b can be aligned to the same position in the genome and the editing distance between sequences is less than a certain threshold (default is 1) and the number of members of adjacent nodes a and b satisfies: node The number of members of a ≥ 2 * the number of members of node b - 1 all nodes are connected to form a network.

可选地，在本发明上述实施例中，在步骤S108，获取每个第一网络中数量最多的双端序列，得到每个第一网络对应的独立序列之后，该方法还包括：Optionally, in the above-mentioned embodiment of the present invention, in step S108, after obtaining the double-ended sequence with the largest number in each first network, and obtaining the independent sequence corresponding to each first network, the method further includes:

步骤S110，将第一比对结果中除每个第一网络对应的独立序列之外的其他双端序列的比对结果删除，得到第二比对结果。In step S110, the alignment results of other double-ended sequences except for the independent sequences corresponding to each first network in the first alignment result are deleted to obtain a second alignment result.

在一种可选的方案中，可以将所有PE reads中除独立序列之外的重复序列全部删除，仅保留独立序列，避免重复序列引起的人工误差。In an optional solution, all repetitive sequences except independent sequences in all PE reads can be deleted, and only independent sequences are retained to avoid artificial errors caused by repetitive sequences.

可选地，在本发明上述实施例中，在步骤S110，将第一比对结果中除每个第一网络对应的独立序列之外的其他双端序列的比对结果删除，得到第二比对结果之后，该方法还包括：Optionally, in the above-mentioned embodiment of the present invention, in step S110, the comparison results of other double-ended sequences other than the independent sequences corresponding to each first network in the first comparison result are deleted to obtain a second comparison. After evaluating the results, the method also includes:

步骤S112，按照每个独立序列的基因组位置，对第二比对结果进行排序，得到第三比对结果。Step S112: Rank the second alignment result according to the genomic position of each independent sequence to obtain a third alignment result.

在一种可选的方案中，由于根据编辑距离进行网络聚类，比对结果文件(.bam)中的序列位置发生变化，为了使得相同位置的PE reads相邻，方便后续对PE reads进行去重处理，可以调用Picard’s SortSam模块将比对结果文件(.bam)(也即上述的第二比对结果)按比对位置排序，得到第三比对结果。In an optional solution, due to network clustering based on edit distance, the sequence position in the alignment result file (.bam) changes. In order to make PE reads at the same position adjacent to each other, it is convenient for subsequent PE reads to be removed. For reprocessing, the Picard's SortSam module can be called to sort the comparison result file (.bam) (that is, the above-mentioned second comparison result) according to the comparison position to obtain the third comparison result.

可选地，在本发明上述实施例中，在步骤S112，按照每个独立序列的基因组位置，对第二比对结果进行排序，得到第三比对结果之后，该方法还包括：Optionally, in the above embodiment of the present invention, in step S112, the second alignment result is sorted according to the genomic position of each independent sequence, and after obtaining the third alignment result, the method further includes:

步骤S114，获取捕获测序区间。Step S114, acquiring the capture sequencing interval.

步骤S116，根据捕获测序区间，对每个独立序列进行单核苷酸变异检测和插入缺失检测，得到检测结果。Step S116, according to the capture sequencing interval, perform single nucleotide variation detection and indel detection on each independent sequence to obtain a detection result.

在一种可选的方案中，可以获取捕获测序区间Bed文件，并调用varscan2mpileup2snp模块检测单核苷酸变异(SNV)，mpileup2indel模块检测插入缺失(INDEL)，其中，单核苷酸变异是指参考基因组的某个位置上发生碱基类型的改变，插入缺失是指在参考基因组的某段序列上插入了一小段新的序列或缺失了某段序列。In an optional solution, a Bed file of the capture sequencing interval can be obtained, and the varscan2mpileup2snp module can be called to detect single nucleotide variation (SNV), and the mpileup2indel module can be used to detect insertion deletion (INDEL), where the single nucleotide variation refers to the reference A change in base type occurs at a certain position in the genome. Indels refer to the insertion of a small new sequence or the deletion of a certain sequence in a certain sequence of the reference genome.

可选地，在本发明上述实施例中，在步骤S104，基于第一比对结果，得到第一序列集合之前，该方法还包括：Optionally, in the above-mentioned embodiment of the present invention, in step S104, before obtaining the first sequence set based on the first alignment result, the method further includes:

步骤S118，按照多对双端序列的基因组位置，对第一比对结果进行排序，得到第四比对结果，并为第四比对结果建立索引。Step S118: Rank the first alignment results according to the genomic positions of the pairs of double-ended sequences to obtain a fourth alignment result, and establish an index for the fourth alignment result.

在一种可选的方案中，可以调用Picard’s SortSam模块将比对结果文件(.bam)(也即上述的第一比对结果)按比对位置排序，同时建立bam文件的索引文件(.bai)。通过比对结果文件按比对位置排序，从而使得相同位置的PE reads相邻，方便后续对PE reads进行去重处理。In an optional solution, Picard's SortSam module can be called to sort the comparison result file (.bam) (that is, the first comparison result above) according to the comparison position, and at the same time create an index file (.bai) of the bam file. ). The comparison result files are sorted according to the comparison positions, so that the PE reads in the same position are adjacent to each other, which is convenient for subsequent deduplication processing of PE reads.

步骤S120，对第四比对结果进行过滤，得到第五比对结果。Step S120, filtering the fourth comparison result to obtain the fifth comparison result.

在一种可选的方案中，由于同一个PE reads可能会对比到多个基因组位置，在进行去重处理之前，首先需要对比对结果文件(.bam)进行过滤，具体可以调用samtools view模块对排序后的bam文件进行筛选，得到第五比对结果。In an optional solution, since the same PE reads may be compared to multiple genomic locations, before deduplication processing, the result file (.bam) needs to be filtered first. Specifically, the samtools view module can be called to The sorted bam files are filtered to obtain the fifth alignment result.

步骤S122，基于第五比对结果，得到至少一个序列集合。Step S122, based on the fifth alignment result, obtain at least one sequence set.

在一种可选的方案中，可以将第五比对结果中，比对到基因组相同位置划分为一个序列集合，从而方便后续对每个序列集合中的所有PE reads的碱基序列做网络聚类。In an optional solution, the same position of the genome in the fifth alignment result can be divided into a sequence set, so as to facilitate subsequent network aggregation of the base sequences of all PE reads in each sequence set kind.

可选地，在本发明上述实施例中，步骤S102，将测序数据和参考基因组序列进行比对，得到第一比对结果包括：Optionally, in the above-mentioned embodiment of the present invention, in step S102, the sequencing data is compared with the reference genome sequence, and obtaining the first comparison result includes:

步骤S1022，获取多对双端序列中每条序列和参考基因组序列中的每段序列的匹配度。Step S1022, obtaining the matching degree of each sequence in the multiple pairs of double-ended sequences and each sequence in the reference genome sequence.

步骤S1024，获取最高匹配度对应的至少一段序列，得到每条序列的匹配序列。Step S1024: Obtain at least a sequence corresponding to the highest matching degree, and obtain a matching sequence of each sequence.

具体地，上述的预设相似度可以根据实际检测需求进行设定，本发明对此不做具体限定。Specifically, the above-mentioned preset similarity may be set according to actual detection requirements, which is not specifically limited in the present invention.

步骤S1026，根据每条序列的匹配序列，确定每条序列的基因组位置。Step S1026: Determine the genomic location of each sequence according to the matching sequence of each sequence.

在一种可选的方案中，可以计算每一对PE reads中每条reads与人类参考基因组序列的匹配度，通过匹配度判断每一条reads是否来自人类参考基因组序列中某一段序列，匹配度越高，每一条reads来自人类参考基因组序列中该序列的可能性越大，可以将每条reads比对到最高匹配度的序列，从而根据该序列的位置，可以得到该条reads的基因组位置。In an optional solution, the matching degree between each read in each pair of PE reads and the human reference genome sequence can be calculated, and whether each read is from a certain segment of the human reference genome sequence can be judged by the matching degree. The higher the probability that each read comes from the sequence in the human reference genome sequence, the more likely each read is to be aligned to the sequence with the highest matching degree, so that the genomic position of the read can be obtained according to the position of the sequence.

需要说明的是，在本发明实施例中，可以采用现有技术中提供的比对算法进行比对，本发明对此不做具体限定。It should be noted that, in this embodiment of the present invention, a comparison algorithm provided in the prior art may be used for comparison, which is not specifically limited in the present invention.

图5是根据本发明实施例的一种可选的循环肿瘤DNA重复序列的处理方法的流程图，下面结合图5对本发明一种优选的实施例进行详细说明。如图5所示，该方法可以包括如下步骤：输入cfDNA样本捕获测序fastq文件和人类参考基因组fasta文件，利用bwa mem软件进行基因组比对；调用Picard软件进行reads排序；调用samtools软件进行reads过滤；可以根据数据特点或者准确度要求，从本发明上述实施例提供的Cluster、Adjacency和Direct算法中选择合适的算法进行去重；调用Picard软件进行reads排序，得到cfDNA样本去除重复后的bam文件；输入捕获测序区间Bed文件，调用samtools mpileup对标记重复后的bam文件按位置展示所有reads的比对情况和质量值；调用varscan2mpileup2snp模块鉴定SNV，mpileup2indel模块鉴定INDEL。FIG. 5 is a flowchart of an optional method for processing circulating tumor DNA repeats according to an embodiment of the present invention. A preferred embodiment of the present invention will be described in detail below with reference to FIG. 5 . As shown in Figure 5, the method may include the following steps: input cfDNA samples to capture and sequence fastq files and human reference genome fasta files, and use bwa mem software to perform genome alignment; call Picard software to sort reads; call samtools software to filter reads; According to data characteristics or accuracy requirements, a suitable algorithm can be selected from the Cluster, Adjacency and Direct algorithms provided in the above-mentioned embodiments of the present invention for deduplication; Picard software is called to sort reads to obtain the bam file of the cfDNA sample after deduplication; input Capture the Bed file of the sequencing interval, and call samtools mpileup to display the alignment and quality values of all reads by position in the labeled bam file; call the varscan2mpileup2snp module to identify SNVs, and the mpileup2indel module to identify INDELs.

需要说明的是，上述的cfDNA样本也可以是其他含有ctDNA的体液样本。It should be noted that the above-mentioned cfDNA samples may also be other body fluid samples containing ctDNA.

本发明输入文件包括：待测样本经过比对、排序、过滤等步骤后生成的测序数据文件(bam格式，包含每条测序片段的名称、SAM标记、位置信息、比对质量信息、CIGAR字串、mate pair信息、片段序列、测序质量等)、人类参考基因组序列(fastq格式)；The input file of the present invention includes: a sequencing data file (bam format, including the name, SAM marker, position information, alignment quality information, CIGAR string of each sequencing fragment) generated after the samples to be tested are compared, sorted, filtered, etc. , mate pair information, fragment sequence, sequencing quality, etc.), human reference genome sequence (fastq format);

本发明的输出文件包括：待测样本标记重复后的比对结果文件(bam格式)以及检测到的SNV和INDEL的vcf格式文件。The output file of the present invention includes: a comparison result file (bam format) after the marking of the sample to be tested is repeated, and a vcf format file of the detected SNV and INDEL.

通过上述方案，通过网络聚类方法不仅可以保留更多的原始分子，而且可以杜绝大部分随机错误的影响，从而在一定程序上减少随机错误在最终变异检测时候的影响，使得变异检测的更准确、可靠。对于DNA分子碎片化严重、覆盖基因组范围小、经过多轮PCR的样本或测序方案，尤其是血浆ctDNA样本的捕获测序数据可以保留更多的原始分子，有效利用碱基序列，提高了原始数据的利用率，和最终变异检测的准确性。Through the above scheme, the network clustering method can not only retain more original molecules, but also eliminate the influence of most random errors, thereby reducing the influence of random errors in the final mutation detection in a certain procedure, making the mutation detection more accurate. ,reliable. For samples or sequencing solutions with severe fragmentation of DNA molecules, small coverage of the genome, and multiple rounds of PCR, especially the captured sequencing data of plasma ctDNA samples, more original molecules can be retained, the base sequence can be effectively used, and the accuracy of the original data can be improved. utilization, and ultimately the accuracy of variant detection.

下面通过单碱基变异(SNV)梯度稀释细胞系测试实验验证对上述实施例进行验证。The above embodiment is verified by the single base variation (SNV) gradient dilution cell line test experimental verification below.

1、细胞系培养1. Cell line culture

细胞系HCT116、KYSE450、NCI-H1573、NCI-H1975、NCI-H441、PC-9、SK-HEP-1、SW48、THP-1、BEAS-2B购买自南京科佰生物科技有限公司，按照提供的说明书进行细胞培养，即RPMI-1640培养基中加入10％胎牛血清，在37度条件下进行培养。Cell lines HCT116, KYSE450, NCI-H1573, NCI-H1975, NCI-H441, PC-9, SK-HEP-1, SW48, THP-1, BEAS-2B were purchased from Nanjing Kebai Biotechnology Co., Ltd. The cell culture was carried out according to the instructions, that is, 10% fetal bovine serum was added to the RPMI-1640 medium, and the culture was carried out under the condition of 37 degrees.

2、细胞DNA提取2. Cell DNA extraction

收集细胞悬液后，常温300g离心5分钟后弃上清，用200uLPBS重悬细胞，然后用QIAamp DNA Mini Kit(货号为51304；Qiagen,Germany)进行基因组DNA提取。经过裂解后过柱纯化，最后用low-TE缓冲液洗脱DNA。After collecting the cell suspension, centrifuge at 300g at room temperature for 5 minutes, discard the supernatant, resuspend the cells with 200uL PBS, and then use QIAamp DNA Mini Kit (Cat. No. 51304; Qiagen, Germany) for genomic DNA extraction. After lysis, it was purified by column, and finally the DNA was eluted with low-TE buffer.

3、用ddPCR的方法确定以上细胞系中突变位点的理论VAF3. Determine the theoretical VAF of the mutation sites in the above cell lines by ddPCR

用细胞提取的基因组DNA作为模板，进行ddPCR的实验，以上细胞系中突变位点的理论VAF如表1所示。ddPCR用伯乐的仪器、商品化探针和反应体系。反应体系组成为：10ulddPCR supermix for probes(no dUTP)，1ul突变探针，1ul野生型探针，以及20ng待测DNA。配制好反应体系后，按照仪器使用方法进行乳糜生成，吸取乳糜至96孔PCR板，用Pierceable Foil Heat Seal进行热封。PCR反应的条件为：酶激活95度，8min；94度30s解链，55度1min退火延伸，共39个循环；酶失活98度10min；4度保温。PCR扩增之后，伯乐的微滴读取仪读取每个反应孔中的带有荧光的微滴数目。每批次反应用超纯水作为阴性对照。每个待测DNA做三个复孔作为技术重复。Using the genomic DNA extracted from the cells as a template, ddPCR experiments were performed, and the theoretical VAFs of the mutation sites in the above cell lines are shown in Table 1. ddPCR uses Biohler's instruments, commercial probes and reaction systems. The reaction system consists of: 10ulddPCR supermix for probes (no dUTP), 1ul mutant probe, 1ul wild-type probe, and 20ng DNA to be tested. After the reaction system was prepared, chyle was generated according to the method of using the instrument, and the chyle was sucked into a 96-well PCR plate, and then heat-sealed with Pierceable Foil Heat Seal. The conditions of PCR reaction were: enzyme activation at 95°C for 8 min; melting at 94°C for 30s, annealing and extension at 55°C for 1 min, a total of 39 cycles; enzyme inactivation at 98°C for 10 min; incubation at 4°C. After PCR amplification, Biole's droplet reader reads the number of fluorescent droplets in each reaction well. Ultrapure water was used as a negative control for each batch of reactions. Three replicate wells were made for each DNA to be tested as technical replicates.

表1Table 1

4、含有11个突变位点的样本制备4. Preparation of samples containing 11 mutation sites

按照下表2中的质量百分比混合上表中的10种细胞系，制备成1个样本，并计算预期的VAF值。The 10 cell lines in the above table were mixed according to the mass percentages in Table 2 below to prepare 1 sample, and the expected VAF value was calculated.

表2Table 2

5、样本的ddPCR结果5. ddPCR results of samples

用ddPCR实验的方法检测样本中以上列表中各个位点的VAF值，如表3所示，每个反应体系中加入20ng样本DNA，每个样本做三个复孔作为技术重复。The VAF value of each site in the above list was detected by ddPCR experiment. As shown in Table 3, 20ng of sample DNA was added to each reaction system, and three replicates were made for each sample as technical replicates.

表3table 3

基因Gene 突变mutation DDPCR VAFDDPCR VAF KRASKRAS G13DG13D 0.530.53 PIK3CAPIK3CA H1047RH1047R 1.061.06 EGFREGFR G719SG719S 0.880.88 NRASNRAS Q61KQ61K 1.801.80 EGFREGFR L858RL858R 1.261.26 EGFREGFR T790MT790M 1.521.52 KRASKRAS G12VG12V 1.431.43 EGFREGFR E746_A750delE746_A750del 4.764.76 BRAFBRAF V600EV600E 0.920.92 EGFREGFR S768IS768I 2.422.42 NRASNRAS G12DG12D 4.484.48

6、样本的文库构建、捕获和测序6. Library construction, capture and sequencing of samples

将混合的细胞系样本DNA首先用covaris超声打断成200bp左右的DNA片段，qubit荧光定量后，如表4所示，用不同的起始量DNA，不足50ul用无酶水补平，采用KAPA hyperpreparation kit(罗氏公司)进行文库构建，经过末端修复、3’端加polyA、连接测序接头、进行无偏向扩增，之后进行纯化获得文库。The DNA of the mixed cell line samples was firstly broken into DNA fragments of about 200bp with covaris ultrasound. After qubit fluorescence quantification, as shown in Table 4, different starting amounts of DNA were used. The hyperpreparation kit (Roche) was used to construct the library. After end repair, polyA addition to the 3' end, ligation of sequencing adapters, unbiased amplification, and purification, the library was obtained.

表4Table 4

样本sample 起始量DNA(ng)Input DNA (ng) PCR循环数PCR cycle number 样本1Sample 1 2020 66 样本2Sample 2 55 88 样本3Sample 3 55 88

详述如下：Details are as follows:

1)末端平齐并在3’末端加A：反应体系如下表5所示：1) The ends are flush and A is added at the 3' end: the reaction system is shown in Table 5 below:

表5table 5

试剂reagent 体积volume Fragmented，double-stranded DNAFragmented, double-stranded DNA 50μL50μL End Repair&A-Tailing BufferEnd Repair&A-Tailing Buffer 7μL7μL End Repair&A-Tailing Enzyme MixEnd Repair&A-Tailing Enzyme Mix 3μL3μL 总体积total capacity 60μL60μL

Buffer和酶应预先在EP管中混匀，与DNA涡旋混匀后按以下反应进行。反应步骤如下表6所示：Buffer and enzyme should be mixed in the EP tube in advance, and the DNA should be vortexed and mixed according to the following reaction. The reaction steps are shown in Table 6 below:

表6Table 6

该步操作将PCR管盖温度设为85℃，而非105℃。若该操作结束后立即进行下步实验，应将终止温度设为20℃，而非4℃。This step sets the PCR tube lid temperature to 85°C instead of 105°C. If the next experiment is performed immediately after this operation, the termination temperature should be set to 20°C instead of 4°C.

2)连接接头：根据建库说明书的指导，20ng DNA应该采用7.5uM接头。按照下表7所示配制反应体系：2) Ligation adapters: According to the instructions of the library construction instructions, 20ng DNA should use 7.5uM adapters. Prepare the reaction system as shown in Table 7 below:

表7Table 7

试剂reagent 体积volume 反应产物reaction product 60μL60μL 接头体积Joint volume 5μL5μL 超纯水Ultra-pure water 5μL5μL 连接Bufferconnection buffer 30μL30μL DNA连接酶DNA ligase 10μL10μL 总体积total capacity 110μL110μL

Buffer和酶应预先在EP管中混匀，涡旋震荡后离心，20℃孵育15分钟。Buffer and enzyme should be mixed in an EP tube in advance, vortexed, centrifuged, and incubated at 20°C for 15 minutes.

3)连接后纯化：在上一步反应体系(110ul)中加入Agencourt AMPure XP纯化磁珠88ul。3) Purification after ligation: Add 88 ul of Agencourt AMPure XP purification magnetic beads to the reaction system (110 ul) of the previous step.

充分涡旋振荡，轻微离心。室温吸附5-15分钟，使DNA与磁珠充分结合EP管放至磁力架吸附至液体澄清缓慢吸取EP管中上清并丢弃。EP管中加入200μL 80％乙醇孵育30秒缓慢吸取EP管中乙醇并丢弃。重复一次乙醇洗磁珠。EP管室温干燥3-5分钟至乙醇完全挥发。从磁力架取下EP管，加入22μL超纯水，涡旋振荡，轻微离心室温孵育2分钟洗脱DNA，EP管放至磁力架吸附至液体澄清,上清转移至新的EP管，取1μL上清测DNA浓度，剩余的进行扩增。Vortex well and centrifuge gently. Adsorb at room temperature for 5-15 minutes to fully bind DNA and magnetic beads to the EP tube and place it on a magnetic stand to adsorb until the liquid is clear. Slowly aspirate the supernatant from the EP tube and discard. Add 200 μL of 80% ethanol to the EP tube and incubate for 30 seconds. Slowly aspirate the ethanol in the EP tube and discard. Repeat the ethanol wash of the magnetic beads. Dry the EP tube at room temperature for 3-5 minutes until the ethanol is completely evaporated. Remove the EP tube from the magnetic stand, add 22 μL of ultrapure water, vortex, and incubate for 2 minutes at room temperature to elute DNA. Place the EP tube on the magnetic stand to absorb the liquid until the liquid is clear, transfer the supernatant to a new EP tube, and take 1 μL The DNA concentration in the supernatant was measured, and the remainder was amplified.

4)PCR扩增：按照下表8所示配制PCR体系。4) PCR amplification: The PCR system was prepared as shown in Table 8 below.

表8Table 8

试剂reagent 体积volume KAPA HiFi HotStart ReadyMix(2X)KAPA HiFi HotStart ReadyMix(2X) 25μL25μL KAPA Library Amplification Primer Mix(10X)*KAPA Library Amplification Primer Mix(10X)* 5μL5μL 接头连接文库Linker Libraries 20μL20μL 总体积total capacity 50μL50μL

充分震荡后快速离心，按照下表9所示条件进行PCR反应。After sufficient shaking, the samples were centrifuged quickly, and the PCR reaction was carried out according to the conditions shown in Table 9 below.

表9Table 9

5)扩增后纯化：加入与PCR反应体系同等体积的Agencourt AMPure XP纯化磁珠(50μl)。5) Purification after amplification: add Agencourt AMPure XP purification magnetic beads (50 μl) in the same volume as the PCR reaction system.

充分涡旋振荡，轻微离心，室温吸附5-15分钟，使DNA与磁珠充分结合。EP管放至磁力架吸附至液体澄清，缓慢吸取EP管中上清并丢弃。EP管中加入200μL 80％乙醇孵育30秒，缓慢吸取EP管中乙醇并丢弃。重复一次乙醇洗磁珠。EP管室温干燥3-5分钟至乙醇完全挥发。从磁力架取下EP管，加入52μL超纯水，涡旋振荡，轻微离心。室温孵育2分钟洗脱DNA，EP管放至磁力架吸附至液体澄清，上清转移至新的EP管，取1μL上清测DNA浓度。Fully vortexed, centrifuged gently, and adsorbed at room temperature for 5-15 minutes to fully bind the DNA to the magnetic beads. Place the EP tube on a magnetic stand to absorb the liquid until the liquid is clear, slowly aspirate the supernatant from the EP tube and discard. Add 200 μL of 80% ethanol to the EP tube and incubate for 30 seconds, slowly aspirate the ethanol in the EP tube and discard. Repeat the ethanol wash of the magnetic beads. Dry the EP tube at room temperature for 3-5 minutes until the ethanol is completely evaporated. Remove the EP tube from the magnetic stand, add 52 μL of ultrapure water, vortex, and centrifuge gently. Incubate at room temperature for 2 minutes to elute the DNA, place the EP tube on a magnetic stand to adsorb until the liquid is clear, transfer the supernatant to a new EP tube, and take 1 μL of the supernatant to measure the DNA concentration.

6)在测序前采用探针捕获的方法，用Roche NimbleGen探针将包含11个突变位点的目的区域进行富集和进一步扩增，获得目的区域的文库。经过q-PCR定量后进行上机测序。6) Using the probe capture method before sequencing, the target region containing 11 mutation sites was enriched and further amplified with the Roche NimbleGen probe to obtain a library of the target region. On-board sequencing was performed after q-PCR quantification.

7、处理下机fastq数据为各软件可使用的输入文件。7. Process the fastq data from the computer as an input file that can be used by each software.

数据下机后，首先将下机数据从fastq文件处理成bam文件，具体使用的软件和步骤如下：After the data gets off the computer, first process the off-camera data from the fastq file into a bam file. The specific software and steps are as follows:

7.1比对7.1 Comparison

调用bwa-0.7.12mem将每一对fastq文件都作为PE reads比对到hg19人类参考基因组序列，除-M参数与指定Reads Group的ID外，不使用其余参数选项，生成初始bam文件。Call bwa-0.7.12mem to align each pair of fastq files as PE reads to the hg19 human reference genome sequence. Except for the -M parameter and the ID of the specified Reads Group, the other parameter options are not used to generate the initial bam file.

7.2排序7.2 Sorting

调用picard-2.1.0的SortSam模块，对初始bam文件按照染色体位置进行排序，参数设置为“SORT_ORDER＝coordinate”。Call the SortSam module of picard-2.1.0, sort the initial bam file according to the chromosome position, and set the parameter as "SORT_ORDER=coordinate".

7.3筛选7.3 Screening

调用samtools-1.3view对排序后的bam文件进行筛选，采用“-F 0x900”作为参数。Call samtools-1.3view to filter the sorted bam files, using "-F 0x900" as the parameter.

7.4建立索引7.4 Indexing

调用samtools-1.3的index模块对最终生成的bam文件建立索引，生成与过滤后的bam文件配对的bai文件。Call the index module of samtools-1.3 to index the final generated bam file, and generate a bai file paired with the filtered bam file.

8、标记重复8. Mark duplicates

8.1使用Picard’s MarkDuplicates模块标记重复，后续的变异检测时，会自动过滤这部分重复序列，再进行分析。8.1 Use Picard’s MarkDuplicates module to mark duplicates. During subsequent variant detection, these duplicate sequences will be automatically filtered and then analyzed.

8.2根据本发明上述实施例提供的方法，分别调用"Cluster"、"Adjacency"和"Direct"方法对过滤后的bam文件去除重复序列，生成去除重复的bam文件。8.2 According to the method provided by the above-mentioned embodiment of the present invention, the "Cluster", "Adjacency" and "Direct" methods are respectively invoked to remove duplicate sequences from the filtered bam file, and a duplicate-removed bam file is generated.

8.3统计比对情况：8.3 Statistical comparison:

调用samtools-1.3的flagstat模块对最终生成的bam文件进行统计，生成去除重复后的bam文件的比对情况文件，包括总reads的数量、重复reads的数量、比对到参考基因组上的reads数量、成对的reads数据数量、read1的数量、read2的数量、完美匹配到参考序列的reads数量(properly paired)、一对reads都比对到了参考序列上的数量、一对reads中只有一条与参考序列相匹配的数量、一对reads比对到不同染色体的数量、一对reads比对到不同染色体的且比对质量值大于5的数量等。Call the flagstat module of samtools-1.3 to count the final generated bam file, and generate the alignment file of the bam file after deduplication, including the number of total reads, the number of duplicate reads, the number of reads aligned to the reference genome, The number of paired reads data, the number of read1, the number of read2, the number of reads that are perfectly matched to the reference sequence (properly paired), the number of pairs of reads that are aligned to the reference sequence, and only one of a pair of reads matches the reference sequence. The number of matches, the number of pairs of reads that align to different chromosomes, the number of pairs of reads that align to different chromosomes and have an alignment quality value greater than 5, etc.

8.4结果比较：8.4 Results comparison:

本发明上述实施例提供的算法与Picard方法的数据量统计结果如下表10所示，从下表10中可以看出，本发明提供的算法均比Picard方法保留的数据量更多，提高了数据的有效利用率，并且，满足Adjacency>Direct>Cluster>Picard。The statistical results of the data volume of the algorithms provided by the above embodiments of the present invention and the Picard method are shown in Table 10 below. It can be seen from Table 10 that the algorithms provided by the present invention all retain more data volume than the Picard method, which improves the data volume. The effective utilization of , and satisfies Adjacency>Direct>Cluster>Picard.

表10Table 10

样本sample PicardPicard ClusterCluster AdjacencyAdjacency DirectDirect 样本1Sample 1 2487274724872747 5168384251683842 5193064851930648 5169854651698546 样本2Sample 2 1368762613687626 4620752446207524 4673098646730986 4624778446247784 样本3Sample 3 1429032214290322 4409704444097044 4454191444541914 4412937044129370

9、变异检测9. Variation detection

9.1堆叠9.1 Stacking

调用samtools-1.3mpileup对标记重复后的bam文件按位置展示所有reads的比对情况和质量值，参数设置为“q＝1”，mpileup的结果文件(mpileup文件)包含染色体、基因组位置、参考基因组碱基类型、该位点测序深度、全部覆盖该位点reads的比对情况和质量值。Call samtools-1.3mpileup to display the alignment and quality values of all reads by position in the labeled bam file. The parameter is set to "q=1". The result file of mpileup (mpileup file) contains chromosomes, genome positions, and reference genomes. Base type, sequencing depth at this site, alignments and quality values that all cover the reads at this site.

由于ddPCR验证阳性位点有限，仅对下列区间做mpileup处理，使用参数“-lpositive.bed”，positive.bed文件如表11所示。Due to the limited number of positive sites verified by ddPCR, mpileup processing was performed only on the following intervals, and the parameter "-lpositive.bed" was used. The positive.bed file is shown in Table 11.

表11Table 11

染色体chromosome 起始位置starting point 结束位置end position 基因Gene chr1chr1 115256527115256527 115256530115256530 NRASNRAS chr1chr1 115258745115258745 115258748115258748 NRASNRAS chr3chr3 178952083178952083 178952086178952086 PIK3CAPIK3CA chr12chr12 2539827925398279 2539828225398282 KRASKRAS chr12chr12 2539828225398282 2539828525398285 KRASKRAS chr7chr7 140453134140453134 140453137140453137 BRAFBRAF chr7chr7 5524170655241706 5524170955241709 EGFREGFR chr7chr7 5524241455242414 5524251355242513 EGFREGFR chr7chr7 5524900355249003 5524900655249006 EGFREGFR chr7chr7 5524906955249069 5524907255249072 EGFREGFR chr7chr7 5525951355259513 5525951655259516 EGFREGFR

9.2统计positive.bed区间的平均测序深度9.2 Statistical average sequencing depth of positive.bed interval

使用简单的脚本或bash命令根据mpileup文件统计不同去除重复序列方法在positive.bed区间的测序深度的平均值，结果见表12。Use a simple script or bash command to count the average sequencing depth in the positive.bed interval for different deduplication methods based on the mpileup file. The results are shown in Table 12.

表12Table 12

样本sample PicardPicard ClusterCluster AdjacencyAdjacency DirectDirect 样本1Sample 1 1625.3701625.370 4111.2704111.270 4130.2704130.270 4112.2404112.240 样本2Sample 2 533.496533.496 3251.8003251.800 3302.6003302.600 3258.1303258.130 样本3Sample 3 627.380627.380 3084.8703084.870 3121.8603121.860 3087.8003087.800

本发明提供的方法均比Picard的方法在positive.bed区间平均深度更高，且满足Adjacency>Direct>Cluster>Picard。The methods provided by the present invention all have higher average depth in the positive.bed interval than the Picard method, and satisfy Adjacency>Direct>Cluster>Picard.

9.3变异检测9.3 Variation detection

调用varscan2mpileup2snp模块检测单核苷酸变异(SNV)，mpileup2indel模块检测插入缺失标记(INDEL)，参数设置“--min-coverage 100--min-reads2 2--min-var-freq0.001--p-value 0.05--min-avg-qual 20”。Call the varscan2mpileup2snp module to detect single nucleotide variants (SNV), the mpileup2indel module to detect insertion and deletion markers (INDEL), parameter settings "--min-coverage 100--min-reads2 2--min-var-freq0.001--p -value 0.05--min-avg-qual 20".

对上述3个样本的ddPCR验证为阳性的位点用不同去重方法之后统计的变异结果如下表13至15所示(表格中数值为突变频率)，其中，表13示出样本1的变异结果，表14示出样本2的变异结果，表15示出样本3的变异结果。The statistical variation results of the ddPCR-positive sites of the above three samples after different deduplication methods are shown in the following Tables 13 to 15 (the values in the table are mutation frequencies), among which, Table 13 shows the variation results of sample 1 , Table 14 shows the mutation results of sample 2, and Table 15 shows the mutation results of sample 3.

表13Table 13

基因Gene AachangeAachange PicardPicard ClusterCluster AdjacencyAdjacency DirectDirect NRASNRAS p.Q61Kp.Q61K 00 11 1.051.05 11 PIK3CAPIK3CA p.H1047Rp.H1047R 0.960.96 1.241.24 1.231.23 1.241.24 BRAFBRAF p.V600Ep.V600E 0.830.83 0.810.81 0.80.8 0.810.81 NRASNRAS p.G12Dp.G12D 3.873.87 4.224.22 4.244.24 4.214.21 EGFREGFR p.G719Sp.G719S 0.880.88 0.770.77 0.770.77 0.770.77 EGFREGFR p.L858Rp.L858R 1.641.64 1.981.98 1.971.97 1.981.98 EGFREGFR p.S768Ip.S768I 2.152.15 2.362.36 2.342.34 2.362.36 KRASKRAS p.G13Dp.G13D 0.60.6 0.520.52 0.510.51 0.520.52 EGFREGFR p.745_750delp.745_750del 3.053.05 2.792.79 2.782.78 2.792.79 KRASKRAS p.G12Vp.G12V 1.021.02 1.161.16 1.151.15 1.151.15 EGFREGFR p.T790Mp.T790M 1.391.39 1.11.1 1.091.09 1.11.1

表14Table 14

基因Gene AachangeAachange PicardPicard ClusterCluster AdjacencyAdjacency DirectDirect NRASNRAS p.Q61Kp.Q61K 4.224.22 4.34.3 4.284.28 4.34.3 PIK3CAPIK3CA p.H1047Rp.H1047R 00 1.681.68 1.671.67 1.681.68 BRAFBRAF p.V600Ep.V600E 00 1.031.03 1.011.01 1.021.02 NRASNRAS p.G12Dp.G12D 00 1.551.55 1.521.52 1.551.55 EGFREGFR p.G719Sp.G719S 0.930.93 1.141.14 1.151.15 1.141.14 EGFREGFR p.L858Rp.L858R 2.32.3 2.232.23 2.232.23 2.262.26 EGFREGFR p.S768Ip.S768I 1.041.04 0.930.93 0.910.91 0.930.93 KRASKRAS p.G13Dp.G13D 1.071.07 0.880.88 0.870.87 0.880.88 EGFREGFR p.745_750delp.745_750del 2.922.92 2.052.05 2.052.05 2.052.05 KRASKRAS p.G12Vp.G12V 1.341.34 1.851.85 1.881.88 1.851.85 EGFREGFR p.T790Mp.T790M 0.960.96 0.950.95 0.930.93 0.950.95

表15Table 15

基因Gene AachangeAachange PicardPicard ClusterCluster AdjacencyAdjacency DirectDirect NRASNRAS p.Q61Kp.Q61K 00 1.11.1 1.091.09 1.11.1 PIK3CAPIK3CA p.H1047Rp.H1047R 00 0.460.46 0.460.46 0.460.46 BRAFBRAF p.V600Ep.V600E 0.990.99 0.980.98 0.970.97 0.980.98 NRASNRAS p.G12Dp.G12D 5.455.45 5.865.86 5.85.8 5.855.85 EGFREGFR p.G719Sp.G719S 00 1.261.26 1.271.27 1.251.25 EGFREGFR p.L858Rp.L858R 0.760.76 0.940.94 0.970.97 0.940.94 EGFREGFR p.S768Ip.S768I 1.661.66 1.551.55 1.561.56 1.551.55 KRASKRAS p.G13Dp.G13D 00 0.30.3 0.310.31 0.30.3 EGFREGFR p.745_750delp.745_750del 2.562.56 2.522.52 2.492.49 2.522.52 KRASKRAS p.G12Vp.G12V 1.541.54 2.062.06 2.042.04 2.062.06 EGFREGFR p.T790Mp.T790M 00 0.950.95 0.930.93 0.950.95

Picard在多处阳性位点检测的突变频率为0(频率>0为阳性，频率＝0为阴性)，而Adjacency、Direct和Cluster四种方法在全部11个位点都检测为阳性。综上可以看出使用本发明相比Picard去重可以检测更多的阳性位点，且在保证阳性检出的同时，Direct和Cluster方法去掉更多的重复序列。The mutation frequency detected by Picard at multiple positive loci was 0 (frequency>0 is positive, frequency=0 is negative), while the four methods of Adjacency, Direct and Cluster detected positive at all 11 loci. To sum up, it can be seen that more positive sites can be detected by using the present invention than Picard deduplication, and while ensuring positive detection, the Direct and Cluster methods remove more repetitive sequences.

实施例2Example 2

根据本发明实施例，提供了一种循环肿瘤DNA重复序列的处理装置的实施例。According to an embodiment of the present invention, an embodiment of an apparatus for processing circulating tumor DNA repeats is provided.

图6是根据本发明实施例的一种循环肿瘤DNA重复序列的处理装置的示意图，如图6所示，该装置包括：FIG. 6 is a schematic diagram of an apparatus for processing circulating tumor DNA repeats according to an embodiment of the present invention. As shown in FIG. 6 , the apparatus includes:

第一获取模块60，用于获取待检测循环肿瘤DNA的测序数据和参考基因组序列，其中，测序数据为对待检测循环肿瘤DNA进行高通量测序得到的数据，测序数据包括：多对双端序列。The first acquisition module 60 is configured to acquire sequencing data and reference genome sequences of the circulating tumor DNA to be detected, wherein the sequencing data is data obtained by high-throughput sequencing of the circulating tumor DNA to be detected, and the sequencing data includes: multiple pairs of double-ended sequences .

具体地，上述的待检测循环肿瘤DNA可以从病人的血液、淋巴液、组织间隙液、脑髓液等体液中提取得到的基因序列，在本发明实施例中以血液中提取到的ctDNA为例进行说明；上述的测序数据可以是对待检测ctDNA进行NGS测序得到的ctDNA样本捕获测序fastq数据；上述的参考基因组序列可以是从公开数据库NCBI等网站下载的人类参考基因组fasta数据，在本发明实施例中以血液中提取到的参考基因组序列为例进行说明。Specifically, the above-mentioned circulating tumor DNA to be detected can be the gene sequence extracted from the patient's blood, lymph fluid, interstitial fluid, cerebrospinal fluid and other body fluids. In the embodiment of the present invention, the ctDNA extracted from blood is used as an example. Explain; the above-mentioned sequencing data can be the fastq data of ctDNA sample capture and sequencing obtained by NGS sequencing of the ctDNA to be detected; the above-mentioned reference genome sequence can be the human reference genome fasta data downloaded from the public database NCBI and other websites, in the embodiment of the present invention Take the reference genome sequence extracted from blood as an example.

比对模块62，用于将测序数据和参考基因组序列进行比对，得到第一比对结果，其中，第一比对结果至少包括：多对双端序列的基因组位置、碱基序列和对应的碱基质量值序列。The alignment module 62 is configured to compare the sequencing data with the reference genome sequence to obtain a first alignment result, wherein the first alignment result at least includes: the genomic positions, base sequences and corresponding pairs of double-ended sequences. A sequence of base quality values.

处理模块64，用于基于第一比对结果，得到至少一个序列集合，其中，每个序列集合包括：至少一对双端序列，同一个序列集合中的双端序列的基因组位置相同。The processing module 64 is configured to obtain at least one sequence set based on the first alignment result, wherein each sequence set includes: at least one pair of double-ended sequences, and the double-ended sequences in the same sequence set have the same genomic position.

聚类模块66，用于对每个序列集合中的至少一对双端序列进行网络聚类，得到至少一个第一网络。The clustering module 66 is configured to perform network clustering on at least a pair of double-ended sequences in each sequence set to obtain at least one first network.

第二获取模块68，用于获取每个第一网络中数量最多的双端序列，得到每个第一网络对应的独立序列。The second obtaining module 68 is configured to obtain the largest number of double-ended sequences in each first network, and obtain an independent sequence corresponding to each first network.

实施例3Example 3

根据本发明实施例，提供了一种存储介质的实施例，存储介质包括存储的程序，其中，在程序运行时控制存储介质所在设备执行上述的循环肿瘤DNA重复序列的处理方法。According to an embodiment of the present invention, an embodiment of a storage medium is provided, the storage medium includes a stored program, wherein when the program is run, the device where the storage medium is located is controlled to execute the above-mentioned method for processing circulating tumor DNA repeats.

实施例4Example 4

根据本发明实施例，提供了一种处理器的实施例，处理器用于运行程序，其中，程序运行时执行上述的循环肿瘤DNA重复序列的处理方法。According to an embodiment of the present invention, an embodiment of a processor is provided, and the processor is configured to run a program, wherein the above-mentioned method for processing circulating tumor DNA repeats is executed when the program runs.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

在本发明的上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

在本申请所提供的几个实施例中，应该理解到，所揭露的技术内容，可通过其它的方式实现。其中，以上所描述的装置实施例仅仅是示意性的，例如所述单元的划分，可以为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，单元或模块的间接耦合或通信连接，可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units may be a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or may be Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of units or modules, and may be in electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes .

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements and modifications can be made. It should be regarded as the protection scope of the present invention.

Claims

1. A method for processing circulating tumor DNA repeat sequences, characterized in that, comprising:

Obtaining sequencing data and a reference genome sequence of the circulating tumor DNA to be detected, wherein the sequencing data is data obtained by high-throughput sequencing of the circulating tumor DNA to be detected, and the sequencing data includes: multiple pairs of double-ended sequences;

Comparing the sequencing data with the reference genome sequence to obtain a first alignment result, wherein the first alignment result at least includes: the genomic positions, base sequences and corresponding pairs of pairs of double-ended sequences The base quality value sequence of ;

Based on the first alignment result, at least one sequence set is obtained, wherein each sequence set includes: at least one pair of double-ended sequences, and the genomic positions of the double-ended sequences in the same sequence set are the same;

performing network clustering on at least a pair of double-ended sequences in each sequence set to obtain at least one first network;

Obtaining the largest number of double-ended sequences in each first network, and obtaining an independent sequence corresponding to each first network;

Wherein, performing network clustering on at least a pair of double-ended sequences in each sequence set to obtain at least one first network includes: performing a network clustering on the base sequences of at least a pair of double-ended sequences in each sequence set By comparison, at least one node and the number of members of each node are obtained, wherein the base sequences of the double-ended sequences corresponding to the same node are the same, and the number of members is used to characterize the double-ended sequences corresponding to the same node. number; obtain the edit distance between any two nodes in each sequence set; determine whether the edit distance between any two nodes is less than the preset distance; if the distance between any two nodes is If the edit distance is less than the preset distance, then it is determined that the any two nodes belong to the same network, and the at least one first network is obtained;

The obtaining the double-ended sequence with the largest number in each first network, and obtaining the independent sequence corresponding to each first network includes: Step A, obtaining the first node with the largest number of members in each first network , calculate the sum of the base quality values of all bases contained in each pair of double-ended sequences corresponding to the first node, and obtain the sum of the base qualities of each pair of double-ended sequences corresponding to the first node, Taking the maximum base quality and the corresponding double-ended sequence as the independent sequence corresponding to each first network; Step B, from each first network, the first node sum and the first The nodes adjacent to the nodes are deleted to obtain at least one second network; Step C, determine whether any second network contains a node; Step D, if any second network contains a node, then the Any second network is used as the first network, and the step A and the step B are executed cyclically.

2 . The method according to claim 1 , wherein before it is determined that the any two nodes belong to the same network and the at least one first network is obtained, the method further comprises: 3 .

Judging whether the number of members of any two nodes satisfies a preset condition;

If the number of members of the any two nodes satisfies the preset condition, it is determined that the any two nodes belong to the same network, and the at least one first network is obtained.

3. The method according to claim 2, wherein judging whether the number of members of the any two nodes satisfies a preset condition comprises:

Obtain the preset multiple of the number of members of the first node in the any two nodes, and obtain the first value;

obtaining the difference between the first numerical value and the preset value;

Judging whether the number of members of the second node in any two nodes is greater than the difference;

If the number of members of the second node is greater than the difference, it is determined that the number of members of any two nodes satisfies the preset condition.

4. The method according to claim 1, wherein after obtaining the double-ended sequence with the largest number in each first network and obtaining the independent sequence corresponding to each first network, the method further comprises:

A second alignment result is obtained by deleting alignment results of other double-ended sequences in the first alignment result except the independent sequences corresponding to each of the first networks.

5. The method according to claim 4, wherein, in the first alignment result, the alignment results of other double-ended sequences except the independent sequences corresponding to each first network are deleted, After obtaining the second comparison result, the method further includes:

The second alignment result is sorted according to the genomic position of each independent sequence to obtain a third alignment result.

6. The method according to claim 5, wherein, after the second alignment result is sorted according to the genomic position of each independent sequence to obtain the third alignment result, the method further comprises:

Obtain the capture sequencing interval;

According to the capture sequencing interval, single nucleotide variation detection and insertion deletion detection are performed on each independent sequence to obtain the detection result.

7. The method according to claim 1, wherein before obtaining at least one sequence set based on the first alignment result, the method further comprises:

According to the genomic positions of the multiple pairs of double-ended sequences, the first alignment result is sorted to obtain a fourth alignment result, and an index is established for the fourth alignment result;

The fourth comparison result is filtered to obtain the fifth comparison result;

Based on the fifth alignment result, the at least one sequence set is obtained.

8. The method according to claim 1, wherein the sequencing data is compared with the reference genome sequence, and obtaining the first comparison result comprises:

Obtain the matching degree of each sequence in the multiple pairs of double-ended sequences and each sequence in the reference genome sequence;

Obtain at least one sequence corresponding to the highest matching degree, and obtain the matching sequence of each sequence;

The genomic location of each sequence is determined based on the matching sequence of each sequence.

9. A device for processing circulating tumor DNA repeats, comprising:

The first acquisition module is used to acquire the sequencing data of the circulating tumor DNA to be detected and the reference genome sequence, wherein the sequencing data is data obtained by high-throughput sequencing of the circulating tumor DNA to be detected, and the sequencing data includes: a plurality of pairs of double-ended sequence;

an alignment module for aligning the sequencing data with the reference genome sequence to obtain a first alignment result, wherein the first alignment result at least includes: the genomic positions of the multiple pairs of double-ended sequences , the base sequence and the corresponding base quality value sequence;

a processing module, configured to obtain at least one sequence set based on the first alignment result, wherein each sequence set includes: at least one pair of double-ended sequences, and the genomic positions of the double-ended sequences in the same sequence set are the same;

a clustering module, configured to perform network clustering on at least a pair of double-ended sequences in each sequence set to obtain at least one first network;

a second acquisition module, configured to acquire the largest number of double-ended sequences in each first network, and obtain an independent sequence corresponding to each of the first networks;

10 . A storage medium, characterized in that the storage medium comprises a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the circulating tumor according to any one of claims 1 to 8 Methods of processing DNA repeats.

11 . A processor, wherein the processor is configured to run a program, wherein when the program is run, the method for processing circulating tumor DNA repeats according to any one of claims 1 to 8 is executed.