HK1240371B - Method and device for determining v/j gene sequences before rearrangement - Google Patents
Method and device for determining v/j gene sequences before rearrangement Download PDFInfo
- Publication number
- HK1240371B HK1240371B HK17113845.1A HK17113845A HK1240371B HK 1240371 B HK1240371 B HK 1240371B HK 17113845 A HK17113845 A HK 17113845A HK 1240371 B HK1240371 B HK 1240371B
- Authority
- HK
- Hong Kong
- Prior art keywords
- region
- sequence
- gene
- rearrangement
- seed
- Prior art date
Links
Description
技术领域Technical Field
本发明属于生物信息领域,具体的,本发明涉及一种确定重排前V/J基因序列的方法和装置。The present invention belongs to the field of bioinformatics. Specifically, the present invention relates to a method and device for determining a V/J gene sequence before rearrangement.
背景技术Background Art
胚系细胞(Germline)上存在一簇V基因、一簇J基因,有的V/J基因之间还有一簇D基因,基因簇中的基因由内含子分开,串联排列在同一条染色体上,并且基因之间的同源性非常的高【动物免疫学[M].中国农业大学出版社,1996.】。一般一个簇里面有几十个基因,且每个基因在不同个体上可能不一样,如人编码抗体的重链(IGH)的V基因簇中有40个基因、D基因簇有25个基因、J基因簇有6个基因,其中40个V基因总共有425个等位基因(allele)。Germline cells contain a cluster of V genes and a cluster of J genes, sometimes with a cluster of D genes between them. The genes in a cluster are separated by introns and arranged in tandem on the same chromosome, with very high homology between them [Animal Immunology [M]. China Agricultural University Press, 1996]. Typically, a cluster contains dozens of genes, and each gene can vary between individuals. For example, the V gene cluster encoding the heavy chain (IGH) of antibodies in humans has 40 genes, the D gene cluster has 25 genes, and the J gene cluster has 6 genes. The 40 V genes have a total of 425 alleles.
对于淋巴细胞来说,在细胞发育成熟过程中,V基因、J基因或者D基因会发生基因间的重排【Parkin J,Cohen B.An overview of the immune system[J].The Lancet,2001,357(9270):1777-1789.】,形成编码T细胞受体(TCR)和B细胞受体(BCR)或抗体(Ig)的基因。而这些构成机体免疫系统的B细胞受体/抗体或T细胞受体的集合就形成了免疫组库(immune reperoire)。During the maturation of lymphocytes, V, J, or D genes undergo intergenic rearrangement [Parkin J, Cohen B. An overview of the immune system [J]. The Lancet, 2001, 357(9270): 1777-1789.], forming genes encoding T cell receptors (TCRs), B cell receptors (BCRs), or antibodies (Ig). The collection of these B cell receptors/antibodies or T cell receptors that make up the body's immune system forms the immune repertoire.
免疫球蛋白TCR和BCR的恒定区(C区)比较保守,相对容易测序,很多动物的C区已知。但V,D,J基因区的多样性较高【余江,姚新生.高通量测序分析自身免疫性疾病T细胞受体β链CDR3组库的特征[J].贵州医药,2015,3:037.】;而且,除人类和小鼠外,其他物种的该区域基因还未被找到或者仅仅证明了其中一部分;这些一定程度上阻碍了免疫学研究的进程。例如,猴子是一种可用的疫苗评价和抗体动物模型并被广泛使用。但是猴子的IgH序列【Link J M,Hellinger M A,Schroeder H W.The Rhesus monkey immunoglobulin IGHDand IGHJ germline repertoire[J].Immunogenetics,2002,54(4):240-250.】只有少量被发现,远远达不到分析的要求。因此研究物种的germline序列是一个亟待解决的基本问题。The constant regions (C regions) of immunoglobulins TCR and BCR are relatively conserved and relatively easy to sequence. The C regions of many animals are known. However, the diversity of the V, D, and J gene regions is high [Yu Jiang, Yao Xinsheng. High-throughput sequencing analysis of the characteristics of the T cell receptor β chain CDR3 repertoire in autoimmune diseases [J]. Guizhou Medicine, 2015, 3: 037.]; moreover, except for humans and mice, the genes in this region of other species have not been found or only a part of them has been identified; this has hindered the progress of immunological research to a certain extent. For example, monkeys are a usable animal model for vaccine evaluation and antibody and are widely used. However, only a small number of monkey IgH sequences [Link J M, Hellinger MA, Schroeder H W. The Rhesus monkey immunoglobulin IGHD and IGHJ germline repertoire [J]. Immunogenetics, 2002, 54(4): 240-250.] have been discovered, which is far from meeting the requirements of analysis. Therefore, studying the germline sequence of species is a basic problem that needs to be solved urgently.
目前,已有一些方法试图探索germline序列。传统的方法是使用PCR克隆的策略,基于人类基因组DNA序列作为引物进行PCR扩增物种的germline。使用这种方法能测出骆驼【Nguyen V K,Hamers R,Wyns L,et al.Camel heavy‐chain antibodies:diversegermline VHH and specific mechanisms enlarge the antigen‐binding repertoire[J].The EMBO journal,2000,19(5):921-930.】和猴子【Diaz O L,Daubenberger C A,Rodriguez R,et al.Immunoglobulin kappa light-chain V,J,and C gene sequencesof the owl monkey Aotus nancymaae[J].Immunogenetics,2000,51(3):212-218.】的部分germline序列,这是最直接的获得序列的方法,但只适用于和人类同源的物种,并且需要设计多重配对引物且时间长。Currently, there are several methods that attempt to explore germline sequences. The traditional method is to use a PCR cloning strategy, using human genomic DNA sequences as primers to PCR amplify the species' germline. This method can be used to detect partial germline sequences in camels [Nguyen V K, Hamers R, Wyns L, et al. Camel heavy-chain antibodies: diverse germline VHH and specific mechanisms enlarge the antigen-binding repertoire[J]. The EMBO journal, 2000, 19(5): 921-930.] and monkeys [Diaz O L, Daubenberger C A, Rodriguez R, et al. Immunoglobulin kappa light-chain V, J, and C gene sequences of the owl monkey Aotus nancymaae[J]. Immunogenetics, 2000, 51(3): 212-218.]. This is the most direct method for obtaining sequences, but it is only applicable to species homologous to humans and requires the design of multiple paired primers, which is time-consuming.
最近,将生物信息的方法应用于参考序列组装物种的基因组已成为一个重要的方向。但这些生物信息策略依赖于已知物种基因组和germline序列。对于物种germline区域高度重复性区域的组装准确校正是较难的,影响了germline的推断。另外,至今也没有软件或工具用于推断germline序列。Recently, applying bioinformatics methods to assemble species genomes from reference sequences has become an important area of research. However, these bioinformatics strategies rely on known species genomes and germline sequences. Accurately correcting the assembly of highly repetitive regions within a species' germline region is challenging, hindering germline inference. Furthermore, no software or tools exist to date for inferring germline sequences.
发明内容Summary of the Invention
本发明旨在至少解决上述问题之一或者提出一种商业选择手段。为此,发明人提供了一个从头(de novo)方法以推定V/J的germline序列,即推定重排前的V/J基因序列。The present invention aims to solve at least one of the above problems or to provide a commercial alternative. To this end, the inventors provide a de novo method to infer the V/J germline sequence, that is, to infer the V/J gene sequence before rearrangement.
依据本发明的一方面,本发明提供一种确定重排前的V和/或J基因序列的方法,该方法包括:(1)获取待测RNA样品的测序数据,所述测序数据包括来自TCR、BCR和/或Ig的可变区的多个读段,所述读段的长度为L,L≥100bp;(2)基于所述测序数据,依据所述可变区中的V基因片段和J基因片段与C基因片段的排列关系,确定所述读段上的来自V基因片段和/或J基因片段的部分,获得多个V区部分和/或多个J区部分;(3)从每个所述V区部分和/或所述J区部分取出至少一段序列作为种子序列,获得包含多个种子序列的种子序列集,所述种子序列的长度为K;(4)依据所述种子序列集中的每个种子序列的V区部分和/或J区部分的支持数目的差异,对所述V区部分和/或J区部分进行聚类,获得多个V区部分簇和/或多个J区部分簇;(5)利用每个所述V区部分簇和/或所述J区部分簇延伸其所支持的种子序列,获得多个候选的重排前V基因序列和/或多个候选的重排前J基因序列;(6)利用所述测序数据中的读段对所述候选的重排前V基因序列和/或所述候选的重排前J基因序列的支持情况进行过滤,以获得所述重排前的V和/或J基因序列。According to one aspect of the present invention, the present invention provides a method for determining the V and/or J gene sequence before rearrangement, the method comprising: (1) obtaining sequencing data of an RNA sample to be tested, the sequencing data comprising a plurality of reads from the variable region of TCR, BCR and/or Ig, the length of the read being L, L ≥ 100 bp; (2) based on the sequencing data, determining the portion of the read from the V gene segment and/or the J gene segment according to the arrangement relationship between the V gene segment and the J gene segment and the C gene segment in the variable region, and obtaining a plurality of V region portions and/or a plurality of J region portions; (3) extracting at least one sequence from each of the V region portions and/or the J region portions as a seed sequence, and obtaining a sequence comprising a plurality of seed sequences. (4) clustering the V region portion and/or the J region portion according to the difference in the number of supports of the V region portion and/or the J region portion of each seed sequence in the seed sequence set to obtain a plurality of V region portion clusters and/or a plurality of J region portion clusters; (5) extending the seed sequence supported by each V region portion cluster and/or the J region portion cluster to obtain a plurality of candidate pre-rearrangement V gene sequences and/or a plurality of candidate pre-rearrangement J gene sequences; (6) filtering the support of the candidate pre-rearrangement V gene sequence and/or the candidate pre-rearrangement J gene sequence using the read segments in the sequencing data to obtain the V and/or J gene sequence before the rearrangement.
依据本发明的另一方面,本发明提供一种计算机可读介质,该计算机可读介质用于存储计算机可执行程序,执行所述程序包括完成上述本发明一方面的确定重排前的V和/或J基因序列的方法。本领域技术人员可以理解,在执行该计算机可执行程序时,通过指令相关硬件可完成上述方法的全部或部分步骤。所称存储介质可以包括:只读存储器、随机存储器、磁盘或光盘等。According to another aspect of the present invention, a computer-readable medium is provided for storing a computer-executable program. Execution of the program includes performing the method for determining the pre-rearrangement V and/or J gene sequence according to one aspect of the present invention. Those skilled in the art will appreciate that, when executing the computer-executable program, the associated hardware can perform all or part of the steps of the above-described method by instructing the hardware. The storage medium may include a read-only memory, a random access memory, a magnetic disk, or an optical disk.
依据本发明的又一方面,本发明提供一种确定重排前的V和/或J基因序列的装置,该装置包括:数据输入单元,用于输入数据;数据输出单元,用于输出数据;存储单元,用于存储数据,其中包括计算机可执行程序;处理器,与所述数据输入单元、所述数据输出单元和所述存储单元连接,用于执行所述计算机可执行程序,执行所述程序包括完成上述本发明一方面的确定重排前的V和/或J基因序列的方法。According to another aspect of the present invention, the present invention provides a device for determining the V and/or J gene sequence before rearrangement, the device comprising: a data input unit for inputting data; a data output unit for outputting data; a storage unit for storing data, comprising a computer executable program; a processor connected to the data input unit, the data output unit and the storage unit, for executing the computer executable program, wherein executing the program comprises completing the method for determining the V and/or J gene sequence before rearrangement according to the above-mentioned aspect of the present invention.
依据本发明的再一方面,本发明提供一种确定重排前的V和/或J基因序列的系统,该系统包括:数据获取装置,用于获取待测RNA样品的测序数据,所述测序数据包括来自TCR、BCR和/或Ig的可变区的多个读段,所述读段的长度为L,L≥100bp;V/J区部分确定装置,用于基于所述测序数据,依据所述可变区中的V基因片段和J基因片段与C基因片段的排列关系,确定所述读段上的来自V基因片段和/或J基因片段的部分,获得多个V区部分和/或多个J区部分;种子序列集获取装置,用于从每个所述V区部分和/或所述J区部分取出至少一段序列作为种子序列,获得包含多个种子序列的种子序列集,所述种子序列的长度为K;V/J区部分簇确定装置,用于依据所述种子序列集中的每个种子序列的V区部分和/或J区部分的支持数目的差异,对所述V区部分和/或J区部分进行聚类,获得多个V区部分簇和/或多个J区部分簇;候选重排前V/J基因序列获取装置,用于利用每个所述V区部分簇和/或所述J区部分簇延伸其所支持的种子序列,获得多个候选的重排前V基因序列和/或多个候选的重排前J基因序列;重排前V/J基因序列确定装置,用于利用所述测序数据中的读段对所述候选的重排前V基因序列和/或所述候选的重排前J基因序列的支持情况进行过滤,以获得所述重排前的V和/或J基因序列。According to another aspect of the present invention, a system for determining V and/or J gene sequences before rearrangement is provided, the system comprising: a data acquisition device for acquiring sequencing data of an RNA sample to be tested, the sequencing data comprising multiple reads from the variable regions of TCR, BCR and/or Ig, the length of the reads being L, L ≥ 100 bp; a V/J region portion determination device for determining, based on the sequencing data and the arrangement relationship between the V gene segments and the J gene segments and the C gene segments in the variable regions, the portions of the reads from the V gene segments and/or the J gene segments, thereby obtaining multiple V region segments and/or multiple J region segments; and a seed sequence set acquisition device for extracting at least one sequence from each of the V region segments and/or the J region segments as a seed sequence, thereby obtaining a seed sequence set comprising multiple seed sequences. a set, wherein the length of the seed sequence is K; a V/J region partial cluster determining device, for clustering the V region parts and/or J region parts according to the difference in the number of supports of the V region parts and/or the J region parts of each seed sequence in the seed sequence set, to obtain multiple V region partial clusters and/or multiple J region partial clusters; a candidate pre-rearrangement V/J gene sequence acquiring device, for extending the seed sequence supported by each V region partial cluster and/or the J region partial cluster, to obtain multiple candidate pre-rearrangement V gene sequences and/or multiple candidate pre-rearrangement J gene sequences; a pre-rearrangement V/J gene sequence determining device, for filtering the support status of the candidate pre-rearrangement V gene sequence and/or the candidate pre-rearrangement J gene sequence using reads in the sequencing data, to obtain the V and/or J gene sequence before the rearrangement.
上述本发明的方法、装置和/或系统,基于高通量测序免疫组库获得的测序数据,能够仅利用信息分析方法,准确推导出V/J的germline序列。通过本发明的方法,可以对很多未发现V/J germline的物种,确定其germline序列,利于用于对物种T细胞受体和B细胞受体或抗体的进一步研究。相比传统和目前已有的方法,本发明的方法大大降低了难度,缩短了时间和费用。The methods, devices, and/or systems of the present invention, based on sequencing data obtained from high-throughput sequencing of immune repertoires, can accurately deduce V/J germline sequences using only information analysis methods. The methods of the present invention can determine germline sequences for many species where V/J germlines have not been identified, facilitating further research on species' T cell receptors, B cell receptors, or antibodies. Compared to traditional and currently available methods, the methods of the present invention significantly reduce difficulty, time, and cost.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
本发明的上述和/或附加的方面和优点从结合下面附图对实施方式的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from the description of the embodiments with reference to the following drawings, in which:
图1是本发明一个实施例中的确定重排前的V和/或J基因序列的方法的步骤示意图。FIG1 is a schematic diagram of the steps of a method for determining the V and/or J gene sequence before rearrangement in one embodiment of the present invention.
图2是本发明一个实施例中的确定重排前的V和/或J基因序列的装置的结构示意图。FIG2 is a schematic diagram of the structure of an apparatus for determining the V and/or J gene sequence before rearrangement in one embodiment of the present invention.
图3是本发明一个实施例中的确定重排前的V和/或J基因序列的系统的结构示意图。FIG3 is a schematic diagram of the structure of a system for determining the V and/or J gene sequence before rearrangement in one embodiment of the present invention.
图4是本发明一个实施例中的确定重排前的V和/或J基因序列的方法的流程图。FIG4 is a flow chart of a method for determining V and/or J gene sequences before rearrangement in one embodiment of the present invention.
图5是本发明一个实施例中的确定的三个样本的合并的TRB-J基因在人类JGermline基因区的覆盖情况的示意图。FIG5 is a schematic diagram showing the coverage of the combined TRB-J gene of three samples in the human JGermline gene region determined in one embodiment of the present invention.
图6是本发明一个实施例中的确定的三个样本的合并的TRB-V基因在人类VGermline基因区的覆盖情况的示意图。FIG6 is a schematic diagram showing the coverage of the combined TRB-V gene of three samples in the human VGermline gene region determined in one embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中,自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions.
下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。需要说明的,本文中所使用的术语“第一”、“第二”、“第一类”、“第二类”或者“第一部分”等仅为方便描述,不能理解为指示或暗示相对重要性,也不能理解为之间有先后顺序关系。在本发明的描述中,除非另有说明,“多个”的含义是两个或两个以上。The embodiments described below with reference to the accompanying drawings are exemplary and are intended only to explain the present invention, and are not to be construed as limiting the present invention. It should be noted that the terms "first," "second," "first category," "second category," or "first portion" used herein are for convenience of description only and are not to be construed as indicating or implying relative importance, nor as implying a sequential order. In the description of the present invention, unless otherwise specified, "plurality" means two or more.
在本文中,除非另有明确的规定和限定,术语“相连”、“连接”等术语应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。In this document, unless otherwise clearly specified or limited, terms such as "connected" and "connection" should be understood in a broad sense. For example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium, or it can be the internal connection of two elements.
如图1所示,依据本发明的一个实施例提供的一种确定重排前的V和/或J基因序列的方法,该方法包括以下步骤:As shown in FIG1 , a method for determining the V and/or J gene sequence before rearrangement is provided according to one embodiment of the present invention. The method comprises the following steps:
S10获取待测RNA样品的测序数据。S10 obtains sequencing data of the RNA sample to be tested.
获取的待测RNA样品的测序数据包括来自TCR、BCR和/或Ig的可变区的多个读段,所述读段的长度为L,L≥100bp。The obtained sequencing data of the RNA sample to be tested includes multiple reads from the variable regions of TCR, BCR and/or Ig, and the length of the reads is L, where L is ≥ 100 bp.
所称的RNA样品来自发生V和/或J基因重排的细胞中的RNA或游离RNA。一般来自特异性免疫细胞,例如来自T淋巴细胞和/或B淋巴细胞。The so-called RNA sample is derived from RNA or free RNA in cells in which V and/or J gene rearrangement occurs, generally from specific immune cells, such as T lymphocytes and/or B lymphocytes.
所称的测序数据通过对待测RNA样品的核酸序列进行测序文库制备、上机测序获得。根据本发明的实施例,获取所述测序数据,包括:获取待测样本中的核酸,制备所述核酸的测序文库,对所述测序文库进行测序。测序文库的制备方法根据所选择的测序方法的要求进行,测序方法依据所选的测序平台的不同,可选择但不限于Illumina公司的Hisq2000/2500测序平台、Life Technologies公司的Ion Torrent平台和单分子测序平台,测序方式可以选择单端测序,也可以选择双末端测序,获得的下机数据是测读出来的片段,称为读段(reads)。The so-called sequencing data is obtained by preparing a sequencing library for the nucleic acid sequence of the RNA sample to be tested and sequencing the library on a machine. According to an embodiment of the present invention, obtaining the sequencing data includes: obtaining nucleic acid from the sample to be tested, preparing a sequencing library of the nucleic acid, and sequencing the sequencing library. The preparation method of the sequencing library is carried out according to the requirements of the selected sequencing method. The sequencing method can be selected from, but not limited to, Illumina's Hisq2000/2500 sequencing platform, Life Technologies' Ion Torrent platform, and a single-molecule sequencing platform based on the selected sequencing platform. The sequencing method can be selected from single-end sequencing or double-end sequencing. The data obtained off the machine are fragments read out, called reads.
根据本发明的一个实施例,所述测序数据为经过预处理的测序数据,所述预处理包括以下至少之一:过滤掉包含接头序列的读段、切去读段的末端序列质量值小于10的碱基以及切去读段末端的接头序列。如此,预处理后的测序数据的整体质量更高,利于后续准确分析推定V/J的germline序列。According to one embodiment of the present invention, the sequencing data has been preprocessed. The preprocessing includes at least one of filtering out reads containing adapter sequences, trimming bases with a sequence quality score less than 10 at the ends of reads, and trimming adapter sequences at the ends of reads. This preprocessing improves the overall quality of the sequencing data, facilitating subsequent accurate analysis and estimation of V/J germline sequences.
根据本发明的一个实施例,利用双末端测序获得所述测序数据,即所述测序数据包含多对成对读段,利用读段之间的重叠部分将一对成对读段拼接成一条拼接序列,以所述拼接序列替代所述成对读段进行以下步骤。如此,相当于获得更长的测序片段,利用更长的测序片段利于后续准确分析推定重排前的序列。According to one embodiment of the present invention, the sequencing data is obtained using paired-end sequencing, i.e., the sequencing data comprises multiple pairs of paired reads. The paired reads are then spliced into a spliced sequence using the overlapping portions between the reads. The following steps are then performed using the spliced sequence instead of the paired reads. This results in longer sequence fragments, which facilitate subsequent accurate analysis and estimation of the sequence before rearrangement.
S20获取多个V区部分和/或多个J区部分。S20 acquires a plurality of V region portions and/or a plurality of J region portions.
基于所述测序数据,依据所述可变区中的V基因片段和J基因片段与C基因片段的排列关系,确定所述读段上的来自V基因片段和/或J基因片段的部分,获得多个V区部分和/或多个J区部分。Based on the sequencing data, according to the arrangement relationship between the V gene segments and the J gene segments and the C gene segments in the variable region, the portion of the read segment from the V gene segment and/or the J gene segment is determined to obtain multiple V region portions and/or multiple J region portions.
根据本发明的一个实施例,S20包括:确定所述读段上的来自C基因片段的部分,例如利用局部比对确定所述读段上的来自C基因片段的部分;切割掉所述读段上的来自C基因片段的部分,获得切割后的部分;从所述切割后的部分的3’端向5’端提取不小于60bp的序列以获得所述J区部分;和/或从所述切割后的部分的3’端向5’端切割掉40bp,获得的余下部分为所述多个V区部分。该示例是依据Ig或TRB中可变区的V基因片段、J基因片段与恒定区C基因片段的排列关系以及目标基因片段的大小,来初步确定读段上的来自V基因片段和J基因片段的V区部分和J区部分。According to one embodiment of the present invention, S20 includes: determining the portion of the read segment that is derived from the C gene segment, for example, by using local alignment to determine the portion of the read segment that is derived from the C gene segment; cutting off the portion of the read segment that is derived from the C gene segment to obtain a cut portion; extracting a sequence of no less than 60 bp from the 3' end to the 5' end of the cut portion to obtain the J region portion; and/or cutting off 40 bp from the 3' end to the 5' end of the cut portion to obtain the remaining portion as the multiple V region portions. This example preliminarily determines the V region portion and J region portion of the read segment that are derived from the V gene segment and the J gene segment based on the arrangement relationship between the variable region V gene segment and the constant region C gene segment in Ig or TRB, as well as the size of the target gene segment.
根据本发明的一个较佳实施例,S20还包括:过滤掉长度小于40bp的所述J区部分和/或长度小于40bp的所述V区部分。如此,依据目标基因片段的大小,去除掉非来自目标基因的片段或者短碎的目标片段,利于后续简单准确的进行数据处理。According to a preferred embodiment of the present invention, S20 further includes filtering out J region portions and/or V region portions less than 40 bp in length. In this way, based on the size of the target gene fragments, fragments not derived from the target gene or short target fragments are removed, facilitating subsequent simple and accurate data processing.
S30获得种子序列集。S30 obtains a seed sequence set.
从每个所述V区部分和/或所述J区部分取出至少一段序列作为种子序列,获得包含多个种子序列的种子序列集,所述种子序列的长度为K。At least one sequence is taken from each of the V region parts and/or the J region parts as a seed sequence to obtain a seed sequence set comprising a plurality of seed sequences, wherein the length of the seed sequence is K.
考虑到J区长度范围为40~60bp,根据本发明的一个实施例,设定K为不大于40bp。如此,利于将每个V区部分或J区部分转化成多个种子序列。Considering that the length of the J region ranges from 40 to 60 bp, according to one embodiment of the present invention, K is set to no more than 40 bp. This facilitates the conversion of each V region portion or J region portion into multiple seed sequences.
根据本发明的一个实施例,S30包括:对每个所述V区部分和/或所述J区部分以1bp长度进行滑动切割,以将一个所述V区部分和/或所述J区部分转化成一个种子序列子集,一个所述种子序列子集包括(L-K+1)个所述种子序列,多个所述种子序列子集构成所述种子序列集。这样,将V区部分或J区部分转化成对应的一个种子序列子集,即转化成一个Kmer集合,该转化一方面使得滑动1bp的两个Kmer间存在(K-1)bp长度的重叠,这种重叠关系不需通过比对来获得,这样节省了比对时间,另一方面使得每个V区部分或J区部分相当于一个Kmer群,确定了这一群Kmer的线性方向关系,这些都利于后续基于种子序列的延伸,利于推定重排前的V/J基因序列。According to one embodiment of the present invention, S30 includes: performing a sliding cut on each of the V region portions and/or the J region portions at a length of 1 bp to convert the V region portion and/or the J region portion into a seed sequence subset, wherein the seed sequence subset includes (L-K+1) seed sequences, and the plurality of seed sequence subsets constitute the seed sequence set. Thus, the V region portion or the J region portion is converted into a corresponding seed sequence subset, i.e., into a Kmer set. This conversion, on the one hand, ensures that there is an overlap of (K-1) bp between two Kmers with a 1 bp sliding distance. This overlap relationship does not need to be obtained through alignment, thus saving alignment time. On the other hand, each V region portion or the J region portion is equivalent to a Kmer group, and the linear directional relationship of this Kmer group is determined. This facilitates subsequent seed sequence-based extension and facilitates the inference of the V/J gene sequence before rearrangement.
S40获得多个V区部分簇和/或多个J区部分簇。S40: Obtain a plurality of V region partial clusters and/or a plurality of J region partial clusters.
依据所述种子序列集中的每个种子序列的V区部分和/或J区部分的支持数目的差异,对所述V区部分和/或J区部分进行聚类,获得多个V区部分簇和/或多个J区部分簇。According to the difference in the support number of the V region portion and/or the J region portion of each seed sequence in the seed sequence set, the V region portion and/or the J region portion are clustered to obtain multiple V region portion clusters and/or multiple J region portion clusters.
根据本发明的一个实施例,S40包括重复进行以下(i)和(ii),直至没有所述种子序列剩余:(i)确定获得数目最多V区部分和/或J区部分的支持的种子序列,将支持该种子序列的所有V区部分和/或J区部分归为一类,对应获得一个V区部分簇和/或一个J区部分簇;(ii)去除(i)中的种子序列和支持该种子序列的所有V区部分和/或J区部分。这样,循环类推,直到剩余的种子序列序列为0。According to one embodiment of the present invention, S40 includes repeating the following (i) and (ii) until no seed sequences remain: (i) determining the seed sequence that receives the most support from the V region portions and/or J region portions, and grouping all V region portions and/or J region portions that support the seed sequence into one category, thereby obtaining a corresponding V region portion cluster and/or a corresponding J region portion cluster; and (ii) removing the seed sequence in (i) and all V region portions and/or J region portions that support the seed sequence. This cycle is repeated until no seed sequences remain.
S50获得候选的重排前V基因序列和/或候选的重排前J基因序列。S50 obtains a candidate pre-rearrangement V gene sequence and/or a candidate pre-rearrangement J gene sequence.
利用每个所述V区部分簇和/或所述J区部分簇延伸其所支持的种子序列,获得多个候选的重排前V基因序列和/或多个候选的重排前J基因序列。所称的延伸依据V区部分或J区部分之间的重叠关系进行。例如,将同一J区部分簇中的J区部分比对到其所支持的种子序列,即将这些J区部分定位,基于定位后的J区部分序列之间的重叠关系进行。Each of the V region partial clusters and/or the J region partial clusters is used to extend the seed sequence supported by it to obtain multiple candidate pre-rearrangement V gene sequences and/or multiple candidate pre-rearrangement J gene sequences. The so-called extension is performed based on the overlapping relationship between the V region portions or J region portions. For example, the J region portions in the same J region partial cluster are aligned to the seed sequence supported by them, that is, these J region portions are mapped, and the extension is performed based on the overlapping relationship between the mapped J region partial sequences.
所称的“比对上”意同匹配。具体比对时,可以利用已知比对软件进行,例如SOAP、BWA和TeraMap等,本实施例对此不作限制。在比对过程中,根据比对参数的设置,一对或一条序列最多允许有n个碱基错配(mismatch),例如设置n为1或2,若序列中有超过n个碱基发生错配,则视为该条/对序列无法比对到参考序列。The term "aligned" is synonymous with matching. The specific alignment can be performed using known alignment software, such as SOAP, BWA, and TeraMap, and is not limited in this embodiment. During the alignment process, depending on the alignment parameter settings, a pair or sequence is allowed to have a maximum of n base mismatches. For example, if n is set to 1 or 2, if more than n base mismatches occur in the sequence, the sequence/pair is considered to be unable to be aligned to the reference sequence.
当匹配为完全匹配,例如当比对上的序列的某个位点与参考序列上的该位点一致,则称这种序列为支持该位点的序列。When the match is a perfect match, for example, when a certain site in the aligned sequence is identical to that site in the reference sequence, the sequence is said to be a sequence supporting the site.
由于V、D基因均有多个拷贝,各片段随机组合即重排的方式多样,显示为定位后的V/J区部分的同一位置的碱基多样,延伸时需要设定可信条件确定该位置的碱基类型。根据本发明的一个实施例,S50包括:利用所述V区部分簇和/或所述J区部分簇,对所述V区部分簇和/或所述J区部分簇支持的种子序列进行延伸,以获得多个所述候选的重排前V基因序列和/或多个所述候选的重排前J基因序列,其中包括进行以下至少之一:(a)对于J区部分簇支持的种子序列,利用该J区部分簇对该种子序列的3’端和/或5’端进行延伸一个碱基需要同时满足条件:支持该碱基的J区部分的数目占该J区部分簇包含的J区部分总数的比例大于3%,支持该碱基的J区部分的种类数目占该J区部分簇包含的种类总数的比例大于5%;(b)对于V区部分簇支持的种子序列,利用该V区部分簇对该种子序列的3’端进行延伸一个碱基需要同时满足条件:支持该碱基的V区部分的数目占该V区部分簇包含的V区部分总数的比例大于3%,支持该碱基的V区部分的种类数目占该V区部分簇包含的V区部分种类总数的比例大于5%;(c)对于V区部分簇支持的种子序列,利用该V区部分簇对该种子序列的5’端进行延伸一个碱基需要同时满足条件:支持该碱基的V区部分的数目大于100,支持该碱基的V区部分的种类数目大于2。所称的支持某碱基的J区部分的种类是指该位置碱基一样但其它位置的碱基不完全一样的J区部分。所称的支持某碱基的V区部分的种类是指该位置碱基一样但其它位置的碱基不完全一样的V区部分。Since both V and D genes have multiple copies, the random combination of each fragment is a variety of rearrangement modes, which is displayed as a variety of bases at the same position in the V/J region after positioning. When extending, it is necessary to set a trust condition to determine the base type at the position. According to one embodiment of the present invention, S50 includes: using the V region partial cluster and/or the J region partial cluster, extending the seed sequence supported by the V region partial cluster and/or the J region partial cluster to obtain a plurality of the candidate pre-rearrangement V gene sequences and/or a plurality of the candidate pre-rearrangement J gene sequences, including performing at least one of the following: (a) for the seed sequence supported by the J region partial cluster, using the J region partial cluster to extend the 3' end and/or 5' end of the seed sequence by one base, the following conditions need to be met at the same time: the number of J region parts supporting the base accounts for more than 3% of the total number of J region parts contained in the J region partial cluster, and the number of types of J region parts supporting the base accounts for more than 1% of the total number of J region parts contained in the J region partial cluster. The proportion of the total number of types included is greater than 5%; (b) For a seed sequence supported by a V region partial cluster, using the V region partial cluster to extend the 3' end of the seed sequence by one base must simultaneously meet the following conditions: the number of V region portions supporting the base accounts for greater than 3% of the total number of V region portions included in the V region partial cluster, and the number of types of V region portions supporting the base accounts for greater than 5% of the total number of types of V region portions included in the V region partial cluster; (c) For a seed sequence supported by a V region partial cluster, using the V region partial cluster to extend the 5' end of the seed sequence by one base must simultaneously meet the following conditions: the number of V region portions supporting the base is greater than 100, and the number of types of V region portions supporting the base is greater than 2. The type of J region portion supporting a base refers to the J region portion that has the same base at that position but not completely identical bases at other positions. The type of V region portion supporting a base refers to the V region portion that has the same base at that position but not completely identical bases at other positions.
为获得候选的重排前V基因序列,根据本发明的一个实施例,S50包括进行上述(b)和(c),以及将进行(b)和(c)后得到的序列进行拼接,以获得候选的重排前的V基因序列。该实施例考虑到V区基因打断之后,片段长度不一,情况较J区复杂,所以将左右两端分开延伸,采用不同的过滤条件,利于获得准确度高的候选V基因序列。To obtain a candidate pre-rearrangement V gene sequence, according to one embodiment of the present invention, S50 includes performing steps (b) and (c) above, and splicing the sequences obtained after steps (b) and (c) to obtain a candidate pre-rearrangement V gene sequence. This embodiment takes into account that after the V region gene is interrupted, the fragment lengths vary, which is more complex than the J region. Therefore, the left and right ends are extended separately and different filtering conditions are used to facilitate the acquisition of highly accurate candidate V gene sequences.
S60过滤以获得重排前的V和/或J基因序列。S60 filtering was used to obtain the V and/or J gene sequences before rearrangement.
利用所述测序数据中的读段对所述候选的重排前V基因序列和/或所述候选的重排前J基因序列的支持情况进行过滤,以获得所述重排前的V和/或J基因序列。The support status of the candidate pre-rearrangement V gene sequence and/or the candidate pre-rearrangement J gene sequence is filtered using the reads in the sequencing data to obtain the pre-rearrangement V and/or J gene sequence.
根据本发明的一个实施例,在进行S60之前,将序列相似度不小于95%的候选的重排前V基因序列合并,和/或将序列相似度不小于95%的候选的重排前J基因序列合并。如此,能够避免相同数据的重复调用分析,减少运算量。According to one embodiment of the present invention, before performing S60, candidate pre-rearrangement V gene sequences with a sequence similarity of not less than 95% are merged, and/or candidate pre-rearrangement J gene sequences with a sequence similarity of not less than 95% are merged. This avoids repeated analysis of the same data and reduces computational complexity.
根据本发明的一个实施例,S60包括进行以下(d)和/或(e):(d)从所述候选的重排前V基因序列的3’端的第一个碱基开始,向5’端方向截取所述种子序列长度的序列,作为第一片段,从所述候选的重排前V基因序列的3’端的第P个碱基开始,向5’端方向截取所述种子序列长度的序列,作为第二片段,基于所述第一片段的读段支持数和所述第二片段的读段支持数的差异程度,对所述候选的重排前V基因序列进行过滤;(e)从所述候选的重排前J基因序列的5’端的第一个碱基开始,向3’端方向截取所述种子序列长度的序列,作为第三片段,从所述候选的重排前J基因序列的5’端的第P’个碱基开始,向3’端方向截取所述种子序列长度的序列,作为第四片段,基于所述第三片段的读段支持数和所述第四片段的读段支持数的差异程度,对所述候选的重排前J基因序列进行过滤。According to one embodiment of the present invention, S60 includes performing the following (d) and/or (e): (d) starting from the first base at the 3' end of the candidate pre-rearrangement V gene sequence, a sequence of the length of the seed sequence is cut off in the 5' direction as the first fragment, starting from the P-th base at the 3' end of the candidate pre-rearrangement V gene sequence, a sequence of the length of the seed sequence is cut off in the 5' direction as the second fragment, and filtering the candidate pre-rearrangement V gene sequence based on the degree of difference between the read support number of the first fragment and the read support number of the second fragment; (e) starting from the first base at the 5' end of the candidate pre-rearrangement J gene sequence, a sequence of the length of the seed sequence is cut off in the 3' direction as the third fragment, starting from the P-th base at the 5' end of the candidate pre-rearrangement J gene sequence, a sequence of the length of the seed sequence is cut off in the 3' direction as the fourth fragment, and filtering the candidate pre-rearrangement J gene sequence based on the degree of difference between the read support number of the third fragment and the read support number of the fourth fragment.
根据本发明的一个实施例,S60中的(d)包括保留同时满足以下两个条件的候选的重排前V基因序列:第二片段的读段支持数/第一片段的读段支持数>1.5,第一片段的读段支持数/第一片段的平均读段支持数>5%;和/或S60中的(e)包括保留同时满足以下两个条件的候选的重排前J基因序列:第四片段的读段支持数/第三片段的读段支持数>1.5,第三片段的读段支持数/第三片段的平均读段支持数>5%。上述实施例基于获得的读段支持数量对候选V/J基因序列进行筛选,利于最终保留住的为可靠的重排前序列。According to one embodiment of the present invention, step (d) in S60 includes retaining candidate pre-rearrangement V gene sequences that simultaneously meet the following two conditions: the read support number of the second segment/the read support number of the first segment>1.5, and the read support number of the first segment/the average read support number of the first segment>5%; and/or step (e) in S60 includes retaining candidate pre-rearrangement J gene sequences that simultaneously meet the following two conditions: the read support number of the fourth segment/the read support number of the third segment>1.5, and the read support number of the third segment/the average read support number of the third segment>5%. The above embodiment screens candidate V/J gene sequences based on the obtained read support numbers, which helps to ultimately retain reliable pre-rearrangement sequences.
上述本发明的这一方法能够仅利用信息分析技术,准确推导出V/J的germline序列。通过本发明的方法,可以确定很多未发现V/J germline的物种的germline序列,可用于对任何物种T细胞受体和B细胞受体或抗体的进一步研究。相比传统和目前已有的方法,本发明的方法大大降低了难度,缩短了时间和费用。This method of the present invention accurately derives V/J germline sequences using only information analysis techniques. This method can determine germline sequences for many species where V/J germlines are not known, enabling further research on T cell receptors, B cell receptors, or antibodies in any species. Compared to traditional and currently available methods, the method of the present invention significantly reduces difficulty, time, and cost.
本领域技术人员可以理解,上述确定重排前的V和/或J基因序列的方法的全部或部分步骤,可以利用机器可识别语言编写成程序,存储于存储介质中。依据本发明的另一个实施例提供的一种计算机可读介质,该计算机可读介质用于存储计算机可执行程序,执行所述程序包括完成上述任一实施例中的的确定重排前的V和/或J基因序列方法。本领域技术人员可以理解,在执行该计算机可执行程序时,通过指令相关硬件可完成上述任一确定重排前的V和/或J基因序列方法的全部或部分步骤。所称存储介质可以包括:只读存储器、随机存储器、磁盘或光盘等。Those skilled in the art will appreciate that all or part of the steps of the above-mentioned method for determining the V and/or J gene sequence before rearrangement can be written into a program in a machine-readable language and stored in a storage medium. According to another embodiment of the present invention, a computer-readable medium is provided, which is used to store a computer-executable program, and executing the program includes completing the method for determining the V and/or J gene sequence before rearrangement in any of the above-mentioned embodiments. Those skilled in the art will appreciate that when executing the computer-executable program, all or part of the steps of any of the above-mentioned methods for determining the V and/or J gene sequence before rearrangement can be completed by instructing the relevant hardware. The so-called storage medium may include: read-only memory, random access memory, magnetic disk or optical disk, etc.
如图2所示,依据本发明的又一实施例提供的一种确定重排前的V和/或J基因序列的装置,该装置100包括:数据输入单元110,用于输入数据;数据输出单元120,用于输出数据;存储单元130,用于存储数据,其中包括计算机可执行程序;处理器140,与所述数据输入单元110、所述数据输出单元120和所述存储单元130连接,用于执行所述计算机可执行程序,执行所述程序包括完成上述任一实施例中的确定重排前的V和/或J基因序列的方法。As shown in Figure 2, according to another embodiment of the present invention, a device for determining the V and/or J gene sequence before rearrangement is provided. The device 100 includes: a data input unit 110, for inputting data; a data output unit 120, for outputting data; a storage unit 130, for storing data, including a computer-executable program; a processor 140, connected to the data input unit 110, the data output unit 120 and the storage unit 130, for executing the computer-executable program, and executing the program includes completing the method for determining the V and/or J gene sequence before rearrangement in any of the above embodiments.
如图3所示,依据本发明的再一个实施例提供的一种确定重排前的V和/或J基因序列的系统,该系统能够用以实施上述本发明任一实施例中的确定重排前的V和/或J基因序列的方法。该系统1000包括:数据获取装置1010,用于获取待测RNA样品的测序数据,所述测序数据包括来自TCR和/或Ig的可变区的多个读段,所述读段的长度为L,L≥100bp;V/J区部分确定装置1020,用于基于所述测序数据,依据所述可变区中的V基因片段和J基因片段与C基因片段的排列关系,确定所述读段上的来自V基因片段和/或J基因片段的部分,获得多个V区部分和/或多个J区部分;种子序列集获取装置1030,用于从每个所述V区部分和/或所述J区部分取出至少一段序列作为种子序列,获得包含多个种子序列的种子序列集,所述种子序列的长度为K;V/J区部分簇确定装置1040,用于依据所述种子序列集中的每个种子序列的V区部分和/或J区部分的支持数目的差异,对所述V区部分和/或J区部分进行聚类,获得多个V区部分簇和/或多个J区部分簇;候选重排前V/J基因序列获取装置1050,用于利用每个所述V区部分簇和/或所述J区部分簇延伸其所支持的种子序列,获得多个候选的重排前V基因序列和/或多个候选的重排前J基因序列;重排前V/J基因序列确定装置1060,用于利用所述测序数据中的读段对所述候选的重排前V基因序列和/或所述候选的重排前J基因序列的支持情况进行过滤,以获得所述重排前的V和/或J基因序列。上述对本发明的确定重排前的V和/或J基因序列的方法的技术特征和优点的描述,同样适用该系统,在此不再赘述。As shown in FIG3 , a system for determining the V and/or J gene sequence before rearrangement is provided according to another embodiment of the present invention. The system can be used to implement the method for determining the V and/or J gene sequence before rearrangement in any of the embodiments of the present invention described above. The system 1000 includes: a data acquisition device 1010 for acquiring sequencing data of an RNA sample to be tested, wherein the sequencing data includes multiple reads from the variable region of TCR and/or Ig, wherein the length of the read is L, and L ≥ 100bp; a V/J region portion determination device 1020 for determining the portion from the V gene segment and/or J gene segment on the read based on the sequencing data and the arrangement relationship between the V gene segment and the J gene segment and the C gene segment in the variable region, thereby obtaining multiple V region portions and/or multiple J region portions; a seed sequence set acquisition device 1030 for extracting at least one sequence from each of the V region portions and/or the J region portions as a seed sequence, thereby obtaining a seed sequence set comprising multiple seed sequences, wherein the length of the seed sequence is K; a V/J region portion The cluster determination device 1040 is used to cluster the V region portions and/or J region portions based on the difference in the number of supports for the V region portions and/or J region portions of each seed sequence in the seed sequence set to obtain multiple V region portion clusters and/or multiple J region portion clusters; the candidate pre-rearrangement V/J gene sequence acquisition device 1050 is used to extend the seed sequence supported by each V region portion cluster and/or the J region portion cluster to obtain multiple candidate pre-rearrangement V gene sequences and/or multiple candidate pre-rearrangement J gene sequences; the pre-rearrangement V/J gene sequence determination device 1060 is used to filter the support status of the candidate pre-rearrangement V gene sequence and/or the candidate pre-rearrangement J gene sequence using the reads in the sequencing data to obtain the pre-rearrangement V and/or J gene sequence. The above description of the technical features and advantages of the method for determining the pre-rearrangement V and/or J gene sequence of the present invention is also applicable to this system and will not be repeated here.
根据本发明的实施例,本发明的这一系统,还可以具有至少一个以下附加技术特征:According to an embodiment of the present invention, the system of the present invention may further have at least one of the following additional technical features:
根据本发明的一个实施例,数据获取装置1010中的测序数据为经过预处理的测序数据,所述预处理包括以下至少之一:过滤掉包含接头序列的读段、切除掉读段的末端序列的质量值小于10的末端序列部分以及切除掉读段的末端序列包含接头序列的末端序列部分。According to one embodiment of the present invention, the sequencing data in the data acquisition device 1010 is preprocessed sequencing data, and the preprocessing includes at least one of the following: filtering out reads containing adapter sequences, cutting off the end sequence portion of the end sequence of the read segment with a quality value less than 10, and cutting off the end sequence portion of the end sequence of the read segment containing the adapter sequence.
根据本发明的一个实施例,利用所述V/J区部分确定装置进行以下:确定所述读段上的来自C基因片段的部分,切割掉所述读段上的来自C基因片段的部分,获得切割后的部分,从所述切割后的部分的3’端向5’端提取不小于60bp的序列以获得所述J区部分;和/或从所述切割后的部分的3’端向5’端切割掉40bp,获得的余下部分为所述多个V区部分。其中,根据本发明的一个实施例,利用局部比对确定所述读段上的来自C基因片段的部分。According to one embodiment of the present invention, the V/J region portion determination apparatus performs the following steps: determining the portion of the read segment originating from the C gene segment, cutting off the portion of the read segment originating from the C gene segment to obtain a cut portion, extracting a sequence of no less than 60 bp from the 3' end to the 5' end of the cut portion to obtain the J region portion; and/or cutting off 40 bp from the 3' end to the 5' end of the cut portion to obtain the remaining portion as the multiple V region portions. According to one embodiment of the present invention, the portion of the read segment originating from the C gene segment is determined using local alignment.
根据本发明的一个实施例,还利用所述V/J区部分确定装置进行:过滤掉长度小于40bp的所述J区部分和/或长度小于40bp的所述V区部分。According to one embodiment of the present invention, the V/J region portion determination device is further used to filter out the J region portion with a length less than 40 bp and/or the V region portion with a length less than 40 bp.
考虑到目标序列的长度,根据本发明的一个实施例,设置K为不大于40bp。Taking into account the length of the target sequence, according to one embodiment of the present invention, K is set to be no greater than 40 bp.
根据本发明的一个实施例,利用所述种子序列集获取装置进行以下:对每个所述V区部分和/或所述J区部分以1bp长度进行滑动切割,以将一个所述V区部分和/或所述J区部分转化成一个种子序列子集,一个所述种子序列子集包括(L-K+1)个所述种子序列,多个所述种子序列子集构成所述种子序列集。According to one embodiment of the present invention, the seed sequence set acquisition device is used to perform the following: sliding cutting is performed on each of the V region portions and/or the J region portions at a length of 1 bp to convert one of the V region portions and/or the J region portions into a seed sequence subset, wherein one of the seed sequence subsets includes (L-K+1) seed sequences, and multiple seed sequence subsets constitute the seed sequence set.
根据本发明的一个实施例,利用所述V/J区部分簇确定装置重复进行以下(i)和(ii),直至没有所述种子序列剩余:确定获得数目最多V区部分和/或J区部分的支持的种子序列,将支持该种子序列的所有V区部分和/或J区部分归为一类,对应获得一个V区部分簇和/或一个J区部分簇,(ii)去除(i)中的种子序列和支持该种子序列的所有V区部分和/或J区部分。According to one embodiment of the present invention, the V/J region partial cluster determination device is used to repeatedly perform the following (i) and (ii) until no seed sequence remains: determining a seed sequence that obtains the largest number of V region portions and/or J region portions supporting it, grouping all V region portions and/or J region portions supporting the seed sequence into one category, and correspondingly obtaining a V region partial cluster and/or a J region partial cluster; (ii) removing the seed sequence in (i) and all V region portions and/or J region portions supporting the seed sequence.
根据本发明的一个实施例,利用所述候选重排前V/J基因序列获取装置进行以下:利用所述V区部分簇和/或所述J区部分簇,对所述V区部分簇和/或所述J区部分簇支持的种子序列进行延伸,以获得多个所述候选的重排前V基因序列和/或多个所述候选的重排前J基因序列,其中包括进行以下至少之一:(a)对于J区部分簇支持的种子序列,利用该J区部分簇对该种子序列的3’端和/或5’端进行延伸一个碱基需要同时满足条件:支持该碱基的J区部分的数目占该J区部分簇包含的J区部分总数的比例大于3%,支持该碱基的J区部分的种类数目占该J区部分簇包含的种类总数的比例大于5%,(b)对于V区部分簇支持的种子序列,利用该V区部分簇对该种子序列的3’端进行延伸一个碱基需要同时满足条件:支持该碱基的V区部分的数目占该V区部分簇包含的V区部分总数的比例大于3%,支持该碱基的V区部分的种类数目占该V区部分簇包含的V区部分种类总数的比例大于5%,(c)对于V区部分簇支持的种子序列,利用该V区部分簇对该种子序列的5’端进行延伸一个碱基需要同时满足条件:支持该碱基的V区部分的数目大于100,支持该碱基的V区部分的种类数目大于2。根据本发明的一个实施例,利用所述V区部分簇和/或所述J区部分簇进行上述(b)和(c),以及将进行(b)和(c)后得到的序列进行拼接。According to one embodiment of the present invention, the candidate pre-rearrangement V/J gene sequence acquisition device is used to perform the following: using the V region partial cluster and/or the J region partial cluster, the seed sequence supported by the V region partial cluster and/or the J region partial cluster is extended to obtain multiple candidate pre-rearrangement V gene sequences and/or multiple candidate pre-rearrangement J gene sequences, including performing at least one of the following: (a) for the seed sequence supported by the J region partial cluster, using the J region partial cluster to extend the 3' end and/or 5' end of the seed sequence by one base, which needs to meet the following conditions at the same time: the proportion of the number of J region parts supporting the base to the total number of J region parts contained in the J region partial cluster is greater than 3%, and the type of the J region part supporting the base is greater than 1%. (b) for a seed sequence supported by a V region partial cluster, extending the 3' end of the seed sequence by one base using the V region partial cluster must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions: the number of V region portions supporting the base must simultaneously meet the following conditions:
根据本发明的一个实施例,在利用所述重排前V/J基因序列确定装置获得所述重排前的V和/或J基因序列之前,将序列相似度不小于95%的候选的重排前V基因序列合并,和/或将序列相似度不小于95%的候选的重排前J基因序列合并。According to one embodiment of the present invention, before obtaining the V and/or J gene sequence before rearrangement using the pre-rearrangement V/J gene sequence determination device, the candidate pre-rearrangement V gene sequences with a sequence similarity of not less than 95% are merged, and/or the candidate pre-rearrangement J gene sequences with a sequence similarity of not less than 95% are merged.
根据本发明的一个实施例,利用所述重排前V/J基因序列确定装置进行以下(d)和/或(e):(d)从所述候选的重排前V基因序列的3’端的第一个碱基开始,向5’端方向截取所述种子序列长度的序列,作为第一片段,从所述候选的重排前V基因序列的3’端的第P个碱基开始,向5’端方向截取所述种子序列长度的序列,作为第二片段,基于所述第一片段的读段支持数和所述第二片段的读段支持数的差异程度,对所述候选的重排前V基因序列进行过滤,(e)从所述候选的重排前J基因序列的5’端的第一个碱基开始,向3’端方向截取所述种子序列长度的序列,作为第三片段,从所述候选的重排前J基因序列的5’端的第P’个碱基开始,向3’端方向截取所述种子序列长度的序列,作为第四片段,基于所述第三片段的读段支持数和所述第四片段的读段支持数的差异程度,对所述候选的重排前J基因序列进行过滤。According to one embodiment of the present invention, the apparatus for determining the pre-rearrangement V/J gene sequence is used to perform the following (d) and/or (e): (d) starting from the first base at the 3' end of the candidate pre-rearrangement V gene sequence, a sequence of the length of the seed sequence is intercepted in the 5' direction as the first fragment, starting from the P-th base at the 3' end of the candidate pre-rearrangement V gene sequence, a sequence of the length of the seed sequence is intercepted in the 5' direction as the second fragment, and based on the difference in the read support number of the first fragment and the read support number of the second fragment, (e) starting from the first base at the 5' end of the candidate J gene sequence before rearrangement, a sequence with the length of the seed sequence is intercepted in the 3' direction as the third segment, starting from the P'th base at the 5' end of the candidate J gene sequence before rearrangement, a sequence with the length of the seed sequence is intercepted in the 3' direction as the fourth segment, and filtering the candidate J gene sequence before rearrangement based on the difference between the read support number of the third segment and the read support number of the fourth segment.
根据本发明的一个实施例,利用所述重排前V/J基因序列确定装置进行(d)包括保留同时满足以下两个条件的候选的重排前V基因序列:第二片段的读段支持数/第一片段的读段支持数>1.5,第一片段的读段支持数/第一片段的平均读段支持数>5%,和/或利用所述重排前V/J基因序列确定装置进行(e)包括保留同时满足以下两个条件的候选的重排前J基因序列:第四片段的读段支持数/第三片段的读段支持数>1.5,第三片段的读段支持数/第三片段的平均读段支持数>5%。According to one embodiment of the present invention, the device for determining the V/J gene sequence before rearrangement is used to perform (d) including retaining candidate pre-rearrangement V gene sequences that simultaneously meet the following two conditions: the read support number of the second fragment/the read support number of the first fragment>1.5, and the read support number of the first fragment/the average read support number of the first fragment>5%, and/or the device for determining the V/J gene sequence before rearrangement is used to perform (e) including retaining candidate pre-rearrangement J gene sequences that simultaneously meet the following two conditions: the read support number of the fourth fragment/the read support number of the third fragment>1.5, and the read support number of the third fragment/the average read support number of the third fragment>5%.
为了使本发明技术方案及优点更加清楚明白,以下结合具体实施例对本发明的的确定重排前的V和/或J基因序列方法、装置和/或系统进行详细的描述。应当理解,下面示例用于解释本发明,不是对本发明的限制。To further clarify the technical solutions and advantages of the present invention, the following detailed description of the method, apparatus, and/or system for determining the pre-rearrangement V and/or J gene sequence of the present invention is provided in conjunction with specific examples. It should be understood that the following examples are provided to illustrate the present invention and are not intended to limit the present invention.
除另有交待,以下实施例中涉及的未特别交待的试剂、序列(接头、标签和引物)、软件及仪器,都是常规市售产品或者开源的,例如购买Illumina的测序文库构建试剂盒。Unless otherwise stated, the reagents, sequences (linkers, tags, and primers), software, and instruments not specifically stated in the following examples are all conventional commercially available products or open source, such as Illumina sequencing library construction kits.
实施例一Example 1
一般方法,包括以下步骤:The general method comprises the following steps:
针对RNA样品,可使用经过发明人优化一套通用引物通过5’race扩增TCR、BCR或Ig的可变区:For RNA samples, a set of universal primers optimized by the inventors can be used to amplify the variable regions of TCR, BCR or Ig via 5’race:
可变区由TCR或Ig的V、D、J三种基因片段重排形成,重排过程中基因片段之间的连接处有核苷酸的插入与缺失,该区域体现了适应性免疫分子表面受体的多样性。C区是恒定区,针对RNA可在C区设计引物,扩增可变区,然后通过5’race的方法扩增由不同亚家族的V区与J区重排所得的可变区。The variable region is formed by the rearrangement of the V, D, and J gene segments of a TCR or Ig. During the rearrangement process, nucleotide insertions and deletions occur at the junctions between the gene segments. This region reflects the diversity of surface receptors in adaptive immune molecules. The C region is the constant region. Primers can be designed within the C region to amplify the variable region. The 5' race method can then be used to amplify the variable region resulting from the rearrangement of the V and J regions of different subfamilies.
(2)文库制备(2) Library preparation
步骤一通过C区的反转录引物和SuperscriptⅡ等合成cDNA一链,然后,用Rnasemix消化cDNA中的RNA,接着在5’端加C,最后用5’race试剂盒中的Abridged Anchor primer和有生物素标记的C区引物PCR扩增。Step 1: Synthesize one strand of cDNA using the reverse transcription primer in the C region and Superscript II. Then, digest the RNA in the cDNA with RNasemix, add C to the 5' end, and finally amplify by PCR using the Abridged Anchor primer in the 5'race kit and the biotin-labeled C region primer.
步骤二把扩增产物打断到250bp左右,Dynabeads M-270链霉素磁珠富集带有生物素的DNA,用限制性内切酶PacⅠ酶切收集DNA。In step 2, the amplified product was fragmented to about 250 bp, and the biotinylated DNA was enriched using Dynabeads M-270 streptavidin magnetic beads. The DNA was then digested with the restriction endonuclease PacⅠ and collected.
步骤三文库构建:DNA通过T4DNA Polymerase、Klenow Fragment和T4Polynucleotide Kinase等酶的作用以dNTP为作用底物进行末端修复,形成补平的末端磷酸化的DNA片段。如果后续是TA粘性末端连接,可以利用Klenow Fragment(3’-5’exo-)聚合酶及dATP在补平序列的3’末端加上“A”碱基。在T4DNA Ligase的作用下与接头进行连接。为了方便来源于不同样本制备的RNA文库混合上机测序并在测序后区分开来,可在接头中引入标签序列以区分不同样品制备的文库。如果需要富集连接上接头的片段,可以加一步公用引物的PCR。Step three: Library construction: DNA is end-repaired using enzymes such as T4 DNA Polymerase, Klenow Fragment, and T4 Polynucleotide Kinase with dNTP as the substrate to form blunt-ended phosphorylated DNA fragments. If the subsequent TA sticky end connection is to be performed, Klenow Fragment (3’-5’exo-) polymerase and dATP can be used to add an “A” base to the 3’ end of the blunt-ended sequence. It is then connected to the adapter under the action of T4 DNA Ligase. In order to facilitate the mixed sequencing of RNA libraries prepared from different samples and to distinguish them after sequencing, a label sequence can be introduced into the adapter to distinguish libraries prepared from different samples. If it is necessary to enrich the fragments connected to the adapter, a PCR step with a common primer can be added.
测序文库全程为磁珠纯化,文库进行安捷伦2100检测和Q-PCR定量。The sequencing library was purified by magnetic beads throughout the process, and the library was detected by Agilent 2100 and quantified by Q-PCR.
(3)高通量测序(3) High-throughput sequencing
将上述准备的文库在高通量测序平台上进行测序,高通量测序平台可选择Illumina Hiseq及Miseq测序平台,Roche 454测序平台,Life Technologies的SOLiD及IonTorrent测序平台中的至少一种。The prepared library is sequenced on a high-throughput sequencing platform. The high-throughput sequencing platform can be selected from at least one of the Illumina Hiseq and Miseq sequencing platforms, the Roche 454 sequencing platform, and the Life Technologies SOLiD and IonTorrent sequencing platforms.
(4)数据分析(4) Data Analysis
如图4所示,主要包括以下步骤:As shown in Figure 4, the main steps include:
步骤一:数据初步处理Step 1: Preliminary data processing
数据过滤:检查序列是否有测序接头污染,若有接头序列,并且在末端(最后50bp)则切掉末端污染部分,否则过滤掉整个序列。序列末端测序低质量值(<Q10)的碱基被切掉。拼接read:对Paired-end的测序类型,将两条reads通过中间重叠的部分拼接起来,成为一条序列。拼接时要求重叠区域的长度>10bp、错配碱基所占比例(mismatch)<=10%。Data filtering: Check the sequence for contamination by sequencing adapters. If adapter sequences are present and located at the end (last 50 bp), the contaminating portion is trimmed. Otherwise, the entire sequence is filtered out. Bases with low sequencing quality (<Q10) are trimmed at the end of the sequence. Read concatenation: For paired-end sequencing, two reads are concatenated through the overlapping region to form a single sequence. The concatenation requires that the overlap region be >10 bp in length and that the mismatch ratio be <=10%.
步骤二:确定C区并把序列分成V和J两部分Step 2: Identify the C region and divide the sequence into V and J parts
1)确定恒定区域(C区):过滤完的序列进行C区的参考序列进行局部比对(如BLAST)。通过比对确定C区,切掉C区部分,并将反义链转成正义链。1) Identify the constant region (C region): Perform a local alignment (e.g., BLAST) of the filtered sequence against a reference sequence for the C region. Determine the C region through alignment, cut off the C region, and convert the antisense strand to the sense strand.
2)分别提取V/J部分:因D区较短且插入/删除使无法确定J区与D区的接头,因J区长度范围为40~60bp,从C区起点向J区延伸,提取一定读长(如70bp)作为J区部分;同样,从C区的起点向5‘端方向,剪切掉40bp,则剩下的序列作为V区部分。2) Extract the V/J parts separately: Because the D region is short and insertions/deletions make it impossible to determine the junction between the J region and the D region, and because the length of the J region ranges from 40 to 60 bp, a certain read length (such as 70 bp) is extracted from the start of the C region to the J region as the J region; similarly, 40 bp is cut off from the start of the C region toward the 5' end, and the remaining sequence is the V region.
步骤三:基于Seed聚类Step 3: Seed-based clustering
对于V、J区部分分别聚类,取一定长度的序列(如40bp)作为seed,读取序列,记录每个seed所拥有的序列支持数。首先选择拥有支持数最大的seed,将支持这个seed的所有序列全部输出作为一类;再重新统计剩余序列的seed以及seed的序列支持数,选择最大的seed并输出其支持的序列作为另外一类;再重新统计剩余序列,输出最大的一类,依次循环类推,直到剩余序列为0。For the V and J regions, separate clustering is performed. Sequences of a certain length (e.g., 40 bp) are used as seeds. Sequences are read, and the sequence support count for each seed is recorded. The seed with the largest support count is selected first, and all sequences supporting this seed are output as one class. The remaining seeds and their sequence support counts are recounted, and the largest seed is selected and its supported sequences are output as another class. The remaining sequences are recounted again, and the largest class is output. This cycle repeats until the remaining sequences are zero.
步骤四:Seed延伸Step 4: Seed extension
J区seed延伸:对于每一类序列,根据seed往左右两边一个碱基逐步延伸,每次延伸时,当同时满足条件:(1)序列支持数占该类序列比例>3%,(2)序列支持数的种类占该类序列种类比例>5%;则继续往前延伸。当延伸时出现分支(即一个位置上出现多个碱基同是满足)的情况时,则根据分支产生多条序列。最后延伸停止时,延伸得到的序列视为候选Germline。J-region seed extension: For each type of sequence, the sequence is extended one base to the left and right according to the seed. During each extension, if the following conditions are met simultaneously: (1) the number of sequence supports accounts for >3% of the sequence type, and (2) the number of sequence supports accounts for >5% of the sequence type; then the extension continues. If a branch occurs during the extension (i.e., multiple bases meet the same conditions at the same position), multiple sequences are generated based on the branch. When the extension stops, the extended sequence is considered a candidate germline.
V区seed延伸:对V区的所有seed聚类子集,由于V区打断之后,片段长度不一,情况较J区复杂,我们将左右两端分开延伸,过滤条件不同,对3‘端延伸时,保留的条件与J区类似;但向5’端延伸时,过滤条件是:(1)序列支持数>100,(2)序列支持数的种类>2;最后将延展的两部分拼接到一起。V region seed extension: For all seed cluster subsets of the V region, since the fragment lengths are different after the V region is interrupted, the situation is more complicated than that of the J region. We extend the left and right ends separately with different filtering conditions. When extending to the 3' end, the retention conditions are similar to those of the J region; but when extending to the 5' end, the filtering conditions are: (1) sequence support number > 100, (2) sequence support number type > 2; finally, the two extended parts are spliced together.
步骤五:合并候选GermlineStep 5: Merge candidate germlines
每个Seed聚类延伸完成后,可能出现不同子集之间有重复germline,合并的过程,就是去除候选germline的重复序列。对候选的germline进行两两比对,如果相似度达到95%以上,则将两条序列合并成一条序列。After each seed cluster is extended, duplicate germlines may appear between different subsets. The merging process removes duplicate sequences from candidate germlines. Candidate germlines are then compared pairwise. If the similarity exceeds 95%, the two sequences are merged into a single sequence.
步骤六:过滤Step 6: Filter
在候选V germline的3‘端,或者J germline的5‘端,从末端向前取40bp作为片段一,从末端的第5个碱基开始,向前取40bp作为片段二。将片段一和片段二在原始数据集(数据初步处理过后的)进行搜索,统计各自的序列支持数。如果同时满足:(1)片段二序列支持数/片段一序列支持数>1.5;(2)片段一序列支持数/片段一平均支持数>5%;则序列保留,否则被过滤掉。At the 3' end of the candidate V germline or the 5' end of the candidate J germline, 40 bp forward from the end was taken as fragment 1, and 40 bp forward from the 5th base from the end was taken as fragment 2. Fragments 1 and 2 were searched in the original data set (after preliminary data processing) and their sequence support numbers were counted. If the following conditions were met at the same time: (1) the sequence support number of fragment 2 / the sequence support number of fragment 1 > 1.5; (2) the sequence support number of fragment 1 / the average support number of fragment 1 > 5%, the sequence was retained; otherwise, it was filtered out.
实施例二Example 2
(一)实验流程(1) Experimental process
(1)5’RACE富集目的片段(1) 5’ RACE enrichment of target fragments
抽取三个正常人外周血,分离外周血单核细胞(peripheral blood mononuclearcell PBMC)后提取RNA,获得三个RNA样本,记为样本1(HRB),样本2(HXY)和样本3(XHS)。RNA通过TCR恒定区C特异性引物反转录成cDNA。以下体系都以一个样品的反应数为例。Peripheral blood was drawn from three healthy individuals, and peripheral blood mononuclear cells (PBMCs) were isolated and RNA was extracted. Three RNA samples were obtained, designated Sample 1 (HRB), Sample 2 (HXY), and Sample 3 (XHS). The RNA was reverse transcribed into cDNA using primers specific for the TCR constant region C. The following system uses the reaction number for one sample as an example.
1.1cDNA 1st合成1.1 cDNA 1st synthesis
1)按以下体系配制(1个样品)1) Prepare according to the following system (1 sample)
TCRC区引物:TTGATGGCTCAAACACAGCGA(SEQ ID NO:1)TCRC region primer: TTGATGGCTCAAACACAGCGA (SEQ ID NO: 1)
2)70℃10min,放置冰上1min,加入以下体系,42℃孵育1min。2) Incubate at 70°C for 10 min, place on ice for 1 min, add the following system, and incubate at 42°C for 1 min.
3)加入1μL SuperscriptⅡ,42℃反应50min,70℃反应15min。3) Add 1 μL Superscript II and react at 42°C for 50 min and then at 70°C for 15 min.
4)加入1μLRNase mix,37℃孵育30min。4) Add 1 μL of RNase mix and incubate at 37°C for 30 min.
1.21.5倍磁珠纯化cDNA,回溶18ul nuclease-free water。1. Purify cDNA using magnetic beads 21.5 times and redissolve in 18ul nuclease-free water.
1.3TdT Tailing cDNA1.3TdT Tailing cDNA
1)按以下体系配制1) Prepare according to the following system
2)94℃孵育2-3min,冰上冷却1min。2) Incubate at 94°C for 2-3 minutes and cool on ice for 1 minute.
3)加入1μL TdT混匀,37℃孵育10min,65℃孵育10min。3) Add 1 μL of TdT, mix well, and incubate at 37°C for 10 min and then at 65°C for 10 min.
1.4PCR of dC-tailed cDNA1.4 PCR of dC-tailed cDNA
1)按以下体系配制1) Prepare according to the following system
2)置于PCR仪中按照下列程序反应。2) Place in a PCR instrument and react according to the following procedure.
a.94℃ 2mina. 94℃ 2min
b.94℃ 15sb.94℃ 15s
c.60℃ 30sc.60℃ 30s
d.72℃ 30sd.72℃ 30s
e.重复b-d步骤29次(共30cycles)e. Repeat steps b-d 29 times (30 cycles in total)
f.72℃ 5minf.72℃ 5min
g.12℃ Holdg.12℃ Hold
3)用1倍磁珠纯化,回溶20μL nuclease-free water。3) Purify with 1x magnetic beads and re-dissolve in 20 μL nuclease-free water.
(2)Covaris打断样品(2) Covaris interruption sample
取出3μL的样品用于电泳检测打断效果.Take out 3 μL of sample for electrophoresis to detect the disruption effect.
(3)打断序列的洗涤和洗脱(3) Washing and elution of interrupted sequences
提前将水浴锅打开并将温度调至47℃并平衡,用来加热Washing Buffer。Open the water bath in advance and adjust the temperature to 47°C and equilibrate it to heat the Washing Buffer.
3.1准备洗液3.1 Prepare washing solution
提前准备好各种Wash buffer试剂,按照比例配制两种Wash buffer试剂(1×Binding and Wash Buffer、2×Binding and Wash Buffer)。Prepare various wash buffer reagents in advance and prepare two wash buffer reagents (1×Binding and Wash Buffer, 2×Binding and Wash Buffer) according to the proportion.
3.2准备链霉素磁珠M-2703.2 Preparation of Streptavidin Magnetic Beads M-270
3.3将打断的DNA结合到链霉素磁珠上并洗涤3.3 Bind the fragmented DNA to streptavidin magnetic beads and wash
(4)限制性酶内切(4) Restriction enzyme endonuclease
1)按以下体系配制1) Prepare according to the following system
2)置于PCR仪中按照下列程序反应。置于磁力架上,吸取上清,即为目的产物。2) Place the tube in a PCR instrument and follow the following reaction schedule. Place the tube on a magnetic rack and aspirate the supernatant, which is the desired product.
a.37℃ 2ha.37℃ 2h
b.65℃ 20minb.65℃ 20min
(5)通过连接酶引入测序接头则根据各测序平台制定的标准文库制备流程进行测序文库制备.(5) Sequencing adapters are introduced by ligase and sequencing libraries are prepared according to the standard library preparation process developed by each sequencing platform.
(6)文库检测(6) Library detection
Bioanalyzer analysis system(Agilent,Santa Clara,USA)检测文库插入片段大小及含量;Q-PCR精确定量文库的浓度。The size and content of the library inserts were detected by Bioanalyzer analysis system (Agilent, Santa Clara, USA); the concentration of the library was accurately quantified by Q-PCR.
(7)测序(7) Sequencing
文库检测合格后在相应的测序平台上进行测序,按照双末端151个碱基的读长在Hiseq2000测序仪上进行测序。After the library is qualified, it is sequenced on the corresponding sequencing platform and sequenced on the HiSeq2000 sequencer according to the double-end read length of 151 bases.
(二)数据分析(2) Data Analysis
1.数据预处理1. Data Preprocessing
数据过滤:检查序列是否有测序接头污染,若有接头序列,并且在末端(最后50bp)则切掉末端污染部分,否则过滤掉整个序列。序列末端测序低质量值(<Q10)的碱基被切掉。Data filtering: Check the sequence for contamination by sequencing adapters. If there is a contamination adapter at the end (last 50 bp), the contamination is removed. Otherwise, the entire sequence is filtered out. Bases with low sequencing quality (<Q10) at the end of the sequence are removed.
拼接read:对Paired-end的测序类型,将两条reads通过中间重叠的部分拼接起来,成为一条序列。(重叠区域,长度>10bp,mismatch<=10%)Splicing reads: For paired-end sequencing, two reads are spliced together through the overlapping portion in the middle to form a single sequence. (Overlapping region, length > 10bp, mismatch <= 10%)
根据过滤条件,三个样本过滤情况分别为:样本1(HRB)滤出序列14,695,238条,滤出率为97.97%;样本2(HXY)滤出序列17,459,894条,数据滤出率98.14%;样本3(XHS)滤出序列16,515,129条,滤出率为96.01%。According to the filtering conditions, the filtering results of the three samples are as follows: Sample 1 (HRB) filtered out 14,695,238 sequences, with a filtering rate of 97.97%; Sample 2 (HXY) filtered out 17,459,894 sequences, with a data filtering rate of 98.14%; Sample 3 (XHS) filtered out 16,515,129 sequences, with a filtering rate of 96.01%.
2.确定C区并把序列分成V和J部分2. Identify the C region and divide the sequence into V and J parts
确定恒定区域(C区):过滤完的序列进行C区的参考序列进行局部比对(如BLAST)。通过比对确定C区,切掉C区部分,并将反义链转成正义链。Identify the constant region (C region): Perform a local alignment (e.g., BLAST) of the filtered sequence against a reference sequence for the C region. Determine the C region by alignment, cut off the C region, and convert the antisense strand to the sense strand.
分别提取V/J部分:因D区较短且插入/删除使无法确定J区与D区的接头,因J区长度范围为40~60bp,从C区起点向J区延伸,提取一定70bp作为J区部分;同样,从C区的起点向5‘端方向,剪切掉40bp,则剩下的序列作为V区部分。若V、J序列长度小于40bp则被过滤掉。表1显示从三个样本中成功提取的V区和J区序列的数目和比例。Separate extraction of the V and J regions: Because the D region is short and insertions/deletions prevent the J-D region junction from being identified, and because the J region ranges from 40 to 60 bp in length, a 70-bp segment extending from the start of the C region toward the J region is extracted as the J region. Similarly, 40 bp from the start of the C region toward the 5' end is trimmed, and the remaining sequence is considered the V region. V and J sequences shorter than 40 bp are filtered out. Table 1 shows the number and proportion of successfully extracted V and J region sequences from the three samples.
表1Table 1
3.基于Seed聚类和延伸3. Seed-based clustering and extension
序列聚类Sequence Clustering
对V、J区部分分别聚类,取一定长度的40bp作为seed,读取序列,记录每个seed所拥有的序列支持数。首先选择拥有支持数最大的seed,将支持这个seed的所有序列全部输出作为一类;再重新统计剩余序列的seed以及seed的序列支持数,选择最大的seed并输出其支持的序列作为另外一类;再重新统计剩余序列,输出最大的一类,…,依次循环类推,直到剩余序列为0。Cluster the V and J regions separately, taking a 40-bp seed and reading the sequence. Record the sequence support for each seed. First, select the seed with the highest support, and output all sequences supporting it as one cluster. Then, recount the seed and its sequence support for the remaining sequences, select the seed with the highest support, and output the sequences it supports as another cluster. Recount the remaining sequences again, outputting the highest cluster, and so on. Repeat this cycle until the remaining sequences are zero.
J区seed延伸J zone seed extension
对于每一类序列,根据seed往左右两边一个碱基逐步延伸,每次延伸时,当同时满足条件:(1)序列支持数占该类序列比例>3%,(2)序列支持数的种类占该类序列种类比例>5%;则继续往前延伸。当延伸时出现分支(即一个位置上出现多个碱基同是满足)的情况时,则根据分支产生多条序列。最后延伸停止时,延伸得到的序列视为候选germline。For each type of sequence, the sequence is extended one base to the left and right based on the seed. Each time the extension is performed, if the following conditions are met simultaneously: (1) the number of sequence supports accounts for >3% of the sequence type, and (2) the number of sequence types supports accounts for >5% of the sequence types; the extension is continued. If a branch occurs during the extension (i.e., multiple bases meet the same conditions at the same position), multiple sequences are generated based on the branch. When the extension is finally stopped, the extended sequence is considered a candidate germline.
V区seed延伸V zone seed extension
对V区的所有seed聚类子集,由于V区打断之后,片段长度不一,情况较J区复杂,我们将左右两端分开延伸,过滤条件不同,对3‘端延伸时,保留的条件与J区类似;但向5‘端延伸时,过滤条件是:(1)序列支持数>100,(2)序列支持数的种类>2;最后将延展的两部分拼接到一起。For all seed cluster subsets in the V region, since the fragment lengths are different after the V region is interrupted, the situation is more complicated than that of the J region. We extend the left and right ends separately with different filtering conditions. When extending to the 3' end, the retention conditions are similar to those of the J region; but when extending to the 5' end, the filtering conditions are: (1) sequence support number > 100, (2) sequence support number type > 2; finally, the two extended parts are spliced together.
4.合并候选Germline4. Merge candidate germlines
每个Seed聚类延伸完成后,可能出现不同子集之间有重复germline,合并的过程,就是去除候选germline的重复序列。对候选的germline进行两两比对,如果相似度达到95%以上,则将两条序列合并成一条序列。After each seed cluster is extended, duplicate germlines may appear between different subsets. The merging process removes duplicate sequences from candidate germlines. Candidate germlines are then compared pairwise. If the similarity exceeds 95%, the two sequences are merged into a single sequence.
5.过滤并得到参考Germline5. Filter and get reference Germline
在候选V germline的3‘端,或者J germline的5‘端,从末端向前取40bp作为片段一,从末端的第5个碱基开始,向前取40bp作为片段二。将片段一和片段二在原始数据集(数据初步处理过后的)进行搜索,统计各自的序列支持数。如果同时满足:(1)片段二序列支持数/片段一序列支持数>1.5;(2)片段一序列支持数/平均片段一支持数>5%;则序列保留,否则被过滤掉。At the 3' end of the candidate V germline or the 5' end of the candidate J germline, 40 bp forward from the end was taken as fragment 1, and 40 bp forward from the 5th base from the end was taken as fragment 2. Fragments 1 and 2 were searched in the original data set (after preliminary data processing) and their sequence support numbers were counted. If the following conditions were met at the same time: (1) Fragment 2 sequence support number / Fragment 1 sequence support number > 1.5; (2) Fragment 1 sequence support number / average Fragment 1 support number > 5%, the sequence was retained; otherwise, it was filtered out.
经过分析,对于TRB-J的germline,三个样品均得到11条候选germline。而对TRB-V的germline,发明人推导出样品1共34条,样品2则推导出30条,样品3则得到36条。下面分析germline的准确度和覆盖度。After analysis, we identified 11 candidate germline lines for all three samples of TRB-J. For the TRB-V germline, we deduced 34 lines for sample 1, 30 for sample 2, and 36 for sample 3. The following section analyzes the accuracy and coverage of the germline results.
6.检验Germline可信度6. Verify the credibility of Germline
6.1统计TRB-J Germline比对信息6.1 Collecting TRB-J Germline Alignment Information
表2显示3个样本的预测的Germline的TRB-J基因,与人类已知的TRB-J基因比对的匹配情况。Table 2 shows the predicted TRB-J genes of the germline of the three samples and their matching with the known human TRB-J genes.
由于排列在编码免疫细胞受体蛋白的基因上有很多个V/J基因,并具有多样性,对于表中的相似度,这里是指将本发明方法预测的TRB-V和TRB-J的一个基因片段与人类目前已知的某个V/J基因进行比对。比对的相似度为100%,则是为一个百分之百匹配(match)的V/J基因。Because there are many V/J genes and their diversity in genes encoding immune cell receptor proteins, the similarity in the table refers to the comparison of a gene segment of TRB-V and TRB-J predicted by the present method with a currently known human V/J gene. A similarity of 100% indicates a 100% match between the V/J genes.
表2Table 2
6.2预测的Germline TRB-J(重排前的TRB的J基因序列)分布6.2 Predicted Germline TRB-J (TRB J gene sequence before rearrangement) Distribution
图5显示3个样本合并的TRB-J基因在人类Germline基因区的覆盖情况。从图5可看出,上述统计分析后,样本1-3分别得到TRB-J区基因个数分别为均为11条,平均长度为50bp;总体相似度>=90%,碱基缺失<=5bp,插入碱基<=5bp,错配率<=2。从推断的各个J基因覆盖分布图,看出整个TRB-J基因被完全覆盖,说明该方法对于TRB-J区基因的个数和准确度都有很高的预测性和准确性,可以用来作为J区基因的推断。Figure 5 shows the combined TRB-J gene coverage of the human germline region for the three samples. As shown in Figure 5, after the statistical analysis above, the number of TRB-J region genes in samples 1-3 was 11, with an average length of 50 bp. The overall similarity was >= 90%, base deletions <= 5 bp, base insertions <= 5 bp, and mismatch rate <= 2. The inferred coverage distribution of each J gene shows complete coverage of the entire TRB-J region, demonstrating that this method has high predictive and accuracy for both the number and accuracy of TRB-J region genes and can be used for inferring J region genes.
6.3统计TRB-V Germline比对信息6.3 Statistics of TRB-V Germline Alignment Information
以下表3显示三个样本推导的V Germline序列与已知的人类TRB-V Germline比对匹配情况。Table 3 below shows the alignment of the V Germline sequences deduced from the three samples with the known human TRB-V Germline.
表3Table 3
6.4预测的Germline TRB-V分布6.4 Predicted Germline TRB-V Distribution
图6显示三个样本合并的TRB-V基因在人类Germline基因区的覆盖情况。Figure 6 shows the coverage of the TRB-V gene in the human germline gene region of the three samples combined.
从上面表2的统计分析结果可看出,推定样本1-3分别得到TRB-V区基因个数分别为34、30和36;总体相似度>=90%,碱基缺失<=5bp,插入碱基<=5bp,错配率<=3。图6推断的V区各个基因覆盖分布图显示,整个TRB-V区基因覆盖率超过80%。有3条不在推断的germline之中,结果与TRB-J区基因相比较,准确性较一致,但整体覆盖率稍差。The statistical analysis results in Table 2 above show that the number of TRB-V region genes imputed for samples 1-3 was 34, 30, and 36, respectively; the overall similarity was >= 90%, the base deletions were <= 5 bp, the base insertions were <= 5 bp, and the mismatch rate was <= 3. The coverage distribution of individual genes in the inferred V region in Figure 6 shows that the coverage of the entire TRB-V region exceeds 80%. Three genes were not included in the inferred germline, and the results were consistent with the TRB-J region genes, but the overall coverage was slightly lower.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、“实施方式”或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。Throughout this specification, references to terms such as "one embodiment," "some embodiments," "examples," "specific examples," "implementation," or "some examples" indicate that the specific features, structures, materials, or characteristics described in conjunction with that embodiment or example are included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
尽管已经示出和描述了本发明的实施例,本领域的普通技术人员可以理解:在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由权利要求及其等同物限定。While embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions, and variations may be made to the embodiments without departing from the principles and spirit of the invention, and that the scope of the invention is defined by the claims and their equivalents.
Claims (27)
Publications (3)
| Publication Number | Publication Date |
|---|---|
| HK1240371A HK1240371A (en) | 2018-05-18 |
| HK1240371A1 HK1240371A1 (en) | 2018-05-18 |
| HK1240371B true HK1240371B (en) | 2020-12-11 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11905511B2 (en) | Method of measuring adaptive immunity | |
| EP3212790B1 (en) | Highly-multiplexed simultaneous detection of nucleic acids encoding paired adaptive immune receptor heterodimers from many samples | |
| Sui et al. | Composition and variation analysis of the TCR β-chain CDR3 repertoire in systemic lupus erythematosus using high-throughput sequencing | |
| Robins et al. | Comprehensive assessment of T-cell receptor β-chain diversity in αβ T cells | |
| US20170335386A1 (en) | Method of measuring adaptive immunity | |
| CN111363783B (en) | T cell receptor library high-throughput sequencing library construction and sequencing data analysis method based on specific recognition sequence | |
| CN105063032A (en) | A multiplex PCR primer and method for constructing leukemia minimal residual lesion BCR library based on high-throughput sequencing | |
| CN107779495B (en) | Construction method and kit of T cell antigen receptor diversity sequencing library | |
| CN107038349B (en) | Method and apparatus for determining pre-rearrangement V/J gene sequence | |
| CN105164277A (en) | Method for evaluating an immunorepertoire | |
| CN103215255B (en) | Primer set for amplifying immunoglobulin light chain CDR3 sequence and application thereof | |
| Simon et al. | Sequencing the peripheral blood B and T cell repertoire–quantifying robustness and limitations | |
| JP2022544578A (en) | Targeted hybrid capture method for determining T cell repertoire | |
| HK1240371B (en) | Method and device for determining v/j gene sequences before rearrangement | |
| CN115807056A (en) | BCR or TCR rearranged sequence template pool and application thereof | |
| HK1240371A1 (en) | Method and device for determining v/j gene sequences before rearrangement | |
| HK1240371A (en) | Method and device for determining v/j gene sequences before rearrangement | |
| HK1169147A (en) | Method of measuring adaptive immunity |