[go: up one dir, main page]

TWI420007B - System and method of assembling dna reads - Google Patents

System and method of assembling dna reads Download PDF

Info

Publication number
TWI420007B
TWI420007B TW100107438A TW100107438A TWI420007B TW I420007 B TWI420007 B TW I420007B TW 100107438 A TW100107438 A TW 100107438A TW 100107438 A TW100107438 A TW 100107438A TW I420007 B TWI420007 B TW I420007B
Authority
TW
Taiwan
Prior art keywords
sequence
gene
gene sequence
base
sequences
Prior art date
Application number
TW100107438A
Other languages
Chinese (zh)
Other versions
TW201237223A (en
Inventor
Hsueh Ting Chu
Cheng Yan Kao
Li Chen Chen
Original Assignee
Hsueh Ting Chu
Cheng Yan Kao
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hsueh Ting Chu, Cheng Yan Kao filed Critical Hsueh Ting Chu
Priority to TW100107438A priority Critical patent/TWI420007B/en
Publication of TW201237223A publication Critical patent/TW201237223A/en
Application granted granted Critical
Publication of TWI420007B publication Critical patent/TWI420007B/en

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Description

基因測序序列的組合系統及方法Combined system and method for gene sequencing sequences

本發明係關於一種基因測序序列的組合系統及方法,尤指一種基因測序(DNA sequencing)的資料分析方法。從頭開始組合(De novo assembly)基因測序(DNA sequencing)產生的核苷酸鹼基序列的短字串。將可能有錯誤的短核苷酸鹼基序列字串拼接成正確的長基因序列。The invention relates to a combined system and method for gene sequencing sequences, in particular to a data analysis method for DNA sequencing. A short string of nucleotide base sequences generated by De novo assembly DNA sequencing. The erroneous short nucleotide base sequence strings are spliced into the correct long gene sequence.

按基因測序序列的組合技術,是一種應用在基因序列的拼圖技術。基因測序所產生的序列,如以下的例子(26421212個序列):The combination technique of gene sequencing sequences is a puzzle technique applied to gene sequences. The sequence generated by gene sequencing, as in the following example (26421212 sequences):

序列1:TCCTGTATATTCTAAACTTAGAGATTGTTCAT;Sequence 1: TCCTGTATATTCTAAACTTAGAGATTGTTCAT;

序列2:CATAAACATCTTTATAAAATACTAATAGAAAG;Sequence 2: CATAAACATCTTTATAAAATACTAATAGAAAG;

序列3:AAAGGAGAGAACGTCGTCGTTTTCGTCGAAGT;Sequence 3: AAAGGAGAGAACGTCGTCGTTTTCGTCGAAGT;

序列4:ACAACCCTAACTCTTTTTTTTTTGGCTATTGT;Sequence 4: ACAACCCTAACTCTTTTTTTTTTGGCTATTGT;

...

序列26421209:Sequence 26421209:

TCTTCCGCCGTCGCAACTTTACCCAACGCCGC;TCTTCCGCCGTCGCAACTTTACCCAACGCCGC;

序列26421210:Sequence 26421210:

ACCGCAAAAGCAAGATGATTCATTGTGTATCC;ACCGCAAAAGCAAGATGATTCATTGTGTATCC;

序列26421211:Sequence 26421211:

CTGGATCACAGCATCCACACGCACAAATATC;CTGGATCACAGCATCCACACGCACAAATATC;

序列26421212:Sequence 26421212:

CCAATGGATTCTTTCTTTACTAACAATATCGA。CCAATGGATTCTTTCTTTACTAACAATATCGA.

上述的基因測序序列的組合問題和普通拼圖(Jigsaw Puzzle)的拼接不同之處主要有:The combination of the above-mentioned gene sequencing sequences and the splicing of the ordinary jigsaw puzzle are mainly:

(1)基因測序序列的組合是一維的字串拼接。基因測序資料的碎片是只包含四種(A,G,C,T)核苷酸鹼基的字串,而拼接的時候是把不同的序列碎片依照其一致的部份重疊起來。(1) The combination of gene sequencing sequences is one-dimensional string splicing. The fragment of the gene sequencing data is a string containing only four (A, G, C, T) nucleotide bases, and the splicing is to overlap the different sequence fragments according to their identical parts.

(2)基因測序序列的組合是巨大數量的碎片拼接。常見圖像拼圖的碎片數可能是300片或500片。1000片的拼圖就很難拼接。而基因測序的組合所要拼接的序列數量是巨大的,往往有1,000,000到1,000,000,000,甚至隨著技術進步可以產生更多的序列碎片。(2) The combination of gene sequencing sequences is a huge number of fragment splicing. The number of pieces of common image puzzles may be 300 or 500. 1000 pieces of puzzles are difficult to splicing. The number of sequences to be spliced by the combination of gene sequencing is enormous, often ranging from 1,000,000 to 1,000,000,000, and even more sequence fragments can be produced as technology advances.

要拼接數目如此龐大的測序序列,往往需要非常大量的記憶體來記住過程中所產生的重疊群產物。而且,因為基因測序的過程可能會產生序列的錯誤。因此,如何判斷序列中的錯誤也是一項重要的待解決課題。To splicing such a large number of sequencing sequences, a very large amount of memory is often required to remember the contig products produced during the process. Moreover, because the process of gene sequencing may produce sequence errors. Therefore, how to judge errors in the sequence is also an important problem to be solved.

傳統的基因測序從頭開始組合方法包含以下三種:Traditional gene sequencing from the beginning of the combination method includes the following three:

(1)重疊-排列-一致法(Overlap-Layout-Consensus);(1) Overlap-Alignment-Consensus;

(2)De Bruijn圖(De Bruijn graph);及(2) De Bruijn graph (De Bruijn graph); and

(3)貪婪延伸演算法(Greedy extension algorithm)。(3) Greedy extension algorithm.

以下以5個基因序列的拼接例子,分別對以上三種習知方法做進一步說明:The following three conventional methods are further explained by the splicing examples of five gene sequences:

r1 CCCTTCCAAC;R1 CCCTTCCAAC;

r2 ATTTAATCCC;R2 ATTTAATCCC;

r3 TTAATCCCTT;R3 TTAATCCCTT;

r4 TTCCAACAGC;及R4 TTCCAACAGC; and

r5 AACAGCCGCCR5 AACAGCCGCC

(1)重疊-排列-一致法是最傳統的方式。包括三個階段:(1) Overlap-arrangement-consistent method is the most traditional way. It consists of three phases:

階段一:將兩兩的基因測序序列重疊看看,找出其距離。如r2和r1可以對齊中間3個鹼基,記成d(r2,r1)=-3,並如下表。Stage 1: Overlapping the two-by-two gene sequencing sequences to find out their distance. For example, r2 and r1 can align the middle 3 bases, and record d(r2, r1) = -3, as shown in the following table.

如r1和r2可以對齊中間8個鹼基,記成d(r2,r3)=-8,並如下表。For example, r1 and r2 can align the middle 8 bases and record d(r2, r3) = -8, as shown in the following table.

階段二:將兩兩的基因測序序列重疊看看,找出其距離。以基因序列間的距離來建立所有基因序列的有向圖,如圖1所示。Stage 2: Overlapping the two-by-two gene sequencing sequences to find out their distance. A directed graph of all gene sequences is established by the distance between gene sequences, as shown in Figure 1.

階段三:在有向圖中找出有一致關係的排列順序(如以最小擴充樹法),如圖2所示。圖2中,如以最小擴充樹法可以得到序列拼接的順序是由左至右重疊r1,r2,r3,r4,r5。這五個基因序列的重疊稱為一個重疊群(contig),由這個重疊群中每個一致的鹼基可以得到最後組合序列:ATTTAATCCCTTCCAACAGCCGCC,如圖3所示。Stage 3: Find the order of the consistent relationship in the directed graph (such as the minimum expansion tree method), as shown in Figure 2. In Fig. 2, the sequence of sequence stitching can be obtained by the minimum expansion tree method by overlapping r1, r2, r3, r4, r5 from left to right. The overlap of these five gene sequences is called a contig, and the final combined sequence can be obtained from each identical base in this contig: ATTTAATCCCTTCCAACAGCCGCC, as shown in FIG.

(2) De Bruijn圖(De Bruijn graph):係把序列以每k個組成一個節點,如圖4所示。將不同的基因序列而具相同的節點予以合併,可以得到如圖5所示之結果。De Bruijn圖是把圖5中相鄰的節點合併成一個更大的節點。因為圖中示例只形成一個序列,所以把相鄰的節點合併最後會形成單一的節點,如圖6所示。如果,合併的結果形成複數的節點,則最後找尋一筆畫的路徑(Eulerian path)來做為最後可以合併的序列。應用De Bruijn圖之專利技術有美國第7,071,324號、第7,034,143號、第6,865,491號、第6,689,563號及第5,683,881號專利案。(2) De Bruijn graph: The sequence is composed of one node per k, as shown in Fig. 4. Combining different gene sequences with the same nodes, the results shown in Figure 5 can be obtained. The De Bruijn graph merges the adjacent nodes in Figure 5 into one larger node. Because the examples in the figure form only one sequence, merging adjacent nodes will eventually form a single node, as shown in Figure 6. If the result of the merge forms a complex node, then the Eulerian path is finally searched for as the last sequence that can be merged. Patent applications of the De Bruijn diagram are disclosed in U.S. Patent Nos. 7,071,324, 7,034,143, 6,865,491, 6,689,563, and 5,683,881.

(3)貪婪延伸演算法(Greedy extension algorithm):係選取一個基因序列如r1: CCCTTCCAAC,看看其字尾(postfix)是不是別的字的字首(prefix)。如是,便將其重疊上去。如TTCCAAC是r1的字尾,是r4: TTCCAACAGC的字首。所以合併r1及r2變成重疊群(1,4): CCCTTCCAACAGC。重疊群(1,4)的字尾AACAGCC是r5: AACAGCCGCC的字首,所以把重疊群(1,4)及r5合併成重疊群(1,4,5): CCCTTCCAACAGCGCC。重疊群(1,4,5)沒有任何字尾是別的序列的字首,所以就停止。換從序列如r2: ATTTAATCCC開始接。r2的字尾TTAATCCC是r3: TTAATCCCTT的字首,所以合併r2及r3變成重疊群(2,3): ATTTAATCCCTT。最後再把重疊群(2,3)及重疊群(1,4,5)合併,可以得到ATTTAATCCCTTCCAACAGCCGCC。(3) Greedy extension algorithm: Select a gene sequence such as r1: CCCTTCCAAC to see if the postfix is a prefix of another word. If so, it will be overlapped. For example, TTCCAAC is the suffix of r1 and is the prefix of r4: TTCCAACAGC. So merge r1 and r2 into a contig (1, 4): CCCTTCCAACAGC. The suffix AACAGCC of the contig (1, 4) is the prefix of r5: AACAGCCGCC, so the contigs (1, 4) and r5 are merged into a contig (1, 4, 5): CCCTTCCAACAGCGCC. The contigs (1, 4, 5) do not have any suffixes that are the beginnings of other sequences, so they stop. Change the sequence from r2: ATTTAATCCC to start. The suffix of r2, TTAACCCC, is the prefix of r3: TTAATCCCTT, so the combination of r2 and r3 becomes a contig (2, 3): ATTTAATCCCTT. Finally, the contigs (2, 3) and the contigs (1, 4, 5) are combined to obtain ATTTAATCCCTTCCAACAGCCGCC.

以上三種傳統的方式都需要不斷進行合併(merge)的動作。把比較短的基因序列之重疊群合併成比較長的基因序列之重疊群。合併的過程需要大量的記憶體來存放拼接過程的暫時結果。然而,當資料量很大時,往往需要很大的記憶體來存放拼接過程的結果。甚至要多達數百Giga的記憶體才能進行。因此,當基因序列資料大時,往往受限於記憶體的限制無法完成拼接的動作。而且,當基因序列中有鹼基是錯誤時,往往就無法被組合。All of the above three traditional methods require continuous merge operations. The contigs of relatively short gene sequences are combined into a contig of relatively long gene sequences. The merging process requires a large amount of memory to store the temporary results of the splicing process. However, when the amount of data is large, a large amount of memory is often required to store the result of the splicing process. Even as many as hundreds of Giga's memory can be carried out. Therefore, when the gene sequence data is large, it is often limited by the limitation of the memory, and the splicing action cannot be completed. Moreover, when there are errors in the gene sequence, they often cannot be combined.

再者,關於基因序列之重組或分析的技術有很多,例如中華民國第I326431號專利案,美國第7,809,509號專案,以及如附件一之參考文獻[1]至[10]所發表之技術內容。然而,目前所見的在先技藝,尚未發現有如本發明之領先技術者。Furthermore, there are many techniques for the recombination or analysis of gene sequences, such as the Patent No. I326431 of the Republic of China, the No. 7,809,509 project of the United States, and the technical contents published by references [1] to [10] of Annex 1. However, the prior art that has been seen so far has not been found to be a leader in the art of the present invention.

本發明之目的,在於提供一種組合基因測序序列的系統及方法,用以解決傳統方法所產生之二個問題:(1)合併的動作需要大量的記憶體;及(2)容許序列中有錯誤的鹼基也可以進行組合。The object of the present invention is to provide a system and method for combining gene sequencing sequences to solve two problems caused by the conventional method: (1) the combined action requires a large amount of memory; and (2) the error in the allowable sequence The bases can also be combined.

為解決上述問題1,本發明之技術手段係提供一種雙向延伸組合的方法來拼接各別的基因序列以形成一目標基因序列。這個技術是發展來將各別基因序列同時向一個待接基因序列的二側延伸接續下去。因為是向待接基因序列二側延伸,因此,我們可以任意選取一個待接之基因序列(為一基因序列或由數個基因序列接續組合而成的一重疊群)開始進行其二側的雙向延伸接續其他候選基因序列的動作。最後可以找出位在同一個重疊群上的其他基因序列,並將它們組合成一目標基因序列(即由更多基因序列接續組合而成的一更長的重疊群)。In order to solve the above problem 1, the technical means of the present invention provides a bidirectional extension combination method for splicing individual gene sequences to form a target gene sequence. This technique was developed to extend the individual gene sequences simultaneously to the two sides of a sequence of genes to be joined. Because it is extended to the two sides of the gene sequence to be ligated, we can arbitrarily select a gene sequence to be connected (for a gene sequence or a contig that is composed of several gene sequences) to start the two-way two-way The action of extending the sequence of other candidate genes is extended. Finally, other gene sequences located on the same contig can be found and combined into a target gene sequence (ie, a longer contig that is composed of more gene sequences).

為解決上述問題2,本發明之技術手段係提供一個容錯序列的篩選機制來找出正確的序列。由於基因測序產生的序列可能會有以下二種錯誤:To solve the above problem 2, the technical means of the present invention provides a screening mechanism for fault-tolerant sequences to find the correct sequence. Sequences resulting from gene sequencing may have the following two errors:

(1)鹼基配對失誤(mismatch)之錯誤:原基因序列中某個鹼基被錯誤配對成其他鹼基。例如:ACATTAAGCCTT是原本的基因序列,經過基因測序處理所產生的序列為AGATTAAGCCTT。也就是第二個鹼基C被錯誤配對成G。(1) Error in base pairing mismatch: A certain base in the original gene sequence is incorrectly paired into other bases. For example, ACATTAAGCCTT is the original gene sequence, and the sequence generated by gene sequencing treatment is AGATTAAGCCTT. That is, the second base C is mismatched into G.

(2)鹼基插入或刪除(insert/deletion)之錯誤:經過基因測序處理所產生的基因序列比原基因序列中多出額外的鹼基,或減少某個鹼基。例如ACATTAAGCCTT是原本的基因序列,第4~5鹼基是連續的鹼基T。如經過基因測序處理所產生的序列為ACATAAGCCTT,比原本的基因序列在相同位置減少一個T,此情形稱之為鹼基刪除的錯誤。反之,如經過基因測序處理所產生的基因序列為ACATTTAAGCCTT比原本的基因序列在相同位置多一個T,此情形稱之為鹼基插入的錯誤。(2) Error in insert/deletion: The gene sequence generated by gene sequencing treatment has extra bases or a certain base number in the original gene sequence. For example, ACATTAAGCCTT is the original gene sequence, and the 4th to 5th bases are consecutive bases T. If the sequence generated by gene sequencing treatment is ACATAAGCCTT, one T is reduced at the same position as the original gene sequence, which is called a base deletion error. Conversely, if the gene sequence generated by the gene sequencing treatment is ACATTTAAGCCTT, one more T at the same position than the original gene sequence, this case is called a base insertion error.

本發明之容錯序列篩選器,是在所有的候選基因序列中找出可以接續基因序列的正確基因序列,而容許基因序列中有錯誤的鹼基。The fault-tolerant sequence filter of the present invention finds the correct gene sequence of the contiguous gene sequence among all the candidate gene sequences, and allows the erroneous bases in the gene sequence.

為讓本發明之目的及其他特徵能更加清楚,以下茲舉出一些較佳實施例,並配合所附圖式圖7到圖16,作詳細說明。在這些實施例的說明中,為了簡明解釋原理,所以在不同實例使用不同的序列長度。以及不同的索引鍵長度。In order to make the objects and other features of the present invention more comprehensible, some preferred embodiments are described below, and are described in detail with reference to Figures 7 through 16 of the accompanying drawings. In the description of these embodiments, different sequence lengths are used in different examples for the sake of simplicity of explanation. And different index key lengths.

I.本發明之系統I. System of the invention

如圖7所示,本發明之基因序列組合系統10係用以將一個基因序列的集合拼接成目標的基因序列,其一種具體實施例係包括有一輸入介面11、一索引器12、一雙向延伸組合器13、一容錯序列篩選器14、一重疊群建構器15及一輸出介面16。茲將前述各元件詳述如下。As shown in FIG. 7, the gene sequence combining system 10 of the present invention is used to splicing a collection of gene sequences into a target gene sequence. One specific embodiment includes an input interface 11, an indexer 12, and a bidirectional extension. A combiner 13, a fault tolerant sequence filter 14, an contig constructor 15 and an output interface 16. The foregoing elements are described in detail below.

輸入介面11係從儲存在資料庫或記憶體中之檔案110讀入複數個基因序列(其可以是由基因測序系統所產生的複數個基因序列),用以給予輸入的基因序列編號,並建立基因序列的左索引結構及右索引結構。本發明一種具體實施例中,其輸入介面11會讀入基因序列二次。第一次輸入基因序列,取其基因序列前後各N個鹼基做為索引鍵值資料111,並將索引鍵值資料111置入索引器12中存放。索引鍵值可以字串或轉換成數值表示。輸入介面11有一個序列使用記錄陣列,來記錄序列是否已被排入重疊群中。The input interface 11 reads a plurality of gene sequences (which may be a plurality of gene sequences generated by a gene sequencing system) from the file 110 stored in the database or the memory to give the input gene sequence number and establish The left index structure and the right index structure of the gene sequence. In a specific embodiment of the invention, the input interface 11 reads the gene sequence twice. The first time the gene sequence is input, the N bases before and after the gene sequence are taken as the index key value data 111, and the index key value data 111 is placed in the indexer 12 for storage. Index key values can be stringed or converted to numeric representations. The input interface 11 has a sequence usage record array to record whether the sequence has been queued into the contig.

索引器12,其用以儲存該複數個基因序列之索引值資料111,該索引值供找出可能可以接續在一個待接基因序列之二側的候選基因序列。其可以是置於記憶體中的一個索引陣列,或者是置於硬碟的索引檔,也可以是一個置放在遠端的資料庫,作用是輸入經輸入介面11並索引的基因序列(即短鹼基序列130),及輸出與索引對應之多重候選基因序列122。The indexer 12 is configured to store index value data 111 of the plurality of gene sequences for finding a candidate gene sequence that may be contiguous on two sides of a waiting gene sequence. It can be an index array placed in the memory, or an index file placed on the hard disk, or a remotely located database for inputting the sequence of genes indexed through the input interface 11 (ie, The short base sequence 130), and the multiplex candidate gene sequence 122 corresponding to the index.

雙向延伸組合器13,用以將經由容錯序列篩選器所決定之該選定基因序列接續在待接基因序列至少一側而延伸成一個更長的基因序列,直至決定該目標基因的鹼基序列為止。本發明實施例中,係以雙向延伸組合器13取出待接基因序列(或目前已組合重疊群序列)二側各M個鹼基長度分別做為一延伸測試視窗21/22,本發明係以一個基因序列的長度做為延伸測試視窗之長度,該二延伸測試視窗21/22分別供自該索引器中找尋出可以附加在該延伸測試視窗的該基因序列以做為該候選基因序列。本發明具體實施例中,雙向延伸組合器13會從長度為1開始位移延伸測試視窗,將位移後的延伸測試視窗中的基因序列分成新的索引鍵131及容錯比對區域132,如圖11所示。其中,新的索引鍵131用以向索引器12查詢可能的候選基因序列。而容錯比對區域132提供給容錯序列篩選器14,用以比對出正確而可供延伸接續的選定基因序列。a bidirectional extension combiner 13 for extending the selected gene sequence determined by the fault tolerant sequence filter to at least one side of the sequence of the waiting gene and extending into a longer gene sequence until the base sequence of the target gene is determined . In the embodiment of the present invention, the M base lengths on the two sides of the waiting gene sequence (or the currently combined contig sequence) are respectively taken as an extension test window 21/22 by using the bidirectional extension combiner 13 , and the present invention is The length of a gene sequence is taken as the length of the extended test window, and the two extended test windows 21/22 are respectively used to find the gene sequence which can be attached to the extended test window as the candidate gene sequence. In a specific embodiment of the present invention, the bidirectional extension combiner 13 extends the test window from a length of 1 to divide the gene sequence in the extended extension test window into a new index key 131 and a fault tolerance comparison area 132, as shown in FIG. Shown. The new index key 131 is used to query the indexer 12 for possible candidate gene sequences. The fault tolerant alignment region 132 is provided to the fault tolerant sequence filter 14 for comparing the selected gene sequences that are correct for extension.

容錯序列篩選器14,用以決定候選基因序列為可接續在待接基因序列之二側的選定基因序列。其係根據由雙向延伸組合器13輸入的容錯比對區域132及多重候選基因序列122,請配合參看圖13、14所示的篩選過程,留下帶有位置且正確可供延伸接續之選定基因序列141給重疊群建構器15。The fault-tolerant sequence filter 14 is configured to determine that the candidate gene sequence is a selected gene sequence that can be flanked on two sides of the waiting gene sequence. Based on the fault-tolerant alignment region 132 and the multiple candidate gene sequence 122 input by the bidirectional extension combiner 13, please refer to the screening process shown in Figures 13 and 14 to leave the selected gene with position and correct extension. Sequence 141 is given to contig constructor 15.

重疊群建構器15將選定基因序列141依其位置重疊排列,建構出重疊群(contig)151。透過輸出介面16將此重疊群151輸出到檔案161中。The contig constructor 15 arranges the selected gene sequences 141 in overlapping positions, and constructs a contig 151. This contig 151 is output to the file 161 through the output interface 16.

Ⅱ.本發明之方法II. Method of the invention

請配合參看圖7至16所示,本發明之基因序列組合方法的一種具體實施例,係包括有以下所述之步驟。Referring to Figures 7 to 16, a specific embodiment of the gene sequence combining method of the present invention includes the steps described below.

步驟S201:由輸入介面11輸入複數個基因序列,給予輸入的所有基因序列一個編號,並且建立此基因序列的左索引結構及右索引結構,並儲存在索引器12。Step S201: A plurality of gene sequences are input from the input interface 11, a number of all gene sequences input is given, and a left index structure and a right index structure of the gene sequence are established and stored in the indexer 12.

步驟S211:輸入介面11從序列使用記錄陣列中找出一個未使用的序列,先和其鄰近的序列比對,確定每個鹼基的正確性後,此序列做為雙方延伸組合器13進行雙向延伸的啟始待接基因序列112。因為單一的基因序列可能會有錯誤,因此可以使用數個連續相鄰的複數個基因序列來做為啟始待接基因序列片段。尋找連續相鄰的複數個基因序列,是用位移的索引鍵找尋彼此鹼基都一致的基因序列來先重疊成啟始待接基因序列片段。其中,如果相鄰的基因序列彼此鹼基不一致,就不能做為雙向延伸的啟始待接基因序列片段的二側。Step S211: The input interface 11 finds an unused sequence from the sequence using the recording array, and first compares with the adjacent sequence to determine the correctness of each base, and the sequence is used as the two-way combiner 13 for bidirectional The extended initiation of the gene sequence 112 is initiated. Since a single gene sequence may be erroneous, several consecutively adjacent multiple gene sequences can be used as the starting pair of gene sequences. Looking for a plurality of consecutively adjacent gene sequences, the index of the displacement is used to find a sequence of genes that are identical to each other, and first overlaps into a sequence of the starting gene sequence. Among them, if adjacent gene sequences are inconsistent with each other, they cannot be used as two sides of the bidirectionally extended start-to-end gene sequence fragment.

步驟S221及S222是左右對稱的運算程序,在此以向右的實施例做說明。雙向延伸組合器13取出待接基因序列20二側各M個鹼基長度分別做為左延伸測試視窗21及右延伸測試視窗22,並從長度1開始位移測試視窗21/22,將位移後的測試視窗21/22中的基因序列分成新的索引鍵131及容錯比對區域132,以新的索引鍵131向索引器12查詢可能的候選基因序列122,並將容錯比對區域132提供給容錯序列篩選器14,用以比對出正確而可供延伸接續的選定基因序列。Steps S221 and S222 are left-and-right symmetrical operation programs, which will be described here with reference to the right embodiment. The bidirectional extension combiner 13 takes out the lengths of the M bases on the two sides of the to-be-connected gene sequence 20 as the left extension test window 21 and the right extension test window 22, respectively, and shifts the test window 21/22 from the length 1 to the displacement. The gene sequence in the test window 21/22 is divided into a new index key 131 and a fault tolerant alignment area 132, a new index key 131 is used to query the indexer 12 for a possible candidate gene sequence 122, and the fault tolerance comparison area 132 is provided to the fault tolerance. Sequence filter 14 is used to compare selected gene sequences that are correct for extension.

如圖9所示,本發明以待接基因序列20(或為目前完成群組之基因序列片段)的二側,分別做為右延伸測試視窗及左延伸測試視窗。如圖10所示,本發明滑動左延伸測試視窗21及右延伸測試視窗22,用以找出可以接續在目前已知待接基因序列20左右二側的候選基因序列122。As shown in FIG. 9, the present invention uses the two sides of the waiting gene sequence 20 (or the gene sequence fragment of the currently completed group) as a right extension test window and a left extension test window, respectively. As shown in FIG. 10, the present invention slides the left extension test window 21 and the right extension test window 22 to find a candidate gene sequence 122 that can be continued on both sides of the currently known to-be-connected gene sequence 20.

步驟S231及S232是左右對稱的運算程序,圖12係以產生向右延伸之候選基因序列為例,目前比對視窗23中的比對參考序列型式為CACAGCAGTAAGTTTCCAATATATGGT。此序列中,CACAGCA做為索引鍵,而GTAAGTTTCCAATATATGGT是用以進行容錯比對的區域。從索引器12中找出所有左側索引鍵為CACAGCA的基因序列。這些基因序列也分成索引鍵及容錯比對的區域。比較延伸視窗23及候選基因序列122的比對參考序列型式,計算出其不同鹼基的數目,如果不同鹼基的數目小於一個閥值T,則該基因序列被選為可能延伸之候選基因序列。Steps S231 and S232 are left-right symmetric operation programs, and FIG. 12 is an example of generating a candidate gene sequence extending to the right. The comparison reference sequence pattern in the comparison window 23 is CACAGCAGTAAGTTTCCAATATATGGT. In this sequence, CACAGCA is used as the index key, and GTAAGTTTCCAATATATGGT is the area for fault tolerance comparison. All gene sequences whose left index key is CACAGCA are found from indexer 12. These gene sequences are also divided into indexing and error-tolerant alignment regions. Comparing the aligned reference sequence patterns of the extended window 23 and the candidate gene sequence 122, calculating the number of different bases. If the number of different bases is less than a threshold T, the gene sequence is selected as a possible extended candidate gene sequence. .

步驟S241及S242是左右對稱的運算程序,由前一個步驟產生的可能延伸之候選基因序列,必須進一步測試是否有測序錯誤。本發明之方法是把所有被找出的候選基因序列依其可能延伸的位置重疊排列,計算每個位置其ACGT鹼基所佔的比率,即統計不同候選基因序列排列後相同位置的鹼基,以判斷是否是測序產生的序列錯誤或者該候選基因序列並不是接在此位置的序列。對單一的序列而言,如果其某個位置的鹼基和其他基因序列的相同位置之鹼基不同,會有二種情形:第一種情形是此基因序列是正確的候選基因序列,但是發生鹼基配對失誤的測序錯誤;第二種情形是此一基因序列並不是可以接在此位置的候選基因序列。圖13及14分別圖示說明此二種情形。在圖13中,序列r1,r2,r3,r4,r6各有1~2個鹼基和其他候選基因序列不一致。然而其重疊時,各別位置的錯誤鹼基沒有超過一定百分比,如1/5。此時,錯誤鹼基被視為鹼基配對失誤的測序錯誤。此外,在圖14中,序列r1,r2,r3,r4,r5,r6,各有1~2個鹼基和其他候選序列不一致。當其重疊時,r1,r2,r3有一個相同位置的錯誤鹼基超過一定的比率,如1/5。該位置的鹼基都是A,相較於其他基因序列在此位置的鹼基都是T,因此r1,r2,r3等基因序列被判定為不是接在此位置的候選基因序列。步驟S241及S242也偵測是否發生鹼基插入或刪除的測序錯誤。在圖15及16顯示鹼基插入或刪除的測序錯誤偵測。Steps S241 and S242 are left-right symmetric operation programs, and the possible extended candidate gene sequences generated by the previous step must be further tested for sequencing errors. In the method of the present invention, all the identified candidate gene sequences are arranged in an overlapping manner according to the positions at which they are likely to be extended, and the ratio of the ACGT bases at each position is calculated, that is, the bases at the same position after the arrangement of different candidate gene sequences are counted. To determine whether it is a sequence error resulting from sequencing or the candidate gene sequence is not a sequence attached to this position. For a single sequence, if the base at a certain position is different from the base at the same position of other gene sequences, there are two cases: the first case is that the gene sequence is the correct candidate gene sequence, but occurs. Sequencing errors in base pairing errors; the second case is that this gene sequence is not a candidate gene sequence that can be ligated at this position. Figures 13 and 14 illustrate these two scenarios, respectively. In Fig. 13, the sequences r1, r2, r3, r4, and r6 each have 1 to 2 bases which are inconsistent with other candidate gene sequences. However, when they overlap, the wrong base at each position does not exceed a certain percentage, such as 1/5. At this point, the wrong base is considered a sequencing error for base pairing errors. Further, in Fig. 14, the sequences r1, r2, r3, r4, r5, and r6 each have 1 to 2 bases which are inconsistent with other candidate sequences. When they overlap, r1, r2, and r3 have an incorrect base at the same position exceeding a certain ratio, such as 1/5. The base at this position is A, and the base at this position is T compared to other gene sequences, and thus the gene sequences such as r1, r2, and r3 are determined not to be candidate gene sequences attached to this position. Steps S241 and S242 also detect whether a base insertion or deletion sequencing error has occurred. Sequencing error detection for base insertion or deletion is shown in Figures 15 and 16.

步驟S251及S252是左右對稱的判斷程序步驟,如果前一個步驟產生一些的候選基因序列,可以附加到已知待接基因序列之右側,則重新進行步驟S221。如果前一個步驟產生一些的候選基因序列,可以附加到已知待接基因序列之左側,則重新進行步驟S222。Steps S251 and S252 are left and right symmetrical determination procedure steps. If the previous step generates some candidate gene sequences, which can be added to the right side of the known to-be-connected gene sequence, step S221 is performed again. If the previous step produces some candidate gene sequences that can be appended to the left side of the known to-be-supplied gene sequence, step S222 is performed again.

步驟S261,當待接基因序列二側都無法繼續附加新的基因序列,則把所有找到的可延伸之選定基因序列依其位置重疊成重疊群(contig)。並輸出重疊群每個位置最判定的鹼基以成為組合的目標序列。In step S261, when the new gene sequence cannot be added on both sides of the waiting gene sequence, all the found extensible selected gene sequences are overlapped into a contig according to their positions. The bases most determined at each position of the contig are output to be the combined target sequence.

如圖9及10所示,係為本發明之雙向延伸組合器進行序列組合的實施例圖。此實施例說明本發明找尋可以拼接在一起的基因序列群的主要方法。由一個小的啟始待接基因序列向二端延伸,找出可以接在適當位置的基因序列。9 and 10 are diagrams showing an embodiment of sequence combining of the bidirectional extension combiner of the present invention. This example illustrates the primary method by which the present invention seeks a population of gene sequences that can be spliced together. A small start-to-end gene sequence is extended to the two ends to find the gene sequence that can be placed in place.

如圖11所示,係為本發明之容錯序列篩選器的簡化實施例圖。此一實施例說明容錯序列篩選器和延伸測試視窗的關係。延伸測試視窗是啟始待接基因序列二側的比對序列。雙向延伸組合器會位移此延伸測試視窗,並將延伸測試視窗內的基因序列分成索引鍵及容錯比對區域。Figure 11 is a simplified embodiment of a fault tolerant sequence filter of the present invention. This embodiment illustrates the relationship between a fault tolerant sequence filter and an extended test window. The extension test window is the alignment sequence that initiates the two sides of the waiting gene sequence. The bidirectional extension combiner shifts the extended test window and divides the gene sequence in the extended test window into index keys and fault tolerant alignment regions.

如圖12所示,係本發明找出可以用以延伸基因序列的候選基因序列之容錯比對方法。As shown in Figure 12, the present invention finds a method of fault tolerant alignment of candidate gene sequences that can be used to extend a gene sequence.

如圖13及14所示,係顯示對候選基因序列進行篩選,偵測是否發生鹼基配對失誤的測序錯誤。圖13顯示發生鹼基配對失誤的測序錯誤情形,圖14顯示非鹼基配對失誤的測序錯誤情形。As shown in Figures 13 and 14, screening of candidate gene sequences is shown to detect sequencing errors in base pairing errors. Figure 13 shows the sequencing error scenario in which base pairing errors occurred, and Figure 14 shows the sequencing error scenario for non-base pairing errors.

圖15及16顯示對候選基因序列進行篩選,偵測是否發生鹼基插入或刪除的測序錯誤。鹼基插入或刪除之錯誤的偵測,係將原來比對的序列型式轉換成差別序列型式進行比對。在延伸測試視窗的比對參考型式ref會被轉換成差別序列型式dref,方法是掃描基因序列。連續相同的鹼基被視為單一鹼基。例如比對參考型式ref=GTAAGTTTCCAATATATGGT,其差別序列型式dref=GTAGTCATATATGT,也就是在ref中的連續二個AA鹼基在dref中只表示成單一個A鹼基。同理,在ref中的連續二個TTT鹼基在dref中只表示成單一個T鹼基。在進行候選基因序列篩選時,候選基因序列r1的比對參考型式GTAAAGTTTCCAATATATGGT,其差別序列型式dr1=GTAGTCATATATGT。比對二個差別序列型式(dref,dr1)是一致的,因此r1的比對參考型式會被取代成er1=GTAAGTTTCCAATATATGGT。如此,r1被視為可以接在此位置的可延伸之選定基因序列。Figures 15 and 16 show screening of candidate gene sequences to detect sequencing errors in base insertion or deletion. The detection of errors in base insertion or deletion is performed by converting the originally aligned sequence patterns into differential sequence patterns. The alignment reference pattern ref in the extended test window is converted to the differential sequence pattern dref by scanning the gene sequence. Successive identical bases are considered to be single bases. For example, the reference reference pattern ref=GTAAGTTTCCAATATATGGT, the differential sequence type dref=GTAGTCATATATGT, that is, two consecutive AA bases in ref are only represented as a single A base in dref. Similarly, two consecutive TTT bases in ref are only represented as a single T base in dref. In the screening of candidate gene sequences, the alignment of the candidate gene sequence r1 is a reference pattern of GTAAAGTTTCCAATATATGGT, and the differential sequence pattern is dr1=GTAGTCATATATGT. The alignment of the two differential sequence patterns (dref, dr1) is identical, so the alignment reference pattern of r1 is replaced by er1=GTAAGTTTCCAATATATGGT. Thus, r1 is considered to be an extensible selected gene sequence that can be ligated at this position.

雖然本發明已以較佳實施例揭露如上,然其並非用以限定本發明,任何熟悉此項技藝者,在不脫本發明之精神和範圍內,當可做些許更動與潤飾,因此本發明之保護範圍當視後附之申請專利範圍所界定為準。Although the present invention has been disclosed in the above preferred embodiments, it is not intended to limit the invention, and the present invention may be modified and modified without departing from the spirit and scope of the invention. The scope of protection is subject to the definition of the scope of the patent application.

10...基因序列的組合系統10. . . Gene sequence combination system

11...輸入介面11. . . Input interface

110,161...檔案110,161. . . file

111...索引鍵值資料111. . . Index key value data

112...啟始待接基因序列112. . . Start-up gene sequence

12...索引器12. . . Indexer

122...候選基因序列122. . . Candidate gene sequence

13...雙向延伸組合器13. . . Bidirectional extension combiner

130...短鹼基序列130. . . Short base sequence

131...索引鍵131. . . Index key

132...容錯比對區域132. . . Fault tolerance comparison area

14...容錯序列篩選器14. . . Fault tolerant sequence filter

141...選定基因序列141. . . Selected gene sequence

15...重疊群建構器15. . . Overlapping group builder

151...重疊群151. . . Contiguous group

16...輸出介面16. . . Output interface

21...左延伸測試視窗twenty one. . . Left extension test window

22...右延伸測試視窗twenty two. . . Right extension test window

圖1為習知有向圖;Figure 1 is a conventional directed graph;

圖2為習知有向圖中找出有一致關係的排列順序之示意圖;2 is a schematic diagram showing the order of arrangement in a conventional directed graph;

圖3為習知以重疊-排列-一致的方式組合序列的示意圖;Figure 3 is a schematic diagram of a conventional combination of sequences in an overlapping-arranged-consistent manner;

圖4為習知De Bruijn示意圖;Figure 4 is a schematic view of a conventional De Bruijn;

圖5為習知De Bruijn圖中相鄰節點合併一大節點示意圖;5 is a schematic diagram of a large node merged by adjacent nodes in a conventional De Bruijn diagram;

圖6為習知De Bruijn圖合併而成的序列示意圖;Figure 6 is a schematic diagram of a sequence of a conventional De Bruijn diagram;

圖7為本發明之基因序列組合系統的一種實施例示意圖;Figure 7 is a schematic view showing an embodiment of a gene sequence combining system of the present invention;

圖8為本發明之基因序列組合方法的一種實施例流程圖;Figure 8 is a flow chart showing an embodiment of a gene sequence combination method of the present invention;

圖9為本發明雙向延伸組合器具有左右延伸測試視窗之示意圖;Figure 9 is a schematic view showing the left and right extension test window of the bidirectional extension combiner of the present invention;

圖10為本發明雙向延伸組合器進行序列組合的簡化實施例圖;10 is a simplified embodiment of a sequence combination of a bidirectional extension combiner of the present invention;

圖11為本發明容錯序列篩選器的簡化實施例圖;Figure 11 is a simplified embodiment of a fault tolerant sequence filter of the present invention;

圖12為本發明找出可以用以延伸序列的候選序列方法示意圖;12 is a schematic diagram of a method for finding a candidate sequence that can be used to extend a sequence according to the present invention;

圖13為本發明對候選序列進行篩選及偵測是否發生鹼基配對失誤的測序錯誤之一種示意圖;Figure 13 is a schematic diagram showing the sequencing error of screening and detecting whether a base pairing error occurs in a candidate sequence according to the present invention;

圖14為本發明對候選序列進行篩選及偵測是否發生鹼基配對失誤的測序錯誤之另一種示意圖;14 is another schematic diagram of sequencing errors of a candidate sequence and detecting whether a base pairing error has occurred;

圖15為本發明對候選序列進行篩選及偵測是否發生鹼基插入或刪除的測序錯誤之一種示意圖;及15 is a schematic diagram of sequencing errors of a candidate sequence and detecting whether a base insertion or deletion has occurred; and

圖16為本發明對候選序列進行篩選及偵測是否發生鹼基插入或刪除的測序錯誤之另一種示意圖。Figure 16 is a schematic diagram showing the sequencing error of a candidate sequence and the detection of whether a base insertion or deletion has occurred.

附件一:參考文獻。Annex I: References.

10‧‧‧基因序列的組合系統10‧‧‧Combination system of gene sequences

11‧‧‧輸入介面11‧‧‧Input interface

110,161‧‧‧檔案110,161‧‧‧Files

111‧‧‧索引鍵值資料111‧‧‧ Index key value data

112‧‧‧啟始待接基因序列112‧‧‧Starting the waiting gene sequence

12‧‧‧索引器12‧‧‧ indexer

122‧‧‧候選基因序列122‧‧‧candidate gene sequence

13‧‧‧雙向延伸組合器13‧‧‧Two-way extension combiner

131‧‧‧索引鍵131‧‧‧ index key

132‧‧‧容錯比對區域132‧‧‧ Fault-tolerant comparison area

14‧‧‧容錯序列篩選器14‧‧‧Fault-tolerant sequence filter

141‧‧‧選定基因序列141‧‧‧Selected gene sequence

15‧‧‧重疊群建構器15‧‧‧Overlapping group builder

151‧‧‧重疊群151‧‧‧ contig

16‧‧‧輸出介面16‧‧‧Output interface

Claims (12)

一種基因序列組合系統,其用以拼接經一基因測序系統所產生之複數個基因序列,以決定一目標基因的鹼基序列,該系統包括:一索引器,其用以儲存該複數個基因序列之索引值資料,該索引值供找出可能可以接續在一個待接基因序列之二側的候選基因序列;一容錯序列篩選器,其用以決定該候選基因序列為可接續在該待接基因序列之二側的選定基因序列;及一雙向延伸組合器,其用以將經由該容錯序列篩選器所決定之該選定基因序列接續在該待接基因序列至少一側而延伸成一個更長的基因序列,直至決定該目標基因的鹼基序列為止。A gene sequence combination system for splicing a plurality of gene sequences generated by a gene sequencing system to determine a base sequence of a target gene, the system comprising: an indexer for storing the plurality of gene sequences Index value data for finding a candidate gene sequence that may be contiguous on two sides of a waiting gene sequence; a fault-tolerant sequence filter for determining that the candidate gene sequence is contiguous in the candidate gene a selected gene sequence on the two sides of the sequence; and a bidirectional extension combiner for extending the selected gene sequence determined by the fault tolerant sequence filter to at least one side of the sequence of the to-be-connected gene to extend into a longer The gene sequence until the base sequence of the target gene is determined. 如請求項1所述之基因序列組合系統,其中該雙向延伸組合器包括有分別對應於該待接基因序列二側的二延伸測試視窗,該二延伸測試視窗分別供自該索引器中找尋出可以附加在該延伸測試視窗的該基因序列以做為該候選基因序列。The gene sequence combination system of claim 1, wherein the bidirectional extension combiner comprises two extension test windows respectively corresponding to two sides of the to-be-connected gene sequence, wherein the two extension test windows are respectively searched from the indexer The gene sequence that can be attached to the extension test window serves as the candidate gene sequence. 如請求項1所述之基因序列組合系統,其中該容錯序列篩選器從該候選基因序列中偵測是否有基因測序的序列錯誤。The gene sequence combination system according to claim 1, wherein the fault-tolerant sequence filter detects whether there is a sequence error of gene sequencing from the candidate gene sequence. 如請求項3所述之基因序列組合系統,其中該容錯序列篩選器所偵測的該序列錯誤,包括鹼基配對失誤之錯誤及鹼基插入或刪除之錯誤。The gene sequence combination system according to claim 3, wherein the sequence error detected by the fault-tolerant sequence filter includes an error in base pairing error and an error in base insertion or deletion. 如請求項4所述之基因序列組合系統,其中該鹼基配對失誤之錯誤的偵測,係將複數個該候選基因序列依其可能延伸的位置排列,統計不同序列排列後相同位置的鹼基,以判斷是否是測序產生的序列錯誤或者該候選基因序列並不是接在此位置的序列。The gene sequence combination system according to claim 4, wherein the detection of the error of the base pairing error is performed by arranging a plurality of the candidate gene sequences according to possible extension positions, and counting the bases of the same position after the different sequence alignment To determine whether it is a sequence error resulting from sequencing or that the candidate gene sequence is not a sequence attached to this position. 如請求項4所述之基因序列組合系統,其中該鹼基插入或刪除之錯誤的偵測,係將原來比對的序列型式轉換成差別序列型式進行比對。The gene sequence combination system according to claim 4, wherein the detection of the erroneous detection of the base insertion or deletion is performed by converting the originally aligned sequence patterns into the differential sequence patterns. 如請求項6所述之基因序列組合系統,其中該差別序列型式是將基因序列中相同的連續鹼基表示成單一的鹼基。The gene sequence combination system of claim 6, wherein the differential sequence type is to represent the same contiguous base in the gene sequence as a single base. 如請求項4所述之基因序列組合系統,其中該容錯序列篩選器進行該鹼基插入或刪除之錯誤的偵測時,當成功比對該選定基因序列及延伸測試視窗的差別序列型式,則該選定基因序列所插入的多餘鹼基或刪除的缺少鹼基會被取代成正確的鹼基數目。The gene sequence combination system according to claim 4, wherein when the fault-tolerant sequence filter performs the detection of the error of the base insertion or deletion, when the difference between the selected gene sequence and the extended test window is successful, The excess base inserted in the selected gene sequence or the deleted missing base will be replaced by the correct number of bases. 如請求項1所述之基因序列組合系統,其中該索引器是以該基因序列的部份片段做為索引鍵值,用以將所有的該基因序列分類存放,於組合時,再依其索引鍵值供取回該基因序列做使用。The gene sequence combination system according to claim 1, wherein the indexer uses a partial fragment of the gene sequence as an index key to classify all the gene sequences, and when combined, according to the index The key value is used to retrieve the gene sequence for use. 一種基因序列組合方法,其用以拼接經一基因測序系統所產生之複數個基因序列,以決定一目標基因的鹼基序列,其包括:步驟(A)提供如請求項1所述之系統;步驟(B)輸入該複數個基因序列並建立索引,將其索引值資料儲存於該索引器;步驟(C)從未使用的該序列中產生一待接基因序列;步驟(D)以該雙向延伸組合器取出該待接基因序列二側各一預定個鹼基長度分別做為一左延伸測試視窗及一右延伸測試視窗,並以一預定長度分別左移該左延伸測試視窗及右移該右延伸測試視窗,將每次位移後的該左延伸測試視窗及該右延伸測試視窗中的基因序列分成新的一索引鍵及一容錯比對區域,以該新的索引鍵向該索引器查詢可能的候選基因序列;步驟(E)以該容錯序列篩選器決定該候選基因序列為可接續在該待接基因序列之二側的選定基因序列;步驟(F)以該雙向延伸組合器將經由該容錯序列篩選器所決定之該選定基因序列接續在該待接基因序列至少一側而延伸成一個更長的基因序列,當延伸成功時,再重複步驟(D)至(F),直至該待接基因序列二側都無法繼續接續為止;及將所有找到的可延伸之該選定基因序列依其位置重疊成重疊群,並輸出該重疊群每個位置最有可能的鹼基而成該目標基因的鹼基序列。A gene sequence combination method for splicing a plurality of gene sequences generated by a gene sequencing system to determine a base sequence of a target gene, comprising: step (A) providing the system according to claim 1; Step (B) inputting the plurality of gene sequences and indexing, storing the index value data in the indexer; step (C) generating a candidate gene sequence from the unused sequence; and step (D) The extension combiner takes out a predetermined base length on each side of the sequence of the to-be-connected gene as a left extension test window and a right extension test window, and respectively shifts the left extension test window to the left and a right shift by a predetermined length. Extending the test window to the right, dividing the gene sequence in the left extension test window and the right extension test window after each displacement into a new index key and a fault tolerance comparison area, and querying the indexer with the new index key a possible candidate gene sequence; step (E) determining, by the fault-tolerant sequence filter, the candidate gene sequence is a selected gene sequence contiguous on two sides of the sequence of the to-be-connected gene; step (F) The stretcher combines the selected gene sequence determined by the fault-tolerant sequence filter to extend to a longer gene sequence on at least one side of the sequence of the to-be-connected gene, and when the extension is successful, repeats step (D) to ( F) until the two sides of the sequence of the to-be-connected gene are unable to continue; and all the selected extensible selected gene sequences are overlapped into a contig according to their positions, and the most likely base at each position of the contig is output The base sequence of the target gene is formed. 如請求項10所述之基因序列組合方法,其中步驟(B)係經由一輸入介面輸入該複數個基因序列,給予每一該基因序列一編號,並且建立該基因序列的左索引結構及右索引結構。The gene sequence combination method according to claim 10, wherein the step (B) inputs the plurality of gene sequences via an input interface, assigns a number to each of the gene sequences, and establishes a left index structure and a right index of the gene sequence. structure. 如請求項10所述之基因序列組合方法,其中步驟(C)係使用連續相鄰的複數個基因序列來做為啟始待接基因序列片段,而該連續相鄰的複數個基因序列是用位移的索引鍵找尋彼此鹼基都一致的基因序列並重疊成該啟始待接基因序列片段。The gene sequence combination method according to claim 10, wherein the step (C) uses a plurality of consecutive adjacent gene sequences as the starting candidate gene sequence fragment, and the consecutive adjacent plurality of gene sequences are used. The index key of the displacement finds a gene sequence that is identical to each other and overlaps into the fragment of the starting gene sequence.
TW100107438A 2011-03-04 2011-03-04 System and method of assembling dna reads TWI420007B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW100107438A TWI420007B (en) 2011-03-04 2011-03-04 System and method of assembling dna reads

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW100107438A TWI420007B (en) 2011-03-04 2011-03-04 System and method of assembling dna reads

Publications (2)

Publication Number Publication Date
TW201237223A TW201237223A (en) 2012-09-16
TWI420007B true TWI420007B (en) 2013-12-21

Family

ID=47223058

Family Applications (1)

Application Number Title Priority Date Filing Date
TW100107438A TWI420007B (en) 2011-03-04 2011-03-04 System and method of assembling dna reads

Country Status (1)

Country Link
TW (1) TWI420007B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862735B (en) * 2022-12-28 2024-02-27 郑州思昆生物工程有限公司 Nucleic acid sequence detection method, device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5667970A (en) * 1994-05-10 1997-09-16 The Trustees Of Columbia University In The City Of New York Method of mapping DNA fragments
US6223128B1 (en) * 1998-06-29 2001-04-24 Dnstar, Inc. DNA sequence assembly system
WO2001063543A2 (en) * 2000-02-22 2001-08-30 Pe Corporation (Ny) Method and system for the assembly of a whole genome using a shot-gun data set
TWI326431B (en) * 2007-04-30 2010-06-21 Univ Nat Taiwan Science Tech Method and system of analyzing gene sequence

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5667970A (en) * 1994-05-10 1997-09-16 The Trustees Of Columbia University In The City Of New York Method of mapping DNA fragments
US6223128B1 (en) * 1998-06-29 2001-04-24 Dnstar, Inc. DNA sequence assembly system
WO2001063543A2 (en) * 2000-02-22 2001-08-30 Pe Corporation (Ny) Method and system for the assembly of a whole genome using a shot-gun data set
TWI326431B (en) * 2007-04-30 2010-06-21 Univ Nat Taiwan Science Tech Method and system of analyzing gene sequence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Huang and Madan, "CAP3: A DNA Sequence Assembly Program", Genome research, 1999, Vol.9, pages 868-877. *

Also Published As

Publication number Publication date
TW201237223A (en) 2012-09-16

Similar Documents

Publication Publication Date Title
Landau et al. Incremental string comparison
CN107133493A (en) Assemble method, structure variation detection method and the corresponding system of genome sequence
KR20140054675A (en) System and method for aligning genome sequence
JP6476931B2 (en) Storage system reliability verification program, reliability verification method, reliability verification device, and storage system
CN107015952B (en) A method and system for correctness verification of suffix array and longest common prefix
CN104850761B (en) Nucleotide sequence joining method and device
US8701162B1 (en) Method and system for detecting and countering malware in a computer
Pham et al. Pathset graphs: a novel approach for comprehensive utilization of paired reads in genome assembly
Cazaux et al. From indexing data structures to de bruijn graphs
KR20140056559A (en) System and method for aligning genome sequence
Thachuk Indexing hypertext
Schmeing et al. Gapless provides combined scaffolding, gap filling, and assembly correction with long reads
TWI420007B (en) System and method of assembling dna reads
EP1285390A2 (en) Method and system for the assembly of a whole genome using a shot-gun data set
CN104750765B (en) A kind of gene order-checking data sequence assemble method
KR102035285B1 (en) Contig Profile Update Method and Contig Formation Method for DNA shotgun sequencing or RNA transcriptome assembly
Chayapathi et al. Survey and comparison of string matching algorithms
Nguyen et al. Real-time resolution of short-read assembly graph using ONT long reads
CN108753765B (en) A Genome Assembly Method for Constructing Ultra-Long Contiguous DNA Sequences
CN104751015B (en) A kind of gene order-checking data sequence assemble method
CN116050348A (en) FASTQ file splitting method, system, electronic equipment and storage medium
Sundararajan et al. Chaining algorithms for alignment of draft sequence
Bolger et al. LOGAN: A framework for LOssless Graph-based ANalysis of high throughput sequence data
Zhang Large Genomes Assembly Using MAPREDUCE Framework
CN118335203B (en) Coronavirus recombination detection method, system, equipment and medium for large-scale genome data

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees