TW201237223A

TW201237223A - System and method of assembling DNA reads

Info

Publication number: TW201237223A
Application number: TW100107438A
Authority: TW
Inventors: Hsueh-Ting Chu; Cheng-Yan Kao; Li-Chen Chen
Original assignee: Hsueh-Ting Chu; Cheng-Yan Kao
Priority date: 2011-03-04
Filing date: 2011-03-04
Publication date: 2012-09-16
Also published as: TWI420007B

Abstract

The present invention discloses a system and method of assembling DNA sequencing reads to determine the target DNA sequences. The system comprises an indexer, a bidirectional stretch assembler and an error-tolerant filter. The method comprises the indexer organizing sequencing reads and locating candidate reads to add to a contig, the bidirectional stretch assembler extending a contig with following reads. An error-tolerant filter decides the candidate following reads to add at both ends of an extending contig.

Description

201237223 六、發明說明：【發明所屬之技術領域】本發明係關於一種基因測序序列的組合系統及方法，尤指— 種基因測序(DNA sequencing)的資料分析方法。從頭開始組合(以 novo assembly)基因測序(DNA sequencing)產生的核苷酸驗基序列的短字串。將可能有錯誤的短核苷酸鹼基序列字串拼接成正確的長基因序列。【先前技術】按基因測序序列的組合技術，是一種應用在基因序列的拼圖技術。基因測序所產生的序列，如以下的例子(26421212個序歹: 序列 1: TCCTGTATATTCTAAACTTAGAGATTGT1TCAT ; 2: CATAAACATCTTTATAAAATACTAATAGAAAG ；序列 3:AAAGGAGAGAACGTCGTCGTTTTCGTCGAAGT ; 序列 4:ACAACCCTAACTCTTTTTTTTTTGGCTATTGT ; 序列 26421209: TCTTCCGCCGTCGCAACTTTACCCAACGCCGC ；序列 26421210: ACCGCAAAAGCAAGATGATTCATTGTGTATCC ；序列 26421211: CTGGATCACAGCATCCACACGCACAAATATC ；序列 26421212: CCAATGGATTCTTTCTTTACTAACAATATCGA。上述的基因測序序列的組合問題和普通拼圖(Jigsaw Puzzle)的[s] 3 201237223 拼接不同之處主要有： ⑴基因測序序列的組合是—維的字串拼接。基因測序資料的碎片是只包含四種(A，Q C，T)㈣酸驗基的字串，而拼接的時候曰把不同的序列碎片依照其一致的部份重疊起來。、疋 (2)基因測序相的組合是巨大數量的碎片拼接。常見圖的碎片數可能是3〇〇片或5⑻片。麵片的拼圖就很難拼接。而基因測序的組合所要拼接的序列數量是巨大的往往有 1，000,000到！，〇〇〇,〇〇〇,〇〇〇，甚至隨著技術進步可以產生更多列碎片。要拼接數目如此魔大的測序序列，往往需要非常大量的記憶體來記住過財所產生的重疊群產物。而且，因為翻測序的過程可能會產生序列的錯誤。因此，如何觸序列中的錯誤也是一項重要的待解決課題。 ' 傳統的基因測序從頭開始組合方法包含以下三種： ⑴重疊-排列一致法(Overlap-Layout-Consensus); (2)De Bruijn 圖（De Brnijn graph);及 ⑶貪婪延伸演算法(Greedy extension algorithm)。以下以5個基因序觸難鮮，分卿以上三種習知方法做進一步說明： rl CCCTTCCAAC ； r2ATTTAATCCC ； r3 TTAATCCCTT ；201237223 VI. Description of the Invention: [Technical Field] The present invention relates to a combined system and method for gene sequencing sequences, and more particularly to a data analysis method for DNA sequencing. Short strings of nucleotide sequence generated by novo assembly (novo assembly) DNA sequencing. Strings of short nucleotide base sequences that may be erroneously spliced into the correct long gene sequence. [Prior Art] A combination technique of gene sequencing sequences is a puzzle technique applied to gene sequences. Sequence of the gene sequence are generated, examples such as the following (26421212 sequences bad: Sequence 1: TCCTGTATATTCTAAACTTAGAGATTGT1TCAT; 2: CATAAACATCTTTATAAAATACTAATAGAAAG; Sequence 3: AAAGGAGAGAACGTCGTCGTTTTCGTCGAAGT; Sequence 4: ACAACCCTAACTCTTTTTTTTTTGGCTATTGT; sequence 26421209: TCTTCCGCCGTCGCAACTTTACCCAACGCCGC; sequence 26421210: ACCGCAAAAGCAAGATGATTCATTGTGTATCC; sequence 26421211: CTGGATCACAGCATCCACACGCACAAATATC Sequence 26421212: CCAATGGATTCTTTCTTTACTAACAATATCGA The combination of the above-mentioned gene sequencing sequences differs from the [J] 3 [J] 3 201237223 splicing of the common jigsaw puzzle: (1) The combination of gene sequencing sequences is - dimensional string splicing. Gene sequencing The fragmentation of the data is a string containing only four (A, QC, T) (four) acid groups, and when splicing, the different sequence fragments are overlapped according to their uniform parts. 疋 (2) Gene sequencing The combination is a huge number of pieces of splicing. The number of fragments in a common figure may be 3 或 or 5 (8). The puzzle of the patch is difficult to splicing. The combination of gene sequencing is spliced. The number of sequences is huge and often has 1,000,000 to!, 〇〇〇, 〇〇〇, 〇〇〇, and even more pieces of debris can be produced as technology advances. To splicing a number of so large sequencing sequences often requires very A large amount of memory to remember the contiguous group products produced by the financial. Moreover, because the process of sequencing may produce sequence errors, how to touch the errors in the sequence is also an important problem to be solved. Gene sequencing from the beginning The combination method includes the following three types: (1) Overlap-Layout-Consensus; (2) De Bruijn graph; and (3) Greedy extension algorithm. The five gene sequences are difficult to read, and the three conventional methods are further explained: rl CCCTTCCAAC; r2ATTTAATCCC; r3 TTAATCCCTT;

r4 TTCCAACAGC ;及 r5 AACAGCCGCC 201237223 (1)重疊制一致法是最傳制方式。包括三個階段：和Γΐ可以對齊巾間3贿基，記成峨rl) = _3，並如下表R4 TTCCAACAGC ; and r5 AACAGCCGCC 201237223 (1) The overlapping method is the most traditional method. Including three stages: and Γΐ can be aligned with the towel 3, recorded as 峨rl) = _3, and as shown below

如Η和r2可以對齊中間8個鹼基，記成处2, r3) = _8，並如下表。階段-：將兩兩的基因測序序列重疊看看，找出其距離。如〇For example, Η and r2 can be aligned in the middle of 8 bases, recorded as 2, r3) = _8, and are shown in the following table. Stage-: Overlapping the two-by-two gene sequencing sequences to find out their distance. Rugao

階段二：將兩兩的基因測序序列重疊看看，找出其距離。以基因序列間的距離來建立所有基因序列的有向圖，如圖丨所示。階段三：在有向圖中找出有一致關係的排列順序(如以最小擴充樹法），如圖2所示。圖2中，如以最小擴充樹法可以得到序列拼接的順序是由左至右重疊^，^，^，^，^。這五個基因序列的重疊稱為一個重疊群(contig)，由這個重疊群中每個一致的鹼基可以得到最後組合序列：ATTTAATCCCTTCCAACAGCCGCC，如圖3 所示。 (2) De Brnijn圖（De Brnijn graph):係把序列以每k個組成一個節點，如圖4所示。將不同的基因序列而具相同的節點予以合併，可以得到如圖5所示之結果。De Brnijn圖是把圖5中相鄰的節點合併成一個更大的節點。因為圖中示例只形成一個序列，所以把相鄰的節點合併最後會形成單一的節點，如圖6所示。如果’合併的結果形成複數的節點，則最後找尋一筆畫的路徑 (Eulerianpath)來做為最後可以合併的序列。應用De Bruijn圖之 201237223 專利技術有美國第7, 071，324號、第7,034,143號、第6,865,491 號、第6, 689, 563號及第5, 683, 881號專利案。 ⑶貪婪延伸演算法(Greedy extension algorithm):係選取一個基因序列如rl: CCCTTCCAAC，看看其字尾(postfix)是不是別的字的字首(prefix)。如是，便將其重疊上去。如TTCCAAC是rl的字尾’是r4:TTCCAACAGC的字首。所以合併rl及r2變成重疊群(1，4): CCCTTCCAACAGC。重疊群(1，4)的字尾 AACAGCC 疋r5: AACAGCCGCC的字首，所以把重疊群(1，4)及r5合併成重疊群(1，4,5): CCCTTCCAACAGCGCC。重疊群(1，4,5)沒有任何字尾是別的序列的字首，所以就停止。換從序列如r2: ATTTAATCCC 開始接。r2的字尾TTAATCCC是r3: TTAATCCCTT的字首，所以合併r2及r3變成重疊群(2,3):ATTTAATCCCTT。最後再把重疊群（2,3)及重疊群（1，4,5)合併，可以得到 ATTTAATCCCTTCCAACAGCCGCC ° 以上一種傳統的方式都需要不斷進行合併(merge)的動作。把比較短的基因序列之重疊群合併成比較長的基因序列之重疊群。合併的過程需要大量的記憶體來存放拼接過程的暫時結果。然而’虽資料量很大時’絲需要減的記麵來存放拼接過程的結果。甚至要乡賴百Giga的記紐才能進行。因此，當基因序列資料大時，錄受限於魄_關紐完成難的動作。而且，當基’财有驗基是錯誤時，絲就無法被組合。 S] 再者’關於基因序列之重組或分析的技術有很多，例如民國第1326431號專利案，美國第7, _，509號專案，以及如件之參考文獻[1]至[10]所發表之技術内容。然而，目前所見的 6 201237223 在先技藝，尚未發現有如本發明之領先技術者。【發明内容】本發明之目的，在於提供一種組合基因測序序列的系統及方法，用以解決傳統方法所產生之二個問題：⑴合併的動作需要大量的記憶體；及(2)容許序列中有錯誤的鹼基也可以進行組合。為解決上述問題1，本發明之技術手段係提供一種雙向延伸組合的方法來拼接各綱基因序取軸—目標基因糊。這個技街疋發展來將各別基园序列同時向一個待接基因序列的二侧延伸接續下去。因為是向待接基因序列二側延伸，因此，我們可以任意選取-個待接之基因相(為-基因序列或由數個基因序列接續組合而成的-重辦)開始進行其二觸雙向延伸接續其他候選基因序列的動作。最後可以找出位在同—個重疊群上的其他基因^ 列’並將它們組合成-目標基因序列（即由更多基因序列接續組合而成的一更長的重疊群）。〃為解決上述問題2，本伽之驗手段雜供—姆錯序列的篩選機制來找狂確的相。&於基關序產生的相可能會有以下二種錯誤： ⑴驗基配對失誤(mismatch)之錯誤：原基因序列中某個鹼基被錯誤配對成其錄基。例如·· ACATTAAGCCTT是原本的基因序列’經過基關序處斯產細相為agattaagcctt。也就是第二個鹼基C被錯誤配對成g。 (2)鹼基插入或刪除(insert/deleti〇n)之錯誤^經過基因測序處理所產生的基因序列比原基因序列中多出額外的驗基，或減少某個鹼基。例如ACATTAAGCCTT是原本的基因序列，第4〜5驗基 7 201237223 是連續的鹼基τ。如經過基因測 =ΓΓΤΤ，比縣轉顺_位少1=匕為驗基刪除的錯誤。反之，如經過基因測序處理所產生為ACATTTAAGCCTT比原本_序列在相同位置夕個T，此情形稱之為鹼基插入的錯誤。鹼基可錯序列筛選器，是在所有的候選基因序列中找出 =接，翻序列的正確基因序列，而容許基因序财有錯誤的妒社二讓本發月之目的及其他特徵能更加清楚，以下兹舉出-些二實施例，並配合所_式圖7 _ 16，作詳細說^在這: 貫施例的說明中，為了簡明解釋肩㈣〜_響顧’所W在不时例使用不同的序列長度。以及索引鍵長度。【實施方式】如圖7所不，本發明之基因序顺合纽⑴伽靖一個基 =的集合拼接成目標的基因序列’其—種具體實施例係包括列餘2介面U、—索引器12、—雙向延伸組合11 13、—容錯序、„ 14、—重疊群建構^15及—輸丨介面μ 兀件詳述如下。激人讀U係從齡在資料庫或記憶體巾之難110讀入複 =因序列(其可以是由基因測序系統所產生的複數個基因序給讀人的基因序列職，並建立基因序_左索引結右索引結構。本發明一種具體實施例中，其輸入介面n會讀土因序列二次。第—次輸人基因序列，取其基因序列前後各N . ,、 t 3 8 201237223 個鹼基做為如鍵值資料m，並將索_值龍⑴置入索引器 12中存放。索引鍵值可以字串或轉換成數值表示。輸入介面n有一個序列使用記錄陣列，來記錄序列是否已被排入重疊群中。索引器12,其用以儲存該複數個基因序列之索引值資料丨^，该索引值供找出可能可哺續在-個待接基目相之二側的候選基因序列。其可以是置於記麵中的—個索僻列，或者是置於硬碟的索⑽’也可以是—個置放在遠端㈣料庫，作用是輸入經輪入介面η錢⑽如序腦錄基相陶，及輸出與索引對應之多重候選基因序列122。 —σ延物合H Π ’肋將軸容錯相騎騎決定之該選定基因序列接續在待接基因序列至少—側而延伸成—個更長的 ^因序列’直至歧該目標基因的驗基序列為止。本發明實施例 ^係以雙向延伸組合器13取出待接基因序列（或目前已組合重 :群序列）二侧個驗基長度分別做為一延伸測試視窗撤2，發明係以—個基因序列的長度做為延伸測試視窗之長度，該二 ^申測試視窗脑分雜自_丨器中找利可明加在該延申測试視窗_基因序_做為該闕基因序.本發明且體實 ί例中，雙向延伸組合器13會從長度為1開始位移延伸測試視 \將位移後的延伸職視針的基因相分成新的索引鍵⑶ :錯_區域132 ’如圖U所示。其中，新的索引鍵⑶用以給容=丄=可:的:_序列。而容錯比對區域132提供因序^ 、，以比對出正確而可供延伸接續的選定基谷錯序㈣選器】4 ’用以決定候選基因相為可接續在待接 201237223 基因序狀-綱較基因序列。其絲_雙向延物合器^ 輸入的容錯比對區域132及多重候選基因序歹4122，請配合參看圖 13、14所示__程，留下帶有位置且正確可供延伸接續之選定基因序列141給重疊群建構器15。重疊群建構器15將選定基因序列⑷依其位置重疊排列，建構出重且群(contig)151。透過輸出介面16將此重⑸ 檔案161中。 •π货咧之方法請配合參看圖7至16所示，本發明之基因序列組合方法的一種具體實施例，係包括有以下所述之步驟。步驟S201 :由輸入介面„輸入複數個基因序列，給予輸入的 =土因序列個編號，並且建立此基因序列的左索引結構及右索引結構’並儲存在索引器12。步驟S211:輸入介面11從序列使用記錄陣列中找出-個未使用的序列，先和其鄰近的相轉，確定每鑛基的正確性後， =列做級枝做合!| 13進行雙向延伸始待接基因序列。因為早-的基因序列可能會有錯誤，因此可以使用數個連續目鄰的複數個基因序列來做為啟始待接翻序㈣段。尋找連續 :的複數個基因序列’是用位移的钟鍵找尋彼此驗基都一致 =因序列來先重疊成啟始待接基因序列片段。其中，如果相鄰 =因序顺此驗基不-致，就不能做為雙向延伸的啟始待接基因序列片段的二侧。步驟咖及如是左右對稱的運算程序，在此以向右的實包例做說明。雙向延伸組合器13取出待接基因相Μ二侧各m [ 201237223 個驗基長度分別做為左延伸測試視窗21及右延伸測試視窗22，並從長度1開始位移測試視窗21/22，將位移後的測試視窗21/22中的基因序列分成新的索引鍵131及容錯比對區域132，以新的索引鍵131向索引器12查詢可能的候選基因序列122，並將容錯比對區域132提供給容錯序列篩選器14 ’用以比對出正確而可供延伸接續的選定基因序列。如圖9所示，本發明以待接基因序列2〇(或為目前完成群組之基因序列片段)的二側，分別做為右延伸測試視窗及左延伸測試視窗。如圖10所示，本發明滑動左延伸測試視窗21及右延伸測試視窗22’用以找出可以接續在目前已知待接基因序列2〇左右二侧的候選基因序列122。步驟S231及S232是左右對稱的運算程序，圖12係以產生向右延伸之候選基因相為例，目前比對視窗Μ中的比對參考序列型式為 CACAGCAGTAAGTTTCCAATATATGGT。此序列中， CACAGCA 做為索引鍵，而 gtaagtttccaatatatggt 是用以進行容錯崎的區域。從索引器12中找出所有左侧索引鍵為 CACAGCA的基因序列。這些基因序列也分成索引鍵及容錯比對的區域三比較延伸視窗23及候選基因賴122的比對參考序列型十算it}其不同驗基的數目，如果不同驗基的數目小於一侧值τ ’職基因序顺縣可能延伸之候選基因序列。步驟S241及S242是左右對稱的運算程序，由前一個步驟產生的可能延伸之候選基因序列，必須進—步測試衫有測序錯誤。本發明之方法是把所有賊出賴選基因序舰其可能延伸的位置重疊侧，計算每條置其ACGT祕所侧比率，即統 201237223 計不同候選基縣觸舰相_置驗基，以觸是否是測序產生的序列錯誤或者該候選基因序列並不是接在此位置的序列。對單-的序咖言，如果其某做置驗基和其他_序列的相同位置之驗基不同，會有二種情形：第—種情形是此基因序列是正確的候選基因賴，但是發生驗基配對失誤_序錯誤·第二種情形是此-基因相並不是可以接在此位置醜縣因序列。圖13及14分別圖示說明此二種情形。在圖13中，序列 r6各有1〜2個絲和其他候縣因相不_致。然而其重疊時’各別位置的錯誤鹼基沒有超過一定百分比，如1/5。此時，錯誤鹼基被視為驗基配對失誤的測序錯誤。此外，在圖14中，序列 1"1，0，13/4，亡，沁，各有1〜2個鹼基和其他候選序列不一致。當其重疊時’ rl，r2, r3有一個相同位置的錯誤鹼基超過一定的比率，如 I/5。該位置的驗基都是A ’相較於其他基因序列在此位置的驗基都是T，· rl，r2，r3隸因序列被判定為枝接在此位置的= 選基因序列。步驟随及S242也偵測是否發生驗基插入或刪除的測序錯誤。在圖15及16顯示驗基插人或刪除賴序錯誤摘測。、步驟S251及S252是左右對稱的判斷程序步驟，如果前一個步驟產生一些的候選基因序列，可以附加到已知待接基因序列之右侧，則重新進行步驟S22卜如果前一個步驟產生一些的候選基因序列’可以附加到已知待接基因序列之左侧，則重新進行步^ S222 » 步驟S261，當待接基因序列二側都無法繼續附加新的基因序列，則把所有找到的可延伸之選定基因序列依其位置重疊成重疊 12 201237223 群(contig)。並輸出重疊群每個位置最判定的鹼基以成為組合的目標序列。如圖9及10所示，係為本發明之雙向延伸組合器進行序列組合的實施例圖。此實施例說明本發明找尋可以拼接在一起的基因序列群的主要方法。由一個小的啟始待接基因序列向二端延伸，找出可以接在適當位置的基因序列。如圖11所不，係為本發明之容錯序列篩選器的簡化實施例圖。此一實施例說明容錯序列篩選器和延伸測試視窗的關係。延伸測試視窗是啟始待接基因序列二側的比對序列。雙向延伸組合器會位移此延伸測試視窗，並將延伸測試視窗内的基因序列分成索引鍵及容錯比對區域。如圖12所示，係本發明找出可以用以延伸基因序列的候選基因序列之容錯比對方法。土如圖13及14所不，係顯示對候選基因序列進行筛選，摘測是否發生鹼基配對失誤的測序錯誤。圖13顯示發生鹼基配對失誤的測序錯誤情形，圖14顯示非鹼基配對失誤的測序錯誤情形。圖15及16顯示對候選基因序列進行篩選’债測是否發生驗基插入或刪除的測序錯誤。鹼基插入或刪除之錯誤的偵測，係將原來比對的序_式轉換減別相型式進行比對。在延伸測試視窗的比對參考型式ref會被轉換成差別序列型式dref，方法是掃描基因序列。連續相同的鹼基被視為單一鹼基。例如比對參^型式 ref = GTAAGTTTCCAAIATATGGT，其差別序列型 drefKHAGTCATATATGT，也就是在ref中的連續二個从鹼基在dref巾只表示成單一個A鹼基。同理，在ref巾的連續二個 13 201237223 TTT鹼基在dref中只表示成單一個τ鹼基。在進行候選基因序列篩選Β寺’候選基因序列ri的比對參考型气 GTAAAGTTTCCAATATATGGT ，其差別序列型 ^ drl=GTAGTCATATATGT。比對二個差別序列型式(dref，把)是一^ 的’因此rl的比對參考型式會被取代成erl；= GTAAGTTTCCAATATATGGT。如此，rl被視為可以接在此位置的可延伸之選定基因序列。雖然本發明已以較佳實施例揭露如上，然其並非用以限定本發明，任何熟悉此項技藝者，在不脫本發明之精神和範圍内，當可做些許更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定為準。【圖式簡單說明】圖1為習知有向圖；圖2為習知有向圖中找出有一致關係的排列順序之示意圖；圖3為習知以重疊-排列_一致的方式組合序列的示意圖；圖4為習知De Bruijn示意圖；圖5為習知DeBmijn圖中相鄰節點合併一大節點示意圖；圖6為習知De Bruijn圖合併而成的序列示意圖；圖7為本發明之基因序列組合系統的一種實施例示意圖；圖8為本發明之基因序列組合方法的一種實施例流程圖；圖9為本發明雙向延伸組合器具有左右延伸測試視窗之示意圖；圖10為本發明雙向延伸組合器進行序列組合的簡化實施例圖；圖11為本發明容錯序列篩選器的簡化實施例圖；圖12為本發明找出可以用以延伸序列的候選序列方法示意圖； 201237223 圖13為本發明對候選序列進行篩選及偵測是否發生鹼基配對失誤的測序錯誤之一種示意圖；圖14為本發明對候選序列進行篩選及偵測是否發生鹼基配對失誤的測序錯誤之另一種示意圖；圖15為本發明對候選序列進行篩選及偵測是否發生鹼基插入或刪除的測序錯誤之一種示意圖；及圖16為本發明對候選序列進行篩選及偵測是否發生鹼基插入或刪除的測序錯誤之另一種示意圖。附件一：參考文獻。【主要元件符號說明】基因序列的組合系統1〇 > 輸入介面 11 檔案 110，161 索引鍵值資料111 啟始待接基因序列112 索引器12 候選基因序列122 雙向延伸組合器13 短鹼基序列130 索引鍵131 容錯比對區域Π2 容錯序列筛選器14 選定基因序列141 重疊群建構器15 重疊群151 輸出介面 16 右延伸測試視窗22 左延伸測試視窗21Stage 2: Overlapping the two-by-two gene sequencing sequences to find out their distance. A directed graph of all gene sequences is established by the distance between gene sequences, as shown in Figure 。. Stage 3: Find the order in which there is a consistent relationship in the directed graph (eg, with the minimum expansion tree method), as shown in Figure 2. In Fig. 2, the sequence of splicing can be obtained by the minimum expansion tree method by overlapping ^, ^, ^, ^, ^ from left to right. The overlap of these five gene sequences is called a contig, and the final combination sequence is obtained from each identical base in this contig: ATTTAATCCCTTCCAACAGCCGCC, as shown in Figure 3. (2) De Brnijn graph: The sequence is composed of one node per k, as shown in Fig. 4. Combining different gene sequences with the same nodes, the results shown in Figure 5 can be obtained. The De Brnijn graph merges the adjacent nodes in Figure 5 into one larger node. Because the examples in the figure only form one sequence, merging adjacent nodes will eventually form a single node, as shown in Figure 6. If the result of the merge forms a complex node, then the path of the one stroke (Eulerianpath) is finally found as the last sequence that can be merged. Patent application No. 7, 712, 324, No. 7,034, 143, No. 6, 865, 491, No. 6, 689, 563, and No. 5,683, 881. (3) Greedy extension algorithm: Select a gene sequence such as rl: CCCTTCCAAC to see if the postfix is a prefix of another word. If so, it will be overlapped. For example, TTCCAAC is the suffix of rl ' is the prefix of r4: TTCCAACAGC. So merge rl and r2 into a supergroup (1, 4): CCCTTCCAACAGC. The suffix of the contig (1,4) AACAGCC 疋r5: The prefix of AACAGCCGCC, so the contigs (1, 4) and r5 are merged into a contig (1, 4, 5): CCCTTCCAACAGCGCC. The contigs (1, 4, 5) do not have any suffixes that are the beginnings of other sequences, so they stop. Change the sequence from r2: ATTTAATCCC to start. The suffix of r2, TTAACCCC, is the prefix of r3: TTAATCCCTT, so that merging r2 and r3 becomes a contig (2, 3): ATTTAATCCCTT. Finally, the overlapping group (2, 3) and the overlapping group (1, 4, 5) are combined to obtain ATTTAATCCCTTCCAACAGCCGCC °. A conventional method in which a conventional method needs to be merged continuously. The contigs of relatively short gene sequences are combined into a contig of relatively long gene sequences. The merging process requires a large amount of memory to store the temporary results of the splicing process. However, when the amount of data is large, the silk needs to be reduced to store the result of the splicing process. Even the township Lai Bai Giga can be carried out. Therefore, when the gene sequence data is large, the recording is limited to the difficult operation of 魄_关纽. Moreover, when the base's money is wrong, the silk cannot be combined. S] Furthermore, there are many techniques for the recombination or analysis of gene sequences, such as the Patent No. 1326431 of the Republic of China, the No. 7, _, 509 project of the United States, and the references [1] to [10] of the article. Technical content. However, the currently seen 6 201237223 prior art has not been found to be a leader in the art of the present invention. SUMMARY OF THE INVENTION It is an object of the present invention to provide a system and method for combining gene sequencing sequences to solve two problems arising from conventional methods: (1) the combined action requires a large amount of memory; and (2) the permissive sequence The wrong bases can also be combined. In order to solve the above problem 1, the technical means of the present invention provides a bidirectional extension combination method for splicing the various gene sequences to take the axis-target gene paste. This technology has evolved to extend the sequence of the respective bases simultaneously to the two sides of a sequence of genes to be joined. Because it is extended to the two sides of the gene sequence to be ligated, we can arbitrarily select a gene phase to be connected (for the gene sequence or a combination of several gene sequences), and start the two-way two-way The action of extending the sequence of other candidate genes is extended. Finally, other gene sequences located on the same contig can be found and combined into a target gene sequence (i.e., a longer contig that is composed of more gene sequences). In order to solve the above problem 2, the test method of the gamma method is to provide a mad phase to the screening mechanism of the mismatch sequence. There may be two errors in the phase produced by & the base sequence: (1) Mismatch error: A base in the original gene sequence is incorrectly paired into its record base. For example, · ACATTAAGCCTT is the original gene sequence 'after the base order is produced, the fine phase is agattaagcctt. That is, the second base C is incorrectly paired into g. (2) Errors in inserting or deleting (insert/deleti〇n) ^ After gene sequencing, the resulting gene sequence has an extra base, or a certain base, than the original gene sequence. For example, ACATTAAGCCTT is the original gene sequence, and 4th to 5th base 7 201237223 is a continuous base τ. If the genetic test = ΓΓΤΤ, than the county turn s _ bit less 1 = 匕 for the error of the base deletion. Conversely, if the gene is processed by the sequencing process, ACATTTAAGCCTT is at the same position as the original _ sequence, which is called the error of base insertion. The base error-sequence filter is to find the correct gene sequence in all the candidate gene sequences, and to allow the gene order to be wrong, and to make the purpose and other features of this month More clearly, the following two embodiments are given, and the details are shown in conjunction with the Figure 7 _16. In this description: In the description of the embodiment, the shoulder (four) ~ _ ' ' From time to time, different sequence lengths are used. And the length of the index key. [Embodiment] As shown in Fig. 7, the gene sequence of the present invention is spliced into a target gene sequence. The specific embodiment includes a column 2 interface U, an indexer. 12, the two-way extension combination 11 13 - the fault-tolerant sequence, „ 14 , the contig group construction ^ 15 and the transmission interface μ 详述 are detailed as follows. It is difficult to read the U system from the age of the database or the memory towel 110 reading the complex = sequence (which may be a plurality of gene sequences generated by the gene sequencing system for reading the human gene sequence, and establishing a gene sequence_left index node right index structure. In a specific embodiment of the present invention, The input interface n will read the sequence of the soil factor twice. The first time the human gene sequence is taken, and the N. , t 3 8 201237223 bases before and after the gene sequence are used as the key value data m, and the value is The dragon (1) is placed in the indexer 12. The index key value can be string or converted into a numerical representation. The input interface n has a sequence using a record array to record whether the sequence has been queued into the contig. To store index data of the plurality of gene sequences丨^, the index value is used to find a candidate gene sequence that may be fed on the two sides of the target phase. It may be placed in a note--a string, or placed on a hard disk. The cable (10)' can also be placed in the remote (four) library, and the input is the input of the input interface η money (10), such as the sequence of the brain, and the output of the index corresponding to the multiple candidate gene sequence 122. The H Π 'rib rib determines the selected gene sequence of the axis tolerant phase riding to extend at least to the side of the waiting gene sequence into a longer sequence of the target gene until the sequence of the target gene is determined. In the embodiment of the present invention, the bidirectional extension combiner 13 takes out the length of the two adjacent test sequences (or the combined weight: the group sequence) as an extension test window, and the invention is a genetic sequence. The length of the extended test window is used as the length of the extended test window, and the second test method is used to find the licensin in the extended test window _ gene sequence _ as the 阙 gene sequence. In the case of the body, the bidirectional extension combiner 13 will start from a length of 1 The extension test view \ divides the gene of the extended extension needle into a new index key (3): the error _ area 132 ' is shown in Figure U. Among them, the new index key (3) is used to give the volume = 丄 = can: The :_ sequence, and the fault-tolerant comparison region 132 provides the order ^, to compare the correct and select the selected base valley misorder (four) selector] 4 ' to determine the candidate gene phase can be connected in the continuation 201237223 Gene sequence-classical gene sequence. Its silk_bidirectional extension device^ input fault-tolerant alignment region 132 and multiple candidate gene sequence 歹4122, please refer to the __process shown in Figure 13, 14 The selected gene sequence 141 is positioned and correctly available for extension to the contig constructor 15. The contig constructor 15 arranges the selected gene sequences (4) in overlapping positions, constructing a contig 151. This weight (5) file 161 is output through the output interface 16. • Method of π cargo 请 Referring to Figures 7 to 16, a specific embodiment of the gene sequence combining method of the present invention includes the steps described below. Step S201: inputting a plurality of gene sequences from the input interface, giving the input = soil sequence number, and establishing a left index structure and a right index structure of the gene sequence and storing in the indexer 12. Step S211: input interface 11 Find out the unused sequence from the sequence using the record array, first and then the adjacent phase to determine the correctness of each mine base, = column to do the combination of branches! | 13 for bidirectional extension of the initial waiting gene sequence Because the early-gene sequence may be wrong, a number of consecutive gene sequences can be used as the starting sequence (4). Finding consecutive: multiple gene sequences 'is displaced The key to find each other's test base is the same = the sequence is first overlapped to start the sequence of the waiting gene sequence. Among them, if the adjacent = because the order is not correct, it can not be started as a two-way extension. The two sides of the gene sequence fragment. The steps are as follows: the left-right symmetric operation program, here is illustrated by the right-hand side of the package. The bidirectional extension combiner 13 takes out the gene to be connected to the two sides of each side [ 201237223 base lengths Do not use the left extension test window 21 and the right extension test window 22, and shift the test window 21/22 from the length 1, and divide the gene sequence in the shifted test window 21/22 into a new index key 131 and a fault tolerance comparison. The region 132 queries the indexer 12 for possible candidate gene sequences 122 with the new index key 131 and provides the fault tolerant alignment region 132 to the fault tolerant sequence filter 14' for comparing the selected genes that are correct for extension. As shown in Fig. 9, the present invention uses the two sides of the waiting gene sequence 2〇 (or the gene sequence fragment of the currently completed group) as the right extension test window and the left extension test window, respectively. The sliding left extension test window 21 and the right extension test window 22' of the present invention are used to find a candidate gene sequence 122 that can be continued on both sides of the currently known to-be-connected gene sequence. Steps S231 and S232 are bilaterally symmetric. The operation program, Figure 12, is an example of generating a candidate gene phase extending to the right. The alignment reference sequence in the current comparison window is CACAGCAGTAAGTTTCCAATATATGGT. In this sequence, CACAGCA is used as a cable. The index, and gtaagttccaatatatggt is used to carry out the fault-tolerant region. All the left-side index keys are CAAAGCA gene sequences are found from the indexer 12. These gene sequences are also divided into index keys and fault-tolerant alignments. And the number of different test bases of the reference gene sequence of the candidate gene Lai 122 is the same as the number of different test bases, if the number of different test bases is smaller than the one-side value τ 'the candidate gene sequence that may be extended by the county gene order. Steps S241 and S242 It is a bilaterally symmetric operation program. The candidate gene sequence that may be extended by the previous step must be tested in a step-by-step manner. The method of the present invention is to calculate the ratio of each of the thieves to the position where the possible extension of the selected gene sequence ship is possible, and calculate the ratio of each of the ACGT secret side units, that is, the 201237223 different candidate base county Whether the touch is a sequence error resulting from sequencing or the candidate gene sequence is not a sequence attached to this position. For the single-order, if there is a difference between the base of the test and the same position of the other _ sequence, there will be two cases: the first case is that the gene sequence is the correct candidate gene, but it occurs Test base pairing error _ order error · The second situation is this - the gene phase is not available in this position ugly county sequence. Figures 13 and 14 illustrate these two scenarios, respectively. In Fig. 13, the sequence r6 has 1 to 2 filaments each and other waiting periods are not related. However, when they overlap, the wrong base at each position does not exceed a certain percentage, such as 1/5. At this point, the wrong base is considered a sequencing error for the base pairing error. Further, in Fig. 14, the sequences 1 "1, 0, 13/4, 死, 沁, each having 1 to 2 bases are inconsistent with other candidate sequences. When it overlaps, 'rl, r2, r3 have an incorrect base at the same position exceeding a certain ratio, such as I/5. The test sites at this position are all A's compared to other gene sequences. The bases at this position are T, · rl, r2, and the r3 gene sequence is determined to be the selected gene sequence at this position. The step followed by S242 also detects whether a sequencing error has occurred in the insertion or deletion of the test. In Figures 15 and 16, the test base insertion or deletion error correction is shown. Steps S251 and S252 are left and right symmetrical determination procedure steps. If the previous step generates some candidate gene sequences, which can be added to the right side of the known to-be-connected gene sequence, step S22 is performed again. If the previous step produces some The candidate gene sequence 'can be attached to the left side of the known to-be-connected gene sequence, and then step S S222 » Step S261, when the two sequences of the waiting gene sequence cannot continue to add a new gene sequence, then all the found extensions can be extended. The selected gene sequences overlap by their positions into overlapping 12 201237223 groups (contig). The bases most determined at each position of the contig are output to be the combined target sequence. As shown in Figures 9 and 10, an embodiment of a sequence combination of the bidirectional extension combiner of the present invention is shown. This example illustrates the primary method by which the present invention seeks a population of genetic sequences that can be spliced together. A small start-to-end gene sequence is extended to the ends to find the sequence of genes that can be placed in place. Figure 11 is a simplified embodiment of a fault tolerant sequence filter of the present invention. This embodiment illustrates the relationship between a fault tolerant sequence filter and an extended test window. The extended test window is the aligned sequence on the two sides of the start-up gene sequence. The bidirectional extension combiner shifts the extended test window and divides the gene sequence in the extended test window into index keys and fault tolerant alignment regions. As shown in Figure 12, the present invention finds a method of fault tolerant alignment of candidate gene sequences that can be used to extend a gene sequence. Soils As shown in Figures 13 and 14, the candidate gene sequences were screened to determine whether sequencing errors occurred in base pairing errors. Figure 13 shows the sequencing error scenario in which base pairing errors occurred, and Figure 14 shows the sequencing error scenario for non-base pairing errors. Figures 15 and 16 show the sequencing errors in the screening of candidate gene sequences for the determination of whether a base insertion or deletion occurred. The detection of the error of base insertion or deletion is to compare the sequence of the original comparison with the subtraction phase. The alignment reference pattern ref in the extended test window is converted to the differential sequence type dref by scanning the gene sequence. Successive identical bases are considered to be single bases. For example, the alignment ref = GTAAGTTTCCAAIATATGGT, the differential sequence type drefKHAGTCATATATGT, that is, the two consecutive slave bases in the ref are only represented as a single A base in the dref towel. Similarly, the two consecutive 13 201237223 TTT bases in the ref towel are only represented as a single t base in dref. The candidate gene sequence was screened for the reference reference gas GTAAAGTTTCCAATATATGGT of the Β ’ 'candidate gene sequence ri, and its differential sequence type ^ drl=GTAGTCATATATGT. The alignment of the two differential sequence patterns (dref, put) is a ^' so the alignment reference pattern of rl will be replaced by erl; = GTAAGTTTCCAATATATGGT. Thus, rl is considered to be an extensible selected gene sequence that can be ligated at this position. Although the present invention has been disclosed in the above preferred embodiments, it is not intended to limit the invention, and the present invention may be modified and modified without departing from the spirit and scope of the invention. The scope of protection is subject to the definition of the scope of the patent application. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a conventional directed graph; FIG. 2 is a schematic diagram showing a sorting order in which a uniform relationship is found in a conventional directed graph; FIG. 3 is a conventional example of combining sequences in an overlapping-arranged_consistent manner. Figure 4 is a schematic diagram of a conventional De Bruijn; Figure 5 is a schematic diagram of a large node in which a neighboring node merges in a conventional DeBmijn diagram; Figure 6 is a sequence diagram of a conventional De Bruijn diagram merged; Figure 7 is a schematic diagram of the present invention; A schematic diagram of an embodiment of a gene sequence combination system; FIG. 8 is a flow chart of an embodiment of a gene sequence combination method of the present invention; FIG. 9 is a schematic diagram of a bidirectional extension combiner having a left and right extension test window according to the present invention; FIG. 11 is a simplified embodiment of a fault tolerant sequence filter of the present invention; FIG. 12 is a schematic diagram of a method for finding a candidate sequence that can be used to extend a sequence according to the present invention; 201237223 FIG. A schematic diagram of sequencing errors in screening candidate sequences and detecting whether base pairing errors occur; FIG. 14 is a screening of candidate sequences according to the present invention; Another schematic diagram for detecting whether a base pairing error has occurred; FIG. 15 is a schematic diagram of sequencing errors of a candidate sequence and detecting whether a base insertion or deletion has occurred; and FIG. 16 is a candidate for the present invention. Another schematic of sequence sequencing and detection of sequencing errors in base insertion or deletion. Annex I: References. [Explanation of main component symbols] Combination of gene sequences 1〇> Input interface 11 File 110,161 Index key data 111 Start-up gene sequence 112 Indexer 12 Candidate gene sequence 122 Bidirectional extension combiner 13 Short base sequence 130 Index Key 131 Fault Tolerant Alignment Area Π 2 Fault Tolerant Sequence Filter 14 Selected Gene Sequence 141 Aligned Group Constructor 15 contig 151 Output Interface 16 Right Extension Test Window 22 Left Extension Test Window 21

Claims

201237223 VII. Patent application scope: 1. A gene sequence combination system for splicing a plurality of genesis generated by a gene sequencing system to determine a target gene hiding base phase, the system comprising: an indexer And an index value data for storing the plurality of gene sequences, wherein the index value is used to find a sequence of candidate gene sequences that may be contiguous in a waiting gene; . . . - a fault-tolerant sequence _le, (4) determining the candidate base a selected gene sequence contiguous on two sides of the sequence of the to-be-connected gene; and a bidirectional extension combiner' for ligating the selected gene sequence determined by the fault-tolerant sequence filter to at least the sequence of the candidate gene - extending sideways into a longer gene sequence until the sequence of the target gene is determined. 2. The gene sequence as described in claim 1 is compliant with Wei, wherein the bidirectional extension combiner comprises a branching window of the transgenic gene paste, the ===::=;r- Deleted error.贱错纽纽插入或或或或或或或或 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 = = = = = = = = = = = = = = = = . 6. The gene sequence compliant system of claim 4, wherein the wrong side of the insertion or deletion of the test base converts the originally aligned sequence pattern into a differential sequence for comparison. 7. The gene sequence combination system according to claim 6, wherein the differential sequence type is to express the gene sequence + the same contiguous base as a single base. " 8. The gene sequence splicing system according to claim 4, wherein the fault-tolerant sequence filter performs the error of the test-input or the falsification, #成魏The selected gene sequence and the extension test The phase format of g, the excess test group inserted in the selected gene sequence or the deleted test base will be replaced by the correct number of bases. 9. The gene sequence combination system according to claim 1, wherein the indexer uses a partial fragment of the gene sequence as an index key to classify all the gene sequences into a combination, and then According to its index key value, it is used to retrieve the gene sequence ship. a gene sequence splicing method for splicing a plurality of gene sequences generated by a genomic sequence system to determine a target base sequence comprising: step (A) provided as claim 1 The system; step (B) inputting the plurality of gene sequences and indexing, storing the index value data in the indexer; and step (C) generating the sequence of the gene to be used from the unused sequence; D) taking the length of each pre-tested base on the two sides of the sequence of the to-be-connected gene by the two-way extension combiner as a left extension test window and a right extension test window m骇 length respectively Right shifting the right extension test field, dividing the gene sequence in the left extension test window and the right extension test window after each displacement into a new index key and a fault tolerance comparison area to the new 17 201237223 index The key to the H query may query the ugly gene sequence; in the 卞料 ( 四四四四四四四四四四衫衫衫衫衫衫衫选定选定选定选定选定选定 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 The filter is determined by The sequence is extended to a longer gene phase at least on the side of the gene to be ligated, and when it is extended to Jane, steps (8) to (8) are repeated until the two sides of the sequence of the to-be-connected gene cannot continue to continue; The base sequence overlaps all of the contiguous stretchable determinate genes into a retrophy group, and outputs the most likely test basis for each contig of the contig, and the gene sequence as described in the term 10 of the target gene A sequent method, wherein the step (8) is to input the plurality of gene sequences via the input interface, give each of the gene sequences a number ' and establish a left index structure and a right index structure of the gene sequence. 12. If the request item 1〇 The gene sequence combination method, wherein the step (6) is to use the continuous phase (4), the planting phase, the phase of the phase, and the phase of the phase, the singular phase, the singular phase, the singularity And overlapping into the starting sequence of the waiting gene sequence.