JPH07155169A

JPH07155169A - Method and device for displaying and extracting locally similar sequence of biopolymer

Info

Publication number: JPH07155169A
Application number: JP5310121A
Authority: JP
Inventors: Keiichi Nagai; 啓一永井; Tetsuo Nishikawa; 哲夫西川; Hisamitsu Kawaguchi; 川口　　久光; Susumu Hiraoka; 進平岡; Naoko Kasahara; 直子笠原; Hideki Kanbara; 秀記神原; Toshiji Okayama; 利次岡山
Original assignee: Hitachi Software Engineering Co Ltd; Hitachi Ltd
Current assignee: Hitachi Software Engineering Co Ltd; Hitachi Ltd
Priority date: 1993-12-10
Filing date: 1993-12-10
Publication date: 1995-06-20

Abstract

(57)【要約】【目的】構成要素の配列からなる生体高分子の局所類
似配列として選択された配列並置結果から、類似度の大
きい領域を表示、抽出する方法、及び装置を提供する。【構成】ダイナミックプログラミング法などの手法に
より、局所類似配列を有する配列同士の並置結果を演算
する。並置結果を、第１の軸を並置された一方の配列の
要素である塩基あるいはアミノ酸の順序を表す要素番
号、第２の軸をその要素番号までのスコアの累積値とす
るグラフとして求め、出力表示する。類似度の大きい領
域ではグラフの勾配は大となる。【効果】タンパク質の並置結果から、低類似度領域、
非類似度領域、高類似度領域を容易に識別し、さらに自
動的に抽出する。 (57) [Summary] [Object] To provide a method and a device for displaying and extracting a region having a high degree of similarity from a sequence alignment result selected as a locally similar sequence of a biopolymer composed of a sequence of constituent elements. [Arrangement] A juxtaposition result of arrays having locally similar arrays is calculated by a method such as a dynamic programming method. The alignment result is obtained as a graph in which the first axis is the element number indicating the order of bases or amino acids that are the elements of one of the aligned sequences, and the second axis is the cumulative value of the scores up to that element number, and output indicate. The gradient of the graph is large in a region having a high degree of similarity. [Effect] From the results of protein juxtaposition, the low similarity region,
The non-similarity region and the high similarity region are easily identified and further extracted automatically.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、核酸、タンパク質等の
生体高分子の構成要素の配列比較技術に関すし、特に、
複数の生体高分子の構成要素の配列に保存されている局
所的な類似配列を抽出する技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for comparing sequences of constituent elements of biopolymers such as nucleic acids and proteins, and in particular,
The present invention relates to a technique for extracting a local similar sequence stored in a sequence of constituent elements of a plurality of biopolymers.

【０００２】[0002]

【従来の技術】近年、核酸、タンパク質の生体高分子の
構成要素である、塩基、アミノ酸の配列データの蓄積が
著しく進んでおり、これら配列データのデータベースが
作成されている。これらのデータベースの中から、生物
学的に意味のある情報を引き出すためには、生体高分子
の構成要素の配列同士を比較して、その配列の類似性に
よってグループ分けしたり、複数の生体高分子の構成要
素の配列で保存されている、特に重要な生理的もしくは
化学的機能に関する局所部分配列を探索することが有効
である。2. Description of the Related Art In recent years, sequence data of bases and amino acids, which are constituents of biopolymers such as nucleic acids and proteins, have been remarkably accumulated, and databases of these sequence data have been prepared. In order to extract biologically meaningful information from these databases, sequences of biopolymer constituents are compared with each other and grouped according to the similarity of the sequences, or multiple biomolecules are compared. It is useful to search for local subsequences that are conserved in the sequences of the constituents of the molecule and are particularly relevant for their physiological or chemical function.

【０００３】従来、核酸の塩基配列同士、タンパク質の
アミノ酸配列同士を比較して、その局所的な類似度を探
索する手法としては、生体高分子の構成要素の配列全体
の類似度を評価するために、ニードルマン（Needlema
n）とブンシュ（Wunsch）により開発され（ジャーナル
オブモレキュラーバイオロジー、４８巻、４４４−
４５３頁（１９７０年）（J. Mol. Biol. 48, 444-453
(1970)））、その後、スミス（Smith）とウォーターマ
ン（Waterman）により、局所的な類似度探索法として改
良された、ダイナミックプログラミングによる方法が広
く用いられている（ジャーナルオブモレキュラーバ
イオロジー、１４７巻、１９５−１９７頁（１９８１
年）（J. Mol. Biol. 147, 195-197 (1981)））。この
方法では、比較する二つの生体高分子の配列の構成要素
を並置して、並置された各要素間の一致、不一致、及び
構成要素の挿入、欠失によるギャップ導入を表わすスコ
アに基づき、局所類似領域においてスコア合計を計算
し、最適並置を導き出す。ただし、このダイナミックプ
ログラミングに基づく方法は、計算時間が比較する構成
要素の配列の長さの積に比例するため、非常に長時間を
要する。そのため、新たに決定された核酸あるいはアミ
ノ酸の配列データを、データベース中の生体高分子の構
成要素の配列データと比較する場合には、より高速化さ
れた手法、FASTA（プロシーディングスオブナショ
ナルアカデミーオブサイエンシズユーエスエー、
８５巻、２４４４−２４４８頁（１９８８年）（Proc.
Natl. Acad.Sci. U.S.A., 85, 2444-2448 (1988)）、あ
るいはBLAST（ジャーナルオブモレキュラーバイオ
ロジー、２１５巻、４０３ー４１０頁（１９９０）(J.
Mol. Biol., 215, 403-410 (1990)）が用いられること
が多い。Conventionally, a method for comparing the base sequences of nucleic acids and the amino acid sequences of proteins to search for their local similarity is to evaluate the similarity of the entire sequence of the constituent elements of the biopolymer. In the Needleman (Needlema
n) and Wunsch (Journal of Molecular Biology, 48, 444-
P. 453 (1970) (J. Mol. Biol. 48, 444-453.
(1970))), and then a method based on dynamic programming, which was improved by Smith and Waterman as a local similarity search method, is widely used (Journal of Molecular Biology, 147 volumes). 195-197 (1981)
(J. Mol. Biol. 147, 195-197 (1981))). In this method, the constituent elements of the sequences of two biopolymers to be compared are juxtaposed, and the locality is based on a score representing the match between the juxtaposed elements, the mismatch, and the introduction of a gap due to the insertion or deletion of the constituent elements. Compute the sum of scores in similar regions to derive the optimal juxtaposition. However, this dynamic programming-based method takes a very long time because the calculation time is proportional to the product of the array lengths of the components to be compared. Therefore, when comparing the newly determined sequence data of nucleic acids or amino acids with the sequence data of biopolymer components in the database, a faster method, FASTA (Proceedings of National Academy of Sciences) is used. USA,
Volume 85, pp. 2444-2448 (1988) (Proc.
Natl. Acad. Sci. USA, 85, 2444-2448 (1988)), or BLAST (Journal of Molecular Biology, 215, 403-410 (1990) (J.
Mol. Biol., 215, 403-410 (1990)) is often used.

【０００４】しかし、これらいずれの手法もスコア合計
が０から最大値をとる領域で、比較する二つの生体高分
子の構成要素の配列を並置する。図１は非常に類似性が
高いタンパク質である、ショウジョウバエのnotchタン
パクと、アフリカツメガエルのxotchタンパクのアミノ
酸配列をBLASTの手法によって比較した結果のうち、最
も高いスコア合計を有するアミノ酸配列部分の並置結果
を示している。以下、比較の際に基準として用いる用い
るキー配列、比較の対象ととなる配列をターゲット配列
と呼ぶことにする。notchタンパクとxotchタンパクのア
ミノ酸配列の並置は、比較の際にキー配列として用いた
notchタンパクの２７１番目から６８９番目のアミノ酸
配列に対してなされている。図１に示すBLASTの手法に
よる結果において、Queryと記された欄はキー配列であ
るnotchタンパクの一文字表現によるアミノ酸配列、Sbj
ctと記された欄はターゲット配列であるxotchタンパク
の一文字表現によるアミノ酸配列がそれぞれ示されてい
る。Queryの欄とSbjctの欄との間には、キー配列とター
ゲット配列が一致している場合には、そのアミノ酸が一
文字表現により示される。また、不一致の場合で、その
スコアが正の値をもつ場合には＋が示され、負の値をも
つ場合には何も示されない。BLASTの手法におけるスコ
ア計算では、図３に示すPAM１２０と呼ばれるスコアマ
トリックスが用いられる。このマトリックスの要素は二
つのアミノ酸配列を比較するとき、それぞれの配列に含
まれる各アミノ酸残基の対が、無関係なアミノ酸の配列
からどの程度の確率で生ずるかを表わす。マトリックス
の要素の値は、あるアミノ酸残基の対が共通の祖先の分
子からの突然変異によって生ずる確率を、無関係なアミ
ノ酸の配列から偶然によって生ずる確率で割った値の対
数で表示される。従って、マトリックスの要素の値が正
の値の場合、偶然に生じるよりは共通の祖先の分子を持
つ確率が高く類似性が高いこと、負の値の場合は正の値
の場合の逆のことをそれぞれ表わす。従って、図１のQu
eryの欄とSbjctの欄との間において、一致しているアミ
ノ酸の一文字表現、及び＋の記号が多い領域ほど、スコ
アの合計が高くなり、構成要素の配列の類似度が高くな
る。スコアマトリックスとしては、図３に示すPAM１２
０と、図４に示すPAM２５０が広く用いられている。図
３、図４はそれぞれ、１００残基あたり１２０及び２５
０残基が突然変異を起こした場合の確率を計算したもの
である。However, in any of these methods, the arrangement of the constituent elements of the two biopolymers to be compared is juxtaposed in the region where the total score takes a maximum value from 0. Fig. 1 shows the alignment result of the amino acid sequence portion having the highest total score among the results of comparing the amino acid sequences of the Drosophila notch protein and the Xenopus xotch protein, which are highly similar proteins, by the BLAST method. Is shown. Hereinafter, a key array used as a reference in comparison and an array to be compared will be referred to as a target array. The juxtaposition of amino acid sequences of notch protein and xotch protein was used as a key sequence for comparison.
It is made for the amino acid sequences 271 to 689 of the notch protein. In the result of the BLAST method shown in FIG. 1, the column labeled Query is the amino acid sequence of the key sequence of the notch protein, Sbj
The columns labeled ct show the amino acid sequences of the target sequence, xotch protein, in one-letter representation. Between the Query column and the Sbjct column, when the key sequence and the target sequence match, the amino acid is indicated by a one-letter expression. In the case of disagreement, + is shown when the score has a positive value, and nothing is shown when the score has a negative value. In the score calculation in the BLAST method, a score matrix called PAM120 shown in FIG. 3 is used. The elements of this matrix indicate, when comparing two amino acid sequences, with what probability each pair of amino acid residues contained in each sequence arises from a sequence of unrelated amino acids. The values of the elements of the matrix are expressed as the logarithm of the probability that a pair of amino acid residues will result from a mutation from a common ancestor molecule divided by the probability that it will occur by chance from a sequence of unrelated amino acids. Therefore, when the values of the elements of the matrix are positive, the probability of having a common ancestor molecule is higher and the similarity is higher than it happens by chance, and the negative value is the opposite of the positive value. Respectively. Therefore, Qu in Figure 1
Between the ery column and the Sbjct column, the more the one-letter expression of the matching amino acids and the region with more + symbols, the higher the total score, and the higher the sequence similarity of the constituent elements. As the score matrix, PAM12 shown in FIG. 3 is used.
0 and PAM250 shown in FIG. 4 are widely used. 3 and 4 show 120 and 25 per 100 residues, respectively.
This is a calculation of the probability when 0 residue is mutated.

【０００５】図１を２つのアミノ酸配列の類似度の観点
から見ると、３つの領域に大別できる。即ち、Queryの
欄の配列番号２７１番から配列番号３３５番までの第１
の領域、配列番号３３５番から配列番号４５６番までの
第２の領域、配列番号４５６番近傍から配列番号６８９
番までの第３の領域に大別できる。第１及び第３の領域
は類似度が高く、第２の領域は類似性がないように見え
る。これはBLASTの手法のアルゴリズムが、二つの生体
高分子の構成要素の配列を並置する際に、構成要素の挿
入、欠失によって生ずるギャップの存在を許容しないこ
とに起因している。From the viewpoint of the degree of similarity between two amino acid sequences, FIG. 1 can be roughly divided into three regions. That is, the first from array number 271 to array number 335 in the Query column
Region, the second region from SEQ ID NO: 335 to SEQ ID NO: 456, the vicinity of SEQ ID NO: 456 to SEQ ID NO: 689
It can be roughly divided into the third area up to the turn. It seems that the first and third regions have a high degree of similarity, and the second region has no similarity. This is because the algorithm of the BLAST method does not allow the existence of a gap caused by the insertion or deletion of the constituent elements when the sequences of the constituent elements of the two biopolymers are aligned.

【０００６】図１と同様に同じ配列比較対象を選び、シ
ョウジョウバエの notchタンパクと、アフリカツメガエ
ルのxotchタンパクのアミノ酸配列をFASTAの手法によっ
て比較して得た並置結果を図２に示す。図１と同様に、
図２のアミノ酸配列において、上欄はnotchタンパクの
一文字表現によるアミノ酸配列（キー配列）、下欄はxo
tchタンパクの一文字表現によるアミノ酸配列（ターゲ
ット配列）をそれぞれ示している。FASTAの手法はダイ
ナミックプログラミングの手法と同様に、構成要素の挿
入、欠失によって生ずるギャップを許容するアルゴリズ
ムである。図２の結果を得る際に、スコアマトリックス
としてPAM２５０、構成要素の挿入、欠失によって生ず
るギャップの導入及び延長のペナルティスコアは、それ
ぞれ１２と４を用いた。図２の結果によれば、キー配列
の３４２番に対応するターゲット配列、キー配列の４５
７番に欠失がありギャップが生じていることが判る。即
ち、図１に示す、BLASTの手法による結果では、キー配
列の３４２番から４５７番までの配列は、本来あるべき
並置結果とアミノ酸１残基分ずれており、全く類似性が
ないと判断されることになる。即ち、BLASTの手法によ
る並置結果は、全く類似性のない領域も含むことにな
る。これは、BLASTの手法ではスコア合計が０から最大
値をとる領域で並置が行われ、この領域内では、スコア
合計が０より大きく最大値より小さいどのような値も有
することができるという原理に起因している。The alignment results obtained by comparing the amino acid sequences of the Drosophila notch protein and the Xenopus xotch protein by the FASTA method by selecting the same sequence comparison target as in FIG. 1 are shown in FIG. Similar to Figure 1,
In the amino acid sequence of FIG. 2, the upper column is the amino acid sequence (key sequence) represented by the one-letter notch protein, and the lower column is xo.
The amino acid sequences (target sequences) represented by one-letter expressions of tch proteins are shown. The FASTA method, like the dynamic programming method, is an algorithm that allows gaps caused by the insertion and deletion of components. In obtaining the results of FIG. 2, PAM250 was used as the score matrix, and 12 and 4 were used as the penalty scores for the introduction and extension of gaps caused by the insertion and deletion of the constituents, respectively. According to the result of FIG. 2, the target array corresponding to the key array 342 and the key array 45
It can be seen that there is a gap at position 7 and there is a gap. That is, in the result of the BLAST method shown in FIG. 1, the sequences of the key sequences from 342 to 457 are deviated from the originally expected alignment result by one amino acid residue, and it is judged that there is no similarity. Will be. That is, the result of juxtaposition by the BLAST method includes a region having no similarity at all. This is based on the principle that in the BLAST method, juxtaposition is performed in a region where the score total takes a maximum value from 0, and within this region, the score total can have any value larger than 0 and smaller than the maximum value. It is due.

【０００７】[0007]

【発明が解決しようとする課題】生体高分子の構成要素
の配列の、類似度の高い領域と低い領域を、構成要素の
並置結果の比較のみから識別し、類似度の高い局所配列
を抽出するのは容易ではない。本発明の第１の目的は、
局所類似配列として選択された、類似度の低い低類似領
域、及び特に類似性がない非類似領域を含む構成要素の
配列の並置結果から、類似度の高い高類似領域を容易に
識別する方法、装置を提供することにある。また、本発
明の第２の目的は、構成要素の配列の並置結果から高類
似領域を自動的に抽出する方法、装置を提供することに
ある。DISCLOSURE OF THE INVENTION A region having a high degree of similarity and a region having a low degree of similarity in the arrangement of constituent elements of a biopolymer are identified only by comparing the juxtaposition results of the constituent elements, and a local arrangement having a high degree of similarity is extracted. It's not easy. The first object of the present invention is to
A method of easily identifying a high similarity region having a high similarity from a juxtaposed result of an arrangement of components including a low similarity region having a low similarity degree selected as a local similarity sequence and a dissimilarity region having no similarity, To provide a device. A second object of the present invention is to provide a method and device for automatically extracting a highly similar region from the juxtaposed result of the arrangement of the constituent elements.

【０００８】[0008]

【課題を解決するための手段】上記第１の目的を実現す
るために、本発明では、核酸、タンパク質等の生体高分
子の構成要素の配列を比較し、構成要素の一致、不一
致、ギャップ導入などの、２つの生体高分子の構成要素
間の類似性を表わすスコアの合計をもとめ、得られた局
所類似配列部分の並置結果を、並置された一方の生体高
分子の構成要素（例えば、塩基あるいはアミノ酸）の配
列の順序を表す要素番号を第１の軸にとり、並置された
構成要素間のスコアから求めた所定のパラメータを第２
の軸にとり、グラフとして多次元表示する。このグラフ
は平滑化処理される。所定のパラメータとして、第１の
軸の所定の要素番号から各要素番号までのスコアの累積
値、又は各要素番号のスコア値そのものをとる。局所類
似配列部分は、例えば、ダイナミックプログラミングの
手法、FASTAの手法、及びBLASTの手法等により求められ
た結果を使用する。In order to achieve the first object, in the present invention, the sequences of the constituent elements of biopolymers such as nucleic acids and proteins are compared, and the constituent elements are matched, mismatched, and introduced a gap. , Etc., and the obtained alignment result of locally similar sequence parts is calculated based on the sum of scores representing the similarity between the components of the two biopolymers. Alternatively, the element number indicating the sequence order of (amino acid) is taken on the first axis, and the predetermined parameter obtained from the score between the juxtaposed components is the second parameter.
A multi-dimensional graph is displayed on the axis of. This graph is smoothed. As the predetermined parameter, the cumulative value of the scores from the predetermined element number on the first axis to each element number or the score value itself of each element number is taken. As the locally similar sequence portion, for example, a result obtained by a dynamic programming method, FASTA method, BLAST method, or the like is used.

【０００９】また、上記第２の目的を実現するために、
本発明では、第１の軸の所定の要素番号から各要素番号
までのスコアの累積値を所定のパラメータとするグラフ
の勾配が、第１の軸に関して連続的に所定の値以上の値
を有する領域を識別し抽出し、その領域での累積スコア
値と構成要素の配列を並置して出力表示する。あるい
は、要素番号のスコア値そのものを所定のパラメータと
するグラフの第２の軸のスコア値が、第１の軸に関して
連続的に所定の値以上の値を有する領域を識別し抽出
し、その領域での累積スコア値と構成要素の配列を並置
して出力表示する。Further, in order to realize the second object,
In the present invention, the gradient of the graph having the cumulative value of the scores from the predetermined element number on the first axis to each element number as a predetermined parameter has a value continuously equal to or larger than the predetermined value with respect to the first axis. The area is identified and extracted, and the cumulative score value and the arrangement of the constituent elements in the area are juxtaposed and output and displayed. Alternatively, an area in which the score value of the second axis of the graph in which the score value of the element number itself is a predetermined parameter is continuously larger than the predetermined value with respect to the first axis is identified and extracted, and the area is extracted. The cumulative score value in and the array of constituent elements are juxtaposed and output.

【００１０】[0010]

【作用】上記のように、第１の軸の所定の要素番号から
各要素番号までのスコアの累積値を所定のパラメータと
するとき、高類似度領域では構成要素の配列の各要素間
のスコアが正の大きな値をとることが多いので、グラフ
の勾配は大きくなる。一方、低類似度領域ではグラフの
勾配は小さくなり、特に類似性がない非類似性領域で
は、構成要素の配列の各要素間のスコアは負の値をとる
ことが多く、グラフの勾配は負になる。従って、得られ
たグラフを一見するだけで、構成要素の配列の並置結果
において、低類似度領域、非類似度高類似領域度を識別
できる。As described above, when the cumulative value of the scores from the predetermined element number of the first axis to each element number is used as the predetermined parameter, the score between the elements of the array of constituent elements is set in the high similarity region. Often takes a large positive value, so the gradient of the graph becomes large. On the other hand, in the low similarity region, the gradient of the graph becomes small, and in the dissimilarity region where there is no particular similarity, the score between each element of the array of constituent elements often takes a negative value, and the gradient of the graph is negative. become. Therefore, the low similarity region and the dissimilarity high similarity region degree can be identified in the juxtaposed result of the arrangement of the constituent elements only by looking at the obtained graph.

【００１１】また、要素番号のスコア値そのものを所定
のパラメータとするとき、高類似度領域では構成要素の
配列の各要素間のスコアが正の大きな値をとることが多
いので、グラフの第２軸の値は大きくなる。一方、低類
似度領域ではグラフの第２軸の値は小さくなり、非類似
性領域では、構成要素の配列の各要素間のスコアは負の
値をとることが多く、グラフの第２軸の値は負になる。
従って、得られたグラフを一見するだけで、構成要素の
配列の並置結果において、低類似度領域、非類似度高類
似領域度を識別できる。さらにグラフを平滑化処理する
ので、全体として類似度が低い領域において、類似度の
微妙な差を識別する場合に有効である。Further, when the score value itself of the element number is used as a predetermined parameter, the score between each element of the arrangement of the constituent elements often takes a large positive value in the high similarity region, so that the second graph The value of the axis becomes large. On the other hand, in the low similarity region, the value of the second axis of the graph becomes small, and in the dissimilarity region, the score between each element of the array of constituent elements often takes a negative value, and the value of the second axis of the graph The value will be negative.
Therefore, the low similarity region and the dissimilarity high similarity region degree can be identified in the juxtaposed result of the arrangement of the constituent elements only by looking at the obtained graph. Further, since the graph is smoothed, it is effective in identifying a subtle difference in similarity in a region where the similarity is low as a whole.

【００１２】高類似度領域を自動的に抽出するために
は、第１の軸の所定の要素番号から各要素番号までのス
コアの累積値を所定のパラメータとするときには、グラ
フの勾配の大きな領域を、要素番号のスコア値そのもの
を所定のパラメータとするときには、グラフの第２の軸
の値の大きな領域を、それぞれ閾値を設定して抽出すれ
ばよい。この閾値は予め、一定の勾配あるいはスコア値
を基準値として設定しておいてもよいし、構成要素の配
列の並置結果から得られるグラフ全体の勾配あるいはス
コア値の平均値を基に計算で求め自動的に設定できるよ
うにしてもよい。高類似度領域を効率よく自動的に抽出
するためには、グラフを平滑化して構成要素の配列の各
要素毎のスコア値のばらつきの影響を受けることが少な
く有効である。In order to automatically extract the high similarity region, when the cumulative value of the score from the predetermined element number of the first axis to each element number is used as the predetermined parameter, the region having a large gradient in the graph is used. When the score value itself of the element number is used as the predetermined parameter, it is only necessary to set threshold values for the areas having large values on the second axis of the graph and extract them. This threshold value may be set in advance with a constant gradient or score value as a reference value, or it may be calculated based on the average value of the gradient or score values of the entire graph obtained from the juxtaposed result of the arrangement of the constituent elements. You may enable it to set automatically. In order to efficiently and automatically extract the high similarity region, it is effective that the graph is smoothed and is less affected by the variation in the score value of each element of the arrangement of the constituent elements.

【００１３】[0013]

【実施例】本発明の一実施例を図５により説明する。本
実施例では、図１に示した、BLASTの手法により得た、
ショウジョウバエのnotchタンパクとアフリカツメガエ
ルのxotchタンパクのアミノ酸配列の並置結果をグラフ
で示した。横軸にはアミノ酸の順序を示す番号、縦軸に
はアミノ酸配列の並置の開始位置から各アミノ酸対に対
して、図３に示したスコアマトリックス、PAM１２０の
スコア値を加算した累積値をとった。横軸はアミノ酸配
列の並置の開始点であるキー配列、notchタンパクの２
７１番を１とした。図５のグラフを一見すれば判るよう
に、横軸の値（番号）が１から６４、及び１８７から４
１５までの領域ではグラフは正の勾配をもち類似度が高
く、６５から１８６までは負の勾配をもち類似度が低く
なっていることが容易に判別できる。本実施例では、正
の勾配をもつ領域と負の勾配をもつ領域に大別できるの
で、勾配が正の部分を識別すれば類似度の大きな局所配
列が抽出できる。ただし、アミノ酸対によっては、高い
類似性を有する領域でも、負の値をもつ対もあるので、
自動的に局所類似配列を抽出するには、横軸方向の数点
で平滑化処理することが望ましい。自動的に抽出された
局所類似配列の累積スコアをグラフの正の勾配部分の増
加量から計算し、アミノ酸配列の並置部分の一致してい
るアミノ酸対の数と並置部分のアミノ酸対の総数との比
から一致度を求め、アミノ酸配列の並置部分とともに出
力表示することにより、類似度の大きな部分配列に関す
る情報を得ることができる。EXAMPLE An example of the present invention will be described with reference to FIG. In this example, obtained by the BLAST method shown in FIG.
The alignment of the amino acid sequences of the Drosophila notch protein and the Xenopus xotch protein is shown graphically. The horizontal axis represents the number indicating the order of amino acids, and the vertical axis represents the cumulative value obtained by adding the score value of PAM120, the score matrix shown in FIG. 3 to each amino acid pair from the start position of the alignment of amino acid sequences. . The horizontal axis is the key sequence, which is the starting point of the alignment of amino acid sequences, and 2 of notch protein.
The number 71 is set to 1. As can be seen at a glance in the graph of FIG. 5, the values (numbers) on the horizontal axis are 1 to 64, and 187 to 4
In the region up to 15, the graph has a positive gradient and a high degree of similarity, and from 65 to 186 has a negative gradient and a low degree of similarity can be easily determined. In the present embodiment, the region having a positive gradient and the region having a negative gradient can be roughly classified, so that a local array having a high degree of similarity can be extracted by identifying a portion having a positive gradient. However, depending on the amino acid pair, there are also pairs with a negative value even in regions with high similarity,
In order to automatically extract local similar sequences, it is desirable to perform smoothing processing at several points along the horizontal axis. The cumulative score of automatically extracted local similar sequences was calculated from the increase in the positive slope portion of the graph, and the cumulative score of the number of matching amino acid pairs in the aligned portion and the total number of aligned amino acid pairs in the aligned portion of the amino acid sequence were calculated. By obtaining the degree of coincidence from the ratio and outputting and displaying it together with the juxtaposed portion of the amino acid sequence, it is possible to obtain information on the partial sequence having a high degree of similarity.

【００１４】本発明の一実施例を図６、図７により説明
する。図６において、図１に示したアミノ酸配列の並置
結果を、横軸にアミノ酸の順序を示す番号、縦軸に並置
の各位置のアミノ酸対に対するスコアマトリックス、PA
M１２０のスコア値を示した。横軸は、アミノ酸配列の
並置の開始点であるキー配列、notchタンパクの２７１
番のアミノ酸残基を１とした。図６のグラフを一見すれ
ば判るように、横軸の値が１から６４まで、及び１８７
から４１５までの領域では大概正の値であり類似度が高
く、６５から１８６までの領域では負の値であり類似度
が低いことが判別できる。図６では、平均的に正のスコ
ア値をもつ領域と負のスコア値をもつ領域に大別できる
ので、スコア値が正の部分を識別すれば類似度の高い局
所配列が抽出できる。ただし、この場合もアミノ酸対に
よっては負の値をもつ対もあるので、自動的に局所類似
配列を抽出するには、横軸方向の数点で平滑化処理をす
ることが望ましい。図６を横軸方向の１１点で加算平均
した結果を図７に示す。図７によれば、アミノ酸対のス
コアによるばらつきが平均化により抑えられるので類似
度の高い領域と低い領域を容易に識別できる。さらにス
コアの正負により、類似度の高い領域を自動的に抽出こ
とも可能になる。An embodiment of the present invention will be described with reference to FIGS. In FIG. 6, the alignment result of the amino acid sequences shown in FIG. 1 is shown by a number indicating the order of amino acids on the horizontal axis, a score matrix for amino acid pairs at each position on the vertical axis, and PA.
The score value of M120 is shown. The horizontal axis is the key sequence that is the starting point of the alignment of amino acid sequences, 271 of notch protein.
The amino acid residue numbered was 1. As can be seen from the graph of FIG. 6, the values on the horizontal axis are from 1 to 64, and 187.
It can be discriminated that the region from No. to 415 has a positive value and a high similarity, and the region from 65 to 186 has a negative value and a low similarity. In FIG. 6, regions having an average positive score value and regions having an average negative score value can be roughly classified, so that a local sequence having a high degree of similarity can be extracted by identifying a portion having a positive score value. However, in this case as well, there are some pairs having a negative value depending on the amino acid pair, so in order to automatically extract the local similar sequence, it is desirable to perform smoothing processing at several points along the horizontal axis. FIG. 7 shows the result of arithmetic mean of FIG. 6 at 11 points along the horizontal axis. According to FIG. 7, the variation due to the score of the amino acid pair is suppressed by averaging, so that the region with high similarity and the region with low similarity can be easily distinguished. Further, depending on whether the score is positive or negative, it is possible to automatically extract a region having a high degree of similarity.

【００１５】本発明の一実施例を図８により説明する。
図８の例は、カルモジュリンと呼ばれるタンパク質のス
ーパーファミリー（配列の類似度によりタンパク質を分
類したもの）に属するヒトのカルモジュリンタンパクと
大腸菌のdnaKと呼ばれるタンパク質のアミノ酸配列を、
Smith-Watermanの手法によって比較して得た結果をグラ
フにより表示したものである。ここでは、スコア計算は
図４に示したマトリックス、PAM250を使用している。ま
た、アミノ酸配列に対するギャップの導入、延長のペナ
ルティスコアとして、それぞれ１２と４を用いた。大腸
菌のdnaKと呼ばれるタンパク質のアミノ酸配列は、カル
モジュリンスーパーファミリーに属さないタンパク質の
中で、Smith-Watermanの手法あるいはFASTAの手法によ
って比較した場合に、最大のスコア合計を有することが
知られている。149残基のアミノ酸からなるヒトカルモ
ジュリンを比較の際の基準配列とすると、大腸菌dnaKタ
ンパクとはほぼその全長でアミノ酸配列の並置ができ
る。アミノ酸配列の並置の長さは、挿入、欠失を含めて
アミノ酸120残基におよぶ。しかし、図８から判るよう
に最適並置のスコア値に貢献しているのはアミノ酸30残
基の領域に限られる。そしてこの領域は、ヒトカルモジ
ュリンの４つあるEFハンドと呼ばれる、カルシウムイオ
ンへの結合に関与するモチーフの一つに対応している。
本実施例によっても局所類似配列として選択された結果
から、真に類似度の高い領域を容易に識別できる。An embodiment of the present invention will be described with reference to FIG.
The example of FIG. 8 shows the amino acid sequences of a human calmodulin protein belonging to the superfamily of proteins called calmodulin (proteins classified according to the degree of sequence similarity) and a protein called dnaK of E. coli.
The results obtained by comparison by the method of Smith-Waterman are displayed in a graph. Here, the matrix shown in FIG. 4, PAM250, is used for score calculation. Further, 12 and 4 were used as the penalty scores for introducing and extending the gap with respect to the amino acid sequence, respectively. It is known that the amino acid sequence of a protein called dnaK of Escherichia coli has the largest total score among proteins not belonging to the calmodulin superfamily when compared by the Smith-Waterman method or the FASTA method. When human calmodulin consisting of 149 amino acid residues is used as a reference sequence for comparison, the amino acid sequence can be juxtaposed with E. coli dnaK protein over almost its entire length. The alignment length of the amino acid sequence extends to 120 amino acid residues including insertions and deletions. However, as can be seen from FIG. 8, it is only the region of 30 amino acid residues that contributes to the score value of optimal alignment. This region corresponds to one of the four EF hands of human calmodulin, which is a motif involved in binding to calcium ions.
Also according to this embodiment, a region having a high degree of similarity can be easily identified from the result selected as the locally similar sequence.

【００１６】本発明の一実施例を図９により説明する。
本実施例では、イムノグロブリンＶ領域スーパーファミ
リーに属するタンパク質であるK1HUAGをキー配列にし
て、イムノグロブリンＶ領域に属する他のタンパク質K2
DGGM、KVRBB1及びS09230（いづれもタンパク質配列デー
タベースPIRのエントリーネームである）の配列比較
を、Smith-Watermanのアルゴリズムに基づくダイナミッ
クプログラミングの手法で行なって得た結果を、図９
（ａ）、図９（ｂ）、図９（ｃ）にそれぞれグラフによ
り示した。スコア計算では、図４に示したスコアマトリ
ックス、PAM２５０を使用した。また、アミノ酸配列に
対するギャップの導入、延長のペナルティスコアとして
それぞれ１２と４を用いた。配列比較の結果、アミノ酸
配列の最適並置では、両タンパク質のアミノ酸の一致度
は５１％から５３％であった。図９では、横軸にキー配
列としたK1HUAGのアミノ酸残基の番号、縦軸にアミノ酸
配列の並置の開始位置からの各アミノ酸対に対するスコ
アの累積値をとった。イムノグロブリン軽鎖のV領域
は、主にベータシート構造からなる４つの骨格部分（F
W）と３つのループ構造部分（CDR）から構成されてい
る。ループ構造部分（CDR）が抗原との結合に関係する
部分であり、抗原認識の多様性の必要からアミノ酸変異
の大きい部分である。一方、骨格部分（FW）はイムノグ
ロブリンタンパクの構造を決める部分であり、そのアミ
ノ酸配列は比較的良く保存されている。図９には、これ
らの位置も併せて示した。図９から判るように、FW領域
では相対的に勾配が大きく、CDR領域では小さい。即
ち、上記した配列の保存性に関する知見と各領域の区分
けが容易に得られる。An embodiment of the present invention will be described with reference to FIG.
In this example, K1HUAG, which is a protein belonging to the immunoglobulin V region superfamily, is used as a key sequence, and another protein K2 belonging to the immunoglobulin V region is used.
The results obtained by performing the sequence comparison of DGGM, KVRBB1 and S09230 (all of which are entry names of the protein sequence database PIR) by the dynamic programming method based on the Smith-Waterman algorithm are shown in FIG.
Graphs are shown in (a), FIG. 9 (b), and FIG. 9 (c), respectively. In the score calculation, the score matrix PAM250 shown in FIG. 4 was used. In addition, 12 and 4 were used as penalty scores for introducing and extending a gap with respect to an amino acid sequence, respectively. As a result of the sequence comparison, the degree of amino acid identity between the two proteins was 51% to 53% in the optimal alignment of the amino acid sequences. In FIG. 9, the abscissa axis represents the number of amino acid residues of K1HUAG as a key sequence, and the ordinate axis represents the cumulative value of the score for each amino acid pair from the alignment start position of the amino acid sequences. The V region of immunoglobulin light chain is composed of four skeletons (F
W) and three loop structure parts (CDR). The loop structure part (CDR) is a part related to antigen binding, and is a part with a large amino acid mutation due to the necessity of diversity of antigen recognition. On the other hand, the skeletal part (FW) is a part that determines the structure of immunoglobulin protein, and its amino acid sequence is relatively well conserved. FIG. 9 also shows these positions. As can be seen from FIG. 9, the gradient is relatively large in the FW region and small in the CDR region. That is, the above-mentioned knowledge about the conservation of the sequence and the division of each region can be easily obtained.

【００１７】本発明の一実施例を図９、図１０により説
明する。図９では楕円形の枠８、９、１０で示した３つ
の、局所的に勾配の大きな領域が存在する。イムノグロ
ブリンＶ領域スーパーファミリーに属するタンパク質は
数が多く、その配列も多岐に渡っている。Smith-Waterm
anの手法、FASTAの手法、BLASTの手法等による、異なる
タンパク質のアミノ酸配列の類似度の比較に関しては、
タンパク質のアミノ酸配列データベースの検索に際し
て、スーパーファミリーに属するタンパク質をできるだ
け選択し、属さないものを選択しないという、感度と選
択性がアルゴリズムの性能の尺度になっている。しか
し、いずれのアルゴリズムを採用しても、最適並置のス
コアだけから選択された配列が基準とした配列と同じス
ーパーファミリーに属するか否かを決めるのは不可能で
ある。そこでスコア以外に、図９に楕円形の枠８、９、
１０で示すような、特徴的な局所部位を判断基準に加え
ることが考えられる。図１０の（ａ）〜（ｅ）に、図９
と同様にK1HUAGをキー配列として、Smith-Watermanの手
法で得た結果の中で、同じスーパーファミリーに属する
もので最も低いスコアを示す幾つかの並置結果をグラフ
により示す。アミノ酸配列の一致度は２７％から１９％
であった。これらのグラフでは最適並置の領域の外側の
部分も示した。従って縦軸のスコア値は相対値である。
全体としてはほとんど類似性が見られないが、図９で示
した楕円形の領域に対応する領域８’、９’、１０’が
多くの例で保存されている。従って、ここで示したよう
なグラフの特徴的なパターンと最適並置のスコア値を組
み合わせることにより、スーパーファミリーメンバーの
同定が精度良く行なえる可能性を示唆している。An embodiment of the present invention will be described with reference to FIGS. In FIG. 9, there are three locally large gradient regions indicated by elliptical frames 8, 9, and 10. The number of proteins belonging to the immunoglobulin V region superfamily is large, and their sequences are also diverse. Smith-Waterm
Regarding the comparison of the amino acid sequence similarity of different proteins by the method of an, the method of FASTA, the method of BLAST, etc.,
When searching the amino acid sequence database of proteins, the sensitivity and selectivity of selecting the proteins that belong to the superfamily as much as possible and not selecting the proteins that do not belong to the superfamily are measures of the algorithm performance. However, whichever algorithm is adopted, it is impossible to determine whether or not the sequence selected only from the optimal alignment score belongs to the same superfamily as the reference sequence. Therefore, in addition to the score, the elliptical frames 8, 9,
It is conceivable to add a characteristic local site as shown in 10 to the criterion. 9 (a) to 10 (e), FIG.
Similarly to the above, using K1HUAG as a key sequence, among the results obtained by the method of Smith-Waterman, some alignment results showing the lowest score among those belonging to the same superfamily are shown by a graph. 27% to 19% amino acid sequence identity
Met. These graphs also show the area outside the region of optimal juxtaposition. Therefore, the score value on the vertical axis is a relative value.
Although almost no similarities are seen as a whole, regions 8 ′, 9 ′, and 10 ′ corresponding to the elliptical regions shown in FIG. 9 are stored in many examples. Therefore, the combination of the characteristic pattern of the graph as shown here and the score value of the optimal juxtaposition suggests that the superfamily members can be identified with high accuracy.

【００１８】以上説明した本発明の方法を実施する装置
の一実施例を図１１により説明する。本実施例は、最適
並置を求める生体高分子配列の入力手段１、最適並置を
求める並置演算手段２、最適並置結果から図５に例示し
たグラフを求めるグラフ化演算手段３、このグラフを出
力表示する出力手段４で構成され、具体的には、計算機
本体、外部記憶装置、ＣＲＴ、プリンタ、キーボード等
のハードウェアと最適並置を求めるための計算プログラ
ム、最適並置結果をグラフ化するためのプログラム等の
ソフトウェアで構成される。An embodiment of an apparatus for carrying out the method of the present invention described above will be described with reference to FIG. In this embodiment, biopolymer array input means 1 for obtaining the optimum alignment, alignment operation means 2 for obtaining the optimum alignment, graphing operation means 3 for obtaining the graph illustrated in FIG. 5 from the optimum alignment result, and this graph is output and displayed. And a calculation program for obtaining the optimum alignment, a program for plotting the optimum alignment result, and the like, specifically, hardware such as a computer main body, an external storage device, a CRT, a printer, and a keyboard. Composed of software.

【００１９】入力手段は、最適並置を求めようとするタ
ンパク質を特定するためのキーワード等のデータを計算
機のキーボードから入力し、計算機の外部記憶装置に保
持されたアミノ酸配列データベース５からタンパク質の
アミノ酸配列データを検索する手段、得られた配列デー
タを並置演算手段の内部記憶装置に複製する手段で構成
する。予め、アミノ酸の配列が判明している場合には、
キーボードから直接アミノ酸配列データを入力してもよ
い。The input means inputs data such as a keyword for specifying a protein for which optimum alignment is to be obtained from a keyboard of a computer, and the amino acid sequence of the protein is stored in an amino acid sequence database 5 held in an external storage device of the computer. It is composed of means for retrieving data and means for copying the obtained array data in the internal storage device of the juxtaposition arithmetic means. If the amino acid sequence is known in advance,
Amino acid sequence data may be directly input from the keyboard.

【００２０】並置演算手段は、入力手段によって内部記
憶装置に複製されたアミノ酸配列データと、外部記憶装
置のアミノ酸配列データベースから順次並置演算手段の
内部記憶装置に複製された配列データとの最適並置を、
各アミノ酸間の一致、不一致、ギャップ導入等のスコア
に基づき求めるする手段、得られた最適並置のスコア合
計の値により有意な配列対を選びだす選別手段で構成す
る。得られた最適並置結果は、グラフ化演算手段の内部
記憶装置に転送される。予め比較しようとする配列の対
が判明している場合には、比較する配列対の配列データ
を入力手段によって並置演算手段の内部記憶装置に複製
すればよい。この場合には、最適並置計算で得られた結
果を選別手段を通さずに、直接グラフ化演算手段の内部
記憶装置に転送すればよい。The juxtaposition calculating means finds the optimum juxtaposition of the amino acid sequence data copied in the internal storage device by the input means and the sequence data sequentially copied from the amino acid sequence database in the external storage device to the internal storage device in the juxtaposition calculating means. ,
It comprises means for obtaining based on the scores of agreement, disagreement, gap introduction, etc. between each amino acid, and selection means for selecting a significant sequence pair based on the value of the score of optimum alignment obtained. The obtained optimum juxtaposition result is transferred to the internal storage device of the graphing calculation means. When the pair of arrays to be compared is known in advance, the array data of the array pair to be compared may be copied to the internal storage device of the juxtaposition calculation means by the input means. In this case, the result obtained by the optimum juxtaposition calculation may be directly transferred to the internal storage device of the graphing calculation means without passing through the selection means.

【００２１】グラフ化演算手段は、並置演算手段で計
算、選別され、内部記憶装置に転送されてきた最適並置
結果をベースに、先に説明した各実施例で示したグラフ
を求める方法により、グラフを求める。得られたグラフ
データは出力手段に転送され、出力表示される。本実施
例では、出力手段をＣＲＴ及びプリンタで構成する。The graphing calculation means calculates the graph by the method for obtaining the graph shown in each of the above-described embodiments based on the optimum juxtaposition result calculated and selected by the juxtaposition calculation means and transferred to the internal storage device. Ask for. The obtained graph data is transferred to the output means and displayed for output. In this embodiment, the output means is composed of a CRT and a printer.

【００２２】本発明の方法を実施する装置の一実施例を
図１２により説明する。本実施例は、上記実施例の装置
を構成する手段に加えて、得られたグラフから特に類似
性の高い領域を選別するための選別演算手段６、選別さ
れた領域の累積スコア、一致度を求める出力値演算手段
７、累積スコア、一致度、並置結果を出力する出力手段
４で構成する。選別演算手段は、選別演算手段の内部記
憶装置に転送されたグラフデータをベースに、グラフの
勾配が連続的に一定値以上である領域、あるいはグラフ
のスコア値が連続的に一定値以上の値である領域を選び
だし、その領域の境界を決定する演算手段で構成する。
出力値演算手段は、選別演算手段により選びだされた領
域の累積スコア値、アミノ酸の一致する割合を表す一致
度を計算する。得られた累積スコア値、一致度、及び選
別領域の並置結果は出力手段に転送され、出力表示され
る。本実施例でも、出力手段をＣＲＴ及びプリンタで構
成する。表示は１つのキー配列に対して、複数のターゲ
ット配列を、２次元座標に重複表示、あるいは多次元座
標に表示することもできる。このとき、各ターゲット配
列に関するデータをカラー表示して識別を容易にするこ
ともできる。さらに、累積スコアのグラフ表示における
勾配、あるいはスコアのグラフ表示におけるスコア値、
の正、負部分をそれぞれ異なるカラー表示して識別を容
易にすることもできる。An embodiment of an apparatus for carrying out the method of the present invention will be described with reference to FIG. In this embodiment, in addition to the means constituting the apparatus of the above-mentioned embodiment, a selection calculation means 6 for selecting an area having particularly high similarity from the obtained graph, a cumulative score of the selected area, and a degree of coincidence. The output value calculating means 7 to be obtained and the output means 4 for outputting the cumulative score, the degree of coincidence, and the juxtaposition result are included. The selection calculation means is based on the graph data transferred to the internal storage device of the selection calculation means, based on the graph data, an area in which the gradient of the graph is continuously a certain value or more, or a score value of the graph is a value continuously more than a certain value. It is constituted by an arithmetic means for selecting a certain area and determining the boundary of the area.
The output value calculation means calculates the cumulative score value of the region selected by the selection calculation means and the degree of coincidence indicating the amino acid coincidence rate. The obtained cumulative score value, the degree of coincidence, and the juxtaposed result of the selection areas are transferred to the output means and displayed for output. Also in this embodiment, the output means is composed of a CRT and a printer. As for the display, a plurality of target arrays can be displayed in duplicate in two-dimensional coordinates or in multi-dimensional coordinates for one key layout. At this time, the data on each target array can be displayed in color to facilitate the identification. Furthermore, the slope in the cumulative score graphical display, or the score value in the score graphical display,
The positive and negative portions of can be displayed in different colors to facilitate identification.

【００２３】以上の実施例では、タンパク質のアミノ酸
配列に関する結果を中心に説明したが、本発明が核酸の
配列比較に際しても有効なことは明らかである。特に、
エクソン、イントロンで構成されるゲノムDNAとcDNA配
列を比較する場合に、ダイナミックプログラミングの手
法、FASTAの手法、BLASTの手法等の従来の局所類似配列
抽出法では、エクソン、イントロンの境界が識別でき
ず、エクソン部分の完全に一致している配列だけでな
く、イントロン部分の全く類似性のない配列も含めて類
似配列として抽出してくる場合があり、配列の一致度か
らゲノムDNAとcDNA配列の同一性を評価する場合に問題
があった。しかし、本発明を用いれば、例えば、グラフ
の勾配から完全一致部分と不一致部分を明瞭に識別で
き、完全一致部分を自動的に抽出し、そのスコア値、一
致度、並置結果を出力表示することにより、上記の同一
性を誤りなく評価できる。In the above examples, the explanation has been centered on the results relating to the amino acid sequences of proteins, but it is clear that the present invention is also effective when comparing the sequences of nucleic acids. In particular,
When comparing genomic DNA composed of exons and introns with cDNA sequences, conventional local similar sequence extraction methods such as dynamic programming method, FASTA method and BLAST method cannot distinguish the boundaries of exons and introns. In some cases, not only sequences that are completely matched in exons but also sequences that have no similarity in introns may be extracted as similar sequences. There was a problem when assessing sex. However, if the present invention is used, for example, it is possible to clearly discriminate the completely matched portion and the unmatched portion from the gradient of the graph, automatically extract the completely matched portion, and output and display the score value, the matching degree, and the juxtaposition result. Thus, the above identity can be evaluated without error.

【００２４】[0024]

【発明の効果】本発明によれば、低類似度、非類似度領
域をもつ類似核酸、タンパク質の最適並置結果の中から
類似度の大きな領域を容易に識別でき、さらに自動的に
抽出することが可能となる。さらに複数のドメインから
構成されるタンパク質の各ドメインの位置、境界あるい
は核酸のイントロン、エクソンの位置、境界を容易に識
別できる。EFFECTS OF THE INVENTION According to the present invention, a region having a high degree of similarity can be easily identified from the optimal alignment result of similar nucleic acids having a low degree of similarity and a dissimilarity region, and a protein can be automatically extracted. Is possible. Furthermore, the position and boundary of each domain of a protein composed of a plurality of domains or the positions and boundaries of nucleic acid introns and exons can be easily identified.

[Brief description of drawings]

【図１】本発明に関連する一従来技術による一出力結果
例。FIG. 1 is an example of one output result according to one prior art related to the present invention.

【図２】本発明に関連する一従来技術による一出力結果
例。FIG. 2 shows an example of one output result according to one prior art related to the present invention.

【図３】本発明で用いられるスコアマトリックスの一
例。FIG. 3 is an example of a score matrix used in the present invention.

【図４】本発明で用いられるスコアマトリックスの一
例。FIG. 4 is an example of a score matrix used in the present invention.

【図５】本発明による出力結果の一例を示すグラフ。FIG. 5 is a graph showing an example of an output result according to the present invention.

【図６】本発明による出力結果の一例を示すグラフ。FIG. 6 is a graph showing an example of an output result according to the present invention.

【図７】本発明による出力結果の一例を示すグラフ。FIG. 7 is a graph showing an example of an output result according to the present invention.

【図８】本発明による出力結果の一例を示すグラフ。FIG. 8 is a graph showing an example of an output result according to the present invention.

【図９】本発明による出力結果の一例を示すグラフ。FIG. 9 is a graph showing an example of an output result according to the present invention.

【図１０】本発明による出力結果の一例を示すグラフ。FIG. 10 is a graph showing an example of an output result according to the present invention.

【図１１】本発明の方法が実施される装置の一実施例の
構成図。FIG. 11 is a block diagram of an embodiment of an apparatus in which the method of the present invention is implemented.

【図１２】本発明の方法が実施される装置の一実施例の
構成図。FIG. 12 is a block diagram of an embodiment of an apparatus in which the method of the present invention is implemented.

[Explanation of symbols]

１…入力手段、２…並置演算手段、３…グラフ化演算手
段、４…出力手段、５…配列データベース、６…選別演
算手段、７…出力値演算手段、８、９、１０…局所的に
勾配の大きい領域、８’…局所的に勾配の大きい領域８
に対応する領域、９’…局所的に勾配の大きい領域９に
対応する領域、１０’…局所的に勾配の大きい領域１０
に対応する領域。DESCRIPTION OF SYMBOLS 1 ... Input means, 2 ... juxtaposition calculation means, 3 ... graphing calculation means, 4 ... output means, 5 ... sequence database, 6 ... selection calculation means, 7 ... output value calculation means, 8, 9, 10 ... locally Area with a large gradient, 8 '... Area 8 with a large gradient locally
Region corresponding to the region 9 '... Region having a large gradient locally 10' ... Region having a large gradient 10
Area corresponding to.

───────────────────────────────────────────────────── フロントページの続き (72)発明者西川哲夫東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者川口久光神奈川県川崎市麻生区王禅寺1099番地株式会社日立製作所システム開発研究所内 (72)発明者平岡進神奈川県川崎市麻生区王禅寺1099番地株式会社日立製作所システム開発研究所内 (72)発明者笠原直子東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者神原秀記東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者岡山利次神奈川県横浜市中区尾上町６丁目81番地日立ソフトウェアエンジニアリング株式会社内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Tetsuo Nishikawa 1-280, Higashi Koikekubo, Kokubunji, Tokyo Inside Hitachi Central Research Laboratory (72) Inventor Hisamitsu Kawaguchi 1099, Ozenji, Aso-ku, Kawasaki-shi, Kanagawa Hitachi, Ltd. In-house System Development Laboratory (72) Inventor Susumu Hiraoka 1099 Ozenji, Aso-ku, Kawasaki-shi, Kanagawa Hitachi Ltd. System Development Laboratory (72) Inventor Naoko Kasahara 1-280 Higashi-Kengokubo, Kokubunji, Tokyo Hitachi Ltd. Central Research Laboratory (72) Inventor Hideki Kamihara 1-280 Higashi Koigokubo, Kokubunji, Tokyo Hitachi Ltd. Central Research Laboratory (72) Inventor Ritsuji Okayama 6-81 Onoue-cho, Naka-ku, Yokohama, Kanagawa Hitachi Software Engineering Co., Ltd. Meeting

Claims

[Claims]

1. A method for displaying a locally similar sequence of a biopolymer composed of an array of a plurality of constituent elements, the method comprising the step of comparing the arrangement of the constituent elements of at least two of the biopolymers with each other. , A step of obtaining a score for evaluating information regarding the similarity of sequences including a mismatch, a gap introduction, and a gap deletion, and the sequences of the constituent elements of a plurality of the biopolymers are juxtaposed and compared based on the score. In the method for displaying a locally similar biopolymer array, the first axis has an element number indicating an array order of the constituent elements of the one of the aligned biopolymers, and the second axis A predetermined parameter obtained from the score between the juxtaposed constituents on the axis of, and converting the juxtaposed result into graph data; Display method biopolymer local similarity sequence, characterized by having the steps of multidimensional display the data as a graph, a.

2. The method according to claim 1, further comprising a step of performing a smoothing process on the graph data, to display a biopolymer locally similar sequence.

3. The method according to claim 1 or 2, wherein the predetermined parameter is a cumulative value of the scores from the predetermined element number to each of the element numbers. Method for displaying locally similar sequence of body macromolecule.

4. The method according to claim 1 or 2, wherein the predetermined parameter is a score value at each of the element numbers.

5. The method according to claim 3, wherein the gradient of the graph data is continuous with respect to the first axis, and the value of the second axis is equal to or larger than a predetermined constant value. A biopolymer, further comprising a step of extracting a region, wherein at least one of the cumulative value of the score, the degree of coincidence of the arrangement of the constituents, and the juxtaposed result in the region is output and displayed. How to display locally similar sequences.

6. The method according to claim 4, wherein in the graph data, a region in which the value of the score is continuously greater than or equal to a predetermined constant value in relation to the first axis is extracted. And displaying at least one of the cumulative value of the score, the degree of coincidence of the sequences of the constituents, and the juxtaposed result in the region, Method.

7. A method for extracting a locally similar sequence of a biopolymer comprising a sequence of a plurality of constituent elements, the method comprising the step of comparing the arrangement of the constituent elements of at least two of the biopolymers with each other. , A step of obtaining a score for evaluating information regarding the similarity of sequences including a mismatch, a gap introduction, and a gap deletion, and the sequences of the constituent elements of a plurality of the biopolymers are juxtaposed and compared based on the score. In the method for extracting a locally similar biopolymer sequence, the first axis has an element number indicating the sequence of the arrangement of the constituent elements of the one of the aligned biopolymers, and the second axis A predetermined parameter obtained from the score between the juxtaposed constituents on the axis of, and converting the juxtaposed result into graph data; Steps and extraction method biopolymer local similarity sequence, characterized by having a step, the extracting a portion having a predetermined similarity from the result the juxtaposed to display multi-dimensional display the data as a graph.

8. The method according to claim 7, further comprising a step of performing a smoothing process on the graph data.

9. The method according to claim 7 or 8, wherein the predetermined parameter is a cumulative value of the scores from the predetermined element number to each of the element numbers. A method for extracting locally similar sequences of a polymer.

10. The method according to claim 7 or 8, wherein the predetermined parameter is a score value at each of the element numbers.

11. The method according to claim 9, wherein the gradient of the graph data is continuous with respect to the first axis, and the value of the second axis is equal to or larger than a predetermined constant value. A method for extracting a locally similar biopolymer sequence, further comprising the step of extracting a region.

12. The method according to claim 10, wherein a region in which the score value is continuously greater than or equal to a predetermined constant value in relation to the first axis is extracted from the graph data. A method for extracting a locally similar sequence of biopolymer, further comprising:

13. A display device of a locally similar array of biopolymers comprising an array of a plurality of constituent elements, wherein the array of the constituent elements of at least two biopolymers is compared,
Means for obtaining a score for evaluating information regarding the similarity of sequences including coincidence, disagreement, gap introduction, gap deletion between the constituent elements, and arrangement of the constituent elements of the plurality of biopolymers based on the score In a display device of a biopolymer locally similar sequence, which has at least a means for juxtaposing and comparing, a means for obtaining a predetermined parameter from a score between the juxtaposed constituent elements, and a juxtaposed first axis. One of the biopolymers is an element number indicating the order of arrangement of the constituent elements, and the second axis is the predetermined parameter, and the hand for converting the juxtaposed results into graph data, and the graph data A method for displaying a locally similar biopolymer array, comprising: a display means for displaying a multi-dimensional graph.

14. The display device according to claim 13, further comprising means for smoothing the graph data.

15. The apparatus according to claim 13 or 14, wherein the predetermined parameter is a cumulative value of the scores from the predetermined element number to each of the element numbers. A display device for locally similar sequences of polymer.

16. A display device for a biopolymer locally similar array according to claim 13 or 14, wherein the predetermined parameter is a score value at each of the element numbers.

17. The apparatus according to claim 15, wherein the gradient of the graph data is continuous with respect to the first axis, and the value of the second axis is equal to or larger than a predetermined constant value. A biopolymer, further comprising means for extracting a region, wherein at least one of the cumulative value of the score, the degree of coincidence of the arrangement of the constituent elements, and the juxtaposed result in the region is output and displayed. Display device of local similar array.

18. The device according to claim 16, wherein a means for extracting a region in which the score value is continuously greater than or equal to a predetermined constant value in the graph data in association with the first axis. And displaying at least one of the cumulative value of the score, the degree of coincidence of the sequences of the constituents, and the juxtaposed result in the region, apparatus.

19. A device for extracting a locally similar array of biopolymers, which comprises an array of a plurality of constituent elements, wherein the arrangement of the constituent elements of at least two biopolymers is compared,
Means for obtaining a score for evaluating information regarding the similarity of sequences including coincidence, disagreement, gap introduction, gap deletion between the constituent elements, and arrangement of the constituent elements of the plurality of biopolymers based on the score In a device for extracting a biopolymer local similar sequence, which comprises at least a means for juxtaposing and comparing, and a means for obtaining a predetermined parameter from a score between the juxtaposed constituent elements, and a juxtaposed first axis. One of the biopolymers is an element number indicating the order of arrangement of the constituent elements, and the second axis is the predetermined parameter, and the hand for converting the juxtaposed results into graph data, and the graph data A living body height component, comprising: a display unit that multidimensionally displays as a graph, and a unit that extracts a portion having a predetermined degree of similarity from the juxtaposed results. Extractor local similar sequences.

20. The apparatus according to claim 19, further comprising means for performing a smoothing process on the graph data, the apparatus for extracting locally similar biopolymer sequences.

21. The apparatus according to claim 19 or 20, wherein the predetermined parameter is a cumulative value of the scores from the predetermined element number to each of the element numbers. Device for extracting locally similar sequences of body macromolecules.

22. The apparatus for extracting a biopolymer locally similar sequence according to claim 19 or 20, wherein the predetermined parameter is a score value at each of the element numbers.

23. The apparatus according to claim 21, wherein the gradient of the graph data is continuous with respect to the first axis, and the value of the second axis is equal to or larger than a predetermined constant value. An apparatus for extracting a biopolymer locally similar sequence, further comprising means for extracting a region.

24. The apparatus according to claim 21, wherein the gradient of the graph data is continuous with respect to the first axis, and the value of the second axis is equal to or larger than a predetermined constant value. Extracting from the juxtaposed result, further comprising means for extracting a region, further comprising means for obtaining a cumulative value of the score in the region and means for obtaining a degree of coincidence of the arrangement of the constituent elements in the region And a display unit for displaying at least one of the accumulated region, the cumulative value of the score, and the degree of coincidence.

25. The apparatus according to claim 22, wherein a means for extracting a region in the graph data, in which the score value is continuously greater than or equal to a predetermined constant value, in association with the first axis. An apparatus for extracting a locally similar biopolymer sequence, further comprising:

26. The apparatus according to claim 22, wherein means for extracting a region in the graph data, in which the score value is continuously greater than or equal to a predetermined constant value, in association with the first axis. Further comprising: means for obtaining a cumulative value of the score in the region, and means for obtaining a degree of coincidence of the arrangement of the constituent elements in the region, the region extracted from the juxtaposed results, A display unit for displaying at least one of the cumulative value of the score and the degree of coincidence, and a device for extracting a locally similar biopolymer sequence.