KR100431620B1

KR100431620B1 - A system for analyzing dna-chips using gene ontology, and a method thereof

Info

Publication number: KR100431620B1
Application number: KR10-2002-0010826A
Authority: KR
Inventors: 김양석; 허정욱; 이성근
Original assignee: 주식회사 이즈텍
Priority date: 2002-02-28
Filing date: 2002-02-28
Publication date: 2004-05-17
Anticipated expiration: 2022-02-28
Also published as: WO2003072701A1; AU2003212669A1; KR20030071225A

Abstract

본 발명은 유전자 어휘 분류체계(Gene Ontology; GO)의 계층 구조(hierarchical structure) 모델링을 통해 DNA 칩 또는 마이크로어레이 실험의 유전자 발현 양상(gene expression pattern)을 생물학적으로 분석하기 위한 시스템 및 그 분석 방법에 관한 것이다. 본 발명에 따른 GO를 이용한 DNA 칩 분석 시스템은 DNA 칩 실험 결과의 통계적 클러스터링(clustering) 결과를 입력받아, 각 클러스터에 속하는 유전자들마다 Gene Ontology(GO) 식별자(identifier)를 할당하는 수단; GO 코드 파일을 이용하여 상기 클러스터에 속하는 유전자에 할당된 GO 식별자를 각각 GO 코드로 변환하는 수단; 상기 클러스터의 속하는 유전자들의 GO 코드를 이용하여 유전자들의 평균 유사 거리 및 최대 유사 거리를 구하는 수단; 기본 과정 및 N-단계 선택 과정 중 하나의 방법에 따라 클러스터에 포함된 유전자들과 GO 트리 구조상의 GO 노드들과의 평균 유사 거리 및 최대 유사 거리를 이용하여 최적으로 매칭이 되는 GO 용어를 추출하는 수단; 및 상기 최적으로 매칭이 되는 GO 용어를 이용하여 상기 클러스터의 생물학적 의미를 추출하는 수단을 포함한다.The present invention provides a system and method for analyzing the gene expression pattern of DNA chip or microarray experiment biologically through hierarchical structure modeling of Gene Ontology (GO). It is about. A DNA chip analysis system using GO according to the present invention comprises: means for receiving a statistical clustering result of a DNA chip test result and assigning a Gene Ontology (GO) identifier to each of the genes belonging to each cluster; Means for converting each GO identifier assigned to a gene belonging to the cluster to a GO code using a GO code file; Means for obtaining an average similarity distance and a maximum similarity distance of genes using GO codes of genes belonging to the cluster; Extracts GO terms that are optimally matched using the average similarity distance and the maximum similarity distance between the genes included in the cluster and the GO nodes in the GO tree structure according to one of the basic process and the N-step selection process. Way; And means for extracting the biological meaning of the cluster using the optimally matched GO term.

Description

System and method for analyzing DNA chips using genetic lexical classification system {A SYSTEM FOR ANALYZING DNA-CHIPS USING GENE ONTOLOGY, AND A METHOD THEREOF}

본 발명은 유전자 어휘 분류체계를 이용하여 DNA 칩을 분석하기 위한 시스템 및 그 방법에 관한 것으로서, 보다 구체적으로, 유전자 어휘 분류체계(Gene Ontology; 이하 'GO'라 한다) 계층 구조(hierarchical structure)의 모델링을 통해 DNA 칩 또는 마이크로어레이(Microarray) 실험의 유전자 발현 양상(gene expression pattern)을 생물학적으로 분석하기 위한 시스템 및 그 분석 방법에 관한 것이다.The present invention relates to a system and method for analyzing DNA chips using a gene vocabulary classification system, and more specifically, to a structure of a hierarchical structure of a gene ontology (GO). The present invention relates to a system for biologically analyzing a gene expression pattern of a DNA chip or a microarray experiment through modeling, and a method of analyzing the same.

1954년 와트슨 및 크릭(Watson and Crick)에 의하여 DNA의 이중 나선 구조가 밝혀진 이래 제한 효소의 발견, 혼성화(hybridization) 기법, PCR (Polymerase chain reaction) 등의 발전은 생명 현상의 분자 수준에서의 이해에 크게 기여하였다. 그러나 복잡한 조절 기능을 갖는 생명 현상을 단편적으로 이해하는 것이 아니라 인간 지놈 프로젝트(Human Genomic Project; HGP)와 같이 전체적 이해를 할 수 있는 실험의 필요성이 대두됨에 따라, 염기서열의 기능을 이해하기 위한 과정이 수행되는 가운데 DNA Chip이 개발되었다. 이러한 HGP와 DNA Chip의 결과를 효율적으로 활용하기 위하여 생물정보학(Bioinformatics)과 기능체 유전학(Functional Genomics)의 연구도 활발하게 진행되고 있다.Since the double helix structure of DNA was identified by Watson and Crick in 1954, advances in restriction enzyme discovery, hybridization techniques, and polymerase chain reaction (PCR) have led to a molecular understanding of life phenomena. Contributed greatly. However, the process of understanding the function of sequencing is emerging as the need for a holistic understanding such as the Human Genomic Project (HGP) rather than a partial understanding of life phenomena with complex regulatory functions. During this process, a DNA chip was developed. Bioinformatics and functional genomics are also being actively researched to efficiently utilize the results of HGP and DNA chips.

바이오 칩은 크게 마이크로어레이 및 마이크로플루이딕스(microfluidics) 칩으로 구분되며, 여기서 마이크로어레이는 수천개 혹은 수만개 이상의 DNA나 단백질 등을 일정 간격으로 배열하여 붙이고, 분석 대상 물질을 처리하여 그 결합 양상을 분석할 수 있는 칩을 말하며, 전술한 DNA 칩 및 단백질 칩 딩이 있으며, 현재까지는 DNA 칩이 가장 널리 사용되고 있는 바이오 칩이라고 볼 수 있다. 또한, 마이크로플루이딕스 칩은 미량의 분석 대상 물질을 흘려보내면서 칩에 집적되어 있는 생물 분자 혹은 센서와 반응하는 양상을 분석할 수 있다.Biochips are largely divided into microarrays and microfluidics chips, where microarrays are arranged at regular intervals, with thousands or tens of thousands of DNA or proteins, and processed to analyze the binding patterns. Refers to the chip that can be, there is the above-described DNA chip and protein chipping, and to date it can be seen that the DNA chip is the most widely used biochip. In addition, the microfluidics chip may analyze the reaction with a biological molecule or sensor integrated in the chip while flowing a small amount of analyte.

이러한 DNA 칩은 유리판, 니트로셀룰로스 막(nitrocellulose membrane) 혹은 실리콘 위에 타겟 DNA 또는 cDNA나 올리고뉴클레오티드(oligonucleotide)를 붙인 것이다. 다시 말하면, 이러한 DNA 칩은 작은 면적의 고체 표면에 염기서열이 알려진 cDNA 혹은 올리고뉴클레오티드 탐침(probe)을 정해진 위치에 미세 집적(micro-array)시킨 것을 말한다.The DNA chip is a target DNA or cDNA or oligonucleotide attached to a glass plate, nitrocellulose membrane or silicon. In other words, such a DNA chip is a micro-array of cDNA or oligonucleotide probes with known sequences on a small surface of a solid surface in a predetermined position.

이러한 DNA 칩은 형광물질 혹은 방사선 동위 원소로 표식된 탐침과 혼성화시켜 유전자의 발현 정도, 돌연 변이의 확인, 단일 뉴클레오티드 다형성(single nucleotide polymorphism; SNP), 질병의 진단, 고처리 스크리닝(high-throughput screening; HTS) 등에 사용할 수 있다. 이러한 DNA 칩에 분석하고자 하는 시료 DNA 단편을 결합시키면, DNA 칩에 부착되어 있는 탐침과 시료 DNA 단편상의 염기서열의 상보적 정도에 따라 혼성화 상태를 이루게 되는데, 광학적인 방법 혹은 방사능 화학적 방법 등을 통해 이를 관찰 해석함으로써, 시료 DNA의 염기 서열을 측정할 수 있다. 이러한 DNA 칩을 이용하면 많은 수의 유전자의 발현 정보를 간편하고 신속하게 알 수 있으며, 현재 신약 개발 및 의료 진단용으로 개발 사용되고 있다.These DNA chips hybridize with probes labeled with fluorescent or radioactive isotopes to identify gene expression, mutations, single nucleotide polymorphism (SNP), disease diagnosis, and high-throughput screening. ; HTS) and the like. When the sample DNA fragment to be analyzed is bound to the DNA chip, the hybridization state is formed according to the complementary degree of the nucleotide sequence on the sample DNA fragment and the probe attached to the DNA chip. By observing and analyzing this, the nucleotide sequence of a sample DNA can be measured. Using these DNA chips, the expression information of a large number of genes can be easily and quickly known, and are currently being used for new drug development and medical diagnosis.

DNA 칩 결과의 분석에는 통계적인 방법과 생물학적인 방법이 병행되고 있다. 이미지 분석을 통하여 나타난 각 유전자들의 발현 정도를 통계적인 방법을 이용하여 공통적인 발현 양상을 보이는 것들을 클러스터링(clustering)을 통하여 묶어 낸다. 여기서 실제 각 유전자의 알려진 기능을 이용하여 해당 클러스터(cluster)에 일반적인 의미를 부여함과 동시에 해당 클러스터의 신뢰도를 생물학적으로 확인하게 된다.Statistical and biological methods are used to analyze DNA chip results. The degree of expression of each gene expressed through image analysis is grouped together by clustering to show common expression patterns using statistical methods. Here, by using the known function of each gene, it gives a general meaning to the cluster and at the same time biologically confirms the reliability of the cluster.

기존의 생물학적 확인 과정은 논문이나 기존의 생물학 정보 데이터베이스 등에서 유전자의 기능을 추출하여 비교하는 방법을 이용한다. 이때 사용되는 데이터베이스들은 NCBI(National Center for Biotechnology Information)의 기본적인 DNA 정보, MIPS(Munich information center for protein sequences) 혹은 CGAP(Cancer genome anatomy project) 등의 기능별 분류(functional category) 정보, 또는 Swiss-Prot의 단백질 정보들을 이용한다. 하지만, 현재까지는 연구자의 수작업을 통해서 많이 이루어지고 있으며, 생물학 용어의 다양성 등으로 인하여 체계적이고 자동화된 분석을 수행하기 어려웠다는 문제점이 있다.The existing biological identification process uses a method of extracting and comparing gene functions from papers or existing biological information databases. The databases used here include basic DNA information from the National Center for Biotechnology Information (NCBI), functional category information such as the MIPS (Munich Information Center for Protein Sequences) or the Cancer Genome Anatomy Project (CGAP), or Swiss-Prot. Use protein information. However, up to now, it has been made by a researcher's manual work, and there is a problem that it is difficult to perform a systematic and automated analysis due to the diversity of biological terms.

또한, 기존 생물학 정보 데이터베이스의 경우, 단백질의 정보원으로 많이 사용되는 Swiss-Prot은 핵심 단어(keyword)를 이용하여 단백질들의 기능을 잘 분류하였으나, 이들 핵심 단어들 사이에는 정형화된 상관 관계 혹은 상하 관계(hierarchy)가 존재하지 않으며, 이 때문에 DNA 칩의 생물학적 분석에서 자동화에 장애 요인으로 작용한다. 또한, CGAP(Cancer Genome Anatomy Project) 등의특화된 분야별의 그룹 정보들은 해당 분야에서만 적용되는 한계점을 지니며, 또한 그 그룹 자체가 너무 넓은 의미의 기능을 다루게 되므로, 세부적인 기능적 측면에서는 한계점을 지니게 된다는 문제점이 있다.In addition, in the existing biological information database, Swiss-Prot, which is widely used as a source of protein information, categorizes the functions of proteins well by using key words, but there is a formal correlation or vertical relationship between these key words. There is no hierarchy, which impedes automation in the biological analysis of DNA chips. In addition, specialized sector-specific group information such as CGAP (Cancer Genome Anatomy Project) has limitations that apply only in that field, and because the group itself deals with functions that are too broad, it has limitations in detail. There is a problem.

이에 따른 대안으로서, GO 컨소시엄(Gene Ontology Consortium)에서 제공하는 GO 용어를 이용하는 것이다. 여기서 어휘 분류체계(Ontology)란 간략하게 말하면 생물학 용어 또는 어휘를 분류해 놓은 체계를 말한다. 유전자 어휘 분류체계 컨소시엄은 생물학 용어들의 통합을 목적으로 세워졌으며, 모든 생물 종들에서 유전자의 기능을 설명하는데 있어서 사용되는 공통적으로 사용될 수 있는 통합된 용어들을 제공하며, 현재 일만여개의 용어로 구성되어 있다. 결국, GO는 유전자(Gene) 혹은 유전자에 함축된 키워드들이 각 개체가 되어 그것들 사이의 관계를 연구하는 것을 의미하며, 생물정보학(bioinformatics)에 적용하게 된다.An alternative is to use the GO terminology provided by the GON Ontology Consortium. Here, the term "ontology" simply refers to a system that classifies biological terms or vocabulary. The Genetic Vocabulary Classification Consortium was established for the integration of biological terms and provides a set of commonly used unified terms used to describe the function of genes in all species, and currently consists of more than 10,000 terms. . After all, GO means that genes or keywords implied by them become individual entities and study the relationships between them. They apply to bioinformatics.

이러한 GO 용어의 특이점은 각 용어들 사이에 상하 관계의 트리 구조를 가지며, 전체 용어들을 3가지의 큰 범주(category)로 구분된다는 점이다. 즉, 세개의 큰 범주를 가지고 약 10,000개 정도의 용어들이 마치 트리 구조처럼 상하 관계(hierarchy)를 가지고 구성이 되어 있다. 이것을 이용하여 DNA 칩의 분석시 생물학적 의미를 찾기 위한 것으로, GO는 유전자의 기능을 크게 ⅰ) 분자의 기능(molecular function), ⅱ) 생물학적 작용(biological process), 및 ⅲ) 세포 성분(cellular component)의 범주로 나누고, 각각의 범주에 계층적인 통제 어휘(controlled vocabulary)를 확립하였다. 이들 범주는 서로 배타적인 것이 아니며, 한 개의 유전자를 묘사하기 위한 특징들을 나누는 범주이다.The singularity of these GO terms is that they have a tree structure of top and bottom relations between the terms, and the whole terms are divided into three large categories. That is, about 10,000 terms with three large categories are organized in a hierarchy, like a tree structure. In order to find the biological meaning of DNA chip analysis, GO can be used to determine the function of a gene, the molecular function, ii) the biological process, and the cellular component. The categories are divided into categories and a controlled vocabulary is established for each category. These categories are not mutually exclusive, but rather a division of features for describing a single gene.

전술한 문제점을 해결하기 위한 본 발명의 목적은 GO 계층 구조의 모델링을 통해 DNA 칩 실험의 유전자 발현 양상에 대해 체계적으로 생물학적 분석을 수행할 수 있도록 유전자 어휘 분류체계를 이용하여 DNA 칩을 분석하기 위한 시스템 및 분석 방법을 제공하기 위한 것이다.An object of the present invention for solving the above problems is to analyze DNA chips using a gene lexical classification system to systematically perform biological analysis on gene expression patterns of DNA chip experiments by modeling GO hierarchy. It is to provide a system and analysis method.

또한, 본 발명의 다른 목적은 GO 용어와 트리 구조를 이용하여 DNA 칩의 실험 결과의 통계적인 클러스터링(clustering)을 통해 생성되는 클러스터(cluster)에 속하는 유전자들의 가장 공통적이며 이상적인 유전자의 기능을 추출하는 방법을 제공하기 위한 것이다.In addition, another object of the present invention is to extract the function of the most common and ideal genes of the genes belonging to the cluster (cluster) generated through the statistical clustering of the experimental results of the DNA chip using the GO term and tree structure. It is to provide a method.

도 1은 본 발명에 따른 유전자 어휘 분류체계(Gene Ontology)를 이용한 DNA 칩 분석 시스템의 구성도이다.1 is a block diagram of a DNA chip analysis system using a gene vocabulary classification system (Gene Ontology) according to the present invention.

도 2는 본 발명에 따른 GO 트리 구조의 일례를 도시하는 도면이다.2 is a diagram illustrating an example of a GO tree structure according to the present invention.

도 3은 본 발명에 따라 텍스트 구조의 GO 트리를 변형한 일례를 예시하는 도면이다.3 is a diagram illustrating an example of a modified GO tree of a text structure according to the present invention.

도 4는 본 발명에 따라 추출된 GO 코드의 변환 예를 보여주는 도면이다.4 is a diagram illustrating an example of conversion of the extracted GO code according to the present invention.

도 5는 본 발명에 따른 GO를 이용하여 유사 거리을 구하는 원리를 개략적으로 설명하기 위한 도면이다.5 is a view for schematically explaining the principle of obtaining the similar distance using the GO according to the present invention.

도 6은 노드가 여러개일 경우 최적 교차점을 구하는 일례를 도시한 도면이다.FIG. 6 illustrates an example of obtaining an optimal intersection point when there are several nodes.

도 7은 본 발명에 따른 GO를 이용하여 DNA 칩을 분석하는 방법의 동작 흐름도이다.7 is a flowchart illustrating a method of analyzing a DNA chip using GO according to the present invention.

상기한 목적을 달성하기 위한 수단으로서, 본 발명에 따른 유전자 어휘 분류체계를 이용하여 DNA 칩을 분석하기 위한 시스템은 상기 DNA 칩 실험 결과의 통계적 클러스터링(clustering) 결과를 입력받아, 각 클러스터에 속하는 유전자들마다 Gene Ontology(GO) 식별자(identifier)를 할당하는 수단; GO 코드 파일을 이용하여 상기 클러스터에 속하는 유전자에 할당된 GO 식별자를 각각 GO 코드로 변환하는 수단; 상기 클러스터의 속하는 유전자들의 GO 코드를 이용하여 유전자들의 평균 유사 거리 및 최대 유사 거리를 구하는 수단; 기본 과정 및 N-단계 선택 과정 중 하나의 방법에 따라 클러스터에 포함된 유전자들과 GO 트리 구조상의 GO 노드들과의 평균 유사 거리 및 최대 유사 거리를 이용하여 최적으로 매칭이 되는 GO 용어를 추출하는 수단; 및 상기 최적으로 매칭이 되는 GO 용어를 이용하여 상기 클러스터의 생물학적 의미를 추출하는 수단을 포함하며, 상기 추출된 최적으로 매칭이되는 GO 용어와 이의 GO 코드 및 생물학적 의미를 디스플레이하기 위한 시각화 수단을 추가로 포함할 수 있다.As a means for achieving the above object, a system for analyzing a DNA chip using the gene vocabulary classification system according to the present invention receives a statistical clustering result of the DNA chip experiment results, genes belonging to each cluster Means for assigning a Gene Ontology (GO) identifier for each; Means for converting each GO identifier assigned to a gene belonging to the cluster to a GO code using a GO code file; Means for obtaining an average similarity distance and a maximum similarity distance of genes using GO codes of genes belonging to the cluster; Extracts GO terms that are optimally matched using the average similarity distance and the maximum similarity distance between the genes included in the cluster and the GO nodes in the GO tree structure according to one of the basic process and the N-step selection process. Way; And means for extracting the biological meaning of the cluster using the optimally matched GO term, and adding visualization means for displaying the extracted optimally matched GO term and its GO code and biological meaning. It can be included as.

또한, 상기 시각화 수단은 상기 최적으로 매칭이되는 GO 용어와 이의 GO 코드 및 생물학적 의미의 요약 정보를 테이블 형태로 디스플레이 하거나, 또는 트리 구조 형태의 그래픽 결과를 디스플레이 하는 것을 특징으로 한다.In addition, the visualization means is characterized by displaying a summary of the GO term and its GO code and biological meanings that are optimally matched in the form of a table, or display a graphical result in the form of a tree structure.

또한, 상기 유사거리는, 유사거리를Pd(v1,v2), 여기서 v1 및 v2는 노드라고 할 때, 두 노드 v1, v2가 형성하는 최적 교차점(optimal branch)의 코드가 가지는 레벨의 가중치(weight)이며, v1과 v2가 동일한 경우에는 Pd값은 0으로 정의되고;Further, the similarity distance is a weight of the level of the code of the optimal branch formed by the two nodes v1 and v2 when the similarity is Pd ( v1, v2), where v1 and v2 are nodes. When v1 and v2 are the same, the Pd value is defined as 0;

또한, 주어진 클러스터의 코드들의 조합을 G라고 할 때, 최대 유사거리(max_pd)와 평균 유사거리(aver_pd)를 이용하여 각각의 최적 교차점을 구해지는데, 이때 G={ v1, v2, v3, v4, , vn} 에서 max_pd와 aver_pd는,In addition, when G is a combination of codes of a given cluster, each optimal intersection point is obtained using the maximum similarity distance max_pd and the average similarity distance aver_pd, where G = {v1, v2, v3, v4, , vn} in max_pd and aver_pd,

max_pd(G) = max { pd(v_i, v_j) } (단, 1≤i ≤j ≤n) max_pd (G) = max {pd (v _i , v _j )} (where 1≤i≤j≤n)

aver_pd(G) = (집합 G의 모든 pd(v_i, v_j)의 합) /_nC₂ aver_pd (G) = (sum of all pd (v _i , v _j ) in set G) / _n C ₂

= 2 × (집합 G의 모든 pd(v_i, v_j)의 합) / n(n-1)= 2 × (sum of all pd (v _i , v _j ) in set G) / n (n-1)

로 정의되며, 가능한 코드들의 조합중에서 가장 낮은 점수의 max_pd와 aver_pd를 최종적으로 선택하는 것을 특징으로 한다.It is defined as, and it is characterized by finally selecting max_pd and aver_pd of the lowest score among possible code combinations.

또한, 상기 최대 유사거리(max_pd)는 클러스터를 개략적으로 평가하는데 사용되며, 최적 교차점이 보다 높은 상위의 단계에 위치할수록, 해당 클러스터는 소속 유전자들의 일반적인 공통성을 해치는 부적당(bad)한 클러스터를 포함하고 있을 확률이 높다는 것을 의미하는 것을 특징으로 한다.In addition, the maximum similarity distance (max_pd) is used to roughly evaluate the cluster, and the higher the optimal intersection is located at a higher level, the cluster contains bad clusters that impair the general commonality of the genes belonging. Characterized in that it means that there is a high probability.

또한, 상기 평균 유사거리(aver_pd)는 주어진 클러스터 내에서 GO 코드들이 얼마나 잘 응집되어 있는지, 그리고 얼마나 비슷한 코드들이 얼마나 빈번히 관찰되는지를 나타내는 것을 특징으로 한다.In addition, the average similarity distance aver_pd is characterized by how well aggregated GO codes within a given cluster and how frequently similar codes are observed.

또한, 상기 기본 과정은 GO 트리 구조상의 모든 노드들에 대하여 최대 유사거리(max_pd) 및 평균 유사거리(aver_pd)를 이용하여 계산하며, 상기 기본 과정의 결과는 주어진 클러스터의 대략적인 생물학적 의미를 보여주며, 상기 N-단계 선택 과정은 GO 트리 구조상에서 선택한 특정 레벨의 GO 노드에 대해 최대 유사거리 및 평균 유사거리를 이용하여 계산한다.In addition, the basic process is calculated using the maximum similarity distance (max_pd) and the average similarity distance (aver_pd) for all nodes in the GO tree structure, and the result of the basic process shows the approximate biological meaning of a given cluster. The N-step selection process is calculated using the maximum similarity distance and the average similarity distance for the GO node of a specific level selected on the GO tree structure.

또한, 본 발명에 따른 유전자 어휘 분류체계를 이용하여 DNA 칩을 분석하기 위한 분석 방법은, a) 상기 DNA 칩 실험 결과의 통계적 클러스터링 결과를 각 클러스터마다 GO 식별자를 할당하는 단계; b) GO 코드 파일을 이용하여 상기 클러스터에 속하는 유전자에 할당된 GO 식별자를 각각 GO 코드로 변환하는 단계; c) 상기 클러스터의 속하는 유전자들의 GO 코드를 이용하여 유전자들의 평균 유사 거리 및 최대 유사 거리를 구하는 단계; d) 기본 과정 및 N-단계 선택 과정 중 하나의 방법에 따라 클러스터에 포함된 유전자들과 GO 트리 구조상의 GO 노드들과의 평균 유사 거리 및 최대 유사 거리를 이용하여 최적으로 매칭이 되는 GO 용어를 추출하는 단계; 및 e) 상기 최적으로 매칭이 되는 GO 용어를 이용하여 상기 클러스터의 생물학적 의미를 추출하는 단계을 포함하며, 상기 추출된 최적으로 매칭이되는 GO 용어와 이의 GO 코드 및 생물학적 의미를 디스플레이하는 단계를 더 포함할 수 있다.In addition, an analysis method for analyzing a DNA chip using the gene vocabulary classification system according to the present invention, the method comprising the steps of: a) assigning a GO identifier for each cluster statistical clustering results of the DNA chip experiment results; b) converting GO identifiers assigned to genes belonging to the cluster into GO codes using GO code files; c) obtaining average similarity distance and maximum similarity distance of genes using GO codes of genes belonging to the cluster; d) Find the optimal GO term using the average similarity distance and the maximum similarity distance between the genes included in the cluster and the GO nodes in the GO tree structure according to one of the basic process and the N-step selection process. Extracting; And e) extracting the biological meaning of the cluster using the optimally matched GO term, further comprising displaying the extracted optimally matched GO term and its GO code and biological meaning. can do.

이하 첨부된 도면을 참조하여 본 발명에 따른 GO를 이용하여 DNA 칩을 분석하기 위한 시스템과 그 방법의 바람직한 실시예를 설명한다.Hereinafter, a preferred embodiment of a system and method for analyzing a DNA chip using GO according to the present invention will be described with reference to the accompanying drawings.

도 1은 본 발명에 따른 GO를 이용한 DNA 칩 분석 시스템의 구성도로서, 상기 DNA 칩 실험 결과의 통계적 클러스터링 결과를 입력하는 입력부(110); GO 식별자 인덱스 파일(120)을 이용하여, 상기 입력된 클러스터링 결과에 대해 각 클러스터에 속하는 유전자마다 GO 식별자를 할당하는 GO 식별자 할당부(130); GO 코드 파일을 이용하여 상기 유전자에 할당된 GO 식별자를 각각 GO 코드로 변환하는 GO식별자/GO 코드 변환부(140); 상기 GO 코드에 대해 유사거리 알고리즘(210)에 따른 소정의 과정을 선택하여 필요한 변수를 지정하여 최적 교차점을 추출하는 최적 교차점 추출부(220); 및 상기 추출된 각각의 최적 교차점에 대해 그 생물학적 의미를 추출하는 생물학적 의미 추출부(230)를 포함하여 이루어진다. 또한, 상기 유전자마다 각각 추출된 최적 교차점, 상기 GO 코드, 및 생물학적 의미를 디스플레이하기 위한 디스플레이(310)를 추가로 포함할 수 있다.1 is a block diagram of a DNA chip analysis system using GO according to the present invention, an input unit 110 for inputting a statistical clustering result of the DNA chip test result; A GO identifier allocator 130 for allocating a GO identifier for each gene belonging to each cluster with respect to the input clustering result using a GO identifier index file 120; A GO identifier / GO code conversion unit 140 for converting each GO identifier assigned to the gene into a GO code using a GO code file; An optimal intersection point extracting unit 220 for extracting an optimal intersection point by selecting a predetermined process according to the similarity distance algorithm 210 for the GO code and designating a required variable; And a biological meaning extraction unit 230 for extracting the biological meaning for each extracted optimal intersection point. In addition, each of the genes may further include a display 310 for displaying the optimal intersection, the GO code, and the biological meaning respectively extracted.

본 발명은 GO 용어와 트리 구조를 이용하여 DNA 칩의 실험결과의 통계적인 클러스터링을 통해 생성되는 클러스터에 속하는 유전자들의 가장 공통적이며 이상적인 유전자의 기능을 추출하게 된다.The present invention uses the GO term and tree structure to extract the most common and ideal gene function of the genes belonging to the cluster generated through the statistical clustering of the experimental results of the DNA chip.

이를 위해, 각각의 유전자에 대하여 정확한 GO 용어를 할당하고, 트리 구조형태의 GO hierarchy의 구조를 효율적으로 이용하여 최적 교차점을 추출하며, 그리고 최적 교차점 추출 결과를 효율적으로 디스플레이 하게 된다.For this purpose, the correct GO term is assigned to each gene, the optimal intersection point is extracted by efficiently using the structure of the GO hierarchical tree structure, and the optimal intersection point extraction result is displayed efficiently.

도 2는 본 발명에 따른 GO 구조의 일례를 도시하는 도면으로서, 최상위 레벨은 GO 계층, 두 번째 계층은 전술한 분자의 기능(molecular function), 생물학적 작용(biological process), 및 세포 성분(cellular component) 계층에 해당하며, 레벨 3, 4 및 5의 하위 레벨로 각각 트리가 형성되는 것을 도시하고 있다. 도 3은 본 발명에 따라 텍스트 구조의 GO 트리를 변형한 일례를 예시하는 도면으로서, 실질적으로, GO는 트리 구조가 아니고 회로가 없는 유향 그래프(acyclic diagraph)라는 수학적 그래프 형태를 띠게 되며, 본 발명에서 사용하는 유사 알고리즘을 통해 GO 구조를 GO 트리 구조로 바꾸게 된다. 도 3은 이러한 텍스트 구조의 GO 트리를 약간 변형한 일례를 나타낸다. 또한, 도 4는 본 발명에 따라 추출된 GO 코드의 변환 예를 보여주는 도면으로서, 상기 GO 코드 변환부(140)에 의해 변환된 결과를 출력하는 것을 예시하고 있다. 도 4에서 "Output"란에 기재된 숫자들이 GO 코드이며, 도 4에 도시된 바와 같이, GO 코드는 15개의 숫자 조합으로 이루어져 있다. GO 용어는 문자이므로 다른 GO 용어들과 GO 트리 구조상에서 어느 정도 근접해있는지 여부를 판단할 수 없다. 따라서, 본 발명에서는 GO 용어를 미리 설정된 숫자 조합인 GO 코드로 변환하도록 하는 것이다. GO 코드가 15자리인 것은 GO 계층 구조의 레벨이 15레벨이기 때문이며, GO 코드의 첫 번째 자리는 1레벨에서의 값, GO 코드의 두 번째 자리는 2레벨에서의 값을 각각 나타낸다.도 2를 참조하여, GO 노드를 GO 코드로 변환하는 예를 설명하면 다음과 같다. 식별부호 400의 GO 노드는 1레벨에 속하며, 1레벨의 첫 번째 노드이다. 이때 식별부호 400의 GO 노드는 "100000000000000"의 GO 코드로 변환된다. 식별부호 400의 GO 노드는 1레벨의 첫 번째 GO 노드이므로 2번째 자리수부터 15번째 자리수까지의 값은 0이고, 첫 번째 자리수의 값은 1이다.식별부호 402의 GO 노드는 2번째 레벨이며, 식별 부호 400인 GO 노드의 하위 노드이다. 이때, 식별부호 402의 GO 노드는 "110000000000000"의 GO 코드로 변환된다. 식별부호 402의 GO 노드는 2레벨에 속하기 때문에, 3자리부터 15자리까지의 값은 0이다. 또한, 식별부호 400에 해당하는 GO 노드의 자(子)노드이기 때문에, 첫 번째 자리수의 값은 모(母)노드의 값을 그대로 사용한다. 또한, 식별부호 402의 GO 노드는 레벨2에 속하는 식별부호 400의 노드의 하위 노드들 중 첫 번째 노드이므로 2번째 자리수의 값은 1이다.이와 같은 원리로, 식별 부호 404의 GO 노드는 "120000000000000"GO 코드로 변환될 수 있을 것이다.식별 부호 410의 GO 노드는 세 번째 레벨이고, 식별 부호 402의 노드의 자(子)노드이며, 식별 부호 402의 자(子)노드들 중 2번째 노드이다. 따라서, 식별 부호 410의 GO 노드는 "112000000000000"의 GO 코드로 변환될 수 있을 것이다. 같은 원리로, 식별부호 412의 GO 노드는 "121000000000000"의 GO 코드로 변환된다.위와 같은 원리로 GO 노드가 GO 코드로 변환되므로, GO 코드는 GO 노드가 속하는 레벨 및 GO 노드의 모(母)노드에 대한 정보를 포함하고 있다.2 is a diagram illustrating an example of a GO structure according to the present invention, where the top level is the GO layer, the second layer is the molecular function, the biological process, and the cellular component described above. Hierarchical level), a tree is formed at lower levels of levels 3, 4, and 5, respectively. FIG. 3 is a diagram illustrating an example of a modified GO tree of a text structure according to the present invention. In practice, GO is not a tree structure and takes the form of a mathematical graph called an acyclic diagraph without a circuit. A similar algorithm used in the example will convert the GO structure into a GO tree structure. 3 shows an example of a slight modification of the GO tree of this text structure. In addition, Figure 4 is a view showing a conversion example of the GO code extracted in accordance with the present invention, illustrating the output of the result converted by the GO code conversion unit 140. In Fig. 4, the numbers described in the “Output” column are GO codes, and as shown in Fig. 4, the GO codes are composed of 15 number combinations. Since the GO term is a letter, it cannot determine how close it is to other GO terms in the GO tree structure. Therefore, in the present invention, the GO term is to be converted into a GO code which is a preset number combination. The GO code is 15 digits because the level of the GO hierarchy is 15 levels, where the first digit of the GO code represents the value at level 1, and the second digit of the GO code represents the value at level 2. Referring to the example of converting a GO node to a GO code as follows. The GO node of ID 400 belongs to level 1 and is the first node of level 1. At this time, the GO node of the identification code 400 is converted into a GO code of "100000000000000". Since the GO node of ID 400 is the first GO node of level 1, the value from the second to 15th digits is 0, and the value of the first digit is 1. The GO node of ID 402 is the second level. A child node of the GO node having an identification code of 400. At this time, the GO node of identification code 402 is converted into a GO code of "110000000000000". Since the GO node of the identification code 402 belongs to level 2, the value from 3 to 15 digits is 0. In addition, since it is the child node of the GO node corresponding to the identification code 400, the value of the first digit uses the value of the parent node as it is. In addition, since the GO node of the identification code 402 is the first node among the lower nodes of the node of the identification code 400 belonging to the level 2, the value of the second digit is 1. In this way, the GO node of the identification code 404 is " 120000000000000 " May be converted to a GO code. The GO node of identification 410 is the third level, the child node of the node of identification 402, and the second node of the child nodes of identification 402. . Thus, the GO node of identification 410 may be converted to a GO code of "112000000000000". In the same principle, the GO node with identifier 412 is converted to a GO code of "121000000000000". As above, the GO node is converted to a GO code, so that the GO code is the level to which the GO node belongs and the parent of the GO node. Contains information about the node.

본 발명에서, 최적 교차점(optimal branch)이란 트리 구조상에서 가장 많은 수의 유전자들을 아래에 포함하는 노드(node)들 중에 가장 하위에 위치한 노드(node)를 말하며, 그 하위에 포함되는 유전자들의 각각의 기능을 모두 대표할 수 있는 광의의 용어(term)가 된다. 본 발명에서는 GO 트리 구조에서 두 노드 사이의 유사 거리(Pseudo-distance)를 구하게 되는데, 최적 교차점을 구하는 과정은 유사 거리를 계산하기 위한 전 단계이다.두 노드 사이의 최적 교차을 구하는 예를 도 2를 참조하여 설명하면 다음과 같다.식별 부호 408의 노드와 식별 부호 310의 노드를 모두 포함하는 상위 노드는 식별 부호 402의 노드 및 식별 부호 400의 노드가 있다. 전술한 바와 같이, 두 개의 노드를 모두 포함하는 상위 노드들 중 가장 하위 노드를 최적 교차점으로 판단하며, 이중 식별 부호 402의 노드가 가장 하위 노드이므로, 식별부호 408의 노드 및 식별 부호 410의 노드의 최적 교차점은 식별부호 402의 노드이다.GO 코드를 이용할 경우, 최적 교차점은 비교적 쉽게 구해질 수 있다. 도 2에서, 식별부호 408번 노드의 GO 코드는 "111000000000000"이고 식별 부호 410번 노드의 GO 코드는 "112000000000000"이다. 두 개의 GO 코드는 2번째 자리까지 동일하므로, 최적 교차점은 2번째 레벨에 존재하며, 1레벨의 첫 번째 노드(첫번째 자리수가 1이므로)의 자(子)노드들 중 첫 번째 노드(두번째 자리수가 1)가 최적 교차점이라는 것을 알 수 있다. 여러개의 노드의 최적 교차점을 구하는 일례는 도 6에 도시되어 있다. 최적 교차점을 이용하여 유사 거리를 구하는 방법은 후에 상세히 살명한다.In the present invention, an optimal branch refers to a node located at a lowermost level among nodes including the largest number of genes in a tree structure below, and each of the genes included therein It is a broad term that can represent all functions. In the present invention, the pseudo-distance between two nodes in the GO tree structure is obtained, and the process of finding the optimal intersection point is a preliminary step for calculating the similarity distance. The following description will be made with reference to the following. An upper node including both the node of the identification code 408 and the node of the identification code 310 includes a node of the identification code 402 and a node of the identification code 400. As described above, the lowest node among the upper nodes including both nodes is determined as an optimal intersection point, and since the node of the double identification code 402 is the lowest node, the node of the identification code 408 and the node of the identification code 410 are determined. The best intersection is the node of identification 402. With the GO code, the best intersection can be found relatively easily. In FIG. 2, the GO code of the node 408 is "111000000000000" and the GO code of the node 410 is "112000000000000". Since the two GO codes are the same to the second digit, the best intersection exists at the second level, and the first node (the second digit) of the child nodes of the first node of the first level (since the first digit is 1) It can be seen that 1) is the optimal intersection point. An example of finding the optimal intersection of several nodes is shown in FIG. The method of obtaining the similar distance using the optimum intersection is explained later in detail.

각 유전자별로 정확한 GO 용어를 할당하는 것은, 여러 생물학 데이터베이스의 텍스트 마이닝을 통하여, 유전자별 GO 용어를 할당하게 된다. UniGene, LocusLink, Swiss-Prot, MGI 등의 DNA 혹은 단백질 수준에서의 정보를 직접적인 식별자(ID) 비교와 서열 유사성 검색 방법을 병행하여 사용하며, GO 컨소시엄에서 각 데이터베이스별로 제공되는 유전자 식별자(ID) 변환 파일들을 이용하여 각 유전자별 GO 용어를 할당한다.Assigning the correct GO term for each gene assigns the GO term for each gene through text mining of several biological databases. Information at the DNA or protein level, such as UniGene, LocusLink, Swiss-Prot, and MGI, is used in combination with direct identifier (ID) comparison and sequence similarity search, and the genetic identifier (ID) conversion provided by each database in the GO Consortium Use the files to assign GO terms for each gene.

여기서, UniGene은 NCBI는 NCBI(National Center for Biotechnology Information)에서 제공하는 DNA 수준에서의 유전자 정보 제공하고, LocusLink는 NCBI의 대표 서열 프로젝트(Reference Sequence Project)로 결과로 각 유전자별 기능 및 대표성을 가지는 서열 정보를 제공하며, Swiss-Prot은 스위스 생물정보학 연구소(Swiss Institute of Bioinformatic)에서 단백질 수준의 정보 제공하며, 그리고 MGI는 쥐(mouse)의 유전체 정보를 제공한다.Here, UniGene provides NCBI gene information at the DNA level provided by the National Center for Biotechnology Information (NCBI), and LocusLink is the NCBI's Reference Sequence Project. Swiss-Prot provides protein-level information at the Swiss Institute of Bioinformatic, and MGI provides mouse genomic information.

본 발명에서는 GO 트리 구조를 효율적으로 이용하여 최적 교차점(optimal branch)을 구하고 최적 교차점을 이용하여 유사 거리를 구한 후, 그것을 이용하여 주어진 클러스터를 대표할 수 있는 GO 용어를 찾게 된다. 즉, 클러스터의 유전자들에 부여된 GO 식별자와 GO 트리 구조의 노드들과의 유사거리를 통해 클러스터를 대표할 수 있는 GO 용어를 찾는 것이다. 이를 해결하기 위하여 먼저 GO 트리 구조상의 각 노드(node)들을 코드화하였다. 이 코드들은 전술한 바와 같이 15개의 숫자 조합으로 구성되었으며, 각각의 숫자는 상위 루트까지의 단계별 위치 정보를 나타낸다. 또한, 각 노드별로 유일한 코드들이 부여됨에 따라서, 동일한 용어(term)들이 트리 구조상에서 여러 곳에 위치하게 되는 경우라도 각각 구분된다.도 5는 본 발명에 따른 GO를 이용하여 유사거리를 구하는 원리를 개략적으로 설명하기 위한 도면이다.In the present invention, the optimal branch is obtained using the GO tree structure efficiently, and the similar distance is obtained by using the optimal intersection, and then the GO term that can represent a given cluster is found using the same. That is, the GO term that can represent the cluster is found through the similarity between the GO identifiers assigned to the genes of the cluster and the nodes of the GO tree structure. To solve this problem, we first coded each node in the GO tree structure. These codes consisted of fifteen number combinations, as described above, with each number representing positional information up to the top route. In addition, since unique codes are assigned to each node, even if the same terms are located in various places in the tree structure, they are distinguished from each other. FIG. 5 schematically illustrates a principle of obtaining similar distances using GO according to the present invention. It is a figure for explaining.

도 5에 도시된 바와 같이, 이들 GO 코드들을 이용하여 유사 거리를 구하기 위해 GO 트리 구조의 각 레벨별로 적절한 가중치(weight)가 부여되어 있다.As shown in FIG. 5, an appropriate weight is assigned to each level of the GO tree structure to obtain a similar distance using these GO codes.

Pd(v1,v2)는 두 노드 v1, v2가 형성하는 최적 교차점(optimal branch)의 코드가 가지는 레벨의 가중치(weight)이며, v1과 v2가 동일한 경우에는 Pd값은 0으로 정의된다. 즉, Pd ( v1, v2) is the weight of the level of the code of the optimal branch formed by the two nodes v1, v2. If v1 and v2 are the same, the Pd value is defined as 0. In other words,

pd(v1,v2) = v1 및 v2 사이의 최적 교차점 코드의 가중치(단 v1≠v2인 경우)pd (v1, v2) = weight of the best intersection code between v1 and v2 (if v1 ≠ v2)

pd(v1,v2) = 0 (v1 = v2인 경우)pd (v1, v2) = 0 (if v1 = v2)

를 가지는 조합을 최종적으로 선택한다.도 5에서 식별 부호 500의 노드와 식별 부호 502의 노드의 최적 교차점은 식별부호 504의 노드이며, 식별부호 504의 노드는 2레벨에 존재하고, 2레벨에 부여된 가중치는 140이다. 따라서, 식별부호 500인 노드와 식별부호 502인 노드의 유사 거리는 140이 된다.Finally, the optimal intersection point of the node of identification 500 and the node of identification 502 in FIG. 5 is the node of identification 504, and the node of identification 504 exists at the second level and is assigned to the second level. Weight is 140. Accordingly, the similarity distance between the node 500 and the node 502 is 140.

다음에, 주어진 클러스터(cluster)의 코드들의 조합을 G라고 할 때, 최대 유사거리인 Pd(max_Pd)와 평균 유사거리인 Pd(aver_Pd)를 구한다.Next, when a combination of codes of a given cluster is G, a maximum similarity Pd (max_Pd) and an average similarity distance Pd (aver_Pd) are obtained.

G={ v1, v2, v3, v4, , vn} 에서 max_Pd와 aver_Pd는 아래와 같이 정의되며, 가능한 코드들의 조합중에서 가장 낮은 점수의 max_pd와 aver_pd는 다음과 같다.In G = {v1, v2, v3, v4,, vn}, max_Pd and aver_Pd are defined as below, and the lowest score max_pd and aver_pd among the possible code combinations are as follows.

max_Pd(G) = max { pd(v_i, v_j) } with 1≤i≤ j ≤n max_Pd (G) = max {pd (v _i , v _j )} with 1≤i≤ j ≤n

aver_Pd(G) = (sum of all pd(v_i, v_j) in set G) /_nC₂ aver_Pd (G) = (sum of all pd (v _i , v _j ) in set G) / _n C ₂

= 2 * (sum of all pd(v_i,v_j) in set G) / n(n-1)= 2 * (sum of all pd (v _i , v _j ) in set G) / n (n-1)

여기서, max_pd는 클러스터를 개략적으로 평가하는 사용될 수 있는 척도이다. 만약 최적 교차점이 보다 높은 상위의 단계에 위치할수록, 해당 클러스터는 소속 유전자들의 일반적인 공통성을 해치는 부적당(bad)한 클러스터를 포함하고 있을 가능성이 높게 된다.Where max_pd is a measure that can be used to roughly evaluate the cluster. If the optimal intersection is located at a higher level, the cluster is more likely to contain bad clusters that compromise the general commonality of its genes.

상기 Aver_pd는 주어진 클러스터 내에서 GO 코드들이 얼마나 잘 응집되어 있는지 그리고 얼마나 비슷한 코드들이 얼마나 빈번히 관찰되는지를 나타낼 수 있다.The Aver_pd may indicate how well aggregated GO codes are in a given cluster and how frequently similar codes are observed.

한편, 클러스터를 대표할 수 있는 GO 용어를 찾는 과정에는 크게 기본 과정(Basic Process), N-단계 선택 과정(N-level selective process)이 있을 수 있다.On the other hand, the process of finding GO terms that can represent a cluster can be largely divided into a basic process (N-level selective process).

상기 기본 과정은 GO 트리 구조상의 모든 노드들에 대하여 클러스터의 유전자들과의 max_pd 및 aver_pd를 이용하여 계산한다. 이 기본 과정의 결과 클러스터와 최적으로 매칭이 되는 GO 용어를 찾을 수 있게 되며, 이를 통해 클러스터와 주어진 클러스터의 대략적인 생물학적 의미를 알 수 있게 된다.The basic process is calculated using max_pd and aver_pd with the genes of the cluster for all nodes in the GO tree structure. As a result of this basic process, we can find the GO term that best matches the cluster. This gives us the approximate biological meaning of the cluster and the given cluster.

또한, 상기 N-단계 선택 과정은 GO 트리 구조상의 모든 노드에 대해 클러스터의 유전자들과의 max_pd 및 aver_pd를 계산하는 것이 아니라, 사용자가 일정 제한점을 지정할 수 있는 것이다. 여기서 N-단계 선택 과정은 max_pd 및 aver_pd를 계산할 GO 트리 구조의 레벨을 미리 지정하여 계산하는 것으로, 특정 단계에서의 최적으로 매칭이 되는 GO 용어를 쉽게 관찰할 수 있으며, 기본과정에서는 알기 힘든 하위 단계에서의 생물학적 의미를 쉽게 유추해 볼 수 있도록 해준다. 특히, N-단계 선택 과정에서는 최상의 코드 조합 이외에도 그 다음 순위의 그룹을 나타낼 수 있다. 이것은 하나의 유전자가 두 개 이상의 기능에 관여할 수도 있는 다양성을 모두 포함할 수 있도록 해준다.In addition, the N-step selection process does not calculate max_pd and aver_pd with the genes of the cluster for all nodes in the GO tree structure, but allows the user to designate a certain limit point. Here, the N-step selection process is calculated by specifying the level of the GO tree structure to calculate max_pd and aver_pd in advance, and it is easy to observe the GO term that is optimally matched at a specific step. Easily infer the biological meaning of In particular, the N-step selection process may indicate the next highest group of ranks in addition to the best code combination. This allows a gene to contain all of the diversity that may be involved in more than one function.

도 7은 본 발명에 따른 GO를 이용하여 DNA 칩을 분석하는 방법의 동작 흐름도로서, DNA 칩을 분석하기 위한 방법에 있어서, 상기 DNA 칩 실험 결과의 통계적 클러스터링(clustering) 결과를 입력받아(S10), 상기 DNA 칩 실험 결과의 통계적 클러스터링(clustering) 결과를 입력받아(S10), 각 클러스터에 속하는 유전자마다 Gene Ontology(GO) 식별자(identifier)를 할당하는 단계(S20); GO 코드 파일을 이용하여 상기 유전자마다 할당된 GO 식별자를 각각 GO 코드로 변환하는 단계(S30); 상기 GO 코드에 대해 유사거리 알고리즘(S40)에 따른 기본 과정(S41), N-단계 선택 과정(S42)중에서 소정의 과정을 선택하고 필요한 변수를 지정하여 최적으로 매칭이 되는 GO 용어를 추출하는 단계(S50); 상기 최적으로 매칭이 되는 GO 용어를 이용하여 그 생물학적 의미를 추출하는 단계(S60); 및 최적으로 매칭이 되는 GO 용어 및 그 GO 코드를 디스플레이 하는 단계(S70)로 이루어진다.FIG. 7 is a flowchart illustrating a method for analyzing a DNA chip using GO according to the present invention. In the method for analyzing a DNA chip, the method receives a statistical clustering result of the DNA chip test result (S10). Receiving a statistical clustering result of the DNA chip test result (S10) and allocating a Gene Ontology (GO) identifier for each gene belonging to each cluster (S20); Converting each GO identifier assigned to each gene into a GO code using a GO code file (S30); Selecting a predetermined process from the basic process (S41) according to the similarity algorithm (S40) and the N-step selection process (S42) with respect to the GO code, and extracting a GO term that is optimally matched by designating a required variable (S50); Extracting the biological meaning using the optimally matched GO term (S60); And displaying an optimally matched GO term and its GO code (S70).

도 7을 참조하여, 본 발명에 따라 GO 구조를 이용한 DNA 칩의 유전자 발현 양상의 생물학적 분석의 전체적인 방법을 설명하면 다음과 같다.Referring to Figure 7, when explaining the overall method of biological analysis of the gene expression pattern of the DNA chip using the GO structure according to the present invention.

먼저, 유전자 발현 양상의 통계적 클러스터링을 통한 결과에서 각 클러스터에 속하는 유전자별로 GO 식별자 및 코드를 할당하는 과정을 수행하게 된다.First, in the result of statistical clustering of gene expression patterns, a process of allocating a GO identifier and a code for each gene belonging to each cluster is performed.

구체적으로, 클러스터링 결과를 입력(S10)하면, 각 유전자별로 GO ID를 여러 데이터베이스의 마이닝(mining)을 통한 미리 GO ID들을 할당해 놓은 파일을 이용하여, 클러스터내의 유전자들에 GO 식별자를 할당하게 된다(S20). 다음에, GO 트리 구조 전체를 코드화 시켜놓은 GO 코드 파일을 이용하여, 클러스터별 유전자 내에 할당된 GO ID들을 GO 코드로 변환하게 된다(S30).Specifically, when the clustering result is input (S10), the GO ID is allocated to the genes in the cluster by using a file in which the GO IDs are pre-assigned through mining of several databases for each gene. (S20). Next, by using the GO code file that encodes the entire GO tree structure, the GO IDs assigned in the cluster-specific genes are converted into GO codes (S30).

다음에 유사 알고리즘을 이용하여, 기본 과정(S41) 및 N-단계 선택 과정(S42), 중 적절한 과정을 선택하고(S40), 필요한 변수를 지정하게 된다. 이후, 각 과정별 Pd를 이용하여 최적으로 매칭이 되는 GO 용어를 추출하며(S50), 이에 따른 생물학적 의미를 각각 추출하게 된다.Next, using a similar algorithm, an appropriate process is selected (S40) among the basic process (S41) and the N-step selection process (S42), and the required variables are designated. Then, GO terms that are optimally matched are extracted using Pd for each process (S50), and biological meanings are extracted accordingly.

다음에, 각 클러스터별로 추출된 최적 매칭 GO 용어 및 GO 코드를 디스플레이 하게 되는데, 테이블 형태의 각 유전자별 GO 코드 및 최적 매칭 GO 용어 및 상기 생물학적 의미의 요약 정보, 또는 트리 구조 형태의 그래픽 결과를 디스플레이 할 수 있다.Next, the optimal matching GO terms and GO codes extracted for each cluster are displayed, and the GO codes and the optimal matching GO terms for each gene in a tabular form and summary information of the biological meaning, or graphic results in the form of a tree structure are displayed. can do.

한편, 상기 유사 알고리즘은 다른 바이오 칩인 단백질 칩에도 동일하게 적용될 수 있으며, 도 1 및 도 7과 DNA 칩 대신에 단백질 칩을 분석하게 되고, 그리고 동일한 방식으로 유사거리 알고리즘을 사용하여 유전자 대신 단백질에 대해서도 마찬가지로 적용할 수 있다.Meanwhile, the similar algorithm may be similarly applied to protein chips, which are other biochips, to analyze protein chips instead of the DNA chips shown in FIGS. 1 and 7, and to use a similar distance algorithm for proteins instead of genes. The same can be applied.

본 발명을 상기 실시예에 의해 구체적으로 설명하였지만, 본 발명은 이에 의해 제한되는 것은 아니고, 당업자의 통상적인 지식의 범위 내에서 그 변형이나 개량이 가능하다.Although this invention was demonstrated concretely by the said Example, this invention is not restrict | limited by this, A deformation | transformation and improvement are possible within the range of common knowledge of a person skilled in the art.

본 발명에 따르면, GO 계층 구조의 모델링을 통해 DNA 칩 실험의 유전자 발현 양상에 대해 체계적으로 자동화된 생물학적 분석을 수행할 수 있고, 또한 GO 용어와 트리 구조를 이용하여 DNA 칩의 실험 결과의 통계적인 클러스터링을 통해 생성되는 클러스터에 속하는 유전자들의 가장 공통적이며 이상적인 유전자의 기능을 추출할 수 있다.According to the present invention, a systematic automated biological analysis of gene expression patterns of DNA chip experiments can be carried out through modeling of the GO hierarchy, and also statistical analysis of experimental results of DNA chips using GO terminology and tree structure The most common and ideal gene function of the genes belonging to the cluster generated through clustering can be extracted.

Claims

In a system for analyzing a DNA chip,

Means for receiving a statistical clustering result of the DNA chip test result and assigning a Gene Ontology (GO) identifier to each gene belonging to each cluster;

Means for converting each GO identifier assigned to a gene belonging to the cluster to a GO code using a GO code file;

Means for obtaining an average similarity distance and a maximum similarity distance of genes using GO codes of genes belonging to the cluster; And

Extracts GO terms that are optimally matched using the average similarity distance and the maximum similarity distance between the genes included in the cluster and the GO nodes in the GO tree structure according to one of the basic process and the N-step selection process. Way; And

Means for extracting the biological meaning of the cluster using the optimally matched GO term

DNA chip analysis system using a gene vocabulary classification system (GO) comprising a.

The method of claim 1,

And a DNA vocabulary classification system further comprising visualization means for displaying the extracted optimally matching GO term and its GO code and biological meaning.

The method of claim 2,

The visualization means may display the optimally matched GO term, its GO code, and summary information of biological meanings in a tabular form, or display a graphical result in the form of a tree structure. Chip analysis system.

delete

The method of claim 1,

The similarity distance is,

If the similarity is Pd ( v1, v2), where v1 and v2 are nodes, the weight of the level of the code of the optimal branch formed by the two nodes v1 and v2, and v1 and v2 are If equal, the Pd value is defined as 0;

In addition, when G is a combination of codes of a given cluster, each optimal intersection point is obtained using the maximum similarity distance max_pd and the average similarity distance aver_pd, where G = {v1, v2, v3, v4, , vn} in max_pd and aver_pd,

max_pd (G) = max {pd (v _i , v _j )} (where 1≤i≤j≤n)

aver_pd (G) = (sum of all pd (v _i , v _j ) in set G) / _n C ₂

= 2 × (sum of all pd (v _i , v _j ) in set G) / n (n-1)

A DNA chip analysis system using a gene vocabulary classification system, characterized in that finally selecting max_pd and aver_pd of the lowest score among possible code combinations.

The method of claim 5,

The maximum similarity distance max_pd is used to roughly evaluate the cluster, and the higher the optimal intersection is at the higher level, the more likely that cluster contains bad clusters that compromise the general commonality of the genes involved. DNA chip analysis system using a gene vocabulary classification system, characterized in that high.

The method of claim 5,

Wherein said average similarity distance (aver_pd) indicates how well aggregated GO codes are in a given cluster and how frequently similar codes are observed.

delete

The method of claim 1,

The basic process is calculated using the maximum similarity distance (max_pd) and the average similarity distance (aver_pd) for all nodes in the GO tree structure, and the result of the basic process shows the approximate biological meaning of a given cluster. DNA chip analysis system using a gene vocabulary classification system.

The method of claim 1,

The N-step selection process is a DNA chip analysis system using a gene lexical classification system characterized in that the calculation using the maximum similarity distance and the average similarity distance for the GO node of a specific level selected on the GO tree structure.

delete

In the method for analyzing a DNA chip,

a) assigning a GO identifier to each cluster of statistical clustering results of the DNA chip experiment results;

b) converting GO identifiers assigned to genes belonging to the cluster into GO codes using GO code files;

c) obtaining average similarity distance and maximum similarity distance of genes using GO codes of genes belonging to the cluster;

d) Find the optimal GO term using the average similarity distance and the maximum similarity distance between the genes included in the cluster and the GO nodes in the GO tree structure according to one of the basic process and the N-step selection process. Extracting; And

e) extracting the biological meaning of the cluster using the optimally matched GO term

DNA chip analysis method using a gene vocabulary classification system comprising a.

The method of claim 13,

The method of claim 1, further comprising displaying the extracted optimally matching GO term, its GO code, and biological meaning.

The method of claim 13,

The similarity distance is,

max_pd (G) = max {pd (v _i , v _j )} (where 1≤i≤j≤n)

aver_pd (G) = (sum of all pd (v _i , v _j ) in set G) / _n C ₂

= 2 × (sum of all pd (v _i , v _j ) in set G) / n (n-1)

A method of analyzing a DNA chip using a gene lexical classification system, characterized in that finally selecting max_pd and aver_pd of the lowest score among possible code combinations.

The method of claim 15,

The maximum similarity distance max_pd is used to roughly evaluate the cluster, and the higher the optimal intersection is at the higher level, the more likely that cluster contains bad clusters that compromise the general commonality of the genes involved. DNA chip analysis method using a gene vocabulary classification system, characterized in that high.

The method of claim 15,

The average similarity distance (aver_pd) represents how well the aggregated GO codes in a given cluster, and how frequently the similar codes are observed, DNA chip analysis method using a gene lexical classification system

delete