JP2007037421A

JP2007037421A - Gene set for predicting the presence or absence of lymph node metastasis from colorectal cancer

Info

Publication number: JP2007037421A
Application number: JP2005222995A
Authority: JP
Inventors: Ichiro Takemasa; 伊知朗竹政; Takenobu Tazaki; 武信田崎; Hikari Sonoda; 光園田; Kenichi Matsubara; 謙一松原; Hirofumi Higuchi; 浩文樋口
Original assignee: Chemo Sero Therapeutic Research Institute Kaketsuken; Osaka University NUC; Shionogi and Co Ltd; DNA Chip Research Inc
Current assignee: Chemo Sero Therapeutic Research Institute Kaketsuken; Shionogi and Co Ltd; DNA Chip Research Inc; University of Osaka NUC
Priority date: 2005-08-01
Filing date: 2005-08-01
Publication date: 2007-02-15
Also published as: WO2007015459A1

Abstract

【課題】大腸癌のリンパ節転移の有無を予測するための方法及び当該方法に利用できる遺伝子セットの提供。
【解決手段】下記（１）〜（４）の工程を含む、遺伝子セットの選択方法：（１）組織病理学的判定によりリンパ節転移の有無が明らかにされた患者の大腸癌原発巣組織における遺伝子発現情報を、リンパ節転移の有無を正分類率７５％以上で分類できる遺伝子群をそれぞれの解析方法において選定する工程（２）（１）で用いた解析方法で選定された遺伝子群から、何れの解析方法でも共通して選定された共通遺伝子を選択する工程、（３）前記遺伝子発現情報を解析することにより、任意の２以上の遺伝子の組合せの中から、リンパ節転移の有無の分類を指示し、交互作用を示す遺伝子の組合わせを選択する工程、及び（４）前記共通遺伝子及び前記遺伝子の組合わせを説明変数としてリンパ節転移の有無を応答としたロジスティック回帰モデルにおける変数選択を行う工程。
【選択図】なしThe present invention provides a method for predicting the presence or absence of lymph node metastasis of colorectal cancer and a gene set usable in the method.
A method of selecting a gene set comprising the following steps (1) to (4): (1) In a primary colorectal cancer tissue of a patient whose presence or absence of lymph node metastasis is revealed by histopathological determination From the gene group selected by the analysis method used in the steps (2) and (1) of selecting gene groups that can classify the presence or absence of lymph node metastasis with a positive classification rate of 75% or more in each analysis method, A step of selecting a common gene selected in common in any analysis method; (3) classification of the presence or absence of lymph node metastasis from any combination of two or more genes by analyzing the gene expression information; And (4) a logistic cycle in which the presence or absence of lymph node metastasis is set as a response using the common gene and the gene combination as explanatory variables. The step of performing variable selection in the model.
[Selection figure] None

Description

本発明は、大腸癌のリンパ節転移の有無を予測するのに有用な遺伝子群、及びそれらの遺伝子発現情報の活用法に関する。 The present invention relates to a gene group useful for predicting the presence or absence of lymph node metastasis of colorectal cancer, and a method for utilizing the gene expression information.

大腸癌は先進国では特に発生率が高く、本邦でも年々増加の一途を辿っており、癌関連死の主要な原因のひとつとなっている。統計学的な報告（例えば、非特許文献１参照）によれば、大腸癌患者の３７％は転移がない主病巣に限局した癌であり、同じく３７％が所属リンパ節にのみ転移を認める局在的な癌であり、残りは遠隔転移を伴っているものなどであることが分かっている。 Colorectal cancer has a particularly high incidence in developed countries and is increasing year by year in Japan and is one of the leading causes of cancer-related death. According to statistical reports (for example, see Non-Patent Document 1), 37% of colorectal cancer patients are cancers confined to the main lesion without metastasis, and 37% have metastases only in the regional lymph nodes. It is known that the cancer is resident and the rest is accompanied by distant metastasis.

現在、臨床における大腸癌の悪性度分類として最も一般的に利用されているDukes分類においては、癌の大腸壁深達度と所属リンパ節への転移の程度などの病理学的な事項を指標としており、本分類と予後の相関に関しては疑う余地のないところとなっている。しかしながら、上記分類結果を大きく左右するリンパ節転移の有無の判定に関しては、切除された数多くのリンパ節組織のうちの一部を用いて作成された標本を顕微鏡下で観察するという古典的な組織病理学的手法に頼っているのが現状である。このような手法によりリンパ節転移陰性と判定された患者の２０〜４０％には後に転移が発見されるとの報告（例えば、非特許文献２参照）が示すように、従来のリンパ節転移判定法は必ずしも精度が十分とはいえなかった。 Currently, the Dukes classification, which is most commonly used as a malignancy classification of colorectal cancer in clinical practice, uses pathological items such as the degree of cancer colon penetration and the degree of metastasis to regional lymph nodes as indicators. There is no doubt about the correlation between this classification and prognosis. However, with regard to the determination of the presence or absence of lymph node metastasis, which greatly affects the above classification results, a classic tissue in which a specimen prepared using a part of a number of excised lymph node tissues is observed under a microscope. The current situation relies on pathological techniques. As shown in a report (for example, see Non-Patent Document 2) that 20 to 40% of patients determined to be negative for lymph node metastasis by such a technique will be found later, refer to conventional lymph node metastasis determination. The method was not always accurate enough.

大腸癌は、多段階発癌の構造など分子生物学的な研究がもっともよく進んでいる癌の一つで、これまでＡＰＣ、Ｋ−ｒａｓ、ｐ５３、ＤＣＣなどの個々の遺伝子についての報告が多数みられる。しかし、これらの遺伝子のいずれかに注目するだけでは、大腸癌の個性を表現するには不十分であるため、近年はＤＮＡマイクロアレイなどを用いることにより一度に極めて多数の遺伝子の発現情報を得ることにより有用な新規知見を得る試みがなされ始めている。 Colorectal cancer is one of the most advanced molecular biological studies such as the structure of multistage carcinogenesis, and there have been many reports on individual genes such as APC, K-ras, p53, and DCC. It is done. However, just focusing on one of these genes is not sufficient to express the individuality of colorectal cancer, and in recent years, expression information on a large number of genes can be obtained at once by using a DNA microarray or the like. Attempts have been made to obtain more useful new knowledge.

Alizadehらは、びまん性大細胞型Ｂ細胞リンパ腫患者の末梢血から分取したＢリンパ球を試料としてＤＮＡマイクロアレイによる測定を行い、得られた遺伝子発現データの階層的クラスタリングを行うことにより、同病患者の末梢血Ｂリンパ球には、リンパ組織の胚中心に存在するＢ細胞に類似した遺伝子発現パターンを示す場合と、in vitroで活性化したＢ細胞に類似した遺伝子発現パターンを示す場合の２種類があることを見出した（非特許文献３）。両者の生存率をKaplan-Meierプロットで調べた結果、後者の発現パターンを示すＢ細胞を持つ患者は、前者の発現パターンを示すＢ細胞を持つ患者と比べて予後が悪いことが明らかとなった。加えて、従来からの病理学的診断に基づく予後予測に従うよりも、著者らの行った遺伝子発現情報のクラスタリングで得られた結果の方が予後との相関性が高かった。Alizadehらの研究結果は、遺伝子発現情報から臨床的に利用可能な有用な法則性を導き出せたという点で意義のあるものといえる。しかし、その法則が全く新たな臨床例についても適用できるかどうかについての検証はなされておらず、この論文の範囲でのみ成立する結果である可能性は否定できない。 Alizadeh et al. Measured the B lymphocytes collected from the peripheral blood of a patient with diffuse large B-cell lymphoma using a DNA microarray as a sample, and performed hierarchical clustering of the obtained gene expression data. The patient's peripheral blood B lymphocytes show a gene expression pattern similar to that of B cells present in the germinal center of the lymphoid tissue and a case of showing a gene expression pattern similar to B cells activated in vitro. It was found that there are types (Non-Patent Document 3). As a result of examining the survival rate of both in the Kaplan-Meier plot, it became clear that patients with B cells showing the latter expression pattern had a worse prognosis than patients with B cells showing the former expression pattern. . In addition, the results obtained by the clustering of gene expression information performed by the authors were more correlated with the prognosis than following the prognosis prediction based on the conventional pathological diagnosis. The results of Alizadeh et al. Are significant in that they can derive clinically useful useful laws from gene expression information. However, it has not been verified whether the law can be applied to completely new clinical cases, and it cannot be denied that it is a result that is valid only within the scope of this paper.

Khanらは、組織学的には区別が難しい小円形青色細胞腫に属する４種類の癌が、人工ニューラルネットワークを利用した遺伝子発現情報の解析により正確に区別されることを報告した（非特許文献４）。この報告の中では、全体のデータから無作為に抜き出した一部のデータを用いて導き出した人工ニューラルネットワークモデルに対して、テストサンプルのデータを入力した場合にも、正確な判定結果が得られることが検証されている。したがって、ここで導き出された人工ニューラルネットワークモデルは、この論文内のデータの範囲に限定されるものではなく、小円形青色細胞腫に属する４種類の癌を区別するために一般的に適用可能なものであることが示唆される。しかしながら、人工ニューラルネットワークモデルで得られる判定結果は、数学的な根拠を明確に説明できないという点で一般には受け入れられにくい。 Khan et al. Reported that four types of cancer belonging to small round blue cell tumors, which are difficult to distinguish histologically, can be accurately distinguished by analysis of gene expression information using artificial neural networks (Non-Patent Literature). 4). In this report, accurate judgment results can be obtained even when test sample data is input to an artificial neural network model derived from a part of the random data extracted from the entire data. It has been verified. Therefore, the artificial neural network model derived here is not limited to the range of data in this paper, but is generally applicable to distinguish four types of cancer belonging to small round blue cell tumor. It is suggested that However, the judgment results obtained with the artificial neural network model are generally not accepted in that the mathematical basis cannot be clearly explained.

大腸癌の肝転移に関わる分子標的を同定することを目的としてＤＮＡマイクロアレイを用いて行われた最近の研究例としては、柳川らの報告（非特許文献５）がある。著者らは、公共の遺伝子データベースに登録されているヒトｃＤＮＡの塩基配列に基づいて設計したオリゴＤＮＡをプライマーとして用い、ヒトのｃＤＮＡを鋳型としてＰＣＲを行い、９,１２１種類の増幅ｃＤＮＡ断片を得た。次いで、これらのｃＤＮＡ断片をプローブとしてプリントしたＤＮＡマイクロアレイを使って、１０症例の大腸癌患者より分離した大腸癌原発巣及び大腸癌肝転移巣の遺伝子発現プロファイルを調べた。その結果、原発巣に対して肝転移巣で発現が上昇している４０種類の遺伝子と、原発巣に対して肝転移巣で発現が低下している７種類の遺伝子を明らかにし、大腸癌の肝転移に関わる可能性がある候補遺伝子セットを同定した。 As a recent research example conducted using a DNA microarray for the purpose of identifying a molecular target involved in liver metastasis of colorectal cancer, there is a report by Yanagawa et al. (Non-patent Document 5). The authors performed PCR using human cDNA as a template using oligo DNA designed based on the base sequence of human cDNA registered in public gene databases, and obtained 9,121 kinds of amplified cDNA fragments. It was. Next, using a DNA microarray in which these cDNA fragments were printed as probes, gene expression profiles of colon cancer primary lesions and colon cancer liver metastases isolated from 10 colon cancer patients were examined. As a result, we have identified 40 genes whose expression is increased in liver metastases relative to the primary lesion and 7 genes whose expression is decreased in liver metastasis relative to the primary lesion. We identified a set of candidate genes that may be involved in liver metastasis.

大腸癌の肝転移に関与する遺伝子セットについては、ＤＮＡマイクロアレイ法により大腸癌原発巣組織に特異的に発現した遺伝子群の発現情報を遺伝子判別分析手法に基づく統計解析処理することにより、大腸癌の肝転移の予測に有効な遺伝子セットを同定する方法、当該方法によって同定された遺伝子セット及び大腸癌原発巣組織における当該遺伝子セットの発現情報を用いて大腸癌の肝転移を予測する方法が知られている（特許文献１）。当該遺伝子セット及び方法は、大腸癌の異時性肝転移の予測に有用な情報を提供するものであり、大腸癌で特異的に発現している重要な遺伝子を同定するための材料として好ましいものではあるが、大腸癌のリンパ節転移は肝転移とは病理学的にみて全く病態が異なるため、これら大腸癌の肝転移用の遺伝子セット及び方法をそのまま大腸癌のリンパ節転移に応用できるわけでは決してない。 For the gene set involved in liver metastasis of colorectal cancer, by analyzing the expression information of genes specifically expressed in the colon cancer primary tissue by DNA microarray method, A method for identifying a gene set effective for predicting liver metastasis, a method for predicting liver metastasis of colorectal cancer using the gene set identified by the method and expression information of the gene set in the colon cancer primary tissue is known. (Patent Document 1). The gene set and method provide information useful for predicting metachronous liver metastasis of colorectal cancer, and are preferable as a material for identifying an important gene specifically expressed in colorectal cancer. However, because lymph node metastasis of colorectal cancer is completely pathologically different from liver metastasis, these gene sets and methods for liver metastasis of colorectal cancer can be directly applied to lymph node metastasis of colorectal cancer. Never.

また、大腸癌原発巣組織、大腸癌肝転移巣組織及び正常大腸粘膜組織を材料として作製したｃＤＮＡライブラリーから選択したプローブを用いてオリジナルのＤＮＡマイクロアレイを作製し、それを用いて大腸癌組織における遺伝子発現解析を行うことにより、大腸癌の発育・進展に関連すると考えられる候補遺伝子の同定が可能であることも示されている（非特許文献６）。 In addition, an original DNA microarray was prepared using a probe selected from a cDNA library prepared using a colon cancer primary tissue, a colon cancer liver metastasis tissue and a normal colon mucosa tissue as a material. It has also been shown that by performing gene expression analysis, it is possible to identify candidate genes that are considered to be related to the development and progression of colorectal cancer (Non-patent Document 6).

一方、大腸癌のリンパ節転移に関しては、上記のようにリンパ節転移の有無の判定が、切除された数多くのリンパ節組織のうちの一部を用いて作成された標本を顕微鏡下で観察するという古典的な組織病理学的手法に頼っているのが現状であり、このようなリンパ節転移判定法は精度が必ずしも十分とはいえない。また、大腸癌原発巣除去手術後に行われる術後補助療法により、リンパ節転移のあった患者の予後を改善できることも知られているが、術後補助療法は食欲不振・上腹部不快感・嘔気などの副作用を伴うこともあり、Quality of life（QOL）や医療費の観点からみて、患者個人の状態と病勢を考慮して必要・不必要を判断する必要がある。従って、より精度の高いリンパ節転移判定法が見出されれば、術後補助療法の選択に際して意志決定のための有用な指標として利用可能となり、最終的には適切な治療が受けられることにより患者の利益につながると考えられる。 On the other hand, with regard to lymph node metastasis of colorectal cancer, as described above, the presence or absence of lymph node metastasis is determined by observing a specimen prepared using a part of a number of excised lymph node tissues under a microscope. The current status depends on the classical histopathological technique, and such a lymph node metastasis determination method is not necessarily accurate enough. It is also known that postoperative adjuvant therapy after primary colorectal cancer removal surgery can improve the prognosis of patients with lymph node metastasis, but postoperative adjuvant therapy is anorexia, upper abdominal discomfort, nausea It is necessary to determine whether it is necessary or unnecessary from the viewpoint of quality of life (QOL) and medical costs, taking into account the individual patient's condition and disease state. Therefore, if a more accurate method for determining lymph node metastasis is found, it can be used as a useful index for decision-making when selecting postoperative adjuvant therapy, and ultimately patients receive the appropriate treatment after receiving appropriate treatment. It is thought to lead to profit.

特開２００４−３３０８２JP 2004-33082 A Troisi R.J.ら、1999, Cancer, vol. 85, p. 1670-1676Troisi R.J. et al., 1999, Cancer, vol. 85, p. 1670-1676 Cohen A.M.ら、1997, Curr Probl Surg., vol. 34, p. 601-676Cohen A.M. et al., 1997, Curr Probl Surg., Vol. 34, p. 601-676 Alizadehら、2000, Nature, vol. 403, p. 503-511Alizadeh et al., 2000, Nature, vol. 403, p. 503-511 Khanら、2001, Nature Medicine, vol. 7, p. 673-679Khan et al., 2001, Nature Medicine, vol. 7, p. 673-679 柳川ら、2001, Neoplasia, vol. 3, No. 5, p.395-401Yanagawa et al., 2001, Neoplasia, vol. 3, No. 5, p.395-401 竹政ら、2001, Biochem. Biophys. Res. Commun., vol. 285, p. 1244-1249Takemasa et al., 2001, Biochem. Biophys. Res. Commun., Vol. 285, p. 1244-1249

上記のように、大腸癌のリンパ節転移の有無を判定する従来の方法は、大腸癌周辺の複数のリンパ節を切除し、これを顕微鏡下で観察するもので、判定結果の精度に問題があった。 As described above, the conventional method for determining the presence or absence of lymph node metastasis of colorectal cancer involves removing a plurality of lymph nodes around the colorectal cancer and observing them under a microscope. there were.

したがって、本発明は、この点を改善するために、大腸癌原発巣組織の遺伝子発現プロファイルを調べることにより、大腸癌のリンパ節転移の有無を予測する方法を提供することを目的とする。本発明はまた、リンパ節への癌細胞の転移の有無を予測することを可能とならしめるために、大腸癌のリンパ節転移判定に利用可能な遺伝子セット及びそれらの発現情報に基づいてリンパ節転移の有無を判定するために利用可能な判別式を提供することを目的とする。 Therefore, in order to improve this point, an object of the present invention is to provide a method for predicting the presence or absence of lymph node metastasis of colorectal cancer by examining the gene expression profile of colorectal cancer primary tissue. In order to make it possible to predict the presence or absence of cancer cell metastasis to a lymph node, the present invention also provides a lymph node based on a gene set that can be used to determine lymph node metastasis of colorectal cancer and expression information thereof. An object is to provide a discriminant that can be used to determine the presence or absence of metastasis.

本発明者らは、上記の目的を達成するために鋭意研究を重ねた結果、大腸癌原発巣組織、大腸癌肝転移巣組織及び正常大腸粘膜組織を材料として作製したｃＤＮＡライブラリーから選択したプローブを用いてオリジナルのＤＮＡマイクロアレイを作製し、当該ＤＮＡマイクロアレイを用いて得た大腸癌原発巣の遺伝子発現解析データの統計解析を通じて、リンパ節転移の有無を予測するのに利用可能な遺伝子セット、及びそれらの発現量に基づいて実際にリンパ節転移の有無を予測するために利用する判別式を見出すことに成功し、本発明を完成するに至った。 As a result of intensive studies to achieve the above-mentioned object, the present inventors have selected a probe selected from a cDNA library prepared using colon cancer primary tissue, colon cancer liver metastasis tissue and normal colon mucosa tissue as materials. A gene set that can be used to predict the presence or absence of lymph node metastasis through statistical analysis of gene expression analysis data of the primary lesion of colorectal cancer obtained using the DNA microarray, and Based on the expression level, the present inventors have succeeded in finding a discriminant used for actually predicting the presence or absence of lymph node metastasis, and completed the present invention.

すなわち、本発明は、以下の大腸癌のリンパ節転移の有無を予測するための遺伝子セットの選択方法を提供する。
１．下記（１）〜（４）の工程を含む、大腸癌リンパ節転移の有無を予測するための遺伝子セットの選択方法：
（１）組織病理学的判定によりリンパ節転移の有無が明らかにされた患者の大腸癌原発巣組織における遺伝子発現情報を、教師あり学習解析方法を少なくとも一つ含む、４以上の解析方法で解析することにより、リンパ節転移の有無を正分類率７５％以上で分類できる遺伝子群をそれぞれの解析方法において選定する工程、
（２）（１）で用いたそれぞれの解析方法で選定された遺伝子群から、何れの解析方法でも共通して選定された共通遺伝子を選択する工程、
（３）前記遺伝子発現情報を解析することにより、任意の２以上の遺伝子の組合せの中から、リンパ節転移の有無の分類を指示し、交互作用を示す遺伝子の組合わせを選択する工程、及び
（４）前記共通遺伝子及び前記遺伝子の組合わせを説明変数として、リンパ節転移の有無を応答としたロジスティック回帰モデルにおける変数選択を行う工程；
２．（１）の解析方法が、（a）Support Vector Machine、（b）Principal Component Analysis Artificial Neural Networkの拡張法、（c）Hierarchical Cluster AnalysisとStepwise Logistic Discriminationの組合せ及び（d）Classification And Regression TreeとLogistic Discriminationの組合せよりなる群から選択されるものを少なくとも一つ含むものである、上記１．に記載の方法；
３．（３）の解析方法が、（d）Classification And Regression TreeとLogistic Discriminationの組合せである、上記１．または２．に記載の方法；
４．（４）の変数選択の方法が、ステップワイズの変数選択法である、上記１．ないし３．のいずれかに記載の方法。 That is, the present invention provides the following gene set selection method for predicting the presence or absence of lymph node metastasis of colorectal cancer.
1. A method for selecting a gene set for predicting the presence or absence of colorectal cancer lymph node metastasis, including the following steps (1) to (4):
(1) Analyzing gene expression information in the primary colorectal cancer tissue of a patient whose presence or absence of lymph node metastasis was revealed by histopathological determination using at least one supervised learning analysis method. Selecting a gene group in each analysis method that can classify the presence or absence of lymph node metastasis with a correct classification rate of 75% or more,
(2) a step of selecting a common gene selected in common in any analysis method from the gene group selected in each analysis method used in (1),
(3) analyzing the gene expression information, instructing the classification of the presence or absence of lymph node metastasis from any combination of two or more genes, and selecting a combination of genes showing an interaction; and (4) A step of selecting a variable in a logistic regression model in which the combination of the common gene and the gene is used as an explanatory variable and the presence or absence of lymph node metastasis is used as a response;
2. (1) Analysis method includes (a) Support Vector Machine, (b) Extension of Principal Component Analysis Artificial Neural Network, (c) Combination of Hierarchical Cluster Analysis and Stepwise Logistic Discrimination, and (d) Classification And Regression Tree and Logistic 1 including at least one selected from the group consisting of a combination of discriminations. The method described in;
3. The analysis method (3) is a combination of (d) Classification And Regression Tree and Logistic Discrimination. Or 2. The method described in;
4). The variable selection method in (4) is a stepwise variable selection method described above in 1. Or 3. The method in any one of.

本発明はまた、以下の大腸癌リンパ節転移の有無を予測するための遺伝子セットを提供する。
５．上記１．ないし４．のいずれかの方法により選択される、大腸癌リンパ節転移の有無を予測するための遺伝子セット；
６．少なくともNM_003404（G1592）、NM_002128（G2645）、NM_052868（G3031）、NM_005034（G3177）、NM_001540（G3753）、NM_005722（G3826）、及びNM_015315（G4370）のデータベースのアクセス番号（シリアル番号）で表される遺伝子を含む、上記５．に記載の遺伝子セット。 The present invention also provides the following gene set for predicting the presence or absence of metastasis from colon cancer lymph nodes.
5. Above 1. Or 4. A gene set for predicting the presence or absence of colorectal cancer lymph node metastasis selected by any of the methods of:
6). At least the gene represented by the access number (serial number) of the database of NM_003404 (G1592), NM_002128 (G2645), NM_052868 (G3031), NM_005034 (G3177), NM_001540 (G3753), NM_005722 (G3826), and NM_015315 (G4370) Including the above 5. The gene set described in.

本発明はさらに、以下の、上記選択された遺伝子セットを用いた大腸癌のリンパ節転移の有無を予測する方法をも提供する。
７．上記５．または６．の何れかに記載の遺伝子セットを用いることを特徴とする大腸癌リンパ節転移の有無を予測するための方法；
８．下記の判別式を用いることを特徴とする上記７．記載の方法：
Ｄ＝０.２３０７−２.７１３２×NM_003404（G1592）の発現量
＋８.９５０９×NM_052868（G3031）の発現量
＋８.７９７５×NM_005722（G3826）の発現量
−２.３０９８×NM_015315（G4370）の発現量
＋３.５１２６×NM_002128（G2645）の発現量×NM_005034（G3177）の発現量
−８.８２２６×NM_001540（G3753）の発現量×NM_005722（G3826）の発現量
（Ｄ＞０のときリンパ節転移あり、Ｄ≦０のときリンパ節転移なし、と判別する）。 The present invention further provides the following method for predicting the presence or absence of lymph node metastasis of colorectal cancer using the selected gene set.
7). 5. above. Or 6. A method for predicting the presence or absence of lymph node metastasis of colorectal cancer, comprising using the gene set according to any one of the above;
8). 6. The above discriminating formula is used. Method described:
D = 0.23027-2.7132 × NM_003404 (G1592) expression level + 8.9509 × NM_052868 (G3031) expression level + 8.7975 × NM_005722 (G3826) expression level-2.3098 × NM_015315 (G4370) expression level Amount + 3.5126 x NM_002128 (G2645) expression x NM_005034 (G3177) expression-8.8226 x NM_001540 (G3753) expression x NM_005722 (G3826) expression (D> 0, lymph node metastasis) , It is determined that there is no lymph node metastasis when D ≦ 0).

本発明において用いる遺伝子群の解析方法としては、以下のものが挙げられる：
１．Hierarchical Cluster Analysis クラスター分析
２．Logistic Discrimination（変数選択法を含む）ロジスティック解析
３．Classification And Regression Tree カート
４．Principal Component Analysis Artificial Neural Network（拡張法を含む）ピーシーエー／主成分分析
５．Projection Pursuit for supervised classification プロジェクションパーシュート／射影追跡
６．Support Vector Machine サポートベクターマシン／エスブイエム
７．Self Organizing Map エスオーエム
８．AdaBoost アダブースト
９．これらの手法を２つ以上組み合わせた遺伝子の選択の工程。 Examples of gene group analysis methods used in the present invention include the following:
1. Hierarchical Cluster Analysis Cluster analysis Logistic Discrimination (including variable selection method) Logistic analysis Classification And Regression Tree Cart 4. 4. Principal Component Analysis Artificial Neural Network (including extended method) PCA / principal component analysis Projection Pursuit for supervised classification Projection Pursuit / Projection Pursuit 6. Support Vector Machine Support Vector Machine / SBM Self Organizing Map SOM8. AdaBoost AdaBoost Gene selection process combining two or more of these methods.

本発明において用いる遺伝子発現情報の解析方法としては、ロジスティク解析（Logistic Discrimination）が挙げられる。 An example of a method for analyzing gene expression information used in the present invention is logistic analysis.

本発明において用いる変数選択の方法としては、以下のものが挙げられる：
１．ステップワイズ（逐次増減法；stepwise）
２．前進選択法（forward）
３．後退選択法（backward） Examples of variable selection methods used in the present invention include the following:
1. Stepwise (stepwise)
2. Forward selection method (forward)
3. Backward selection method (backward)

本発明では、大腸癌患者が大腸癌原発巣切除手術を受ける時点で、大腸癌細胞が周辺のリンパ節に転移している可能性が高いか否かを判定するために有用な一連の遺伝子セット、ならびにそれらの遺伝子発現情報に基づいてリンパ節転移の有無を予測するための判別式が提供される。本発明の方法に従えば、大腸癌原発巣組織の当該遺伝子セットの遺伝子発現情報をロジスティック回帰式で解析することにより、良好なリンパ節転移判定成績を得ることができる。したがって、大腸癌原発巣切除手術の時点において、リンパ節への癌細胞の転移の有無を予測することが可能となる。 In the present invention, a series of gene sets useful for determining whether or not colon cancer cells are likely to have metastasized to surrounding lymph nodes at the time when the colon cancer patient undergoes primary colorectal cancer resection surgery. And a discriminant for predicting the presence or absence of lymph node metastasis based on the gene expression information. According to the method of the present invention, a good lymph node metastasis determination result can be obtained by analyzing the gene expression information of the gene set in the colon cancer primary tissue with a logistic regression equation. Therefore, it is possible to predict the presence or absence of cancer cell metastasis to lymph nodes at the time of primary colorectal cancer resection.

本発明の方法は、大腸癌原発巣切除手術の時点におけるリンパ節転移の有無を予測するうえで有効な遺伝子セット、及び当該遺伝子セットの発現量に基づいて実際にリンパ節転移の有無を予測するための判別式によって特徴付けられる。 The method of the present invention predicts the presence or absence of lymph node metastasis based on the gene set effective in predicting the presence or absence of lymph node metastasis at the time of primary colorectal cancer resection surgery, and the expression level of the gene set. Characterized by a discriminant for

リンパ節転移の有無を予測するうえで有用な遺伝子セットは、多サンプルの大腸癌原発巣組織について遺伝子発現を網羅的に調べ、その中から判定に利用可能な遺伝子のセットを選出することにより得られる。このような網羅的な遺伝子発現解析の方法としては、マイクロアレイをはじめとして、Northern解析、ATAC−PCR法（Katoら、Nuc. Acids Res., vol. 25, p. 4694−4696, 1997）やTaq Man PCR法（Applied Biosystems社）に代表されるリアルタイムPCR法、SAGE（Velculescuら、Science, vol. 270,p. 484−487, 1995）等々様々な方法が利用可能である。 A gene set useful for predicting the presence or absence of lymph node metastasis is obtained by comprehensively examining gene expression in multiple samples of colon cancer primary tissue and selecting a set of genes that can be used for the determination. It is done. Such comprehensive gene expression analysis methods include microarray, Northern analysis, ATAC-PCR method (Kato et al., Nuc. Acids Res., Vol. 25, p. 4694-4696, 1997) and Taq Various methods such as real-time PCR represented by Man PCR (Applied Biosystems), SAGE (Velculescu et al., Science, vol. 270, p. 484-487, 1995) and the like can be used.

本発明の好ましい態様では、ＤＮＡマイクロアレイを用いた、特開２００４−３３０８２に記載の方法に従って行われる。より具体的には、インフォームドコンセントを経て収集された、原発巣除去手術時の組織病理的観察でリンパ節への転移が認められた患者由来の大腸癌原発巣組織６３例と、リンパ節転移が認められなかった患者由来の原発巣組織８７例の合計１５０例の組織について上記ＤＮＡマイクロアレイを用いて遺伝子発現データを取得した。比較対照としては、４０症例分の大腸癌原発巣組織周辺の正常大腸粘膜組織から得られた遺伝子発現データを用いた。 In a preferred embodiment of the present invention, it is carried out according to the method described in JP-A-2004-33082 using a DNA microarray. More specifically, 63 cases of primary colorectal cancer tissue derived from patients whose metastasis to the lymph node was observed by histopathological observation at the time of removal of the primary lesion, collected through informed consent, and lymph node metastasis Using the DNA microarray, gene expression data were obtained for a total of 150 tissues, including 87 primary lesion tissues derived from patients in which no sigma was observed. As a comparative control, gene expression data obtained from normal large intestine mucosal tissue around the colon cancer primary tissue for 40 cases was used.

上記の遺伝子発現データは、癌組織から抽出した全ＲＮＡを材料として調製した蛍光標識ｃＤＮＡを、ＤＮＡマイクロアレイにハイブリダイズさせ、ＤＮＡマイクロアレイ上のプローブにハイブリダイズした蛍光標識ｃＤＮＡの発する蛍光シグナルを専用のスキャナで検出・定量化することにより取得される。より具体的な手順を以下に記した。 The above gene expression data is obtained by hybridizing fluorescence-labeled cDNA prepared using total RNA extracted from cancer tissue to a DNA microarray, and using the fluorescence signal emitted by the fluorescence-labeled cDNA hybridized to the probe on the DNA microarray. It is obtained by detecting and quantifying with a scanner. A more specific procedure is described below.

大腸癌組織あるいは正常大腸粘膜組織からの全ＲＮＡ抽出は、TRIzol試薬（GIBCO BRL社）、ISOGEN（ニッポンジーン社）などの試薬を用い、各試薬の添付文書に記載された方法に従って行うことができる。このようにして調製した全ＲＮＡは、そのまま下記の標識ｃＤＮＡの調製に使用することができる。また、例えば、mRNA Purification Kit（Amersham BioSciences社）などの市販のキットにより、添付の方法に従って、該全ＲＮＡからポリアデニン付加ＲＮＡ（以下、「ｍＲＮＡ」と称することもある）を精製して以下の標識ｃＤＮＡの調製に使用することもできる。 Extraction of total RNA from colon cancer tissue or normal colon mucosa tissue can be performed using a reagent such as TRIzol reagent (GIBCO BRL) or ISOGEN (Nippon Gene) according to the method described in the package insert of each reagent. The total RNA thus prepared can be used as it is for the preparation of the following labeled cDNA. Further, for example, by using a commercially available kit such as mRNA Purification Kit (Amersham BioSciences) and purifying polyadenine-added RNA (hereinafter also referred to as “mRNA”) from the total RNA according to the attached method, It can also be used for the preparation of cDNA.

Ｃｙ３標識した大腸癌原発巣組織由来のｃＤＮＡ（以下、「Ｃｙ３ｃＤＮＡ」と称することもある）は、上記の全ＲＮＡまたはｍＲＮＡ、オリゴｄＴプライマー、ｄＮＴＰ及びＣｙ３標識ｄＵＴＰを含む混合液に逆転写酵素を加えた後、３７〜４５℃で１〜３時間、好ましくは、４２℃で１時間加温することにより調製される。比較対照として使用されるＣｙ５標識した正常大腸粘膜由来のｃＤＮＡ（以下、「Ｃｙ５ｃＤＮＡ」と称することもある）の調製も、正常大腸粘膜組織の全ＲＮＡを用いて同様の方法により行われる。こうして得られたＣｙ３ｃＤＮＡ及びＣｙ５ｃＤＮＡは、それぞれ変性溶液中で６５〜７０℃で１０〜２０分間、好ましくは、７０℃で１０分間加熱処理し、中和後、等量混合される（以下、この混合液を「Ｃｙ５・Ｃｙ３ｃＤＮＡ」を称することもある）。変性溶液として、５０ｍＭＥＤＴＡを含む０.５ＮＮａＯＨ又は１ＮＮａＯＨなどを用いることができるが、５０ｍＭＥＤＴＡを含む０.５ＮＮａＯＨを使用するのが好ましい。Ｃｙ５・Ｃｙ３ｃＤＮＡの精製は、例えばMicrocon-30（Amicon社）などの市販キットを用い、添付の方法に従って行われる。 Cy3-labeled cDNA derived from colon cancer primary tissue (hereinafter sometimes referred to as “Cy3 cDNA”) is prepared by applying reverse transcriptase to a mixture containing the above total RNA or mRNA, oligo dT primer, dNTP and Cy3-labeled dUTP. After the addition, it is prepared by heating at 37 to 45 ° C. for 1 to 3 hours, preferably at 42 ° C. for 1 hour. Preparation of Cy5-labeled normal colon mucosa-derived cDNA (hereinafter also referred to as “Cy5 cDNA”) used as a comparative control is performed in the same manner using total RNA of normal colon mucosa tissue. The thus obtained Cy3 cDNA and Cy5 cDNA are each heated in a denaturing solution at 65 to 70 ° C. for 10 to 20 minutes, preferably at 70 ° C. for 10 minutes, neutralized, and then mixed in equal amounts (hereinafter referred to as this mixing). The liquid may be referred to as “Cy5 / Cy3 cDNA”). As the denaturing solution, 0.5N NaOH or 1N NaOH containing 50 mM EDTA can be used, but 0.5N NaOH containing 50 mM EDTA is preferably used. Purification of Cy5 / Cy3 cDNA is performed according to the attached method using a commercially available kit such as Microcon-30 (Amicon).

Ｃｙ５・Ｃｙ３ｃＤＮＡとＤＮＡマイクロアレイにプリントされたプローブとのハイブリダイゼーションは、以下のようにして行われる。先ず、プローブを熱変性させるためにＤＮＡマイクロアレイを加熱処理し、これに１００℃で２分間加熱処理したＣｙ５・Ｃｙ３ｃＤＮＡ含有ハイブリダイゼーション液を滴下し、カバーガラスで覆った後、ＤＮＡマイクロアレイを密閉容器に入れ、ハイブリダイゼーションを行う。ハイブリダイゼーション条件としては、ハイブリダイゼーション液がホルムアミドを含む場合には、４２℃で１２時間以上のハイブリダイゼーションが行われ、ホルムアミドを含まない場合には約６８℃で１２時間以上のハイブリダイゼーションが行われる。ハイブリダイゼーションの終了後、例えばScan Array 4000（GSI Lumonics社）などの機器によりＣｙ３とＣｙ５の蛍光をスキャンし、蛍光パターンを画像データとして得る。続いて、これらの画像データを、例えばQuantarrayソフトウェア（GSI Lumonics社）などのマイクロアレイデータ専用解析ソフトを用いて解析することにより、全プローブについてのＣｙ３とＣｙ５の蛍光強度をテキスト形式の数値データとして得ることができる。 Hybridization of Cy5 / Cy3 cDNA and the probe printed on the DNA microarray is carried out as follows. First, in order to heat denature the probe, the DNA microarray was heat-treated, and a Cy5 / Cy3 cDNA-containing hybridization solution that had been heat-treated at 100 ° C. for 2 minutes was added dropwise, covered with a cover glass, and the DNA microarray was then placed in a sealed container. And perform hybridization. As hybridization conditions, when the hybridization solution contains formamide, hybridization is performed at 42 ° C. for 12 hours or more, and when it does not contain formamide, hybridization is performed at about 68 ° C. for 12 hours or more. . After completion of hybridization, the fluorescence of Cy3 and Cy5 is scanned with an instrument such as Scan Array 4000 (GSI Lumonics) to obtain a fluorescence pattern as image data. Subsequently, these image data are analyzed using, for example, microarray data dedicated analysis software such as Quantarray software (GSI Lumonics) to obtain the fluorescence intensity of Cy3 and Cy5 for all probes as numerical data in text format. be able to.

本発明の好ましい態様では上記ＤＮＡマイクロアレイを使用したが、その代わりにハイブリダイゼーションのために有効な鎖長を持つ合成ＤＮＡを用いても同様の結果を得ることができる。例えば、本発明で開示された遺伝子名あるいは配列情報に基づいて、その一部の配列からなる約２０ヌクレオチド以上の長さを持つ合成ＤＮＡをプローブとして、ガラス基盤などに固定化したものを使用することも可能である。 In the preferred embodiment of the present invention, the above-described DNA microarray is used, but the same result can be obtained by using a synthetic DNA having a chain length effective for hybridization instead. For example, based on the gene name or sequence information disclosed in the present invention, a synthetic DNA consisting of a part of the sequence and having a length of about 20 nucleotides or more is used as a probe and is immobilized on a glass substrate or the like. It is also possible.

一般的に、蛍光強度の低いデータはバックグラウンドの影響を大きく受けているので、例えば蛍光強度が強い方から３，０００データポイントだけを残すなどの方法により、蛍光強度の低いプローブのデータは棄却され欠損値として扱われる。続いて、スキャンの際に起こりうるＣｙ３とＣｙ５の検出感度調整のずれを補正して標準化するための操作が行われる。すなわち、各プローブについてのＣｙ３とＣｙ５の蛍光強度値の比であるＣｙ３／Ｃｙ５を算出し、底が２の対数値（以下、「ｌｏｇ（Ｃｙ３／Ｃｙ５）」と記載する）に変換し、各プローブについてのｌｏｇ（Ｃｙ３／Ｃｙ５）値から、全ｌｏｇ（Ｃｙ３／Ｃｙ５）値の中央値（median）を差し引くことにより標準化ｌｏｇ（Ｃｙ３／Ｃｙ５）値を得ることができる。該標準化ｌｏｇ（Ｃｙ３／Ｃｙ５）値は、各遺伝子の発現量として用いることができる。 In general, data with low fluorescence intensity is greatly affected by the background. For example, data of probes with low fluorescence intensity is rejected by leaving only 3,000 data points from the higher fluorescence intensity. And treated as missing values. Subsequently, an operation for correcting and standardizing the deviation in detection sensitivity adjustment between Cy3 and Cy5 that may occur during scanning is performed. That is, Cy3 / Cy5, which is the ratio of the fluorescence intensity values of Cy3 and Cy5 for each probe, is calculated, converted to a logarithmic value with a base of 2 (hereinafter referred to as “log (Cy3 / Cy5)”), A standardized log (Cy3 / Cy5) value can be obtained by subtracting the median of all log (Cy3 / Cy5) values from the log (Cy3 / Cy5) value for the probe. The standardized log (Cy3 / Cy5) value can be used as the expression level of each gene.

このようにして得られた全症例についての標準化された数値データ（以下、「標準化数値データ」と記載することがある）は、一旦統合され、欠損値を多く含むプローブのデータを以降の解析対象から外す目的で次の選択操作が行われる。すなわち、マイクロアレイで解析した全１５０症例のうちの８５％以上にあたる１２８例以上でデータが取得できているプローブのデータのみが選択される。これにより、欠損値を１５％以下しか含まないプローブのデータだけを選択することができる。さらに、個人的な遺伝子背景因子の除外を目的として次の選択操作が加えられる。すなわち、各プローブについて、１５０例の大腸癌原発巣のデータ内での分散値と１２例の正常大腸粘膜についてのデータ内での分散値を算出し、前者が後者の１.１倍を超ているプローブのデータのみが選択される。 The standardized numerical data for all cases obtained in this way (hereinafter sometimes referred to as “standardized numerical data”) are once integrated, and probe data containing many missing values are analyzed later. The following selection operation is performed for the purpose of removing from That is, only probe data for which data has been acquired in 128 or more cases, which is 85% or more out of all 150 cases analyzed by the microarray, is selected. As a result, it is possible to select only the probe data containing 15% or less of missing values. In addition, the following selection operation is added for the purpose of excluding personal genetic background factors. That is, for each probe, a variance value in the data of 150 primary colon cancer lesions and a variance value in the data of 12 normal colon mucosa were calculated, and the former exceeded 1.1 times the latter. Only the data of the existing probe is selected.

これら一連の選択操作により、２,１２１種類のプローブの標準化された数値データが以降の解析対象として選択される。このようにして選択される標準化数値データに含まれる欠損値の存在は、後の統計解析において不都合を生じるため、何らかの方法で補完される必要がある。 Through these series of selection operations, standardized numerical data of 2,121 types of probes is selected as a subsequent analysis target. The presence of missing values included in the standardized numerical data selected in this way causes inconvenience in later statistical analysis, and thus needs to be supplemented by some method.

補完の方法としては様々なものが適用可能であるが、例えば、補完する欠損値を含む症例についての全データの平均値に、その欠損値を含む遺伝子の全症例についてのデータの平均値を加えた値から、全症例についての全遺伝子のデータの平均値を引いた値をもって補完する方法がある。他にはTroyanskayaらの報告（Bioinformatics, vol. 17, p. 520−525, 2001）において３種類の補完方法、すなわち、K−Nearest Neighbors (KNN) method、Singular Value Decomposition (SVD) based method及びrow average methodによる補完の例が示されている。これらのうちのいずれかの方法を適用することにより、全ての欠損値を補完することが可能である。 Various methods can be used as a complementation method. For example, the average value of all the data for the case containing the missing value is added to the average value of all the cases for the gene containing the missing value. There is a method of complementing with the value obtained by subtracting the average value of the data of all genes for all cases. In addition, in the report of Troyanskaya et al. (Bioinformatics, vol. 17, p. 520-525, 2001), there are three types of complementation methods: K-Nearest Neighbors (KNN) method, Singular Value Decomposition (SVD) based method and row An example of completion by the average method is shown. By applying any of these methods, it is possible to complement all missing values.

かくして準備される欠損値が補完された標準化数値データ（以下、「標準化遺伝子発現データ」と称することもある）は、バックグラウンドの影響を受けておらず、Ｃｙ３とＣｙ５の検出感度の違いによる誤差を含まず、また欠損値も含まず、かつ、正常大腸粘膜との比較における大腸癌原発巣の遺伝子発現の変動幅が個人差に起因する遺伝子発現の変動幅を超えている遺伝子の発現情報を有しており、以後の統計解析の信頼性を確保することができるものである。 The standardized numerical data (hereinafter also referred to as “standardized gene expression data”) supplemented with the missing values thus prepared is not affected by the background, and an error due to a difference in detection sensitivity between Cy3 and Cy5. And gene expression information that does not include missing values, and the variation range of gene expression in the primary colorectal cancer lesion exceeds the variation range of gene expression due to individual differences compared to normal colon mucosa The reliability of subsequent statistical analysis can be ensured.

ＤＮＡマイクロアレイの測定で得られる大量の遺伝子発現データを統計学的手法により処理し、目的に叶う遺伝子セットを導き出す方法については、確立された一般的なものはなく、研究者の相当な鋭意工夫を必要とするのが現実である。本発明においては、まず４つの異なるアプローチで解析を行い、それらの各々について、リンパ節転移の有無の予測に利用可能な遺伝子群が同定される。 There is no established general method for processing a large amount of gene expression data obtained by DNA microarray measurement using statistical methods and deriving gene sets that meet the objectives. What you need is the reality. In the present invention, analysis is first performed by four different approaches, and for each of these, a gene group that can be used for prediction of the presence or absence of lymph node metastasis is identified.

本発明で行った４つのアプローチは、
（a）Support Vector Machine（SVM）（Hastieら、The Elements of Statistical Learning-Data Mining, Inference, and Prediction, Springer, 2001）、
（b）Principal Component Analysis/artificial Neural Network（PCA/aNN）（Khanら、Nature Medicine, vol. 7, p. 673−679, 2001）の拡張法、
（c）Hierarchical Cluster Analysis（HCA）＋ Stepwise Logistic Discrimination及び
（d）Classification And Regression Tree（CART）（Breimanら、Classification and Regression Trees, Wadswarth, 1983）＋ Logistic Discrimination
である。遺伝子群の同定に際しては、統計学的な信頼性を担保する目的で、全データを予測用遺伝子同定用と評価用の２群に分けて解析を行う。より具体的に述べると、１５０例のデータを、リンパ節転移ありの４２例とリンパ節転移なしの５７例から成る９９例と、リンパ節転移ありの２１症例とリンパ節転移なしの３０例から成る５１例、の２群に分け、前者の９９例分のデータをリンパ節転移の有無を予測するための遺伝子の同定と判別式の確立に使用し、その判別式で後者の５１例分のデータを判別することにより、判別式の評価を行う。以降の記載においては、予測用遺伝子の同定及び判別式の確立に使用される前者の９９例分のデータを「トレーニング用データ」と表現し、判別式の評価に使用する後者の５１例分のデータを「テスト用データ」と表現することがある。 The four approaches taken in the present invention are:
(A) Support Vector Machine (SVM) (Hastie et al., The Elements of Statistical Learning-Data Mining, Inference, and Prediction, Springer, 2001),
(B) Extension method of Principal Component Analysis / artificial Neural Network (PCA / aNN) (Khan et al., Nature Medicine, vol. 7, p. 673-679, 2001),
(C) Hierarchical Cluster Analysis (HCA) + Stepwise Logistic Discrimination and (d) Classification And Regression Tree (CART) (Breiman et al., Classification and Regression Trees, Wadswarth, 1983) + Logistic Discrimination
It is. When identifying gene groups, analysis is performed by dividing all data into two groups for predicting gene identification and evaluation for the purpose of ensuring statistical reliability. More specifically, the data of 150 cases are divided into 99 cases consisting of 42 cases with lymph node metastasis and 57 cases without lymph node metastasis, 21 cases with lymph node metastasis and 30 cases without lymph node metastasis. The data of 99 cases of the former are used to identify genes and establish discriminants for predicting the presence or absence of lymph node metastasis. The discriminant is evaluated by discriminating the data. In the following description, the former 99 cases of data used for identification of predictive genes and establishment of discriminants are expressed as “training data” and the latter 51 cases used for discriminant evaluation. Data may be expressed as “test data”.

上記４つのアプローチのうちの（ａ）、（ｃ）及び（ｄ）の３つに関しては、上記のデータの２分割を、分割時の標本変動を考慮して、ランダムに１００回行って解析することにより、１００通りの判定用遺伝子群を同定したうえで、同定された回数が多かった遺伝子を採用する。 Regarding three of (a), (c), and (d) among the above four approaches, analysis is performed by randomly dividing the above data into two times 100 times in consideration of sample variation at the time of division. Thus, after identifying 100 types of gene groups for determination, genes that have been identified many times are employed.

一方、上記（ｂ）のアプローチに関しては、膨大な計算量を考慮して最初の２分割についてのみ実施する。ただし、トレーニング用の９９例分のデータを２対１の割合で１２５０回ランダムに２分割し、それを用いて主成分分析とニューラルネットワークの学習を反復する。学習後、リンパ節転移の有無を識別する感度に基づき遺伝子をランキングし，遺伝子を絞り込む。２１２１個の遺伝子から開始し、以降１５３６個、７６８個、３８４個、１９２個、９６個、４８個、２４個の絞り込み個数のそれぞれで学習を進める。 On the other hand, the approach (b) is performed only for the first two divisions in consideration of a huge amount of calculation. However, the data for 99 training examples are randomly divided into 1250 times at a ratio of 2 to 1, and the principal component analysis and the neural network learning are repeated using the data. After learning, rank genes based on their sensitivity to identify the presence or absence of lymph node metastasis, and narrow down the genes. Starting from 2121 genes, the learning is advanced with 1536, 768, 384, 192, 96, 48, and 24 refined numbers.

以上の解析を行うことにより、各遺伝子セットに含まれる遺伝子の個数と、確立された判別式を用いたテスト用データの正分類率（判別式での判別結果と組織病理学的検査の結果が一致した症例数／テスト用データ数×１００（％））の平均値として、（ａ）については遺伝子数が１４４個で正分類率は８０.２％（標準偏差は５.６％）、（ｂ）については遺伝子数が１９２個で正分類率は（９０.２％）、（ｃ）については遺伝子数が１３３個で正分類率は７８.６％（標準偏差は６.２％）及び（ｄ）については遺伝子数が１３８個で正分類率は８６.３％（標準偏差：４.５％）が得られる。このとき、１６種類の遺伝子が、各アプローチで選択された遺伝子セットに共通して含まれる。 By performing the above analysis, the number of genes included in each gene set and the correct classification rate of test data using established discriminants (the discriminant discriminant results and histopathological examination results As an average value of the number of matched cases / number of test data × 100 (%)), (a) has 144 genes and a normal classification rate of 80.2% (standard deviation is 5.6%), ( For b), the number of genes is 192 and the correct classification rate is (90.2%). For (c), the number of genes is 133 and the correct classification rate is 78.6% (standard deviation is 6.2%). As for (d), the number of genes is 138 and the correct classification rate is 86.3% (standard deviation: 4.5%). At this time, 16 types of genes are commonly included in the gene set selected by each approach.

次いで、正分類率を落とさないようにしつつ、予測に使用する遺伝子の数を絞り込むために、まず対象とする遺伝子を上記の１６遺伝子とし、これら１６遺伝子各々の寄与（以下、「主効果」と記載する）に加えて、２遺伝子の交互作用も加味した統計解析を行う。これにより、個別の遺伝子による主効果だけでなく、遺伝子間の交互作用を含めたより広い範囲で判別ルールを探索することとなり、高い判別性能を維持できることが期待される。 Next, in order to narrow down the number of genes used for the prediction while keeping the normal classification rate from dropping, the target genes are first set to the above 16 genes, and the contribution of each of these 16 genes (hereinafter referred to as “main effect”). In addition to (to be described), statistical analysis is also performed in consideration of the interaction of two genes. As a result, it is possible to search for a discrimination rule in a wider range including not only the main effect by individual genes but also the interaction between genes, and it is expected that high discrimination performance can be maintained.

交互作用探索のために、上記解析で用いた１００通りのトレーニング用データのそれぞれで、リンパ節転移の有無を応答とするＣＡＲＴ解析を再度行う。このＣＡＲＴ解析では、FreeのソフトウェアRのrpartを用いた。その際、操作パラメータは全てデフォルトの値を用いた。この解析により、データの分割を指示する変数として登場する遺伝子の個数として、１回の解析あたり３個から５個が得られる。 For the interaction search, CART analysis is performed again with the presence or absence of lymph node metastasis as a response for each of the 100 training data used in the above analysis. In this CART analysis, Free software R rpart was used. At that time, default values were used for all operation parameters. By this analysis, 3 to 5 genes can be obtained per analysis as the number of genes appearing as variables for instructing data division.

登場する遺伝子が例えば３個の場合、とり得る遺伝子のペアは３通りあるため、それら全てのペアを交互作用として捉える。同様にして、遺伝子が４個の場合は６組、５個の場合は１０組の各ペアを交互作用として捉える。そして、できる限り多くの候補をカバーするため１００通りの解析のうち１２回以上現れた１８組の遺伝子ペアを交互作用の候補として選択する。 For example, when there are three genes that appear, there are three possible gene pairs, and thus all these pairs are considered as interactions. Similarly, when there are 4 genes, 6 pairs, and when 5 genes, 10 pairs are regarded as an interaction. Then, in order to cover as many candidates as possible, 18 gene pairs that appear 12 times or more out of 100 analyzes are selected as interaction candidates.

次に判別式の確立のために、１５０例のデータを用いて、１６遺伝子の主効果と１８通りの交互作用を説明変数とし、リンパ節転移の有無を応答としたロジスティック回帰モデルにおいてステップワイズの変数選択を行う。その際、回帰係数の有意性検定のｐ値を、変数の組入れ基準（０.０５未満）及び除外基準（０.０５超）として用いる。これにより、６個の変数、すなわち、G1592、G3031、G3826、G4370、G2645とG3177の交互作用、G3753とG3826の交互作用が選択される。そして、リンパ節転移の有無を予測するための判別式は、
Ｄ＝０.２３０７−２.７１３２×「Ｇ１５９２の発現量」
＋８.９５０９×「Ｇ３０３１の発現量」
＋８.７９７５×「Ｇ３８２６の発現量」
−２.３０９８×「Ｇ４３７０の発現量」
＋３.５１２６×「Ｇ２６４５の発現量」×「Ｇ３１７７の発現量」
−８.８２２６×「Ｇ３７５３の発現量」×「Ｇ３８２６の発現量」
と推定され、Ｄ＞０のときリンパ節転移あり、Ｄ≦０のときリンパ節転移なし、とする判別ルールが導かれる。この判別式に登場した７個の遺伝子、すなわちG1592、G2645、G3031、G3177、G3753、G3826、G4370をリンパ節転移の有無の識別に寄与する遺伝子のセットとして選択する。それらの遺伝子名を表１に記した。 Next, in order to establish a discriminant, using the data of 150 cases, stepwise in a logistic regression model using the main effects of 16 genes and 18 interactions as explanatory variables and the presence or absence of lymph node metastasis as responses. Perform variable selection. At that time, the p-value of the significance test of the regression coefficient is used as a variable inclusion criterion (less than 0.05) and an exclusion criterion (greater than 0.05). As a result, six variables, namely, G1592, G3031, G3826, G4370, the interaction between G2645 and G3177, and the interaction between G3753 and G3826 are selected. And the discriminant for predicting the presence or absence of lymph node metastasis is
D = 0.23027-2.7132 × “expression amount of G1592”
+ 8.9509 × “expression amount of G3031”
+ 8.7975 × “expression amount of G3826”
-2.398 × “expression amount of G4370”
+ 3.5126 × “expression amount of G2645” × “expression amount of G3177”
−8.8226 × “expression amount of G3753” × “expression amount of G3826”
Thus, a determination rule is derived that lymph node metastasis is present when D> 0 and no lymph node metastasis is present when D ≦ 0. Seven genes appearing in this discriminant, that is, G1592, G2645, G3031, G3177, G3753, G3826, and G4370 are selected as a set of genes that contribute to the identification of the presence or absence of lymph node metastasis. Their gene names are shown in Table 1.

最後に、選択した遺伝子セットによるリンパ節転移の判別性能がＬＯＯ法により評価される。すなわち、１サンプルを除いた残りの１４９サンプルのデータを用いて、上記の６個の変数を含むロジスティック判別式を推定し、それによって除いたサンプルを判別する操作を、１５０サンプルのそれぞれで実施する。これにより、表２に示すように、選択した遺伝子のセットによる正分類率は８８.７％（感度：７７.８％、特異度：９６.６％）と推定される。以上のように、本発明においては、大腸癌のリンパ節転移の有無を高い精度で予測するのに必要な遺伝子セットを明らかにすることができる。 Finally, the discrimination performance of lymph node metastasis by the selected gene set is evaluated by the LOO method. That is, using the data of the remaining 149 samples excluding one sample, the logistic discriminant including the above six variables is estimated, and the operation of discriminating the sample removed by that is performed for each of the 150 samples. . Thereby, as shown in Table 2, the correct classification rate by the set of selected genes is estimated to be 88.7% (sensitivity: 77.8%, specificity: 96.6%). As described above, in the present invention, it is possible to clarify a gene set necessary for predicting the presence or absence of lymph node metastasis of colorectal cancer with high accuracy.

以下に本発明を実施例により詳細に説明するが、これら実施例によって本発明は何ら制約を受けることはない。なお、実施例において使用した試薬類は特にことわりのない限り、ナカライテスク株式会社より購入したものを使用した。 EXAMPLES The present invention will be described in detail below with reference to examples, but the present invention is not limited by these examples. The reagents used in the examples were those purchased from Nacalai Tesque, unless otherwise specified.

（１）大腸癌組織試料からの全ＲＮＡ調製
ＤＮＡマイクロアレイを用いた、大腸癌における遺伝子発現解析を行うための試料としては、インフォームドコンセントを経て収集された、大腸癌手術時に切除された大腸癌原発巣組織１５０例を用いた。その内訳は、原発巣除去手術時の組織病理学的な観察でリンパ節転移が認められた患者に由来する６３例（以下、「リンパ節転移陽性症例」と記載する）と、リンパ節転移が認められなかった患者に由来する８７例（以下、「リンパ節転移陰性症例」と記載する）である。これらの大腸癌組織試料からTRIzol試薬（GIBCO BRL社より購入）を用いて全ＲＮＡを抽出した。抽出手順は基本的に上記試薬に添付のマニュアルに従った。この他に、４０例分の正常大腸粘膜部分由来の全ＲＮＡを抽出し、それらを混合して、全ての実験を通して使用する標準正常大腸粘膜全ＲＮＡとした。これらのＲＮＡサンプルの濃度は、定法通りに分光光度計を用いて測定した波長２６０ｎｍでの吸光度に基づいて算出した。 (1) Preparation of total RNA from colorectal cancer tissue sample As a sample for performing gene expression analysis in colorectal cancer using a DNA microarray, colorectal cancer collected through informed consent and excised at the time of colorectal cancer surgery 150 cases of primary lesion tissue were used. The breakdown consists of 63 patients (hereinafter referred to as “positive lymph node metastasis”) derived from patients whose lymph node metastasis was observed by histopathological observation at the time of primary lesion removal, and lymph node metastasis There are 87 cases (hereinafter referred to as “negative lymph node metastasis cases”) derived from patients who were not recognized. Total RNA was extracted from these colon cancer tissue samples using TRIzol reagent (purchased from GIBCO BRL). The extraction procedure basically followed the manual attached to the reagent. In addition to this, total RNA derived from 40 portions of normal large intestine mucosa was extracted and mixed to obtain standard normal large intestine mucosa total RNA used throughout all experiments. The concentrations of these RNA samples were calculated based on the absorbance at a wavelength of 260 nm measured using a spectrophotometer as usual.

（２）蛍光ラベルターゲットの調製
ＤＮＡマイクロアレイにハイブリダイズさせる蛍光ラベルターゲットは以下の手順で作製した。まず、２５μｇの大腸癌部試料由来全ＲＮＡ（以下、「大腸癌ＲＮＡ」と記す）と２５μｇの標準正常大腸粘膜全ＲＮＡ（以下、「標準大腸粘膜ＲＮＡ」と記す）を別々のチューブに入れ、それぞれに２μｇの１８ヌクレオチドから成るオリゴｄＴプライマーを加え、滅菌蒸留水にて容量を１４μＬとし、７０℃で１０分間加熱した後、直ちに氷上に移して急冷した。その後、それぞれのチューブに、６μＬの５×First Strand Buffer、３μＬの０.１ＭＤＴＴ、１.５μＬの２０×ｄＮＴＰmix（１０ｍＭのｄＡＴＰ、ｄＣＴＰ、ｄＧＴＰ及び６ｍＭのｄＴＴＰの混合物）及び０.５μＬのＲＮＡguardを添加した。 (2) Preparation of fluorescent label target A fluorescent label target to be hybridized with a DNA microarray was prepared by the following procedure. First, 25 μg of total RNA derived from colon cancer sample (hereinafter referred to as “colon cancer RNA”) and 25 μg of standard normal colon mucosa total RNA (hereinafter referred to as “standard colon mucosa RNA”) are put in separate tubes, 2 μg of oligo dT primer consisting of 18 nucleotides was added to each, the volume was made up to 14 μL with sterilized distilled water, heated at 70 ° C. for 10 minutes, then immediately transferred to ice and rapidly cooled. Each tube was then filled with 6 μL of 5 × First Strand Buffer, 3 μL of 0.1 M DTT, 1.5 μL of 20 × dNTPmix (mixture of 10 mM dATP, dCTP, dGTP and 6 mM dTTP) and 0.5 μL RNAguard. Was added.

さらに、大腸癌ＲＮＡを入れた方のチューブに蛍光色素Ｃｙ３でラベルされたｄＵＴＰ（以下、「Ｃｙ３−ｄＵＴＰ」と記す；濃度１ｍＭ）を３μＬ、標準大腸粘膜ＲＮＡを入れた方のチューブにＣｙ５でラベルされたｄＵＴＰ（以下、「Ｃｙ５−ｄＵＴＰ」と記す；濃度１ｍＭ）を３μＬ加えて、４２℃にて２分間保温した。その後、逆転写酵素であるSuperscriptIIを各チューブに２μＬ加えて、４２℃にてさらに１時間保温することによりラベル反応を行った。この反応により、大腸癌ＲＮＡと標準大腸粘膜ＲＮＡを鋳型としてｃＤＮＡ合成が起こる際に、それぞれＣｙ３−ｄＵＴＰとＣｙ５−ｄＵＴＰが取り込まれることにより、それぞれＣｙ３とＣｙ５で蛍光ラベルされた大腸癌ラベルターゲットと標準大腸粘膜ラベルターゲットが生成する。 Furthermore, 3 μL of dUTP labeled with the fluorescent dye Cy3 (hereinafter referred to as “Cy3-dUTP”; concentration 1 mM) in the tube containing the colon cancer RNA, and Cy5 in the tube containing the standard colon mucosa RNA. 3 μL of labeled dUTP (hereinafter referred to as “Cy5-dUTP”; concentration 1 mM) was added and incubated at 42 ° C. for 2 minutes. Thereafter, 2 μL of Superscript II, which is a reverse transcriptase, was added to each tube and incubated at 42 ° C. for an additional hour to carry out the labeling reaction. By this reaction, when cDNA synthesis occurs using colorectal cancer RNA and standard colorectal mucosal RNA as a template, Cy3-dUTP and Cy5-dUTP are incorporated, respectively, so that a colon cancer label target fluorescently labeled with Cy3 and Cy5, respectively, A standard colon mucosa label target is generated.

この反応で使用した５×First Strand Buffer、０.１ＭＤＴＴ及びSuperscriptIIは、いずれもGIBCO BRL社より購入した。また、ｄＡＴＰ、ｄＣＴＰ、ｄＧＴＰ及びｄＴＴＰ、Ｃｙ５−ｄＵＴＰ及びＣｙ３−ｄＵＴＰ、そしてＲＮＡguardはいずれもAmersham Biosciences社より購入した。反応後は、各チューブに５μＬの変性溶液（０.５ＮＮａＯＨ、５０ｍＭＥＤＴＡ）を添加して７０℃で１０分間加熱した後、７.５μＬの１ＭＴｒｉｓ−ＨＣｌ（ｐＨ７.５）を加えることにより中和した。これらの処理を行った段階で、大腸癌ラベルターゲットと標準大腸粘膜ラベルターゲットを混合し、ここに１０μｇのヒトＣＯＴ−１ＤＮＡ（GIBCO BRL社より購入）を添加した。この混合液にＴＥバッファーを加えて５００μＬに調整し、Microcon−30（Amicon社より購入）を用いて精製・濃縮することにより、未反応のＣｙ５−ｄＵＴＰ及びＣｙ３−ｄＵＴＰなどを除去した。精製・濃縮の手順はMicrocon−30に添付のマニュアルに従った。最終的には、全容量が５μＬとなるまで濃縮し、これをＤＮＡマイクロアレイにハイブリダイズさせるラベルターゲットとした。 All 5 × First Strand Buffer, 0.1M DTT and Superscript II used in this reaction were purchased from GIBCO BRL. DATP, dCTP, dGTP and dTTP, Cy5-dUTP and Cy3-dUTP, and RNAguard were all purchased from Amersham Biosciences. After the reaction, 5 μL of a denaturing solution (0.5 N NaOH, 50 mM EDTA) is added to each tube, heated at 70 ° C. for 10 minutes, and then 7.5 μL of 1 M Tris-HCl (pH 7.5) is added. Neutralized. At the stage where these treatments were performed, the colon cancer label target and the standard colon mucosa label target were mixed, and 10 μg of human COT-1 DNA (purchased from GIBCO BRL) was added thereto. The buffer solution was adjusted to 500 μL by adding TE buffer, and purified and concentrated using Microcon-30 (purchased from Amicon) to remove unreacted Cy5-dUTP and Cy3-dUTP. The procedure of purification / concentration was according to the manual attached to Microcon-30. Finally, it was concentrated until the total volume became 5 μL, and this was used as a label target to be hybridized to the DNA microarray.

（３）ＤＮＡマイクロアレイの前処理
ＤＮＡマイクロアレイをマスキング溶液（３ｇの無水コハク酸、１９０ｍＬのＮ−メチル−２−ピロリドン及び２１ｍＬの０.２Ｍホウ酸ナトリウムの混合液）に５分間浸すことによりマスキングを行った後、９５℃の蒸留水に３分間浸すことにより、マイクロアレイ上にプリントされているｃＤＮＡを熱変性させた。その後直ちに９５％以上のエタノールに１分間浸して脱水し風乾させた。 (3) Pretreatment of DNA microarray Masking is performed by immersing the DNA microarray in a masking solution (mixture of 3 g of succinic anhydride, 190 mL of N-methyl-2-pyrrolidone and 21 mL of 0.2 M sodium borate) for 5 minutes. Then, the cDNA printed on the microarray was heat denatured by immersing it in distilled water at 95 ° C. for 3 minutes. Immediately thereafter, it was immersed in 95% or more of ethanol for 1 minute to dehydrate and air dry.

（４）ラベルターゲットとＤＮＡマイクロアレイとのハイブリダイゼーション
前述のようにして調製したラベルターゲット溶液５μＬに対して、２.５μＬの１０ｍｇ／ｍＬのポリアデニン（Roche社より購入）、０.５μＬの１０％ＳＤＳ溶液、３μＬの２０×ＰＭ溶液（０.４％ＢＳＡと１％ＳＤＳの混合液）、１５μＬのホルムアミド、３μＬの２０×ＳＳＣ（３Ｍ塩化ナトリウム、０.３Ｍクエン酸ナトリウム、ｐＨ７.０）及び滅菌蒸留水１μＬを添加し、１００℃で２分間加熱した後、暗所にて約３０分間室温で静置した。その後、前項に記載の方法で前処理したＤＮＡマイクロアレイのｃＤＮＡがプリントされている部分に滴下し、２４×４０ミリメートルのカバーガラス（マツナミガラス工業より購入）で覆い、マイクロアレイを密閉容器に入れ、その容器ごと４２℃のインキュベーターに約１６時間入れておくことにより、ラベルターゲットをマイクロアレイ上のｃＤＮＡにハイブリダイズさせた。ハイブリダイゼーションの後、マイクロアレイを０.１％ＳＤＳを含む２×ＳＳＣに浸して１０分間洗浄し、次に、０.１％ＳＤＳを含む０.１×ＳＳＣに浸して１０分間洗浄した。さらに、０.１×ＳＳＣに浸して５分間の洗浄を２回行った後、滴を切って暗所で風乾させた。 (4) Hybridization of label target and DNA microarray For 5 μL of the label target solution prepared as described above, 2.5 μL of 10 mg / mL polyadenine (purchased from Roche), 0.5 μL of 10% SDS Solution, 3 μL of 20 × PM solution (mixture of 0.4% BSA and 1% SDS), 15 μL of formamide, 3 μL of 20 × SSC (3M sodium chloride, 0.3M sodium citrate, pH 7.0) and sterilization 1 μL of distilled water was added and heated at 100 ° C. for 2 minutes, and then allowed to stand at room temperature for about 30 minutes in the dark. After that, it is dropped on the portion of the DNA microarray where the cDNA is pretreated by the method described in the previous section, covered with a 24 × 40 mm cover glass (purchased from Matsunami Glass Industry), and the microarray is placed in a sealed container. The label target was hybridized to the cDNA on the microarray by placing the entire container in a 42 ° C. incubator for about 16 hours. After hybridization, the microarray was soaked in 2 × SSC containing 0.1% SDS for 10 minutes and then soaked in 0.1 × SSC containing 0.1% SDS for 10 minutes. Further, after being soaked in 0.1 × SSC and washed twice for 5 minutes, the drops were cut and air-dried in the dark.

（５）マイクロアレイのスキャンとデータ解析
洗浄後風乾させたマイクロアレイを、マイクロアレイ専用共焦点レーザースキャナであるScanArray 4000（GSI Lumonics社製）を使ってＣｙ３とＣｙ５の蛍光を独立にスキャンすることにより、マイクロアレイ上の各プローブにハイブリダイズした大腸癌ターゲットと標準大腸ターゲットに由来するＣｙ３とＣｙ５の蛍光パターンを１６のビットＴｉｆｆ形式のスキャン画像データとして得た。続いて、それらの画像データをマイクロアレイデータ専用解析ソフトであるQuantArrayソフトウェア（GSI Lumonics社製）を用いて解析することにより、全プローブについてのＣｙ３とＣｙ５の蛍光強度をテキスト形式の数値データとして得た。バックグラウンドの補正のために、ｃＤＮＡがプリントされていない部分の蛍光強度値を、各プローブについての蛍光強度値から差し引いた。また、蛍光強度値が低い部分は実験誤差の影響を大きく受けるため、蛍光強度値が高い方から約３０００のデータポイントを残して他のデータは棄却した。各プローブについてのＣｙ３とＣｙ５の蛍光強度値の比、すなわちＣｙ３／Ｃｙ５を算出し、底が２の対数値（以下、「ｌｏｇ（Ｃｙ３／Ｃｙ５）」と記載する）に変換した。スキャンの際に起こりうるＣｙ３とＣｙ５の検出感度調整のずれを補正して標準化するために、各プローブについてのｌｏｇ（Ｃｙ３／Ｃｙ５）値から、全ｌｏｇ（Ｃｙ３／Ｃｙ５）値の中央値（median）を差し引くことにより標準化ｌｏｇ（Ｃｙ３／Ｃｙ５）値を得た。 (5) Microarray scanning and data analysis Microarrays that were air-dried after washing were scanned independently for fluorescence of Cy3 and Cy5 using ScanArray 4000 (GSI Lumonics), a confocal laser scanner dedicated to microarrays. The fluorescence patterns of Cy3 and Cy5 derived from the colon cancer target and standard colon target hybridized with each of the above probes were obtained as 16-bit Tiff format scan image data. Subsequently, the fluorescence intensity of Cy3 and Cy5 for all probes was obtained as numerical data in text format by analyzing the image data using QuantArray software (manufactured by GSI Lumonics), which is analysis software dedicated to microarray data. . For background correction, the fluorescence intensity value of the part where the cDNA was not printed was subtracted from the fluorescence intensity value for each probe. In addition, since the portion where the fluorescence intensity value is low is greatly affected by the experimental error, other data are rejected leaving about 3000 data points from the higher fluorescence intensity value. The ratio of the fluorescence intensity values of Cy3 and Cy5 for each probe, that is, Cy3 / Cy5, was calculated and converted to a logarithmic value with a base of 2 (hereinafter referred to as “log (Cy3 / Cy5)”). In order to correct and standardize the shift in detection sensitivity adjustment between Cy3 and Cy5 that may occur during scanning, the median (median) of all log (Cy3 / Cy5) values is calculated from the log (Cy3 / Cy5) values for each probe. ) Was subtracted to obtain a normalized log (Cy3 / Cy5) value.

以上の操作により、標準大腸粘膜ＲＮＡを基準としたときの、リンパ節転移ありの症例６３例分及びリンパ節転移なしの症例８７例分の大腸癌原発巣の相対的発現強度を対数化し、標準化した数値データを得ることができた。また、同様の操作によって、標準大腸粘膜ＲＮＡを基準としたときの、正常大腸粘膜サンプル１２例分の数値データも得た。これらの数値データのうち、解析した１５０症例の大腸癌原発巣のうちの８５％にあたる１２８症例以上についてデータが取得できており、かつ、１５０症例の大腸癌原発巣のデータ内での分散値（variance）が、１２例の正常大腸粘膜についてのデータ内での分散値の１．１倍を超えていた合計２,１２１種類のプローブについてのデータのみを以降の統計解析に使用した。 By the above operations, the relative expression intensities of the primary colorectal cancer lesions for 63 cases with lymph node metastasis and 87 cases without lymph node metastasis when using standard colonic mucosa RNA as a standard are logarithmized and standardized. Obtained numerical data. In addition, by the same operation, numerical data for 12 normal colon mucosa samples with reference to standard colon mucosa RNA was also obtained. Among these numerical data, data has been obtained for 128 cases or more, which is 85% of the analyzed 150 primary colorectal cancer lesions, and the variance value in the data of 150 primary colon cancer lesions ( Only the data for a total of 2,121 types of probes whose variance) exceeded 1.1 times the variance in the data for 12 normal colonic mucosa were used for subsequent statistical analysis.

これらの２,１２１プローブのデータの中には、症例によっては発現量が低く、カットオフ値を下回ったために棄却されたものも含まれており、それらのデータは欠損値となっている。このような欠損値は合計で８,８１６個存在していた。全データ数は１５０×２,１２１＝３１８,１５０個であることから、欠損値の含有率は約２.８％である。これらの欠損値の存在は、後の統計解析において不都合を生じるため、k-Nearest Neighbor法を用いて、それらを補完した。具体的には、１５０症例×２,１２１プローブのデータ行列において、欠損値となっていた発現量を、サンプル間距離でその遺伝子に最も近い８個の遺伝子の発現量の平均値で推定した。具体的には、１５０症例×２,１２１プローブのデータ行列において、ペアにした全てのサンプル間の距離を計算した。そして、欠損値のあるサンプルと最も距離の近かった８個のサンプルで補完すべき遺伝子発現量の平均値を求め、それを欠損値の補完値として用いた。ここに、欠損値のあるサンプルと最も距離の近いサンプルの個数は、個数を順次増加させroot mean square（RMS）が最小となる個数で定義した。このようにして欠損値を補完して得られた全数値データを以降、標準化遺伝子発現データと記載する。 Among the data of these 2,121 probes, the expression level was low in some cases, and some of them were rejected because they were below the cut-off value, and these data are missing values. There were a total of 8,816 such missing values. Since the total number of data is 150 × 2,121 = 318,150, the content rate of missing values is about 2.8%. The presence of these missing values causes inconveniences in later statistical analyses, so they were supplemented using the k-Nearest Neighbor method. Specifically, in the data matrix of 150 cases × 2,121 probes, the expression level that was a missing value was estimated by the average value of the expression levels of the eight genes closest to that gene in the inter-sample distance. Specifically, in the data matrix of 150 cases × 2,121 probes, the distances between all the paired samples were calculated. And the average value of the gene expression level which should be complemented with the 8 samples which were the closest to the sample with the missing value was obtained and used as the complemented value of the missing value. Here, the number of samples closest to the sample having a missing value was defined as the number with which the root mean square (RMS) is minimized by sequentially increasing the number. All the numerical data obtained by complementing the missing values in this way are hereinafter referred to as standardized gene expression data.

標準化遺伝子発現データの統計解析によるリンパ節転移予測のための高判別能遺伝子セットの決定
本実施例においては、ＤＮＡマイクロアレイにプリントしたプローブを指して遺伝子と呼称することがある。
最初の試みとして、判定用遺伝子セットの同定のために次の４つのアプローチを適用した。すなわち、（a）Support Vector Machine（SVM）（Hastieら、The Elements of Statistical Learning-Data Mining, Inference, and Prediction, Springer, 2001）、（b）Principal Component Analysis/artificial Neural Network（PCA/aNN）（Khanら、Nature Medicine, vol. 7, p. 673−679, 2001）の拡張法、（c）Hierarchical Cluster Analysis（HCA）＋ Stepwise Logistic Discrimination及び（d）Classification And Regression Tree（CART）（Breimanら、Classification and Regression Trees, Wadswarth, 1983）＋ Logistic Discrimination、の４通りの手法を用いた。 Determination of highly discriminating gene set for predicting lymph node metastasis by statistical analysis of standardized gene expression data In this example, probes printed on a DNA microarray may be referred to as genes.
As an initial attempt, the following four approaches were applied to identify the gene set for judgment. (A) Support Vector Machine (SVM) (Hastie et al., The Elements of Statistical Learning-Data Mining, Inference, and Prediction, Springer, 2001), (b) Principal Component Analysis / artificial Neural Network (PCA / aNN) ( Khan et al., Nature Medicine, vol. 7, p. 673-679, 2001), (c) Hierarchical Cluster Analysis (HCA) + Stepwise Logistic Discrimination and (d) Classification And Regression Tree (CART) (Breiman et al., Classification and Regression Trees, Wadswarth, 1983) + Logistic Discrimination.

遺伝子群の同定に際しては、統計学的な信頼性を担保する目的で、全データを判定用遺伝子同定用と評価用の２群に分けて解析を行った。より具体的に述べると、１５０例のデータを、リンパ節転移ありの４２例とリンパ節転移なしの５７例から成る９９例と、リンパ節転移ありの２１症例とリンパ節転移なしの３０例から成る５１例、の２群に分け、前者の９９例分のデータをリンパ節転移の有無を判定するための遺伝子の同定と判別式の確立に使用し、その判別式で後者の５１例分のデータを判別することにより、遺伝子の同定と判別式の評価を行った。以降の記載においては、前者の９９例分のデータのように判定用遺伝子の同定及び判別式の確立に使用されるデータを「トレーニング用データ」と表現し、後者の５１例分のデータのように判別式の評価に使用するデータを「テスト用データ」と表現することがある。 In the identification of the gene group, all data were divided into two groups for determination gene identification and evaluation and analyzed for the purpose of ensuring statistical reliability. More specifically, the data of 150 cases are divided into 99 cases consisting of 42 cases with lymph node metastasis and 57 cases without lymph node metastasis, 21 cases with lymph node metastasis and 30 cases without lymph node metastasis. The data of 99 cases of the former are used to identify genes and establish a discriminant for determining the presence or absence of lymph node metastasis. By discriminating data, gene identification and discriminant evaluation were performed. In the following description, the data used for identification of the gene for determination and establishment of the discriminant like the data for the former 99 cases is expressed as “training data”, and the data for the latter 51 cases In some cases, data used for discriminant evaluation is expressed as “test data”.

上記４つのアプローチのうちの（ａ）、（ｃ）及び（ｄ）の３つに関しては、上記のデータの２分割をランダムに１００回行って解析することにより、１００通りの判定用遺伝子群を同定したうえで、同定された回数が多かった遺伝子を採用した。一方、上記（ｂ）のアプローチに関しては、最初の２分割についてのみ実施した。ただし、トレーニング用の９９例分のデータを２対１の割合で１２５０回ランダムに２分割し、それを用いて主成分分析とニューラルネットワークの学習を反復した。学習後、リンパ節転移の有無を識別する感度に基づき遺伝子をランキングし、遺伝子を絞り込んだ。２１２１個の遺伝子から開始し、以降１５３６個、７６８個、３８４個、１９２個、９６個、４８個、２４個の絞り込み個数のそれぞれで学習を進めた。 With regard to three of the above four approaches (a), (c) and (d), 100 data groups for determination can be obtained by analyzing the above data by dividing the data into two at random 100 times. After identification, genes that were identified many times were employed. On the other hand, the approach (b) was performed only for the first two divisions. However, the data for 99 cases for training were randomly divided into 1250 times at a ratio of 2: 1, and principal component analysis and neural network learning were repeated using the data. After learning, the genes were ranked based on the sensitivity to identify the presence or absence of lymph node metastasis, and the genes were narrowed down. Starting from 2121 genes, learning proceeded with 1536, 768, 384, 192, 96, 48, and 24 refined numbers.

以上の検討の結果、各アプローチごとに判別用遺伝子セットと判別式を確立することができた。各遺伝子セットに含まれる遺伝子の個数と、確立された判別式を用いたテスト用データの正分類率（判別式での判別結果と組織病理学的検査の結果が一致した症例数／テスト用データ数×１００（％））の平均値は、（ａ）については遺伝子数が１４４個で正分類率は８０.２％（標準偏差は５.６％）、（ｂ）については遺伝子数が１９２個で正分類率は（９０.２％）、（ｃ）については遺伝子数が１３３個で正分類率は７８.６％（標準偏差は６.２％）及び（ｄ）については遺伝子数が１３８個で正分類率は８６.３％（標準偏差：４.５％）であった。また、各アプローチで選択された遺伝子セットに共通して含まれていた遺伝子が１６種類あった。 As a result of the above examination, a discriminant gene set and a discriminant were established for each approach. The number of genes included in each gene set and the correct classification rate of test data using established discriminants (number of cases in which the discriminant discriminant results match histopathological examination results / test data) The average value of (number x 100 (%)) is 144 genes for (a), the correct classification rate is 80.2% (standard deviation is 5.6%), and the number of genes is 192 for (b). The correct classification rate is (90.2%), the number of genes is 133 for (c), the correct classification rate is 78.6% (standard deviation is 6.2%), and the number of genes is for (d). 138 and the correct classification rate was 86.3% (standard deviation: 4.5%). In addition, there were 16 genes that were commonly included in the gene set selected by each approach.

上記のように、４つの異なるアプローチのそれぞれで、８０％を超える高い正分類率を達成することができた。しかしながら、判別に使用する遺伝子の数はどのアプローチでも１００種類を超えており、実際にアッセイ法として実用化するためには多すぎると考えた。 As mentioned above, each of the four different approaches was able to achieve a high correct classification rate of over 80%. However, the number of genes used for discrimination exceeds 100 in any approach, and it was considered that there are too many for practical use as an assay method.

そこで、正分類率を落とさないようにしつつ、判定に使用する遺伝子の数を絞り込んで新たな判別ルールを確立することを考えた。そのために、まず対象とする遺伝子を上記の各アプローチで選択された遺伝子セットに共通して含まれていた１６遺伝子とした。次に、これら１６遺伝子各々の寄与（以下、「主効果」と記載する）に加えて、２遺伝子の交互作用も加味することに工夫した。個別の遺伝子による主効果だけでなく、遺伝子間の交互作用を含めたより広い範囲で判別ルールを探索することで、高い判別性能を維持できることを期待した。 Therefore, we considered establishing a new discrimination rule by narrowing down the number of genes used for judgment while keeping the normal classification rate from dropping. For this purpose, first, the genes of interest were 16 genes that were commonly included in the gene set selected by the above approaches. Next, in addition to the contribution of each of these 16 genes (hereinafter referred to as “main effect”), it was devised to take into account the interaction of 2 genes. We expected that high discrimination performance could be maintained by searching for discrimination rules in a wider range including not only the main effects of individual genes but also interactions between genes.

交互作用探索のために、前述の解析で用いた１００通りのトレーニング用データのそれぞれで、リンパ節転移の有無を応答とするＣＡＲＴ解析を再度行った。その結果、データの分割を指示する変数として登場した遺伝子の個数は、１回の解析あたり３個から５個であった。登場した遺伝子が例えば３個の場合、とり得る遺伝子のペアは３通りあるため、それら全てのペアを交互作用として捉えた。同様にして、遺伝子が４個の場合は６組、５個の場合は１０組の各ペアを交互作用として捉えた。そして、１００通りの解析のうち、１２回以上現れた１８組の遺伝子ペアを交互作用の候補として選択した。 In order to search for interaction, CART analysis was performed again with the presence or absence of lymph node metastasis as a response in each of the 100 training data used in the above analysis. As a result, the number of genes that appeared as variables for instructing data division was 3 to 5 per analysis. For example, when there are three genes that appear, there are three possible gene pairs, so all these pairs were considered as interactions. Similarly, when there were 4 genes, 6 pairs and when 5 genes, 10 pairs were considered as interactions. Then, 18 gene pairs that appeared 12 times or more out of 100 analyzes were selected as interaction candidates.

次に判別式の確立のために、１５０例のデータを用いて、１６遺伝子の主効果と１８通りの交互作用を説明変数とし、リンパ節転移の有無を応答としたロジスティック回帰モデルにおいてステップワイズの変数選択を行った。その際、回帰係数の有意性検定のｐ値を、変数の組入れ基準（０.０５未満）及び除外基準（０.０５超）として用いた。その結果、６個の変数、すなわち、G1592、G3031、G3826、G4370、G2645とG3177の交互作用、G3753とG3826の交互作用が選択された。G1592、G3031、G3826、G4370、G2645、G3177及びG3753とは、本発明で使用したColonoChip上の各プローブ（遺伝子）に付与したシリアル番号である。そして、リンパ節転移の有無を判定するための判別式は、
Ｄ＝０.２３０７−２.７１３２×Ｇ１５９２の発現量
＋８.９５０９×Ｇ３０３１の発現量
＋８.７９７５×Ｇ３８２６の発現量
−２.３０９８×Ｇ４３７０の発現量
＋３.５１２６×Ｇ２６４５の発現量×Ｇ３１７７の発現量
−８.８２２６×Ｇ３７５３の発現量×Ｇ３８２６の発現量
と推定され、Ｄ＞０のときリンパ節転移あり、Ｄ≦０のときリンパ節転移なし、とする判別ルールが導かれた。 Next, in order to establish a discriminant, using the data of 150 cases, stepwise in a logistic regression model using the main effects of 16 genes and 18 interactions as explanatory variables and the presence or absence of lymph node metastasis as responses. Variable selection was performed. At that time, the p-value of the significance test of the regression coefficient was used as a variable inclusion criterion (less than 0.05) and an exclusion criterion (greater than 0.05). As a result, six variables were selected: G1592, G3031, G3826, G4370, G2645 and G3177 interaction, and G3753 and G3826 interaction. G1592, G3031, G3826, G4370, G2645, G3177 and G3753 are serial numbers assigned to each probe (gene) on the ColonoChip used in the present invention. And the discriminant for judging the presence or absence of lymph node metastasis is
D = 0.2307-2.7132 × G1592 expression level + 8.9509 × G3031 expression level + 8.7975 × G3826 expression level-2.3098 × G4370 expression level + 3.5126 × G2645 expression level × G3177 Expression level−8.8226 × G3753 expression level × G3826 expression level was estimated, and a determination rule was derived that there was lymph node metastasis when D> 0 and no lymph node metastasis when D ≦ 0.

この判別式に登場した７個の遺伝子、すなわちG1592、G2645、G3031、G3177、G3753、G3826、及びG4370をリンパ節転移の有無の識別に寄与する遺伝子のセットとして選択した。それらの遺伝子名を表１に記した。表１中には上記の各遺伝子のＲｅｆＳｅｑデータベースにおけるアクセス番号も併記した。RefSeqデータベースにはNational Center for Biotechnology Information（NCBI）のWeb site（http://www.ncbi.nlm.nih.gov/RefSeq/index.html）からアクセスすることができる。 Seven genes appearing in this discriminant, namely G1592, G2645, G3031, G3177, G3753, G3826, and G4370 were selected as a set of genes that contribute to the identification of the presence or absence of lymph node metastasis. Their gene names are shown in Table 1. In Table 1, the access numbers in the RefSeq database for each of the above genes are also shown. The RefSeq database can be accessed from the National Center for Biotechnology Information (NCBI) website (http://www.ncbi.nlm.nih.gov/RefSeq/index.html).

最後に、選択した遺伝子セットによるリンパ節転移の判定性能をＬＯＯ法により評価した。すなわち、１サンプルを除いた残りの１４９サンプルのデータを用いて、上記の６個の変数を含むロジスティック判別式を推定し、それによって除いたサンプルを判別する操作を、１５０サンプルのそれぞれで実施した。その結果を表２に示した。表２から、選択した遺伝子のセットによる正分類率は８８.７％（感度：７７.８％、特異度：９６.６％）と推定された。 Finally, the judgment performance of lymph node metastasis by the selected gene set was evaluated by the LOO method. That is, using the data of the remaining 149 samples excluding one sample, the logistic discriminant including the above six variables was estimated, and the operation of discriminating the sample excluded by that was performed for each of the 150 samples. . The results are shown in Table 2. From Table 2, the correct classification rate by the selected gene set was estimated to be 88.7% (sensitivity: 77.8%, specificity: 96.6%).

本発明により可能となるリンパ節転移の判定によって、症例に応じたよりよい治療方針の選択が可能であり、また医療経済効果も期待できる。例えば、リンパ節転移の可能性の高い症例に対しては積極的な治療を施行することによって予後の改善が期待できるし、一方、リンパ節転移の可能性の低い症例に対しては、術後補助療法は軽度のものにすることができ、患者の肉体的・経済的負担を軽くすることができる。 According to the determination of lymph node metastasis enabled by the present invention, it is possible to select a better treatment policy according to the case and to expect a medical economic effect. For example, prognosis improvement can be expected by aggressive treatment for patients with a high possibility of lymph node metastasis, while postoperative patients with a low possibility of lymph node metastasis. Adjuvant therapy can be mild and can reduce the physical and economic burden on the patient.

さらに、本発明で開示される遺伝子セットに含まれる個々の遺伝子は、リンパ節転移の原因として機能するものである可能性も推察されるため、これらの遺伝子及びその発現産物を標的とする薬剤を開発し、リンパ節転移を直接抑制できるようにすることも期待できる。 Furthermore, since it is assumed that individual genes included in the gene set disclosed in the present invention may function as a cause of lymph node metastasis, drugs targeting these genes and their expression products may be used. It can also be expected to develop and be able to directly suppress lymph node metastasis.

Claims

A method for selecting a gene set for predicting the presence or absence of colorectal cancer lymph node metastasis, including the following steps (1) to (4):
(1) Analyzing gene expression information in the primary colorectal cancer tissue of a patient whose presence or absence of lymph node metastasis was revealed by histopathological determination using at least one supervised learning analysis method. Selecting a gene group in each analysis method that can classify the presence or absence of lymph node metastasis with a correct classification rate of 75% or more,
(2) a step of selecting a common gene selected in common in any analysis method from the gene group selected in each analysis method used in (1),
(3) analyzing the gene expression information, instructing the classification of the presence or absence of lymph node metastasis from any combination of two or more genes, and selecting a combination of genes showing an interaction; and (4) A step of selecting a variable in a logistic regression model in which the combination of the common gene and the gene is used as an explanatory variable and the presence or absence of lymph node metastasis is used as a response.

(1) Analysis method includes (a) Support Vector Machine, (b) Extension of Principal Component Analysis Artificial Neural Network, (c) Combination of Hierarchical Cluster Analysis and Stepwise Logistic Discrimination, and (d) Classification And Regression Tree and Logistic The method of claim 1, comprising at least one selected from the group consisting of a combination of discriminations.

The method according to claim 1 or 2, wherein the analysis method of (3) is a combination of (d) Classification And Regression Tree and Logistic Discrimination.

4. The method according to claim 1, wherein the variable selection method of (4) is a stepwise variable selection method.

A gene set for predicting the presence or absence of colorectal cancer lymph node metastasis, selected by the method according to claim 1.

At least the gene represented by the access number (serial number) of the database of NM_003404 (G1592), NM_002128 (G2645), NM_052868 (G3031), NM_005034 (G3177), NM_001540 (G3753), NM_005722 (G3826), and NM_015315 (G4370) The gene set according to claim 5, comprising:

A method for predicting the presence or absence of colorectal cancer lymph node metastasis, comprising using the gene set according to claim 5.

8. The method of claim 7, wherein the following discriminant is used:
D = 0.23027-2.7132 × NM_003404 (G1592) expression level + 8.9509 × NM_052868 (G3031) expression level + 8.7975 × NM_005722 (G3826) expression level-2.3098 × NM_015315 (G4370) expression level Amount + 3.5126 x NM_002128 (G2645) expression x NM_005034 (G3177) expression-8.8226 x NM_001540 (G3753) expression x NM_005722 (G3826) expression (D> 0, lymph node metastasis) , It is determined that there is no lymph node metastasis when D ≦ 0).