AU773329B2

AU773329B2 - Materials and methods relating to disease diagnosis

Info

Publication number: AU773329B2
Application number: AU49208/00A
Authority: AU
Inventors: Alaiya Ayodele; Bo Franzen; Anders Hagman
Original assignee: PROTEOMEDICA AB
Current assignee: PROTEOMEDICA AB
Priority date: 1999-05-14
Filing date: 2000-05-11
Publication date: 2004-05-20
Anticipated expiration: 2020-05-11
Also published as: AU4920800A; WO2000070340A2; WO2000070340A3; EP1179175A2

Description

WO 00/70340 PCT/EP00/04265 Materials and Methods Relating to Disease Diagnosis Field of the Invention The present invention concerns materials and methods relating to disease diagnosis. Particularly, but not exclusively, the invention relates to methods of diagnosing tumours, by comparing specific patterns of gene expression at a nucleic acid or protein level using expressed nucleic acid, e.g. mRNA or cellular proteins associated with the tumour.

BackQround of the Invention The major characteristics that differentiate malignant tumours from benign ones are their properties of invasiveness and spread. Malignant tumours do not remain localised and encapsulated: they invade surrounding tissues, get into the body's circulatory system, and set up areas of proliferation away from the site of their original appearance. When tumour cells spread and engender secondary areas of growth, the process is call metastasis; malignant cells having the ability to metastasize.

The earliest stages of malignant tumours are hard to identify and pathologists are rarely sure how or where a malignancy began. The cells of malignant tumours have a tendency to lose differentiated traits and therefore it can be difficult to determine the primary origin of the Ir (I WO 00/70340 PCT/EP00/04265 -2cells following metastasis.

A concern with the histopathologic assessment of neoplasias (tumour growth) is that tumour classification is based on subjective evaluation Immunostaining can be used to determine the expression of various diagnostic markers and may increase reproducibility.

Ovarian cancer is an example of a disease where the diagnostic difficulties are considerable Epithelial neoplasias of ovarian cancers are classified into benign, borderline and malignant tumours. Borderline tumours are often difficult to diagnose, and it is not known if most of these tumours represent intermediate steps in tumour progression or whether these tumours should be considered as a separate group Relative survival decreases with increasing tumour stage or grade. Five-year survival is considerably lower for women with carcinoma than for women with borderline carcinoma Summary of the Invention The present inventors have appreciated that carrying out routine tumour diagnosis in an accurate and objective manner is very difficult. The process is preoperatively dependent on an experienced cytologist and/or postoperatively dependent on an experienced pathologist, and is at present based on morphological judgements.

Further, the primary tumour source can be difficult to determine which may lead to miss-diagnosis and i f t WO 00/70340 PCTIEP00/04265 3inappropriate treatment regime. Therefore, the present inventors have realised that there is a need for a diagnostic tool that can perform preoperative diagnosis objectively. Such a tool should help to reduce the number of patients undergoing unnecessary and expensive therapy.

Multivariate analysis of the expression of a series of diagnostic markers is one approach to diagnostic problems. If a sufficiently large data set is collected, it may be possible to recognize patterns of expression in different histological groups. Goldschmidt et al. showed that multivariate analysis of 47 histological variables generated by computer-assisted microscope analysis facilitated classification of adipose tumours.

Similarly, multivariate analysis of RNA expression data has been used to discriminate between fibroblast subtypes (6) One approach to obtain a large data set is to use high resolution two-dimensional polyacrylamide gel electrophoresis This technique is able to resolve more than one thousand polypeptides on a single gel. The pattern can be analysed by computer software such as PDQUEST and MELANIE II This approach has been previously used for the classification of lung tumour cell lines An alternative approach to obtaining a large data set is to use micro-array technology. Nucleic acid WO 00/70340 PCT/EP00/04265 4 sequence characteristic of nucleic acid sequences expressed in certain cell types, e.g. MRNA or cDNA, can be analysed in this way. There is an increasing tendency towards miniaturisation of assays which use binding members (such as antibodies or nucleic acid sequences).

For example, the binding members may be immoblised in small discrete locations (microspots) and/or as arrays (micro-array technology) on solid supports or on diagnostic chips. These approaches can be particularly valuable as they can provide great sensitivity (particularly through the use of fluorescent labelled reagents), require only very small amounts if biological sample from individuals being tested and allow a variety of separate assays to be carried out simultaneously.

Examples of techniques enabling miniaturised technology are provided in W084/01031, W088/1058, W089/01157, W093/8472, W095/18376, W095/18377, W095/24649 AND EP-A- 0373203.

Early research by Fedor et al established that silicon could serve as a substrate onto which organic molecules such as DNA could be synthesized. The process now commercialised by Affymetrix Inc. Santa Clara, California, involves the use of serial photolithographic steps to build oligonucleotides in situ at a specific addressable position on the chip.

The strategy of addressing specific nucleic acid sequences synthesized off chip, then hybridized to a particular location on a chip by electrical attraction to a charged microelectrode has been developed by Nanogen Inc. Variation on the theme of microaddressable arrays has recently led to the evaluation of chips for sequence analysis of uncharaterised genetic material, mutational analysis of a known gene locus, and for the evaluation of a particular cell or tissue's profile of gene expression for the whole complement of the human DNA sequence. These methodologies typically rely on the use of laser activated fluorescence of addressable signals on a microchip.

Thus, at its most general, the present invention provides materials and methods for, firstly obtaining a number of protein or nucleic acid expression profiles characteristic for disease states of different origins or different stages of development or malignancy; secondly, analysing said expression profiles in order to determine specific diagnostic markers; and thirdly, diagnosing the presence of a disease, e.g. tumour, the type of disease or the stage of development of said disease e.g. tumour 20 malignancy by comparison of its protein or nucleic acid expression profile with those previously obtained to determine using the specified diagnostic markers.

Thus, the present invention primarily relates, in a first aspect, to a method of obtaining gene expression profiles in order to determine diagnostic markers characteristic of a selected disease type or stage of development of a disease comprising obtaining cells from a sample of said disease :tissue; disrupting cells to expose the cellular products characteristic of gene expression; separating said cellular products according to their characteristic properties on a substrate; collecting image data from said substrate; and carrying out image analysis of the substrate to quantify the cellular products distributed on the substate and partial least squares discriminant analysis (PLS-DA) to determine a set of gene expresion profiles characteristic of said disease.

Depending on the cell type, different genes are expressed or are expressed at different levels or frequency. These differences in gene expression may be used to characterise the type of cell. The cellular products that reflect the differences in gene expression are those products produced downstream of the nucleic acid transcription and translation process, e.g. mRNA or the expressed protein itself. These cellular products may then be separated according to their own characteristic properties, e.g. size, charge or sequence.

In a preferred embodiment of the invention, the 20 cellular products are expressed proteins which may be separated according to their size on a electrophoresis 0° gel, preferably a two dimensional electrophoresis gel.

0*00 ~Alternatively, the cellular products may be separated according to their characteristic properties using a substrate comprising specific binding members, for 0000 .oo example, antibodies or oligonucleotides. As mentioned 0000 above, this is conveniently done by using a micro-array.

In such a situation, it is preferable to label the cellular products, e.g. radioactively or fluorescently or 30 enzymatically.

In order to carry out the analysis as outlined in step quantitative and qualitative data from the two dimensional electrophoresis gel is firstly obtained. Thus, step may require carrying out multivariate analysis of the quantitative and qualitative data from the two-dimensional gel to characterise the protein expression profile and identify specific diagnostic markers characteristic of said disease.

The cellular product may be expressed nucleic acid, preferably mRNA which may be obtained from the cells by standard molecular techniques known to the skilled person, for example see Sambrook, Fritsch and Maniatis, "Molecular Cloning, A Laboratory Manual", Cold Spring Harbor Laboratory Press, 1989, and Ausubel et al, Short Protocols in Molecular Biology, John Wiley and Sons, 1992) Alternatively, cDNA (the next page is page 9) EDITORIAL NOTE Application No.49208/00 This application does not have page 8 WO 00/70340 PCT/EP00/04265 9 may be created from the expressed mRNA by reverse transcription before separation and analysing on the micro-array. Micro-array technologies use oligonucleotides (representing thousands of different genes) bound to given positions on various substrate.

Total mRNA is purified from a cell/tissue sample and cDNA is produced by reverse transcriptase. Various steps (e.g.

in vitro transcription using biotinylated nucleotides) may then be added before hybridisation and visualisation depending on the specific type of micro-array technology used Affymetrix chips, Clontech membranes). The final read-out is a signal that is proportional to the quantity of a given expressed gene.

The present inventors have discovered that proteins are differently expressed or differentially regulated between various malignant tumours and benign tumours.

Therefore, the inventors believe that the present invention will have particular utility in relation to the diagnosis of tumours. Although the following description of the invention concentrates on the diagnosis of tumours in general, it will be appreciated by the skilled person that the present invention may equally and advantageously be applied to the diagnosis of other disease states characterised by gene expression profiles, e.g.

hypo/hyperthyroidism, diabetes, or organ rejection.

Further, the invention may be used to test plasma samples for leukaemia or other hematopoetic disorders.

In previous studies carried out by the present inventors, a large degree of heterogeneity in protein expression was observed, particularly in malignant tumours (17, 18). Both qualitative and quantitative differences were found within each tumour group. However, the large quantitative variability indicated that identification based on pattern recognition would be difficult. However, the present inventors show herein that it is possible to select a subset of variables which show a characteristic pattern within the group, and thus are useful for prediction of the presence of malignant cells and their initial origin.

The disease type is preferably cancer, wherein a plurality of samples may be collected from tumours of a particular cancer, e.g. ovarian, breast, skin etc, and its gene expression profile characterised by the present invention.

g It is important that scanning of the electrophoresis gel or the micro-array easily identifies the separated proteins or nucleic acids respectively. Therefore, the method may further comprise the step of labelling the obtained proteins or expressed nucleic acids. Nucleic acid sequences may be labelled by standard techniques known to the skilled person such as fluorescent, enzyme or 25 radio-active labelling. As an alternative to labelling obtained proteins, the gels may be stained with, for example silver nitrate, and scanned using a laser densitometer. Alternatively, the gels may be analysed using computer-assisted microscope to facilitate 30 classification. The data obtained and statistical comparison may be performed. In particular, this is preferably a multivariate characterisation of one or more -11numerical parameters associated with the proteins. In other words, multivariate analysis of a plurality of variables generated by, for example, computer-assisted image analysis may be easily classified. The statistical comparison may, for example, identify a sub-set of proteins, from among all of the proteins on the 2-DE, having a statistically significant degree of expression and/or correlation when compared to other samples from similar tumour cells. This sub-set of proteins may then be used as diagnostic markers for the particular tumour or stage of malignancy. Preferably, a plurality of 2-DE gels are analysed and the distribution pattern of the proteins are determined. A model may then be set up with a specified number of variables between the tumour cells being analysed. For example, a comparison may be made between benign/borderline/malignant. Preferably the number of variables separating the groups whether proteins or expressed nucleic acid sequences, will range between and 500, more preferably 50 and 300, even more preferably 100 and 200. In general, it is preferably that the number of variables is at least 20, more preferably at least and even more preferably at least 70, 100 or 150 variables. In the present case, the inventors used 170 variables.

Quantification and multivariate characterisation of the expression profiles of selected protein or nucleic acid groups may be performed on image analytical data obtained from analysis of the 2-DE or the micro-array respectively and used for objective classification of the tumour cells in a given sample. Generally, the partial least squares discriminant analysis (PLS-DA) allows (i) the construction and characterisation of a protein or -12nucleic acid expression profile database and data extraction of a plurality of sets of proteins or nucleic acids which contribute significantly to the diagnosis/classification of a disease state; (ii) add samples/protein or nucleic acid expression profiles to the database and further improve the future accuracy of the diagnosis/classification; and (iii) query the database via the expert system using new tumour samples/protein or nucleic acid expression patterns aiming at a prediction of diagnosis.

Also provided is a protein expression profile database comprising image data which has been analysed in order to determine a plurality of variables for use as diagnostic markers; said data being obtained from analysis of two-dimensional electrophoresis gels showing characteristic protein distribution associated with a disease type or state of development of said disease for use in disease diagnosis.

Also provided is a nucleic acid (mRNA or cDNA) 20 expression profile database comprising image data which has been analysed in order to determine a plurality of S variables for use as diagnostic markers; said data being obtained from analysis of a micro-array showing characteristic expressed nucleic acid sequence distribution associated with a disease type or stage of development of said disease, for use in disease diagnosis.

In a further aspect, the present invention provides a method of determining the presence, type or stage of a disease in a patient comprising the steps of extracting a sample of cells from the patient; disrupting the cells so as to expose the cellular proteins contained therein; -13separating said cellular proteins on a twodimensional electrophoresis gel; and analysing said gel by computer assisted image evaluation so as to create an expression profile for said cells; and comparing the expression profile of the cells with a database of gene expression profiles determined according to the first aspect of the invention, said expression profiles being characteristic of a plurality of disease types or stages of disease development to determine presence, type or risk of said disease in said patient.

The present invention also provides a method of determining the presence, type or stage of a disease in a patient comprising the steps of extracting a sample of candidate disease cells from a patient; S(2) disrupting the cells so as to obtain the expressed nucleic acid sequences contained therein; separating said expressed nucleic acid sequences on a micro-array according to their individual nucleotide sequence; analysing said gel by computer assisted image evaluation so as to create an expression proflie for the 25 cells; and comparing said expression profile with a database of expression profiles determined according to the first aspect of the invention, said expression profile being characteristic of a plurality of disease types or stages of disease development to determine presence, type or risk of said disease in said (the next page is page 16) EDITORIAL NOTE Application No.49208/00 This application does not have pages 14 and WO 00/70340 PCT/EP00/04265 16 patient.

Preferably, the disease type is cancer and the disease cells are tumour cells.

Sample preparation may be carried out using standard techniques. One typical sample may contain approximately one million cells. Samples may be collected using fine needles aspiration biopsy (FNA) a routine technique used for cytological diagnosis. The major advantage of using FNA combined with the expert system is early diagnosis if possible, a prerequisite for making early decisions on therapy (ii) effects of hormone or chemotherapy can be followed at protein expression level, providing early information on e.g. resistance against treatment; and (iii) the analysis is based on an average expression profile of the cell population.

Samples may also be collected after surgery for analysis in order to guide pathological examination and selection of post-operation therapeutic strategy.

As mentioned above, the earliest stages of malignant tumours are hard to identify and pathologists are rarely sure how or where a malignancy began. The present invention therefore has further utility in being able to more accurately determine the primary origin of tumour cells as the primary tumour and its corresponding metastasis express very similar 2-DE protein profiles (Franzen et al, Int. J. Cancer 1996, 69, 408-414). Such analysis will therefore assist a clinician in determining WO 00/70340 PCT/EP00/04265 17 the location of the primary tumour.

The above disclosure concentrates on the analysis and diagnosis of tumours. However, as mentioned above, the present invention may also be usefully applied to the diagnosis of any disease state that can be characterised by a statistically significant protein expression profile which allows the identification of specific diagnostic markers.

By way of example only, a brief outline/workflow on how the computer analysis may be set up in practice is provided below: 1. A new tumour sample is prepared, analyzed by 2-DE and the expression pattern is scanned.

2. All protein spots in this expression pattern is quantified and matched against a reference pattern using any established software for basic 2-DE analysis PDQuest, Melanie, Biolmage).

3. The data is first organized in a Excel-spreadsheetlike format table with all protein spot reference numbers in the first column and individual normalized protein quantities for every analyzed sample in the following columns. A new case/pattern is added as a new column. This corresponds to the "data table X".

WO 00/70340 PCT/EP00/04265 18 4. The process of "data mining" to find those proteins/variables which contribute most to the separation of tumour classes and build the learning set (the core of the database), is based on the PLS-DA analysis. Here, an additional "data table Y" is included, as described under materials and methods, data preprocessing (please see also references 14 15). Graphically and numerically it is possible to make a first selection of variables (those that are far from origo (compare fig. 4) in the same and opposite direction from the corresponding position of tumour classes, compare fig. 3).

5. In an interactive sub-routine or process, this first set of variables is crossvalidated by excluding cases one by one in sequences, rebuild the model and make a prediction of each of the excluded cases.

Then, a second set of variables are selected (according to step and so on until the predictive value reach an optimum. In the present case, a set of 170 variables was selected in this way (step 4 and 5) and is therefore not a random choice.

6. Next, the true predictive value is determined using a new set of cases (the test set).

WO 00/70340 PCT/EP00/04265 19- 7. This process, step 3-6, can then be repeated with an increased number of cases in order to further improve the predictive accuracy.

8. A new case (an unknown tumour sample) is then analyzed by 2-DE/basic image analysis, the pattern is compared with respect to the defined group of variables in the database model and classified using, for example, PLS-DA prediction in order to obtain a diagnosis. Each new case may also be added to the database for future improvements of the predictive value of the model.

One part of the expert system/computer software is to integrate steps 3 to 7 and make the process userfriendly in order to guide the investigator towards the construction of a model within the data base which provide high predictive accuracy. The other part of the expert system/computer software is to facilitate the query of the model using a new case in order to obtain a diagnosis (step 8 above). In addition to these "calculation parts" of the expert system, information may be included on sample preparation and on sample characteristics, 5-year survival data etc.

Thus, in the further aspect of the present invention, there is a provided a diagnostic kit for diagnosing the presence, type or stage of a disease, e.g.

WO 00/70340 PCT/EP00/04265 20 a tumour or malignancy of a tumour, said kit comprising a database capable of quantifying an protein or nucleic acid expression pattern and comparing it against reference patterns held within the database. The kit may also optionally include, instructions for carrying out any of the methods described above; apparatus for carrying out a 2-DE; micro-array technology or a laser densitometer or other image scanning device.

Aspects and embodiments of the present invention will now be illustrated, by way of example, with reference to the accompanying figures. Further aspects and embodiments will be apparent to those skilled in the art. All documents mentioned in this text are incorporated herein by reference.

Brief Description of the Drawings Fig. 1 The two first principal components scores (t 2 against of the 2-DE training data-set (22 gels and 1553 spots). A benign ovary tumour sample (open circles), B borderline ovary tumour sample (mixed circles), and C malignant ovary tumour sample (filled circles).

Fig. 2 The two first principal components scores (t 2 against ti) of the most informative part of the 2-DE training data-set (22 gels and 170 spots). For descriptions, see Fig i.

Fig. 3 The two first PLS-DA scores (tPS 2 against WO 00/70340 PCT/EP00/04265 21 tPSi) of the entire 2-DE data (40 gels and 170 spots). The samples in the learning-set are indicated using circles A benign ovary tumour sample (open circles), B borderline ovary tumour sample (mixed circles), and C malignant ovary tumour sample (filled circles). The samples in the test-set are indicated using filled/mixed and open squares in analogy with the learning-set.

Fig. 4 The corresponding loading plot to Fig. 3 (wcz against wcl). Indicated are the loading scores for the most significant spots for separation of the three tumour classes.

Fig. 5 The two first principal components scores (t 2 against ti) of breast tumour samples (33 gels and 170 spots). Cases classified as carcinoma are labelled "C" and have filled symbols; cases classified as fibroadenoma are marked with FA and have open symbols.

Detailed Description 1) MATERIALS AND METHODS Tumour tissue samples All samples were obtained within 40 min after resection and tumour cells were enriched as previously described Histopathological characterization was carried out using hematoxylin-eosin stained sections of formalin fixed and paraffin embedded specimens. Tumours -22were classified using the WHO system.

Electrophoresis, scanning and image analysis 2-DE was performed as previously described (11).

Resolyte pH 4 8, BDH) were used for isoelectric focussing, 10 13% linear gradient SDS-polyacrylamide gels were used in the second dimension. Gels were swtained with silver nitrate as described by Rabilloud et al. (12) and scanned at 100 micrometre resolution using a Molecular Dynamics laser densitometer. Data was analysed using

PDQUEST

TM software obtained from Pharmacia Biotech (Uppsala, Sweden).

Data preprocessing The data from the matchset was exported from PDQUEST gel analysis package in the form of tables, with rows representing gels and columns prepresenting spots (data table X see references 14 and 15). Before the analysis, the data was standardized by dividing each variable (table 20 column) by its standard deviation, thereby giving each variable the same influence in the analysis. Thereafter the data is centred by substracting from each column its average.

25 Data analysis se* The preprocessed data table (data table X) was analysed by two data analysis methods. The first one, *o• go* WO 00/70340 PCT/EP00/04265 23 Principal Component Analysis (PCA), extracts the information in the data, in form of eigenvectors or principal components. Visually, one can see this as a cloud of points (the individuals cases/gels) in a multidimensional space (each axis's representing each spot). PCA first centers the data. Secondly, it rotates the data in such a way that the greatest amount of linear variation is described by the first component axis, the residual variation is described by the second component axis, and so on. Most of the information is often compressed into two or three components. A more detailed description of PCA may be found elsewhere (13).

The second data analysis method, Partial Least Squarest (PLS) Discriminant analysis, was used to classify the cases into the three tumour-classes (benign, borderline or malignant). An additional data table (data table Y) with the classificationof the tumours is included into the analysis. Table Y consists of the same number of columns as the number of tumour classes and the number of rows is equal to the number of cases. The table is then filled with suitable dummy variables 1 belongs to a specific tumour class or 0 does not belong).

The PLS-analysis is similar to PCA in that it projects the data table X into a vector. It differs, however, in that the direction of the vector is determined both by the variation of data table X (as in the case of PCA) as well as the variation of data table Y. For further descriptions WO 00/70340 PCT/EP00/04265 24 of PLS, see (14, 15). The significance of the PLS-model is checked by cross-validation. Data from a small number of samples is kept out of the calculation, the PLS model is computed from the remaining data, and the y-values of the deleted are thereafter predicted from the model. The differences in square between predicted and actual y-values for deleted samples are summed to form PRESS (Predictive Error of Sum Squares). This sequence is repeated until each sample has been deleted once.

The data-table used for training the PLS-model consists of 22 cases and 170 spots (Table To test the model a table (18 cases and 170 spots) with unknown tumour class was used (Table X).

The data analysis were carried out on CODEXTM software obtained from Sumit System AB (Stockholm, Sweden) and SIMCATM software obtained from Umetri AB (Umea, Sweden).

2) RESULTS Creation of a Learning Set Cells were extracted from fresh ovarian tumour tissue and single cell suspensions free of erythrocytes were prepared Cytological smears were prepared from all preparations and samples usually contained 90% tumour cells (histopathological characteristics are presented in Table 2-DE polypeptide patterns obtained from these cells were analysed by the PDQUESTTM software The WO 00/70340 PCT/EP00/04265 25 patterns of polypeptide expression in 22 ovarian tumours were examined, 5 benign 6 borderline and 11 malignant cases (objects). These patterns were matched together and a reference 2-DE map was constructed containing 1553 spots (variables) As an initial step, principal component analysis was applied to entire material (22 gels and 1553 spots) to provide an overview over the data structure, to identify outliers and possible clusters. Normalized quantities (expressed as ppm) for all spots were used for the PCA.

Fig. 1 shows the scores for the first two components. A coarse separation into two major groups, A B and C was observed, indicating that latent structures with predictive value are present in this set of data. However, the corresponding loading plots showed very scattered data (data not shown).

Of the original data (1553 variables, Fig. 170 variables had a substantial influence on the model (PLS loadings 0.02). Approx. 100 variables were active in separating the groups A B (benign/borderline) and C (malignant), and approximately 70 variables in separating between A (benign) and B (borderline) An improved separation of the clusters representing each of the three classes was observed using these 170 variables (Fig. 2).

Four significant PLS-DA vectors were found, by using crossvalidation (Q2=0.84), describing 98.4 of the variance in Y and 40.7 in X. This data set was then closed and WO 00/70340 PCT/EP00/04265 26 called "learning set" Testing the model with unknown tumours Eighteen new cases were analysed by 2-DE and added to the existing matchset. Expression levels of the 170 markers for all cases were analysed blindly using PCA, enabling the distribution of new objects. Figure 3 show the predictions of unknown cases in a PLS score plot (and the corresponding loadings in Fig. 4).

After breaking the code, 6 of 8 malignant cases were correctly classified. Case 84 and 89 were classified as borderline. Furthermore, 3 of 4 borderline cases were correctly classified, whereas borderline case 96 was classified as benign. Benign cases 90 and 95 were correctly were correctly classified. Of the remaining 4 cases, 3 were classified as borderline and one (case 29) as borderline/malignant.

Testing a ovary model with breast tumours The possibility that an ovarian cancer model could be used for classification intraductal breast tumours was exploited. The present inventors matched the ovary tumour matchset standard 2-DE map with a corresponding breast tumour standard map in the database Seventy-five of the 170 markers were present in the breast standard map.

Fig. 5 shows the PCA distribution of 33 cases of breast cancer (26 carcinomas, 6 fibroadenomas and 1 normal breast WO 00/70340 PCT/EP00/04265 27 epithelium). Only a tendency of clustering of benign cases was observed which indicate that some but not all of the markers show predictive value.

3) DISCUSSION The present inventors present here a first attempt to apply artificial learning strategies using quantitative 2dimensional electrophoresis data for tumour diagnosis. A subset of the information in the 2-DE pattern, based on 170 spots, was selected. Using these variables, a learning set was constructed where an acceptable separation of the groups benign/borderline/malignant tumours into three clusters was obtained. Whether other combinations of spots will result in an improved separation is unknown and difficult to test, since each learning set has to be tested by a new panel of unknown samples. We tested the learning set using 18 cases, and observed a correct classification of the majority of these (11/18).

It is well known among pathologists that the routinely used limited number of diagnostic sections may not be representative for a certain lesion. In this context it is important to note that the sampling technique employed for 2-DE analysis is more likely to meet the requirements for lesion representivity.

In previous studies by the present inventors, a large degree of heterogeneity in polypeptide expression was observed, particularly in malignant tumors (17, 18). Both -28qualitative and quantitative differences were found within each tumour group. Particularly, the large quantitative variability indicated that identification based on pattern recognition would be difficult. The present data suggests that it is possible to select a subset of variables which show limited variability within the group, and useful for prediction.

Neural networks and artificial learning has been used to predict cancer prognosis and for grading tumors 19- 22). The parameters used have been various TNM-scoring systems, nuclear grading, tumour markers and histopathological scoring. For prostate cancer, the sensitivity of the network was between 81 to 100% and the specificity 72 to 75% to predict various outcomes such as seminal vesicle and lymph node involvement (22) Similarly, neural network analysis has been performed on breast cancer, using parameters such as hormone receptor status, DNA index, tumour size, number of axillary lymph :nodes involved with tumour as input information These studies have indicated that artificial learning is a powerful method to increase the diagnostic accuracy on individual tumours.

The present inventors have noted that many of the ooe alterations observed in 2-DE pattern are similar between 25 tumours of epithelial origin. Thus similar changes in the 0o*o expression of some cytoskeletal and stress proteins are observed in breast, ovarian and prostate tumors (10; 23) With this background, it was interesting to examine whether a selected set of ovarian markers could be used 0. 30 for classification of intraductal breast tumors into benign and malignant. Some clustering of benign cases was observed, whereas malignant cases showed extensive -29scattering. It seems reasonable to suggest that it will be difficult to construct a universal model for epithelial tumors, and that learning sets have to be created for each tumour type.

In conclusion, the present study suggests that artificial learning strategies can be used for tumour diagnosis.

*o WO 00/70340 WO 0070340PCTIEPOO/04265 30

REFERENCES

1. Dalton, L. Page, D. and Dupont, W. Cancer.

73: 2765-2770, 1994.

2. Kronqvist, Montironi, Kuopio, and Collan, Y. U. Anal. Quant. Cytology Histology. 19: 423-429, 1997.

3. Friedlander, M. Seminars in Oncology. 25: 305-314, 1998.

4. Link, C. J. Reed, Sarosy, and Kohn, E. C., Am. J. Medicine. 101: 217-225, 1996.

Coldschmidt, Decaestecker, Berthe, J. V., Gordower, Renunelink, Danguy, Pasteels, J. L., Salmon, and Kiss, Lab. Invest. 75: 295-306, 1996.

6. Spanakis, E. and Brouty-Boye, D. Int. J. Cancer. 71: 402-409, 1997.

7. Garrels, J. J. Biol. Chem. 264: 5269-5282, 1989.

8. Wilkins, M. Hochstrasser, D. Sanchez, J. C., Bairoch, and Appel, R. Trends in Biochem. Sciences.

21: 496-497, 1996.

9. Schmid, H. Schmitter, Blum, Miller, and Vonderschmitt, Electrophoresis. 16: 1961-1968, 1995.

Alaiya, A. Franz~n, Fujioka, Moberger, B., Schedvins, Silversv~rd, Linder, and Auer, G., Int. J. Cancer. 73: 678-683, 1997.

WO 00/70340 PCT/EP00/04265 31 11. Franz6n, Okuzawa, Linder, Kato, and Auer, Electrophoresis. 14: 382-390, 1993.

12. Rabilloud, Vuillard, Gilly, and Lawrence, A general overview., Cell Mol. Biol. 40: 57-75, 1994.

13. Jocliffe, I. T. New York: Springer Verlag, 1986.

14. Jellum, Harboe, Bjune Wold S. J. Pharm. Biomedical Analysis 9: 663-669, 1991.

Hagberg, G. A review of pattern recognition methods.

NMR in Biomedicine. 11: 148-56, 1998.

16. Franz6n, Auer, Alaiya, A. Eriksson, E., Uryu, Hirano, Okuzawa, and Linder, S. downregulation of cytokeratins, Br. J. Cancer. 73: 1632-1638, 1996.

17. Franz6n, Auer, Alaiya, A. Eriksson, E., Uryu, Hirano, Okuzawa, Kato, and Linder, S.

Int. J. Cancer. 69: 408-414, 1996.

18. Alaiya, A. Franz6n, Linder, and Auer, G., Electrophoresis. in press.

19. Dawson, A. Austin, R. E. and Weinberg, D. S., J. Clinical Pathology. 95: 29-37, 1991.

Ravdin, P. Clark, G. Hilsenbeck, S. Owens, M. Vendely, Pandian, M. and McGuire, W. L., Breast Cancer Res. Treat. 21: 47-53, 1992.

21. Erler, B. Hsu, Truong, H. Petrovic, L. M., Kim, S. Huh, M. Ferrell, L. Thung, S. N., -32- Geller, and Marchevsky, A. Lab. Invest. 71: 446- 451, 1994.

22. Tewari, A. and Narayan, P. J. of Urology, 160: 430- 436, 1998.

23. Alaiya Franzen Ragman Silfversward C., Moberger Linder and Auer International Journal of Cancer, 86(5): 731-6, 2000.

Claims

1. A method of obtaining a plurality of gene expression profiles in order to determine profiles characteristic of a selected disease type or stage of development of a disease comprising obtaining cells from a sample of said disease tissue; disrupting cells to expose the cellular products characteristic of gene expression; separating said cellular products according to their characteristic properties on a substrate; and collecting image data from said substrate; and carrying out image analysis to quantify the cellular products distributed on the substrate and partial least squares discriminant analysis (PLS-DA) to determine a set of gene expression profiles characteristic of said disease. ::20 2. A method according to claim 1 wherein the cellular products characteristic of gene expression are proteins.

3. A method according to claim 1 or claim 2 wherein the :substrate is an electrophoresis gel which allows separation of the cellular products characteristic of gene expression according to their size. o*oo

4. A method according to claim 3 wherein said gel is 2D-electrophoresis gel. .oS S. A method according to claim 1 wherein the cellular products characteristic of gene expression are nucleic acid sequences.

6. A method according to claim 5 wherein the nucleic acid sequences are mRNA.

7. A method according to claim 1, claim 5 or claim 6 wherein the substrate comprises a plurality of binding members capable of binding said cellular products characteristic of gene expression.

8. A method according to claim 7 wherein said binding members are oligonucleotides capable of binding said cellular products characteristic of gene expression according to their nucleotide sequence.

9. A method according to claim 1 or claim 2 wherein the substrate comprises a plurality of antibodies. A method according to any one of claims 7 to 9 wherein said substrate is a micro-array. oo .11 ii. A method according to any one of the preceding claims wherein said image analysis is computer-assisted. oioo

12. A method according to any one of the preceding claims **wherein said cellular products characteristic of gene expression are labelled. o oo* *o o•

13. A method according to any one of the preceding claims wherein the disease is cancer and the cells are tumour cells or normal reference cells within a given tumour.

14. A method of determining the presence, type or stage of a disease in a patient comprising the steps of extracting a sample of cells from the patient; disrupting the cells so as to expose the cellular proteins contained therein; separating the cellular proteins on a two- dimensional electrophoresis gel; analysing said gel by computer assisted image evaluation so as create an expression profile for said cells; and comparing the expression profile of the cells with a database of gene expression profiles determined according to claim i, said expression profiles being characteristic of a plurality of disease types or stages of development of said disease to determine presence, type 20 or risk of said disease in said patient. o• oooo

15. A method of determining the presence, type or stage of a disease in a patient comprising the steps of extracting a sample of cells from the patient; disrupting the cells so as to obtain the expressed nucleic acid sequences contained therein; oo[ separating the expressed nucleic acid sequences on a micro-array according to their individual nucleotide sequence; and 30 analysing said micro-array by computer assisted image evaluation so as create an expression profile for cells; and comparing said expression profile with a database of expression profiles determined according to claim 1, said expression profiles being characteristic of a plurality of disease types or stages of development of said disease to determine presence, type or risk of said disease in said patient.

16. A method according to any one of the preceding claims wherein the number of markers characteristic of said disease type is in the range of 20 to 500.

17. A method according to claim 16 wherein the number of markers characteristic of said disease type is in the range of 50 to 300.

18. A method according to any one of claims 14 to 17 wherein the disease type is selected from the group cancer, hypo/hyperthyroidism, diabetes, organ rejection, and samples for leukaemia or other hematopoetic disorders. S2019. A method according to claim 18 wherein said disease state is cancer and said disease tissue is a tumour. o• Dated this Twenty-second day of March 2004 Proteomedica AB oo Patent Attorneys for the oooo oo ~Applicant: F B RICE CO e•* go*