AU5926299A

AU5926299A - Data decomposition/reduction method for visualizing data clusters/sub-clusters

Info

Publication number: AU5926299A
Application number: AU59262/99A
Authority: AU
Inventors: Joseph Y. Wang
Original assignee: Catholic University of America
Current assignee: Catholic University of America
Priority date: 1998-09-17
Filing date: 1999-09-17
Publication date: 2000-04-03
Also published as: EP1032918A1; WO2000016250A1; JP2002525719A; CA2310333A1

Description

WO 00/16250 PCT/US99/21363 -1 DATA DECOMPOSITION/REDUCTION METHOD FOR VISUALIZING DATA CLUSTERS/SUB-CLUSTERS Background Art The present invention relates generically to 5 the field of data analysis and data presentation and, more particularly, to the analysis of data sets having higher dimensionality data points in order to optimally present the data in a lower dimensional order context, i.e., in a hierarchy of two- or 10 three-dimensional visual contexts to reveal data structures within the data set. The visualization of data sets having a large number of data points with multiple variables or attributes associated with each data point 15 represents a complex problem. In general, there is no way, a priori, to easily identify groups or sub groups of data points that have relational attributes such that structures and sub-structures existing within the data set can be visualized. 20 Various techniques have been developed for processing the data sets to reveal internal structures as an aid to understanding the data. In general, a large data set will oftentimes have data points that are multi-variant, that is, a single 25 data point can have a multitude of attributes, including attributes that are completely independent from one another or have some degree of inter attribute relationship or dependency. Any elementary visualization process involving 30 the projection of the data set onto a two dimensional visualization space using straight- WO 00/16250 PCT/US99/21363 - 2 forward projection algorithms becomes progressively less adequate as the order of the data points increases. Thus, a single projection of a higher order data set onto a visualization space may not be 5 able to present all of the structures and sub structures within the data set of interest in such a way that the structures or sub-structures can be visually distinguished or discriminated. One form of presentation schema involves 10 hierarchical visualization by which the data set is viewed at a highest-level, whole data set viewpoint. Thereafter, features within the highest-level projection are identified in accordance with an algorithm(s) or other identification criteria and 15 those next highest level features further processed to reveal their respective internal structure in another projection(s). This hierarchal process can be repeated for successive levels to present successively finer and detailed views of the data 20 set. Thus, in a hierarchical visualization scheme, an image tree is provided with the successively lower images of the tree revealing more detail. One such hierarchical data visualization scheme is disclosed by C. M. Bishop and M. E. Tipping in an 25 article entitled "A Hierarchical Latent Variable Model for Data Visualization," IEEE Trans. Pattern Anal. Machine Intell., Vol. 20, No. 3, pp. 282-293, March 1998. Bishop and Tipping present a hierarchical visualization algorithm based on a two 30 dimensional hierarchical mixture of latent variable models, the parameters of which are estimated using WO 00/16250 PCT/US99/21363 -3 the expectation-maximization (EM) algorithm. The construction of the hierarchical tree proceeds top down so that structure decomposition is driven interactively by the user and optimal projection is 5 determined by the maximum likelihood principle. A hierarchy of multiple two-dimensional visualization spaces is provided with the top-level projection displaying the entire data set and successive lower level projections displaying clusters within the 10 data set displayed at the top-level. Further lower level projections display sub-clusters and related internal structures within the data set. Initially, the data set is subjected by Bishop and Tipping to a form of linear latent variable 15 modelling to find a representation of the multi dimensional data set in terms of two latent, or "hidden," variables that is determined indirectly from the data set. The modelling is similar to principal component analysis, but defines a 20 probability density in the data space. In applying the Bishop and Tipping protocol, a single top-level latent variable model is generated with the posterior mean of each data point plotted in the latent space. Any cluster centers identified in 25 this initial plot are used as the basis for initiating the next-lower level analysis leading to a mixture of the latent variable models. There are two potential limitations associated with the Bishop and Tipping approach. First, 30 although a probability density is defined in the data space through a latent variable model, the WO 00/16250 PCTIUS99/21363 - 4 prior and order of the mixture model are heuristically selected and an isotropic Gaussian conditional distribution is undesirably restricted, which may misrepresent the true data structures and 5 put the optimality of the formulation in doubt. Secondly, the parameters, including the optimal projections, are determined by maximum likelihood; this criterion need not always lead to the most interesting or interpretable visualization plots. 10 Disclosure of Invention The present invention provides a data decomposition/reduction method for visualizing large sets of multi-variant data including the processing of the multi-variant data down to two- or three 15 dimensional space in order to optimally reveal otherwise hidden structures within the data set including the principal data cluster or clusters at a first or top level of processing and additional sub-clusters within the principal data clusters in 20 successive lower level visualizations. The identification of the morphology of clusters and subclusters and inter-cluster separation and relative positioning within a large data set allows investigation of the underlying drive that created 25 the data set morphology and the intra-data-set features. The data set, constituted by a multitude of data points each having a plurality of attributes, is initially processed as a whole using multiple 30 finite normal mixture models and hierarchical visualization spaces to develop the multi-level data WO 00/16250 PCT/US99/21363 -5 visualization and interpretation. The top-level model and its projection explain the entire data set revealing the presence of clusters and cluster relationships, while lower-level models and 5 projections display internal structure within individual clusters, such as the presence of subclusters, which might not be apparent in the higher-level models and projections. With many complementary mixture models and visualization 10 projections, each level is relatively simple while the complete hierarchy maintains overall flexibility while still conveying considerable structural information. The arrangement combines (a) minimax entropy modeling by which the models are determined 15 and various parameters estimated and (b) principal component analysis to optimize structure decomposition and dimensionality reduction. The present invention advantagiously performs a probabilistic principal component analysis to 20 project the softly partitioned data space down to a desired two-dimensional visualization space to lead to an optimal dimensionality reduction allowing the best extraction and visualization of local clusters. The minimax entropy principle is used to select the 25 model structures and estimate its parameter values, where the soft partitioning of the data set results in a standard finite normal mixture model with minimum conditional bias and variance. By performing the principal component analysis and 30 minimax entropy modeling alternatively, a complete hierarchy of complementary projections and refined WO 00/16250 PCT/US99/21363 - 6 models can be generated automatically, corresponding to a statistical description best fitted to the data. The present invention treats structure 5 decomposition and dimensionality reduction as two separate but complementary operations, where the criterion used to optimize dimensionality reduction is the separation of clusters rather than the maximum likelihood approach of Bishop and Tipping. 10 The resulting projections, in turn, enhance the performance of structure decomposition at the next lower level. Thereafter, a model selection procedure is applied to determine the number of subclusters 15 inside each cluster at each level using an information theoretic criteria based upon the minimum of alternate calculations of the Akaike Information Critera (AIC) and the minimum description length (MDL) criteria. This 20 determination allows the process of the present invention to automatically determine whether a further split of a subspace should be implemented or whether to terminate the further processing. A probabilistic adaptive principal component 25 extraction (PAPEX) algorithm is also applied to estimate the desired number of principal axes. When the dimensionality of the raw data is high, this PAPEX approach is computationally very efficient. Lastly, the present invention defines a probability 30 distribution in data space which naturally induces a corresponding distribution in projection space WO 00/16250 PCT/US99/21363 -7 through a Radon transform. This defined probability distribution permits an independent procedure in determining values for the intrinsic model parameters without concurrent estimation of 5 projection mapping matrix. In many data sets in which the data points are multi-varient, the underlying "drive" that give rise to the data points often form clusters of points because more than one variable may be a function of 10 that same underlying drive. In accordance with the present invention and as an initial step in processing the raw data set, the data set (designated herein as the t-space) is projected onto a single x-space (i.e., two 15 dimensional space), in which a descriptor W is determined from the sample covariance matrix Ct by fitting a single Gaussian model to the data set over t-space. Thereafter, a value f(x) is determined for 20 clusters K=1,2,...,Kx, in which the values of 7T, and Oxk are initialized by the user and estimated by maximizing the likelihood over x-space. After f(x) is determined, values of the Akaike Information Criterion (AIC) and the minimum 25 description length (MDL) for the various clusters K = 1,2,...,Kms are calculated and a model selected with a KO that corresponds to the minimum of the calculated values of the Akaike Information Criteria (AIC) and the minimum description length (MDL) 30 criteria.

WO 00/16250 PCT/US99/21363 - 8 The a value f(t) is then determined for KO in which the values of 7 Tk z, ikp Iltk , and Ctk are further refined by maximizing the likelihood over t-space. Wk is determined by directly evaluating the 5 covariance matrix Ctk or learning from tik for k=1,2,..., KO. Thereafter, xik or h [WT (ti-mtk) ] f or k = 1, 2,..., KO is plotted by projecting the data set onto multiple x-subspaces at the second level for visual 10 evaluation by the user. Then Gk(t) is determined by repeating the above process steps to thus construct multiple x-subspaces at the third level; the hierarchy is completed under the information theoretic criteria using the AIC and 15 the MDL and all x-space subspaces plotted for visual evaluation. The present invention advantageously provides a data decomposition/reduction method for visualizing data clusters/sub-clusters within a large data space 20 that is optimally effective and computationally efficient. Other objects and further scope of applicability of the present invention will become apparent from the detailed description to follow, 25 taken in conjunction with the accompanying drawings, in which like parts are designated by like reference characters. Brief Description of the Drawings The present invention is described below, by 30 way of example, with reference to the accompanying drawings, wherein: WO 00/16250 PCT/US99/21363 -9 FIG. 1 is a schematic block diagram of a system for processing a raw multi-varient data set in accordance with the present invention; FIG. 2 is a flow diagram of the process flow of 5 the present invention; FIG. 2A is an alternative visualization of the process flow of the present invention; FIG. 3 is an example of the projection of a data set onto a 2-dimensional visualization space 10 after determination of the principal axis; FIG. 4A is a 2-dimensional visualization space of one of the clusters of FIG. 3; FIG. 4B is a 2-dimensional visualization space of another of the clusters of FIG. 3; 15 FIG. 5 is an example of the projection of a data set onto a 2-dimensional visualization space after determination of the principal axis; FIG. 6A is a 2-dimensional visualization space of one of the clusters of FIG. 5; 20 FIG. 6B is a 2-dimensional visualization space of a second of the clusters of FIG. 5; and FIG. 6C is a 2-dimensional visualization space of a third of the clusters of FIG. 5. Best Mode for Carrying Out the Invention 25 A processing system for implementing the dimensionality reduction using probabilistic principal component analysis and structure decomposition using adaptive expectation maximization methods for visualizing data in 30 accordance with the present invention is shown in FIG. 1 and designated generally therein by the WO 00/16250 PCT/US99/21363 - 10 reference character 10. As shown, the system 10 includes a working memory 12 that accepts the raw multi-varient data set, indicated at 14, and which bi-directionally interfaces with a processor 16. 5 The processor 16 processes the raw t-space data set 14 as explained in more detail below and presents that data to a graphical user interface (GUI) 18 which presents a two- or three- dimensional visual presentation to the user as also explained below. 10 If desired, a plotter or printer 20 (or other hard copy output device) can be provided to generate a printed record of the display output of the graphical user interface (GUI). The processor 16 may take the form of a software or firmware 15 programed CPU, ALU, ASIC, or microprocessor or a combination thereof. As the initial step in processing the raw data and as presented in FIG. 2, the data set is subject to a global principal component analysis to 20 thereafter effect a top most projection. This step is initiated by determining the value of a variable W for the top-most projection in the hierarchy of projections. For relatively low dimensional data sets, W is directly found by evaluating the 25 covariance matrix Ct. For higher dimensional data sets, only the top two eigenvectors of the covariance matrix of the data points are of interest; depending upon the dimensionality of the raw data, it may be computationally more efficient 30 to apply the adaptive principal components extraction (APEX) algorithm described in Y. Wang, S.

WO 00/16250 PCT/US99/21363 - 11 H. Lin, H. Li, and S, Y. Kung, "Data mapping by probabilistic modular networks and information theoretic criteria," IEEE Trans. Signal Processing, Vol. 46, No.12, pp. 3378-3397, December 1998 to find 5 W directly from the raw data points ti. After the data set is projected and displayed by it principal component axis and n the basis of this single x space and given a fixed K, the user then selects or identifies those points yxk on the plot 10 corresponding to the centers of apparent clusters. The two-step expectation maximization (EM) algorithm can be applied to allow a standard finite normal mixture model (SFNM), i.e., where Ko f(x) = rkg(xIOk) EQ. 1 g(xl6xk) = f g(t|Otk)b(x - Wt + WTpt)dt 15 EQ. 2 and where the log-likelihood of projecting the data under the Radon Transform is ELL, og f(xi) EQ. 3 The standard finite normal mixture (SFNM) 20 modeling solution addresses the estimation of the regional parameters (7rkOt,) and the detection of the structural parameter K, in the relationship Ko p(t) = Z7rkg(tIOtk) k=1 EQ. 4 WO 00/16250 PCT/US99/21363 - 12 based on the observations t. It has been shown that when Ko is given, the maximum likelihood (ML) estimate of the regional parameters can be obtained using the expectation maximization (EM) algorithm. 5 There are two observations with the described approach: when the dimension of the data space is high, the implementation of the expectation maximization (EM) algorithm is very complex and time consuming. Additionally, the initialization of the 10 expectation maximization (EM) algorithm is heuristically chosen, this heuristic selection often leads to only a local optimal solution. Therefore, it is reasonable to consider the model parameter values being estimated, first, in the projected x 15 space and then further adjusted or fine tuned in the data t-space. One natural criterion used for determining the optimal parameter values is to minimize the distance between the standard finite normal mixture (SFNM) distribution f(x) and the data 20 histogram fx. Relative entropy (Kullback-Leibler distance), as suggested by information theory, is a suitable measure of the reconstruction error, given by: D(fI|f) =fx(x) log MX) f(x) EQ. 5 25 When relative entropy is used as a distance measure, distance minimization is equivalent to the maximum likelihood estimation, summarized by C = exp(-N[H(f.) + D(fatlf)]) EQ. 6 WO 00/16250 PCT/US99/21363 - 13 where H is the entropy calculator described by Y. Wang, S. H. Lin, H. Li, and S, Y. Kung in "Data mapping by probabilistic modular networks and information theoretic criteria," IEEE Trans. Signal 5 Processing, Vol. 46, No.12, pp. 3378-3397, December 1998. The EM algorithm is implemented as a two-step process, i.e., the E-step and the M-step as follows: E-step: (g(,jIo~n)) 10 EQ. 7 and the M-step: (n+1) n) Irk N7L~k EQ. 8A N (n) (n+1) M1 Zj Xi EQ. 8B Nk .(n+l) _ Nk _ ____ ____ ____ EQ. 8C 15 In each complete processing cycle, the previous set of parameter values is used to determine the (n) posterior probabilities Zi 4 using the E-step equation. These posterior probabilities are then (n+1) (n+l) used to obtain the new set of values 7k , , and 20 c using the appropriate M-step equations. The processing is continued until a minima in the value D(f-l ff) = 1(X) of the relative entropy f is WO 00/16250 PCTIUS99/21363 - 14 identified. This model selection criteria will determine the optimal number of K, values unless it is already at a local minimum. The model selection procedure will then determine the optimal number K, 5 of models to fit at the next level down in the hierarchy using the two information theoretic criteria, where Ka = 7KO - 1 (i.e., the values of Akaike's Information Criteria (AIC) and the Minimum Description Length (MDL) for K with selection of a 10 model in which K corresponds to the minimum of the AIC and the MDL). The resulting points j'tk () in data space, obtained by EQ. 9 are then used as the initial means of the respective 15 submodels. Since the mixing proportions n are projection-invariant, a 2 x 2 unit matrix is assigned to the remaining parameters of the covariance matrix Ctk. The expectation-maximization (EM) algorithm can be again applied to allow a 20 standard finite normal matrix (SFNM) with K. submodels to be fitted to the data over t-space. The corresponding EM algorithm can be derived by replacing all x in the E-step and the M-step equations, above, by t. 25 With a soft partitioning of the data set to generate possible models for the next level projection using the EM algorithm, data points will now effectively belong to more than one cluster at any given level. Thus, the effective input values WO 00/16250 PCT/US99/21363 - 15 are tik zik (tI - ekk) for an independent visualization subspace k in the hierarchy. Ctk can be directly evaluated to obtain Wk as described above. However, when the determination of Wk is 5 based on a neural network learning scheme, an algorithm, termed the probabilistic adaptive principal component extraction (PAPEX) is applied as follows. The feedforward weight vector wmkand the 10 feedback weight vector wmk are initialize to small random values at time i = 1 and a small positive value is assigned to the learning rate parameter i. For m=1 and for i=1,2,..., the value of y~ 9= W (Tk Ylk(i) W EQ. 10 15 is computed where Wik(i + 1) = Wlk(i) + r7[yik(i)tik - y12(i)Wk(i)I EQ. 11 For large values of i, wlk(i) - wIk, where wlk is the eigenvector associated with the largest eigenvalue of the covariance matrix Ck. Thereafter 20 m is set equal to 2 and for i=1,2, ..., the following values are computed: =wi(i)tik + ak(O)yik(W EQ. 12 wk(i + 1) = wn(i) + r7/[y(i)tik - Vik 0)Wk(0) EQ. 13 and WO 00/16250 PCT/US99/21363 - 16 ak(i + 1) = ak Wi - 7 1[2k(i11k W+ A+y (i)ak(i) EQ. 14 For large values of i, wk(i) - wk, where Wok is the eigenvector associated with the second largest eigenvalue of the covariance matrix Ck. 5 Having determined principal axes Wk of the mixture model at the second level, the visualization subspaces are then constructed by plotting each data point t, at the corresponding xik for k = 1,2, ... , Ko. Thus if one particular point takes most of the 10 contribution for a particular component, then that point will effectively be visible only on the corresponding subspace. The determination of the parameters of the models at the third level can again be viewed as a 15 two-step estimation problem, in which a further split of the models at the second level is determined within each of the subspaces over x space, and then the parameters of the selected models are fine tuned over t-space. Based on the 20 plot of xik, the learning of gk(x) can again be performed using the expectation-maximization (EM) algorithm and the model selection procedures described above. The third level EM algorithm has the same form as the EM algorithm at the second 25 level, except that in the E-step, the posterior probability that a data point xi belongs to submodel j is given by Zi(k~i) = 7-ikZijjk = Zik Tj9 d0 jk E9(.x EQ. 15 WO 00/16250 PCT/US99/21363 - 17 where zik are constants estimated from the second level of the hierarchy. The corresponding M-step in the expectation maximization algorithm is then given by _~ D=1 zi(kd) 5 EQ. 16 t; Zik D E ZQkj)X Cxx(kd ) = N EQ. 17 2-i Zi(kd) Cx~kj) 1 zi(kd)(Xi ~' /x(kd)) (Xi -- #xkd) ) EQ . 18 Ei;=1 zi(kd) Similarly, the resulting points in data space EQ. 19 At(kd) = Wh/x(kd) Ih 10 are then used to initialize the means of the respective submodels, and the expectation maximization (EM) algorithm can be applied to allow a standard finite normal matrix (SFNM) distribution with K. submodels to be fitted to the data over t 15 space. The formulation can be derived by simply replacing all x in the second level M-step by t. With the resulting zi(k,) in t-space, the PAPEX algorithm can be applied to estimate W(, ), in which the effective input values are expressed by 20 EQ. 20 ti(kj) = Zi(kj) (ti- At(kj)) The next level visualization subspace is generated by plotting each data point t 1 at the corresponding EQ.i(k,) i(kj)W j) (t - Ilt(kd)) WO 00/16250 PCT/US99/21363 - 18 values in the (k,j) subspace. The construction of the entire tree structure hierarchy is automatically completed with the flow diagram of FIG. 2 ending when no further data split 5 is recommended by the information theoretic criteria (AIC and MDL) in all of the parent subspaces. A first exemplary two-level implementation of the present invention is shown in FIGS. 3, 4A, and 4B in which the entire data set is present in the 10 top level projection and two local clusters within that top level projection each individually presented in FIGS. 4A and 4B. As shown in FIG. 3, the entire data set is subject to principal component analysis as described above to obtain the 15 principal axis or axes (axis Ax being representative) for the top level display. Additionally, the axis (unnumbered) for each of the apparent clusters is displayed. Thereafter, the apparent centers of the two clusters are identified 20 and the data subject to the aforementioned processing to further reveal the local cluster of FIG. 4A and the local cluster of FIG. 4B. A second exemplary two-level implementation of the present invention is shown in FIGS. 5, 6A, 6B, 25 and 6C in which the entire data set is present in the top level projection and three local clusters within that top level projection are each individually presented in FIGS. 6A, 6B, and 6C. As shown in FIG. 5, the entire data set is subject to 30 principal component analysis as described above to obtain the principal axis (Ax) and the axis WO 00/16250 PCT/US99/21363 - 19 (unnumbered) for each of the apparent clusters as displayed. The t-space raw data set arises from a mixture of three Gaussians consisting of 300 data points as presented in FIG. 5. As shown, two cloud 5 like clusters are well separated while a third cluster appears spaced in between the two well separated cloud-like clusters. By performing the same operations as described above, the second level visual space is generated with a mixture of two 10 local principal component axis subspaces where the line Ax indicates the global principal axis. When the two information theoretic criteria are applied (AIC and MDL) to examine these two cluster plots, the plot on the "right" of FIG. 5 shows evidence of 15 further split. At the third level data modeling, a hierarchical model is adopted, which illustrates that there are indeed total three clusters within the data set, as shown in FIGS. 6A, 6B, and 6C. An alternate visualization of the process of 20 flow of the present invention is shown in FIG. 2A in which the data is input and structured and the high level data set that this then subject to algorithmic processing to iteratively effect the data structure decomposition, dimensionality reduction, and 25 multiple model selection using the AIC/MDL criteria and effect a best fit to for the next subsequent projection. Thereafter, extraction by the above described probabilistic adaptive principal component processing and the radon transform is effect to 30 thereafter generate the data cluster visualizations.

WO 00/16250 PCT/US99/21363 - 20 Industrial Applicability The present invention has use in all applications requiring the analysis of data, particularly multi-dimensional data, for the purpose 5 of optimally visualizing various underlying structures and distributions present within the universe of data. Applications include the detection of data clusters and sub-clusters and their relative relationships in areas of medical, 10 industrial, geophysical imaging, and digital library processing, for example. As will be apparent to those skilled in the art, various changes and modifications may be made to the illustrated data decomposition/reduction 15 method for visualizing data clusters/sub-clusters of the present invention without departing from the spirit and scope of the invention as determined in the appended claims and their legal equivalent. CROSS REFERENCE TO UNITED STATES PROVISIONAL 20 PATENT APPLICATION This application claims the benefit of the filing date of co-pending U.S. Provisional Patent Application No. 60/100,622 filed on September 17, 1996 by the same inventor herein and entitled 25 "Hierarchical Minimax Entropy Modeling and Visualization for Data Representation and Analysis/Interpretation," the disclosure of which is incorporated herein by reference.

Claims

1. A method of processing a data set of a multitude of data points each having a 5 dimensionality greater than at least three to provide a hierarchy of visualizations of the data set into an at least two-dimensional space including a top-level visualization and at least one second level visualization presenting at least one cluster 10 K of the top-level visualized therein, comprising the steps of: providing, as the top-level visualization, a reduced dimension projection of the entire data set along at least a principal axis into an at least 15 two-dimensional visualization space in which the dimensionality of the projected data set is reduced by principal component analysis of the data set to obtain a principal component projection axis; selecting at least one point on said first 20 mentioned visualization space corresponding to centers of apparent clusters; developing an optimum number of possible models for a second level projection; determining the optimum number of local 25 clusters K for the second level projection by alternately calculating the Akaike information criteria and the minimum description length and using the minimum of the Akaike information criterion and minimum description length to 30 determine the optimum number of local clusters K; WO 00/16250 PCT/US99/21363 - 22 determining the principal axes for visualization subspaces for the so-determined local clusters and projecting the data for at least one of the so-determined local clusters in a visualization 5 space different from the first-mentioned visualization space.

2. The method of claim 1, wherein the principal component for the top-level projection is 10 determined by directly evaluating the covariance matrix.

3. The method of claim 1, wherein the principal component for the top-level projection is 15 determined by adaptive principal components extraction.

4. The method of claim 1, wherein the plurality of possible models for the next-to-top 20 level projection are developed by successive cycles of the E-step and the M-step of expectation maximization algorithm until a minimum relative entropy is attained. 25

5. A method of optimally processing a large data set of high dimensional (>3) data points to provide, by dimensional reduction, cluster analysis, and two-dimensional surface projection of a hierarchy of visual displays for the purpose of 30 discerning data information relationships therein, comprising the steps of: WO 00/16250 PCT/US99/21363 - 23 a. providing, through principal component analysis, a top level visualization as a projection of the entire data set onto a two-dimensional visualization space defined by its principal component projection axis; 5 b. selection by algorithm of an initial best estimate of data points on said first-mentioned visualization space corresponding to centers of apparent clusters; c. developing an optimal number of possible models for a second level projection; 10 d. determining the optimal number of local clusters for the second level projections by calculating the minimum of the AIC or the MDL to determine the optimum number of second level clusters; e. determining the principal component axis of each 15 second level cluster for projection onto respective two dimensional subspaces for display visualization; and f. repeating steps c, d, and e until no further data point clusters are algorithmically detectable. 20

6. A method of optimally processing a large data set of high dimensional (>3) data points to provide, by dimensional reduction, cluster analysis and two-dimensional surface projection of a 25 hierarchy of visual displays for the purpose of discerning data information relationships therein, comprising the steps of: a. providing, through principal component analysis, a top level visualization as a projection of the entire data set 30 onto a two-dimensional visualization space defined by its principal component projection axis; WO 00/16250 PCT/US99/21363 - 24 b. heuristically selecting, from multiple competing choices, the initial best estimates of data points on said first-mentioned visualization space corresponding to centers of apparent clusters; 5 c. developing an optimal number of possible models f3r a second level projection; d. determining the optimal number of local clusters for the second level projections by calculating the minimum of the AIC or the MDL to determine the optimum number of second level 10 clusters; e. determining the principal component axis of each second level cluster for projection onto respective two dimensional subspaces for display visualization; and f. repeating steps c, d, and e until no 15 further data point clusters are heuristically detectable.

7. A system for processing a data set of a multitude of data points each having a 20 dimensionality greater than at least three to provide a hierarchy of visualizations of the data set into an at least two-dimensional space including a top-level visualization and at least one second level visualization presenting at least one cluster 25 K of the top-level visualized therein, characterized by: a processor having a cooperating memory containing a data set of a multitude of data points each having a dimensionality greater than at least 30 three; WO 00/16250 PCT/US99/21363 - 25 a display for presenting one or more visualizations of the data set as processed by the processor; the processor providing, as the top-level 5 visualization on the display, a reduced dimension projection of the entire data set along at least a principal axis into an at least two-dimensional visualization space in which the dimensionality of the projected data set is reduced by principal 10 component analysis of the data set to obtain a principal component projection axis; the processor selecting at least one point on said first-mentioned visualization space corresponding to centers of apparent clusters; 15 the processor thereafter developing an optimum number of possible models for a second level projection; the processor determining the optimum number of local clusters K for the second level projection by 20 alternately calculating the Akaike information criteria and the minimum description length and using the minimum of the Akaike information criterion and minimum description length to determine the optimum number of local clusters K; 25 the processor determining the principal axes for visualization subspaces for the so-determined local clusters and projecting the data for at least one of the so-determined local clusters in a visualization space on the display different from 30 the first-mentioned visualization space. WO 00/16250 PCT/US99/21363 - 26

8. The system of claim 7, wherein the principal component for the top-level projection is determined by directly evaluating the covariance matrix. 5

9. The system of claim 8, wherein the principal component for the top-level projection is determined by adaptive principal components extraction. 10

10. The method of claim 7, wherein the plurality of possible models for the next-to-top level projection are developed by successive cycles of the E-step and the M-step of expectation 15 maximization algorithm until a minimum relative entropy is attained.

11. A system for optimally processing a large data set of high dimensional (>3) data points 20 to provide, by dimensional reduction, cluster analysis, and two-dimensional surface projection of a hierarchy of visual displays for the purpose of discerning data information relationships therein, characterized by: 25 a processor having a cooperating memory containing a data set of a multitude of data points each having a dimensionality greater than at least three and a display for presenting one or more visualizations of the data set as processed by the 30 processor; WO 00/16250 PCT/US99/21363 - 27 the processor, through principal component analysis, providing a top level visualization as a projection of the entire data set onto a two dimensional visualization space in the display and 5 defined by its principal component projection axis; the processor selecting by algorithm an initial best estimate of data points on said first-mentioned visualization space corresponding to centers of apparent clusters; the processor developing an optimal number of possible 10 models for a second level projection; the processor determining the optimal number of local clusters for the second level projections by calculating the minimum of the AIC or the MDL to determine the optimum number of second level clusters; 15 the processor determining the principal component axis of each second level cluster for projection onto respective two-dimensional subspaces for visualization on the display.

12. A system for optimally processing a 20 large data set of high dimensional (>3) data points to provide, by dimensional reduction, cluster analysis and two-dimensional surface projection of a hierarchy of visual displays for the purpose of discerning data information relationships therein, 25 characterized by: a processor having a cooperating memory containing a data set of a multitude of data points each having a dimensionality greater than at least three and a display for presenting one or more 30 visualizations of the data set as processed by the processor; WO 00/16250 PCT/US99/21363 - 28 the processor effecting a principal component analysis of the data to provide a top level visualization as a projection of the entire data set onto a two-dimensional visualization space of the display and defined by its 5 principal component projection axis; the processor, in response to a heuristic selection entered by a user, providing an initial best estimate of data points on said first-mentioned visualization space corresponding to centers of apparent clusters; 10 the processor thereafter developing an optimal number of possible models for a second level projection; the processor determining the optimal number of local clusters for the second level projections by calculating the minimum of the AIC or the MDL to determine the optimum number 15 of second level clusters; and the processor determining the principal component axis of each second level cluster for projection onto respective two-dimensional subspaces for display visualization by the display. 20

13. A computer automated process for generating a hierarchy of minimax entropy models and optimum visualization projections for high dimensional space data to improve data representation and interpretation, characterized by: 25 structurally decomposing a high dimensional space data utilizing minimax entropy principles to develop a statistical framework for model identification to an optional number and kernel shape of local clusters from said data; dimensionally reducing said high dimensional space data 30 by combining minimax entropy principles and principal component analysis to optimize data structure decomposition; WO 00/16250 PCT/US99/21363 - 29 iteratively and separately performing principal component analysis and minimax entropy model identification to generate a hierarchy of complementary projections and models to develop an intrinsic model to best-fit the high dimensional 5 space data; and creating a substantially reduced dimensional visualization space to facilitate better data representation and interpretation of said data. 10