University of Washington - Department of Statistics
In recent years, the growth of datamining has driven increased interest in cluster analysis, which tries to find cohesive groups in data without prior hypotheses. It has been realized that cluster analysis, long a collection of ad hoc techniques, can be profitably recast in terms of a statistical model: a finite mixture of probability distributions. This enables one to understand when specific methods are likely to work well, and to devise new ones that are near-optimal for particular situations. A finite mixture of multivariate normal distributions works surprisingly well for a wide variety of problems. Banfield and Raftery (1993, Biometrics) proposed that this model be refined by (a) decomposing the variance matrix for a cluster into factors that represent its volume, shape and orientation, and imposing across-cluster equality constraints on some or all of these, and (b) adding a Poisson process component to the model to represent outliers, or data points that don't belong to any cluster. This also allows one to model spatial point processes consisting of features embedded in noise. This has led to the MCLUST software which has been widely used in applications.
I will review recent work on this approach, including methods for determining the number of clusters and the best clustering method, applications to feature detection in images, noise removal and robust covariance estimation. I will also discuss scalability of the algorithms and their application to datamining. Applications include medical diagnosis, classifying gamma-ray bursts in astronomy, minefield detection, and fault detection in textile manufacturing.