Marina Meila (PI) and Zaid Harchaoui (co-PI) have been awarded a grant from the National Science Foundation’s Statistics Program for their project on “Cluster Validation Without Model Assumptions”.
The project aims to put in the hands of practitioners a suite of methods to validate the results of clustering. The methods to be developed build on optimization and statistics. They work by recognizing the cases when the data is well clustered. In these cases, they provide a "certificate of stability". This research will develop such methods for several widely used clustering paradigms (such as K-means, Spectral Clustering, finite Mixture Models) in realistic data scenarios.
The framework is based on the notions of good and stable clustering. ``Good'' means that clustering C fits the data well, according to the current clustering paradigm. ``Stable'' means that the only clusterings that fit well are small perturbations of C. This can only happen when C captures structure present in the data. While goodness can be easily checked in practice, stability is not a property that can be verified directly. The core of this proposal is to find conditions on the data and C that guarantee stability and are practically verifiable. Such results are known as stability results.
The proposal outlines two novel approaches to this goal. The first is based on using convex relaxations to the original clustering problem. The project will show how to determine the stability of a clustering by studying a convex relaxation around its global minimum. The second approach is based on ``recycling'' existing theoretical results proved under assumptions about the data generating model. Often, the proof of such a result contains the elements of a model free stability proof. Thus, this project will be using existing statistical theory in a novel way.
Among convex relaxations for clustering, the relaxations to a Semi-Definite Program (SDP) are especially promising. However, solving an SDP is computationally expensive. The project will explore ways to accelerate the validation algorithms by exploiting the special structure of the SDPs. This work will impact all users of clustering, by providing them with provably correct tools to validate the outputs of clustering algorithms.
In particular, the methods will be applied to the validation of clustering in molecular dynamics simulations, in collaboration with researchers in Chemical Engineering and Materials Science at the University of Washington. The work will also align with the broader effort to build theoretical foundations of data science through the NSF TRIPODS Institute for Algorithmic Foundations of Data Science.
These tools will be integrated and disseminated into the big data unsupervised learning platform megaman, distributed and maintained by the group of PI Meila.