University of Washington - Department of Statistics
In the context of variable selection for model-based clustering the problem of comparing two nested subsets of variables is recast as a model comparison problem, and addressed using approximate Bayes factors. A greedy search algorithm is proposed for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the clustering model simultaneously for either continuous or discrete data. We present the results of applying the method to several datasets. In general removing irrelevant variables often improves performance. Compared to methods based on all the variables, the variable selection method consistently yields more accurate estimates of the number of groups and lower classification error rates, as well as more parsimonious clustering models and easier visualization of results.