A Bayesian Criterion for Cluster Stability.¶
- Koepke & B. Clarke. A Bayesian Criterion for Cluster Stability. Statistical Analysis and Data Mining. Accepted.
We present a technique for evaluating and comparing how clusterings reveal structure inherent in the data set. Our technique is based on a criterion evaluating how much point-to-cluster distances may be perturbed without affecting the membership of the points. Although similar to some existing perturbation methods, our approach distinguishes itself in ﬁve ways. First, the strength of the perturbations is indexed by a prior distribution controlling how close to boundary regions a point may be before it is considered unstable. Second, our approach is exact in that we integrate over all the perturbations; in practice, this can be done efficiently for well-chosen prior distributions. Third, we provide a rigorous theoretical treatment of the approach, showing that it is consistent for estimating the correct number of clusters. Fourth, it yields a detailed picture of the behavior and structure of the clustering. Finally, it is computationally tractable and easy to use, requiring only a point-to-cluster distance matrix as input. In a simulation study, we show that it outperforms several existing methods in terms of recovering the correct number of clusters. We also illustrate the technique in three real data sets.
Fast Prediction of New Feature Utility.¶
We study the new feature utility prediction problem: statistically testing whether adding a feature to the data representation can improve the accuracy of a current predictor. In many applications, identifying new features. is the main pathway for improving performance. However, evaluating every potential feature by re-training the predictor can be costly. The paper describes an efficient, learner-independent technique for estimating new feature utility without re-training based on the current predictor’s outputs. The method is obtained by deriving a connection between loss reduction potential and the new feature’s correlation with the loss gradient of the current predictor. This leads to a simple yet powerful hypothesis testing procedure, for which we prove consistency. Our theoretical analysis is accompanied by empirical evaluation on standard benchmarks and a large-scale industrial dataset.
Efficient Identification of Equivalences in Dynamic Graphs and Pedigree Structures¶
We propose a new framework for designing test and query functions for complex structures that vary across a given parameter such as genetic marker position. The operations we are interested in include equality testing, set operations, isolating unique states, duplication counting, or finding equivalence classes under identifiability constraints. A motivating application is locating equivalence classes in identity-by-descent (IBD) graphs, graph structures in pedigree analysis that change over genetic marker location. The nodes of these graphs are unlabeled and identified only by their connecting edges, a constraint easily handled by our approach. The general framework introduced is powerful enough to build a range of testing functions for IBD graphs, dynamic populations, and other structures using a minimal set of operations. The theoretical and algorithmic properties of our approach are analyzed and proved. Computational results on several simulations demonstrate the effectiveness of our approach.
Why Python Rocks for Research¶
This article advocates the use of python as a general programming language for scientific programming.
On The Limits of Clustering in High Dimensions via Cost Functions.¶
H. Koepke & B. Clarke. On The Limits of Clustering in High Dimensions via Cost Functions. Statistical Analysis and Data Mining (online).
This paper establishes a negative result for clustering: above a certain ratio of random noise to nonrandom information, it is impossible for a large class of cost functions to distinguish between two partitions of a data set. In particular, it is shown that as the dimension increases, the ability to distinguish an accurate partitioning from an inaccurate one is lost unless the informative components are both sufficiently numerous and sufficiently informative. We examine squared error cost functions in detail. More generally, it is seen that the VC-dimension is an essential hypothesis for the class of cost functions to satisfy for an impossibility proof to be feasible. Separately, we provide bounds on the probabilistic behavior of cost functions that show how rapidly the ability to distinguish two clusterings decays. In two examples, one simulated and one with genomic data, bounds on the ability of squared-error and other cost functions to distinguish between two partitions are computed. Thus, one should not rely on clustering results alone for high dimensional low sample size data and one should do feature selection.