Featured Publications
=====================
.. _bayesian-cluster-stability:
A Bayesian Criterion for Cluster Stability.
-------------------------------------------
H. Koepke & B. Clarke. *A Bayesian Criterion for Cluster Stability.* Statistical Analysis and Data Mining. Accepted.
**Abstract**:
We present a technique for evaluating and comparing how clusterings
reveal structure inherent in the data set. Our technique is based on
a criterion evaluating how much point-to-cluster distances may be
perturbed without affecting the membership of the points. Although
similar to some existing perturbation methods, our approach
distinguishes itself in ﬁve ways. First, the strength of the
perturbations is indexed by a prior distribution controlling how close
to boundary regions a point may be before it is considered
unstable. Second, our approach is exact in that we integrate over all
the perturbations; in practice, this can be done efficiently for
well-chosen prior distributions. Third, we provide a rigorous
theoretical treatment of the approach, showing that it is consistent
for estimating the correct number of clusters. Fourth, it yields a
detailed picture of the behavior and structure of the
clustering. Finally, it is computationally tractable and easy to use,
requiring only a point-to-cluster distance matrix as input. In a
simulation study, we show that it outperforms several existing methods
in terms of recovering the correct number of clusters. We also
illustrate the technique in three real data sets.
Fast Prediction of New Feature Utility.
---------------------------------------
H. Koepke & M. Bilenko. *Fast Prediction of New Feature Utility.* International Conference on Machine Learning, 2012.
[`pdf `_ | `bibtex <_static/papers/new-feature-utility.txt>`_ ]
**Abstract**:
We study the new feature utility prediction problem: statistically
testing whether adding a feature to the data representation can
improve the accuracy of a current predictor. In many applications,
identifying new features. is the main pathway for improving
performance. However, evaluating every potential feature by
re-training the predictor can be costly. The paper describes an
efficient, learner-independent technique for estimating new feature
utility without re-training based on the current predictor's
outputs. The method is obtained by deriving a connection between loss
reduction potential and the new feature's correlation with the loss
gradient of the current predictor. This leads to a simple yet powerful
hypothesis testing procedure, for which we prove consistency. Our
theoretical analysis is accompanied by empirical evaluation on
standard benchmarks and a large-scale industrial dataset.
Efficient Identification of Equivalences in Dynamic Graphs and Pedigree Structures
----------------------------------------------------------------------------------
Hoyt Koepke and Elizabeth Thompson, *Efficient Identification of Equivalences in Dynamic Graphs and Pedigree Structures*,
arXiv preprint arXiv:1301.3946, 2013.
[`pdf `_ | `bibtex <_static/papers/hashreduce.txt>`_ ]
**Abstract**:
We propose a new framework for designing test and query functions for
complex structures that vary across a given parameter such as genetic
marker position. The operations we are interested in include equality
testing, set operations, isolating unique states, duplication
counting, or finding equivalence classes under identifiability
constraints. A motivating application is locating equivalence classes
in identity-by-descent (IBD) graphs, graph structures in pedigree
analysis that change over genetic marker location. The nodes of these
graphs are unlabeled and identified only by their connecting edges, a
constraint easily handled by our approach. The general framework
introduced is powerful enough to build a range of testing functions
for IBD graphs, dynamic populations, and other structures using a
minimal set of operations. The theoretical and algorithmic properties
of our approach are analyzed and proved. Computational results on
several simulations demonstrate the effectiveness of our approach.
Why Python Rocks for Research
-----------------------------
H. Koepke. *Why Python Rocks for Research.* Hacker Monthly, Issue 8, January 2011.
[`pdf <_static/papers/why-python.pdf>`_ | `online version `_ ]
*Errata: The Gamma function is wrong in the pdf version... Whoops!*
This article advocates the use of python as a general programming
language for scientific programming.
.. _paper-clustering-limits:
On The Limits of Clustering in High Dimensions via Cost Functions.
------------------------------------------------------------------
H. Koepke & B. Clarke. *On The Limits of Clustering in High Dimensions via Cost Functions.*
Statistical Analysis and Data Mining (`online `_).
**Abstract**:
This paper establishes a negative result for clustering: above a
certain ratio of random noise to nonrandom information, it is
impossible for a large class of cost functions to distinguish between
two partitions of a data set. In particular, it is shown that as the
dimension increases, the ability to distinguish an accurate
partitioning from an inaccurate one is lost unless the informative
components are both sufficiently numerous and sufficiently
informative. We examine squared error cost functions in detail. More
generally, it is seen that the VC-dimension is an essential hypothesis
for the class of cost functions to satisfy for an impossibility proof
to be feasible. Separately, we provide bounds on the probabilistic
behavior of cost functions that show how rapidly the ability to
distinguish two clusterings decays. In two examples, one simulated and
one with genomic data, bounds on the ability of squared-error and
other cost functions to distinguish between two partitions are
computed. Thus, one should not rely on clustering results alone for
high dimensional low sample size data and one should do feature
selection.
Older Research
==============
H. Koepke & `E. Thompson `_. *Efficient Testing Operations on Dynamic Graph Structures using Strong Hash Functions.* Tech. Report #567, University of Washington, Department of Statistics.
[`pdf `_ | `bibtex <_static/papers/hash-functions.txt>`_ ]
H. Koepke. *Bayesian Cluster Evaluation.* Masters Thesis, University of British Columbia, Department of Computer Science, 2008.
[`pdf <_static/papers/mastersthesis.pdf>`_ | `bibtex <_static/papers/mastersthesis.txt>`_ ]
H. Koepke & B. Clarke. *A Bayesian Approach to Cluster Validation.* International Symposium on Artificial Intelligence and Mathematics, 2008.
[`pdf <_static/papers/clstab-isaim.pdf>`_ | `bibtex <_static/papers/clstab-isaim.txt>`_ | `slides <_static/papers/clstab-isaim-pres.pdf>`_ ]