Dec 12

10:00 am

## Algorithms for Estimating the Cluster Tree of a Density

### Rebecca Nugent

Final Exam

University of Washington - Department of Statistics

The goal of clustering is to identify distinct groups in a data set and assign a group label to each observation. To cast clustering as a statistical problem, we regard the data as an iid sample from some unknown probability density p. We adopt the premise that groups correspond to modes of the density. Our goal then is to find the modes and assign each observation to the "domain of attraction" of a mode. We do this by estimating the cluster tree of the density, a representation of the hierarchical structure of its level sets. Leaves of the cluster tree correspond to modes of the density. An obvious way of estimating the cluster tree of the underlying density p from a sample is to first compute a density estimate \hat{p} and then use the cluster tree of \hat{p} as an estimate for the cluster tree of p.

We first present an algorithm that computes the cluster tree for piecewise constant density estimates. Exact computation of the cluster tree for general density estimates seems infeasible. Our second contribution is a graph-based approach for approximating the cluster tree of any density estimate. A problem with this graph-based approach is that it requires the edge weights of a graph which are not given in closed form but instead are solutions of an optimization problem; the weight of an edge connecting two observations is the minimum of the estimated density along the edge. Our third contribution is an algorithm that exactly implements the graph-based approach for smooth density estimates.