Stanford University - Department of Mathematics
Clustering is one of the most basic data analysis techniques, yet, not much is known about the theoretical properties of most clustering schemes. Jon Kleinberg has proved that there exists no clustering scheme that simultaneously satisfies three very natural axioms that he identifies. We prove that in the more relaxed context of hierarchical clustering methods, a set of axioms very similar to those of Kleinberg's yield existence and uniqueness instead of non-existence.
Other properties of this unique scheme can be established, namely stability and convergence. We represent dendrograms as ultrametric spaces and use tools from metric geometry to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods.
The unique scheme that we characterize turns out to be single linkage hierarchical clustering which is commonly criticized for exhibiting the so called "chaining effect." As a refinement of a classical observation by Hartigan, we prove that in the case when the input to our unique scheme comes from sampling an underlying measure metric space in an i.i.d. fashion, our convergence results imply that one asymptotically recovers a certain multiscale decomposition of the support of the underlying probability measure.
I will describe our exploration of a certain variant of single linkage hierarchical clustering that can be regarded as a two parameter hierarchical clustering method. By definition, this type of methods encode both a metric parameter and a density parameter and are therefore sensitive to the distribution of mass in the support of the underlying probability measure. This construction is related to the methods of persistent topology, to the Cluster Tree construction, and to the Reeb graph.