Massachusetts Institute of Technology - Laboratory for Information and Decision Systems
Increasingly, we are confronted with very high dimensional data sets in areas like computational biology and medical imaging. As a result, methods of avoiding the curse of dimensionality have come to the forefront of machine learning research. One approach, which relies on exploiting the geometry of the data, has evolved into a subfield called manifold learning.
The underlying hypothesis of this field is that data tend to lie near a low dimensional submanifold, due to constraints that limit the degrees of freedom of the process generating them. This has been empirically observed to be the case, for example, in speech and video data. Although there are many widely used algorithms which assume this hypothesis, the basic question of testing this hypothesis is poorly understood.
I will describe my research on developing a provably correct, efficient algorithm to test this hypothesis, and an improvement in the sample complexity of k-means, stemming from this work. The tools we use include random projections and ideas from empirical processes.