The human microbiome is the collection of microorganisms which live inside and on top of us; recent studies have established the centrality of the microbiome to human health. These studies raise a number of questions: Given a collection of samples from a single body location, which samples indicate a healthy versus unhealthy phenotype? Do these samples fall into natural "types" from which one can generalize? What are the "units" of microbial communities, and what are the significant synergistic and antagonistic interactions between microbes?
High-throughput sequencing technologies have opened the door to understanding these questions via sequencing of genetic material extracted in bulk from a collection of microorganisms. Statistical methods can be used to assign these fragmentary sequences to locations on a "reference" phylogenetic tree using information about the genomes of previously identified species; each sample thus results in a cloud of points on the reference tree. One can approach the above questions by developing statistical methods for comparing such clouds. From a probabilistic perspective, an appropriate comparative tool is the classical Kantorovich-Rubinstein metric (a.k.a. "earth-mover's distance"), which we have shown is a generalization of the "UniFrac" metric popularized in 2005 by microbial ecologists. One can define related clustering and ordination techniques which operate directly on the underlying clouds of points, rather than solely on a matrix of distances derived from the clouds. I will describe this theoretical work, as well as our software implementation, in the context of our project researching the microbiome of the human vagina. This is joint work with Steve Evans (UC Berkeley), Robin Kodner (UW), Ginger Armbrust (UW), Noah Hoffman (UW), David Fredricks (FHCRC), Sujatha Srinivasan (FHCRC), and Martin Morgan (FHCRC).