Advisors: Raphael Gottardo & Mathias Drton
Bulk gene expression experiments utilizing large populations of cells can accurately determine the average population state, but to study cell-to-cell variation it is logical to repeatedly sample single cells. Advances in microfluidics are now enabling the expression profile of thousands of genes to be measured in individual cells.
A characteristic of single cell expression is zero-inflation of otherwise continuous measurements, in which expression is either well-separated from zero, or undetectable. I apply a two-part (Hurdle) model in which a continuous distribution is inflated with discrete zeros. The Hurdle model, with an empirical Bayes regularization of nuisance parameters, is apt for testing differential expression in individual genes, and in pre-defined gene sets.
Single cell gene expression experiments seek to elucidate gene co-regulation at the cellular level, but current methods to learn co-expression networks are not suited to zero-inflated data. To study this problem, I propose a multivariate generalization of the Hurdle model, involving a class of singular conditional Gaussian distributions. Supposing that most genes do not appreciably interact, I employ regularized maximum likelihood estimation with a grouped $ell_1$ penalty to learn the interaction neighborhood of each gene. A non-isometric penalty that accounts for differences in the Fisher Information available in the two components in the model improves finite sample performance. I interpret large-scale networks by summarizing the connections they produce between and within gene ontology categories through a formal test for gene ontology enrichment.