University of Washington - Departments of Computer Science and Engineering & Genome Sciences
Humans differ in many "phenotypes" such as weight, hair color and more importantly disease susceptibility. These phenotypes are largely determined by each individual's specific genotype, stored in the 3.2 billion bases of his or her genome sequence. Deciphering the sequence by finding which sequence variations cause a certain phenotype would have a great impact. The recent advent of high-throughput genotyping methods has enabled retrieval of an individual's sequence information on a genome-wide scale. Classical approaches have focused on identifying which sequence variations are associated with a particular phenotype. However, the complexity of cellular mechanisms, through which sequence variations cause a particular phenotype, makes it difficult to directly infer such causal relationships. In this talk, I will present machine learning approaches that address these challenges by explicitly modeling the cellular mechanisms induced by sequence variations. Our approach takes as input genome-wide expression measurements and aims to generate a finer-grained hypothesis such as "sequence variations S induces cellular processes M, which lead to changes in the phenotype P." Furthermore, we have developed the "meta-prior algorithm" which can learn the regulatory potential of each sequence variation based on their intrinsic characteristics. This improvement helps to identify a true causal sequence variation among a number of polymorphisms in the same chromosomal region. Our approaches have led to novel insights on sequence variations, and some of the hypotheses have been validated through biological experiments. Many of the machine learning techniques are generally applicable to a wide-ranging set of applications, and as an example I will present the meta-prior algorithm in the context of movie rating prediction tasks using the Netflix data set.