Seminar Details

Seminar Details


Tuesday

Feb 4

10:00 am

Phylogenetic Stochastic Mapping without Matrix Exponentiation

Jan Irvahn

Faculty host Vladimir Minin
General Exam

Phylogenetic stochastic mapping is a method for reconstructing the history of trait changes on a phylogenetic tree relating species/organisms carrying the trait. State-of-the-art methods assume that the trait evolves according to a continuous-time Markov chain (CTMC) and work well for small state spaces such as the space of four DNA nucleotides. The computations slow down considerably for larger state spaces, hindering investigations involving evolution of microsatellites (repeats of 2-6 DNA nucleotides), amino acids, codons (triples of DNA nucleotides), gene family sizes, and analysis of models with latent variables. Current methods slow down on large state spaces because they rely on exponentiating CTMC infinitesimal rate matrices -- an operation whose computational complexity grows as the size of the CTMC state space cubed. In this work, we introduce a new approach, based on a CTMC technique called uniformization, that does not use matrix exponentiation for phylogenetic stochastic mapping. Our method is based on a new Markov chain Monte Carlo (MCMC) algorithm that targets the distribution of trait histories conditional on the trait data observed at the tips of the tree. Similarly to recent developments in the hidden Markov models literature, we demonstrate that computational complexity of our MCMC method is growing as the size of the CTMC state space squared. Moreover, in contrast to competing matrix exponentiation methods, if the rate matrix is sparse, we can leverage this sparsity and decrease the complexity of our algorithm further. Using simulated data, we illustrate advantages of our MCMC algorithm and investigate how large the state space needs to be in order for our method to outperform matrix exponentiation approaches. We show that even on moderately large state spaces of amino acids and codons our MCMC method can be faster than currently used matrix exponentiation methods. This finding is important, because these state spaces are ubiquitous in molecular evolution studies, but when dealing with large datasets, state-of-the-art statistical methods often grind to a halt.