Seminar Details

Seminar Details


Aug 9

10:00 am

Statistical Approaches to Analyze Mass Spectrometry Data

Soyoung Ryu

Final Exam

University of Washington - Department of Statistics

Advisors: Vladimir Minin & David Goodlett

Proteomics attempts to understand biological functions of an organism through the lens of expressed proteins, basic building blocks of all living cells. Mass spectrometry is used in the field of shotgun proteomics to generate mass spectra that are in turn used to identify and quantify proteins in a given sample.

In the first part of dissertation, we identify peptides, substrings of proteins in biological samples, using clustered tandem mass spectra. Tandem mass spectrometry experiments generate from thousands to millions of spectra. These spectra can be used to identify the presence of proteins in biological samples. In this work, we propose a new method to identify peptides based on clustered tandem mass spectrometry data. In contrast to previously proposed approaches, which identify one representative spectrum for each cluster using traditional database searching algorithms, our method uses all available information to score all the spectra in a cluster against candidate peptides using Bayesian model selection. We illustrate the performance of our method by applying it to seven-standard-protein mixture data as well as to more complex mixture data from Francisella novicida and Saccharomyces cerevisiae.

The second part of this dissertation proposes a hierarchical neural network Poisson regression for spectral count data. Considering that multiple peptides often come from the same protein, our hierarchical model is adequate for the spectral count data. We fit the model using local scoring algorithm coupled with a backfitting algorithm. We also stabilize the neural network within the backfitting algorithm. Furthermore, we employ a Bayesian random searching algorithm (BARS) for our model in order to determine the choice of covariates as well as the complexity of the neural network. We illustrate the performance of our proposed model as well as the performance of the model selection approach using simulation studies. Then, we perform our preliminary study in measuring the relative abundances of proteins in a sample and the tryptic-, ion-, fragment efficiency of peptides. We train our model on one MS/MS data set from Saccharomyces cerevisiae and test the model performance using the standard protein mixture as well as the protein abundance measured by non-mass spectrometry based experiments.