University of Washington - Department of Statistics
Protein identification using mass spectrometry is a high-throughput way to identify proteins in biological samples. The identification procedure consists of two steps. It first identifies the peptides from mass spectra, then determines if proteins assembled from the putative peptide identifications are present in the samples. In this talk, I will present two statistical methods for these two steps.
The main goal of the first step is to select the peptide sequence that is most likely to generate the observed spectrum from candidate sequences in a protein database, according to the similarity between the observed spectrum and the theoretical spectra predicted from candidate sequences. For this part, we developed a likelihood-based scoring algorithm based on a generative model, which measures the likelihood that the observed spectrum arises from the theoretical spectrum of each candidate sequence. Our probabilistic model takes account of multiple sources of noise in the data, e.g. variable peak intensities and errors in peak locations. Our likelihood-based approach also provides two measures for assessing the uncertainty of each identification.
The main goal of the second step is to assess the evidence for presence of proteins constructed from putative peptide identifications, many of which are incorrect. For this part, we develop an unsupervised protein identification approach based on a nested mixture model, which incorporates the evidence feedback between proteins and their constituent peptides in a coherent framework. Our method is essentially a model-based clustering method, which simultaneously identifies which proteins are present, and which peptides are correctly identified. Our method provides properly-calibrated probabilities for each peptide being correctly identified. It shows improved accuracy over a widely-used program for protein inference in the simulation studies, and substantially increased accuracy over a widely-used program for peptide inference in both simulated studies and the real data we studied.