University of Washington - Department of Statistics
The detection and genotyping of sequence variations, particularly Single Nucleotide Polymorphisms (SNPs), is at the core of all genetic analysis. The principle approach for detecting variants in a specific gene is to sequence that gene in a sample of (diploid) individuals. (The term â€œdiploidâ€ refers to the fact that each individual has two copies of their genome, one inherited from each parent.) Identification of SNPs from this kind of sequence data has been greatly aided by the use of computational and statistical methods. However, existing algorithms are not sufficiently accurate to be used without potentially costly confirmation, usually by a human manually checking each call.
This talk will describe the problem, and our work on a new and more accurate statistical method to detect and genotype SNPs. The new algorithm improves on existing approaches in two key ways. First, it takes more detailed account of systematic variation in peak heights due to read-specific and sequence-context effects. If unaccounted for these systematic effects obscure the signal we are aiming to detect. Second, it computes a formal statistical measure of the evidence for potential genotypes at each position in each sequence. This enables the application of standard statistical methods to efficiently combine evidence across multiple reads for an individual, which results in exceptional accuracy for data with â€œdouble-coverageâ€, where individuals are sequenced on both the forward and reverse strands. It also provides a quantitative assessment of the confidence in each SNP identified, and in each genotype called. This is particularly useful in identifying a subset of highly accurate SNP and genotype calls, which may be accepted without manual confirmation.
Background reading for this seminar can be found at: