Data and algorithms are ubiquitous in all scientific, industrial and personal domains. Data now come in multiple forms (text, image, video, web, sensors, etc.), are massive, and require more and more complex processing beyond their mere indexation or the computation of simple statistics, such as recognizing objects in images or translating texts.
Prediction problems typically assume the training data are independent samples, but in many modern applications samples come from individuals connected by a network. For example, in adolescent health studies of risk-taking behaviors, information on the subjects’ social networks is often available and plays an important role through network cohesion, the empirically observed phenomenon of friends behaving similarly. Taking cohesion into account should allow us to improve prediction.
Random matrices now play a role in many areas of theoretical, applied, and computational mathematics. Therefore, it is desirable to have tools for studying random matrices that are flexible, easy to use, and powerful. Over the last fifteen years, researchers have developed a remarkable family of results, called matrix concentration inequalities, that balance these criteria.
Host: Daniela Witten, Tyler McCormick
Faculty Host: Carlos Guestrin Stat Liason: Emily Fox Abstract: Society is witnessing remarkable technological and scientific advances as numerous disciplines are adopting more advanced statistical and computational methodologies. Along with this progress comes an increasing need for scalable algorithms with solid theoretical foundations; the hope is that algorithms which address efficiency (with regards to both statistical and computational perspectives) can further facilitate breakthroughs.
In classical quantitative genetics, the correlation between the phenotypes of individuals with unknown genotypes and a known pedigree relationship is expressed in terms of probabilities of IBD states. In existing models of the inverse problem where genotypes are observed but pedigree relationships are not, probabilities and correlations have either a Bayesian or a hybrid interpretation. We introduce a generative evolutionary model of the inverse problem based on the classic infinite allele mutation process, IBF (Identity by Function).
In genomic sciences, the amount of data has grown faster than statistical methodologies necessary to analyze those data. Furthermore, the complex underlying structure of these data means that simple, unstructured statistical models do not perform well. We consider the problem of identifying multiple, functionally independent, co-localized genetic regulators of gene transcription. Sparse regression techniques have been critical to multi-SNP association mapping because of their computational tractability in large data settings.
Clustering involves placing entities into mutually exclusive categories. We wish to relax the requirement of mutual exclusivity, allowing objects to belong simultaneously to multiple classes, a formulation that we refer to as "feature allocation." The first step is a theoretical one. In the case of clustering the class of probability distributions over exchangeable partitions of a dataset has been characterized (via exchangeable partition probability functions and the Kingman paintbox).
Probabilistic topic models provide a suite of tools for analyzing large document collections. Topic modeling algorithms discover the latent themes that underlie the documents and identify how each document exhibits those themes. Topic modeling can be used to help explore, summarize, and form predictions about documents.
Adapting group sequential methods to observational drug and vaccine safety surveillance studies using large electronic healthcare data
Gaps in medical product safety evidence have spurred the development of new national post-licensure systems that prospectively monitor large observational cohorts of health plan enrollees. These multi-site systems, which include CDCâ€™s Vaccine Safety Datalink (VSD) and FDAâ€™s Mini-Sentinel (MS) Pilot Program for the Sentinel Initiative, attempt to leverage the vast amount of administrative and clinical information that is captured during the course of routine medical care and contained within computerized health plan databases.