Seminar Details

Seminar Details


Mar 10

3:30 pm

Making Big Data Small with Sampling, Sketching, and Statistics

Daniel Ting

Faculty host Thomas Richardson


Statistics is a field where the goal is extract the most information out of the least amount of data. Traditionally, the data is fixed and small, and the goals are centered around efficient estimation and inference from the full dataset. Computation and storage issues are often an afterthought. With the rise of big data ranging from terabytes to petabytes, a new set of storage and computational issues arise as simply reading the data can take hours to days of cpu time. This leads to our goal: how can statistics help reduce data sizes and computational complexity while answering a wide range of questions of interest within some error tolerance? The gains can be stunning as a problem of counting distinct items in a data set can be reduced from one which takes 8 GB of memory down to one taking just 4 KB if an error of 1% is allowed.

The problem is then how to construct an efficient sketch of the data where the notion of efficiency combines size, computation, and estimation error. We present two topics in this area and show how statistics contributes to advancing the state of the art. First, we examine the problem of approximate distinct counting and present our work on sketching and optimal estimation. Statistical notions of efficiency show that the CS notion of rate-optimality yields an algorithm which is inefficient in practice while providing practical improvements using provably efficient estimators. Second, we examine the problem of computationally efficient sampling algorithms and estimation. We present a new method, generalized priority sampling, which is also relevant to traditional sampling design. Using what we call the distribution substitution trick, it circumvents the traditional problems in computing inclusion probabilities in without replacement designs while providing an extremely flexible framework for generating sampling designs. We give applications to demonstrate its usefulness including stratified sampling with multiple objectives and online sampling with arbitrary time-decay.