Data science is at a crossroads. Each year, thousands of new data scientists are entering science and technology, after a broad training in a variety of fields. Modern data science is often exploratory in nature, with datasets being collected and dissected in an interactive manner. Classical guarantees that accompany many statistical methods are often invalidated by their non-standard interactive use, resulting in an underestimated risk of falsely discovering correlations or patterns. It is a pressing challenge to upgrade existing tools, or create new ones, that are robust to involving a human-in-the-loop.
In this talk, I will describe two new advances that enable some amount of interactivity while testing multiple hypotheses, and control the resulting selection bias. I will first introduce a new framework, STAR, that uses partial masking to divide the available information into two parts, one for selecting a set of potential discoveries, and the other for inference on the selected set. I will then show that it is possible to flip the traditional roles of the algorithm and the scientist, allowing the scientist to make post-hoc decisions after seeing the realization of an algorithm on the data. The theoretical basis for both advances is founded in the theory of martingales : in the first, the user defines the martingale and associated filtration interactively, and in the second, we move from optional stopping to optional spotting by proving uniform concentration bounds on relevant martingales.
This talk will feature joint work with Will Fithian, Lihua Lei and Eugene Katsevich, but I will also briefly mention work with Rina Barber, Jianbo Chen, Kevin Jamieson, Michael Jordan, Max Rabinovich, Martin Wainwright, Fanny Yang and Tijana Zrnic.