Feb 12

3:30 pm

## Statistical Machine Learning and Big-p, Big-n, Complex Data

### Pradeep Ravikumar

Seminar

Drawing upon disparate fields as economics, psychology, operations research and statistics, the subfield of statistical machine learning has provided practically successful tools ranging from search engines to medical diagnosis, image processing, speech recognition, and a wide array of problems in science and engineering. However, over the past decade, faced with modern data settings, off-the-shelf statistical machine learning methods are frequently proving insufficient. These modern settings pose three key challenges, which largely come under the rubric of "Big Data": (a) the data might have a large number of features, in what we will call "Big-p" data, to denote the fact that the dimension "p" of the data is large, or (b) the data might have a large number of data instances, in what we will call "Big-n" data, to denote the fact that the number of samples "n" is large, or (c) the data-types could be complex: such as permutations, or strings, or graphs, which typically lie in some large discrete space. A key approach in addressing such "Big Data" settings has involved leveraging systems-related approaches such as parallel and distributed algorithms, as well as architecture and algorithms for efficient, possibly distributed, data access and storage. In this talk, we will discuss the complementary approach of statistical modeling, but which importantly is tuned to each of these three aspects of modern statistical machine learning: big-p data, big-n data, and complex data-types.Statistical machine learning for Big-p data, with more variables than samples, has been the focus of considerable research over the last decade. It is now well understood that estimation with strong statistical guarantees is still possible under such high-dimensional settings provided we impose suitable constraints on the model space.

Accordingly, we will discuss a unified framework for learning general structurally constrained high-dimensional models (such as models that are sparse, low-rank, and so on). For Big-n data, a key sub-field that is increasingly gaining importance is that of non-parametric models, where the model components potentially lie in infinite-dimensional spaces. A key caveat to the wide-spread use of these models has been the larger number of observations required by these models as compared to parametric methods, but this is much less of a problem in Big-n settings. Accordingly, we will discuss a unified framework of structurally constrained semi-parametric models (such as sparse additive models and so on). For complex-typed data, even standard machine learning questions such as devising suitable loss functions, and devising suitable statistical models that respect interesting structure, are still outstanding. We will address some of these questions for the specific complex data-type of permutations.