University of Pennsylvania - Wharton School
Among statisticians variable selection is a common and very dangerous activity. This talk will survey the dangers and then propose two forms of insurance to guarantee against the damages from this activity.
Conventional statistical inference requires that a specific model of how the data were generated be specified before the data are examined and analyzed. Yet it is common in applications for a variety of variable selection procedures to be applied to the data to determine a preferred model. These are then followed by statistical tests and confidence intervals computed for this â€œfinalâ€ model. Such practices are typically misguided. The parameters being estimated depend on this final model, and post-model-selection sampling distributions may have unexpected properties that are very different from what is conventionally assumed. Confidence intervals and statistical tests do not perform as they should.
We address this dilemma within a standard linear-model framework. There is a numerical response of interest (Y) and a suite of possible explanatory variables, X1,â€¦,Xp. to be used in a multiple linear regression. The data is gathered, a multivariate linear model is constructed using a selected subset of the potential X variables, and inference (estimates, confidence intervals, tests) is performed for the selected slope parameters.
We propose two types of insurance to guarantee against the deleterious effects of this type of variable selection. The first provides valid confidence intervals and tests based on the design matrix of the observed variables. It does not require adherence to a pre-specified variable selection algorithm. This insurance may involve overly conservative procedures; but on the other hand, no less-conservative procedure of this type will provide the desired insurance. The second type of insurance is purchased through use of a properly specified split-sample bootstrap. These intervals may be less conservative, but are not always so, and part of their price lies in the split-sample scheme that effectively sacrifices a portion of the data.
This is joint work with R. Berk, A. Buja, E. George, E. Pitkin, M. Traskin, K. Zhang and L. Zhao.