Carnegie Mellon University - Department of Statisitcs
Consider a linear regression model
Y = XÎ² + z; z ~ N(0, In); X = Xn,p;
where both p and n are large but p > n. The vector Î² is unknown but is sparse in the sense that only a small proportion of its coordinates is nonzero, and we are interested in identifying these nonzero ones. We model the coordinates of Î² as samples from a two-component mixture (1-Ïµ)Ï…0 + ÏµÏ€, and the rows of X as samples from N(0, 1/n Î©), where Ï…0 is the point mass at 0, Ï€ is a distribution, and Î© is a p by p correlation matrix which is unknown but is presumably sparse.
We propose a two-stage variable selection procedure which we call the UPS. This is a Screen and Clean procedure, in which we screen with the Univariate thresholding, and clean with the Penalized MLE. In many situations, the UPS possesses two important properties: Sure Screening and Separable After Screening (SAS). These properties enable us to reduce the original regression problem to many small-size regression problems that can be fitted separately. As a result, the UPS is effective both in theory and in computation.
We measure the performance of variable selection procedure by the Hamming distance, and use an asymptotic framework where p â†’ âˆž and (Ïµ, Ï€, n, Î©) depend on p. We find that in many situations, the UPS achieves the optimal rate of convergence. We also find that in the (Ïµp; Ï€p) space, there is a three-phase diagram shared by many choices of Î©. In the first phase, it is possible to recover all signals. In the second phase, exact recovery is impossible, but it is possible to recover most of the signals. In the third phase, successful variable selection is impossible. The UPS partitions the phase space in the same way that the optimal procedures do, and recovers most of the signals as long as successful variable selection is possible.
The lasso and the subset selection (also known as the L1- and L0-penalization methods, respectively) are well-known approaches to variable selection. However, somewhat surprisingly, there are regions in the phase space where neither the lasso nor the subset selection is rate optimal, even for very simple Î©. The lasso is non-optimal because it is too loose in filtering out fake signals (i.e. noise that is highly correlated with a signal), and the subset selection is non-optimal because it tends to kill one or more signals in correlated pairs, triplets, etc.