In the first of this series of posts on regression techniques we introduced the work-horse of predictive modeling, ordinary least squares regression (OLS), but concluded with the notion that while a common and useful technique, OLS has some notable weaknesses. In the last post we discussed its sensitivity to outliers and how to deal with that using multiple “robust regression” techniques. Now we will discuss regression modeling in high dimensional data. Two of the techniques are based on idea called “feature extraction” and the last three are based on “penalized regression”

## Dimensionality, feature extraction, selection, and penalized regression

An issue that every modeler who has had to deal with high-dimensional data, is feature selection. By high dimensionality I mean that \(p\), the number of predictors, is large. High dimensional data can cause all sorts of issues — some of them psychological — for instance, it can be very hard to wrap your mind around so many predictors at one time; it can be difficult to figure out where to start, what to analyze, what to explore, and even visualization can be a beast! I tried to find some pictures to represent high-dimensional data for this blog post, but of course, by definition 2D or 3D representations of high-D data is hard to come by. This parallel plot at least add some color to the post and represents at least some of the complexity!

When \(p\) is large with respect to the number of observations \(n\), then the probability of overfitting is high. Now, let me quickly interject that \(p\) may be much greater than the number of raw input variables in the data. That is, a modeler may have decided on constructing several new features from among the existing data (e.g., ratios between two or more variables, non-linear transformations, and multi-way interactions). If there is a strong theoretical reason for including a set of variables, then great, build the model and evaluate the results. However, oftentimes, a modeler is looking for a good fit and doesn’t know which of the features should be created, and then which combinations of such features should be included in the model. Feature construction deals with the first question (and we will look at that later on), feature selection deals with the second question (which we deal with now).

Actually, first I will digress slightly — in the case when \(p\) is equal to or greater than \(n\), OLS will fail miserably. In this situation, feature selection is not only a good idea, it is necessary! One example problem type and data when \(p>n\) occurs in microarray data. If you are not familiar with what a microarray is see the inset quote from the Chapter “Introduction to microarray data analysis” by M. Madan Babu in Computational Genomics. Essentially microarray data is collected for the simultaneous analysis of thousands of genes to help discovery of gene functions, examine the gene regulatory network, identify of drug targets, and provide insight to understanding of diseases. The data usually has \(n\) on the orders of 10 to 100, whereas \(p\) might be on the order of 100’s to 1000’s!

A microarray is typically a glass slide on to which DNA molecules are fixed in an orderly manner at specific locations called spots. A microarray may contain thousands of spots and each spot may contain a few million copies of identical DNA molecules that uniquely correspond to a gene. The DNA in a spot may either be genomic DNA or short stretch of oligo-nucleotide strands that correspond to a gene. Microarrays may be used to measure gene expression in many ways, but one of the most popular applications is to compare expression of a set of genes from a cell maintained in a particular condition (condition A) to the same set of genes from a reference cell maintained under normal conditions (condition B). – M. Madan Babu

### PCR and PLS

There are two closely related approaches that can handle such scenarios with ease. These are Principal Component Regression (PCR) and Partial Least Squares regression (PLS). Both of these techniques allow you deal with data when \(p>n\). Both PCR and PLS essentially represent the data in lower dimensions (either in what is known as “principal components” for the former or just “components” in the latter). Either way these components are each formed from linear combinations of all of the predictors. Since PCR and PLS are essentially automatically creating new features from the given data, this is a form of what is known as “feature extraction”. That is, the algorithm extracts new predictors from the data for use in modeling.

If you choose \(k\) components (or extracted features) to be less than \(p\), then you have reduced the effective representation of your data. With this of course also comes information loss. You can’t just get rid of dimensions without losing information. However, both PCR and PLS both try to shove as much “information” into the first component as possible; subsequent components will contain less and less information. If you chop off the only a few of the last components, then you will not experience much information loss. If your data contains highly correlated variables or subsets of variables, then you can possible reduce many dimensions with very little loss of information.

Face database for facial recognition algorithms

Image data for instance is another example of high-dimensional data which can usually be reduced using something Principal Component Analysis (PCA) to a much lower level of dimensions. Facial recognition algorithms leverage this fact extensively! Looking at the image at right, it is easy to see many commonalities among the images of faces — the *information* to discern one face from the other is not within the similarities, but of course the differences. Imagine if you can removing the “similar elements” of all the faces — this would remove a considerable amount of the data dimensionality and the remaining “differences” is where the true discriminatory information is found.

PCR and PLS essentially do this — allow you to throw away the non-informative dimensions of data and perform regression modeling on only the informative bits.

I’ll leave the details about the difference in PCR and PLS alone for now, except to say that PCR is based on an unsupervised technique (PCA), whereas PLS is a inherently a supervised learning technique through-and-through. Both define components (linear combinations of the original data) and then allow you to perform OLS on a subset of components instead of the original data.

### Feature Selection

One possibility for sub-selecting predictors for an OLS model is to use* stepwise regression*. Stepwise regression (which can either be forward, backward, or bi-directional) is a greedy technique in which potential predictors are added (or removed) one-at-time from a candidate OLS model by evaluating the impact of adding (or removing) each variable individually to which improves performance the most. Maybe “performance” is the wrong word here, but let me use it for now. One traditional technique is to add (or remove) variables based on their associated statistical \(p\)-values, e.g., remove a variable if its \(p \ge 0.5\). I should note that while this is commonly employed (e.g., it is the default stepwise method used in SAS), there is some reasonable controversy with this approach. It is sometimes called \(p\)-fishing — as in, “there might not be anything of value in the data, but I am going to fish around until I find something anyway.” This is not a flattering term. If you do choose to perform stepwise regression, a less controversial approach would be to use either AIC or BIC scores as the model performance metric at each step.

### Lasso: a penalized regression approach

However, there are variations of OLS (which again are based on modifications to the objective function in Equation (2) (from the first post in the series) that result in an automatic selection of a subset of the candidate predictors.

Here is Equation (2) again:

$$\text{(Equation 2 again)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 $$

The first technique that we will discuss is called the *least absolute shrinkage and selection operator (*lasso). Lasso adds a penalty function to Equation 2 based on the magnitude of the regression coefficients as shown in Equation 3.

$$\text{(Equation 3)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 + \lambda \sum_{j=0}^{p} |\beta_j |$$

The penalty is the sum of the absolute values of all the regression coefficients scaled by some value \(\lambda > 0\). The larger the value of \(\lambda\), the larger the potential penalty. As \(\lambda\) increases, it makes sense with respect to Equation (3) to set more and more regression coefficients to 0. This effectively removes them from the regression model. Since this removal is based on balancing the sum of residuals with the penalty, the predictors which are not as important to the first part of the lasso objective are the ones that are eliminated. Voilà, feature selection!

You might ask, why place a penalty on the magnitude of the beta values — doesn’t that artificially impact the true fit and interpretation of your model? Well, this is a good question — the lasso definitely “shrinks” the regression coefficients, however, this does not necessarily mean that the shrinkage is a departure from the true model. If the predictors in the regression model are correlated (i.e., some form of multi-collinearity exists), then the magnitudes of the regression coefficients will be artificially inflated (and the “meaning” of the beta values may be totally lost). The shrinkage operator in lasso (and other techniques, e.g. ridge regression) tackle this directly. It is *possible* that lasso puts too much downward pressure on the coefficient magnitudes, but it is not necessarily true.

### Ridge regression: another penalized regression approach

I need to confess that ridge regression is *not* a method with automatic feature selection. However, since it is so closely related to lasso, I decided to throw it in really quickly so I don’t have to right another blog post just for this little guy. Here’s the equation, see if you can spot the difference from Equation (3)!

$$\text{(Equation 4)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 + \lambda \sum_{j=0}^{p} \beta_j^2 $$

That’s right — the only difference between Equation (3) and Equation (4) is that the penalty is based on absolute values in one and on the squares in the other. The idea is the same, except one very interesting difference — the regression coefficients in ridge regression are never forced to 0. They get smaller and smaller, but unlike lasso, no features are eliminated. Ridge regression however does turn out to produce better predictions that OLS. For both lasso and ridge regression the value of \(\lambda\) is determined by using cross-validation methods to tune the parameter to the best value for predictions. If \(\lambda = 0\), then both of these methods give the exact result as OLS. If \(\lambda > 0\) (as determined by the so-called hyper-parameter tuning, then the penalized regression techniques are shown to be better than OLS.

### Elastic net regularization: yet, again, a penalized regression approach

The other reason that I wanted to introduce ridge regression is that it is a great segue into my favorite of the penalized techniques, *elastic net regularization *or just elastic net for short. The elastic net approach combines both penalties from lasso and ridge regression in an attempt to get at the best of both worlds: the feature selection element of lasso and the predictive performance of ridge regression.

$$\text{(Equation 5)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 + \lambda_1 \sum_{j=0}^{p} |\beta_j | + \lambda_2 \sum_{j=0}^{p} \beta_j^2$$

Oftentimes the relationship between \(\lambda_1 >0 \) and \(\lambda_2 >0\) is one such that \(\lambda_1 + \lambda_2 = 1\). In this case if \(\lambda_1 = 1\), then the elastic net gives you the same result as lasso, and if \(\lambda_2 = 1\), then the result is equivalent to ridge regression. However, many times the result from hyper-parameter tuning is that \(\lambda_1 < 1\) and \(\lambda_2 < 1\), implying that yes! some hybridization of the lasso and ridge regression approaches produces the best cross-validated results.

In the case when we require \(\lambda_1 + \lambda_2 = 1\), Equation (5) can be rewritten as follows to simplify to only one parameter:

$$\text{(Modified equation 5)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 + \lambda \sum_{j=0}^{p} |\beta_j | + (1-\lambda) \sum_{j=0}^{p} \beta_j^2$$

I have mentioned “hyper-parameter tuning” a couple of times already. Without going into the details of the cross-validation, let me simply just say that all hyper-parameter tuning means is that you try out a whole bunch of values for your parameter (e.g., \(\lambda\)) until you find the values that work best. Take a look at the lasso and elastic net paths figure. In this figure the values of the coefficients are on the y-axis (each color of a line represents represents a different predictor) and the value of the penalty is represented on the x-axis (actually here the log of the penalty is represented). As the penalty value decreases (moving right along the x-axis), the values of the coefficients increase for both lasso (solid lines) and elastic net (dashed lines). So you can see as you “tune” the value of \(\lambda\), infinite number of models are possible! When the value becomes large enough (starting at right and then moving to the left along the x-axis), some of the coefficients are forced to 0 by lasso and by elastic net. The lasso seems to do this quicker than elastic net as demonstrated in the solid blue and dashed blue line — a very small increase in \(\lambda\) (at about \(\log \lambda \sim 90\) and its value is set to 0 by lasso; whereas, the penalty has to increase such that \(\log \lambda \sim 55\) for elastic net before the same variable’s regression coefficient is set to 0.

Now there are other regression techniques that also include automatic feature selection, e.g., multivariate adaptive regression splines (MARS) essentially use a forward step-wise procedure to add terms to the regression model and regression trees choose a features one at a time to add to a model to produce prediction estimates. (These two seemingly different techniques actually have quite a lot in common!) However, I will introduce the first of these as a technique for both feature selection *and* feature construction. Our next post deals more generally with how our regression approach can deal with non-linearities in our model assumption.