We left off discussion in the last post on regression techniques with the statement that there are some known issues relating to Ordinary Least Squares (OLS) regression techniques. My goal in this series of posts is to introduce several variations on OLS that address one or more of these drawbacks. The first issue I will mention is related to making regression more robust in the presence of outliers. This is accomplished through what is known, straightforwardly, as robust regression.

## Outliers and robust regression

OLS regression is sensitive to outliers. To see this look at the objective function again (listed as Equation 2 in the first post of the series):

$$\text{OLS objective function:} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 $$

The procedure is highly incentivized to minimize the residuals. That is, the square term implies a very large penalty if the predictions are wrong. This sounds fine, but imagine there exists a single point in your data set that simply does not fit the true model well, i.e., the value for \(y\) does not follow the linear assumption in Equation 1 (from the previous post) at all. The OLS fit will be greatly affected by such an observation.

Let’s put some numbers to this to make a quick example. Assume that without this outlying point, the residuals associated with your \(n=100\) data points are all relatively small, for instance, \(-0.1 <= y_i – \hat{y}_i <= 0.1\). The sum of the squared residuals is at most equal to 1. Now, pick one of the 100 points and let’s say that that the associated residual is 10. The sum of squared errors would jump to 10000.999! To mitigate this possibility, the OLS fit would not recover the true model, but adjust the fit so as to keep the sum of the squared residuals to something more reasonable. It decreases the residual of this one outlying point by allowing larger residuals for the remaining 99 data points. So now, instead of a good fit for 99% of the data with only one point that doesn’t predict well, we have a model that is a poor fit 100% of the data!

To mitigate this sensitivity to outliers, we can use *robust regression*. The key is to change the objective function so that the penalty for large residuals is not so dramatic. Let \(\rho(r_i)\) represent the penalty function (as a function of the residual for observation \(i\)). Three such common penalty functions are:

\begin{align}

\text{Least absolute value (LAV)} \ \ \rho_\text{lav}(r_i) &= \sum_{i=1}^n |r_i| &\\

\text{Huber} \ \ \rho_\text{huber}(r_i) &= \begin{cases} \frac{1}{2} r_i^2 \ & \text{ if } r_i \le c \\ c|r_i| -\frac{1}{2}r_i^2 \ & \text{ if } |r_i|> c\end{cases}\\

\text{Bisquare} \ \ \rho_\text{bisquare}(r_i) &= \begin{cases} \frac{c^2}{6} \left( 1 – \left( 1 – \left(\frac{r_i}{c}\right)^2 \right)^3 \right) \ & \text{ if } r_i \le c \\ \frac{c^2}{6} \ & \text{ if } |r_i|> c\end{cases}\\

\end{align}

While I won’t discuss the solution techniques for solving the associated minimization problems here in detail, I will mention that these changes directly impact the computational efficiency of the regression. The LAV problem can be solved using linear programming techniques. The other two approaches (Huber and Bisquare) are actually part of a family of such objective function modifications known as *M-estimation*.

**Pvt. Joe Bowers**: What are these electrolytes? Do you even know?

**Secretary of State**: They’re… what they use to make Brawndo!

**Pvt. Joe Bowers**: But why do they use them to make Brawndo?

**Secretary of Defense**: Because Brawndo’s got electrolytes.

-Circular reasoning from the movie *Idiocracy*

M-estimation techniques are a class of robust regression method which change the objective function to be less sensitive to outliers (there are other objectives besides Huber and Bisquare). M-estimation problems usually determine the residual penalty value based on the fit, which is based on the beta values, which in turn is based on the residual values… but wait, we don’t know the residuals until we have the beta values, which is based on the residuals, which are based on the beta values! Oh no, it’s circular reasoning!

This paradox is resolved by performing *iterative reweighted least squares regression (IRLS). *It turns out that the effect of a residual-based penalty is equivalent to allowing different weights for each observation (based on the residual value of that observation). To address the circular logic, IRLS solves the regression problem multiple times in which the observations are weighted differently each time. The weights for all observations all start equal to 1 (the same as regular old OLS) and then the observations which do not fit very well will have the lower weights assigned — reducing their impact on the regression fit. The observations that fit well will have higher weights. After the weights are assigned, the process is repeated, and the weights are adjusted again, and again, and again, until convergence. This ultimately reduces the impact of outlying observations.

The images here represent a simple data set in which we have a few outliers (in the upper left corner). The OLS model fit produces the red line, the M-estimation procedure using Tukey’s bisquare penalty produces the blue regression line. As you can see, the slope of the red line is impacted by the outlying points, but the blue line is not — its fit is in fact based on all points except the outliers.

In a later post in this series we will look at another regression-like approach that is robust to outliers (Support Vector Machine regression), however since this technique is also excellent at dealing with non-linearity in the data, I’ll postpone that topic for a bit. Next we will discuss a common issue in predictive modeling — dealing with high dimensionality.