Thoughts on regression techniques (part 1)

By | December 7, 2016

I just completed this semester’s series of lectures on regression methods in ISE/DSA 5103 Intelligent Data Analytics and I wanted to take a moment to call out a few key points.

regressionFirst, let me list the primary set of techniques that we covered along with links to the associated methods and package  in R:

While I do not intend to rehash everything we covered in class (e.g., residual diagnostics, leverage, hat-values, performance evaluation, multicollinearity, interpretation, variance inflation, derivations, algorithms, etc.), I wanted to point out a few key things.

Ordinary Least Squares Regression

OLS multiple linear regression is the workhorse of predictive modeling for continuous response variables.  Not only is it a powerful technique that is commonly used across multiple fields and industries, it is very easy to build and test, the results are easy to interpret, the ability to perform OLS is ubiquitous in statistical software, and it is a computationally efficient procedure.  Furthermore, it serves as the foundation for several other techniques.  Therefore, learning OLS for multivariate situations is a fundamental element in starting predictive modeling.

In order to make any of this make sense, let me introduce some very brief notation.  We will assume that we have n observations of data which are comprised of a single real-valued outcome (a.k.a. response or dependent variable or target), which I will denote as y, and p predictors (a.k.a. independent variables or features or inputs) denoted as x_1, x_2,\ldots, x_p.  These predictors can be continuous or binary.  (Note: nominal variables are perfectly fine, however, to be used in OLS they need to be transformed into one or more “dummy variables” which are each binary.)  For the y-intercept and for each predictor there is a regression coefficient: \beta_0, \beta_1, \ldots, \beta_p .  The assumption in OLS is that the true underlying relationship between the response and the input variables is:

$$\text{(Equation 1)} \ \ \ y = \beta_0 + \sum_{i=1}^n \beta_i x_i + \epsilon $$

where \epsilon represents a normally distributed error term with mean equal to 0. When the OLS model is fit, the values for \beta_0, \beta_1, \ldots, \beta_p are estimated. Let \hat{y} denote the estimates for y after the model is fit and the values \hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p denote the estimates for regression coefficients. The objective of OLS is to minimize the sum of the squares of the residuals, where the residuals are defined as y_i – \hat{y}_i for all i = 1 \ldots n. That is,

$$\text{(Equation 2)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 $$

Let me just take a moment here to say that most of the OLS variants that I’ll summarize in this series of posts are motivated by simple modifications to the objective function in Equation (2)!

OLS is a linear technique, but feature engineering allows a modeler to introduce non-linear effects. That is, if you believe the relationship between the response and the predictor is: $$y = f(x) = x + x^2 + sin(x) + \epsilon$$ then simply create two new variables (this is called feature construction) with these transformations. Let x_1 = x, x_2 = x^2, \text{ and } x_3 = sin(x), your estimated OLS model is linear with respect to the new variables. That is,

\begin{align}
\hat{y} & = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \hat{\beta}_3 x_3  \\
& = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2 + \hat{\beta}_3 \sin(20 x)
\end{align}

For example, if you simple fit an OLS model without transformations, simply y ~ x, then you get the following predictions: the blue dots represent the output of the model, whereas the black dots represent the actual data.linear

However, if you transform your variables then you can get a very good fit:

While OLS has various assumptions that ideally should be met in order to proceed with modeling, the predictive performance is insensitive to many of these. For instance, ideally, the input variables should be independent; however, even if there are relatively highly collinear predictors, the predictive ability of OLS is not impacted (the interpretation of the coefficients however is greatly affected!).

However, there are some notable difficulties and problems with OLS.  We will discuss some of these in the next few posts!