# Monthly Archives: December 2016

## Happy Winter Break 2016!

Here at the Analytics Lab @ OU we would like to wish you all a happy holiday season, a wonderful winter break, and a feliz año nuevo!

Also, we have had some nice developments over the last few days and weeks and I would like to mention these briefly.

# Congratulations!

Congratulations to Param Tripathi for succesfully defending his Masters thesis and completing his MS in ISE!

In his thesis entitled “ANALYSIS OF RESILIENCE IN US STOCK MARKETS DURING NATURAL DISASTERS” Param analyzes how major hurricane events impact the stock market in the United States. In particular, he looks at two fundamental elements of resilience, vulnerability and recoverability, as it applies to the Property and Casualty Insurance sector by evaluating the price volatility with respect to the overall New York Stock Exchange.  He applies breakout detection to study patterns in the time series data and uses this to quantify key resilience metrics.

Congratulations to Alexandra Amidon for successfully completing her Industry practicum on anomaly detection using a compression algorithm.  Well done!

# Good luck!

Next, Weili Zhang, a PhD Candidate in ISE, is actively interviewing with a vast number of top companies for data science positions. He has passed first and second round interviews for a number of companies and has been invited for face-to-face interviews with Facebook, Google, Ebay, Verizon, Disney, Lyft, Amazon, Target, and several more.  Getting past round one with these companies is already a major accomplishment — round two or three get progressively more intense!  Good job and good luck Weili!

# Yay us!

The Analytics Lab team members have had a few papers published or accepted for publication recently:

• Almoghathawi, Y., K. Barker, C.M. Rocco, and C. Nicholson. 2017. A multi-criteria decision analysis approach for importance ranking of network components. Reliability Engineering and System Safety, 158: 142-151 LINK [bibTex]
• Nicholson, C., L. Goodwin, and C. Clark. 2016. Variable neighborhood search for reverse engineering of gene regulatory networks.  Journal of Biomedical Informatics, 65:120-131 LINK [bibTex]
• Zhang, W., N. Wang, and C. Nicholson. 2016. Resilience-based post-disaster recovery strategies for community road-bridge networks. Accepted in Structure and Infrastructure Engineering.

I’ll write up a post about these exciting new research articles soon!

# Data Science Interviews

Weili, along with a few other current and former students have been describing the data science interview processes and questions with me.  My plan is to write up a blog post summarizing a some key things for those of you seeking DSA positions to keep in mind and prepare for.  Fortunately, many of the questions/topics asked about are covered in the ISE 5103 Intelligent Data Analytics class offered in the Fall semester.  If you are starting to look for data science jobs now or in the near future, make sure you check out that post!

# Life in Industry

Additionally, I have asked several current/former students (Pete Pelter, Leslie Goodwin, Cyril Beyney, Olivia Peret) who are employed in data science positions to provide some descriptions and information about life in industry.  Look for that series of posts soon!

# ISE 5113 Advanced Analytics and Metaheuristics

I am currently preparing for the online launch of the ISE 5113 Advanced Analytics and Metaheuristics course to be offered in Spring 2017.  The course will be offered both online and on-campus.  This is one of my favorite courses to teach.  The introductory video should be posted soon!  The course fills up very fast, so if you are a DSA or ISE student make sure to register ASAP if you haven’t already!

# Soccer

I have had excellent response from so many people this semester that it looks like “yes!” we will have enough people to start a soccer team in the Spring.  I’ll be providing more information next month.  The season starts in March and the official signups are in February (\$80 each).  We need a team name and team colors — so I’ll be open for ideas!  In the meantime, get out there and practice!

# Finally…

I hope everyone has an excellent winter break.  Enjoy your family, enjoy good food, stay warm, practice some soccer, catch up on sleep, maybe study a little bit… and I’ll see you next year!

## Thoughts on regression techniques (part iii)

In the first of this series of posts on regression techniques we introduced the work-horse of predictive modeling, ordinary least squares regression (OLS), but concluded with the notion that while a common and useful technique, OLS has some notable weaknesses.  In the last post we discussed its sensitivity to outliers and how to deal with that using multiple “robust regression” techniques.  Now we will discuss regression modeling in high dimensional data.  Two of the techniques are based on idea called “feature extraction” and the last three are based on “penalized regression”

## Dimensionality, feature extraction, selection, and penalized regression

An  issue that every modeler who has had to deal with high-dimensional data, is feature selection.  By high dimensionality I mean that $$p$$, the number of predictors, is large. High dimensional data can cause all sorts of issues — some of them psychological — for instance, it can be very hard to wrap your mind around so many predictors at one time; it can be difficult to figure out where to start, what to analyze, what to explore, and even visualization can be a beast!  I tried to find some pictures to represent high-dimensional data for this blog post, but of course, by definition 2D or 3D representations of high-D data is hard to come by.  This parallel plot at least add some color to the post and represents at least some of the complexity!

When $$p$$ is large with respect to the number of observations $$n$$, then the probability of overfitting is high. Now, let me quickly interject that $$p$$ may be much greater than the number of raw input variables in the data.  That is, a modeler may have decided on constructing several new features from among the existing data (e.g., ratios between two or more variables, non-linear transformations, and multi-way interactions). If there is a strong theoretical reason for including a set of variables, then great, build the model and evaluate the results. However, oftentimes, a modeler is looking for a good fit and doesn’t know which of the features should be created, and then which combinations of such features should be included in the model. Feature construction deals with the first question (and we will look at that later on), feature selection deals with the second question (which we deal with now).

Actually, first I will digress slightly — in the case when $$p$$ is equal to or greater than $$n$$, OLS will fail miserably.  In this situation, feature selection is not only a good idea, it is necessary!  One example problem type and data when $$p>n$$ occurs in microarray data. If you are not familiar with what a microarray is see the inset quote from the Chapter “Introduction to microarray data analysis” by M. Madan Babu in Computational Genomics.  Essentially microarray data is collected for the simultaneous analysis of thousands of genes to help discovery of gene functions, examine the gene regulatory network, identify of drug targets, and provide insight to understanding of diseases.  The data usually has $$n$$ on the orders of 10 to 100, whereas $$p$$ might be on the order of 100’s to 1000’s!

A microarray is typically a glass slide on to which DNA molecules are fixed in an orderly manner at specific locations called spots. A microarray may contain thousands of spots and each spot may contain a few million copies of identical DNA molecules that uniquely correspond to a gene. The DNA in a spot may either be genomic DNA or short stretch of oligo-nucleotide strands that correspond to a gene. Microarrays may be used to measure gene expression in many ways, but one of the most popular applications is to compare expression of a set of genes from a cell maintained in a particular condition (condition A) to the same set of genes from a reference cell maintained under normal conditions (condition B). – M. Madan Babu

### PCR and PLS

There are two closely related approaches that can handle such scenarios with ease.  These are Principal Component Regression (PCR) and Partial Least Squares regression (PLS).  Both of these techniques allow you deal with data when $$p>n$$.  Both PCR and PLS essentially represent the data in lower dimensions (either in what is known as “principal components” for the former or just “components” in the latter).  Either way these components are each formed from linear combinations of all of the predictors.  Since PCR and PLS are essentially automatically creating new features from the given data, this is a form of what is known as “feature extraction”.  That is, the algorithm extracts new predictors from the data for use in modeling.

If you choose $$k$$ components (or extracted features) to be less than $$p$$, then you have reduced the effective representation of your data.  With this of course also comes information loss.  You can’t just get rid of dimensions without losing information.  However, both PCR and PLS both try to shove as much “information” into the first component as possible; subsequent components will contain less and less information.  If you chop off the only a few of the last components, then you will not experience much information loss.  If your data contains highly correlated variables or subsets of variables, then you can possible reduce many dimensions with very little loss of information.

Face database for facial recognition algorithms

Image data for instance is another example of high-dimensional data which can usually be reduced using something Principal Component Analysis (PCA) to a much lower level of dimensions.  Facial recognition algorithms leverage this fact extensively!  Looking at the image at right, it is easy to see many commonalities among the images of faces — the information to discern one face from the other is not within the similarities, but of course the differences.  Imagine if you can removing the “similar elements” of all the faces — this would remove a considerable amount of the data dimensionality and the remaining “differences” is where the true discriminatory information is found.

PCR and PLS essentially do this — allow you to throw away the non-informative dimensions of data and perform regression modeling on only the informative bits.

I’ll leave the details about the difference in PCR and PLS alone for now, except to say that PCR is based on an unsupervised technique (PCA), whereas PLS is a inherently a supervised learning technique through-and-through.  Both define components (linear combinations of the original data) and then allow you to perform OLS on a subset of components instead of the original data.

### Feature Selection

One possibility for sub-selecting predictors for an OLS model is to use stepwise regression.  Stepwise regression (which can either be forward, backward, or bi-directional) is a greedy technique in which potential predictors are added (or removed) one-at-time from a candidate OLS model by evaluating the impact of adding (or removing) each variable individually to which improves performance the most. Maybe “performance” is the wrong word here, but let me use it for now. One traditional technique is to add (or remove) variables based on their associated statistical $$p$$-values, e.g., remove a variable if its $$p \ge 0.5$$.  I should note that while this is commonly employed (e.g., it is the default stepwise method used in SAS), there is some reasonable controversy with this approach.  It is sometimes called $$p$$-fishing — as in, “there might not be anything of value in the data, but I am going to fish around until I find something anyway.”  This is not a flattering term.  If you do choose to perform stepwise regression, a less controversial approach would be to use either AIC or BIC scores as the model performance metric at each step.

### Lasso: a penalized regression approach

However, there are variations of OLS (which again are based on modifications to the objective function in Equation (2) (from the first post in the series) that result in an automatic selection of a subset of the candidate predictors.

Here is Equation (2) again:

$$\text{(Equation 2 again)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2$$

The first technique that we will discuss is called the least absolute shrinkage and selection operator (lasso).  Lasso adds a penalty function to Equation 2 based on the magnitude of the regression coefficients as shown in Equation 3.

$$\text{(Equation 3)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 + \lambda \sum_{j=0}^{p} |\beta_j |$$

The penalty is the sum of the absolute values of all the regression coefficients scaled by some value $$\lambda > 0$$.  The larger the value of $$\lambda$$, the larger the potential penalty.  As $$\lambda$$ increases, it makes sense with respect to Equation (3) to set more and more regression coefficients to 0.  This effectively removes them from the regression model.  Since this removal is based on balancing the sum of residuals with the penalty, the predictors which are not as important to the first part of the lasso objective are the ones that are eliminated.  Voilà, feature selection!

You might ask, why place a penalty on the magnitude of the beta values — doesn’t that artificially impact the true fit and interpretation of your model?  Well, this is a good question — the lasso definitely “shrinks” the regression coefficients, however, this does not necessarily mean that the shrinkage is a departure from the true model.  If the predictors in the regression model are correlated (i.e., some form of multi-collinearity exists), then the magnitudes of the regression coefficients will be artificially inflated (and the “meaning” of the beta values may be totally lost).  The shrinkage operator in lasso (and other techniques, e.g. ridge regression) tackle this directly.  It is possible that lasso puts too much downward pressure on the coefficient magnitudes, but it is not necessarily true.

### Ridge regression: another penalized regression approach

I need to confess that ridge regression is not a method with automatic feature selection.  However, since it is so closely related to lasso, I decided to throw it in really quickly so I don’t have to right another blog post just for this little guy.  Here’s the equation, see if you can spot the difference from Equation (3)!

$$\text{(Equation 4)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 + \lambda \sum_{j=0}^{p} \beta_j^2$$

That’s right — the only difference between Equation (3) and Equation (4) is that the penalty is based on absolute values in one and on the squares in the other. The idea is the same, except one very interesting difference — the regression coefficients in ridge regression are never forced to 0.  They get smaller and smaller, but unlike lasso, no features are eliminated.  Ridge regression however does turn out to produce better predictions that OLS.  For both lasso and ridge regression the value of $$\lambda$$ is determined by using cross-validation methods to tune the parameter to the best value for predictions.  If $$\lambda = 0$$, then both of these methods give the exact result as OLS.  If $$\lambda > 0$$ (as determined by the so-called hyper-parameter tuning, then the penalized regression techniques are shown to be better than OLS.

### Elastic net regularization: yet, again, a penalized regression approach

The other reason that I wanted to introduce ridge regression is that it is a great segue into my favorite of the penalized techniques, elastic net regularization or just elastic net for short.  The elastic net approach combines both penalties from lasso and ridge regression in an attempt to get at the best of both worlds: the feature selection element of lasso and the predictive performance of ridge regression.

$$\text{(Equation 5)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 + \lambda_1 \sum_{j=0}^{p} |\beta_j | + \lambda_2 \sum_{j=0}^{p} \beta_j^2$$

Oftentimes the relationship between $$\lambda_1 >0$$ and $$\lambda_2 >0$$ is one such that $$\lambda_1 + \lambda_2 = 1$$.  In this case if $$\lambda_1 = 1$$, then the elastic net gives you the same result as lasso, and if $$\lambda_2 = 1$$, then the result is equivalent to ridge regression.  However, many times the result from hyper-parameter tuning is that $$\lambda_1 < 1$$ and $$\lambda_2 < 1$$, implying that yes! some hybridization of the lasso and ridge regression approaches produces the best cross-validated results.

In the case when we require $$\lambda_1 + \lambda_2 = 1$$, Equation (5) can be rewritten as follows to simplify to only one parameter:

$$\text{(Modified equation 5)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 + \lambda \sum_{j=0}^{p} |\beta_j | + (1-\lambda) \sum_{j=0}^{p} \beta_j^2$$

I have mentioned “hyper-parameter tuning” a couple of times already.  Without going into the details of the cross-validation, let me simply just say that all hyper-parameter tuning means is that you try out a whole bunch of values for your parameter (e.g., $$\lambda$$) until you find the values that work best.  Take a look at the lasso and elastic net paths figure.  In this figure the values of the coefficients are on the y-axis (each color of a line represents represents a different predictor) and the value of the penalty is represented on the x-axis (actually here the log of the penalty is represented).  As the penalty value decreases (moving right along the x-axis), the values of the coefficients increase for both lasso (solid lines) and elastic net (dashed lines).  So you can see as you “tune” the value of $$\lambda$$, infinite number of models are possible!  When the value becomes large enough (starting at right and then moving to the left along the x-axis), some of the coefficients are forced to 0 by lasso and by elastic net.  The lasso seems to do this quicker than elastic net as demonstrated in the solid blue and dashed blue line — a very small increase in $$\lambda$$ (at about $$\log \lambda \sim 90$$ and its value is set to 0 by lasso; whereas, the penalty has to increase such that $$\log \lambda \sim 55$$ for elastic net before the same variable’s regression coefficient is set to 0.

Now there are other regression techniques that also include automatic feature selection, e.g., multivariate adaptive regression splines (MARS) essentially use a forward step-wise procedure to add terms to the regression model and regression trees choose a features one at a time to add to a model to produce prediction estimates. (These two seemingly different techniques actually have quite a lot in common!)  However, I will introduce the first of these as a technique for both feature selection and feature construction.  Our next post deals more generally with how our regression approach can deal with non-linearities in our model assumption.

## Thoughts on regression techniques (part deux)

We left off discussion in the last post on regression techniques with the statement that there are some known issues relating to Ordinary Least Squares (OLS) regression techniques.  My goal in this series of posts is to introduce several variations on OLS that address one or more of these drawbacks.  The first issue I will mention is related to making regression more robust in the presence of outliers.  This is accomplished through what is known, straightforwardly, as robust regression.

## Outliers and robust regression

OLS regression is sensitive to outliers. To see this look at the objective function again (listed as Equation 2 in the first post of the series):

$$\text{OLS objective function:} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2$$

The procedure is highly incentivized to minimize the residuals. That is, the square term implies a very large penalty if the predictions are wrong. This sounds fine, but imagine there exists a single point in your data set that simply does not fit the true model well, i.e., the value for $$y$$ does not follow the linear assumption in Equation 1 (from the previous post) at all. The OLS fit will be greatly affected by such an observation.

Let’s put some numbers to this to make a quick example. Assume that without this outlying point, the residuals associated with your $$n=100$$ data points are all relatively small, for instance, $$-0.1 <= y_i – \hat{y}_i <= 0.1$$. The sum of the squared residuals is at most equal to 1. Now, pick one of the 100 points and let’s say that that the associated residual is 10. The sum of squared errors would jump to 10000.999! To mitigate this possibility, the OLS fit would not recover the true model, but adjust the fit so as to keep the sum of the squared residuals to something more reasonable. It decreases the residual of this one outlying point by allowing larger residuals for the remaining 99 data points. So now, instead of a good fit for 99% of the data with only one point that doesn’t predict well, we have a model that is a poor fit 100% of the data!

To mitigate this sensitivity to outliers, we can use robust regression. The key is to change the objective function so that the penalty for large residuals is not so dramatic. Let $$\rho(r_i)$$ represent the penalty function (as a function of the residual for observation $$i$$).  Three such common penalty functions are:

\begin{align}
\text{Least absolute value (LAV)} \ \  \rho_\text{lav}(r_i) &= \sum_{i=1}^n |r_i| &\\
\text{Huber} \ \  \rho_\text{huber}(r_i) &= \begin{cases} \frac{1}{2} r_i^2 \ & \text{ if } r_i \le c \\ c|r_i| -\frac{1}{2}r_i^2 \ & \text{ if } |r_i|> c\end{cases}\\
\text{Bisquare} \ \    \rho_\text{bisquare}(r_i) &= \begin{cases} \frac{c^2}{6} \left( 1 – \left( 1 – \left(\frac{r_i}{c}\right)^2 \right)^3 \right) \ & \text{ if } r_i \le c \\ \frac{c^2}{6} \ & \text{ if } |r_i|> c\end{cases}\\
\end{align}

While I won’t discuss the solution techniques for solving the associated minimization problems here in detail, I will mention that these changes directly impact the computational efficiency of the regression. The LAV problem can be solved using linear programming techniques. The other two approaches (Huber and Bisquare) are actually part of a family of such objective function modifications known as M-estimation.

Pvt. Joe Bowers: What are these electrolytes? Do you even know?

Secretary of State: They’re… what they use to make Brawndo!

Pvt. Joe Bowers: But why do they use them to make Brawndo?

Secretary of Defense: Because Brawndo’s got electrolytes.

-Circular reasoning from the movie Idiocracy

M-estimation techniques are a class of robust regression method which change the objective function to be less sensitive to outliers (there are other objectives besides Huber and Bisquare).  M-estimation problems usually determine the residual penalty value based on the fit, which is based on the beta values, which in turn is based on the residual values… but wait, we don’t know the residuals until we have the beta values, which is based on the residuals, which are based on the beta values!  Oh no, it’s circular reasoning!

This paradox is resolved by performing iterative reweighted least squares regression (IRLS).  It turns out that the effect of a residual-based penalty is equivalent to allowing different weights for each observation (based on the residual value of that observation).  To address the circular logic, IRLS solves the regression problem multiple times in which the observations are weighted differently each time.  The weights for all observations all start equal to 1 (the same as regular old OLS) and then the observations which do not fit very well will have the lower weights assigned — reducing their impact on the regression fit.  The observations that fit well will have higher weights.  After the weights are assigned, the process is repeated, and the weights are adjusted again, and again, and again, until convergence.  This ultimately reduces the impact of outlying observations.

The images here represent a simple data set in which we have a few outliers (in the upper left corner).  The OLS model fit produces the red line, the M-estimation procedure using Tukey’s bisquare penalty produces the blue regression line.   As you can see, the slope of the red line is impacted by the outlying points, but the blue line is not — its fit is in fact based on all points except the outliers.

OLS on the left; Robust Regression on the right

In a later post in this series we will look at another regression-like approach that is robust to outliers (Support Vector Machine regression), however since this technique is also excellent at dealing with non-linearity in the data, I’ll postpone that topic for a bit.  Next we will discuss a common issue in predictive modeling — dealing with high dimensionality.

## Thoughts on regression techniques (part 1)

I just completed this semester’s series of lectures on regression methods in ISE/DSA 5103 Intelligent Data Analytics and I wanted to take a moment to call out a few key points.

First, let me list the primary set of techniques that we covered along with links to the associated methods and package  in R:

While I do not intend to rehash everything we covered in class (e.g., residual diagnostics, leverage, hat-values, performance evaluation, multicollinearity, interpretation, variance inflation, derivations, algorithms, etc.), I wanted to point out a few key things.

## Ordinary Least Squares Regression

OLS multiple linear regression is the workhorse of predictive modeling for continuous response variables.  Not only is it a powerful technique that is commonly used across multiple fields and industries, it is very easy to build and test, the results are easy to interpret, the ability to perform OLS is ubiquitous in statistical software, and it is a computationally efficient procedure.  Furthermore, it serves as the foundation for several other techniques.  Therefore, learning OLS for multivariate situations is a fundamental element in starting predictive modeling.

In order to make any of this make sense, let me introduce some very brief notation.  We will assume that we have $$n$$ observations of data which are comprised of a single real-valued outcome (a.k.a. response or dependent variable or target), which I will denote as $$y$$, and $$p$$ predictors (a.k.a. independent variables or features or inputs) denoted as $$x_1, x_2,\ldots, x_p$$.  These predictors can be continuous or binary.  (Note: nominal variables are perfectly fine, however, to be used in OLS they need to be transformed into one or more “dummy variables” which are each binary.)  For the $$y$$-intercept and for each predictor there is a regression coefficient: $$\beta_0, \beta_1, \ldots, \beta_p$$.  The assumption in OLS is that the true underlying relationship between the response and the input variables is:

$$\text{(Equation 1)} \ \ \ y = \beta_0 + \sum_{i=1}^n \beta_i x_i + \epsilon$$

where $$\epsilon$$ represents a normally distributed error term with mean equal to 0. When the OLS model is fit, the values for $$\beta_0, \beta_1, \ldots, \beta_p$$ are estimated. Let $$\hat{y}$$ denote the estimates for $$y$$ after the model is fit and the values $$\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p$$ denote the estimates for regression coefficients. The objective of OLS is to minimize the sum of the squares of the residuals, where the residuals are defined as $$y_i – \hat{y}_i$$ for all $$i = 1 \ldots n$$. That is,

$$\text{(Equation 2)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2$$

Let me just take a moment here to say that most of the OLS variants that I’ll summarize in this series of posts are motivated by simple modifications to the objective function in Equation (2)!

OLS is a linear technique, but feature engineering allows a modeler to introduce non-linear effects. That is, if you believe the relationship between the response and the predictor is: $$y = f(x) = x + x^2 + sin(x) + \epsilon$$ then simply create two new variables (this is called feature construction) with these transformations. Let $$x_1 = x, x_2 = x^2, \text{ and } x_3 = sin(x)$$, your estimated OLS model is linear with respect to the new variables. That is,

\begin{align}
\hat{y} & = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \hat{\beta}_3 x_3  \\
& = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2 + \hat{\beta}_3 \sin(20 x)
\end{align}

For example, if you simple fit an OLS model without transformations, simply y ~ x, then you get the following predictions: the blue dots represent the output of the model, whereas the black dots represent the actual data.

However, if you transform your variables then you can get a very good fit:

While OLS has various assumptions that ideally should be met in order to proceed with modeling, the predictive performance is insensitive to many of these. For instance, ideally, the input variables should be independent; however, even if there are relatively highly collinear predictors, the predictive ability of OLS is not impacted (the interpretation of the coefficients however is greatly affected!).

However, there are some notable difficulties and problems with OLS.  We will discuss some of these in the next few posts!