Author Archives: Charles Nicholson

Centerville Virtual Community Testbed

Centerville Special Edition in Sustainable and Resilient InfrastructureCenterville

Enhancing community resilience in the future will require new interdisciplinary systems-based approaches that depend on many disciplines, including engineering, social and economic, and information sciences. The National Institute of Standards and Technology awarded the Center for Risk-Based Community Resilience Planning to Colorado State University and nine other universities in 2015 (including the University of Oklahoma!), with the overarching goal of establishing the measurement science for community resilience assessment. To this end, several of the researches within the Center for Risk-Based Community Resilience Planning have come together to develop the Centerville virtual community.

The Centerville Virtual Community Testbed is aimed at enabling fundamental resilience assessment algorithms to be initiated, developed, and coded in a preliminary form, and tested before the refined measurement methods and supporting data classifications and databases necessary for a more complete assessment have fully matured.  Sustainable and Resilient Infrastructure has published a Special Issue introducing the Centerville Testbed, defining the physical infrastructure within the community, natural hazards to which it is exposed, and the population demographics necessary to assess potential post-disaster impacts on the population, local economy, and public services in detail.

The community has multiple residential and commercial zones with several types of buildings at different code levels.  The population of about 50,000 is diverse with respect to employment and income.  There are multiple public schools and government buildings located throughout the city as well as emergency facilities.  There are a few main roads, a simple highway system, some smaller local roads, and a few important bridges within the transportation system.  The Analytics Lab @ OU currently has a research paper in development relating to the study of transportation systems and we use Centerville testbed as one of our cases.

In addition to to the buildings and transportation system, Centerville also has a simplified electric power network (EPN) with multiple substation types (transmission, main grid, distribution, sub-distribution), a small power plant, and single-pole transmission lines. The community also includes a basic potable water system with pumps, tanks, reservoirs, a water treatment plant, and an underground piping system.  The maps associated with these infrastructure systems can be found in the Special Issue.

By creating such a detailed virtual community, researches can have a simplified but somewhat realistic platform for experimentation.  The papers included in the Special Issue cover topics such as multi-objective optimization for retrofit strategies, building portfolio fragility functions, performance assessment of EPN to tornadoes, and computable general equilbrium (CGE) assessment of the community with respect to disasters.

All papers included in the Special Issue are listed below.

Bruce R. Ellingwood, John W. van de Lindt & Therese P. McAllister
Bruce R. Ellingwood, Harvey Cutler, Paolo Gardoni, Walter Gillis Peacock, John W. van de Lindt & Naiyu Wang
Roberto Guidotti, Hana Chmielewski, Vipin Unnikrishnan, Paolo Gardoni, Therese McAllister & John van de Lindt

Happy Winter Break 2016!

Winter Break 2016

Here at the Analytics Lab @ OU we would like to wish you all a happy holiday season, a wonderful winter break, and a feliz año nuevo!

Also, we have had some nice developments over the last few days and weeks and I would like to mention these briefly.


Congratulations to Param Tripathi for succesfully defending his Masters thesis and completing his MS in ISE!

In his thesis entitled “ANALYSIS OF RESILIENCE IN US STOCK MARKETS DURING NATURAL DISASTERS” Param analyzes how major hurricane events impact the stock market in the United States. In particular, he looks at two fundamental elements of resilience, vulnerability and recoverability, as it applies to the Property and Casualty Insurance sector by evaluating the price volatility with respect to the overall New York Stock Exchange.  He applies breakout detection to study patterns in the time series data and uses this to quantify key resilience metrics.

Congratulations to Alexandra Amidon for successfully completing her Industry practicum on anomaly detection using a compression algorithm.  Well done!

Good luck!

facebook_like_thumbNext, Weili Zhang, a PhD Candidate in ISE, is actively interviewing with a vast number of top companies for data science positions. He has passed first and second round interviews for a number of companies and has been invited for face-to-face interviews with Facebook, Google, Ebay, Verizon, Disney, Lyft, Amazon, Target, and several more.  Getting past round one with these companies is already a major accomplishment — round two or three get progressively more intense!  Good job and good luck Weili!

Yay us!

The Analytics Lab team members have had a few papers published or accepted for publication recently:

  • Almoghathawi, Y., K. Barker, C.M. Rocco, and C. Nicholson. 2017. A multi-criteria decision analysis approach for importance ranking of network components. Reliability Engineering and System Safety, 158: 142-151 LINK [bibTex]
  • Nicholson, C., L. Goodwin, and C. Clark. 2016. Variable neighborhood search for reverse engineering of gene regulatory networks.  Journal of Biomedical Informatics, 65:120-131 LINK [bibTex]
  • Zhang, W., N. Wang, and C. Nicholson. 2016. Resilience-based post-disaster recovery strategies for community road-bridge networks. Accepted in Structure and Infrastructure Engineering.

I’ll write up a post about these exciting new research articles soon!

Data Science Interviews

Weili, along with a few other current and former students have been describing the data science interview processes and questions with me.  My plan is to write up a blog post summarizing a some key things for those of you seeking DSA positions to keep in mind and prepare for.  Fortunately, many of the questions/topics asked about are covered in the ISE 5103 Intelligent Data Analytics class offered in the Fall semester.  If you are starting to look for data science jobs now or in the near future, make sure you check out that post!

Life in Industry

Additionally, I have asked several current/former students (Pete Pelter, Leslie Goodwin, Cyril Beyney, Olivia Peret) who are employed in data science positions to provide some descriptions and information about life in industry.  Look for that series of posts soon!

ISE 5113 Advanced Analytics and Metaheuristics

I am currently preparing for the online launch of the ISE 5113 Advanced Analytics and Metaheuristics course to be offered in Spring 2017.  The course will be offered both online and on-campus.  This is one of my favorite courses to teach.  The introductory video should be posted soon!  The course fills up very fast, so if you are a DSA or ISE student make sure to register ASAP if you haven’t already!


I have had excellent response from so many people this semester that it looks like “yes!” we will have enough people to start a soccer team in the Spring.  I’ll be providing more information next month.  The season starts in March and the official signups are in February ($80 each).  We need a team name and team colors — so I’ll be open for ideas!  In the meantime, get out there and practice!

winter break soccer


I hope everyone has an excellent winter break.  Enjoy your family, enjoy good food, stay warm, practice some soccer, catch up on sleep, maybe study a little bit… and I’ll see you next year!


Thoughts on regression techniques (part iii)

In the first of this series of posts on regression techniques we introduced the work-horse of predictive modeling, ordinary least squares regression (OLS), but concluded with the notion that while a common and useful technique, OLS has some notable weaknesses.  In the last post we discussed its sensitivity to outliers and how to deal with that using multiple “robust regression” techniques.  Now we will discuss regression modeling in high dimensional data.  Two of the techniques are based on idea called “feature extraction” and the last three are based on “penalized regression”

Dimensionality, feature extraction, selection, and penalized regression

An  issue that every modeler who has had to deal with high-dimensional data, is feature selection.  By high dimensionality I mean that p, the number of predictors, is large. High dimensional data can cause all sorts of issues — some of them psychological — for instance, it can be very hard to wrap your mind around so many predictors at one time; it can be difficult to figure out where to start, what to analyze, what to explore, and even visualization can be a beast!  I tried to find some pictures to represent high-dimensional data for this blog post, but of course, by definition 2D or 3D representations of high-D data is hard to come by.  This parallel plot at least add some color to the post and represents at least some of the complexity!

When p is large with respect to the number of observations n, then the probability of overfitting is high. Now, let me quickly interject that p may be much greater than the number of raw input variables in the data.  That is, a modeler may have decided on constructing several new features from among the existing data (e.g., ratios between two or more variables, non-linear transformations, and multi-way interactions). If there is a strong theoretical reason for including a set of variables, then great, build the model and evaluate the results. However, oftentimes, a modeler is looking for a good fit and doesn’t know which of the features should be created, and then which combinations of such features should be included in the model. Feature construction deals with the first question (and we will look at that later on), feature selection deals with the second question (which we deal with now).

Actually, first I will digress slightly — in the case when p is equal to or greater than n, OLS will fail miserably.  In this situation, feature selection is not only a good idea, it is necessary!  One example problem type and data when p>n occurs in microarray data. If you are not familiar with what a microarray is see the inset quote from the Chapter “Introduction to microarray data analysis” by M. Madan Babu in Computational Genomics.  Essentially microarray data is collected for the simultaneous analysis of thousands of genes to help discovery of gene functions, examine the gene regulatory network, identify of drug targets, and provide insight to understanding of diseases.  The data usually has n on the orders of 10 to 100, whereas p might be on the order of 100’s to 1000’s!

A microarray is typically a glass slide on to which DNA molecules are fixed in an orderly manner at specific locations called spots. A microarray may contain thousands of spots and each spot may contain a few million copies of identical DNA molecules that uniquely correspond to a gene. The DNA in a spot may either be genomic DNA or short stretch of oligo-nucleotide strands that correspond to a gene. Microarrays may be used to measure gene expression in many ways, but one of the most popular applications is to compare expression of a set of genes from a cell maintained in a particular condition (condition A) to the same set of genes from a reference cell maintained under normal conditions (condition B). – M. Madan Babu


There are two closely related approaches that can handle such scenarios with ease.  These are Principal Component Regression (PCR) and Partial Least Squares regression (PLS).  Both of these techniques allow you deal with data when p>n.  Both PCR and PLS essentially represent the data in lower dimensions (either in what is known as “principal components” for the former or just “components” in the latter).  Either way these components are each formed from linear combinations of all of the predictors.  Since PCR and PLS are essentially automatically creating new features from the given data, this is a form of what is known as “feature extraction”.  That is, the algorithm extracts new predictors from the data for use in modeling.

If you choose k components (or extracted features) to be less than p, then you have reduced the effective representation of your data.  With this of course also comes information loss.  You can’t just get rid of dimensions without losing information.  However, both PCR and PLS both try to shove as much “information” into the first component as possible; subsequent components will contain less and less information.  If you chop off the only a few of the last components, then you will not experience much information loss.  If your data contains highly correlated variables or subsets of variables, then you can possible reduce many dimensions with very little loss of information.

Face database for facial recognition algorithms

Face database for facial recognition algorithms

Image data for instance is another example of high-dimensional data which can usually be reduced using something Principal Component Analysis (PCA) to a much lower level of dimensions.  Facial recognition algorithms leverage this fact extensively!  Looking at the image at right, it is easy to see many commonalities among the images of faces — the information to discern one face from the other is not within the similarities, but of course the differences.  Imagine if you can removing the “similar elements” of all the faces — this would remove a considerable amount of the data dimensionality and the remaining “differences” is where the true discriminatory information is found.

PCR and PLS essentially do this — allow you to throw away the non-informative dimensions of data and perform regression modeling on only the informative bits.

I’ll leave the details about the difference in PCR and PLS alone for now, except to say that PCR is based on an unsupervised technique (PCA), whereas PLS is a inherently a supervised learning technique through-and-through.  Both define components (linear combinations of the original data) and then allow you to perform OLS on a subset of components instead of the original data.

Feature Selection

One possibility for sub-selecting predictors for an OLS model is to use stepwise regression.  Stepwise regression (which can either be forward, backward, or bi-directional) is a greedy technique in which potential predictors are added (or removed) one-at-time from a candidate OLS model by evaluating the impact of adding (or removing) each variable individually to which improves performance the most. Maybe “performance” is the wrong word here, but let me use it for now. One traditional technique is to add (or remove) variables based on their associated statistical p-values, e.g., remove a variable if its p \ge 0.5.  I should note that while this is commonly employed (e.g., it is the default stepwise method used in SAS), there is some reasonable controversy with this approach.  It is sometimes called p-fishing — as in, “there might not be anything of value in the data, but I am going to fish around until I find something anyway.”  This is not a flattering term.  If you do choose to perform stepwise regression, a less controversial approach would be to use either AIC or BIC scores as the model performance metric at each step.

Lasso: a penalized regression approach

However, there are variations of OLS (which again are based on modifications to the objective function in Equation (2) (from the first post in the series) that result in an automatic selection of a subset of the candidate predictors.

Here is Equation (2) again:

$$\text{(Equation 2 again)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 $$

The first technique that we will discuss is called the least absolute shrinkage and selection operator (lasso).  Lasso adds a penalty function to Equation 2 based on the magnitude of the regression coefficients as shown in Equation 3.

$$\text{(Equation 3)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2  + \lambda \sum_{j=0}^{p} |\beta_j |$$

The penalty is the sum of the absolute values of all the regression coefficients scaled by some value \lambda > 0.  The larger the value of \lambda, the larger the potential penalty.  As \lambda increases, it makes sense with respect to Equation (3) to set more and more regression coefficients to 0.  This effectively removes them from the regression model.  Since this removal is based on balancing the sum of residuals with the penalty, the predictors which are not as important to the first part of the lasso objective are the ones that are eliminated.  Voilà, feature selection!

You might ask, why place a penalty on the magnitude of the beta values — doesn’t that artificially impact the true fit and interpretation of your model?  Well, this is a good question — the lasso definitely “shrinks” the regression coefficients, however, this does not necessarily mean that the shrinkage is a departure from the true model.  If the predictors in the regression model are correlated (i.e., some form of multi-collinearity exists), then the magnitudes of the regression coefficients will be artificially inflated (and the “meaning” of the beta values may be totally lost).  The shrinkage operator in lasso (and other techniques, e.g. ridge regression) tackle this directly.  It is possible that lasso puts too much downward pressure on the coefficient magnitudes, but it is not necessarily true.

Ridge regression: another penalized regression approach

I need to confess that ridge regression is not a method with automatic feature selection.  However, since it is so closely related to lasso, I decided to throw it in really quickly so I don’t have to right another blog post just for this little guy.  Here’s the equation, see if you can spot the difference from Equation (3)!

$$\text{(Equation 4)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2  + \lambda \sum_{j=0}^{p} \beta_j^2 $$

That’s right — the only difference between Equation (3) and Equation (4) is that the penalty is based on absolute values in one and on the squares in the other. The idea is the same, except one very interesting difference — the regression coefficients in ridge regression are never forced to 0.  They get smaller and smaller, but unlike lasso, no features are eliminated.  Ridge regression however does turn out to produce better predictions that OLS.  For both lasso and ridge regression the value of \lambda is determined by using cross-validation methods to tune the parameter to the best value for predictions.  If \lambda = 0, then both of these methods give the exact result as OLS.  If \lambda > 0 (as determined by the so-called hyper-parameter tuning, then the penalized regression techniques are shown to be better than OLS.

Elastic net regularization: yet, again, a penalized regression approach

The other reason that I wanted to introduce ridge regression is that it is a great segue into my favorite of the penalized techniques, elastic net regularization or just elastic net for short.  The elastic net approach combines both penalties from lasso and ridge regression in an attempt to get at the best of both worlds: the feature selection element of lasso and the predictive performance of ridge regression.

$$\text{(Equation 5)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2  + \lambda_1 \sum_{j=0}^{p} |\beta_j | + \lambda_2 \sum_{j=0}^{p} \beta_j^2$$

Oftentimes the relationship between \lambda_1 >0 and \lambda_2 >0 is one such that \lambda_1 + \lambda_2 = 1.  In this case if \lambda_1 = 1, then the elastic net gives you the same result as lasso, and if \lambda_2 = 1, then the result is equivalent to ridge regression.  However, many times the result from hyper-parameter tuning is that \lambda_1 < 1 and \lambda_2 < 1, implying that yes! some hybridization of the lasso and ridge regression approaches produces the best cross-validated results.

In the case when we require \lambda_1 + \lambda_2 = 1, Equation (5) can be rewritten as follows to simplify to only one parameter:

$$\text{(Modified equation 5)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2  + \lambda \sum_{j=0}^{p} |\beta_j | + (1-\lambda) \sum_{j=0}^{p} \beta_j^2$$

I have mentioned “hyper-parameter tuning” a couple of times already.  Without going into the details of the cross-validation, let me simply just say that all hyper-parameter tuning means is that you try out a whole bunch of values for your parameter (e.g., \lambda) until you find the values that work best.  Take a look at the lasso and elastic net paths figure.  In this figure the values of the coefficients are on the y-axis (each color of a line represents represents a different predictor) and the value of the penalty is represented on the x-axis (actually here the log of the penalty is represented).  As the penalty value decreases (moving right along the x-axis), the values of the coefficients increase for both lasso (solid lines) and elastic net (dashed lines).  So you can see as you “tune” the value of \lambda, infinite number of models are possible!  When the value becomes large enough (starting at right and then moving to the left along the x-axis), some of the coefficients are forced to 0 by lasso and by elastic net.  The lasso seems to do this quicker than elastic net as demonstrated in the solid blue and dashed blue line — a very small increase in \lambda (at about \log \lambda \sim 90 and its value is set to 0 by lasso; whereas, the penalty has to increase such that \log \lambda \sim 55 for elastic net before the same variable’s regression coefficient is set to 0.


Now there are other regression techniques that also include automatic feature selection, e.g., multivariate adaptive regression splines (MARS) essentially use a forward step-wise procedure to add terms to the regression model and regression trees choose a features one at a time to add to a model to produce prediction estimates. (These two seemingly different techniques actually have quite a lot in common!)  However, I will introduce the first of these as a technique for both feature selection and feature construction.  Our next post deals more generally with how our regression approach can deal with non-linearities in our model assumption.


Thoughts on regression techniques (part deux)

We left off discussion in the last post on regression techniques with the statement that there are some known issues relating to Ordinary Least Squares (OLS) regression techniques.  My goal in this series of posts is to introduce several variations on OLS that address one or more of these drawbacks.  The first issue I will mention is related to making regression more robust in the presence of outliers.  This is accomplished through what is known, straightforwardly, as robust regression.

Outliers and robust regressionoutlier-pic-1

OLS regression is sensitive to outliers. To see this look at the objective function again (listed as Equation 2 in the first post of the series):

$$\text{OLS objective function:} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 $$

The procedure is highly incentivized to minimize the residuals. That is, the square term implies a very large penalty if the predictions are wrong. This sounds fine, but imagine there exists a single point in your data set that simply does not fit the true model well, i.e., the value for y does not follow the linear assumption in Equation 1 (from the previous post) at all. The OLS fit will be greatly affected by such an observation.

Let’s put some numbers to this to make a quick example. Assume that without this outlying point, the residuals associated with your n=100 data points are all relatively small, for instance, -0.1 <= y_i – \hat{y}_i <= 0.1. The sum of the squared residuals is at most equal to 1. Now, pick one of the 100 points and let’s say that that the associated residual is 10. The sum of squared errors would jump to 10000.999! To mitigate this possibility, the OLS fit would not recover the true model, but adjust the fit so as to keep the sum of the squared residuals to something more reasonable. It decreases the residual of this one outlying point by allowing larger residuals for the remaining 99 data points. So now, instead of a good fit for 99% of the data with only one point that doesn’t predict well, we have a model that is a poor fit 100% of the data!

To mitigate this sensitivity to outliers, we can use robust regression. The key is to change the objective function so that the penalty for large residuals is not so dramatic. Let \rho(r_i) represent the penalty function (as a function of the residual for observation i).  Three such common penalty functions are:

\text{Least absolute value (LAV)} \ \  \rho_\text{lav}(r_i) &= \sum_{i=1}^n |r_i| &\\
\text{Huber} \ \  \rho_\text{huber}(r_i) &= \begin{cases} \frac{1}{2} r_i^2 \ & \text{ if } r_i \le c \\ c|r_i| -\frac{1}{2}r_i^2 \ & \text{ if } |r_i|> c\end{cases}\\
\text{Bisquare} \ \    \rho_\text{bisquare}(r_i) &= \begin{cases} \frac{c^2}{6} \left( 1 – \left( 1 – \left(\frac{r_i}{c}\right)^2 \right)^3 \right) \ & \text{ if } r_i \le c \\ \frac{c^2}{6} \ & \text{ if } |r_i|> c\end{cases}\\

While I won’t discuss the solution techniques for solving the associated minimization problems here in detail, I will mention that these changes directly impact the computational efficiency of the regression. The LAV problem can be solved using linear programming techniques. The other two approaches (Huber and Bisquare) are actually part of a family of such objective function modifications known as M-estimation.

Pvt. Joe Bowers: What are these electrolytes? Do you even know?

Secretary of State: They’re… what they use to make Brawndo!

Pvt. Joe Bowers: But why do they use them to make Brawndo?

Secretary of Defense: Because Brawndo’s got electrolytes.

-Circular reasoning from the movie Idiocracy

M-estimation techniques are a class of robust regression method which change the objective function to be less sensitive to outliers (there are other objectives besides Huber and Bisquare).  M-estimation problems usually determine the residual penalty value based on the fit, which is based on the beta values, which in turn is based on the residual values… but wait, we don’t know the residuals until we have the beta values, which is based on the residuals, which are based on the beta values!  Oh no, it’s circular reasoning!

This paradox is resolved by performing iterative reweighted least squares regression (IRLS).  It turns out that the effect of a residual-based penalty is equivalent to allowing different weights for each observation (based on the residual value of that observation).  To address the circular logic, IRLS solves the regression problem multiple times in which the observations are weighted differently each time.  The weights for all observations all start equal to 1 (the same as regular old OLS) and then the observations which do not fit very well will have the lower weights assigned — reducing their impact on the regression fit.  The observations that fit well will have higher weights.  After the weights are assigned, the process is repeated, and the weights are adjusted again, and again, and again, until convergence.  This ultimately reduces the impact of outlying observations.

The images here represent a simple data set in which we have a few outliers (in the upper left corner).  The OLS model fit produces the red line, the M-estimation procedure using Tukey’s bisquare penalty produces the blue regression line.   As you can see, the slope of the red line is impacted by the outlying points, but the blue line is not — its fit is in fact based on all points except the outliers.

robust regression

OLS on the left; Robust Regression on the right

In a later post in this series we will look at another regression-like approach that is robust to outliers (Support Vector Machine regression), however since this technique is also excellent at dealing with non-linearity in the data, I’ll postpone that topic for a bit.  Next we will discuss a common issue in predictive modeling — dealing with high dimensionality.

Thoughts on regression techniques (part 1)

I just completed this semester’s series of lectures on regression methods in ISE/DSA 5103 Intelligent Data Analytics and I wanted to take a moment to call out a few key points.

regressionFirst, let me list the primary set of techniques that we covered along with links to the associated methods and package  in R:

While I do not intend to rehash everything we covered in class (e.g., residual diagnostics, leverage, hat-values, performance evaluation, multicollinearity, interpretation, variance inflation, derivations, algorithms, etc.), I wanted to point out a few key things.

Ordinary Least Squares Regression

OLS multiple linear regression is the workhorse of predictive modeling for continuous response variables.  Not only is it a powerful technique that is commonly used across multiple fields and industries, it is very easy to build and test, the results are easy to interpret, the ability to perform OLS is ubiquitous in statistical software, and it is a computationally efficient procedure.  Furthermore, it serves as the foundation for several other techniques.  Therefore, learning OLS for multivariate situations is a fundamental element in starting predictive modeling.

In order to make any of this make sense, let me introduce some very brief notation.  We will assume that we have n observations of data which are comprised of a single real-valued outcome (a.k.a. response or dependent variable or target), which I will denote as y, and p predictors (a.k.a. independent variables or features or inputs) denoted as x_1, x_2,\ldots, x_p.  These predictors can be continuous or binary.  (Note: nominal variables are perfectly fine, however, to be used in OLS they need to be transformed into one or more “dummy variables” which are each binary.)  For the y-intercept and for each predictor there is a regression coefficient: \beta_0, \beta_1, \ldots, \beta_p .  The assumption in OLS is that the true underlying relationship between the response and the input variables is:

$$\text{(Equation 1)} \ \ \ y = \beta_0 + \sum_{i=1}^n \beta_i x_i + \epsilon $$

where \epsilon represents a normally distributed error term with mean equal to 0. When the OLS model is fit, the values for \beta_0, \beta_1, \ldots, \beta_p are estimated. Let \hat{y} denote the estimates for y after the model is fit and the values \hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p denote the estimates for regression coefficients. The objective of OLS is to minimize the sum of the squares of the residuals, where the residuals are defined as y_i – \hat{y}_i for all i = 1 \ldots n. That is,

$$\text{(Equation 2)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 $$

Let me just take a moment here to say that most of the OLS variants that I’ll summarize in this series of posts are motivated by simple modifications to the objective function in Equation (2)!

OLS is a linear technique, but feature engineering allows a modeler to introduce non-linear effects. That is, if you believe the relationship between the response and the predictor is: $$y = f(x) = x + x^2 + sin(x) + \epsilon$$ then simply create two new variables (this is called feature construction) with these transformations. Let x_1 = x, x_2 = x^2, \text{ and } x_3 = sin(x), your estimated OLS model is linear with respect to the new variables. That is,

\hat{y} & = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \hat{\beta}_3 x_3  \\
& = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2 + \hat{\beta}_3 \sin(20 x)

For example, if you simple fit an OLS model without transformations, simply y ~ x, then you get the following predictions: the blue dots represent the output of the model, whereas the black dots represent the actual data.linear

However, if you transform your variables then you can get a very good fit:

While OLS has various assumptions that ideally should be met in order to proceed with modeling, the predictive performance is insensitive to many of these. For instance, ideally, the input variables should be independent; however, even if there are relatively highly collinear predictors, the predictive ability of OLS is not impacted (the interpretation of the coefficients however is greatly affected!).

However, there are some notable difficulties and problems with OLS.  We will discuss some of these in the next few posts!

New pubs in Sustainable and Resilient Infrastructure

The Analytics Lab has two new publications recently accepted in Sustainable and Resilient Infrastructure.Sustainable and Resilient Infrastructure

Barker, K., J. Lambert, C. Zobel, A. Tapia, J. Ramirez-Marquez, L. McLay, C. Caragea, C. Nicholson. 2016. Defining Resilience Analytics. Accepted for publication in Sustainable and Resilient Infrastructure on September 1, 2016.

Zhang, W. and C. Nicholson. 2016. A multi-objective optimization model for retrofit strategies to mitigate direct economic loss and population dislocation. Accepted for publication in  Sustainable and Resilient Infrastructure on September 19, 2016.

Defining Resilience Analytics

Dr. Nicholson along with Dr. Kash Barker (OU),  Dr. Cornelia Caragea (UNT), Dr. James Lambert (UVA), Dr. Laura McLay (Univ of Wisconsin), Dr. Chris Zobel (Virginia Tech), Dr. Andrea Tapia (Penn State), and Dr. Jose Ramirez-Marquez (Stevens Institute) have collaborated on this perspective article funded by their NSF award.

Abstract: Theory, methodology, and applications of risk analysis contribute to the quantification and management of resilience. For risk analysis, numerous complementary frameworks, guidelines, case studies, etc., are available in the literature. For resilience, the documented applications are sparse relative to numerous untested definitions and concepts. This essay on resilience analytics motivates the methodology, tools, and processes that will achieve resilience of real systems. The paper describes how risk analysts will lead in the modeling, quantification, and management of resilience for a variety of systems subject to future conditions including technologies, economics, environment, health, developing regions, regulations, etc. The paper identifies key gaps where methods innovations are needed, presenting resilience of interdependent infrastructure networks as an example. Descriptive, predictive, and prescriptive analytics are differentiated. A key outcome will be the recognition, adoption, and advancement of resilience analytics by scholars and practitioners of risk analysis.

A multi-objective optimization model for retrofit strategies to mitigate direct economic loss and population dislocation

This work is part of the NIST-funded Center of Excellence in Community Resilience and will be published in a special edition of Sustainable and Resilient Infrastructure focused on some of the initial analysis conducted by the Center.

Abstract: One strategy to mitigate social and economic vulnerabilities of communities to natural disasters is to enhance the current infrastructure underlying the community. Decisions regarding allocation of limited resources to improve infrastructure components are complex and involve various trade-offs. In this study, an efficient multi-objective optimization model is proposed to support decisions regarding building retrofits within a community.
In particular, given a limited budget and a heterogeneous commercial and residential building stock, solutions to the proposed model allow a detailed analysis of the trade-offs between direct economic loss and the competing objective of minimizing immediate population dislocation. The developed mathematical model is informed by earthquake simulation modeling as well as population dislocation modeling from the field of social science. The model is applied to the well-developed virtual city, Centerville, designed collaboratively by a team of engineering experts, economists, and social scientists. Multiple Pareto optimal solutions are computed in the case study and a detailed analysis regarding the various decision strategies is provided.

Sustainable and Resilient Infrastructure is an interdisciplinary journal that focuses on the sustainable development of resilient communities.

Sustainability is defined in relation to the ability of infrastructure to address the needs of the present without sacrificing the ability of future generations to meet their needs.  Resilience is considered in relation to both natural hazards (like earthquakes, tsunami, hurricanes, cyclones, tornado, flooding and drought) and anthropogenic hazards (like human errors and malevolent attacks.)  Resilience is taken to depend both on the performance of the built and modified natural environment and on the contextual characteristics of social, economic and political institutions. Sustainability and resilience are considered both for physical and non-physical infrastructure.

Contributions address pressing societal issues while exploring needed solutions.  Investigating sustainability and resilience from an interdisciplinary perspective, the journal includes original articles, reviews, short communications and case studies in all areas relevant to sustainability and resilience.


Open Faculty Position: Cyber-Physical-Social Systems

Open Faculty Position in ISE

The School of Industrial and Systems Engineering at the University of Oklahoma is recruiting to fill an open tenure-track faculty position to begin in August 2017.  The position should help further our existing core research efforts in Cyber-Physical-Social systems in particular as it relates to the broad field of resilience.  The Analytics Lab @ OU is actively engaged in research in regards to both community resilience and critical resilient interdependent infrastructure systems and processes.

This position will also help support the Data Science and Analytics graduate program in the College of Engineering.   The full position announcement is available as a PDF in the link below.  Here is an excerpt of the position description.  Applicants are encourage to apply by November 1, 2016.

Open Faculty Position

The full position announcement can be found here: ISE Faculty Open Position

The University of Oklahoma is a Carnegie-R1 comprehensive public research university known for excellence in teaching, research, and community engagement, serving the educational, cultural, economic and health-care needs of the state, region, and nation from three campuses: the main campus in Norman, the Health Sciences Center in Oklahoma City, and the Schusterman Center in Tulsa.

OU enrolls over 30,000 students and has more than 2,700 full-time faculty members. Norman is a culturally rich and vibrant town located in the Oklahoma City metro area. With outstanding schools, amenities, and a low cost of living, Norman is a perennial contender on the “Best Places to Live” rankings.

The University of Oklahoma, in compliance with all applicable federal and state laws and regulations, does not discriminate on the basis of race, color, national origin, sex, sexual orientation, genetic information, gender identity, gender expression, age, religion, disability, political beliefs, or status as a veteran in any of its policies, practices, or procedures. The University of Oklahoma, recognizing its obligation to guarantee equal opportunity to all persons in all segments of University life, reaffirms
its commitment to the continuation and expansion of positive programs which reinforce and strengthen its affirmative action policies. This commitment stems not only from compliance with federal and state equal opportunity laws but from a desire to ensure social justice and promote campus diversity. Our commitment to the concept of affirmative action requires sincere and cooperative efforts throughout all levels of our employment structure. We will continue to strive to reach the goals of fair and equal employment opportunities for all.

Big data: what is it?

What is Big Data?

big data knows everything

Big data is term that you often hear when people talk about data science and analytics.

So, the question is, “what is big data?”

Doug Laney from Gartner, a leading information technology research company, defined 3 dimensions of “big data”: volume, velocity, and variety.

  • Volume denotes the size and scale of the data.   There is a lot of data out there – and it is growing. It is estimated that 40 zettabytes of data will be created by the year 2020.   What is a zettabyte?  One zettabyte is equal to 1 trillion gigabytes!
  • Velocity is the speed at which data is created as well as the increasing speed at which it is processed.  The speed at which data is created is almost unimaginable. And it is accelerating. I’ll give some examples, but by the time you see read this they will be out of date: Google is processing about 3.5 billion search queries everyday; every minute we are uploading 300 hours of video onto Youtube; and 3.4 million emails are sent every second.  Check out this site for more up-to-date information:
  • Variety of the data refers to the fact that data comes from many sources and in many forms.  Whether it is facebook posts, video uploads, satellite images, GIS data, reviews on products from, sensor data from self-driving cars, or data from wearable devices and wireless health monitors – data is  is coming at us from all directions and in many formats.

People love alliteration…

Everyone seems to want to add more “V’s” to the definition of big data so now we have 4’vs of big data, the 5’vs, 6 V’s, and even 7 V’s of big data… 

Batman says: Only 3 V's of big data!

Let’s look at these next four V’s: Veracity, Variability, Visualization, and Value.   I’d like to add however that these next dimensions are not unique to “big” data, but represent challenges to data of basically any size.  Now, I should mention that Doug Laney did not necessarily like the addition of the new V’s to his working description of “big data”

  • The first one, added by IBM, is “veracity” – that is the accuracy, truthfulness, or trustworthiness of the data.  IBM found that 1 in 3 business leaders didn’t trust the information that they use to make decisions. And additionally that “poor data quality costs the US economy an estimated 3.1 trillion dollars a year.

big data and veracity

  • Variability implies that the meaning of the data is changing.  A number, variable, or rule might have had a metalFancertain definition last month; but now it has changed.  This also might relate, for example, to how words have different meanings in different context.  One especially difficult challenge in the field of natural language processing is how to detect and interpret sarcasm.  The same word used in one phrase may have the exact opposite meaning when used in a different phrase.


  • Visualization is associated with challenge of understanding what is really in your data – this includes visualizing and communicating the interesting facets of the data; turning all of this into something comprehensible — this is not easy.

big data dashboard

  • Finally, the last V – value.  Data by itself has no real value.   Having lots of it, without meaning, doesn’t do anyone any good. Individual observations, transactions, records, entities in the data have mean very little on their own.  It is only though aggregation and analysis that we can find anything worthwhile.   But, there is so much of it, there is an enormous potential!  As a shameless plug, turning big data or small data or anything in between into value – well, that’s the purpose of the ISE/DSA 5103 Intelligent Data Analytics course that I teach.

Now what?

I like Joel Gurin, author of Open Data Now, I like his quote on defining big data, “Big data describes datasets that are so large, complex, or rapidly changing that they push the very limits of our analytical capability.  It’s a subjective term: What seems “big” today may seem modest in a few years when our analytic capacity has improved.”

“Big data describes datasets that are so large, complex, or rapidly changing that they push the very limits of our analytical capability.    — Joel Gurin”

What was big data yesterday, may not be big data now; and what is “big” now,  may not be considered “big” tomorrow.  However, what is consistent in this field and this problem is about the need for us to expand our analytical talents and technology.  This (again, shameless plug) is what the MS Data Science and Analytics program at OU is all about!  Joel Gurin goes on to say that what’s really important is not so much the size of the data, but the “big impact” that it can have on society, health, economy, and research.

next level of big data

Fall 2016 Classes

Why are open source statistical programming languages the best?
Because they R.

It is August and Fall 2016 classes begin in just a couple of days.  I am currently prepping for two large classes: I happy to see the incredible interest in my graduate course with over 50 students enrolled in ISE/DSA 5103 Intelligent Data Analytics! I will also be taking over Dr. Suleyman Karabuk’s ISE 4113 Decision Support Systems undergraduate course with nearly 80 students already enrolled!

To this end I am collecting as many new jokes and one-liners as possible — gotta to keep the material fresh.  That said, to those of you who have yet to have taken any of my courses, my jokes are really not that funny, however, I do expect all students to laugh regardless.  This is a price that must be paid.  If you have any jokes, puns, etc. that are both short, clean, related to statistics or data science, and optionally are funny, please send them my way: cnicholson @ ou (dot) edu.

To support these two course I have tricked two unassuming graduate students into becoming TA’s for me.  Sai Krishna Theja Bhavaraju has enthusiastically accepted the role of TA for ISE 4113 and Alex Rodriguez will be the TA for ISE 5103.  Both of these TA’s are bright, friendly, and very helpful.  If you are taking either of these two classes, please feel free to ask them for help.  If you are not taking these classes, but you stumble across either of these two gentlemen, please buy them a beer — they have their work cut out for them!

Fall 2016 Classes

Intelligent Data Analytics is not an easy course.  The homeworks and projects are notoriously challenging.  In the class we address real-world data intensive problems by integrating human intuition with data analysis tools to draw out and communicate meaningful insights. Topics include problem approach and framing, data cleansing, exploratory analysis and visualization, dimension reduction, linear and logistic regression, decision trees, and clustering.  Students will be introduced to a powerful open source statistical programming language (R) and work on hands-on, applied data analysis projects.  I have heard from several former students that this has been a hard but useful course — at least six students that I know of who have taken this course have obtained jobs in analytics and data science fields at companies including Deloitte Consulting, Visual BI, GE Global Research, Nerd Kingdom, OKC Thunder, and Standard & Poors.  Hopefully the skills you are introduced to in the class can be helpful to you in the future.Former students working in Analytics

ISE 4113 is a Decision Support Systems course that exploits advanced features of MS Excel 2013 to model and build decision support applications.  The course will start with the basics and quickly move into mathematical modeling, simulation, VBA, and GUI design.  While this is the first time for me to teach this course, I have heard from students that the material they learn in this class has made a significant impact in their academic and professional lives.  I hope to continue the track record of success with this course.



Summer 2016 Hangout

Summer 2016 Hangout

Very happy to see all the students and friends that came out to the Summer 2016 hangout at McNellie’s The Abner Ale House in Norman.  I am privileged to work a wide variety of students in ISE, DSA, and CEES who are applying research in a broad array of application areas (from Community Resilience to Streaming Clustering in online Gaming to Predictive Modeling for TV Ratings to Optimizing Ship Routing) and who represent many different cultures, languages, and backgrounds.  Our group includes members from China, India, Iran, Peru, Brazil, as well as Oklahomans and Texans.  My beautiful wife, hailing from Mexico, also came to hangout.

I am glad that this gave you a chance to meet some new colleagues and reconnect with others outside the lab.

Hopefully, all of the MS DSA students (Alex B., Alex R., Alexandra, Emily, Silvia, and Stephen) can support each other through this academically intense Fall semester about to begin!   Silvia and Emily are completing their industry practicums this week as well — so congratulations to them (assuming all goes well!)

We are also happy to welcome Vera Bosco to the group — an ISE PhD student who is applying methods of stochastic optimization and dynamic programming to ship routing under weather uncertainty.  She is a new addition from the group and hails from Brazil.  Her bio is now posted on the team page.

And as always, I am glad to hangout with the CEES group who are a part of the CORE lab – Peihui, Mohammad, Yingjun, and Jia.

I hope this opportunity (and more like them to come) will help you connect with your colleagues and co-conspirators in the Analytics Lab. Several students are out-of-town during the Summer, but when everyone is back from their internships and travels we will plan a get-together for the Fall.