Author Archives: Charles Nicholson

Congratulations to three new Masters!

Congratulations Alexandra, Emily, and Megan — new Masters of Science!

Alexandra Amidon (left), Emily Grimes (center), and Megan Snelling (right) have all successfully defended their Master’s theses this Spring 2017 are the three newest Masters from the Analytics Lab @ OU.

Alexandra and Emily completed their Masters of Science in Data Science and Analytics from the Gallogly College of Engineering.  The DSA program is a joint effort between the School of Industrial & Systems Engineering and the School of Computer Science. Megan completed her Masters of Science from the School of Industrial & Systems Engineering.

I’ll start with Megan since she is the lone ISE in this group of three.

Megan’s work is entitled “MODEL FOR MITIGATING ECONOMIC AND SOCIAL DISASTER DAMAGE THROUGH STRUCTURAL REINFORCEMENT” and is a continuation of previous work completed as a part of the NIST-funded Center of Excellence on Risk-Based Community Resilience Planning  and CORE Lab @ OU.

Abstract: Natural disasters have both severe negative short-term consequences on community structures, inhabitants, and long-term impacts on economic growth. In response to the rising costs and magnitude of such disasters to communities, a characteristic of modern community development is the aspiration towards resilience. An effective and well-studied mitigation measure, structural interventions reduce the value lost in buildings in earthquake scenarios. Both structural loss and socioeconomic characteristics are indicators for whether a household will dislocate from their residence. Therefore, this social vulnerability can be mitigated by structural interventions and should be minimized as it is also indicator of indirect economic loss. This research presents a model for mitigating direct economic loss and population dislocation through decisions regarding the selection of community structures to retrofit to higher code levels. In particular, the model allows for detailed analysis of the tradeoffs between budget, direct economic loss, population dislocation, and the disparity of dislocation across socioeconomic classes given a heterogeneous residential and commercial structure set. The mathematical model is informed by extensive earthquake simulation and as well as recent dislocation modeling from the field of social science. The non-dominated sorting genetic algorithm II (NSGA-II) is adapted to solve to model, as the dislocation model component is non-linear. Use of the mitigation model is demonstrated through a case study using Centerville, a test bed community designed by a multidisciplinary team of experts.  Details of the retrofit strategies are interpreted from the estimated Pareto front.

We should also offer congratulations to Megan on another account she is getting married soon and plans to spend her Summer hiking through Europe!

Alexandra and Emily both worked on project related to T.U.G. (The Untitled Game) which was partially funded by Nerd Kingdom.Nerd Kingdom

Alexandra’s work entitled “A NEW APPROACH TO ADAPTING NEURAL NETWORK CLASSIFIERS TO SUDDEN CHANGES IN NON-STATIONARY ENVIRONMENTS”

Abstract: Predictive algorithms applied to streaming data sources are often trained sequentially by updating the model weights after each new data point arrives. When disruptions or changes in the data generating process occur (“concept drifts”), the online learning process allows the algorithm to slowly learn the changes; however, there may be a period of time after concept drift during which the predictive algorithm underperforms. This thesis introduces a method that makes online neural network classifiers more resilient to these concept drifts by utilizing data about concept drift to update neural network parameters.

Alexandra has accepted a position with MSCI, a leading provider of investment decision support tools worldwide, as a Reference Data Production Analyst.  She will be using her skills in machine learning to continue developing new tools for anomaly detection.

Emily’s work is entitled “THE EFFECT OF GATHERING ON SANDBOX PLAYER ENGAGEMENT AS DEFINED USING ANALYTIC METHODS”

Abstract: Player engagement is a concept that is both vital to the online gaming industry and difficult to define. Typically, engagement is defined using social science methodologies such as observing, surveying, and interviewing players. With the vast amount of data being collected from video games as well as user bases increasing in size, it is worthwhile to investigate whether or not user engagement can be defined and interpolated from data alone. This study develops a methodology for defining engagement using analytic methods in order to approach the question of whether gathering (as a proxy for social interaction) in sandbox games has an effect on player engagement.

Emily is following up on leads for a full-time position now, but in the meantime she has a road trip planned to the Grand Canyon, Sequoia National Park, and the Big Sur in California.  She is also in discussions with KGOU and NPR about starting a new radio program!

Congratulations to all three excellent students!  We wish you great success!

Emily Grimes, MS DSA, May 2017

Megan Snelling, MS ISE, May 2017

OU Industrial & Systems Engineering and Data Science & Analytics

Public Webinar Announcement: Center for Risk-Based Community Resilience Planning

Public Webinar Announcement — Community Resilience: Modeling, Field Studies and Implementation

Learn more about NIST-funded Center for Risk-Based Community Resilience Planning and how the Center is developing a computational environment to help define the attributes that make communities resilient.

WEBINAR: Thursday, April 27, 10:00 a.m. – 12:00 p.m. (CDT)

https://www.youtube.com/watch?v=eyjzCDxcdSA&feature=youtu.be  

The webinar is open to anyone immediately followed by a Q&A “chat” period.

A Resilient Community is one that is prepared for and can adapt to changing conditions and can withstand and recover rapidly from disruptions to its physical and social infrastructure.  Modeling community resilience comprehensively requires a concerted effort by experts in engineering social sciences and information sciences to explain how physical, economic and social infrastructure systems within a real community interact and affect recover efforts.

Join this information WEBINAR to learn more about the Center’s recent activities.

A Center overview will be followed by a session on the Center’s recent Special Issue of Resilient and Sustainable Infrastructure, which features six papers on the virtual community Centerville.  The modeling and analysis theory behind each paper will be explained followed by a demonstration of IN-CORE, the Interdependent Connected Modeling Environment for Community Resilience.  Presentations on the first validation study, the Joplin Hindcast, and the Center’s First Field Study, the 2016 Lumberton floods in NC will also be a highlight of the Webinar.

No registration is required this time, just click, watch, and chat.

Both Dr. Nicholson and Dr. Wang will be giving presentations during the webinar.

Flier for distribution: Webinar Flier 27-April-2017

Postdoctoral Research Fellow Position in Community Resilience

Prof. Charles Nicholson is currently accepting applications for a postdoctoral research fellow position in Community Resilience within the School of Industrial and Systems Engineering at the University of Oklahoma.

The primary area of research is with respect to the following broad objective:

Enhance community resilience to natural and man-made disasters through modeling, optimization, and risk-informed decision making with respect to vital, large-scale, interdependent civil infrastructure and socio-economic systems.

Researchers with backgrounds and interests in one or more the following areas are encouraged to apply:

  • Optimization: network flow optimization, multi-objective optimization, stochastic optimization; stochastic programming
  • Data science and analytics: including machine learning for predictive and classification modeling as well as unsupervised and semi-supervised learning
  • Decision modeling for community and regional resilience planning

The postdoctoral research fellow will embark on an exciting and innovative research program within a well-established and active multidisciplinary research group with collaboration opportunities across the United States.  In this role, you will also supervise one or more PhD students.  Experience with tools such as Python or R is highly preferred.  Familiarity with Civil Infrastructure systems and/or economic modeling is a plus.  The position will be supported by funded research projects with multi-year durations.

Interested applicants please send a one-page statement of research interests and CV to cnicholson @ OU (dot) edu.

Total Chaos: Soccer, ISE, and Old People

nike-soccer-academy-az-1

Total Chaos

In Fall 2017  I decided that it was time to start working on my bucket-list, item #117: play an actual game of soccer.   There are other items on my bucket-list too, but I figured I better try this one soon since I am not getting any younger.  This is the impetus for my new team: Total Chaos.

I’ve coached soccer for 3 years (my daughter’s team) for Norman Youth Soccer Association (NYSA) . When I started then, I had very little understanding of the game.  I knew that most of the players were not supposed to use their hands, but any rules other than that were  a bit vague…

Anyway, I’ve wanted to play soccer for years, but starting out as a complete newbie with such a demanding and skilled sport like futbol over the age 40, well, it was somewhat of daunting thing to do.  The options were: (1)  try to join an existing team and then ultimately disappoint all of the other players with my complete lack of skill or… (2) start my own team from scratch with the understanding that (a) everyone is welcome — even newbies and old people — and  (b) we will not likely win.  That is, set expectations low: so low in fact that no one has a right to be disappointed with any outcome!   I opted for the latter.  NYSA has an adult league, and thus I started recruiting for my new team…

To make a long story short, the response to my invitation “do you want to play soccer in a league even if we have no chance of winning any games?” — was a resounding yes.  Soccer mom’s and dad’s, friends, OU faculty, and both grad and undergrad students in ISE for some reason found the idea appealing.  My wife, who like me, has never played the sport in her life even joined up.  Thankfully, not everyone that answered the call was a complete newbie, because several of us needed teachers!

The student becomes the teacher…

In this case, literally “the students become the teachers” — Jack, Austin, Leslie, and Brad are all undergrads who took my ISE 4113 course in Fall 2017 and now they had their work cut out for them trying to teach me what to do on the field. Joining them we also have Darin, Nicole, Andrew, and Yasser — all PhD or MS students in either ISE or DSA.

Now, while our defense is not this bad:

without Jack Appleyard leading the defense, it could be much worse (I’m on defense you see — which does not give Jack much to work with!) so he is almost a one man team in the backfield — saving our collective butts more than once keeping it from being the total chaos it would’ve been otherwise!

Brad “the slide tackle ninja” Osborn, Austin Shaw, “the king of awesome”, and Leslie “the beast” Barnes head-up the midfield and offense and simply rock the pitch…

Darin Chambers — who happens to also be a political candidate running for State Representative District 46 — is a fellow soccer dad and great teammate and leader.  Yasser, a PhD student in ISE has both published research with me and taught me how to defend and pass.  Nicole and Andrew, both new to the game, are simply fearless.  Pravin, who is going up for tenure at OU the same time as me, has stepped up to help play keeper after our first keeper was injured.  Finally, Everton, Omar, Nery, Marco, Justin, Alicia, and Greg — are all new friends.

In summary — we have a great team: a great mix of ages, genders, languages, skills, and backgrounds.  Thanks for helping me mark off an item on my bucket-list I’ve dreamed to do for years.    The team pic below is missing a few players, so I’ll update it later, but here we are: Total Chaos.

Finally, despite my hand-balls and/or fouls in the box and/or missed passes and/or bad throw-ins (sorry about all that…) — so far we’ve played two games and won both.

Total Chaos team picture

Left to Right — Back: Everton, Omar, Marco, Justin, Jack, Brad, Austin, Pravin, Greg, Charles, Andrew, Nicole; Front: Yasser, Alicia, Zorelly

 

Two new resilience publications 2017!

Two new resilience publications!

Well, here at the Analytics Lab @ OU  2017 started off nicely with two new articles published in the area of community resilience. We are also very excited about finally being able to share the virtual community we created named “Centerville” as a part of the Center for Risk-Based Community Resilience Planning — the special issue on Centerville is finally published in Sustainable and Resilient Infrastructure.  Please check out the post on Centerville!

The first of these resilience publications is entitled Resilience-based post-disaster recovery strategies for road-bridge networks which appears in Structure and Infrastructure Structure and Infrastructure EngineeringEngineering, an international journal which aims to present research and developments on the most advanced technologies for analyzing, predicting and optimizing infrastructure performance.

This paper by Weili Zhang, Naiyu Wang, and myself presents a novel resilience-based framework to optimise the scheduling of the post-disaster recovery actions for road-bridge transportation networks.  This work was supported, in part, by the Center for Risk-Based Community Resilience Planning, National Institute of Standards and Technology (NIST) [Federal Award No. 70NANB15H044].

The methodology systematically incorporates network topology, redundancy, traffic flow, damage level and available resources into the stochastic processes of network post-hazard recovery strategy optimisation. Two metrics are proposed for measuring rapidity and efficiency of the network recovery: total recovery time (TRT) and the skew of the recovery trajectory (SRT).  The SRT is a novel metric designed to capture the characteristics of the recovery trajectory which relate to the efficiency of the restoration strategies.  This is depicted in the figure below.

resilience publication

Depiction of new skew metric for network recovery

Based on this two-dimensional metric, a restoration scheduling method is proposed for optimal post-disaster recovery planning for bridge-road transportation networks. To illustrate the proposed methodology, a genetic algorithm is used to solve the restoration schedule optimisation problem for a hypothetical bridge network with 30 nodes and 37 bridges subjected to a scenario seismic event. A sensitivity study using this network illustrates the impact of the resourcefulness of a community and its time-dependent commitment of resources on the network recovery time and trajectory.

  • Zhang, W., N. Wang, C. Nicholson. 2017. Resilience-based post-disaster recovery strategies for road-bridge networks.  Structure and Infrastructure Engineering, Accepted.  LINK

The next of the resilience publications, is a paper appearing in Reliability Engineering & System Safety entitled A multi-criteria decision analysis approach for importance ranking of network components.  This a joint effort between Yasser Almoghathawi, Kash Barker, Claudio Rocco.Reliability Engineering and System Safety

Reliability Engineering and System Safety is an international journal devoted to the development and application of methods for the enhancement of the safety and reliability of complex technological systems. The journal normally publishes only articles that involve the analysis of substantive problems related to the reliability of complex systems or present techniques and/or theoretical results that have a discernable relationship to the solution of such problems. An important aim is to achieve a balance between academic material and practical applications.

In the study, we propose a new approach to identify the most important network components based on multiple importance measures using a multi criteria decision making method, namely the technique for order performance by similarity to ideal solution (TOPSIS), able to take into account the preferences of decision-makers. We consider multiple edge-specific flow-based importance measures provided as the multiple criteria of a network where the alternatives are the edges.

resilience publication in RESS

Component Importance Measures may rank elements within a newtwork differently. TOPSIS provides one approach to considered such cases.

Accordingly, TOPSIS is used to rank the edges of the network based on their importance considering multiple different importance measures. The proposed approach is illustrated through different networks with different densities along with the effects of weights.

  • Almoghathawi, Y., K. Barker, C.M. Rocco, and C. Nicholson. 2017. A multi-criteria decision analysis approach for importance ranking of network components. Reliability Engineering and System Safety, 158: 142-151 LINK [bibTex]

 

Centerville Virtual Community Testbed

Centerville Special Edition in Sustainable and Resilient InfrastructureCenterville

Enhancing community resilience in the future will require new interdisciplinary systems-based approaches that depend on many disciplines, including engineering, social and economic, and information sciences. The National Institute of Standards and Technology awarded the Center for Risk-Based Community Resilience Planning to Colorado State University and nine other universities in 2015 (including the University of Oklahoma!), with the overarching goal of establishing the measurement science for community resilience assessment. To this end, several of the researches within the Center for Risk-Based Community Resilience Planning have come together to develop the Centerville virtual community.

The Centerville Virtual Community Testbed is aimed at enabling fundamental resilience assessment algorithms to be initiated, developed, and coded in a preliminary form, and tested before the refined measurement methods and supporting data classifications and databases necessary for a more complete assessment have fully matured.  Sustainable and Resilient Infrastructure has published a Special Issue introducing the Centerville Testbed, defining the physical infrastructure within the community, natural hazards to which it is exposed, and the population demographics necessary to assess potential post-disaster impacts on the population, local economy, and public services in detail.

The community has multiple residential and commercial zones with several types of buildings at different code levels.  The population of about 50,000 is diverse with respect to employment and income.  There are multiple public schools and government buildings located throughout the city as well as emergency facilities.  There are a few main roads, a simple highway system, some smaller local roads, and a few important bridges within the transportation system.  The Analytics Lab @ OU currently has a research paper in development relating to the study of transportation systems and we use Centerville testbed as one of our cases.

In addition to to the buildings and transportation system, Centerville also has a simplified electric power network (EPN) with multiple substation types (transmission, main grid, distribution, sub-distribution), a small power plant, and single-pole transmission lines. The community also includes a basic potable water system with pumps, tanks, reservoirs, a water treatment plant, and an underground piping system.  The maps associated with these infrastructure systems can be found in the Special Issue.

By creating such a detailed virtual community, researches can have a simplified but somewhat realistic platform for experimentation.  The papers included in the Special Issue cover topics such as multi-objective optimization for retrofit strategies, building portfolio fragility functions, performance assessment of EPN to tornadoes, and computable general equilbrium (CGE) assessment of the community with respect to disasters.

All papers included in the Special Issue are listed below.

Bruce R. Ellingwood, John W. van de Lindt & Therese P. McAllister
Bruce R. Ellingwood, Harvey Cutler, Paolo Gardoni, Walter Gillis Peacock, John W. van de Lindt & Naiyu Wang
Roberto Guidotti, Hana Chmielewski, Vipin Unnikrishnan, Paolo Gardoni, Therese McAllister & John van de Lindt

Happy Winter Break 2016!

Winter Break 2016

Here at the Analytics Lab @ OU we would like to wish you all a happy holiday season, a wonderful winter break, and a feliz año nuevo!

Also, we have had some nice developments over the last few days and weeks and I would like to mention these briefly.


Congratulations!

Congratulations to Param Tripathi for succesfully defending his Masters thesis and completing his MS in ISE!

In his thesis entitled “ANALYSIS OF RESILIENCE IN US STOCK MARKETS DURING NATURAL DISASTERS” Param analyzes how major hurricane events impact the stock market in the United States. In particular, he looks at two fundamental elements of resilience, vulnerability and recoverability, as it applies to the Property and Casualty Insurance sector by evaluating the price volatility with respect to the overall New York Stock Exchange.  He applies breakout detection to study patterns in the time series data and uses this to quantify key resilience metrics.

Congratulations to Alexandra Amidon for successfully completing her Industry practicum on anomaly detection using a compression algorithm.  Well done!

Good luck!

facebook_like_thumbNext, Weili Zhang, a PhD Candidate in ISE, is actively interviewing with a vast number of top companies for data science positions. He has passed first and second round interviews for a number of companies and has been invited for face-to-face interviews with Facebook, Google, Ebay, Verizon, Disney, Lyft, Amazon, Target, and several more.  Getting past round one with these companies is already a major accomplishment — round two or three get progressively more intense!  Good job and good luck Weili!

Yay us!

The Analytics Lab team members have had a few papers published or accepted for publication recently:

  • Almoghathawi, Y., K. Barker, C.M. Rocco, and C. Nicholson. 2017. A multi-criteria decision analysis approach for importance ranking of network components. Reliability Engineering and System Safety, 158: 142-151 LINK [bibTex]
  • Nicholson, C., L. Goodwin, and C. Clark. 2016. Variable neighborhood search for reverse engineering of gene regulatory networks.  Journal of Biomedical Informatics, 65:120-131 LINK [bibTex]
  • Zhang, W., N. Wang, and C. Nicholson. 2016. Resilience-based post-disaster recovery strategies for community road-bridge networks. Accepted in Structure and Infrastructure Engineering.

I’ll write up a post about these exciting new research articles soon!


Data Science Interviews

Weili, along with a few other current and former students have been describing the data science interview processes and questions with me.  My plan is to write up a blog post summarizing a some key things for those of you seeking DSA positions to keep in mind and prepare for.  Fortunately, many of the questions/topics asked about are covered in the ISE 5103 Intelligent Data Analytics class offered in the Fall semester.  If you are starting to look for data science jobs now or in the near future, make sure you check out that post!

Life in Industry

Additionally, I have asked several current/former students (Pete Pelter, Leslie Goodwin, Cyril Beyney, Olivia Peret) who are employed in data science positions to provide some descriptions and information about life in industry.  Look for that series of posts soon!

ISE 5113 Advanced Analytics and Metaheuristics

I am currently preparing for the online launch of the ISE 5113 Advanced Analytics and Metaheuristics course to be offered in Spring 2017.  The course will be offered both online and on-campus.  This is one of my favorite courses to teach.  The introductory video should be posted soon!  The course fills up very fast, so if you are a DSA or ISE student make sure to register ASAP if you haven’t already!

Soccer

I have had excellent response from so many people this semester that it looks like “yes!” we will have enough people to start a soccer team in the Spring.  I’ll be providing more information next month.  The season starts in March and the official signups are in February ($80 each).  We need a team name and team colors — so I’ll be open for ideas!  In the meantime, get out there and practice!

winter break soccer

Finally…

I hope everyone has an excellent winter break.  Enjoy your family, enjoy good food, stay warm, practice some soccer, catch up on sleep, maybe study a little bit… and I’ll see you next year!

 

Thoughts on regression techniques (part iii)

In the first of this series of posts on regression techniques we introduced the work-horse of predictive modeling, ordinary least squares regression (OLS), but concluded with the notion that while a common and useful technique, OLS has some notable weaknesses.  In the last post we discussed its sensitivity to outliers and how to deal with that using multiple “robust regression” techniques.  Now we will discuss regression modeling in high dimensional data.  Two of the techniques are based on idea called “feature extraction” and the last three are based on “penalized regression”

Dimensionality, feature extraction, selection, and penalized regression

An  issue that every modeler who has had to deal with high-dimensional data, is feature selection.  By high dimensionality I mean that \(p\), the number of predictors, is large. High dimensional data can cause all sorts of issues — some of them psychological — for instance, it can be very hard to wrap your mind around so many predictors at one time; it can be difficult to figure out where to start, what to analyze, what to explore, and even visualization can be a beast!  I tried to find some pictures to represent high-dimensional data for this blog post, but of course, by definition 2D or 3D representations of high-D data is hard to come by.  This parallel plot at least add some color to the post and represents at least some of the complexity!

When \(p\) is large with respect to the number of observations \(n\), then the probability of overfitting is high. Now, let me quickly interject that \(p\) may be much greater than the number of raw input variables in the data.  That is, a modeler may have decided on constructing several new features from among the existing data (e.g., ratios between two or more variables, non-linear transformations, and multi-way interactions). If there is a strong theoretical reason for including a set of variables, then great, build the model and evaluate the results. However, oftentimes, a modeler is looking for a good fit and doesn’t know which of the features should be created, and then which combinations of such features should be included in the model. Feature construction deals with the first question (and we will look at that later on), feature selection deals with the second question (which we deal with now).

Actually, first I will digress slightly — in the case when \(p\) is equal to or greater than \(n\), OLS will fail miserably.  In this situation, feature selection is not only a good idea, it is necessary!  One example problem type and data when \(p>n\) occurs in microarray data. If you are not familiar with what a microarray is see the inset quote from the Chapter “Introduction to microarray data analysis” by M. Madan Babu in Computational Genomics.  Essentially microarray data is collected for the simultaneous analysis of thousands of genes to help discovery of gene functions, examine the gene regulatory network, identify of drug targets, and provide insight to understanding of diseases.  The data usually has \(n\) on the orders of 10 to 100, whereas \(p\) might be on the order of 100’s to 1000’s!

A microarray is typically a glass slide on to which DNA molecules are fixed in an orderly manner at specific locations called spots. A microarray may contain thousands of spots and each spot may contain a few million copies of identical DNA molecules that uniquely correspond to a gene. The DNA in a spot may either be genomic DNA or short stretch of oligo-nucleotide strands that correspond to a gene. Microarrays may be used to measure gene expression in many ways, but one of the most popular applications is to compare expression of a set of genes from a cell maintained in a particular condition (condition A) to the same set of genes from a reference cell maintained under normal conditions (condition B). – M. Madan Babu

PCR and PLS

There are two closely related approaches that can handle such scenarios with ease.  These are Principal Component Regression (PCR) and Partial Least Squares regression (PLS).  Both of these techniques allow you deal with data when \(p>n\).  Both PCR and PLS essentially represent the data in lower dimensions (either in what is known as “principal components” for the former or just “components” in the latter).  Either way these components are each formed from linear combinations of all of the predictors.  Since PCR and PLS are essentially automatically creating new features from the given data, this is a form of what is known as “feature extraction”.  That is, the algorithm extracts new predictors from the data for use in modeling.

If you choose \(k\) components (or extracted features) to be less than \(p\), then you have reduced the effective representation of your data.  With this of course also comes information loss.  You can’t just get rid of dimensions without losing information.  However, both PCR and PLS both try to shove as much “information” into the first component as possible; subsequent components will contain less and less information.  If you chop off the only a few of the last components, then you will not experience much information loss.  If your data contains highly correlated variables or subsets of variables, then you can possible reduce many dimensions with very little loss of information.

Face database for facial recognition algorithms

Face database for facial recognition algorithms

Image data for instance is another example of high-dimensional data which can usually be reduced using something Principal Component Analysis (PCA) to a much lower level of dimensions.  Facial recognition algorithms leverage this fact extensively!  Looking at the image at right, it is easy to see many commonalities among the images of faces — the information to discern one face from the other is not within the similarities, but of course the differences.  Imagine if you can removing the “similar elements” of all the faces — this would remove a considerable amount of the data dimensionality and the remaining “differences” is where the true discriminatory information is found.

PCR and PLS essentially do this — allow you to throw away the non-informative dimensions of data and perform regression modeling on only the informative bits.

I’ll leave the details about the difference in PCR and PLS alone for now, except to say that PCR is based on an unsupervised technique (PCA), whereas PLS is a inherently a supervised learning technique through-and-through.  Both define components (linear combinations of the original data) and then allow you to perform OLS on a subset of components instead of the original data.

Feature Selection

One possibility for sub-selecting predictors for an OLS model is to use stepwise regression.  Stepwise regression (which can either be forward, backward, or bi-directional) is a greedy technique in which potential predictors are added (or removed) one-at-time from a candidate OLS model by evaluating the impact of adding (or removing) each variable individually to which improves performance the most. Maybe “performance” is the wrong word here, but let me use it for now. One traditional technique is to add (or remove) variables based on their associated statistical \(p\)-values, e.g., remove a variable if its \(p \ge 0.5\).  I should note that while this is commonly employed (e.g., it is the default stepwise method used in SAS), there is some reasonable controversy with this approach.  It is sometimes called \(p\)-fishing — as in, “there might not be anything of value in the data, but I am going to fish around until I find something anyway.”  This is not a flattering term.  If you do choose to perform stepwise regression, a less controversial approach would be to use either AIC or BIC scores as the model performance metric at each step.

Lasso: a penalized regression approach

However, there are variations of OLS (which again are based on modifications to the objective function in Equation (2) (from the first post in the series) that result in an automatic selection of a subset of the candidate predictors.

Here is Equation (2) again:

$$\text{(Equation 2 again)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 $$

The first technique that we will discuss is called the least absolute shrinkage and selection operator (lasso).  Lasso adds a penalty function to Equation 2 based on the magnitude of the regression coefficients as shown in Equation 3.

$$\text{(Equation 3)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2  + \lambda \sum_{j=0}^{p} |\beta_j |$$

The penalty is the sum of the absolute values of all the regression coefficients scaled by some value \(\lambda > 0\).  The larger the value of \(\lambda\), the larger the potential penalty.  As \(\lambda\) increases, it makes sense with respect to Equation (3) to set more and more regression coefficients to 0.  This effectively removes them from the regression model.  Since this removal is based on balancing the sum of residuals with the penalty, the predictors which are not as important to the first part of the lasso objective are the ones that are eliminated.  Voilà, feature selection!

You might ask, why place a penalty on the magnitude of the beta values — doesn’t that artificially impact the true fit and interpretation of your model?  Well, this is a good question — the lasso definitely “shrinks” the regression coefficients, however, this does not necessarily mean that the shrinkage is a departure from the true model.  If the predictors in the regression model are correlated (i.e., some form of multi-collinearity exists), then the magnitudes of the regression coefficients will be artificially inflated (and the “meaning” of the beta values may be totally lost).  The shrinkage operator in lasso (and other techniques, e.g. ridge regression) tackle this directly.  It is possible that lasso puts too much downward pressure on the coefficient magnitudes, but it is not necessarily true.

Ridge regression: another penalized regression approach

I need to confess that ridge regression is not a method with automatic feature selection.  However, since it is so closely related to lasso, I decided to throw it in really quickly so I don’t have to right another blog post just for this little guy.  Here’s the equation, see if you can spot the difference from Equation (3)!

$$\text{(Equation 4)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2  + \lambda \sum_{j=0}^{p} \beta_j^2 $$

That’s right — the only difference between Equation (3) and Equation (4) is that the penalty is based on absolute values in one and on the squares in the other. The idea is the same, except one very interesting difference — the regression coefficients in ridge regression are never forced to 0.  They get smaller and smaller, but unlike lasso, no features are eliminated.  Ridge regression however does turn out to produce better predictions that OLS.  For both lasso and ridge regression the value of \(\lambda\) is determined by using cross-validation methods to tune the parameter to the best value for predictions.  If \(\lambda = 0\), then both of these methods give the exact result as OLS.  If \(\lambda > 0\) (as determined by the so-called hyper-parameter tuning, then the penalized regression techniques are shown to be better than OLS.

Elastic net regularization: yet, again, a penalized regression approach

The other reason that I wanted to introduce ridge regression is that it is a great segue into my favorite of the penalized techniques, elastic net regularization or just elastic net for short.  The elastic net approach combines both penalties from lasso and ridge regression in an attempt to get at the best of both worlds: the feature selection element of lasso and the predictive performance of ridge regression.

$$\text{(Equation 5)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2  + \lambda_1 \sum_{j=0}^{p} |\beta_j | + \lambda_2 \sum_{j=0}^{p} \beta_j^2$$

Oftentimes the relationship between \(\lambda_1 >0 \) and \(\lambda_2 >0\) is one such that \(\lambda_1 + \lambda_2 = 1\).  In this case if \(\lambda_1 = 1\), then the elastic net gives you the same result as lasso, and if \(\lambda_2 = 1\), then the result is equivalent to ridge regression.  However, many times the result from hyper-parameter tuning is that \(\lambda_1 < 1\) and \(\lambda_2 < 1\), implying that yes! some hybridization of the lasso and ridge regression approaches produces the best cross-validated results.

In the case when we require \(\lambda_1 + \lambda_2 = 1\), Equation (5) can be rewritten as follows to simplify to only one parameter:

$$\text{(Modified equation 5)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2  + \lambda \sum_{j=0}^{p} |\beta_j | + (1-\lambda) \sum_{j=0}^{p} \beta_j^2$$

I have mentioned “hyper-parameter tuning” a couple of times already.  Without going into the details of the cross-validation, let me simply just say that all hyper-parameter tuning means is that you try out a whole bunch of values for your parameter (e.g., \(\lambda\)) until you find the values that work best.  Take a look at the lasso and elastic net paths figure.  In this figure the values of the coefficients are on the y-axis (each color of a line represents represents a different predictor) and the value of the penalty is represented on the x-axis (actually here the log of the penalty is represented).  As the penalty value decreases (moving right along the x-axis), the values of the coefficients increase for both lasso (solid lines) and elastic net (dashed lines).  So you can see as you “tune” the value of \(\lambda\), infinite number of models are possible!  When the value becomes large enough (starting at right and then moving to the left along the x-axis), some of the coefficients are forced to 0 by lasso and by elastic net.  The lasso seems to do this quicker than elastic net as demonstrated in the solid blue and dashed blue line — a very small increase in \(\lambda\) (at about \(\log \lambda \sim 90\) and its value is set to 0 by lasso; whereas, the penalty has to increase such that \(\log \lambda \sim 55\) for elastic net before the same variable’s regression coefficient is set to 0.

lassopath

Now there are other regression techniques that also include automatic feature selection, e.g., multivariate adaptive regression splines (MARS) essentially use a forward step-wise procedure to add terms to the regression model and regression trees choose a features one at a time to add to a model to produce prediction estimates. (These two seemingly different techniques actually have quite a lot in common!)  However, I will introduce the first of these as a technique for both feature selection and feature construction.  Our next post deals more generally with how our regression approach can deal with non-linearities in our model assumption.

 

Thoughts on regression techniques (part deux)

We left off discussion in the last post on regression techniques with the statement that there are some known issues relating to Ordinary Least Squares (OLS) regression techniques.  My goal in this series of posts is to introduce several variations on OLS that address one or more of these drawbacks.  The first issue I will mention is related to making regression more robust in the presence of outliers.  This is accomplished through what is known, straightforwardly, as robust regression.

Outliers and robust regressionoutlier-pic-1

OLS regression is sensitive to outliers. To see this look at the objective function again (listed as Equation 2 in the first post of the series):

$$\text{OLS objective function:} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 $$

The procedure is highly incentivized to minimize the residuals. That is, the square term implies a very large penalty if the predictions are wrong. This sounds fine, but imagine there exists a single point in your data set that simply does not fit the true model well, i.e., the value for \(y\) does not follow the linear assumption in Equation 1 (from the previous post) at all. The OLS fit will be greatly affected by such an observation.

Let’s put some numbers to this to make a quick example. Assume that without this outlying point, the residuals associated with your \(n=100\) data points are all relatively small, for instance, \(-0.1 <= y_i – \hat{y}_i <= 0.1\). The sum of the squared residuals is at most equal to 1. Now, pick one of the 100 points and let’s say that that the associated residual is 10. The sum of squared errors would jump to 10000.999! To mitigate this possibility, the OLS fit would not recover the true model, but adjust the fit so as to keep the sum of the squared residuals to something more reasonable. It decreases the residual of this one outlying point by allowing larger residuals for the remaining 99 data points. So now, instead of a good fit for 99% of the data with only one point that doesn’t predict well, we have a model that is a poor fit 100% of the data!

To mitigate this sensitivity to outliers, we can use robust regression. The key is to change the objective function so that the penalty for large residuals is not so dramatic. Let \(\rho(r_i)\) represent the penalty function (as a function of the residual for observation \(i\)).  Three such common penalty functions are:

\begin{align}
\text{Least absolute value (LAV)} \ \  \rho_\text{lav}(r_i) &= \sum_{i=1}^n |r_i| &\\
\text{Huber} \ \  \rho_\text{huber}(r_i) &= \begin{cases} \frac{1}{2} r_i^2 \ & \text{ if } r_i \le c \\ c|r_i| -\frac{1}{2}r_i^2 \ & \text{ if } |r_i|> c\end{cases}\\
\text{Bisquare} \ \    \rho_\text{bisquare}(r_i) &= \begin{cases} \frac{c^2}{6} \left( 1 – \left( 1 – \left(\frac{r_i}{c}\right)^2 \right)^3 \right) \ & \text{ if } r_i \le c \\ \frac{c^2}{6} \ & \text{ if } |r_i|> c\end{cases}\\
\end{align}

While I won’t discuss the solution techniques for solving the associated minimization problems here in detail, I will mention that these changes directly impact the computational efficiency of the regression. The LAV problem can be solved using linear programming techniques. The other two approaches (Huber and Bisquare) are actually part of a family of such objective function modifications known as M-estimation.

Pvt. Joe Bowers: What are these electrolytes? Do you even know?

Secretary of State: They’re… what they use to make Brawndo!

Pvt. Joe Bowers: But why do they use them to make Brawndo?

Secretary of Defense: Because Brawndo’s got electrolytes.

-Circular reasoning from the movie Idiocracy

M-estimation techniques are a class of robust regression method which change the objective function to be less sensitive to outliers (there are other objectives besides Huber and Bisquare).  M-estimation problems usually determine the residual penalty value based on the fit, which is based on the beta values, which in turn is based on the residual values… but wait, we don’t know the residuals until we have the beta values, which is based on the residuals, which are based on the beta values!  Oh no, it’s circular reasoning!

This paradox is resolved by performing iterative reweighted least squares regression (IRLS).  It turns out that the effect of a residual-based penalty is equivalent to allowing different weights for each observation (based on the residual value of that observation).  To address the circular logic, IRLS solves the regression problem multiple times in which the observations are weighted differently each time.  The weights for all observations all start equal to 1 (the same as regular old OLS) and then the observations which do not fit very well will have the lower weights assigned — reducing their impact on the regression fit.  The observations that fit well will have higher weights.  After the weights are assigned, the process is repeated, and the weights are adjusted again, and again, and again, until convergence.  This ultimately reduces the impact of outlying observations.

The images here represent a simple data set in which we have a few outliers (in the upper left corner).  The OLS model fit produces the red line, the M-estimation procedure using Tukey’s bisquare penalty produces the blue regression line.   As you can see, the slope of the red line is impacted by the outlying points, but the blue line is not — its fit is in fact based on all points except the outliers.

robust regression

OLS on the left; Robust Regression on the right

In a later post in this series we will look at another regression-like approach that is robust to outliers (Support Vector Machine regression), however since this technique is also excellent at dealing with non-linearity in the data, I’ll postpone that topic for a bit.  Next we will discuss a common issue in predictive modeling — dealing with high dimensionality.

Thoughts on regression techniques (part 1)

I just completed this semester’s series of lectures on regression methods in ISE/DSA 5103 Intelligent Data Analytics and I wanted to take a moment to call out a few key points.

regressionFirst, let me list the primary set of techniques that we covered along with links to the associated methods and package  in R:

While I do not intend to rehash everything we covered in class (e.g., residual diagnostics, leverage, hat-values, performance evaluation, multicollinearity, interpretation, variance inflation, derivations, algorithms, etc.), I wanted to point out a few key things.

Ordinary Least Squares Regression

OLS multiple linear regression is the workhorse of predictive modeling for continuous response variables.  Not only is it a powerful technique that is commonly used across multiple fields and industries, it is very easy to build and test, the results are easy to interpret, the ability to perform OLS is ubiquitous in statistical software, and it is a computationally efficient procedure.  Furthermore, it serves as the foundation for several other techniques.  Therefore, learning OLS for multivariate situations is a fundamental element in starting predictive modeling.

In order to make any of this make sense, let me introduce some very brief notation.  We will assume that we have \(n\) observations of data which are comprised of a single real-valued outcome (a.k.a. response or dependent variable or target), which I will denote as \(y\), and \(p\) predictors (a.k.a. independent variables or features or inputs) denoted as \(x_1, x_2,\ldots, x_p\).  These predictors can be continuous or binary.  (Note: nominal variables are perfectly fine, however, to be used in OLS they need to be transformed into one or more “dummy variables” which are each binary.)  For the \(y\)-intercept and for each predictor there is a regression coefficient: \(\beta_0, \beta_1, \ldots, \beta_p \).  The assumption in OLS is that the true underlying relationship between the response and the input variables is:

$$\text{(Equation 1)} \ \ \ y = \beta_0 + \sum_{i=1}^n \beta_i x_i + \epsilon $$

where \(\epsilon\) represents a normally distributed error term with mean equal to 0. When the OLS model is fit, the values for \(\beta_0, \beta_1, \ldots, \beta_p \) are estimated. Let \(\hat{y}\) denote the estimates for \(y\) after the model is fit and the values \(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p \) denote the estimates for regression coefficients. The objective of OLS is to minimize the sum of the squares of the residuals, where the residuals are defined as \(y_i – \hat{y}_i\) for all \(i = 1 \ldots n\). That is,

$$\text{(Equation 2)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i – \hat{y}_i)^2 $$

Let me just take a moment here to say that most of the OLS variants that I’ll summarize in this series of posts are motivated by simple modifications to the objective function in Equation (2)!

OLS is a linear technique, but feature engineering allows a modeler to introduce non-linear effects. That is, if you believe the relationship between the response and the predictor is: $$y = f(x) = x + x^2 + sin(x) + \epsilon$$ then simply create two new variables (this is called feature construction) with these transformations. Let \(x_1 = x, x_2 = x^2, \text{ and } x_3 = sin(x)\), your estimated OLS model is linear with respect to the new variables. That is,

\begin{align}
\hat{y} & = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \hat{\beta}_3 x_3  \\
& = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2 + \hat{\beta}_3 \sin(20 x)
\end{align}

For example, if you simple fit an OLS model without transformations, simply y ~ x, then you get the following predictions: the blue dots represent the output of the model, whereas the black dots represent the actual data.linear

However, if you transform your variables then you can get a very good fit:

While OLS has various assumptions that ideally should be met in order to proceed with modeling, the predictive performance is insensitive to many of these. For instance, ideally, the input variables should be independent; however, even if there are relatively highly collinear predictors, the predictive ability of OLS is not impacted (the interpretation of the coefficients however is greatly affected!).

However, there are some notable difficulties and problems with OLS.  We will discuss some of these in the next few posts!