Author Archives: Charles Nicholson

Sr. Data Analyst Position – Open in Plano, TX

JCPenney is hiring Sr. Data Analyst

A friend of mine who works at the JC Penney HQ in Plano, TX just sent me a new job posting — she would love to hire an OU DSA student!  See below for job description and let me know if you are interested!   Please note that JCPenney is not doing Visa sponsoring for this position.

Job posting

JCPenney is one of the nations largest apparel and home furnishing retailers with more than 1,000 stores and We are a diverse community of people, all working together to bring sensational style, sensible prices and the best service possible to our customers. Were looking for talented individuals who want to work in an energetic, respectful, collaborative environment. With a wide array of jobs, internships, training and more, there are countless opportunities for you to grow your career with us.

JCPenney is looking for an experienced data analyst who is eager to learn, to add value, and to do interesting work as a valued member of the Customer Strategy team. This position is data-intensive and will involve use of SQL and SAS software tools to pull data for analysis and reporting purposes. Insights produced by this team inform business decisions in Marketing and beyond, including those by senior executive leaders.

Primary Responsibilities:

  • Facilitate the definition of analysis needs and work product requirements of internal clients
  • Translate client needs and requirements into specific data, logic and reporting requirements and realistic work plans
  • Understand and have a working knowledge of customer/transactional level data
  • Strive to structure analysis to provide conclusive insights that directly align to decision-making
  • Prioritize and balance multiple activities in parallel and communicate status proactively to manage stakeholder expectations
  • Understand data sources to determine the correct source(s) and logic to ensure accurate, efficient and timely deliverables
  • Build, run and automate data queries, analysis and reports
  • Speak out when business strategies do not align with data insights and when insights suggest new marketing tactics
  • Identify and log data issues and work with department, IT and vendor teammates to understand and resolve them
  • Proactively seek help from and offer help to JCP teammates to accelerate skill development, business understanding and overall goal achievement
  • Anticipate future insight needs/opportunities and deliver self-initiated value to JCP

Core Competencies & Accomplishments:

  • College graduate with 3+ years of experience
  • At least 2 years experience using databases and SQL (structured query language) and SAS
  • Ability to combine, cleanse and harmonize data for descriptive and predictive analytics
  • Strong math, computer and problem-solving skills, including MS Excel
  • Structured thinker and high attention to detail
  • Strong teamwork, communication and interpersonal skills
  • Desire to consistently meet and exceed stakeholder expectations
  • Desire to acquire new technical skills (e.g., R, Hive, Tableau, Datameer) and business knowledge

Welcome to Fall 2017!

Farewell to Summer

I hope everyone had a great summer and are enjoying the beginning of the Fall 2017 classes begin anew.  I’ve been here most of the summer, and wow! it is great to have the students back — the peace and quiet are nice for a while, but the campus really comes alive in Fall.

My summer included a trip to Disney with the family, a solo climb of two 14,000+ foot mountains in Colorado (Blanca Peak and Ellingwood Point), and trip to Austria for the ICOSSAR 2017 conference.

Welcome to Fall 2017 courses

My classes begin on Tuesday, 8/22 — both DSA/ISE 5103 and ISE 4113.

The DSA/ISE 5103 Intelligent Data Analytics graduate course is one which I think is core to data science.  In it we will  study and practice how to deal with real-world data intensive problems.   The topics include lots of data work and some great modeling techniques/applications such as dimension reduction, facial recognition, linear and logistic regression, LASSO, elasticnet, support vector machines, MARS, decision trees, random forests, boosted trees, neural networks, and clustering.  You will use powerful open source statistical programming language (R) and work on hands-on, applied data analysis projects.  No previous R experienced is required.  That said, I will expect you to work hard to learn the tool!  This course is being offered both online and on-campus.

In the ISE 4113 undergraduate course, we will be delving into the nitty-gritty of MS Excel to build spreadsheet-based decision support systems.  Excel is essentially ubiquitous in industry and mastery of it is critical!  We will go way beyond simple formulas and the basic usage of the tool and delve into optimization modeling, simulation, and even Visual Basic for Applications (VBA) programming.  This class is a big class, but fortunately, we have an excellent TA supporting the class.

DSA club?

One piece of great news is that it looks like there is some interest in starting a Data Science and Analytics club.  I will have more news about this later this semester, but if you are interested in joining such a club, please feel free to email me!  More info to follow!

I  look forward to meeting and getting to know you all this semester!

Charles Nicholson


Congratulations to three new Masters!

Congratulations Alexandra, Emily, and Megan — new Masters of Science!

Profile Picture
Snelling Megan

Alexandra Amidon (left), Emily Grimes (center), and Megan Snelling (right) have all successfully defended their Master’s theses this Spring 2017 are the three newest Masters from the Analytics Lab @ OU.

Alexandra and Emily completed their Masters of Science in Data Science and Analytics from the Gallogly College of Engineering.  The DSA program is a joint effort between the School of Industrial & Systems Engineering and the School of Computer Science. Megan completed her Masters of Science from the School of Industrial & Systems Engineering.

I’ll start with Megan since she is the lone ISE in this group of three.

Megan’s work is entitled “MODEL FOR MITIGATING ECONOMIC AND SOCIAL DISASTER DAMAGE THROUGH STRUCTURAL REINFORCEMENT” and is a continuation of previous work completed as a part of the NIST-funded Center of Excellence on Risk-Based Community Resilience Planning  and CORE Lab @ OU.

Abstract: Natural disasters have both severe negative short-term consequences on community structures, inhabitants, and long-term impacts on economic growth. In response to the rising costs and magnitude of such disasters to communities, a characteristic of modern community development is the aspiration towards resilience. An effective and well-studied mitigation measure, structural interventions reduce the value lost in buildings in earthquake scenarios. Both structural loss and socioeconomic characteristics are indicators for whether a household will dislocate from their residence. Therefore, this social vulnerability can be mitigated by structural interventions and should be minimized as it is also indicator of indirect economic loss. This research presents a model for mitigating direct economic loss and population dislocation through decisions regarding the selection of community structures to retrofit to higher code levels. In particular, the model allows for detailed analysis of the tradeoffs between budget, direct economic loss, population dislocation, and the disparity of dislocation across socioeconomic classes given a heterogeneous residential and commercial structure set. The mathematical model is informed by extensive earthquake simulation and as well as recent dislocation modeling from the field of social science. The non-dominated sorting genetic algorithm II (NSGA-II) is adapted to solve to model, as the dislocation model component is non-linear. Use of the mitigation model is demonstrated through a case study using Centerville, a test bed community designed by a multidisciplinary team of experts.  Details of the retrofit strategies are interpreted from the estimated Pareto front.

We should also offer congratulations to Megan on another account she is getting married soon and plans to spend her Summer hiking through Europe!

Alexandra and Emily both worked on project related to T.U.G. (The Untitled Game) which was partially funded by Nerd Kingdom.Nerd Kingdom


Abstract: Predictive algorithms applied to streaming data sources are often trained sequentially by updating the model weights after each new data point arrives. When disruptions or changes in the data generating process occur (“concept drifts”), the online learning process allows the algorithm to slowly learn the changes; however, there may be a period of time after concept drift during which the predictive algorithm underperforms. This thesis introduces a method that makes online neural network classifiers more resilient to these concept drifts by utilizing data about concept drift to update neural network parameters.

Alexandra has accepted a position with MSCI, a leading provider of investment decision support tools worldwide, as a Reference Data Production Analyst.  She will be using her skills in machine learning to continue developing new tools for anomaly detection.


Abstract: Player engagement is a concept that is both vital to the online gaming industry and difficult to define. Typically, engagement is defined using social science methodologies such as observing, surveying, and interviewing players. With the vast amount of data being collected from video games as well as user bases increasing in size, it is worthwhile to investigate whether or not user engagement can be defined and interpolated from data alone. This study develops a methodology for defining engagement using analytic methods in order to approach the question of whether gathering (as a proxy for social interaction) in sandbox games has an effect on player engagement.

Emily is following up on leads for a full-time position now, but in the meantime she has a road trip planned to the Grand Canyon, Sequoia National Park, and the Big Sur in California.  She is also in discussions with KGOU and NPR about starting a new radio program!

Congratulations to all three excellent students!  We wish you great success!

Emily Grimes, MS DSA, May 2017

Megan Snelling, MS ISE, May 2017

OU Industrial & Systems Engineering and Data Science & Analytics

Public Webinar Announcement: Center for Risk-Based Community Resilience Planning

Public Webinar Announcement — Community Resilience: Modeling, Field Studies and Implementation

Learn more about NIST-funded Center for Risk-Based Community Resilience Planning and how the Center is developing a computational environment to help define the attributes that make communities resilient.

WEBINAR: Thursday, April 27, 10:00 a.m. – 12:00 p.m. (CDT)  

The webinar is open to anyone immediately followed by a Q&A “chat” period.

A Resilient Community is one that is prepared for and can adapt to changing conditions and can withstand and recover rapidly from disruptions to its physical and social infrastructure.  Modeling community resilience comprehensively requires a concerted effort by experts in engineering social sciences and information sciences to explain how physical, economic and social infrastructure systems within a real community interact and affect recover efforts.

Join this information WEBINAR to learn more about the Center’s recent activities.

A Center overview will be followed by a session on the Center’s recent Special Issue of Resilient and Sustainable Infrastructure, which features six papers on the virtual community Centerville.  The modeling and analysis theory behind each paper will be explained followed by a demonstration of IN-CORE, the Interdependent Connected Modeling Environment for Community Resilience.  Presentations on the first validation study, the Joplin Hindcast, and the Center’s First Field Study, the 2016 Lumberton floods in NC will also be a highlight of the Webinar.

No registration is required this time, just click, watch, and chat.

Both Dr. Nicholson and Dr. Wang will be giving presentations during the webinar.

Flier for distribution: Webinar Flier 27-April-2017

Postdoctoral Research Fellow Position in Community Resilience

Prof. Charles Nicholson is currently accepting applications for a postdoctoral research fellow position in Community Resilience within the School of Industrial and Systems Engineering at the University of Oklahoma.

The primary area of research is with respect to the following broad objective:

Enhance community resilience to natural and man-made disasters through modeling, optimization, and risk-informed decision making with respect to vital, large-scale, interdependent civil infrastructure and socio-economic systems.

Researchers with backgrounds and interests in one or more the following areas are encouraged to apply:

  • Optimization: network flow optimization, multi-objective optimization, stochastic optimization; stochastic programming
  • Data science and analytics: including machine learning for predictive and classification modeling as well as unsupervised and semi-supervised learning
  • Decision modeling for community and regional resilience planning

The postdoctoral research fellow will embark on an exciting and innovative research program within a well-established and active multidisciplinary research group with collaboration opportunities across the United States.  In this role, you will also supervise one or more PhD students.  Experience with tools such as Python or R is highly preferred.  Familiarity with Civil Infrastructure systems and/or economic modeling is a plus.  The position will be supported by funded research projects with multi-year durations.

Interested applicants please send a one-page statement of research interests and CV to cnicholson @ OU (dot) edu.

Total Chaos: Soccer, ISE, and Old People


Total Chaos

In Fall 2017  I decided that it was time to start working on my bucket-list, item #117: play an actual game of soccer.   There are other items on my bucket-list too, but I figured I better try this one soon since I am not getting any younger.  This is the impetus for my new team: Total Chaos.

I’ve coached soccer for 3 years (my daughter’s team) for Norman Youth Soccer Association (NYSA) . When I started then, I had very little understanding of the game.  I knew that most of the players were not supposed to use their hands, but any rules other than that were  a bit vague…

Anyway, I’ve wanted to play soccer for years, but starting out as a complete newbie with such a demanding and skilled sport like futbol over the age 40, well, it was somewhat of daunting thing to do.  The options were: (1)  try to join an existing team and then ultimately disappoint all of the other players with my complete lack of skill or… (2) start my own team from scratch with the understanding that (a) everyone is welcome — even newbies and old people — and  (b) we will not likely win.  That is, set expectations low: so low in fact that no one has a right to be disappointed with any outcome!   I opted for the latter.  NYSA has an adult league, and thus I started recruiting for my new team…

To make a long story short, the response to my invitation “do you want to play soccer in a league even if we have no chance of winning any games?” — was a resounding yes.  Soccer mom’s and dad’s, friends, OU faculty, and both grad and undergrad students in ISE for some reason found the idea appealing.  My wife, who like me, has never played the sport in her life even joined up.  Thankfully, not everyone that answered the call was a complete newbie, because several of us needed teachers!

The student becomes the teacher…

In this case, literally “the students become the teachers” — Jack, Austin, Leslie, and Brad are all undergrads who took my ISE 4113 course in Fall 2017 and now they had their work cut out for them trying to teach me what to do on the field. Joining them we also have Darin, Nicole, Andrew, and Yasser — all PhD or MS students in either ISE or DSA.

Now, while our defense is not this bad:

without Jack Appleyard leading the defense, it could be much worse (I’m on defense you see — which does not give Jack much to work with!) so he is almost a one man team in the backfield — saving our collective butts more than once keeping it from being the total chaos it would’ve been otherwise!

Brad “the slide tackle ninja” Osborn, Austin Shaw, “the king of awesome”, and Leslie “the beast” Barnes head-up the midfield and offense and simply rock the pitch…

Darin Chambers — who happens to also be a political candidate running for State Representative District 46 — is a fellow soccer dad and great teammate and leader.  Yasser, a PhD student in ISE has both published research with me and taught me how to defend and pass.  Nicole and Andrew, both new to the game, are simply fearless.  Pravin, who is going up for tenure at OU the same time as me, has stepped up to help play keeper after our first keeper was injured.  Finally, Everton, Omar, Nery, Marco, Justin, Alicia, and Greg — are all new friends.

In summary — we have a great team: a great mix of ages, genders, languages, skills, and backgrounds.  Thanks for helping me mark off an item on my bucket-list I’ve dreamed to do for years.    The team pic below is missing a few players, so I’ll update it later, but here we are: Total Chaos.

Finally, despite my hand-balls and/or fouls in the box and/or missed passes and/or bad throw-ins (sorry about all that…) — so far we’ve played two games and won both.

Total Chaos team picture

Left to Right — Back: Everton, Omar, Marco, Justin, Jack, Brad, Austin, Pravin, Greg, Charles, Andrew, Nicole; Front: Yasser, Alicia, Zorelly


Two new resilience publications 2017!

Two new resilience publications!

Well, here at the Analytics Lab @ OU  2017 started off nicely with two new articles published in the area of community resilience. We are also very excited about finally being able to share the virtual community we created named “Centerville” as a part of the Center for Risk-Based Community Resilience Planning — the special issue on Centerville is finally published in Sustainable and Resilient Infrastructure.  Please check out the post on Centerville!

The first of these resilience publications is entitled Resilience-based post-disaster recovery strategies for road-bridge networks which appears in Structure and Infrastructure Structure and Infrastructure EngineeringEngineering, an international journal which aims to present research and developments on the most advanced technologies for analyzing, predicting and optimizing infrastructure performance.

This paper by Weili Zhang, Naiyu Wang, and myself presents a novel resilience-based framework to optimise the scheduling of the post-disaster recovery actions for road-bridge transportation networks.  This work was supported, in part, by the Center for Risk-Based Community Resilience Planning, National Institute of Standards and Technology (NIST) [Federal Award No. 70NANB15H044].

The methodology systematically incorporates network topology, redundancy, traffic flow, damage level and available resources into the stochastic processes of network post-hazard recovery strategy optimisation. Two metrics are proposed for measuring rapidity and efficiency of the network recovery: total recovery time (TRT) and the skew of the recovery trajectory (SRT).  The SRT is a novel metric designed to capture the characteristics of the recovery trajectory which relate to the efficiency of the restoration strategies.  This is depicted in the figure below.

resilience publication

Depiction of new skew metric for network recovery

Based on this two-dimensional metric, a restoration scheduling method is proposed for optimal post-disaster recovery planning for bridge-road transportation networks. To illustrate the proposed methodology, a genetic algorithm is used to solve the restoration schedule optimisation problem for a hypothetical bridge network with 30 nodes and 37 bridges subjected to a scenario seismic event. A sensitivity study using this network illustrates the impact of the resourcefulness of a community and its time-dependent commitment of resources on the network recovery time and trajectory.

  • Zhang, W., N. Wang, C. Nicholson. 2017. Resilience-based post-disaster recovery strategies for road-bridge networks.  Structure and Infrastructure Engineering, Accepted.  LINK

The next of the resilience publications, is a paper appearing in Reliability Engineering & System Safety entitled A multi-criteria decision analysis approach for importance ranking of network components.  This a joint effort between Yasser Almoghathawi, Kash Barker, Claudio Rocco.Reliability Engineering and System Safety

Reliability Engineering and System Safety is an international journal devoted to the development and application of methods for the enhancement of the safety and reliability of complex technological systems. The journal normally publishes only articles that involve the analysis of substantive problems related to the reliability of complex systems or present techniques and/or theoretical results that have a discernable relationship to the solution of such problems. An important aim is to achieve a balance between academic material and practical applications.

In the study, we propose a new approach to identify the most important network components based on multiple importance measures using a multi criteria decision making method, namely the technique for order performance by similarity to ideal solution (TOPSIS), able to take into account the preferences of decision-makers. We consider multiple edge-specific flow-based importance measures provided as the multiple criteria of a network where the alternatives are the edges.

resilience publication in RESS

Component Importance Measures may rank elements within a newtwork differently. TOPSIS provides one approach to considered such cases.

Accordingly, TOPSIS is used to rank the edges of the network based on their importance considering multiple different importance measures. The proposed approach is illustrated through different networks with different densities along with the effects of weights.

  • Almoghathawi, Y., K. Barker, C.M. Rocco, and C. Nicholson. 2017. A multi-criteria decision analysis approach for importance ranking of network components. Reliability Engineering and System Safety, 158: 142-151 LINK [bibTex]


Centerville Virtual Community Testbed

Centerville Special Edition in Sustainable and Resilient InfrastructureCenterville

Enhancing community resilience in the future will require new interdisciplinary systems-based approaches that depend on many disciplines, including engineering, social and economic, and information sciences. The National Institute of Standards and Technology awarded the Center for Risk-Based Community Resilience Planning to Colorado State University and nine other universities in 2015 (including the University of Oklahoma!), with the overarching goal of establishing the measurement science for community resilience assessment. To this end, several of the researches within the Center for Risk-Based Community Resilience Planning have come together to develop the Centerville virtual community.

The Centerville Virtual Community Testbed is aimed at enabling fundamental resilience assessment algorithms to be initiated, developed, and coded in a preliminary form, and tested before the refined measurement methods and supporting data classifications and databases necessary for a more complete assessment have fully matured.  Sustainable and Resilient Infrastructure has published a Special Issue introducing the Centerville Testbed, defining the physical infrastructure within the community, natural hazards to which it is exposed, and the population demographics necessary to assess potential post-disaster impacts on the population, local economy, and public services in detail.

The community has multiple residential and commercial zones with several types of buildings at different code levels.  The population of about 50,000 is diverse with respect to employment and income.  There are multiple public schools and government buildings located throughout the city as well as emergency facilities.  There are a few main roads, a simple highway system, some smaller local roads, and a few important bridges within the transportation system.  The Analytics Lab @ OU currently has a research paper in development relating to the study of transportation systems and we use Centerville testbed as one of our cases.

In addition to to the buildings and transportation system, Centerville also has a simplified electric power network (EPN) with multiple substation types (transmission, main grid, distribution, sub-distribution), a small power plant, and single-pole transmission lines. The community also includes a basic potable water system with pumps, tanks, reservoirs, a water treatment plant, and an underground piping system.  The maps associated with these infrastructure systems can be found in the Special Issue.

By creating such a detailed virtual community, researches can have a simplified but somewhat realistic platform for experimentation.  The papers included in the Special Issue cover topics such as multi-objective optimization for retrofit strategies, building portfolio fragility functions, performance assessment of EPN to tornadoes, and computable general equilbrium (CGE) assessment of the community with respect to disasters.

All papers included in the Special Issue are listed below.

Bruce R. Ellingwood, John W. van de Lindt & Therese P. McAllister
Bruce R. Ellingwood, Harvey Cutler, Paolo Gardoni, Walter Gillis Peacock, John W. van de Lindt & Naiyu Wang
Roberto Guidotti, Hana Chmielewski, Vipin Unnikrishnan, Paolo Gardoni, Therese McAllister & John van de Lindt

Happy Winter Break 2016!

Winter Break 2016

Here at the Analytics Lab @ OU we would like to wish you all a happy holiday season, a wonderful winter break, and a feliz año nuevo!

Also, we have had some nice developments over the last few days and weeks and I would like to mention these briefly.


Congratulations to Param Tripathi for succesfully defending his Masters thesis and completing his MS in ISE!

In his thesis entitled “ANALYSIS OF RESILIENCE IN US STOCK MARKETS DURING NATURAL DISASTERS” Param analyzes how major hurricane events impact the stock market in the United States. In particular, he looks at two fundamental elements of resilience, vulnerability and recoverability, as it applies to the Property and Casualty Insurance sector by evaluating the price volatility with respect to the overall New York Stock Exchange.  He applies breakout detection to study patterns in the time series data and uses this to quantify key resilience metrics.

Congratulations to Alexandra Amidon for successfully completing her Industry practicum on anomaly detection using a compression algorithm.  Well done!

Good luck!

facebook_like_thumbNext, Weili Zhang, a PhD Candidate in ISE, is actively interviewing with a vast number of top companies for data science positions. He has passed first and second round interviews for a number of companies and has been invited for face-to-face interviews with Facebook, Google, Ebay, Verizon, Disney, Lyft, Amazon, Target, and several more.  Getting past round one with these companies is already a major accomplishment — round two or three get progressively more intense!  Good job and good luck Weili!

Yay us!

The Analytics Lab team members have had a few papers published or accepted for publication recently:

  • Almoghathawi, Y., K. Barker, C.M. Rocco, and C. Nicholson. 2017. A multi-criteria decision analysis approach for importance ranking of network components. Reliability Engineering and System Safety, 158: 142-151 LINK [bibTex]
  • Nicholson, C., L. Goodwin, and C. Clark. 2016. Variable neighborhood search for reverse engineering of gene regulatory networks.  Journal of Biomedical Informatics, 65:120-131 LINK [bibTex]
  • Zhang, W., N. Wang, and C. Nicholson. 2016. Resilience-based post-disaster recovery strategies for community road-bridge networks. Accepted in Structure and Infrastructure Engineering.

I’ll write up a post about these exciting new research articles soon!

Data Science Interviews

Weili, along with a few other current and former students have been describing the data science interview processes and questions with me.  My plan is to write up a blog post summarizing a some key things for those of you seeking DSA positions to keep in mind and prepare for.  Fortunately, many of the questions/topics asked about are covered in the ISE 5103 Intelligent Data Analytics class offered in the Fall semester.  If you are starting to look for data science jobs now or in the near future, make sure you check out that post!

Life in Industry

Additionally, I have asked several current/former students (Pete Pelter, Leslie Goodwin, Cyril Beyney, Olivia Peret) who are employed in data science positions to provide some descriptions and information about life in industry.  Look for that series of posts soon!

ISE 5113 Advanced Analytics and Metaheuristics

I am currently preparing for the online launch of the ISE 5113 Advanced Analytics and Metaheuristics course to be offered in Spring 2017.  The course will be offered both online and on-campus.  This is one of my favorite courses to teach.  The introductory video should be posted soon!  The course fills up very fast, so if you are a DSA or ISE student make sure to register ASAP if you haven’t already!


I have had excellent response from so many people this semester that it looks like “yes!” we will have enough people to start a soccer team in the Spring.  I’ll be providing more information next month.  The season starts in March and the official signups are in February ($80 each).  We need a team name and team colors — so I’ll be open for ideas!  In the meantime, get out there and practice!

winter break soccer


I hope everyone has an excellent winter break.  Enjoy your family, enjoy good food, stay warm, practice some soccer, catch up on sleep, maybe study a little bit… and I’ll see you next year!


Thoughts on regression techniques (part iii)

In the first of this series of posts on regression techniques we introduced the work-horse of predictive modeling, ordinary least squares regression (OLS), but concluded with the notion that while a common and useful technique, OLS has some notable weaknesses.  In the last post we discussed its sensitivity to outliers and how to deal with that using multiple “robust regression” techniques.  Now we will discuss regression modeling in high dimensional data.  Two of the techniques are based on idea called “feature extraction” and the last three are based on “penalized regression”

Dimensionality, feature extraction, selection, and penalized regression

An  issue that every modeler who has had to deal with high-dimensional data, is feature selection.  By high dimensionality I mean that p, the number of predictors, is large. High dimensional data can cause all sorts of issues — some of them psychological — for instance, it can be very hard to wrap your mind around so many predictors at one time; it can be difficult to figure out where to start, what to analyze, what to explore, and even visualization can be a beast!  I tried to find some pictures to represent high-dimensional data for this blog post, but of course, by definition 2D or 3D representations of high-D data is hard to come by.  This parallel plot at least add some color to the post and represents at least some of the complexity!

When p is large with respect to the number of observations n, then the probability of overfitting is high. Now, let me quickly interject that p may be much greater than the number of raw input variables in the data.  That is, a modeler may have decided on constructing several new features from among the existing data (e.g., ratios between two or more variables, non-linear transformations, and multi-way interactions). If there is a strong theoretical reason for including a set of variables, then great, build the model and evaluate the results. However, oftentimes, a modeler is looking for a good fit and doesn’t know which of the features should be created, and then which combinations of such features should be included in the model. Feature construction deals with the first question (and we will look at that later on), feature selection deals with the second question (which we deal with now).

Actually, first I will digress slightly — in the case when p is equal to or greater than n, OLS will fail miserably.  In this situation, feature selection is not only a good idea, it is necessary!  One example problem type and data when p>n occurs in microarray data. If you are not familiar with what a microarray is see the inset quote from the Chapter “Introduction to microarray data analysis” by M. Madan Babu in Computational Genomics.  Essentially microarray data is collected for the simultaneous analysis of thousands of genes to help discovery of gene functions, examine the gene regulatory network, identify of drug targets, and provide insight to understanding of diseases.  The data usually has n on the orders of 10 to 100, whereas p might be on the order of 100’s to 1000’s!

A microarray is typically a glass slide on to which DNA molecules are fixed in an orderly manner at specific locations called spots. A microarray may contain thousands of spots and each spot may contain a few million copies of identical DNA molecules that uniquely correspond to a gene. The DNA in a spot may either be genomic DNA or short stretch of oligo-nucleotide strands that correspond to a gene. Microarrays may be used to measure gene expression in many ways, but one of the most popular applications is to compare expression of a set of genes from a cell maintained in a particular condition (condition A) to the same set of genes from a reference cell maintained under normal conditions (condition B). – M. Madan Babu


There are two closely related approaches that can handle such scenarios with ease.  These are Principal Component Regression (PCR) and Partial Least Squares regression (PLS).  Both of these techniques allow you deal with data when p>n.  Both PCR and PLS essentially represent the data in lower dimensions (either in what is known as “principal components” for the former or just “components” in the latter).  Either way these components are each formed from linear combinations of all of the predictors.  Since PCR and PLS are essentially automatically creating new features from the given data, this is a form of what is known as “feature extraction”.  That is, the algorithm extracts new predictors from the data for use in modeling.

If you choose k components (or extracted features) to be less than p, then you have reduced the effective representation of your data.  With this of course also comes information loss.  You can’t just get rid of dimensions without losing information.  However, both PCR and PLS both try to shove as much “information” into the first component as possible; subsequent components will contain less and less information.  If you chop off the only a few of the last components, then you will not experience much information loss.  If your data contains highly correlated variables or subsets of variables, then you can possible reduce many dimensions with very little loss of information.

Face database for facial recognition algorithms

Face database for facial recognition algorithms

Image data for instance is another example of high-dimensional data which can usually be reduced using something Principal Component Analysis (PCA) to a much lower level of dimensions.  Facial recognition algorithms leverage this fact extensively!  Looking at the image at right, it is easy to see many commonalities among the images of faces — the information to discern one face from the other is not within the similarities, but of course the differences.  Imagine if you can removing the “similar elements” of all the faces — this would remove a considerable amount of the data dimensionality and the remaining “differences” is where the true discriminatory information is found.

PCR and PLS essentially do this — allow you to throw away the non-informative dimensions of data and perform regression modeling on only the informative bits.

I’ll leave the details about the difference in PCR and PLS alone for now, except to say that PCR is based on an unsupervised technique (PCA), whereas PLS is a inherently a supervised learning technique through-and-through.  Both define components (linear combinations of the original data) and then allow you to perform OLS on a subset of components instead of the original data.

Feature Selection

One possibility for sub-selecting predictors for an OLS model is to use stepwise regression.  Stepwise regression (which can either be forward, backward, or bi-directional) is a greedy technique in which potential predictors are added (or removed) one-at-time from a candidate OLS model by evaluating the impact of adding (or removing) each variable individually to which improves performance the most. Maybe “performance” is the wrong word here, but let me use it for now. One traditional technique is to add (or remove) variables based on their associated statistical p-values, e.g., remove a variable if its p >= 0.05.  I should note that while this is commonly employed (e.g., it is the default stepwise method used in SAS), there is some reasonable controversy with this approach.  It is sometimes called p-fishing — as in, “there might not be anything of value in the data, but I am going to fish around until I find something anyway.”  This is not a flattering term. If you do choose to perform stepwise regression, a less controversial approach would be to use either AIC or BIC scores as the model performance metric at each step.

Lasso: a penalized regression approach

However, there are variations of OLS (which again are based on modifications to the objective function in Equation (2) (from the first post in the series) that result in an automatic selection of a subset of the candidate predictors.

Here is Equation (2) again:

\text{(Equation 2 again)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i - \hat{y}_i)^2

The first technique that we will discuss is called the least absolute shrinkage and selection operator (lasso).  Lasso adds a penalty function to Equation 2 based on the magnitude of the regression coefficients as shown in Equation 3.

\text{(Equation 3)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i -\hat{y}_i)^2 + \lambda \sum_{j=0}^p |\beta_j|

The penalty is the sum of the absolute values of all the regression coefficients scaled by some value λ > 0.  The larger the value of λ, the larger the potential penalty.  As λ increases, it makes sense with respect to Equation (3) to set more and more regression coefficients to 0.  This effectively removes them from the regression model.  Since this removal is based on balancing the sum of residuals with the penalty, the predictors which are not as important to the first part of the lasso objective are the ones that are eliminated.  Voilà, feature selection!

You might ask, why place a penalty on the magnitude of the beta values — doesn’t that artificially impact the true fit and interpretation of your model?  Well, this is a good question — the lasso definitely “shrinks” the regression coefficients, however, this does not necessarily mean that the shrinkage is a departure from the true model.  If the predictors in the regression model are correlated (i.e., some form of multi-collinearity exists), then the magnitudes of the regression coefficients will be artificially inflated (and the “meaning” of the beta values may be totally lost).  The shrinkage operator in lasso (and other techniques, e.g. ridge regression) tackle this directly.  It is possible that lasso puts too much downward pressure on the coefficient magnitudes, but it is not necessarily true.

Ridge regression: another penalized regression approach

I need to confess that ridge regression is not a method with automatic feature selection.  However, since it is so closely related to lasso, I decided to throw it in really quickly so I don’t have to right another blog post just for this little guy.  Here’s the equation, see if you can spot the difference from Equation (3)!

\text{(Equation 4)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i -\hat{y}_i)^2 + \lambda \sum_{j=0}^p \beta_j^2

That’s right — the only difference between Equation (3) and Equation (4) is that the penalty is based on absolute values in one and on the squares in the other. The idea is the same, except one very interesting difference — the regression coefficients in ridge regression are never forced to 0.  They get smaller and smaller, but unlike lasso, no features are eliminated.  Ridge regression however does turn out to produce better predictions that OLS.  For both lasso and ridge regression the value of λ is determined by using cross-validation methods to tune the parameter to the best value for predictions.  If λ = 0, then both of these methods give the exact result as OLS.  If λ >  0 (as determined by the so-called hyper-parameter tuning, then the penalized regression techniques are shown to be better than OLS.

Elastic net regularization: yet, again, a penalized regression approach

The other reason that I wanted to introduce ridge regression is that it is a great segue into my favorite of the penalized techniques, elastic net regularization or just elastic net for short.  The elastic net approach combines both penalties from lasso and ridge regression in an attempt to get at the best of both worlds: the feature selection element of lasso and the predictive performance of ridge regression.

\text{(Equation 5)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i - \hat{y}_i)^2  + \lambda_1 \sum_{j=0}^{p} |\beta_j | + \lambda_2 \sum_{j=0}^{p} \beta_j^2

Oftentimes the relationship between λl >0 and λ2>0 is one such that λl2= 1.  In this case if λl = 1, then the elastic net gives you the same result as lasso, and if λ2 = 1, then the result is equivalent to ridge regression.  However, many times the result from hyper-parameter tuning is that 0<λl<1 and 0<λ2<1, implying that yes! some hybridization of the lasso and ridge regression approaches produces the best cross-validated results.

In the case when we require λl2= 1, Equation (5) can be rewritten as follows to simplify to only one parameter:

\text{(Modified equation 5)} \ \ \ \text{minimize } \sum_{i=1}^n (y_i - \hat{y}_i)^2  + \lambda \sum_{j=0}^{p} |\beta_j | + (1-\lambda) \sum_{j=0}^{p} \beta_j^2

I have mentioned “hyper-parameter tuning” a couple of times already.  Without going into the details of the cross-validation, let me simply just say that all hyper-parameter tuning means is that you try out a whole bunch of values for your parameter (e.g., λ) until you find the values that work best.  Take a look at the lasso and elastic net paths figure.  In this figure the values of the coefficients are on the y-axis (each color of a line represents represents a different predictor) and the value of the penalty is represented on the x-axis (actually here the log of the penalty is represented).  As the penalty value decreases (moving right along the x-axis), the values of the coefficients increase for both lasso (solid lines) and elastic net (dashed lines).  So you can see as you “tune” the value of λ, infinite number of models are possible!  When the value becomes large enough (starting at right and then moving to the left along the x-axis), some of the coefficients are forced to 0 by lasso and by elastic net.  The lasso seems to do this quicker than elastic net as demonstrated in the solid blue and dashed blue line — a very small increase in λ (at about log λ ≈ -90) and its value is set to 0 by lasso; whereas, the penalty has to increase such that log λ ≈ -55 for elastic net before the same variable’s regression coefficient is set to 0.


Now there are other regression techniques that also include automatic feature selection, e.g., multivariate adaptive regression splines (MARS) essentially use a forward step-wise procedure to add terms to the regression model and regression trees choose a features one at a time to add to a model to produce prediction estimates. (These two seemingly different techniques actually have quite a lot in common!)  However, I will introduce the first of these as a technique for both feature selection and feature construction.  Our next post deals more generally with how our regression approach can deal with non-linearities in our model assumption.