Monthly Archives: August 2016

Big data: what is it?

What is Big Data?

big data knows everything

Big data is term that you often hear when people talk about data science and analytics.

So, the question is, “what is big data?”

Doug Laney from Gartner, a leading information technology research company, defined 3 dimensions of “big data”: volume, velocity, and variety.

  • Volume denotes the size and scale of the data.   There is a lot of data out there – and it is growing. It is estimated that 40 zettabytes of data will be created by the year 2020.   What is a zettabyte?  One zettabyte is equal to 1 trillion gigabytes!
  • Velocity is the speed at which data is created as well as the increasing speed at which it is processed.  The speed at which data is created is almost unimaginable. And it is accelerating. I’ll give some examples, but by the time you see read this they will be out of date: Google is processing about 3.5 billion search queries everyday; every minute we are uploading 300 hours of video onto Youtube; and 3.4 million emails are sent every second.  Check out this site for more up-to-date information: http://www.internetlivestats.com
  • Variety of the data refers to the fact that data comes from many sources and in many forms.  Whether it is facebook posts, video uploads, satellite images, GIS data, reviews on products from Amazon.com, sensor data from self-driving cars, or data from wearable devices and wireless health monitors – data is  is coming at us from all directions and in many formats.

People love alliteration…

Everyone seems to want to add more “V’s” to the definition of big data so now we have 4’vs of big data, the 5’vs, 6 V’s, and even 7 V’s of big data… 

Batman says: Only 3 V's of big data!

Let’s look at these next four V’s: Veracity, Variability, Visualization, and Value.   I’d like to add however that these next dimensions are not unique to “big” data, but represent challenges to data of basically any size.  Now, I should mention that Doug Laney did not necessarily like the addition of the new V’s to his working description of “big data”

  • The first one, added by IBM, is “veracity” – that is the accuracy, truthfulness, or trustworthiness of the data.  IBM found that 1 in 3 business leaders didn’t trust the information that they use to make decisions. And additionally that “poor data quality costs the US economy an estimated 3.1 trillion dollars a year.

big data and veracity

  • Variability implies that the meaning of the data is changing.  A number, variable, or rule might have had a metalFancertain definition last month; but now it has changed.  This also might relate, for example, to how words have different meanings in different context.  One especially difficult challenge in the field of natural language processing is how to detect and interpret sarcasm.  The same word used in one phrase may have the exact opposite meaning when used in a different phrase.

 

  • Visualization is associated with challenge of understanding what is really in your data – this includes visualizing and communicating the interesting facets of the data; turning all of this into something comprehensible — this is not easy.

big data dashboard

  • Finally, the last V – value.  Data by itself has no real value.   Having lots of it, without meaning, doesn’t do anyone any good. Individual observations, transactions, records, entities in the data have mean very little on their own.  It is only though aggregation and analysis that we can find anything worthwhile.   But, there is so much of it, there is an enormous potential!  As a shameless plug, turning big data or small data or anything in between into value – well, that’s the purpose of the ISE/DSA 5103 Intelligent Data Analytics course that I teach.

Now what?

I like Joel Gurin, author of Open Data Now, I like his quote on defining big data, “Big data describes datasets that are so large, complex, or rapidly changing that they push the very limits of our analytical capability.  It’s a subjective term: What seems “big” today may seem modest in a few years when our analytic capacity has improved.”

“Big data describes datasets that are so large, complex, or rapidly changing that they push the very limits of our analytical capability.    — Joel Gurin”

What was big data yesterday, may not be big data now; and what is “big” now,  may not be considered “big” tomorrow.  However, what is consistent in this field and this problem is about the need for us to expand our analytical talents and technology.  This (again, shameless plug) is what the MS Data Science and Analytics program at OU is all about!  Joel Gurin goes on to say that what’s really important is not so much the size of the data, but the “big impact” that it can have on society, health, economy, and research.

next level of big data

Fall 2016 Classes

Why are open source statistical programming languages the best?
Because they R.

It is August and Fall 2016 classes begin in just a couple of days.  I am currently prepping for two large classes: I happy to see the incredible interest in my graduate course with over 50 students enrolled in ISE/DSA 5103 Intelligent Data Analytics! I will also be taking over Dr. Suleyman Karabuk’s ISE 4113 Decision Support Systems undergraduate course with nearly 80 students already enrolled!

To this end I am collecting as many new jokes and one-liners as possible — gotta to keep the material fresh.  That said, to those of you who have yet to have taken any of my courses, my jokes are really not that funny, however, I do expect all students to laugh regardless.  This is a price that must be paid.  If you have any jokes, puns, etc. that are both short, clean, related to statistics or data science, and optionally are funny, please send them my way: cnicholson @ ou (dot) edu.

To support these two course I have tricked two unassuming graduate students into becoming TA’s for me.  Sai Krishna Theja Bhavaraju has enthusiastically accepted the role of TA for ISE 4113 and Alex Rodriguez will be the TA for ISE 5103.  Both of these TA’s are bright, friendly, and very helpful.  If you are taking either of these two classes, please feel free to ask them for help.  If you are not taking these classes, but you stumble across either of these two gentlemen, please buy them a beer — they have their work cut out for them!

Fall 2016 Classes

Intelligent Data Analytics is not an easy course.  The homeworks and projects are notoriously challenging.  In the class we address real-world data intensive problems by integrating human intuition with data analysis tools to draw out and communicate meaningful insights. Topics include problem approach and framing, data cleansing, exploratory analysis and visualization, dimension reduction, linear and logistic regression, decision trees, and clustering.  Students will be introduced to a powerful open source statistical programming language (R) and work on hands-on, applied data analysis projects.  I have heard from several former students that this has been a hard but useful course — at least six students that I know of who have taken this course have obtained jobs in analytics and data science fields at companies including Deloitte Consulting, Visual BI, GE Global Research, Nerd Kingdom, OKC Thunder, and Standard & Poors.  Hopefully the skills you are introduced to in the class can be helpful to you in the future.Former students working in Analytics

ISE 4113 is a Decision Support Systems course that exploits advanced features of MS Excel 2013 to model and build decision support applications.  The course will start with the basics and quickly move into mathematical modeling, simulation, VBA, and GUI design.  While this is the first time for me to teach this course, I have heard from students that the material they learn in this class has made a significant impact in their academic and professional lives.  I hope to continue the track record of success with this course.

 

 

Summer 2016 Hangout

Summer 2016 Hangout

Very happy to see all the students and friends that came out to the Summer 2016 hangout at McNellie’s The Abner Ale House in Norman.  I am privileged to work a wide variety of students in ISE, DSA, and CEES who are applying research in a broad array of application areas (from Community Resilience to Streaming Clustering in online Gaming to Predictive Modeling for TV Ratings to Optimizing Ship Routing) and who represent many different cultures, languages, and backgrounds.  Our group includes members from China, India, Iran, Peru, Brazil, as well as Oklahomans and Texans.  My beautiful wife, hailing from Mexico, also came to hangout.

I am glad that this gave you a chance to meet some new colleagues and reconnect with others outside the lab.

Hopefully, all of the MS DSA students (Alex B., Alex R., Alexandra, Emily, Silvia, and Stephen) can support each other through this academically intense Fall semester about to begin!   Silvia and Emily are completing their industry practicums this week as well — so congratulations to them (assuming all goes well!)

We are also happy to welcome Vera Bosco to the group — an ISE PhD student who is applying methods of stochastic optimization and dynamic programming to ship routing under weather uncertainty.  She is a new addition from the group and hails from Brazil.  Her bio is now posted on the team page.

And as always, I am glad to hangout with the CEES group who are a part of the CORE lab – Peihui, Mohammad, Yingjun, and Jia.

I hope this opportunity (and more like them to come) will help you connect with your colleagues and co-conspirators in the Analytics Lab. Several students are out-of-town during the Summer, but when everyone is back from their internships and travels we will plan a get-together for the Fall.

IMG_0837IMG_0818

IMG_0816IMG_0801IMG_0806IMG_0810IMG_0803IMG_0815IMG_0830IMG_0821IMG_0832IMG_0822IMG_0834

 

Two new publications in CAIE

Summer publications!

CAIE-published

We are happy to see two new papers accepted for publication in Computers and Industrial Engineering this Summer!  These publications form a logical pair, with one introducing a new perspective that uses statistical learning to help study the Fixed-Charge Network Flow (FCNF) problem and the other develops a solution technique that hybridizes the new approach with classical techniques to improve on CIEsolution efficiency.

Zhang, W. and C.D. Nicholson. 2016. Prediction-based relaxation solution approach for the fixed charge network flow problem. Computers & Industrial Engineering, 99:106-111 http://dx.doi.org/10.1016/j.cie.2016.07.014.
Keywords: Network optimization; Fixed charge network flow; Heuristics

Abstract: A new heuristic procedure for the fixed charge network flow problem is proposed. The new method leverages a probabilistic model to create an informed reformulation and relaxation of the FCNF problem. The technique relies on probability estimates that an edge in a graph should be included in an optimal flow solution. These probability estimates, derived from a statistical learning technique, are used to reformulate the problem as a linear program which can be solved efficiently. This method can be used as an independent heuristic for the fixed charge network flow problem or as a primal heuristic. In rigorous testing, the solution quality of the new technique is evaluated and compared to results obtained from a commercial solver software. Testing demonstrates that the novel prediction-based relaxation outperforms linear programming relaxation in solution quality and that as a primal heuristic the method significantly improves the solutions found for large problem instances within a given time limit.

Nicholson, C.D. and W. Zhang. 2016. Optimal Network Flow: A Predictive Analytics Perspective on the Fixed-Charge Network Flow Problem. Computers & Industrial Engineering, 99:260-268 http://dx.doi.org/ 10.1016/j.cie.2016.07.030  
Keywords:Network analysis, Fixed charge network flow, Predictive modeling, Critical components

Abstract: The fixed charge network flow (FCNF) problem is a classical NP-hard combinatorial problem with wide spread applications. To the best of our knowledge, this is the first paper that employs a statistical learning technique to analyze and quantify the effect of various network characteristics relating to the optimal solution of the FCNF problem. In particular, we create a probabilistic classifier based on 18 network related variables to produce a quantitative measure that an arc in the network will have a non-zero flow in an optimal solution. The predictive model achieves 85% cross-validated accuracy. An application employing the predictive model is presented from the perspective of identifying critical network components based on the likelihood of an arc being used in an optimal solution.

TSRI

We have also just had a very good first round review from Sustainable and Resilient Infrastructure on a paper entitled “Defining Resilience Analytics for Interdependent Cyber-Physical-Social Networks” and expect a quick second round of reviews soon.

Journal_of_Biomedical_InformaticsWe have finally had the first round of reviews back from the Journal of Biomedical Informatics and a paper written in 2015 by Leslie Goodwin (MS ISE @ OU), Charles Nicholson (OU), and Corey Clark (SMU) entitled “Variable neighborhood search for reverse engineering of gene regulatory networks”. The first round review is very promising, and we are going to work hard to see this paper published in such a high quality journal!

Hopefully this fall we have 7 more submissons of papers that are close to wrapping up.  These include two papers on data mining, one paper on network heuristics, and four papers relating to advancing the science of resilience.