Big data is term that you often hear when people talk about data science and analytics.
So, the question is, “what is big data?”
Doug Laney from Gartner, a leading information technology research company, defined 3 dimensions of “big data”: volume, velocity, and variety.
- Volume denotes the size and scale of the data. There is a lot of data out there – and it is growing. It is estimated that 40 zettabytes of data will be created by the year 2020. What is a zettabyte? One zettabyte is equal to 1 trillion gigabytes!
- Velocity is the speed at which data is created as well as the increasing speed at which it is processed. The speed at which data is created is almost unimaginable. And it is accelerating. I’ll give some examples, but by the time you see read this they will be out of date: Google is processing about 3.5 billion search queries everyday; every minute we are uploading 300 hours of video onto Youtube; and 3.4 million emails are sent every second. Check out this site for more up-to-date information: http://www.internetlivestats.com
- Variety of the data refers to the fact that data comes from many sources and in many forms. Whether it is facebook posts, video uploads, satellite images, GIS data, reviews on products from Amazon.com, sensor data from self-driving cars, or data from wearable devices and wireless health monitors – data is is coming at us from all directions and in many formats.
People love alliteration…
Let’s look at these next four V’s: Veracity, Variability, Visualization, and Value. I’d like to add however that these next dimensions are not unique to “big” data, but represent challenges to data of basically any size. Now, I should mention that Doug Laney did not necessarily like the addition of the new V’s to his working description of “big data”
- The first one, added by IBM, is “veracity” – that is the accuracy, truthfulness, or trustworthiness of the data. IBM found that 1 in 3 business leaders didn’t trust the information that they use to make decisions. And additionally that “poor data quality costs the US economy an estimated 3.1 trillion dollars a year.
- Variability implies that the meaning of the data is changing. A number, variable, or rule might have had a certain definition last month; but now it has changed. This also might relate, for example, to how words have different meanings in different context. One especially difficult challenge in the field of natural language processing is how to detect and interpret sarcasm. The same word used in one phrase may have the exact opposite meaning when used in a different phrase.
- Visualization is associated with challenge of understanding what is really in your data – this includes visualizing and communicating the interesting facets of the data; turning all of this into something comprehensible — this is not easy.
- Finally, the last V – value. Data by itself has no real value. Having lots of it, without meaning, doesn’t do anyone any good. Individual observations, transactions, records, entities in the data have mean very little on their own. It is only though aggregation and analysis that we can find anything worthwhile. But, there is so much of it, there is an enormous potential! As a shameless plug, turning big data or small data or anything in between into value – well, that’s the purpose of the ISE/DSA 5103 Intelligent Data Analytics course that I teach.
I like Joel Gurin, author of Open Data Now, I like his quote on defining big data, “Big data describes datasets that are so large, complex, or rapidly changing that they push the very limits of our analytical capability. It’s a subjective term: What seems “big” today may seem modest in a few years when our analytic capacity has improved.”
“Big data describes datasets that are so large, complex, or rapidly changing that they push the very limits of our analytical capability. — Joel Gurin”
What was big data yesterday, may not be big data now; and what is “big” now, may not be considered “big” tomorrow. However, what is consistent in this field and this problem is about the need for us to expand our analytical talents and technology. This (again, shameless plug) is what the MS Data Science and Analytics program at OU is all about! Joel Gurin goes on to say that what’s really important is not so much the size of the data, but the “big impact” that it can have on society, health, economy, and research.