What is Big Data?

This is a topic we have explored several times.

20130911-012258.jpgIn Why the 3V’s Are Not Sufficient to Describe Big Data, Mark van Rijmenam describes what has become the ‘classic’ definition of “Big Data” as put forth by Doug Laney in 2001. Laney defined big data as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). These three dimensions (Volume, Velocity and Variety) have since taken on the name of “the 3V’s”.

After establishing the bona fides of the 3V’s, van Rijmenam goes on to say, “I do think that big data can be better explained by adding a few more V’s”. The other V’s he proposes are: Veracity, Variability, Visualization and Value.

imageIn an InformationWeek article by Seth Grimes titled Big Data: Avoid Wanna-V Confusion, a contrary argument is made. Grimes counters van Rijmenam by saying that the original 3V’s, while not perfect, work well. Grimes goes on to say that it is his opinion that the drive behind ‘wanna-V’ is rooted in causing division and spin.


A few months ago, I wrote in an article titled In Search of a Definition for Big Data that the implied demand for an arbitrary size specification when creating a definition of “Big Data” makes the definition obsolete the moment it is formulated. This article went on to review an academic paper published by Jonathan Stuart Ward and Adam Barker of the University of St. Andrews in Scotland. Their paper surveyed the definitions of “Big Data” from a number of industry relevant and significant organizations such as Garter (3V’s), Intel, Microsoft and Oracle, leading Ward and Barker to offer their own composite definition of “Big Data”:

“Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.”

I ended that article by sharing my personal definition of “Big Data”:

Big Data is that data, which because of its complexity, or its size, or the speed it is being generated exceeds the capability of current conventional methods and systems to extract practical use and value.

This is the definition we’ll go with from here on forward– at least until a better one comes to mind.

In Search of a Definition for Big Data

Big-Data1The concept of “Big” with its implication of significance, complexity and challenge presents a us with difficulty when trying to nail down a definition because of an inherent invitation for quantification. When one goes to define “Big”, it stands to reason that a size — some number — is all but demanded. Living, however, in the era of Moore’s law, the almost contradictory notion that a data set defined to be large today will certainly seem small in the not-too-distant future, makes assigning a quantity to “Big” seem arbitrary. This appears to imply that “Big Data“, at any given point in time, will always be more than current conventional techniques can handle.

There are those who assign the complexity of a data set greater import than its size when deciding whether it is “Big”. Taking things further, there are definitions which include requisite solution components as part of their definition; tying “Big Data” to technologies such as Apache Hadoop and NoSQL stores such as Amazon Dynamo, Cassandra, CouchDB and MongoDB.

In an academic paper published last month (September 30, 2013), Jonathan Stuart Ward and Adam Barker at the University of St Andrews in Scotland have made a valiant effort to clarify the definition of “Big Data”. The abstract of that paper is as follows:

Undefined By Data: A Survey of Big Data Definitions1

The term big data has become ubiquitous. Owing to a shared origin between academia, industry and the media there is no single unified definition, and various stakeholders provide diverse and often contradictory definitions. The lack of a consistent definition introduces ambiguity and hampers discourse relating to big data. This short paper attempts to collate the various definitions which have gained some degree of traction and to furnish a clear and concise definition of an otherwise ambiguous term.

Source: arXiv:1309.5821v1 [cs.DB] http://arxiv.org/pdf/1309.5821v1.pdf


This paper calls out 6 of the more well known and oft quoted definitions of “Big Data”. I’ve summarized these six below:

  1. Gartner: In 2001, a Meta (now Gartner) report noted the increasing size of data, the increasing rate at which it is produced and the increasing range of formats and representations employed. This report predated the term “Big Data” but proposed a three-fold definition encompassing the “three Vs”: Volume, Velocity and Variety.2
  2. Oracle: In a 2012 white paper entitled “Oracle: Big data for the enterprise”,3 the author contends that big data is the derivation of value from traditional relational database driven business decision making, augmented with new sources of unstructured data. This definition does not make it clear as to exactly when and why the term big data is applicable, but rather it provides a means by which one who has the requisite experience and background to “know it when you see it”.
  3. Intel: Intel takes a stand and quantifies where “Big” begins. According to Intel, an organization is playing in the realm of “Big Data” when they are “generating a median of 300 terabytes (TB) of data weekly”4. Additionally, they assert that the most common data involved in analytics are business transactions stored in relational databases with unstructured data in the form of documents, email, sensor data, blogs and social media following.
  4. Microsoft: The definition of “Big Data” by Microsoft is clear and straightforward: “Big Data is the term increasingly used to describe the process of applying serious computing power – the latest in machine learning and artificial intelligence – to seriously massive and often highly complex sets of information”5. This definition clearly states that “Big Data” requires the application of significant compute power. In addition, two technologies, machine learning and artificial intelligence, are introduced. While a volume quantification is lacking, the concept of there being related technologies involved is added.
  5. MIKE2.0, The Method for an Integrated Knowledge Environment project: The MIKE project makes the argument that “Big Data can be very small and not all large datasets are big”6. Their argument, that “Big Data” is not a function of the size of a data set but its complexity alters the fundamental semantic of “Big” to the point that we may need or want a term other than “Big Data”.
  6. NIST, The National Institute of Standards and Technology: The US Government NIST has defined “Big Data” in terms somewhat similar to MIKE. Their definition supports the notion that “Big” is relative and “Big Data” is data that challenges current paradigms and practices, specifically, it is data which “exceed(s) the capacity or capability of current or conventional methods and systems”.7

The authors of the paper, Ward and Barker, in an attempt to discern a populist definition of “Big Data” performed an analysis of Google search phrases most commonly associated with “Big Data”8 and have come up with the following list:

  • data analytics
  • Hadoop
  • NoSQL
  • Google
  • IBM
  • Oracle

Ward and Barker note that all of the definitions referenced make at least one of the following assertions:

  • Size: the volume of the datasets is a critical factor.
  • Complexity: the structure, behaviour and permutations of the datasets is a critical factor.
  • Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor.

Taking these points into account and considering the sum as well as all of the parts of the aforementioned definitions, they did an extrapolation and came up with the following definition of their own:

Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.

English: Man and woman shown working with IBM ...

I have, over the past several years, been asked for my definition of “Big Data”. My personal bias is more in line with the definition offered by NIST than with any of the others, although, I am sympathetic with the MIKE2.0 definition as well. I would posit:

Big Data is that data, which because of its complexity, or its size, or the speed it is being generated exceeds the capability of current conventional methods and systems to extract practical use and value.

  1. J.S. Ward, A. Barker. arXiv:1309.5821v1 [cs.DB] http://arxiv.org/pdf/1309.5821v1.pdf Undefined By Data: A Survey of Big Data Definitions, 2013
  2. L. Douglas. 3d data management: Controlling data volume, velocity and variety. Gartner. Retrieved, 6, 2001.
  3. J. P. Dijcks. Oracle: Big data for the enterprise. Oracle White Paper, 2012.
  4. Intel Peer Research on Big Data Analysis. http://www.intel.com/content/www/us/en/big-data/data-insights-peer-research-report.html.
  5. The Big Bang: How the Big Data Explosion Is Changing the World – Microsoft UK Enterprise Insights Blog – Site Home – MSDN Blogs. http://blogs.msdn.com/b/microsoftenterpriseinsight/archive/2013/04/15/the-big-bang-how-the-big-data-explosion-is-changing-the-world.aspx.
  6. Big Data Definition – MIKE2.0, the open source methodology for Information Development. http://mike2.openmethodology.org/wiki/Big%20Data%20Definition.
  7. NIST Big Data Working Group (NBD-WG). http://bigdatawg.nist.gov/home.php.
  8. Google. Google Trends for Big Data, 2013.


A More Thoughtful but No More Convincing View of Big Data

(via Instapaper)

I have a problem with Big Data. As someone who makes his living working with data and helping others do the same as effectively as possible, my objection doesn’t stem from a problem with data itself, but instead from the misleading claims that people often make about data when they refer to it as Big Data. I have frequently described Big Data as nothing more than a marketing campaign cooked up by companies that sell information technologies either directly (software and hardware vendors) or indirectly (analyst groups such as Gartner and Forrester).

There a lot here that I agree with….

A More Thoughtful but No More Convincing View of Big Data