In Search of a Definition for Big Data

Big-Data1The concept of “Big” with its implication of significance, complexity and challenge presents a us with difficulty when trying to nail down a definition because of an inherent invitation for quantification. When one goes to define “Big”, it stands to reason that a size — some number — is all but demanded. Living, however, in the era of Moore’s law, the almost contradictory notion that a data set defined to be large today will certainly seem small in the not-too-distant future, makes assigning a quantity to “Big” seem arbitrary. This appears to imply that “Big Data“, at any given point in time, will always be more than current conventional techniques can handle.

There are those who assign the complexity of a data set greater import than its size when deciding whether it is “Big”. Taking things further, there are definitions which include requisite solution components as part of their definition; tying “Big Data” to technologies such as Apache Hadoop and NoSQL stores such as Amazon Dynamo, Cassandra, CouchDB and MongoDB.

In an academic paper published last month (September 30, 2013), Jonathan Stuart Ward and Adam Barker at the University of St Andrews in Scotland have made a valiant effort to clarify the definition of “Big Data”. The abstract of that paper is as follows:

Undefined By Data: A Survey of Big Data Definitions1

The term big data has become ubiquitous. Owing to a shared origin between academia, industry and the media there is no single unified definition, and various stakeholders provide diverse and often contradictory definitions. The lack of a consistent definition introduces ambiguity and hampers discourse relating to big data. This short paper attempts to collate the various definitions which have gained some degree of traction and to furnish a clear and concise definition of an otherwise ambiguous term.

Source: arXiv:1309.5821v1 [cs.DB]


This paper calls out 6 of the more well known and oft quoted definitions of “Big Data”. I’ve summarized these six below:

  1. Gartner: In 2001, a Meta (now Gartner) report noted the increasing size of data, the increasing rate at which it is produced and the increasing range of formats and representations employed. This report predated the term “Big Data” but proposed a three-fold definition encompassing the “three Vs”: Volume, Velocity and Variety.2
  2. Oracle: In a 2012 white paper entitled “Oracle: Big data for the enterprise”,3 the author contends that big data is the derivation of value from traditional relational database driven business decision making, augmented with new sources of unstructured data. This definition does not make it clear as to exactly when and why the term big data is applicable, but rather it provides a means by which one who has the requisite experience and background to “know it when you see it”.
  3. Intel: Intel takes a stand and quantifies where “Big” begins. According to Intel, an organization is playing in the realm of “Big Data” when they are “generating a median of 300 terabytes (TB) of data weekly”4. Additionally, they assert that the most common data involved in analytics are business transactions stored in relational databases with unstructured data in the form of documents, email, sensor data, blogs and social media following.
  4. Microsoft: The definition of “Big Data” by Microsoft is clear and straightforward: “Big Data is the term increasingly used to describe the process of applying serious computing power – the latest in machine learning and artificial intelligence – to seriously massive and often highly complex sets of information”5. This definition clearly states that “Big Data” requires the application of significant compute power. In addition, two technologies, machine learning and artificial intelligence, are introduced. While a volume quantification is lacking, the concept of there being related technologies involved is added.
  5. MIKE2.0, The Method for an Integrated Knowledge Environment project: The MIKE project makes the argument that “Big Data can be very small and not all large datasets are big”6. Their argument, that “Big Data” is not a function of the size of a data set but its complexity alters the fundamental semantic of “Big” to the point that we may need or want a term other than “Big Data”.
  6. NIST, The National Institute of Standards and Technology: The US Government NIST has defined “Big Data” in terms somewhat similar to MIKE. Their definition supports the notion that “Big” is relative and “Big Data” is data that challenges current paradigms and practices, specifically, it is data which “exceed(s) the capacity or capability of current or conventional methods and systems”.7

The authors of the paper, Ward and Barker, in an attempt to discern a populist definition of “Big Data” performed an analysis of Google search phrases most commonly associated with “Big Data”8 and have come up with the following list:

  • data analytics
  • Hadoop
  • NoSQL
  • Google
  • IBM
  • Oracle

Ward and Barker note that all of the definitions referenced make at least one of the following assertions:

  • Size: the volume of the datasets is a critical factor.
  • Complexity: the structure, behaviour and permutations of the datasets is a critical factor.
  • Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor.

Taking these points into account and considering the sum as well as all of the parts of the aforementioned definitions, they did an extrapolation and came up with the following definition of their own:

Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.

English: Man and woman shown working with IBM ...

I have, over the past several years, been asked for my definition of “Big Data”. My personal bias is more in line with the definition offered by NIST than with any of the others, although, I am sympathetic with the MIKE2.0 definition as well. I would posit:

Big Data is that data, which because of its complexity, or its size, or the speed it is being generated exceeds the capability of current conventional methods and systems to extract practical use and value.

  1. J.S. Ward, A. Barker. arXiv:1309.5821v1 [cs.DB] Undefined By Data: A Survey of Big Data Definitions, 2013
  2. L. Douglas. 3d data management: Controlling data volume, velocity and variety. Gartner. Retrieved, 6, 2001.
  3. J. P. Dijcks. Oracle: Big data for the enterprise. Oracle White Paper, 2012.
  4. Intel Peer Research on Big Data Analysis.
  5. The Big Bang: How the Big Data Explosion Is Changing the World – Microsoft UK Enterprise Insights Blog – Site Home – MSDN Blogs.
  6. Big Data Definition – MIKE2.0, the open source methodology for Information Development.
  7. NIST Big Data Working Group (NBD-WG).
  8. Google. Google Trends for Big Data, 2013.


How to choose a No SQL Database

This article, while targeting the question, “How to choose the right NoSQL Database?”, contains a straightforward and concise taxonomy of NoSQL databases:

NoSQL Databases can be categorized into four major groups.

  1. Key Value databases
  2. Wide column (column-family stores) database
  3. Document databases
  4. Graph Databases

How to choose a No SQL Database