Big Data Technologies

Now that we have selected a working definition of Big Data we can look at the technologies that have emerged and are emerging to take advantage of this new challenge. To review, our definition of “Big Data” is:

Big Data is that data, which because of its complexity, or its size, or the speed it is being generated exceeds the capability of current conventional methods and systems to extract practical use and value.

There have emerged three technologies or technology groups targeting the management and processing of “Big Data”:

  • HadoopApache Hadoop ElephantHadoop is an open source framework for the storage and processing of massive amounts of data. Originally developed at Yahoo!, Hadoop is based on work published by Google and fundamentally relies on a distributed, redundant (for fault tolerance) file system (the Hadoop Distributed File System, or HDFS) and a mechanism for processing the distributed data in parallel called MapReduce.
  • NoSQL
    Intro to NoSQL and Cassandra, Ran TavoriNoSQL refers to a group of data management technologies geared toward the management of large data sets in the context of discrete transactions or individual records as opposed to the batch orientation of Hadoop. A common theme of NoSQL technologies is to trade ACID (atomicity, consistency, isolation, durability) compliance for performance. This model of consistency been called ‘eventually consistent’1.NoSQL databases are often broken into categories based on their underlying data model. The most commonly referenced categories and representative examples are as follows:

    • Key-Value Pair Databases
      • E.g. Dynamo, Riak, Redis, MemcacheDB, Project Voldemort
    • Document Databases
    • Graph Databases
      • E.g. Neo4J, Allegro, Virtuoso
    • Columnar Databases
      • E.g. HBase, Accumulo, Cassandra, SAP Sybase IQ
  • Massively Parallel Analytic DatabasesSystem diagram of the Goodyear Aerospace Massi...As the name Implies, massively parallel analytic databases employ massive parallel processing, or MPP, to allow for the ingest, processing and querying of data (typically structured) across multiple machines simultaneously. This architecture makes for significantly faster performance than a traditional database that runs on a single, large box.

    It is common for Massively Parallel Analytic Databases to employ a shared-nothing architecture. This ensures there is no single point of failure. Each node operates independently of the others so if one machine fails, the others keep running. Additionally, it is not uncommon for the nodes to be made up of commodity, off-the-shelf, hardware so they can be scaled-out in a cost effective (relatively) manner.

In the coming articles, I’ll address in detail each of these technologies. In the next article, I’ll dive into and explore Hadoop; first focusing on the Hadoop Distributed File System and then diving into the MapReduce paradigm.

Enhanced by Zemanta

Why Big Data 101?

While the following events did not occur verbatim, they did happen in spirit.

Big Data: water wordscape

At a recent family event I had the opportunity to talk with my younger brother. He graduated from the University of Wisconsin in Milwaukee a couple years ago with a degree in computer science. Since graduating he’s found a full time paid internship and was able to convert it into being taken on as a full time engineer, putting him at having 3 (ish) years of practical work experience.

During course of conversation, I kept having this nagging feeling of being misunderstood. It felt as though what I was saying was missing a common frame of reference for the topics of the conversation; that I was, in essence, speaking gibberish. I understand that my brother is young– much much younger than I. I also understand we come from two very different ways of having been taught our craft, but still, we’re both technical and should have common ground; something was very

Big Data

off.

I stopped the conversation and bluntly asked him, “Do you know…”

  • what is meant by the term “Big Data”?
  • what Hadoop is?
  • what the mapreduce paradigm is?
  • why a NoSQL database would be used vs. a traditional RDBMS?

To each of these questions, the answers were similar. He had heard all of the terms I asked about, but he had no actual experience with any of them. He went on to tell me that his way of learning things was to wait until he was either told by a manager he would need a new skill or knowledge-set for an upcoming project or he came to discover it himself. In other words, he would learn something new only if it was required; there was no exploring or being proactive.

After I quickly reorganized my thoughts following the shock and awe that was just dropped before me, I saw an opportunity to reach out and be a mentor. I explained to him the importance of taking control over his own destiny and to actively manage his learning and career direction. In the world today, waiting for someone to tell you what to do just doesn’t cut it.

Apache Hadoop Elephant

Further, he had been telling me how he was being put more and more into a leadership and project management role and while found he was good at it– and further had been recognized as being good by his managers, it wasn’t what he wanted to be doing. Of course, I made the foolish mistake of asking him what it is that he wants to do, only to get back the answer I deserved for such a silly question– “Gee, I’m not sure, I just know I don’t want to do that.”1

I told him that I would write up a list of links and come up with a self-education task plan for him around the big data space if he were interested. And that brings us to where we are now. That is why over the next several weeks I’ll be going back to the basics and creating a “Big Data”, Hadoop and NoSQL primer baseline series of articles and tutorials.


  1. This is something that I’ve come to call the ‘Law of Hate’. It typically gets applied to customers and specifically in user interface design situations. In essence, it is impossible for a customer to tell you, the consultant, what they want. They can only tell you what they hate after you have showed to them the 2 dozen mock-ups you have prepared. 

In Search of a Definition for Big Data

Big-Data1The concept of “Big” with its implication of significance, complexity and challenge presents a us with difficulty when trying to nail down a definition because of an inherent invitation for quantification. When one goes to define “Big”, it stands to reason that a size — some number — is all but demanded. Living, however, in the era of Moore’s law, the almost contradictory notion that a data set defined to be large today will certainly seem small in the not-too-distant future, makes assigning a quantity to “Big” seem arbitrary. This appears to imply that “Big Data“, at any given point in time, will always be more than current conventional techniques can handle.

There are those who assign the complexity of a data set greater import than its size when deciding whether it is “Big”. Taking things further, there are definitions which include requisite solution components as part of their definition; tying “Big Data” to technologies such as Apache Hadoop and NoSQL stores such as Amazon Dynamo, Cassandra, CouchDB and MongoDB.

In an academic paper published last month (September 30, 2013), Jonathan Stuart Ward and Adam Barker at the University of St Andrews in Scotland have made a valiant effort to clarify the definition of “Big Data”. The abstract of that paper is as follows:

Undefined By Data: A Survey of Big Data Definitions1

The term big data has become ubiquitous. Owing to a shared origin between academia, industry and the media there is no single unified definition, and various stakeholders provide diverse and often contradictory definitions. The lack of a consistent definition introduces ambiguity and hampers discourse relating to big data. This short paper attempts to collate the various definitions which have gained some degree of traction and to furnish a clear and concise definition of an otherwise ambiguous term.

Source: arXiv:1309.5821v1 [cs.DB] http://arxiv.org/pdf/1309.5821v1.pdf

Dictionary

This paper calls out 6 of the more well known and oft quoted definitions of “Big Data”. I’ve summarized these six below:

  1. Gartner: In 2001, a Meta (now Gartner) report noted the increasing size of data, the increasing rate at which it is produced and the increasing range of formats and representations employed. This report predated the term “Big Data” but proposed a three-fold definition encompassing the “three Vs”: Volume, Velocity and Variety.2
  2. Oracle: In a 2012 white paper entitled “Oracle: Big data for the enterprise”,3 the author contends that big data is the derivation of value from traditional relational database driven business decision making, augmented with new sources of unstructured data. This definition does not make it clear as to exactly when and why the term big data is applicable, but rather it provides a means by which one who has the requisite experience and background to “know it when you see it”.
  3. Intel: Intel takes a stand and quantifies where “Big” begins. According to Intel, an organization is playing in the realm of “Big Data” when they are “generating a median of 300 terabytes (TB) of data weekly”4. Additionally, they assert that the most common data involved in analytics are business transactions stored in relational databases with unstructured data in the form of documents, email, sensor data, blogs and social media following.
  4. Microsoft: The definition of “Big Data” by Microsoft is clear and straightforward: “Big Data is the term increasingly used to describe the process of applying serious computing power – the latest in machine learning and artificial intelligence – to seriously massive and often highly complex sets of information”5. This definition clearly states that “Big Data” requires the application of significant compute power. In addition, two technologies, machine learning and artificial intelligence, are introduced. While a volume quantification is lacking, the concept of there being related technologies involved is added.
  5. MIKE2.0, The Method for an Integrated Knowledge Environment project: The MIKE project makes the argument that “Big Data can be very small and not all large datasets are big”6. Their argument, that “Big Data” is not a function of the size of a data set but its complexity alters the fundamental semantic of “Big” to the point that we may need or want a term other than “Big Data”.
  6. NIST, The National Institute of Standards and Technology: The US Government NIST has defined “Big Data” in terms somewhat similar to MIKE. Their definition supports the notion that “Big” is relative and “Big Data” is data that challenges current paradigms and practices, specifically, it is data which “exceed(s) the capacity or capability of current or conventional methods and systems”.7

The authors of the paper, Ward and Barker, in an attempt to discern a populist definition of “Big Data” performed an analysis of Google search phrases most commonly associated with “Big Data”8 and have come up with the following list:

  • data analytics
  • Hadoop
  • NoSQL
  • Google
  • IBM
  • Oracle

Ward and Barker note that all of the definitions referenced make at least one of the following assertions:

  • Size: the volume of the datasets is a critical factor.
  • Complexity: the structure, behaviour and permutations of the datasets is a critical factor.
  • Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor.

Taking these points into account and considering the sum as well as all of the parts of the aforementioned definitions, they did an extrapolation and came up with the following definition of their own:

Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.

English: Man and woman shown working with IBM ...

I have, over the past several years, been asked for my definition of “Big Data”. My personal bias is more in line with the definition offered by NIST than with any of the others, although, I am sympathetic with the MIKE2.0 definition as well. I would posit:

Big Data is that data, which because of its complexity, or its size, or the speed it is being generated exceeds the capability of current conventional methods and systems to extract practical use and value.


  1. J.S. Ward, A. Barker. arXiv:1309.5821v1 [cs.DB] http://arxiv.org/pdf/1309.5821v1.pdf Undefined By Data: A Survey of Big Data Definitions, 2013
  2. L. Douglas. 3d data management: Controlling data volume, velocity and variety. Gartner. Retrieved, 6, 2001.
  3. J. P. Dijcks. Oracle: Big data for the enterprise. Oracle White Paper, 2012.
  4. Intel Peer Research on Big Data Analysis. http://www.intel.com/content/www/us/en/big-data/data-insights-peer-research-report.html.
  5. The Big Bang: How the Big Data Explosion Is Changing the World – Microsoft UK Enterprise Insights Blog – Site Home – MSDN Blogs. http://blogs.msdn.com/b/microsoftenterpriseinsight/archive/2013/04/15/the-big-bang-how-the-big-data-explosion-is-changing-the-world.aspx.
  6. Big Data Definition – MIKE2.0, the open source methodology for Information Development. http://mike2.openmethodology.org/wiki/Big%20Data%20Definition.
  7. NIST Big Data Working Group (NBD-WG). http://bigdatawg.nist.gov/home.php.
  8. Google. Google Trends for Big Data, 2013.