Big Data Technologies

Now that we have selected a working definition of Big Data we can look at the technologies that have emerged and are emerging to take advantage of this new challenge. To review, our definition of “Big Data” is:

Big Data is that data, which because of its complexity, or its size, or the speed it is being generated exceeds the capability of current conventional methods and systems to extract practical use and value.

There have emerged three technologies or technology groups targeting the management and processing of “Big Data”:

  • HadoopApache Hadoop ElephantHadoop is an open source framework for the storage and processing of massive amounts of data. Originally developed at Yahoo!, Hadoop is based on work published by Google and fundamentally relies on a distributed, redundant (for fault tolerance) file system (the Hadoop Distributed File System, or HDFS) and a mechanism for processing the distributed data in parallel called MapReduce.
  • NoSQL
    Intro to NoSQL and Cassandra, Ran TavoriNoSQL refers to a group of data management technologies geared toward the management of large data sets in the context of discrete transactions or individual records as opposed to the batch orientation of Hadoop. A common theme of NoSQL technologies is to trade ACID (atomicity, consistency, isolation, durability) compliance for performance. This model of consistency been called ‘eventually consistent’1.NoSQL databases are often broken into categories based on their underlying data model. The most commonly referenced categories and representative examples are as follows:

    • Key-Value Pair Databases
      • E.g. Dynamo, Riak, Redis, MemcacheDB, Project Voldemort
    • Document Databases
    • Graph Databases
      • E.g. Neo4J, Allegro, Virtuoso
    • Columnar Databases
      • E.g. HBase, Accumulo, Cassandra, SAP Sybase IQ
  • Massively Parallel Analytic DatabasesSystem diagram of the Goodyear Aerospace Massi...As the name Implies, massively parallel analytic databases employ massive parallel processing, or MPP, to allow for the ingest, processing and querying of data (typically structured) across multiple machines simultaneously. This architecture makes for significantly faster performance than a traditional database that runs on a single, large box.

    It is common for Massively Parallel Analytic Databases to employ a shared-nothing architecture. This ensures there is no single point of failure. Each node operates independently of the others so if one machine fails, the others keep running. Additionally, it is not uncommon for the nodes to be made up of commodity, off-the-shelf, hardware so they can be scaled-out in a cost effective (relatively) manner.

In the coming articles, I’ll address in detail each of these technologies. In the next article, I’ll dive into and explore Hadoop; first focusing on the Hadoop Distributed File System and then diving into the MapReduce paradigm.

Enhanced by Zemanta

What is Big Data?

This is a topic we have explored several times.

20130911-012258.jpgIn Why the 3V’s Are Not Sufficient to Describe Big Data, Mark van Rijmenam describes what has become the ‘classic’ definition of “Big Data” as put forth by Doug Laney in 2001. Laney defined big data as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). These three dimensions (Volume, Velocity and Variety) have since taken on the name of “the 3V’s”.

After establishing the bona fides of the 3V’s, van Rijmenam goes on to say, “I do think that big data can be better explained by adding a few more V’s”. The other V’s he proposes are: Veracity, Variability, Visualization and Value.

imageIn an InformationWeek article by Seth Grimes titled Big Data: Avoid Wanna-V Confusion, a contrary argument is made. Grimes counters van Rijmenam by saying that the original 3V’s, while not perfect, work well. Grimes goes on to say that it is his opinion that the drive behind ‘wanna-V’ is rooted in causing division and spin.

Big-Data1

A few months ago, I wrote in an article titled In Search of a Definition for Big Data that the implied demand for an arbitrary size specification when creating a definition of “Big Data” makes the definition obsolete the moment it is formulated. This article went on to review an academic paper published by Jonathan Stuart Ward and Adam Barker of the University of St. Andrews in Scotland. Their paper surveyed the definitions of “Big Data” from a number of industry relevant and significant organizations such as Garter (3V’s), Intel, Microsoft and Oracle, leading Ward and Barker to offer their own composite definition of “Big Data”:

“Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.”

I ended that article by sharing my personal definition of “Big Data”:

Big Data is that data, which because of its complexity, or its size, or the speed it is being generated exceeds the capability of current conventional methods and systems to extract practical use and value.

This is the definition we’ll go with from here on forward– at least until a better one comes to mind.

Why Big Data 101?

While the following events did not occur verbatim, they did happen in spirit.

Big Data: water wordscape

At a recent family event I had the opportunity to talk with my younger brother. He graduated from the University of Wisconsin in Milwaukee a couple years ago with a degree in computer science. Since graduating he’s found a full time paid internship and was able to convert it into being taken on as a full time engineer, putting him at having 3 (ish) years of practical work experience.

During course of conversation, I kept having this nagging feeling of being misunderstood. It felt as though what I was saying was missing a common frame of reference for the topics of the conversation; that I was, in essence, speaking gibberish. I understand that my brother is young– much much younger than I. I also understand we come from two very different ways of having been taught our craft, but still, we’re both technical and should have common ground; something was very

Big Data

off.

I stopped the conversation and bluntly asked him, “Do you know…”

  • what is meant by the term “Big Data”?
  • what Hadoop is?
  • what the mapreduce paradigm is?
  • why a NoSQL database would be used vs. a traditional RDBMS?

To each of these questions, the answers were similar. He had heard all of the terms I asked about, but he had no actual experience with any of them. He went on to tell me that his way of learning things was to wait until he was either told by a manager he would need a new skill or knowledge-set for an upcoming project or he came to discover it himself. In other words, he would learn something new only if it was required; there was no exploring or being proactive.

After I quickly reorganized my thoughts following the shock and awe that was just dropped before me, I saw an opportunity to reach out and be a mentor. I explained to him the importance of taking control over his own destiny and to actively manage his learning and career direction. In the world today, waiting for someone to tell you what to do just doesn’t cut it.

Apache Hadoop Elephant

Further, he had been telling me how he was being put more and more into a leadership and project management role and while found he was good at it– and further had been recognized as being good by his managers, it wasn’t what he wanted to be doing. Of course, I made the foolish mistake of asking him what it is that he wants to do, only to get back the answer I deserved for such a silly question– “Gee, I’m not sure, I just know I don’t want to do that.”1

I told him that I would write up a list of links and come up with a self-education task plan for him around the big data space if he were interested. And that brings us to where we are now. That is why over the next several weeks I’ll be going back to the basics and creating a “Big Data”, Hadoop and NoSQL primer baseline series of articles and tutorials.


  1. This is something that I’ve come to call the ‘Law of Hate’. It typically gets applied to customers and specifically in user interface design situations. In essence, it is impossible for a customer to tell you, the consultant, what they want. They can only tell you what they hate after you have showed to them the 2 dozen mock-ups you have prepared.