Big Data Technologies

Now that we have selected a working definition of Big Data we can look at the technologies that have emerged and are emerging to take advantage of this new challenge. To review, our definition of “Big Data” is:

Big Data is that data, which because of its complexity, or its size, or the speed it is being generated exceeds the capability of current conventional methods and systems to extract practical use and value.

There have emerged three technologies or technology groups targeting the management and processing of “Big Data”:

  • HadoopApache Hadoop ElephantHadoop is an open source framework for the storage and processing of massive amounts of data. Originally developed at Yahoo!, Hadoop is based on work published by Google and fundamentally relies on a distributed, redundant (for fault tolerance) file system (the Hadoop Distributed File System, or HDFS) and a mechanism for processing the distributed data in parallel called MapReduce.
  • NoSQL
    Intro to NoSQL and Cassandra, Ran TavoriNoSQL refers to a group of data management technologies geared toward the management of large data sets in the context of discrete transactions or individual records as opposed to the batch orientation of Hadoop. A common theme of NoSQL technologies is to trade ACID (atomicity, consistency, isolation, durability) compliance for performance. This model of consistency been called ‘eventually consistent’1.NoSQL databases are often broken into categories based on their underlying data model. The most commonly referenced categories and representative examples are as follows:

    • Key-Value Pair Databases
      • E.g. Dynamo, Riak, Redis, MemcacheDB, Project Voldemort
    • Document Databases
    • Graph Databases
      • E.g. Neo4J, Allegro, Virtuoso
    • Columnar Databases
      • E.g. HBase, Accumulo, Cassandra, SAP Sybase IQ
  • Massively Parallel Analytic DatabasesSystem diagram of the Goodyear Aerospace Massi...As the name Implies, massively parallel analytic databases employ massive parallel processing, or MPP, to allow for the ingest, processing and querying of data (typically structured) across multiple machines simultaneously. This architecture makes for significantly faster performance than a traditional database that runs on a single, large box.

    It is common for Massively Parallel Analytic Databases to employ a shared-nothing architecture. This ensures there is no single point of failure. Each node operates independently of the others so if one machine fails, the others keep running. Additionally, it is not uncommon for the nodes to be made up of commodity, off-the-shelf, hardware so they can be scaled-out in a cost effective (relatively) manner.

In the coming articles, I’ll address in detail each of these technologies. In the next article, I’ll dive into and explore Hadoop; first focusing on the Hadoop Distributed File System and then diving into the MapReduce paradigm.

Enhanced by Zemanta