To Model Or Not To Model: Is That The Quesion?

The basic gist of this article is that the exercise of data modeling is just as important when using the big data and NoSQL technologies as it is when using the more traditional relational algebra based technologies.

This conclusion came after a series of experiments were performed pitting Cloudera’s Hadoop distribution against an unidentified ‘major relational database’. A suite of 5 business questions were distilled into either SQL for the relational database, or HQL for execution against Hadoop stacked with Hive. For each of the queries, for each data store, 5 experimental scenarios were explored:

  1. Flat Schema vs. Star Schema
  2. Using compressed data vs. uncompressed in Hadoop
  3. Indexing appropriate columns
  4. Partitioning the data by date
  5. Compare Hive/HQL/MapReduce to Cloudera Impala

Details of the experiment and intermediate results can be found in the article, but at a macro level, the results were mixed with the exception of it being clear that a flat un-modeled schema was not a scenario one should use and expect performance. As the article points out, the question is not whether one should model or not, but rather how and when.

The Hadoop experiment: To model or not to model – The Data Roundtable

Captured by

Tamara Dull, Director of Emerging Technologies, SAS Best Practices wrote the following at SAS’ “The Data Roundtable” blog:

It was refreshing to see that the RDBMS skills some of us have developed over the years still apply with these new big data technologies. And while discussions of late binding (i.e., applying structure to the data at query time, not load time) work their way down our corporate hallways, we are reminded once again that “it depends” is a far more honest and accurate answer than it’s been given credit for in the past.

To model or not to model is no longer the question. How and when to model is.

Big Data Technologies

Now that we have selected a working definition of Big Data we can look at the technologies that have emerged and are emerging to take advantage of this new challenge. To review, our definition of “Big Data” is:

Big Data is that data, which because of its complexity, or its size, or the speed it is being generated exceeds the capability of current conventional methods and systems to extract practical use and value.

There have emerged three technologies or technology groups targeting the management and processing of “Big Data”:

  • HadoopApache Hadoop ElephantHadoop is an open source framework for the storage and processing of massive amounts of data. Originally developed at Yahoo!, Hadoop is based on work published by Google and fundamentally relies on a distributed, redundant (for fault tolerance) file system (the Hadoop Distributed File System, or HDFS) and a mechanism for processing the distributed data in parallel called MapReduce.
  • NoSQL
    Intro to NoSQL and Cassandra, Ran TavoriNoSQL refers to a group of data management technologies geared toward the management of large data sets in the context of discrete transactions or individual records as opposed to the batch orientation of Hadoop. A common theme of NoSQL technologies is to trade ACID (atomicity, consistency, isolation, durability) compliance for performance. This model of consistency been called ‘eventually consistent’1.NoSQL databases are often broken into categories based on their underlying data model. The most commonly referenced categories and representative examples are as follows:

    • Key-Value Pair Databases
      • E.g. Dynamo, Riak, Redis, MemcacheDB, Project Voldemort
    • Document Databases
    • Graph Databases
      • E.g. Neo4J, Allegro, Virtuoso
    • Columnar Databases
      • E.g. HBase, Accumulo, Cassandra, SAP Sybase IQ
  • Massively Parallel Analytic DatabasesSystem diagram of the Goodyear Aerospace Massi...As the name Implies, massively parallel analytic databases employ massive parallel processing, or MPP, to allow for the ingest, processing and querying of data (typically structured) across multiple machines simultaneously. This architecture makes for significantly faster performance than a traditional database that runs on a single, large box.

    It is common for Massively Parallel Analytic Databases to employ a shared-nothing architecture. This ensures there is no single point of failure. Each node operates independently of the others so if one machine fails, the others keep running. Additionally, it is not uncommon for the nodes to be made up of commodity, off-the-shelf, hardware so they can be scaled-out in a cost effective (relatively) manner.

In the coming articles, I’ll address in detail each of these technologies. In the next article, I’ll dive into and explore Hadoop; first focusing on the Hadoop Distributed File System and then diving into the MapReduce paradigm.

Enhanced by Zemanta