To Model Or Not To Model: Is That The Quesion?

The basic gist of this article is that the exercise of data modeling is just as important when using the big data and NoSQL technologies as it is when using the more traditional relational algebra based technologies.

This conclusion came after a series of experiments were performed pitting Cloudera’s Hadoop distribution against an unidentified ‘major relational database’. A suite of 5 business questions were distilled into either SQL for the relational database, or HQL for execution against Hadoop stacked with Hive. For each of the queries, for each data store, 5 experimental scenarios were explored:

  1. Flat Schema vs. Star Schema
  2. Using compressed data vs. uncompressed in Hadoop
  3. Indexing appropriate columns
  4. Partitioning the data by date
  5. Compare Hive/HQL/MapReduce to Cloudera Impala

Details of the experiment and intermediate results can be found in the article, but at a macro level, the results were mixed with the exception of it being clear that a flat un-modeled schema was not a scenario one should use and expect performance. As the article points out, the question is not whether one should model or not, but rather how and when.

The Hadoop experiment: To model or not to model – The Data Roundtable

Captured by

Tamara Dull, Director of Emerging Technologies, SAS Best Practices wrote the following at SAS’ “The Data Roundtable” blog:

It was refreshing to see that the RDBMS skills some of us have developed over the years still apply with these new big data technologies. And while discussions of late binding (i.e., applying structure to the data at query time, not load time) work their way down our corporate hallways, we are reminded once again that “it depends” is a far more honest and accurate answer than it’s been given credit for in the past.

To model or not to model is no longer the question. How and when to model is.


Data Modelling in MarkLogic, and how my mate Norm can help you!…

Data modelling in an aggregate/document NoSQL database can be unfamiliar to most. In this post I mention a couple of techniques and how they can help…

In both relational database systems and NoSQL key-value and columnar stores you have to take a logical application ‘object’ and shred it in to one or more structures that are flat. They are typically table based, with some NoSQL options supported one or more values per column, or some sort of built in hash mapping function.

Shredding poses a few problems for application developers. These mainly centre around them having to do work just to store and retrieve their information. As a lazy bunch (I used to be one, so I can get away with that comment!) they don’t like spending time doing this – they prefer to spend time on the interesting and unique stuff. This is normally an issue because organisations want these storage layers to be fast and reliable – so they force their developers (quite rightly) to spend time on this layer too.

Data Modelling in MarkLogic, and how my mate Norm can help you!…