The basic gist of this article is that the exercise of data modeling is just as important when using the big data and NoSQL technologies as it is when using the more traditional relational algebra based technologies.
This conclusion came after a series of experiments were performed pitting Cloudera’s Hadoop distribution against an unidentified ‘major relational database’. A suite of 5 business questions were distilled into either SQL for the relational database, or HQL for execution against Hadoop stacked with Hive. For each of the queries, for each data store, 5 experimental scenarios were explored:
- Flat Schema vs. Star Schema
- Using compressed data vs. uncompressed in Hadoop
- Indexing appropriate columns
- Partitioning the data by date
- Compare Hive/HQL/MapReduce to Cloudera Impala
Details of the experiment and intermediate results can be found in the article, but at a macro level, the results were mixed with the exception of it being clear that a flat un-modeled schema was not a scenario one should use and expect performance. As the article points out, the question is not whether one should model or not, but rather how and when.
Tamara Dull, Director of Emerging Technologies, SAS Best Practices wrote the following at SAS’ “The Data Roundtable” blog:
It was refreshing to see that the RDBMS skills some of us have developed over the years still apply with these new big data technologies. And while discussions of late binding (i.e., applying structure to the data at query time, not load time) work their way down our corporate hallways, we are reminded once again that “it depends” is a far more honest and accurate answer than it’s been given credit for in the past.
To model or not to model is no longer the question. How and when to model is.