By way of an article written by Phil Simon at SAS’ Data Roundtable Blog, I was led to this most excellent article written by Alexis C. Madrigal in The Atlantic. In How Netflix Reverse Engineered Hollywood, Madrigal writes how he went about reverse engineering of his own to discover the more than 75,000 micro-genre’s used by Netflix to meticulously tag the video content they distribute.
Having tagged their content with these detailed metadata tags, Netflix is able to better, some might say eerily so, suggest what we might like to watch next once we disclose our relative preference for the content we’ve already seen.
I highly recommend taking the 15 minutes or so it will take to read Madrigal’s article. It will be well worth your while, not only for the insights into what and how Netflix is executing their voodoo, but also for the insights into some of the true potential of data when married with the appropriate external and meta-data to give it new and different context.
Phil Simon writes at SAS’ The Data Roundtable blog:
I’ve written before on this site about the Netflix data advantage. The company isn’t exactly forthcoming about its data, a certainly tenable position these days. After all, data is a major source of its competitive advantage.
Now, thanks to an astonishing article by Alexis Madrigal in The Atlantic, laypeople now possess a much greater understanding of the data that Netflix uses and generates.
There is no shortage of valuable lessons on contemporary data management to be gleaned from Netflix. I highly recommend reading the entire article. Suffice it to say that everything that Netflix can do is based upon some form of data. For now and the foreseeable future, data contains an important human element. Organizations that ignore this side of equation do so at their own peril.
Stephen Wolfram, founder of Wolfram Research and creator of Mathematica, not too long ago announced the new Wolfram Programming Language. I’ve called attention to this before, am doing it again, now, and will do it again in the future. This new knowledge-based language is, in my opinion, going to be a game changer in Big Data, data science and computer science in general.
In the video below, Wolfram introduces the Wolfram Language showing the concepts of symbolic programming and functional programming, the querying of large databases with powerful visualization support, interactivity, and much more.
The basic gist of this article is that the exercise of data modeling is just as important when using the big data and NoSQL technologies as it is when using the more traditional relational algebra based technologies.
This conclusion came after a series of experiments were performed pitting Cloudera’s Hadoop distribution against an unidentified ‘major relational database’. A suite of 5 business questions were distilled into either SQL for the relational database, or HQL for execution against Hadoop stacked with Hive. For each of the queries, for each data store, 5 experimental scenarios were explored:
- Flat Schema vs. Star Schema
- Using compressed data vs. uncompressed in Hadoop
- Indexing appropriate columns
- Partitioning the data by date
- Compare Hive/HQL/MapReduce to Cloudera Impala
Details of the experiment and intermediate results can be found in the article, but at a macro level, the results were mixed with the exception of it being clear that a flat un-modeled schema was not a scenario one should use and expect performance. As the article points out, the question is not whether one should model or not, but rather how and when.
Tamara Dull, Director of Emerging Technologies, SAS Best Practices wrote the following at SAS’ “The Data Roundtable” blog:
It was refreshing to see that the RDBMS skills some of us have developed over the years still apply with these new big data technologies. And while discussions of late binding (i.e., applying structure to the data at query time, not load time) work their way down our corporate hallways, we are reminded once again that “it depends” is a far more honest and accurate answer than it’s been given credit for in the past.
To model or not to model is no longer the question. How and when to model is.