Level Set and Perspective
Before I take us down the technical path, I thought it would make sense to level set ourselves with a check on the history of Hadoop. I believe you will see that knowing a bit of the background history of Hadoop gives perspective to the why of things. It also offers a glimpse into the evolution of many project aspects.
Developed by Doug Cutting / Yahoo!
In 2002, Doug Cutting, the acknowledged progenitor of the project / product called Hadoop1, began work on a project called Nutch. The goal of Nutch was to create an index of the Internet.2 In 2006, Yahoo! hired Doug Cutting to lead development of what has become Hadoop.3
Google Concepts and Documents
While original implementations of Nutch showed promise, it was nowhere near web scale. Running on 4 nodes and indexing 100M web pages was operationally onerous. In 2003, Google published papers on a distributed file system called the Google File System (GFS) and a parallel processing model called MapReduce. Cutting recognized GFS and MapReduce’s applicability to the scale issues of Nutch.
Google Use Case
The desire to index the Internet and facilitate it’s rapid search was the basis of the use case presented by Google. It has the added trifecta of an understood and predetermined input data set, output data set and query set.
The input data set, as defined by the Google use case, is the huge and unstructured data set that is all Internet web pages. The queries to be issued would be sets of keywords; any matched keywords would return a ranked set of associated web pages. The obvious output data set is an index made up of keywords providing the correlation between keyword and web page.
The concept presented by Google to address this use case is, as earlier disclosed, a distributed file system geared to the storage and streaming retrieval of very large unstructured data sets (such as the input data set of all Internet web pages).
Layered on top of this distributed file system is a parallel processing engine implemented with a programming paradigm called MapReduce.4 Among the key concepts behind MapReduce are that the it is far less expensive to move processing power to the data upon which it is to act than move the data to the processing; thus, data is processed where it is stored vs. being moved across a network. Additionally, because of the distributed nature of the file system and the parallel nature of the processing system, the concept called for a scale-out architecture. A scale-out architecture is one which allows servers to be added as processing power needs increase and allows for servers to be added as space needs increased.
Yahoo! Use Case and Implementation
As serendipity would have it, Yahoo! had a use case with parallel motivations to Google. Not just motivation, but also the trifecta of an understood and predetermined input data set, output data set and query set is shared by Yahoo! and Google. Imagine Yahoo! having a desire to index the internet so that pages can be found quickly by keyword- it boggles the mind 😉
As mentioned earlier, Cutting recognized the applicability of GFS and MapReduce to a number of challenges being faced by Nutch and it’s development team. A refactoring of Nutch to use the concepts presented by Google vis-a-vis GFS and MapReduce took around two weeks time. The result was greater operational stability, a significantly more simple programming model and greater performance. While this two week effort did not result in rainbows and unicorns, it did present Cutting with a new and better architectural direction. This is what has now become known as Hadoop.
The Hadoop project was eventually spun out of Yahoo! and today stands as a top level Apache Software Foundation 5 project.
In the next post, I’ll take us through the overall architecture of Hadoop v1 (spoiler alert! Hadoop is up to v2) and dive into v1 of the Hadoop File System.