MapReduce: What is it?
As a way to explain what MapReduce is and how it introduces parallelism into what was once a single-threaded task, I’m going to use the very simple and admittedly scoff-ably inane example of counting how many times each word in a book occurs in that book (or collection of books). While this is an impractical use of the technology, the point here is not to show we can count, but rather to try and demonstrate how we can take something that is single threaded and turn it into a parallelized process with a few optimizations tossed in.
Before we apply any technology to the matter, let’s first approach this as a thought experiment. At its very basic level, the map part of MapReduce takes the data that is to be processed and breaks it into tuples; key – value pairs. The reduce part takes the output from map as it’s input and combines the mapped tuples into a smaller set of tuples in some way (thus reducing the output). Were I to take the book we’re going to process and tear it into two roughly equal parts, give you half and keep half, and ask you to count the words in your half while I count the words in my half, that would be ‘map’. When we’re done with our respective counting and we take our results to consolidate them into a single list of words and counts, that’s ‘reduce’.
I estimate that an average person could count the number of occurrences of each word of one page of a decent sized book (let’s imagine it’s something like ‘War and Peace’) in somewhere between 15 and 20 minutes. We’ll go down the middle and say 18 minutes. If the book in question is 1,000 pages long that puts it at 18,000 minutes or 300 hours, for one person to count the number of time each word occurs. As we discussed, if we split the book in two, we can then, in theory, get our job done in half the time- but wait -that’s only for the ‘map’ part. When we split the book into two parts, we also created the need to put the two result sets together at the end. We’ve gained the efficiency of parallelizing our task, but with the cost of adding a ‘reduce’ task to the end of our work – still, it’s got to be faster than doing the job alone.
Let’s go a step further and now define how each of us is going to count the words on a given page. As discussed, ‘map’ takes the input and breaks it into tuples to pass on to ‘reduce’. So, in this case, let’s define map to be taking each word as it occurs on a page and make that the key to a tuple with the value being simply a ‘tick’, or ‘1’, to represent this word has occurred once (or one more time). I’ll guess this changes how quickly an individual can ‘map’ a page. There’s no adding to be done, no lookup to find where the tally of a given word might have previously been recorded, or initializing a tally if found this is the first time we’ve come across a particular word. It’s simply, record the word and a ‘1’. Knowing that we’re going to have to apply some ‘reduce’ task to our output, we can apply a bit of forethought and record our ‘map’ results onto index cards vs. recording the data in a notebook.
Now, at the end of each of our respective ‘map’ tasks, we each have a (very large!) stack of index cards. How shall we consolidate and ‘reduce’ all of this data? I propose that I take our stacks, put them together and do a sort by the key of the tuple creating a series of sorted stacks. When I’m through with the sort, I can then give the first half of the cards to you, now sorted and stacked, so that you can continue to help and do half of the ‘reduce’ work to be done, while I do the other half.
In this case, ‘reduce’ is pretty easy. Each of the presorted stack of index cards simply needs to be counted and a tally tuple recorded onto a new index card. When we are each done with our ‘reduce’ task, all that needs to be done is to collect the tally tuples, in the already sorted order, first from you, then from me and voila, we have a sorted list of each word that occurred in the book to be processed along with the count of how many time that word occurred.
To recap, we have distilled our job of counting the number of time each word occurs in a book down into the following tasks:
- Split the input. In this task, the book to be processed is split into ‘n’ number of equal pieces, one for each friend who has agreed to help us ‘map’.
Execute the ‘map’ in parallel. Each friend helping to ‘map’ is given their split of the book. They each independently go off to create a stack of index cards containing a word and a ‘1’ for each word as it occurs each time it occurs in their split of the book.
Sort and Divvy. When everyone is done with their ‘map’, I collect all of the index cards and sort them into stacks by word. When the sort is complete, I divvy up the sorted stacks for each friend staying to help ‘reduce’. Note that the ‘map’ and ‘reduce’ are independent. There could be 5 friends helping to ‘map’ but only 2 left to ‘reduce’ or vice-versa, 3 could help ‘map’ and 4 more come over to give us 7 working on ‘reduce’.
Execute the ‘reduce in parallel. Each friend staying to help ‘reduce’ is given their divvy of the mapped and sorted cards. Each presorted stack is counted and a tally tuple is recorded.
Collect the tally tuples. The tally tuples are collected in the already sorted order.
In the next post in this series, I’ll cover Hadoop Streaming and how to implement what we’ve covered as a thought experiment in code.