Hands-On Hadoop 2.2.0 – Part 1 – Compile

Hadoop Elephant
Hadoop YARN

Getting Hands-On with Hadoop YARN

For a while now, I’ve wanted to write a tutorial about rolling up the sleeves, and installing Hadoop from scratch, first as a single-node, stand-alone system, then perhaps expanding to a multi-node (3 probably) configuration with the end goal of showing how to write mapreduce programs in Python making use of the Hadoop streaming mechanism. As I was going through the steps of installation and configuration, I ran into a hitch. It appears that the pre-compiled installation tarball for Hadoop v2.2.0 (the latest stable release) on the Apache website is compiled only for 32-bit Linux and when run on a 64-bit system, runs, but throws up enough errors to make things look and act wonky. Doing a bit of Google research led me to the conclusion that the ‘right’ way to go about installing Hadoop on a 64-bit system is to get deeper into the dirt and compile Hadoop from source code to build a proper 64-bit installation set.

Ubuntu 12.04.4 LTS 64-bit Server to Compile Hadoop v2.2.0

So, that’s what this post is going to be- a tutorial on setting up a 64-bit Ubuntu Linux system which will be used to compile the source for Hadoop v2.2.0 (the current stable release). My primary workstation is a Microsoft Server 2008 R2 system. This being the case, I’m going to use Hyper-V services and create a virtual machine into which I’ll load 64-bit Ubuntu Linux LTS 12.04.4 Server and use it to do the compiling and then, later, installation of the single-node, stand-alone Hadoop installation.

New Hyper-V VM for Ubuntu

Hyper-V Create a New Virtual Machine

Create a New Virtual Machine in Hyper-V

The first step is to open the Hyper-V Manager and create a new virtual machine: On the Hyper-V Manager Actions menu for your machine, select New and then select Virtual Machine as pictured here.

Hyper-V Before You Begin

Before You Begin… Click Next to Move On

Click “Next” to bypass the ‘Before You Begin’ window.

Hyper-V Name and Location

Specify Name and Location, Next to Continue

On the next window, give the new virtual machine a name- I’m going to name it ‘compiler’. If you need to specify a location other than the default for the virtual machine files, this is a place where that change can be made.  When ready, click “Next”.

Hyper-V Assign Memory

Assign Memory, Click Next to Move Forward

The next window presented is where the amount of memory to be assigned to the virtual machine is entered. I’m going to go with 2GB or 2048MB. This is pretty much the lowest I would go– you may be able to get away with 1GB (1024MB), but I would not recommend it. If you have more memory in your machine and are able to allocate more to the virtual machine, by all means, give it more!. When ready, click “Next”.

Hyper-V Configure Networking

Configure Networking, Click Next to Continue

‘Configure Networking’ is the next window to be presented.  I’m going to select a previously configured network connection which is bridged to my home office network.  You will need connectivity to the internet– it is not optional. Configure what is appropriate for your network. When ready, click ‘Next’ to continue.

Hyper-V Connect Virtual Hard Disk

Create and Connect a New 12GB Virtual Hard Disk

On the ‘Connect Virtual Hard Disk’ window, I’m keeping the already selected option of ‘Create a virtual hard disk’ as well as the default Name and Location. I am changing the size to 12GB as that’s more than enough room for what needs to be done. When ready, click ‘Next’ to continue.

Hyper-V Installation Options

Select Install From DVD, .iso and then Browse for the File

The window that follows is the ‘Installation Options’ window.  This is where you specify where the ISO image of the DVD from which the new virtual machine is going to virtually boot.  If you don’t already have it, you are going to need the ISO image for the 64-bit version of Ubuntu 12.04.4 LTS Server. The download page for 12.04 Ubuntu can be found HERE. A direct link to download the 12.04.4 64-bit Server ISO file can be found HERE.

Once you have the ISO file, on the ‘Installation Options’ Window, select the ‘Install an operating system from a boot CD/DVD-ROM’ option.  Then select the ‘Image file (.iso)’ option and browse to the ISO file you’ve downloaded and select it. When you are ready, click ‘Next’.

Click ‘Finish’ when shown the ‘Completing the New Virtual Machine Wizard’ Window.

Hyper-V Completing the New Virtual Machine Wizard

Press Finish to Complete the New Virtual Machine Wizard

Select the newly configured virtual machine (remember, I named mine ‘compiler’) in the ‘Virtual Machines‘ pane of the Hyper-V Manager Window and then select ‘Connect…’ in the lower right pane of the window.

Hyper-V Manager Connect

Select the new Virtual Machine and Click ‘Connect’

This will bring up a console window for the ready-to-be-powered-on virtual machine.

Power-Up the New Virtual Machine

Click the green power icon in the toolbar and the virtual machine will be powered on and begin loading the operating system.

Hyper-V Console Ready to Start

Virtual Machine Console – Ready to Start

The first screen shown by Ubuntu upon loading is a language selection page.  I’m going to select English and acknowledge by clicking the mouse in the window and pressing Enter.

Hyper-V Ubuntu Language Selection

Language Selection – Select English, Press Enter

On the next screen, ensure ‘Install Ubuntu Server’ is highlighted and, again, press enter.

Hyper-V Ubuntu Install Server

Select ‘Install Ubuntu Server’ and press Enter

This is followed by another Language selection screen— this one for the language used for the installation process. Select ‘English, press Enter.

Hyper-V Ubuntu Install Language

Language Selection – Select ‘English’, press Enter

Select your location; in my case, ‘United States’; and press Enter.

Hyper-V Ubuntu Location Selection

Location Selection – Select ‘United States’, press Enter

Do NOT auto-configure your keyboard, rather, we’ll explicitly tell the system what keyboard is in use. Press Enter to continue.

Hyper-V Ubuntu Auto-Configure Keyboard

Keyboard Auto-Configuration – Select ‘No’, press Enter

Select the Keyboard’s language / country combo (in my case, US English) and press Enter to continue.

Hyper-V Ubuntu Keyboard Country Layout

Keyboard Country Layout – Select ‘English (US)’, press Enter

Select the layout of the keyboard (QWERTY, DVORAK, etc.) and press enter to continue.

Hyper-V Ubuntu Keyboard Layout

Select Keyboard Layout – Select ‘English (US)’, press Enter

Specify the hostname you wish to use– I’ve chosen to call this host ‘compiler’ –and press Enter to continue.

Hyper-V Ubuntu Hostname

Enter system hostname – Enter ‘compiler’, press Enter

Specify the full name of the primary system user– for simplicity, I’ve chosen to name this user ‘hadoop’. Press Enter when ready to continue.

Hyper-V Ubuntu User FullName

Enter Full Name of system User – Enter ‘hadoop’, press Enter

Specify the login name for the user previously specified– again, for simplicity, I’ve chosen ‘hadoop’. Press Enter to continue when ready.

Hyper-V Ubuntu Username

Enter system user username – Enter ‘hadoop’, press Enter

Again, for simplicity, I’ve chosen ‘hadoop’ as the password as well. Press Enter to continue. If asked about a weak password, tell the system to accept it.

Hyper-V Ubuntu Password

Enter system user password – Enter ‘hadoop’, press Enter

Do NOT encrypt the home directory. Press Enter to continue.

Hyper-V Ubuntu Encrypt Home

Home Directory Encryption – Select ‘No’, press Enter

Confirm your TimeZone– in my case, America/Denver –and press Enter to continue.

Hyper-V Ubuntu Timezone

Clock Configuration – Confirm Timezone as ‘America/Denver’, Select ‘Yes’, press Enter

Select ‘Guided – use entire disk and set up LVM’ as the Partitioning Method. Press Enter to continue.

Hyper-V Ubuntu Partitioning Method

Choose Partitioning Method – Select ‘Guided – Use Entire Disk and Set Up LVM’, press Enter

Verify that the disk to be partitioned is correct. Confirm by pressing Enter to continue.

Hyper-V Ubuntu Disk to Partition

Confirm Disk OK to Partition – Press Enter if OK

Confirm the partitioning table for the specified device– Select ‘Yes’. Press Enter to continue.

Hyper-V Ubuntu Confirm Partition

Confirm OK to Proceed with Partitioning – Select ‘Yes’, press Enter

Allocate the entirety of the volume group to use for guided partitioning. Press Enter to continue.

Hyper-V Ubuntu Allocate Full Volume

Enter Amount of Volume Group to Allocate – Select the Entirety, press Enter

Confirm the Partitioning Layout– Select ‘Yes’. Press Enter to continue.

Hyper-V Ubuntu Confirm Partition Layout

Confirm the Partition Layout – Select ‘Yes’, press Enter

Specify a proxy if you have or need one. Leave blank if none. Press Enter to continue.

Hyper-V Ubuntu Proxy

Enter HTTP Proxy if Required – Leave Blank for None

Specify ‘No Automatic Updates’. Press Enter to continue.


Configure Automatic Upgrades – Select ‘No Automatic Upgrades’, press Enter

Do not select any additional software to install at this point. Any software to be installed will be installed after the system is up and running. Press Enter to continue.

Hyper-V Ubuntu Software Selection

Software Selection – Select NO additional Software, Tab to ‘Continue’, press Enter

Answer ‘Yes’ to installing the Grub Boot Loader. Press Enter to continue.

Hyper-V Ubuntu Grub Boot Loader

Install the GRUB Boot Loader – Select ‘Yes’, press Enter

Press Enter to complete the installation and continue.

Hyper-V Ubuntu Finish Installation

Finish the Installation – Press Enter to Continue

Login Using hadoop user

Hyper-V Ubuntu Initial Login

Login as hadoop User

The base system is now loaded and we’re ready to login, install the necessary development tools, download the source code and compile hadoop.

Let’s get started.

First thing to do upon login is to apply any available updates to the system.

Install Updates

Issue the following commands to download and install any updates that are available:

sudo apt-get update
sudo apt-get dist-upgrade

Install Java JDK 7

The Oracle JDK is the official JDK; however, it is no longer provided by Oracle as a default installation for Ubuntu. It can, however, still be installed by using apt-get. To install version 7, execute the following commands:

sudo apt-get install python-software-properties
sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer

Install Development Tools

Several development tools need to be installed so that Hadoop can be compiled and packaged. The commands to install the needed software are as follows:

sudo apt-get install -y libssl-dev cmake
sudo apt-get install -y build-essential
sudo apt-get install -y maven

Download and Build Protobuf v2.5.0

Protobuf is required to build and run Hadoop, however, the version that is included in Ubuntu’s repository is older than what is required. Protobuf will have to be downloaded and built from source code so that we are using an appropriate version. The commands to do this are as follows:

curl -# -O https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
gunzip protobuf-2.5.0.tar.gz
tar -xvf protobuf-2.5.0.tar
cd protobuf-2.5.0
./configure --prefix=/usr
sudo make install
protoc --version

Assuming the output of the last command (protoc –version) is appropriate and correct (that it is version 2.5.0), the Java bindings of protobuf for Hadoop can be built and packaged. This is done with the following commands:

cd java
mvn install
mvn package

Download the Hadoop Source, Apply Patches and Build

The following commands will download the source code for Hadoop v2.2.0, apply a necessary patch, compile the source code and package the result into a gzipped tar archive.

Download and Extract the Source Code

wget http://apache.fastbull.org/hadoop/common/hadoop-2.2.0/hadoop-2.2.0-src.tar.gz
tar xzvf hadoop-2.2.0-src.tar.gz

Download and Apply the Required Patch

cd hadoop-2.2.0-src
wget https://issues.apache.org/jira/secure/attachment/12614482/HADOOP-10110.patch
patch -p0 < HADOOP-10110.patch

Compile the Source and Package the Result

mvn package -Pdist,native -DskipTests -Dtar

At this point, Hadoop v2.2.0 has been compiled for 64-bit Ubuntu Linux 12.04.4 LTS Server. The gzipped tar archive is found in:


Compile Complete; Part 2 – Install

This concludes Part 1 on this tutorial as the compile is complete.  It should be a relatively straightforward task to adapt what has been laid out here for other Linux distributions. Additionally, if you are not working with Hyper-V, the spirit of what I have detailed in this article will apply to VirtualBox and VMWare.

In Part 2, I’ll use the gzipped tar archive we’ve created in this tutorial to install and configure Hadoop v2.2.0 and get us to the point of running the word count demo.

That, in turn will have us set up for Part 3 where we’ll explore the Hadoop Streaming mechanism and write our own MapReduce programs, not in Java, but rather in Python.

Enhanced by Zemanta

List of Meta-Resources for Big Data

A “Cliff Notes” for Big Data

Following is a list of meta-resources which have been identified by Dr. Kirk Borne in a blog post he wrote at Data Science Central. The list is made up of links clickable through to the resource they identify. I have also included a link to Dr. Borne’s blog post:

Dr. Kirk Borne writes at Data Science Central:

The flood of articles, webinars, and conferences related to Big Data is generating its own “infoglut”. Consequently, it is really helpful when you find resources that summarize many of the latest developments in one place – a sort of “Cliff Notes” for Big Data.  Here are six meta-resources that I have found useful, plus one additional collection that I authored:

Big Data Meta-ResourcesDr. Borne’s original blog post: Big Data – Seven Meta-Resources for Best Practices, Lessons Learned, Data Stories, Opportunities, and Insights – Data Science Central

Graphic for Big Data Buzzword Overload

This is the perfect warning to post when cliché has overcome content and it seems that the third word spoken by everyone results in a chit being played on your ‘Buzzword Bingo‘ card.

Samuel L Jackson in Pulp Fiction, overcome by the buzzword, saying, "Say Big Data! Say Big Data one more time! I dare you! I DOUBLE dare you!"

Say Big Data! Say Big Data one more time! I dare you! I DOUBLE dare you!

Enhanced by Zemanta