Getting Hands-On with Hadoop YARN
For a while now, I’ve wanted to write a tutorial about rolling up the sleeves, and installing Hadoop from scratch, first as a single-node, stand-alone system, then perhaps expanding to a multi-node (3 probably) configuration with the end goal of showing how to write mapreduce programs in Python making use of the Hadoop streaming mechanism. As I was going through the steps of installation and configuration, I ran into a hitch. It appears that the pre-compiled installation tarball for Hadoop v2.2.0 (the latest stable release) on the Apache website is compiled only for 32-bit Linux and when run on a 64-bit system, runs, but throws up enough errors to make things look and act wonky. Doing a bit of Google research led me to the conclusion that the ‘right’ way to go about installing Hadoop on a 64-bit system is to get deeper into the dirt and compile Hadoop from source code to build a proper 64-bit installation set.
Ubuntu 12.04.4 LTS 64-bit Server to Compile Hadoop v2.2.0
So, that’s what this post is going to be- a tutorial on setting up a 64-bit Ubuntu Linux system which will be used to compile the source for Hadoop v2.2.0 (the current stable release). My primary workstation is a Microsoft Server 2008 R2 system. This being the case, I’m going to use Hyper-V services and create a virtual machine into which I’ll load 64-bit Ubuntu Linux LTS 12.04.4 Server and use it to do the compiling and then, later, installation of the single-node, stand-alone Hadoop installation.
New Hyper-V VM for Ubuntu
The first step is to open the Hyper-V Manager and create a new virtual machine: On the Hyper-V Manager Actions menu for your machine, select New and then select Virtual Machine as pictured here.
Click “Next” to bypass the ‘Before You Begin’ window.
On the next window, give the new virtual machine a name- I’m going to name it ‘compiler’. If you need to specify a location other than the default for the virtual machine files, this is a place where that change can be made. When ready, click “Next”.
The next window presented is where the amount of memory to be assigned to the virtual machine is entered. I’m going to go with 2GB or 2048MB. This is pretty much the lowest I would go– you may be able to get away with 1GB (1024MB), but I would not recommend it. If you have more memory in your machine and are able to allocate more to the virtual machine, by all means, give it more!. When ready, click “Next”.
‘Configure Networking’ is the next window to be presented. I’m going to select a previously configured network connection which is bridged to my home office network. You will need connectivity to the internet– it is not optional. Configure what is appropriate for your network. When ready, click ‘Next’ to continue.
On the ‘Connect Virtual Hard Disk’ window, I’m keeping the already selected option of ‘Create a virtual hard disk’ as well as the default Name and Location. I am changing the size to 12GB as that’s more than enough room for what needs to be done. When ready, click ‘Next’ to continue.
The window that follows is the ‘Installation Options’ window. This is where you specify where the ISO image of the DVD from which the new virtual machine is going to virtually boot. If you don’t already have it, you are going to need the ISO image for the 64-bit version of Ubuntu 12.04.4 LTS Server. The download page for 12.04 Ubuntu can be found HERE. A direct link to download the 12.04.4 64-bit Server ISO file can be found HERE.
Once you have the ISO file, on the ‘Installation Options’ Window, select the ‘Install an operating system from a boot CD/DVD-ROM’ option. Then select the ‘Image file (.iso)’ option and browse to the ISO file you’ve downloaded and select it. When you are ready, click ‘Next’.
Click ‘Finish’ when shown the ‘Completing the New Virtual Machine Wizard’ Window.
Select the newly configured virtual machine (remember, I named mine ‘compiler’) in the ‘Virtual Machines‘ pane of the Hyper-V Manager Window and then select ‘Connect…’ in the lower right pane of the window.
This will bring up a console window for the ready-to-be-powered-on virtual machine.
Power-Up the New Virtual Machine
Click the green power icon in the toolbar and the virtual machine will be powered on and begin loading the operating system.
The first screen shown by Ubuntu upon loading is a language selection page. I’m going to select English and acknowledge by clicking the mouse in the window and pressing Enter.
On the next screen, ensure ‘Install Ubuntu Server’ is highlighted and, again, press enter.
This is followed by another Language selection screen— this one for the language used for the installation process. Select ‘English, press Enter.
Select your location; in my case, ‘United States’; and press Enter.
Do NOT auto-configure your keyboard, rather, we’ll explicitly tell the system what keyboard is in use. Press Enter to continue.
Select the Keyboard’s language / country combo (in my case, US English) and press Enter to continue.
Select the layout of the keyboard (QWERTY, DVORAK, etc.) and press enter to continue.
Specify the hostname you wish to use– I’ve chosen to call this host ‘compiler’ –and press Enter to continue.
Specify the full name of the primary system user– for simplicity, I’ve chosen to name this user ‘hadoop’. Press Enter when ready to continue.
Specify the login name for the user previously specified– again, for simplicity, I’ve chosen ‘hadoop’. Press Enter to continue when ready.
Again, for simplicity, I’ve chosen ‘hadoop’ as the password as well. Press Enter to continue. If asked about a weak password, tell the system to accept it.
Do NOT encrypt the home directory. Press Enter to continue.
Confirm your TimeZone– in my case, America/Denver –and press Enter to continue.
Select ‘Guided – use entire disk and set up LVM’ as the Partitioning Method. Press Enter to continue.
Verify that the disk to be partitioned is correct. Confirm by pressing Enter to continue.
Confirm the partitioning table for the specified device– Select ‘Yes’. Press Enter to continue.
Allocate the entirety of the volume group to use for guided partitioning. Press Enter to continue.
Confirm the Partitioning Layout– Select ‘Yes’. Press Enter to continue.
Specify a proxy if you have or need one. Leave blank if none. Press Enter to continue.
Specify ‘No Automatic Updates’. Press Enter to continue.
Do not select any additional software to install at this point. Any software to be installed will be installed after the system is up and running. Press Enter to continue.
Answer ‘Yes’ to installing the Grub Boot Loader. Press Enter to continue.
Press Enter to complete the installation and continue.
Login Using hadoop user
The base system is now loaded and we’re ready to login, install the necessary development tools, download the source code and compile hadoop.
Let’s get started.
First thing to do upon login is to apply any available updates to the system.
Issue the following commands to download and install any updates that are available:
sudo apt-get update sudo apt-get dist-upgrade
Install Java JDK 7
The Oracle JDK is the official JDK; however, it is no longer provided by Oracle as a default installation for Ubuntu. It can, however, still be installed by using apt-get. To install version 7, execute the following commands:
1 2 3 4
sudo apt-get install python-software-properties sudo apt-add-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java7-installer
Install Development Tools
Several development tools need to be installed so that Hadoop can be compiled and packaged. The commands to install the needed software are as follows:
1 2 3
sudo apt-get install -y libssl-dev cmake sudo apt-get install -y build-essential sudo apt-get install -y maven
Download and Build Protobuf v2.5.0
Protobuf is required to build and run Hadoop, however, the version that is included in Ubuntu’s repository is older than what is required. Protobuf will have to be downloaded and built from source code so that we are using an appropriate version. The commands to do this are as follows:
1 2 3 4 5 6 7
curl -# -O https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz gunzip protobuf-2.5.0.tar.gz tar -xvf protobuf-2.5.0.tar cd protobuf-2.5.0 ./configure --prefix=/usr make sudo make install
Assuming the output of the last command (protoc –version) is appropriate and correct (that it is version 2.5.0), the Java bindings of protobuf for Hadoop can be built and packaged. This is done with the following commands:
1 2 3
cd java mvn install mvn package
Download the Hadoop Source, Apply Patches and Build
The following commands will download the source code for Hadoop v2.2.0, apply a necessary patch, compile the source code and package the result into a gzipped tar archive.
Download and Extract the Source Code
1 2 3
cd wget http://apache.fastbull.org/hadoop/common/hadoop-2.2.0/hadoop-2.2.0-src.tar.gz tar xzvf hadoop-2.2.0-src.tar.gz
Download and Apply the Required Patch
1 2 3
cd hadoop-2.2.0-src wget https://issues.apache.org/jira/secure/attachment/12614482/HADOOP-10110.patch patch -p0 < HADOOP-10110.patch
Compile the Source and Package the Result
mvn package -Pdist,native -DskipTests -Dtar
At this point, Hadoop v2.2.0 has been compiled for 64-bit Ubuntu Linux 12.04.4 LTS Server. The gzipped tar archive is found in:
Compile Complete; Part 2 – Install
This concludes Part 1 on this tutorial as the compile is complete. It should be a relatively straightforward task to adapt what has been laid out here for other Linux distributions. Additionally, if you are not working with Hyper-V, the spirit of what I have detailed in this article will apply to VirtualBox and VMWare.
In Part 2, I’ll use the gzipped tar archive we’ve created in this tutorial to install and configure Hadoop v2.2.0 and get us to the point of running the word count demo.
That, in turn will have us set up for Part 3 where we’ll explore the Hadoop Streaming mechanism and write our own MapReduce programs, not in Java, but rather in Python.