CSIF Hadoop Setup Instructions

by Ken Gribble

This document may be found at: 

http://goo.gl/eCqOq

Introduction and Requirements

Set your Hadoop related environmental variables.

Add Hadoop environmental variables

Test that Hadoop is in your path

Run the CSIF hadoop-config.pl program

Choose a directory name for your hadoop install

Choose a good password for use with your new SSH keys

Run hadoop-config.pl in the directory where you want the configuration directory to be installed (usually your home directory)

Prepare SSH and ssh-agent and Test SSH keys

Start ssh-agent by typing these two commands.

Test your SSH key in the CSIF

Starting up Hadoop

First, format a new Distributed Filesystem (DFS)

Start all Hadoop daemons with the "start-dfs.sh" command.

Check that the web interfaces for Hadoop are up

Click on the NameNode link.

Test with an Example

Copy the input files needed into the DFS

Test Hadoop with an example jar

Look at your output files:

Show the files locally then view

Look at the Web Interface

Stopping Hadoop

Running Hadoop as a Fully Distributed Cluster

Change localhost to PC hostname

Find your host name with this command

Find IP of host

Edit conf files

Add some slave nodes

Rebuild your DFS

Remove your old DFS.

Reformat your DFS

You are ready to run

Troubleshooting

NameNode DFS errors

Rebuilding

More Information

SSH

Hadoop single node setup and example

Introduction and Requirements

Follow this step by step tutorial in order to setup Hadoop for running experiments and assignments with Hadoop in the Computer Science Instructional Facility (CSIF). Much of the software mentioned here is exclusive to the CSIF, so to mimic this setup in another location please use the links at the bottom of the page to find further information.

NOTE: The CSIF machines reboot every night. Any Hadoop daemons you leave running will die during reboot. Any DFS data, which is stored in /tmp, will be deleted during reboot.

Set your Hadoop related environmental variables.

Hadoop uses a few environmental variables, and to make it easier to run, hadoop should be in your PATH.

Add Hadoop environmental variables

Edit your ~/.cshrc file and add or change these lines

setenv HADOOP_PREFIX /home/software/hadoop

setenv HADOOP_CONF_DIR ~/hadoop/etc/hadoop

setenv JAVA_HOME /usr/java/latest

set path=( $JAVA_HOME $path )

set path=( /home/software/hadoop/bin /home/software/hadoop/sbin $path )

Remember that you will need to exec your shell ( $ exec tcsh) , or re-log in to get environmental variables to be added to the shell. After re-logging back in, type "env" to see your environmental variables. There will be a lot of output, but it should show your new PATH directories and new variables:

$ env

...

PATH=/usr/java/latest:/home/software/hadoop/bin:/home/software/hadoop/sbin:...

...

HADOOP_PREFIX=/home/software/hadoop

HADOOP_CONF_DIR=/home/gribble/hadoop/etc/hadoop

JAVA_HOME=/usr/java/latest

Test that Hadoop is in your path

Run this command

$ hadoop

It will output the usage documentation for hadoop if hadoop is in your path. If it says "hadoop: Command not found" then hadoop isn't yet in your path.

Run the CSIF hadoop-config.pl program

After setting up your environmental variables, you can set up Hadoop. The hadoop-config program does these things:

Choose a directory name for your hadoop install

The suggested name for your hadoop directory is "hadoop".

Choose a good password for use with your new SSH keys

DO NOT use a blank password. Use a good passphrase.

Run hadoop-config.pl in the directory where you want the configuration directory to be installed (usually your home directory)

$ cd

$ /home/software/bin/hadoop-config.pl hadoop

When it says this, type in your good passphrase.

creating an SSH key for you, PLEASE USE A GOOD PASSWORD

Generating public/private dsa key pair.

Enter passphrase (empty for no passphrase):

After entering your ssh key passphrase you should see the key is generated, and messages will show you the key's location, fingerprint, randomart and so on.

Prepare SSH and ssh-agent and Test SSH keys

These next steps will prepare your SSH configuration for use with distributing Hadoop.

Start ssh-agent by typing these two commands.

Each time you login to run hadoop, you will need to start ssh-agent to manage your password.

$ ssh-agent $SHELL

$ ssh-add ~/.ssh/id_dsa_hadoop

Enter the passphrase you created above when it says this.

Enter passphrase for /home/youraccount/.ssh/id_dsa_hadoop:

If that goes well, your passphrase for that new SSH key should be cached and you won't have to enter the passphrase to use SSH to localhost or other CSIF machines.

Test your SSH key in the CSIF

First, try to use ssh to execute a command on the machine you are on, with localhost with this command.

$ ssh localhost echo it works

The first time you try to SSH to any machine, you might see something like this, type yes and hit enter to authorize the new key on localhost:

The authenticity of host 'localhost (127.0.0.1)' can't be established.

RSA key fingerprint is ...

Are you sure you want to continue connecting (yes/no)?

NOTE: ''if you try to ssh into "localhost" and you get an error message telling you: WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED, you probably need to delete the "localhost" line in your ~/.ssh/known_hosts file. This can be avoided by using the same CSIF machine each time you run hadoop.

NOTE: If you plan on running Hadoop in a fully distributed way, make sure and SSH login to all the CSIF machines you plan to make into Hadoop slave nodes. This way the authenticity of those hosts is recorded and won't pop up the "Are you sure you want to continue connecting (yes/no)? " question when running Hadoop.

Starting up Hadoop

To start hadoop, for testing your installation, run these commands

First, format a new Distributed Filesystem (DFS)

$ hdfs namenode -format

Start all Hadoop daemons with the "start-dfs.sh" command.

$ start-dfs.sh

You can check your Hadoop logs in the "logs" directory in your hadoop configuration files directory (~/hadoop/logs, for this set of examples).

Check that the web interfaces for Hadoop are up

In your hadoop configuration files directory there is a README.html file. Open it with firefox or another browser.

$ firefox ~/hadoop/README.html

There are some help links, such as one that comes back to this page. There are also two links to the web interfaces — with your unique ports.

Click on the NameNode link.

Before you proceed with the next steps the Live Nodes must be 1 or higher. It may take a few minutes for this to happen. Reload the web page until you see the Live Nodes come up. You should be able to browse the filesystem (see link on NameNode page) and not get errors.

Test with an Example

Try this example to ensure hadoop is configured and running correctly.

Copy the input files needed into the DFS

NOTE: The NameNode web interface should show at least 1 "Live Nodes" before this step.

First change directory into your hadoop configuration files directory, then run the hadoop fs command to copy the input files from the conf directory. In this case, use your CSIF account name where it says accountname.

cd ~/hadoop

hdfs dfs -mkdir -p /user/accountname/input

hdfs dfs -put etc/hadoop/* /user/accountname/input

Test Hadoop with an example jar

Run this command to test Hadoop. You can check the JobTracker web interface to watch it run, get data on how it ran, check error messages in the logs, and more (see your README.html file for the link to JobTracker).

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'

Look at your output files:

Show the files locally then view

One way to show the output files from the DFS to your home directory is to use hdfs dfs -cat

$ hdfs dfs -cat output/part-r-00000

Look at the Web Interface

You can also look at the files via your NameNode web interface

See the README.html file, in firefox.

Stopping Hadoop

When you are done using Hadoop in the CSIF you should stop all of it's processes. To stop hadoop, issue the "stop-all.sh" command.

$ stop-dfs.sh

Running Hadoop as a Fully Distributed Cluster

To run Hadoop more like a cluster, follow these steps. Make sure hadoop is stopped with the "stop-dfs.sh" command first!

Change localhost to PC hostname

Change localhost to the hostname of the machine you are logged into in various hadoop conf files.

Find your host name with this command

$ hostname

Find IP of host

$ host hostname

(Replace hostname with what you got from the above command)

Edit conf files

In the conf directory of your hadoop configuration files directory, change these files:

In etc/hadoop/core-site.xml, change the hostname "localhost" to the PC's IP, so if hostname told you "PC50" and nslookup told you "128.120.211.101" the new line should be similar to this (your port number will be different)

<value>hdfs://128.120.211.101:10000</value>

In conf/mapred-site.xml, do the same thing as in the core-site.xml. The changed line should be similar to this (your port number will be different)

<value>128.120.211.101:20000</value>

Do the same for the conf/masters file, localhost becomes a single line

128.120.211.101

Add some slave nodes

In the conf/slaves file, add a list of host names of a handful of machines that are running the same OS (32 bit or 64 bit, see above). Make sure you can SSH to your slave machines, from your master machine, using ssh-agent and not needing a password (see above). If you can't SSH to a few machines, choose new machines as those may be down for maintenance.

In this example we tried to use pc51 through pc55, but some systems were down, so we didn’t list them and added some more. The file would look like this (note that localhost has been removed!)

pc53

pc55

pc56

pc57

pc58

If you put your server on this list make sure you put the IP number and not the hostname, but all the slaves work with just their hostname.

Rebuild your DFS

Reformat your DFS after removing it.

Remove your old DFS.

NOTE: THIS WILL DESTROY ANY DATA ON YOUR DFS, SO BACK IT UP IF YOU WANT YOUR DATA SAVED

Run this command on any masters or slaves you have already used to run hadoop

rm -rf /tmp/hadoop-accountname

rm -rf /tmp/hsperfdata_accountname

rm -rf /tmp/Jetty*

Reformat your DFS

hdfs namenode -format

You are ready to run

start-dfs.sh

Now you can run the example above to test it. Don't forget to stop-all.sh when you are done!

Troubleshooting

NameNode DFS errors

If a DFS starts producing errors, you might need to rebuild it. It's suggest that you issue a "stop-all.sh" to stop the Hadoop daemons. Then remove all the files associated with Hadoop in /tmp. This command has worked (use your own account name instead of accountname. Execute the rm command on all slave nodes if you are running distributed.

rm -rf /tmp/hadoop-accountname*

Format the new DFS with the "hadoop namenode -format" command. Then "start-all.sh" to start all daemons again. Don't forget to "hadoop fs -put" your data back!

Rebuilding

Sometimes things go horribly wrong and you need to rebuild your Hadoop installation. Issue a "stop-all.sh", back up your work and data, if needed, then remove or move your hadoop configuration files directory (hadoop in this example). Then follow these instructions, from the step above called: Run the CSIF hadoop-config program.

More Information

Where to get more information

SSH

http://www.openssh.org/manual.html

Hadoop single node setup and example

http://hadoop.apache.org/common/docs/current/single_node_setup.html

This document may be found at: 

http://goo.gl/eCqOq