CSIF Hadoop Setup Instructions
by Ken Gribble
Follow this step by step tutorial in order to setup Hadoop for running experiments and assignments with Hadoop in the Computer Science Instructional Facility (CSIF). Much of the software mentioned here is exclusive to the CSIF, so to mimic this setup in another location please use the links at the bottom of the page to find further information.
NOTE: The CSIF machines reboot every night. Any Hadoop daemons you leave running will die during reboot. Any DFS data, which is stored in /tmp, will be deleted during reboot.
Hadoop uses a few environmental variables, and to make it easier to run, hadoop should be in your PATH.
Edit your ~/.cshrc file and add or change these lines
setenv HADOOP_PREFIX /home/software/hadoop
setenv HADOOP_CONF_DIR ~/hadoop/etc/hadoop
setenv JAVA_HOME /usr/java/latest
set path=( $JAVA_HOME $path )
set path=( /home/software/hadoop/bin /home/software/hadoop/sbin $path )
Remember that you will need to exec your shell ( $ exec tcsh) , or re-log in to get environmental variables to be added to the shell. After re-logging back in, type "env" to see your environmental variables. There will be a lot of output, but it should show your new PATH directories and new variables:
Run this command
It will output the usage documentation for hadoop if hadoop is in your path. If it says "hadoop: Command not found" then hadoop isn't yet in your path.
After setting up your environmental variables, you can set up Hadoop. The hadoop-config program does these things:
The suggested name for your hadoop directory is "hadoop".
DO NOT use a blank password. Use a good passphrase.
$ /home/software/bin/hadoop-config.pl hadoop
When it says this, type in your good passphrase.
creating an SSH key for you, PLEASE USE A GOOD PASSWORD
Generating public/private dsa key pair.
Enter passphrase (empty for no passphrase):
After entering your ssh key passphrase you should see the key is generated, and messages will show you the key's location, fingerprint, randomart and so on.
These next steps will prepare your SSH configuration for use with distributing Hadoop.
Each time you login to run hadoop, you will need to start ssh-agent to manage your password.
$ ssh-agent $SHELL
$ ssh-add ~/.ssh/id_dsa_hadoop
Enter the passphrase you created above when it says this.
Enter passphrase for /home/youraccount/.ssh/id_dsa_hadoop:
If that goes well, your passphrase for that new SSH key should be cached and you won't have to enter the passphrase to use SSH to localhost or other CSIF machines.
First, try to use ssh to execute a command on the machine you are on, with localhost with this command.
$ ssh localhost echo it works
The first time you try to SSH to any machine, you might see something like this, type yes and hit enter to authorize the new key on localhost:
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is ...
Are you sure you want to continue connecting (yes/no)?
NOTE: ''if you try to ssh into "localhost" and you get an error message telling you: WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED, you probably need to delete the "localhost" line in your ~/.ssh/known_hosts file. This can be avoided by using the same CSIF machine each time you run hadoop.
NOTE: If you plan on running Hadoop in a fully distributed way, make sure and SSH login to all the CSIF machines you plan to make into Hadoop slave nodes. This way the authenticity of those hosts is recorded and won't pop up the "Are you sure you want to continue connecting (yes/no)? " question when running Hadoop.
To start hadoop, for testing your installation, run these commands
$ hdfs namenode -format
You can check your Hadoop logs in the "logs" directory in your hadoop configuration files directory (~/hadoop/logs, for this set of examples).
In your hadoop configuration files directory there is a README.html file. Open it with firefox or another browser.
$ firefox ~/hadoop/README.html
There are some help links, such as one that comes back to this page. There are also two links to the web interfaces — with your unique ports.
Before you proceed with the next steps the Live Nodes must be 1 or higher. It may take a few minutes for this to happen. Reload the web page until you see the Live Nodes come up. You should be able to browse the filesystem (see link on NameNode page) and not get errors.
Try this example to ensure hadoop is configured and running correctly.
NOTE: The NameNode web interface should show at least 1 "Live Nodes" before this step.
First change directory into your hadoop configuration files directory, then run the hadoop fs command to copy the input files from the conf directory. In this case, use your CSIF account name where it says accountname.
hdfs dfs -mkdir -p /user/accountname/input
hdfs dfs -put etc/hadoop/* /user/accountname/input
Run this command to test Hadoop. You can check the JobTracker web interface to watch it run, get data on how it ran, check error messages in the logs, and more (see your README.html file for the link to JobTracker).
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'
One way to show the output files from the DFS to your home directory is to use hdfs dfs -cat
$ hdfs dfs -cat output/part-r-00000
You can also look at the files via your NameNode web interface
See the README.html file, in firefox.
When you are done using Hadoop in the CSIF you should stop all of it's processes. To stop hadoop, issue the "stop-all.sh" command.
To run Hadoop more like a cluster, follow these steps. Make sure hadoop is stopped with the "stop-dfs.sh" command first!
Change localhost to the hostname of the machine you are logged into in various hadoop conf files.
$ host hostname
(Replace hostname with what you got from the above command)
In the conf directory of your hadoop configuration files directory, change these files:
In etc/hadoop/core-site.xml, change the hostname "localhost" to the PC's IP, so if hostname told you "PC50" and nslookup told you "18.104.22.168" the new line should be similar to this (your port number will be different)
In conf/mapred-site.xml, do the same thing as in the core-site.xml. The changed line should be similar to this (your port number will be different)
Do the same for the conf/masters file, localhost becomes a single line
In the conf/slaves file, add a list of host names of a handful of machines that are running the same OS (32 bit or 64 bit, see above). Make sure you can SSH to your slave machines, from your master machine, using ssh-agent and not needing a password (see above). If you can't SSH to a few machines, choose new machines as those may be down for maintenance.
In this example we tried to use pc51 through pc55, but some systems were down, so we didn’t list them and added some more. The file would look like this (note that localhost has been removed!)
If you put your server on this list make sure you put the IP number and not the hostname, but all the slaves work with just their hostname.
Reformat your DFS after removing it.
NOTE: THIS WILL DESTROY ANY DATA ON YOUR DFS, SO BACK IT UP IF YOU WANT YOUR DATA SAVED
Run this command on any masters or slaves you have already used to run hadoop
rm -rf /tmp/hadoop-accountname
rm -rf /tmp/hsperfdata_accountname
rm -rf /tmp/Jetty*
hdfs namenode -format
Now you can run the example above to test it. Don't forget to stop-all.sh when you are done!
If a DFS starts producing errors, you might need to rebuild it. It's suggest that you issue a "stop-all.sh" to stop the Hadoop daemons. Then remove all the files associated with Hadoop in /tmp. This command has worked (use your own account name instead of accountname. Execute the rm command on all slave nodes if you are running distributed.
rm -rf /tmp/hadoop-accountname*
Format the new DFS with the "hadoop namenode -format" command. Then "start-all.sh" to start all daemons again. Don't forget to "hadoop fs -put" your data back!
Sometimes things go horribly wrong and you need to rebuild your Hadoop installation. Issue a "stop-all.sh", back up your work and data, if needed, then remove or move your hadoop configuration files directory (hadoop in this example). Then follow these instructions, from the step above called: Run the CSIF hadoop-config program.
Where to get more information