Installing Hadoop multi-node cluster

I will install hadoop on a single node cluster and then clone it to multiple nodes in order to create a multi-node cluster after modifying each to have a difference in system configurations

System Setup

Virtual machine platform	Oracle Virtual Box ver 4.3.8 r92456
Operating System	Ubuntu Server 12.04 LTS
Hostnames used	hd-master hd-slave1 hd-slave2 hd-slave3
Username(s):Password	uadmin:admin hduser:hadoop
Groups created	hadoop
Java version	Oracle java 7
Hadoop Version	1.2.1
Memory	512MB RAM /machine
Processor	1 /machine
HDD	8GB vdi dynamically allocated

Network Setup

Hostname	IP Address
hd-master	192.168.1.30/26
hd-salve1	192.168.1.31/26
hd-slave2	192.168.1.32/26
hd-slave3	192.168.1.33/26

DNS will on Google’s public DNS of 8.8.8.8 and 8.8.4.4

Setup network

uadmin@ubuntu:~$ sudo vi /etc/network/interfaces

Changed the line “iface eth0 inet dhcp”

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0
iface eth0 inet static
address 192.168.1.30
netmask 255.255.255.0
gateway 192.168.1.1
broadcast 192.168.1.255
dns-nameservers 8.8.8.8 8.8.4.4

uadmin@ubuntu:~$ sudo shutdown -h 0

Check status of ssh

uadmin@ubuntu:~$ sudo service ssh status

ssh start/running, process 610

Install oracle Java 7

Install python-software-properties

uadmin@ubuntu:~$ sudo apt-get install python-software-properties

Add the ppa from webupd8team for the java 7 installer

uadmin@ubuntu:~$ sudo apt-add-repository ppa:webupd8team/java

uadmin@ubuntu:~$ sudo apt-get update

uadmin@ubuntu:~$ sudo apt-get install oracle-java7-installer

Check Java version

uadmin@ubuntu:~$ java -version
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

create the hadoop user account and group

Create group

uadmin@ubuntu:~$ sudo addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.

create user into group

uadmin@ubuntu:~$ sudo adduser --ingroup hadoop hduser
Adding user `hduser' ...
Adding new user `hduser' (1001) with group `hadoop' ...
Creating home directory `/home/hduser' ...
Copying files from `/etc/skel' ...
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Changing the user information for hduser
Enter the new value, or press ENTER for the default
Full Name []:
Room Number []:
Work Phone []:
Home Phone []:
Other []:
Is the information correct? [Y/n] y

Configure SSH for Public and Private key on the newly created hadoop account

login as hduser

uadmin@ubuntu:~$ su - hduser

run the ssh command to configure the public and private rsa key

hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
24:66:25:86:9a:22:57:10:00:c1:94:f3:f9:8d:a6:a8 hduser@ubuntu
The key's randomart image is:
+--[ RSA 2048]----+
|B++o .o . |
| + o. o |
| o+. + . |
|o +o o o |
|.o . o S |
| + . |
| . o |
| . . |
|E |
+-----------------+

enable ssh access on local machine with the new private key

hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_key

tested ssh on the local host

hduser@ubuntu:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is f0:10:81:ed:d3:a1:4f:4c:1d:01:a3:9f:b8:54:55:ad.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
hduser@localhost's password:
Welcome to Ubuntu 12.04.4 LTS (GNU/Linux 3.11.0-15-generic x86_64)

* Documentation: https://help.ubuntu.com/

System information as of Fri Apr 18 23:30:09 EDT 2014

System load: 0.08 Processes: 77
Usage of /: 19.5% of 6.99GB Users logged in: 1
Memory usage: 16% IP address for eth0: 192.168.1.30
Swap usage: 0%

Graph this data and manage this system at:
https://landscape.canonical.com/

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

Disable IPv6

edit the sysctl.conf file

uadmin@ubuntu:~$ sudo vi /etc/sysctl.conf

Add the following lines to the end of the file and reboot

#disable IPv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

check to make sure that IPv6 is disabled

uadmin@ubuntu:~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
1

Downloading and Installing Hadoop 1.2.1

Navigate to /usr/local and download the file from apache

uadmin@ubuntu:~$ cd /usr/local/

uadmin@ubuntu:/usr/local$ sudo wget http://mirrors.koehn.com/apache/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz
[sudo] password for uadmin:
--2014-04-19 00:01:27-- http://mirrors.koehn.com/apache/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz
Resolving mirrors.koehn.com (mirrors.koehn.com)... 107.150.35.50, 2604:4300:a:36:982:df24:f8bc:ad08
Connecting to mirrors.koehn.com (mirrors.koehn.com)|107.150.35.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63851630 (61M) [application/x-gzip]
Saving to: `hadoop-1.2.1.tar.gz'

100%[================================>] 63,851,630 318K/s in 3m 6s

2014-04-19 00:04:33 (336 KB/s) - `hadoop-1.2.1.tar.gz' saved [63851630/63851630]

Untar the the file into the current directory

uadmin@ubuntu:/usr/local$ sudo tar xzf hadoop-1.2.1.tar.gz
uadmin@ubuntu:/usr/local$ ls
bin games hadoop-1.2.1.tar.gz lib sbin src
etc hadoop-1.2.1 include man share
uadmin@ubuntu:/usr/local$ ls hadoop-1.2.1
bin hadoop-ant-1.2.1.jar ivy sbin
build.xml hadoop-client-1.2.1.jar ivy.xml share
c++ hadoop-core-1.2.1.jar lib src
CHANGES.txt hadoop-examples-1.2.1.jar libexec webapps
conf hadoop-minicluster-1.2.1.jar LICENSE.txt
contrib hadoop-test-1.2.1.jar NOTICE.txt
docs hadoop-tools-1.2.1.jar README.txt

Set ownership of the hadoop folder to hduser and hadoop group

uadmin@ubuntu:/usr/local$ sudo chown -R hduser:hadoop hadoop-1.2.1

uadmin@ubuntu:/usr/local$ ls -l

total 62396

drwxr-xr-x 2 root root 4096 Apr 18 14:36 bin

drwxr-xr-x 2 root root 4096 Apr 18 14:36 etc

drwxr-xr-x 2 root root 4096 Apr 18 14:36 games

drwxr-xr-x 15 hduser hadoop 4096 Jul 22 2013 hadoop-1.2.1

-rw-r--r-- 1 root root 63851630 Jul 22 2013 hadoop-1.2.1.tar.gz

drwxr-xr-x 2 root root 4096 Apr 18 14:36 include

drwxr-xr-x 3 root root 4096 Apr 18 14:37 lib

lrwxrwxrwx 1 root root 9 Apr 18 14:36 man -> share/man

drwxr-xr-x 2 root root 4096 Apr 18 14:36 sbin

drwxr-xr-x 6 root root 4096 Apr 18 20:04 share

drwxr-xr-x 2 root root 4096 Apr 18 14:36 src

Update .bashrc for hduser

copy the .bashrc file and edit the

hduser@ubuntu:~$ cp ~/.bashrc ~/.bashrc.bak

hduser@ubuntu:~$ vi .bashrc

adding the following

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop-1.2.1

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-7-oracle

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

reload .bashrc

hduser@ubuntu:~$ source .bashrc

Configuration

modify the following line for conf/hadoop-env.sh:

# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

changed to

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Conf/*-site.xml configuration and temp directory for application

create app directory (I am using the the one from the tutorial but this can be anywhere you choose) and modifying the permissions for hduser

hduser@ubuntu:/$ su - uadmin
Password:
uadmin@ubuntu:~$ pwd
/home/uadmin
uadmin@ubuntu:~$ cd /
uadmin@ubuntu:/$ sudo mkdir -p /app/hadoop/tmp
[sudo] password for uadmin:
uadmin@ubuntu:/$ sudo chown hduser:hadoop /app/hadoop/tmp/

uadmin@ubuntu:/$ ls -l /app/hadoop/
total 4
drwxr-xr-x 2 hduser hadoop 4096 Apr 19 10:57 tmp
uadmin@ubuntu:/$ sudo chmod 750 /app/hadoop/tmp/
uadmin@ubuntu:/$ ls -l /app/hadoop/
total 4
drwxr-x--- 2 hduser hadoop 4096 Apr 19 10:57 tmp

add configuration to core-site.xml (in between <configuration>...</configuration>)

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

add configuration to mapred-site.xml (in between <configuration>...</configuration>)

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>

add configuration to hdfs-site.xml (in between <configuration>...</configuration>)

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

Formatting hadoop filesystem

Run the following command to format the filesystem

(when asked, answer “Y” in the proper case)

hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop namenode -format
Warning: $HADOOP_HOME is deprecated.

14/04/19 11:54:05 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.2.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG: java = 1.7.0_55
************************************************************/
14/04/19 11:54:06 INFO util.GSet: Computing capacity for map BlocksMap
14/04/19 11:54:06 INFO util.GSet: VM type = 64-bit
14/04/19 11:54:06 INFO util.GSet: 2.0% max memory = 1013645312
14/04/19 11:54:06 INFO util.GSet: capacity = 2^21 = 2097152 entries
14/04/19 11:54:06 INFO util.GSet: recommended=2097152, actual=2097152
14/04/19 11:54:07 INFO namenode.FSNamesystem: fsOwner=hduser
14/04/19 11:54:07 INFO namenode.FSNamesystem: supergroup=supergroup
14/04/19 11:54:07 INFO namenode.FSNamesystem: isPermissionEnabled=true
14/04/19 11:54:07 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
14/04/19 11:54:07 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
14/04/19 11:54:07 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
14/04/19 11:54:07 INFO namenode.NameNode: Caching file names occuring more than 10 times
14/04/19 11:54:08 INFO common.Storage: Image file /app/hadoop/tmp/dfs/name/current/fsimage of size 112 bytes saved in 0 seconds.
14/04/19 11:54:09 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/app/hadoop/tmp/dfs/name/current/edits
14/04/19 11:54:09 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/app/hadoop/tmp/dfs/name/current/edits
14/04/19 11:54:09 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted.
14/04/19 11:54:09 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

Starting the hadoop single node cluster and testing

running the start all shell scripts located in the bin directory of the hadoop folder

hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/start-all.sh
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-ubuntu.out
hduser@localhost's password:
localhost: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-ubuntu.out
hduser@localhost's password:
localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-ubuntu.out
hduser@localhost's password:
localhost: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-ubuntu.out

checking to see if the processes are running

hduser@ubuntu:/usr/local/hadoop-1.2.1$ jps
4019 Jps
3663 JobTracker
3331 DataNode
3904 TaskTracker
3045 NameNode
3586 SecondaryNameNode

check to see if hadoop is listening with netstat

hduser@ubuntu:~$ netstat -plten | grep java
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 0.0.0.0:55651 0.0.0.0:* LISTEN 1001 13311 3663/java
tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN 1001 13560 3331/java
tcp 0 0 0.0.0.0:57766 0.0.0.0:* LISTEN 1001 13235 3586/java
tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 12430 3045/java
tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 13578 3663/java
tcp 0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 1001 13577 3586/java
tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 13853 3904/java
tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 13584 3663/java
tcp 0 0 127.0.0.1:41135 0.0.0.0:* LISTEN 1001 13586 3904/java
tcp 0 0 0.0.0.0:33429 0.0.0.0:* LISTEN 1001 12367 3045/java
tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 12449 3045/java
tcp 0 0 0.0.0.0:43159 0.0.0.0:* LISTEN 1001 12749 3331/java
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 13189 3331/java
tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 1001 13236 3331/java

Running a couple of test jobs

estimated value of pi example

hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop jar hadoop-examples-1.2.1.jar pi 10 100
Warning: $HADOOP_HOME is deprecated.

Number of Maps = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
14/04/20 10:48:16 INFO mapred.FileInputFormat: Total input paths to process : 10
14/04/20 10:48:17 INFO mapred.JobClient: Running job: job_201404200112_0003
14/04/20 10:48:18 INFO mapred.JobClient: map 0% reduce 0%
14/04/20 10:48:46 INFO mapred.JobClient: map 20% reduce 0%
Starting JobStarting Job14/04/20 10:49:14 INFO mapred.JobClient: map 40% reduce 0%
14/04/20 10:49:21 INFO mapred.JobClient: map 40% reduce 13%
14/04/20 10:49:37 INFO mapred.JobClient: map 60% reduce 13%
14/04/20 10:49:47 INFO mapred.JobClient: map 60% reduce 20%
14/04/20 10:49:57 INFO mapred.JobClient: map 80% reduce 20%
14/04/20 10:50:03 INFO mapred.JobClient: map 80% reduce 26%
14/04/20 10:50:15 INFO mapred.JobClient: map 100% reduce 26%
14/04/20 10:50:18 INFO mapred.JobClient: map 100% reduce 33%
14/04/20 10:50:21 INFO mapred.JobClient: map 100% reduce 66%
14/04/20 10:50:25 INFO mapred.JobClient: map 100% reduce 100%
14/04/20 10:50:34 INFO mapred.JobClient: Job complete: job_201404200112_0003
14/04/20 10:50:36 INFO mapred.JobClient: Counters: 30
14/04/20 10:50:36 INFO mapred.JobClient: Job Counters
14/04/20 10:50:36 INFO mapred.JobClient: Launched reduce tasks=1
14/04/20 10:50:36 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=219822
14/04/20 10:50:36 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/04/20 10:50:36 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/04/20 10:50:36 INFO mapred.JobClient: Launched map tasks=10
14/04/20 10:50:36 INFO mapred.JobClient: Data-local map tasks=10
14/04/20 10:50:36 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=98877
14/04/20 10:50:36 INFO mapred.JobClient: File Input Format Counters
14/04/20 10:50:36 INFO mapred.JobClient: Bytes Read=1180
14/04/20 10:50:36 INFO mapred.JobClient: File Output Format Counters
14/04/20 10:50:36 INFO mapred.JobClient: Bytes Written=97
14/04/20 10:50:36 INFO mapred.JobClient: FileSystemCounters
14/04/20 10:50:36 INFO mapred.JobClient: FILE_BYTES_READ=226
14/04/20 10:50:36 INFO mapred.JobClient: HDFS_BYTES_READ=2420
14/04/20 10:50:36 INFO mapred.JobClient: FILE_BYTES_WRITTEN=604395
14/04/20 10:50:36 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215
14/04/20 10:50:36 INFO mapred.JobClient: Map-Reduce Framework
14/04/20 10:50:36 INFO mapred.JobClient: Map output materialized bytes=280
14/04/20 10:50:36 INFO mapred.JobClient: Map input records=10
14/04/20 10:50:36 INFO mapred.JobClient: Reduce shuffle bytes=280
14/04/20 10:50:36 INFO mapred.JobClient: Spilled Records=40
14/04/20 10:50:36 INFO mapred.JobClient: Map output bytes=180
14/04/20 10:50:36 INFO mapred.JobClient: Total committed heap usage (bytes)=2037886976
14/04/20 10:50:36 INFO mapred.JobClient: CPU time spent (ms)=13510
14/04/20 10:50:36 INFO mapred.JobClient: Map input bytes=240
14/04/20 10:50:36 INFO mapred.JobClient: SPLIT_RAW_BYTES=1240
14/04/20 10:50:36 INFO mapred.JobClient: Combine input records=0
14/04/20 10:50:36 INFO mapred.JobClient: Reduce input records=20
14/04/20 10:50:36 INFO mapred.JobClient: Reduce input groups=20
14/04/20 10:50:36 INFO mapred.JobClient: Combine output records=0
14/04/20 10:50:36 INFO mapred.JobClient: Physical memory (bytes) snapshot=1615872000
14/04/20 10:50:36 INFO mapred.JobClient: Reduce output records=0
14/04/20 10:50:36 INFO mapred.JobClient: Virtual memory (bytes) snapshot=10645270528
14/04/20 10:50:36 INFO mapred.JobClient: Map output records=20
Job Finished in 141.28 seconds
Estimated value of Pi is 3.14800000000000000000

running word count example

copying all of the XML files from the conf folder in hadoop to the hdfs

hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop dfs -mkdir /user/hduser/
input
Warning: $HADOOP_HOME is deprecated.

hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop dfs -copyFromLocal conf/*.xml /user/hduser/input
Warning: $HADOOP_HOME is deprecated.

hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop dfs -ls /user/hduser/input
Warning: $HADOOP_HOME is deprecated.

Found 7 items
-rw-r--r-- 1 hduser supergroup 7457 2014-04-20 11:08 /user/hduser/input/capacity-scheduler.xml
-rw-r--r-- 1 hduser supergroup 767 2014-04-20 11:08 /user/hduser/input/core-site.xml
-rw-r--r-- 1 hduser supergroup 327 2014-04-20 11:08 /user/hduser/input/fair-scheduler.xml
-rw-r--r-- 1 hduser supergroup 4644 2014-04-20 11:08 /user/hduser/input/hadoop-policy.xml
-rw-r--r-- 1 hduser supergroup 458 2014-04-20 11:08 /user/hduser/input/hdfs-site.xml
-rw-r--r-- 1 hduser supergroup 2033 2014-04-20 11:08 /user/hduser/input/mapred-queue-acls.xml
-rw-r--r-- 1 hduser supergroup 436 2014-04-20 11:08 /user/hduser/input/mapred-site.xml

run the wordcount example to output the results to an output directory within the hdfs

hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop jar hadoop-examples-1.2.1.jar wordcount /user/hduser/input /user/hduser/wc-output
Warning: $HADOOP_HOME is deprecated.

14/04/20 11:13:59 INFO input.FileInputFormat: Total input paths to process : 7
14/04/20 11:13:59 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/04/20 11:13:59 WARN snappy.LoadSnappy: Snappy native library not loaded
14/04/20 11:14:01 INFO mapred.JobClient: Running job: job_201404200112_0004
14/04/20 11:14:02 INFO mapred.JobClient: map 0% reduce 0%
14/04/20 11:14:32 INFO mapred.JobClient: map 28% reduce 0%
14/04/20 11:15:06 INFO mapred.JobClient: map 57% reduce 0%
14/04/20 11:15:10 INFO mapred.JobClient: map 57% reduce 9%
14/04/20 11:15:16 INFO mapred.JobClient: map 57% reduce 19%
14/04/20 11:15:26 INFO mapred.JobClient: map 85% reduce 19%
14/04/20 11:15:34 INFO mapred.JobClient: map 85% reduce 28%
14/04/20 11:15:37 INFO mapred.JobClient: map 100% reduce 28%
14/04/20 11:15:43 INFO mapred.JobClient: map 100% reduce 33%
14/04/20 11:15:47 INFO mapred.JobClient: map 100% reduce 100%
14/04/20 11:15:54 INFO mapred.JobClient: Job complete: job_201404200112_0004
14/04/20 11:15:57 INFO mapred.JobClient: Counters: 29
14/04/20 11:15:57 INFO mapred.JobClient: Job Counters
14/04/20 11:15:57 INFO mapred.JobClient: Launched reduce tasks=1
14/04/20 11:15:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=158499
14/04/20 11:15:57 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/04/20 11:15:57 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/04/20 11:15:57 INFO mapred.JobClient: Launched map tasks=7
14/04/20 11:15:57 INFO mapred.JobClient: Data-local map tasks=7
14/04/20 11:15:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=78409
14/04/20 11:15:57 INFO mapred.JobClient: File Output Format Counters
14/04/20 11:15:57 INFO mapred.JobClient: Bytes Written=7001
14/04/20 11:15:57 INFO mapred.JobClient: FileSystemCounters
14/04/20 11:15:57 INFO mapred.JobClient: FILE_BYTES_READ=11713
14/04/20 11:15:57 INFO mapred.JobClient: HDFS_BYTES_READ=16983
14/04/20 11:15:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=463056
14/04/20 11:15:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=7001
14/04/20 11:15:57 INFO mapred.JobClient: File Input Format Counters
14/04/20 11:15:57 INFO mapred.JobClient: Bytes Read=16122
14/04/20 11:15:57 INFO mapred.JobClient: Map-Reduce Framework
14/04/20 11:15:57 INFO mapred.JobClient: Map output materialized bytes=11749
14/04/20 11:15:57 INFO mapred.JobClient: Map input records=397
14/04/20 11:15:57 INFO mapred.JobClient: Reduce shuffle bytes=11749
14/04/20 11:15:57 INFO mapred.JobClient: Spilled Records=1366
14/04/20 11:15:57 INFO mapred.JobClient: Map output bytes=22382
14/04/20 11:15:57 INFO mapred.JobClient: Total committed heap usage (bytes)=1428643840
14/04/20 11:15:57 INFO mapred.JobClient: CPU time spent (ms)=9760
14/04/20 11:15:57 INFO mapred.JobClient: Combine input records=1862
14/04/20 11:15:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=861
14/04/20 11:15:57 INFO mapred.JobClient: Reduce input records=683
14/04/20 11:15:57 INFO mapred.JobClient: Reduce input groups=468
14/04/20 11:15:57 INFO mapred.JobClient: Combine output records=683
14/04/20 11:15:57 INFO mapred.JobClient: Physical memory (bytes) snapshot=1108615168
14/04/20 11:15:57 INFO mapred.JobClient: Reduce output records=468
14/04/20 11:15:57 INFO mapred.JobClient: Virtual memory (bytes) snapshot=7745290240
14/04/20 11:15:57 INFO mapred.JobClient: Map output records=1862

hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop dfs -ls /user/hduser
Warning: $HADOOP_HOME is deprecated.

Found 2 items
drwxr-xr-x - hduser supergroup 0 2014-04-20 11:08 /user/hduser/input
drwxr-xr-x - hduser supergroup 0 2014-04-20 11:15 /user/hduser/wc-output

Viewing the results

reading straight from the hdfs

hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop dfs -cat /user/hduser/wc-output/part-r-00000

(I will insert a head read of the output ont he local machine example)

using hadoop you can also merge folders in the hdfs onto the local machine

hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop dfs -getmerge /user/hduser/wc-output ~/hdfs-output/
Warning: $HADOOP_HOME is deprecated.

14/04/20 11:23:34 INFO util.NativeCodeLoader: Loaded the native-hadoop library

hduser@ubuntu:/usr/local/hadoop-1.2.1$ head ~/hdfs-output/wc-output
"*" 10
"alice,bob 10
"local", 1
' 2
'(i.e. 2
'*', 2
'default' 2
(fs.SCHEME.impl) 1
(maximum-system-jobs 2
* 2

Setting up for multi-node

for the sake of simplicity, I have taken the already working single-node and cloned it 3 times, modified the IP addresses and the hostname of all 4 machines to match the information provided earlier.

Setting up passwordless SSH for the master to reach the slave nodes

From the master, run the following commands to setup passwordless ssh, the login will have to match on all slaves, I included hd-master to keep it prompting when starting its own node.

ssh-copy-id -i $HOME/.ssh/id_rsa.pub hd-master

ssh-copy-id -i $HOME/.ssh/id_rsa.pub hd-slave1

ssh-copy-id -i $HOME/.ssh/id_rsa.pub hd-slave2

ssh-copy-id -i $HOME/.ssh/id_rsa.pub hd-slave3

this will prompt an ssh login and copy the rsa public key to the machines that will need passwordless ssh

Modify the master and slave node configuration files

Modify the masters in slaves files in the hadoop conf/ folder to designate master and slave nodes

on hd-master:

hduser@hd-master:/usr/local/hadoop-1.2.1$ vi conf/masters

change from “localhost” to “hd-master”

on hd-master and hd-slaves(1-3)

hduser@hd-slave1:/usr/local/hadoop-1.2.1$ vi conf/slaves

change from “localhost” to:

hd-master (on master only)

hd-slave1
hd-slave2
hd-slave3

Modify the XML files in conf/

edit the XML files in conf/ on all nodes

edit conf/core-site.xml

change the line <value>hdfs://localhost:54310</value> to:

<value>hdfs://hd-master:54310</value>

edit conf/mapred-site.xml

change the line <value>localhost:54311</value> to:

<value>hd-master:54311</value>

add the following to this file as well:

<name>mapred.local.dir</name>

<value>${hadoop.tmp.dir}/mapred/local</value>

</property>

<name>mapred.map.tasks</name>

</property>

<name>mapred.reduce.tasks</name>

</property>

edit conf/hdfs-site.xml

change the line <value>1</value> to:

Formatting the Hadoop File System

with no hadoop processes running run the following command:

bin/hadoop namenode -format

Starting the multi-node cluster

run the following on the master:

bin/start-dfs.sh

(if you have issues with the nodes shutting down shortly after starting, check the logs, I had to delete the data folder from /app/hadoop/tmp due to namespace IDs not matching)

you should get the following output:

starting namenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-hd-master.out
hd-slave3: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hd-slave3.out
hd-master: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hd-master.out
hd-slave2: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hd-slave2.out
hd-slave1: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hd-slave1.out
hd-master: starting secondarynamenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-hd-master.out

run jps to see the following processes running in java

jps
8723 NameNode
9213 SecondaryNameNode
9630 Jps

the slaves should only report the NameNode entry

run the mapred process on master

bin/start-mapred.sh

you should get the following output:

starting jobtracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-hd-master.out
hd-slave2: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hd-slave2.out
hd-slave1: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hd-slave1.out
hd-slave3: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hd-slave3.out
hd-master: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hd-master.out

check jps

jps
8723 NameNode
9213 SecondaryNameNode
9334 JobTracker
9630 Jps
9584 TaskTracker

the slave should produce the tasktracker jobs

Run a map reduced job to test the new cluster setup

From the master node, I downloaded example etexts provided for the example, they were downloaded onto the local machine /tmp/clustertest from the hadoop directroy run:

bin/hadoop dfs -copyFromLocal /tmp/clustertest /user/hduser/

It will give a rather long output:

hduser@hd-master:/usr/local/hadoop-1.2.1$ bin/hadoop jar hadoop-examples-1.2.1.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
Warning: $HADOOP_HOME is deprecated.

14/04/20 16:43:46 INFO input.FileInputFormat: Total input paths to process : 7
14/04/20 16:43:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/04/20 16:43:46 WARN snappy.LoadSnappy: Snappy native library not loaded
14/04/20 16:43:48 INFO mapred.JobClient: Running job: job_201404201608_0001
14/04/20 16:43:49 INFO mapred.JobClient: map 0% reduce 0%
14/04/20 16:44:12 INFO mapred.JobClient: map 14% reduce 0%
14/04/20 16:44:40 INFO mapred.JobClient: map 42% reduce 1%
14/04/20 16:45:05 INFO mapred.JobClient: map 85% reduce 1%
14/04/20 16:45:13 INFO mapred.JobClient: map 100% reduce 1%
14/04/20 16:45:18 INFO mapred.JobClient: map 100% reduce 3%
14/04/20 16:45:20 INFO mapred.JobClient: map 100% reduce 5%
14/04/20 16:45:29 INFO mapred.JobClient: map 100% reduce 16%
14/04/20 16:45:32 INFO mapred.JobClient: map 100% reduce 17%
14/04/20 16:45:33 INFO mapred.JobClient: map 100% reduce 35%
14/04/20 16:45:36 INFO mapred.JobClient: map 100% reduce 50%
14/04/20 16:45:41 INFO mapred.JobClient: map 100% reduce 67%
14/04/20 16:45:42 INFO mapred.JobClient: map 100% reduce 75%
14/04/20 16:45:46 INFO mapred.JobClient: map 100% reduce 92%
14/04/20 16:45:49 INFO mapred.JobClient: map 100% reduce 100%
14/04/20 16:45:54 INFO mapred.JobClient: Job complete: job_201404201608_0001
14/04/20 16:45:54 INFO mapred.JobClient: Counters: 30
14/04/20 16:45:54 INFO mapred.JobClient: Job Counters
14/04/20 16:45:54 INFO mapred.JobClient: Launched reduce tasks=6
14/04/20 16:45:54 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=481856
14/04/20 16:45:54 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/04/20 16:45:54 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/04/20 16:45:54 INFO mapred.JobClient: Rack-local map tasks=2
14/04/20 16:45:54 INFO mapred.JobClient: Launched map tasks=11
14/04/20 16:45:54 INFO mapred.JobClient: Data-local map tasks=9
14/04/20 16:45:54 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=359838
14/04/20 16:45:54 INFO mapred.JobClient: File Output Format Counters
14/04/20 16:45:54 INFO mapred.JobClient: Bytes Written=1412513
14/04/20 16:45:54 INFO mapred.JobClient: FileSystemCounters
14/04/20 16:45:54 INFO mapred.JobClient: FILE_BYTES_READ=4481581
14/04/20 16:45:54 INFO mapred.JobClient: HDFS_BYTES_READ=6950874
14/04/20 16:45:54 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7982746
14/04/20 16:45:54 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1412513
14/04/20 16:45:54 INFO mapred.JobClient: File Input Format Counters
14/04/20 16:45:54 INFO mapred.JobClient: Bytes Read=6950006
14/04/20 16:45:54 INFO mapred.JobClient: Map-Reduce Framework
14/04/20 16:45:54 INFO mapred.JobClient: Map output materialized bytes=2915221
14/04/20 16:45:54 INFO mapred.JobClient: Map input records=137146
14/04/20 16:45:54 INFO mapred.JobClient: Reduce shuffle bytes=2915221
14/04/20 16:45:54 INFO mapred.JobClient: Spilled Records=507862
14/04/20 16:45:54 INFO mapred.JobClient: Map output bytes=11435858
14/04/20 16:45:54 INFO mapred.JobClient: Total committed heap usage (bytes)=1466208256
14/04/20 16:45:54 INFO mapred.JobClient: CPU time spent (ms)=59870
14/04/20 16:45:54 INFO mapred.JobClient: Combine input records=1174992
14/04/20 16:45:54 INFO mapred.JobClient: SPLIT_RAW_BYTES=868
14/04/20 16:45:54 INFO mapred.JobClient: Reduce input records=201012
14/04/20 16:45:54 INFO mapred.JobClient: Reduce input groups=128514
14/04/20 16:45:54 INFO mapred.JobClient: Combine output records=201012
14/04/20 16:45:54 INFO mapred.JobClient: Physical memory (bytes) snapshot=1541390336
14/04/20 16:45:54 INFO mapred.JobClient: Reduce output records=128514
14/04/20 16:45:54 INFO mapred.JobClient: Virtual memory (bytes) snapshot=10689261568
14/04/20 16:45:54 INFO mapred.JobClient: Map output records=1174992