Installing Hadoop multi-node cluster
I will install hadoop on a single node cluster and then clone it to multiple nodes in order to create a multi-node cluster after modifying each to have a difference in system configurations
System Setup
Virtual machine platform | Oracle Virtual Box ver 4.3.8 r92456 |
Operating System | Ubuntu Server 12.04 LTS |
Hostnames used | hd-master hd-slave1 hd-slave2 hd-slave3 |
Username(s):Password | uadmin:admin hduser:hadoop |
Groups created | hadoop |
Java version | Oracle java 7 |
Hadoop Version | 1.2.1 |
Memory | 512MB RAM /machine |
Processor | 1 /machine |
HDD | 8GB vdi dynamically allocated |
Network Setup
Hostname | IP Address |
hd-master | 192.168.1.30/26 |
hd-salve1 | 192.168.1.31/26 |
hd-slave2 | 192.168.1.32/26 |
hd-slave3 | 192.168.1.33/26 |
DNS will on Google’s public DNS of 8.8.8.8 and 8.8.4.4
Setup network
uadmin@ubuntu:~$ sudo vi /etc/network/interfaces |
Changed the line “iface eth0 inet dhcp”
# This file describes the network interfaces available on your system # and how to activate them. For more information, see interfaces(5).
# The loopback network interface auto lo iface lo inet loopback
# The primary network interface auto eth0 iface eth0 inet static address 192.168.1.30 netmask 255.255.255.0 gateway 192.168.1.1 broadcast 192.168.1.255 dns-nameservers 8.8.8.8 8.8.4.4 |
uadmin@ubuntu:~$ sudo shutdown -h 0 |
Check status of ssh
uadmin@ubuntu:~$ sudo service ssh status ssh start/running, process 610 |
Install oracle Java 7
Install python-software-properties
uadmin@ubuntu:~$ sudo apt-get install python-software-properties |
Add the ppa from webupd8team for the java 7 installer
uadmin@ubuntu:~$ sudo apt-add-repository ppa:webupd8team/java uadmin@ubuntu:~$ sudo apt-get update uadmin@ubuntu:~$ sudo apt-get install oracle-java7-installer |
Check Java version
uadmin@ubuntu:~$ java -version java version "1.7.0_55" Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) |
create the hadoop user account and group
Create group
uadmin@ubuntu:~$ sudo addgroup hadoop Adding group `hadoop' (GID 1001) ... Done. |
create user into group
uadmin@ubuntu:~$ sudo adduser --ingroup hadoop hduser Adding user `hduser' ... Adding new user `hduser' (1001) with group `hadoop' ... Creating home directory `/home/hduser' ... Copying files from `/etc/skel' ... Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully Changing the user information for hduser Enter the new value, or press ENTER for the default Full Name []: Room Number []: Work Phone []: Home Phone []: Other []: Is the information correct? [Y/n] y |
Configure SSH for Public and Private key on the newly created hadoop account
login as hduser
uadmin@ubuntu:~$ su - hduser |
run the ssh command to configure the public and private rsa key
hduser@ubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 24:66:25:86:9a:22:57:10:00:c1:94:f3:f9:8d:a6:a8 hduser@ubuntu The key's randomart image is: +--[ RSA 2048]----+ |B++o .o . | | + o. o | | o+. + . | |o +o o o | |.o . o S | | + . | | . o | | . . | |E | +-----------------+ |
enable ssh access on local machine with the new private key
hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_key |
tested ssh on the local host
hduser@ubuntu:~$ ssh localhost The authenticity of host 'localhost (127.0.0.1)' can't be established. ECDSA key fingerprint is f0:10:81:ed:d3:a1:4f:4c:1d:01:a3:9f:b8:54:55:ad. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. hduser@localhost's password: Welcome to Ubuntu 12.04.4 LTS (GNU/Linux 3.11.0-15-generic x86_64)
* Documentation: https://help.ubuntu.com/
System information as of Fri Apr 18 23:30:09 EDT 2014
System load: 0.08 Processes: 77 Usage of /: 19.5% of 6.99GB Users logged in: 1 Memory usage: 16% IP address for eth0: 192.168.1.30 Swap usage: 0%
Graph this data and manage this system at: https://landscape.canonical.com/
The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law.
|
Disable IPv6
edit the sysctl.conf file
uadmin@ubuntu:~$ sudo vi /etc/sysctl.conf |
Add the following lines to the end of the file and reboot
#disable IPv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 |
check to make sure that IPv6 is disabled
uadmin@ubuntu:~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6 1 |
Downloading and Installing Hadoop 1.2.1
Navigate to /usr/local and download the file from apache
uadmin@ubuntu:~$ cd /usr/local/ uadmin@ubuntu:/usr/local$ sudo wget http://mirrors.koehn.com/apache/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz [sudo] password for uadmin: --2014-04-19 00:01:27-- http://mirrors.koehn.com/apache/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz Resolving mirrors.koehn.com (mirrors.koehn.com)... 107.150.35.50, 2604:4300:a:36:982:df24:f8bc:ad08 Connecting to mirrors.koehn.com (mirrors.koehn.com)|107.150.35.50|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 63851630 (61M) [application/x-gzip] Saving to: `hadoop-1.2.1.tar.gz'
100%[================================>] 63,851,630 318K/s in 3m 6s
2014-04-19 00:04:33 (336 KB/s) - `hadoop-1.2.1.tar.gz' saved [63851630/63851630]
|
Untar the the file into the current directory
uadmin@ubuntu:/usr/local$ sudo tar xzf hadoop-1.2.1.tar.gz uadmin@ubuntu:/usr/local$ ls bin games hadoop-1.2.1.tar.gz lib sbin src etc hadoop-1.2.1 include man share uadmin@ubuntu:/usr/local$ ls hadoop-1.2.1 bin hadoop-ant-1.2.1.jar ivy sbin build.xml hadoop-client-1.2.1.jar ivy.xml share c++ hadoop-core-1.2.1.jar lib src CHANGES.txt hadoop-examples-1.2.1.jar libexec webapps conf hadoop-minicluster-1.2.1.jar LICENSE.txt contrib hadoop-test-1.2.1.jar NOTICE.txt docs hadoop-tools-1.2.1.jar README.txt |
Set ownership of the hadoop folder to hduser and hadoop group
uadmin@ubuntu:/usr/local$ sudo chown -R hduser:hadoop hadoop-1.2.1 uadmin@ubuntu:/usr/local$ ls -l total 62396 drwxr-xr-x 2 root root 4096 Apr 18 14:36 bin drwxr-xr-x 2 root root 4096 Apr 18 14:36 etc drwxr-xr-x 2 root root 4096 Apr 18 14:36 games drwxr-xr-x 15 hduser hadoop 4096 Jul 22 2013 hadoop-1.2.1 -rw-r--r-- 1 root root 63851630 Jul 22 2013 hadoop-1.2.1.tar.gz drwxr-xr-x 2 root root 4096 Apr 18 14:36 include drwxr-xr-x 3 root root 4096 Apr 18 14:37 lib lrwxrwxrwx 1 root root 9 Apr 18 14:36 man -> share/man drwxr-xr-x 2 root root 4096 Apr 18 14:36 sbin drwxr-xr-x 6 root root 4096 Apr 18 20:04 share drwxr-xr-x 2 root root 4096 Apr 18 14:36 src |
Update .bashrc for hduser
copy the .bashrc file and edit the
hduser@ubuntu:~$ cp ~/.bashrc ~/.bashrc.bak hduser@ubuntu:~$ vi .bashrc |
adding the following
# Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop-1.2.1
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) export JAVA_HOME=/usr/lib/jvm/java-7-oracle
# Some convenient aliases and functions for running Hadoop-related commands unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and # compress job outputs with LZOP (not covered in this tutorial): # Conveniently inspect an LZOP compressed file from the command # line; run via: # # $ lzohead /hdfs/path/to/lzop/compressed/file.lzo # # Requires installed 'lzop' command. # lzohead () { hadoop fs -cat $1 | lzop -dc | head -1000 | less }
# Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin
|
reload .bashrc
hduser@ubuntu:~$ source .bashrc |
Configuration
modify the following line for conf/hadoop-env.sh:
# The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
changed to
# The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-7-oracle |
Conf/*-site.xml configuration and temp directory for application
create app directory (I am using the the one from the tutorial but this can be anywhere you choose) and modifying the permissions for hduser
hduser@ubuntu:/$ su - uadmin Password: uadmin@ubuntu:~$ pwd /home/uadmin uadmin@ubuntu:~$ cd / uadmin@ubuntu:/$ sudo mkdir -p /app/hadoop/tmp [sudo] password for uadmin: uadmin@ubuntu:/$ sudo chown hduser:hadoop /app/hadoop/tmp/ uadmin@ubuntu:/$ ls -l /app/hadoop/ total 4 drwxr-xr-x 2 hduser hadoop 4096 Apr 19 10:57 tmp uadmin@ubuntu:/$ sudo chmod 750 /app/hadoop/tmp/ uadmin@ubuntu:/$ ls -l /app/hadoop/ total 4 drwxr-x--- 2 hduser hadoop 4096 Apr 19 10:57 tmp
|
add configuration to core-site.xml (in between <configuration>...</configuration>)
<property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property>
<property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> |
add configuration to mapred-site.xml (in between <configuration>...</configuration>)
<property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> |
add configuration to hdfs-site.xml (in between <configuration>...</configuration>)
<property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> |
Formatting hadoop filesystem
Run the following command to format the filesystem
(when asked, answer “Y” in the proper case)
hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop namenode -format Warning: $HADOOP_HOME is deprecated.
14/04/19 11:54:05 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.2.1 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013 STARTUP_MSG: java = 1.7.0_55 ************************************************************/ 14/04/19 11:54:06 INFO util.GSet: Computing capacity for map BlocksMap 14/04/19 11:54:06 INFO util.GSet: VM type = 64-bit 14/04/19 11:54:06 INFO util.GSet: 2.0% max memory = 1013645312 14/04/19 11:54:06 INFO util.GSet: capacity = 2^21 = 2097152 entries 14/04/19 11:54:06 INFO util.GSet: recommended=2097152, actual=2097152 14/04/19 11:54:07 INFO namenode.FSNamesystem: fsOwner=hduser 14/04/19 11:54:07 INFO namenode.FSNamesystem: supergroup=supergroup 14/04/19 11:54:07 INFO namenode.FSNamesystem: isPermissionEnabled=true 14/04/19 11:54:07 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 14/04/19 11:54:07 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 14/04/19 11:54:07 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0 14/04/19 11:54:07 INFO namenode.NameNode: Caching file names occuring more than 10 times 14/04/19 11:54:08 INFO common.Storage: Image file /app/hadoop/tmp/dfs/name/current/fsimage of size 112 bytes saved in 0 seconds. 14/04/19 11:54:09 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/app/hadoop/tmp/dfs/name/current/edits 14/04/19 11:54:09 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/app/hadoop/tmp/dfs/name/current/edits 14/04/19 11:54:09 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted. 14/04/19 11:54:09 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/ |
Starting the hadoop single node cluster and testing
running the start all shell scripts located in the bin directory of the hadoop folder
hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/start-all.sh Warning: $HADOOP_HOME is deprecated.
starting namenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-ubuntu.out hduser@localhost's password: localhost: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-ubuntu.out hduser@localhost's password: localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-ubuntu.out starting jobtracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-ubuntu.out hduser@localhost's password: localhost: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-ubuntu.out |
checking to see if the processes are running
hduser@ubuntu:/usr/local/hadoop-1.2.1$ jps 4019 Jps 3663 JobTracker 3331 DataNode 3904 TaskTracker 3045 NameNode 3586 SecondaryNameNode |
check to see if hadoop is listening with netstat
hduser@ubuntu:~$ netstat -plten | grep java (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) tcp 0 0 0.0.0.0:55651 0.0.0.0:* LISTEN 1001 13311 3663/java tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN 1001 13560 3331/java tcp 0 0 0.0.0.0:57766 0.0.0.0:* LISTEN 1001 13235 3586/java tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 12430 3045/java tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 13578 3663/java tcp 0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 1001 13577 3586/java tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 13853 3904/java tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 13584 3663/java tcp 0 0 127.0.0.1:41135 0.0.0.0:* LISTEN 1001 13586 3904/java tcp 0 0 0.0.0.0:33429 0.0.0.0:* LISTEN 1001 12367 3045/java tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 12449 3045/java tcp 0 0 0.0.0.0:43159 0.0.0.0:* LISTEN 1001 12749 3331/java tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 13189 3331/java tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 1001 13236 3331/java |
Running a couple of test jobs
estimated value of pi example
hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop jar hadoop-examples-1.2.1.jar pi 10 100 Warning: $HADOOP_HOME is deprecated.
Number of Maps = 10 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 14/04/20 10:48:16 INFO mapred.FileInputFormat: Total input paths to process : 10 14/04/20 10:48:17 INFO mapred.JobClient: Running job: job_201404200112_0003 14/04/20 10:48:18 INFO mapred.JobClient: map 0% reduce 0% 14/04/20 10:48:46 INFO mapred.JobClient: map 20% reduce 0% Starting JobStarting Job14/04/20 10:49:14 INFO mapred.JobClient: map 40% reduce 0% 14/04/20 10:49:21 INFO mapred.JobClient: map 40% reduce 13% 14/04/20 10:49:37 INFO mapred.JobClient: map 60% reduce 13% 14/04/20 10:49:47 INFO mapred.JobClient: map 60% reduce 20% 14/04/20 10:49:57 INFO mapred.JobClient: map 80% reduce 20% 14/04/20 10:50:03 INFO mapred.JobClient: map 80% reduce 26% 14/04/20 10:50:15 INFO mapred.JobClient: map 100% reduce 26% 14/04/20 10:50:18 INFO mapred.JobClient: map 100% reduce 33% 14/04/20 10:50:21 INFO mapred.JobClient: map 100% reduce 66% 14/04/20 10:50:25 INFO mapred.JobClient: map 100% reduce 100% 14/04/20 10:50:34 INFO mapred.JobClient: Job complete: job_201404200112_0003 14/04/20 10:50:36 INFO mapred.JobClient: Counters: 30 14/04/20 10:50:36 INFO mapred.JobClient: Job Counters 14/04/20 10:50:36 INFO mapred.JobClient: Launched reduce tasks=1 14/04/20 10:50:36 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=219822 14/04/20 10:50:36 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/04/20 10:50:36 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/04/20 10:50:36 INFO mapred.JobClient: Launched map tasks=10 14/04/20 10:50:36 INFO mapred.JobClient: Data-local map tasks=10 14/04/20 10:50:36 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=98877 14/04/20 10:50:36 INFO mapred.JobClient: File Input Format Counters 14/04/20 10:50:36 INFO mapred.JobClient: Bytes Read=1180 14/04/20 10:50:36 INFO mapred.JobClient: File Output Format Counters 14/04/20 10:50:36 INFO mapred.JobClient: Bytes Written=97 14/04/20 10:50:36 INFO mapred.JobClient: FileSystemCounters 14/04/20 10:50:36 INFO mapred.JobClient: FILE_BYTES_READ=226 14/04/20 10:50:36 INFO mapred.JobClient: HDFS_BYTES_READ=2420 14/04/20 10:50:36 INFO mapred.JobClient: FILE_BYTES_WRITTEN=604395 14/04/20 10:50:36 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215 14/04/20 10:50:36 INFO mapred.JobClient: Map-Reduce Framework 14/04/20 10:50:36 INFO mapred.JobClient: Map output materialized bytes=280 14/04/20 10:50:36 INFO mapred.JobClient: Map input records=10 14/04/20 10:50:36 INFO mapred.JobClient: Reduce shuffle bytes=280 14/04/20 10:50:36 INFO mapred.JobClient: Spilled Records=40 14/04/20 10:50:36 INFO mapred.JobClient: Map output bytes=180 14/04/20 10:50:36 INFO mapred.JobClient: Total committed heap usage (bytes)=2037886976 14/04/20 10:50:36 INFO mapred.JobClient: CPU time spent (ms)=13510 14/04/20 10:50:36 INFO mapred.JobClient: Map input bytes=240 14/04/20 10:50:36 INFO mapred.JobClient: SPLIT_RAW_BYTES=1240 14/04/20 10:50:36 INFO mapred.JobClient: Combine input records=0 14/04/20 10:50:36 INFO mapred.JobClient: Reduce input records=20 14/04/20 10:50:36 INFO mapred.JobClient: Reduce input groups=20 14/04/20 10:50:36 INFO mapred.JobClient: Combine output records=0 14/04/20 10:50:36 INFO mapred.JobClient: Physical memory (bytes) snapshot=1615872000 14/04/20 10:50:36 INFO mapred.JobClient: Reduce output records=0 14/04/20 10:50:36 INFO mapred.JobClient: Virtual memory (bytes) snapshot=10645270528 14/04/20 10:50:36 INFO mapred.JobClient: Map output records=20 Job Finished in 141.28 seconds Estimated value of Pi is 3.14800000000000000000
|
running word count example
copying all of the XML files from the conf folder in hadoop to the hdfs
hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop dfs -mkdir /user/hduser/ input Warning: $HADOOP_HOME is deprecated.
hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop dfs -copyFromLocal conf/*.xml /user/hduser/input Warning: $HADOOP_HOME is deprecated.
hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop dfs -ls /user/hduser/input Warning: $HADOOP_HOME is deprecated.
Found 7 items -rw-r--r-- 1 hduser supergroup 7457 2014-04-20 11:08 /user/hduser/input/capacity-scheduler.xml -rw-r--r-- 1 hduser supergroup 767 2014-04-20 11:08 /user/hduser/input/core-site.xml -rw-r--r-- 1 hduser supergroup 327 2014-04-20 11:08 /user/hduser/input/fair-scheduler.xml -rw-r--r-- 1 hduser supergroup 4644 2014-04-20 11:08 /user/hduser/input/hadoop-policy.xml -rw-r--r-- 1 hduser supergroup 458 2014-04-20 11:08 /user/hduser/input/hdfs-site.xml -rw-r--r-- 1 hduser supergroup 2033 2014-04-20 11:08 /user/hduser/input/mapred-queue-acls.xml -rw-r--r-- 1 hduser supergroup 436 2014-04-20 11:08 /user/hduser/input/mapred-site.xml
|
run the wordcount example to output the results to an output directory within the hdfs
hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop jar hadoop-examples-1.2.1.jar wordcount /user/hduser/input /user/hduser/wc-output Warning: $HADOOP_HOME is deprecated.
14/04/20 11:13:59 INFO input.FileInputFormat: Total input paths to process : 7 14/04/20 11:13:59 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/04/20 11:13:59 WARN snappy.LoadSnappy: Snappy native library not loaded 14/04/20 11:14:01 INFO mapred.JobClient: Running job: job_201404200112_0004 14/04/20 11:14:02 INFO mapred.JobClient: map 0% reduce 0% 14/04/20 11:14:32 INFO mapred.JobClient: map 28% reduce 0% 14/04/20 11:15:06 INFO mapred.JobClient: map 57% reduce 0% 14/04/20 11:15:10 INFO mapred.JobClient: map 57% reduce 9% 14/04/20 11:15:16 INFO mapred.JobClient: map 57% reduce 19% 14/04/20 11:15:26 INFO mapred.JobClient: map 85% reduce 19% 14/04/20 11:15:34 INFO mapred.JobClient: map 85% reduce 28% 14/04/20 11:15:37 INFO mapred.JobClient: map 100% reduce 28% 14/04/20 11:15:43 INFO mapred.JobClient: map 100% reduce 33% 14/04/20 11:15:47 INFO mapred.JobClient: map 100% reduce 100% 14/04/20 11:15:54 INFO mapred.JobClient: Job complete: job_201404200112_0004 14/04/20 11:15:57 INFO mapred.JobClient: Counters: 29 14/04/20 11:15:57 INFO mapred.JobClient: Job Counters 14/04/20 11:15:57 INFO mapred.JobClient: Launched reduce tasks=1 14/04/20 11:15:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=158499 14/04/20 11:15:57 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/04/20 11:15:57 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/04/20 11:15:57 INFO mapred.JobClient: Launched map tasks=7 14/04/20 11:15:57 INFO mapred.JobClient: Data-local map tasks=7 14/04/20 11:15:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=78409 14/04/20 11:15:57 INFO mapred.JobClient: File Output Format Counters 14/04/20 11:15:57 INFO mapred.JobClient: Bytes Written=7001 14/04/20 11:15:57 INFO mapred.JobClient: FileSystemCounters 14/04/20 11:15:57 INFO mapred.JobClient: FILE_BYTES_READ=11713 14/04/20 11:15:57 INFO mapred.JobClient: HDFS_BYTES_READ=16983 14/04/20 11:15:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=463056 14/04/20 11:15:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=7001 14/04/20 11:15:57 INFO mapred.JobClient: File Input Format Counters 14/04/20 11:15:57 INFO mapred.JobClient: Bytes Read=16122 14/04/20 11:15:57 INFO mapred.JobClient: Map-Reduce Framework 14/04/20 11:15:57 INFO mapred.JobClient: Map output materialized bytes=11749 14/04/20 11:15:57 INFO mapred.JobClient: Map input records=397 14/04/20 11:15:57 INFO mapred.JobClient: Reduce shuffle bytes=11749 14/04/20 11:15:57 INFO mapred.JobClient: Spilled Records=1366 14/04/20 11:15:57 INFO mapred.JobClient: Map output bytes=22382 14/04/20 11:15:57 INFO mapred.JobClient: Total committed heap usage (bytes)=1428643840 14/04/20 11:15:57 INFO mapred.JobClient: CPU time spent (ms)=9760 14/04/20 11:15:57 INFO mapred.JobClient: Combine input records=1862 14/04/20 11:15:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=861 14/04/20 11:15:57 INFO mapred.JobClient: Reduce input records=683 14/04/20 11:15:57 INFO mapred.JobClient: Reduce input groups=468 14/04/20 11:15:57 INFO mapred.JobClient: Combine output records=683 14/04/20 11:15:57 INFO mapred.JobClient: Physical memory (bytes) snapshot=1108615168 14/04/20 11:15:57 INFO mapred.JobClient: Reduce output records=468 14/04/20 11:15:57 INFO mapred.JobClient: Virtual memory (bytes) snapshot=7745290240 14/04/20 11:15:57 INFO mapred.JobClient: Map output records=1862 hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop dfs -ls /user/hduser Warning: $HADOOP_HOME is deprecated.
Found 2 items drwxr-xr-x - hduser supergroup 0 2014-04-20 11:08 /user/hduser/input drwxr-xr-x - hduser supergroup 0 2014-04-20 11:15 /user/hduser/wc-output
|
Viewing the results
reading straight from the hdfs
hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop dfs -cat /user/hduser/wc-output/part-r-00000 (I will insert a head read of the output ont he local machine example) |
using hadoop you can also merge folders in the hdfs onto the local machine
hduser@ubuntu:/usr/local/hadoop-1.2.1$ bin/hadoop dfs -getmerge /user/hduser/wc-output ~/hdfs-output/ Warning: $HADOOP_HOME is deprecated.
14/04/20 11:23:34 INFO util.NativeCodeLoader: Loaded the native-hadoop library hduser@ubuntu:/usr/local/hadoop-1.2.1$ head ~/hdfs-output/wc-output "*" 10 "alice,bob 10 "local", 1 ' 2 '(i.e. 2 '*', 2 'default' 2 (fs.SCHEME.impl) 1 (maximum-system-jobs 2 * 2
|
Setting up for multi-node
for the sake of simplicity, I have taken the already working single-node and cloned it 3 times, modified the IP addresses and the hostname of all 4 machines to match the information provided earlier.
Setting up passwordless SSH for the master to reach the slave nodes
From the master, run the following commands to setup passwordless ssh, the login will have to match on all slaves, I included hd-master to keep it prompting when starting its own node.
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hd-master ssh-copy-id -i $HOME/.ssh/id_rsa.pub hd-slave1 ssh-copy-id -i $HOME/.ssh/id_rsa.pub hd-slave2 ssh-copy-id -i $HOME/.ssh/id_rsa.pub hd-slave3 |
this will prompt an ssh login and copy the rsa public key to the machines that will need passwordless ssh
Modify the master and slave node configuration files
Modify the masters in slaves files in the hadoop conf/ folder to designate master and slave nodes
on hd-master:
hduser@hd-master:/usr/local/hadoop-1.2.1$ vi conf/masters change from “localhost” to “hd-master” |
on hd-master and hd-slaves(1-3)
hduser@hd-slave1:/usr/local/hadoop-1.2.1$ vi conf/slaves change from “localhost” to: hd-master (on master only) hd-slave1 hd-slave2 hd-slave3 |
Modify the XML files in conf/
edit the XML files in conf/ on all nodes
edit conf/core-site.xml
change the line <value>hdfs://localhost:54310</value> to: <value>hdfs://hd-master:54310</value> |
edit conf/mapred-site.xml
change the line <value>localhost:54311</value> to: <value>hd-master:54311</value> |
add the following to this file as well:
<property> <name>mapred.local.dir</name> <value>${hadoop.tmp.dir}/mapred/local</value> </property> <property> <name>mapred.map.tasks</name> <value>20</value> </property> <property> <name>mapred.reduce.tasks</name> <value>2</value> </property> |
edit conf/hdfs-site.xml
change the line <value>1</value> to: <value>4</value> |
Formatting the Hadoop File System
with no hadoop processes running run the following command:
bin/hadoop namenode -format |
Starting the multi-node cluster
run the following on the master:
(if you have issues with the nodes shutting down shortly after starting, check the logs, I had to delete the data folder from /app/hadoop/tmp due to namespace IDs not matching)
you should get the following output:
starting namenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-hd-master.out hd-slave3: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hd-slave3.out hd-master: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hd-master.out hd-slave2: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hd-slave2.out hd-slave1: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hd-slave1.out hd-master: starting secondarynamenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-hd-master.out |
run jps to see the following processes running in java
jps 8723 NameNode 9213 SecondaryNameNode 9630 Jps |
the slaves should only report the NameNode entry
run the mapred process on master
you should get the following output:
starting jobtracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-hd-master.out hd-slave2: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hd-slave2.out hd-slave1: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hd-slave1.out hd-slave3: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hd-slave3.out hd-master: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hd-master.out |
check jps
jps 8723 NameNode 9213 SecondaryNameNode 9334 JobTracker 9630 Jps 9584 TaskTracker |
the slave should produce the tasktracker jobs
Run a map reduced job to test the new cluster setup
From the master node, I downloaded example etexts provided for the example, they were downloaded onto the local machine /tmp/clustertest from the hadoop directroy run:
bin/hadoop dfs -copyFromLocal /tmp/clustertest /user/hduser/ |
It will give a rather long output:
hduser@hd-master:/usr/local/hadoop-1.2.1$ bin/hadoop jar hadoop-examples-1.2.1.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output Warning: $HADOOP_HOME is deprecated.
14/04/20 16:43:46 INFO input.FileInputFormat: Total input paths to process : 7 14/04/20 16:43:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/04/20 16:43:46 WARN snappy.LoadSnappy: Snappy native library not loaded 14/04/20 16:43:48 INFO mapred.JobClient: Running job: job_201404201608_0001 14/04/20 16:43:49 INFO mapred.JobClient: map 0% reduce 0% 14/04/20 16:44:12 INFO mapred.JobClient: map 14% reduce 0% 14/04/20 16:44:40 INFO mapred.JobClient: map 42% reduce 1% 14/04/20 16:45:05 INFO mapred.JobClient: map 85% reduce 1% 14/04/20 16:45:13 INFO mapred.JobClient: map 100% reduce 1% 14/04/20 16:45:18 INFO mapred.JobClient: map 100% reduce 3% 14/04/20 16:45:20 INFO mapred.JobClient: map 100% reduce 5% 14/04/20 16:45:29 INFO mapred.JobClient: map 100% reduce 16% 14/04/20 16:45:32 INFO mapred.JobClient: map 100% reduce 17% 14/04/20 16:45:33 INFO mapred.JobClient: map 100% reduce 35% 14/04/20 16:45:36 INFO mapred.JobClient: map 100% reduce 50% 14/04/20 16:45:41 INFO mapred.JobClient: map 100% reduce 67% 14/04/20 16:45:42 INFO mapred.JobClient: map 100% reduce 75% 14/04/20 16:45:46 INFO mapred.JobClient: map 100% reduce 92% 14/04/20 16:45:49 INFO mapred.JobClient: map 100% reduce 100% 14/04/20 16:45:54 INFO mapred.JobClient: Job complete: job_201404201608_0001 14/04/20 16:45:54 INFO mapred.JobClient: Counters: 30 14/04/20 16:45:54 INFO mapred.JobClient: Job Counters 14/04/20 16:45:54 INFO mapred.JobClient: Launched reduce tasks=6 14/04/20 16:45:54 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=481856 14/04/20 16:45:54 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/04/20 16:45:54 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/04/20 16:45:54 INFO mapred.JobClient: Rack-local map tasks=2 14/04/20 16:45:54 INFO mapred.JobClient: Launched map tasks=11 14/04/20 16:45:54 INFO mapred.JobClient: Data-local map tasks=9 14/04/20 16:45:54 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=359838 14/04/20 16:45:54 INFO mapred.JobClient: File Output Format Counters 14/04/20 16:45:54 INFO mapred.JobClient: Bytes Written=1412513 14/04/20 16:45:54 INFO mapred.JobClient: FileSystemCounters 14/04/20 16:45:54 INFO mapred.JobClient: FILE_BYTES_READ=4481581 14/04/20 16:45:54 INFO mapred.JobClient: HDFS_BYTES_READ=6950874 14/04/20 16:45:54 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7982746 14/04/20 16:45:54 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1412513 14/04/20 16:45:54 INFO mapred.JobClient: File Input Format Counters 14/04/20 16:45:54 INFO mapred.JobClient: Bytes Read=6950006 14/04/20 16:45:54 INFO mapred.JobClient: Map-Reduce Framework 14/04/20 16:45:54 INFO mapred.JobClient: Map output materialized bytes=2915221 14/04/20 16:45:54 INFO mapred.JobClient: Map input records=137146 14/04/20 16:45:54 INFO mapred.JobClient: Reduce shuffle bytes=2915221 14/04/20 16:45:54 INFO mapred.JobClient: Spilled Records=507862 14/04/20 16:45:54 INFO mapred.JobClient: Map output bytes=11435858 14/04/20 16:45:54 INFO mapred.JobClient: Total committed heap usage (bytes)=1466208256 14/04/20 16:45:54 INFO mapred.JobClient: CPU time spent (ms)=59870 14/04/20 16:45:54 INFO mapred.JobClient: Combine input records=1174992 14/04/20 16:45:54 INFO mapred.JobClient: SPLIT_RAW_BYTES=868 14/04/20 16:45:54 INFO mapred.JobClient: Reduce input records=201012 14/04/20 16:45:54 INFO mapred.JobClient: Reduce input groups=128514 14/04/20 16:45:54 INFO mapred.JobClient: Combine output records=201012 14/04/20 16:45:54 INFO mapred.JobClient: Physical memory (bytes) snapshot=1541390336 14/04/20 16:45:54 INFO mapred.JobClient: Reduce output records=128514 14/04/20 16:45:54 INFO mapred.JobClient: Virtual memory (bytes) snapshot=10689261568 14/04/20 16:45:54 INFO mapred.JobClient: Map output records=1174992
|
now we can run a getmerge to bring it to the local file system
bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/ 14/04/20 16:50:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library |
and view a sample of the output:
hduser@hd-master:/usr/local/hadoop-1.2.1$ head /tmp/gutenberg-output "'Ample.' 1 "'As 1 "'Because 1 "'Certainly,' 1 "'DEAR 1 "'Dear 2 "'Fritz! 1 "'From 1 "'Is 3 "'Ku 1 |
all of the namenode data and job tracking status can be done from the following:
http://hd-master:50070
http://hd-master:50030
and thats it, to stop the name node clusters you simply run the following shell scripts:
bin/stop-dfs.sh bin/stop-mapred.sh |
References used
http://www.michael-noll.com
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster
http://www.javacodegeeks.com
http://www.javacodegeeks.com/2013/06/setting-up-apache-hadoop-multi-node-cluster.html