NAS and SAN's offer centralized, low-latency access to either a block device or a filesystem on the order of terabytes in size. They do not scale to meet the need of thousands of machines pulling hundreds of Gigs of content all at one time.

Goals for HDFS

Store millions of large files, each greater than tens of gigabytes, and filesystem sizes reaching tens of petabytes.
Use a scale-out model based on inexpensive commodity servers with internal JBOD ("Just a bunch of disks") rather than RAID to achieve large-scale storage. Accomplish availability and high throughput through application-level replication of data.
Optimize for large, streaming reads and writes rather than low-latency access to many small files.
Batch performance is more important than interactive response times.
Gracefully deal with component failures of machines and disks.
Support the functionality and scale requirements of MapReduce processing.

HDFS Design

HDFS is a userspace filesystem. It runs as a process of the OS. It uses the local machines filesystem to store it's own files. HDFS is not a POSIX-compliant filesystem.
HDFS is a distributed filesystem. Each machine in a cluster stores a subset of the data (blocks) that makes up the complete filesystem. Filesystem metadata is stored on a centralized server (NameNode), acting as a directory of block data and providing a global picture of the filesystem's state.
The admin controls the number of blocks that are replicated across the cluster. The default replication for each block is 3 times. Should the number of copies of a block drop below the configured replication factor, the filesystem automatically makes a new copy from one of the remaining replicas
Having this replication allows:

Multiple machine failures can be more easily tolerated.
Read data from a machine closest to an application on the network
During processing any copy of the data can be used, giving the scheduler a better chance of finding available resources
No need for special expensive specialized storage system for data protection of block data. Ex. RAID

HDFS has a block size higher than most other filesystems. The default is 64M and some go as high as 1G.
Increasing the block size means data will be written in larger contiguous chunks on disk, which in turn means data can be written and read in larger sequential operations. This minimizes drive seek operations and results in better performance when doing large streaming I/O operations.
A block consists of the raw bytes of a portion of the file being stored. All block files start with the prefix blk_. The metadata file (with a .meta suffix) is made up of a header with version and type information, followed by a series of checksums for sections of the block.
Files in HDFS are write once, once a replica is written, it is not possible for it to change.
Benefits of this include:

Removing the need for complex reasoning about the consistency between replicas.
Applications being able to read any of the available replicas when accessing a file
A file in HDFS that is smaller than a single block does not occupy a full block's worth of underlying storage.
So if a file is 10MB but the block size is 64MB the block only occupies 10MB until more data is added.

Multiple files in HDFS can use different block sizes. HDFS provides API to specify block size when you create a file. FileSystem.create(Path, overwrite, bufferSize, replication, blockSize, progress)
Every datanode runs a block scanner periodically verified every three weeks, this verifies all the blocks stored on the datanode. This allows bad blocks to be detected and fixed before they are read by clients. The DataBlockScanner maintains a list of blocks to verify and scans them one by one for checksum errors.
HDFS permits a client to read a file that is open for writing. When reading a file open for writing, the length of the last block still being written is unknown to the NameNode. The last incomplete block is not visible and neither are any subsequent blocks which are going to be completed. Only the blocks which were completed before opening the stream for reading are available. In other words if the block being written is not finished yet the reader can't see it.
HAR file (Hadoop Archive) groups together small files into a single Hadoop archive file.

Useful when there are already lots of small files in HDFS, which need to be grouped together before some expensive jobs.
Implemented as a MapReduce job.
Use a har:// URL to access each file from the archive and view the archive as a folder.
Use a normal hdfs:// URL to access the actual content of the archive in HDFS. HARs are stored in HDFS as folders which contain a file with the concatenation of all its containing input files.

Daemons that make up HDFS

There are three daemons that make up a standard HDFS cluster.

NameNode: 1 per cluster. Filesystem metadata is stored on a centralized server, acting as a directory of block data and providing a global picture of the filesystem's state.
Secondary NameNode: 1 per cluster. Performs internal NameNode transaction log checkpointing.
Datanode: Many per cluster. Stores block data (contents of files).

NameNode

Clients connect to the NameNode to perform filesystem operations
Datanodes regularly report their status to the NameNode in a heartbeat. They carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for the NameNode's block allocation and load balancing decisions.
The NameNode does not directly send requests to DataNodes. It uses replies to heartbeats to send instructions to the DataNodes.
The instructions include commands to replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, and shut down the node.
At any given time, the NameNode has a complete view of all datanodes in the cluster, their current health, and what blocks they have available.
Datanode initially starts up, as well as every hour thereafter, a block report to the NameNode. A block report is simply a list of all blocks the datanode currently has on its disks. The NameNode keeps track of all the changes.
File to block mapping on the NameNode is stored on disk. The host specific location of the blocks are not recorded. DataNodes send out their block lists on startup and periodically after that. The NameNode uses these reports to see where all of it's blocks are. This allows you to move disks from one datanode to another and not worry about a changing hostname or IP.
NameNode stores its filesystem metadata on local filesystem disks in a few different files, but the two most important of which are fsimage and edits.
Fsimage contains a complete snapshot of the filesystem metadata including a serialized form of all the directory and file inodes in the filesystem. Each inode is an internal representation of a file or directory's metadata and contains such information as the file's replication level, modification and access times, access permissions, block size, and the blocks a file is made up of. For directories, the modification time, permissions, and quota metadata is stored.
Edits file (journal) contains only incremental modifications made to the metadata. It uses a write ahead log which reduces I/O operations to sequential, append-only operations (in the context of the NameNode, since it serves directly from RAM), which avoids costly seek operations and yields better overall performance.
Upon NameNode startup, the fsimage file is loaded into RAM and any changes in the edits file are replayed, bringing the in-memory view of the filesystem up to date.
Each client-initiated transaction (copy,move,etc) is recorded in the journal, and the journal file is flushed and synced before the acknowledgment is sent to the client. The NameNode also updates its in-memory (RAM) representation of the filesystem metadata, which it updates after the edit log has been modified.
Starting a NameNode will bring it into service after it loads the fsimage, replays the transaction log, sees some percentage of blocks (minimally replicated) from the datanodes, and is stable for some additional amount of time.
This threshold is almost always 100% but can be changed with dfs.safemode.extension parameter.
During safe mode replication of blocks is prohibited, but it does offer a read-only view of the filesystem to clients.
The NameNode awaits when all or majority of DataNodes report their blocks. Depending on how safe mode parameters are configured the name-node will stay in safe mode until a specific percentage of blocks of the system is minimally replicated.
Safe mode is exited when the minimal replication condition is reached, plus an extension time of 30 seconds. The minimal replication condition is when 99.9% of the blocks in the whole filesystem meet their minimum replication level (which defaults to one and is set by dfs.replication.min.

hadoop dfsadmin -safemode get

NameNode filesystem metadata is served entirely from RAM. This makes it fast, but limits the amount of metadata a box can handle. Roughly 1 million blocks occupies roughly 1 GB of heap.
Hadoop's default strategy is for the NameNode to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.

Secondary NameNode

IT IS NOT A BACKUP FOR THE NameNode! The name is horrible, but it is what it is.
The NameNodes edits file needs to be periodically applied to the fsimage file. NameNode may not have the available resources (CPU or RAM) to do this while continuing to provide service to the cluster so the secondary NameNode applies the updates from the edits file to the fsimage file and sends it back to the primary. This is known as the checkpointing process.
The checkpoint file is never changed by the NameNode only the Secondary NameNode.
This application of the updates (checkpointing) to the fsimage file occurs every 60 mins by default or whenever the NameNodes edits file reaches 64Meg. Which ever happens first. Newer versions of Hadoop use a defined number of transactions rather than file size to determine when to perform a checkpoint.
If the secondary NameNode is not running at all, the edit log will grow significantly and it will slow the system down.
Also, the system will go into safemode for an extended time since the NameNode needs to combine the edit log and the current filesystem checkpoint image.

Datanode

Daemon responsible for storing and retrieving block (chunks of a file) data is called the Datanode (DN).
Datanode has direct local access to one or more disks, (commonly called data disks) in a server on which it's permitted to store block data
Point Datanode to new disks in existing servers or adding new servers with more disks increases the amount of storage in the cluster.
Block data is streamed to and from datanodes directly, so bandwidth is not limited by a single node
Datanodes regularly report their status to the NameNode in a heartbeat
Datanode initially starts up, as well as every hour thereafter, a block report to the NameNode. A block report is simply a list of all blocks the datanode currently has on its disks. The NameNode keeps track of all the changes.

Process of reading a file

Reading file in HDFS called /foo/bar.txt.

The client uses a Hadoop client program to make the request.
Client program reads the cluster config file on the local machine which tells it where the namemode is located. This has to be configured ahead of time.
The client contacts the NameNode and requests the file it would like to read.
Client validation is checked by username or by strong authentication mechanism like Kerberos.
Once client is validated request is checked against the owner and permissions of the file.
If the file exists and the user has access to it then the NameNode responds with the first block id and provides a list of datanodes a copy of the block can be found, sorted by their distance to the client (reader).
The client now contacts the most appropriate datanode directly and read the block data it needs. This process repeats until all blocks in the file have been read or the client closes the file stream.

If while reading the file the datanode dies, library will automatically attempt to read another replica of the data from another datanode. If all replicas are unavailable, the read operation fails and the client receives an exception.
If information returned by the NameNode about block locations are outdated by the time the client attempts to contact a datanode, a retry will occur if there are other replicas or the read will fail.

Process of writing a file

Writing a new file to HDFS called /foo/babar.txt.

The client uses a Hadoop client program to make the request.
Client program reads the cluster config file on the local machine which tells it where the namemode is located. This has to be configured ahead of time.
A request is sent to the NameNode to create the file metadata
Client validation is checked by username or by an authentication mechanism like Kerberos.
If the user has the necessary permissions to do so, the metadata entry for the new file is made. However, it initially has no associated blocks.
NameNode responds to the client and indicates the open request was successful and that it may now begin writing data.
The client starts breaking up the file into pieces (packets, not TCP ones), queues them in memory and starts a data stream from this queue.
The client contacts the NameNode requesting a set of datanodes to which replicas of the next block should be written.
The namemode responds and the clients data packets are then streamed to the first datanode, which writes the data to disk, and to the next datanode, which writes to its disk, and so on. This is called a replication pipeline. Each datanode in the replication pipeline acknowledges each packet as it's successfully written.
The client application maintains a list of packets for which acknowledgments have not yet been received and when it receives a response, it knows the data has been written to all nodes in the pipeline. This process of writing packets to the pipeline continues until the block size is reached, at which point the client goes back to the NameNode for the next set of datanodes to write to.
Eventually, the client indicates it's finished sending data by closing the stream, which flushes any remaining packets out to disk and updates the NameNode to indicate the file is now complete.

If a datanode in the pipeline fails to write the pipeline is immediately closed and all packets that had been sent since the last acknowledgment are pushed back into the queue to be written so that any datanodes past the failed node in the pipeline will receive the data. The current block is given a new ID on the remaining healthy datanodes. This is done so that, should the failed datanode return, the abandoned block will appear to not belong to any file and be discarded automatically.
A new replication pipeline containing the remaining datanodes is opened and the write resumes.
When a new block is created, HDFS places the first replica on the node where the writer is located. The second and the third replicas are placed on two different nodes in a different rack. The rest are placed on random nodes with restrictions that no more than one replica is placed at any one node and no more than two replicas are placed in the same rack, if possible.

NameNode High Availability

NameNode high availability (or HA) is deployed as an active/passive(standby) pair of NameNodes. The edits write ahead log needs to be available to both NameNodes, and therefore is stored on a shared storage device. Currently, an NFS filer is required as the shared storage, although there are plans to remove this dependency. As the active NameNode writes to the edits log, the Standby NameNode is constantly replaying transactions to ensure it is up to date and ready to take over in the case of failure.
Datanodes are also aware of both NameNodes in an HA configuration and send block reports to both servers.
High-availability pair of NameNodes can be configured for manual or automatic failover. Default is manual failover.
In a manual failover a command must be sent to effect a state transition from one NameNode to the other.
In automatic failover, each NameNode runs an additional process called a "failover controller" that monitors the health of the process and coordinates state transitions.
Graceful failover is initiated by the admin.
A Nongraceful failover is a detected fault in the active failover controller process.
The system can use a series of increasingly drastic fencing techniques to ensure the failed node (which could still think it's active) is actually stopped. Tell it to stop via RPC, or Send a IPMI reboot message to the failed host.
It's impossible to know if a NameNode has relinquished active status or if it's simply inaccessible from the standby.
When running in HA mode the standby NameNode takes over the role of the secondary NameNode. There is no separate secondary NameNode process in an HA cluster, only a pair of NameNode processes. Most repurpose their secondary NameNode machine to be a second NameNode.
Manual failover from primary NN to Secondary NN looks like this: haadmin -failover hadoop1 hadoop2
This shuts down (fences) hadoop1 and brings up hadoop2 as the active NN.

NameNode Federation

Helps overcome the limit of how much metadata the NameNode can store in memory by splitting it up across multiple NameNodes. This gives us one logical namespace from a bunch of different NameNodes. Similar to the Linux filesystem where many different devices can be mounted to different points, but still form under one named root /.
Each datanode has a block pool for each namespace. While blocks from different pools are stored on the same disks (there is no physical separation), they are logically exclusive. Each datanode sends heartbeats and block reports to each NameNode.
NameNodes do not communicate with one another and failure of one does not affect the other Clients view the namespace via an API implementation called ViewFS. This maps slices of the filesystem to the proper NameNode.
It is configured on the client side via the local core-site.xml file.
Federation does not support overlapping mount points as of right now.

Clients

Client can read and write data to HDFS using different tools and API's.
Clients can be on the same physical machines as any of the Hadoop daemons, or they can be on a host separate from the cluster.
Clients that regularly use one datanode can cause the node to become unbalanced because of the block placement policy.
The NameNode will assign the local machine as the destination for the first replica when an HDFS client is running on a datanode. This causes more blocks to kept on the local datanode that others. To help fix this run the balancer.
hadoop balancer -threshold N (where N is the percentage of blocks within which datanodes should be with one another)
This can be killed at any time without any repercussions. Apache Hadoop users can use the start-balancer.sh script.

Commands

Hadoop comes with a number of command-line tools that enable basic filesystem operations.
HDFS commands are subcommands of the hadoop command-line utility

# Display basic usage information

hadoop fs

# List files in a dir. Uses fs.default.name value in core-site.xml file if full url syntax is not used.

hadoop fs -ls /user/dude

hadoop fs -ls hdfs://NameNode.blah.com:8020/home/dude

# Upload file with -put or -copyFromLocal which copies file form local filesystem

hadoop fs -put /etc/resolv.conf /user/dude/

# Download file from HDFS using -get or -copyToLocal.

hadoop fs -get /user/dude/resolv.conf ./

# Set a replication factor for a file or dir of files with the -R

hadoop fs -setrep 5 -R /user/dude/rep5/

# Run a fsck on the files we set the rep factor on and see if it looks correct

hadoop fsck /user/dude/rep5 -files -blocks -locations

MapReduce

Their are 2 versions on MapReduce in the ecosystem right now. V1 and V2.

V1 is the original MapReduce that uses tasktracker and jobtracker daemons.
V2 is called YARN. YARN to splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. Resource manager, application master, and node manager.

MapReduce v1

Developers write jobs (code) that contains of a map function and a reduce function, along with job configuration information that controls various aspects of its execution
The jobs are broken up into tasks, the tasks are scheduled to run on machines, each task's health is monitored, in case of any failures the necessary tasks are retried.
TaskTracker is a process that runs on slave (data) nodes and is responsible for instantiating and monitoring individual Map and Reduce tasks. It starts a separate JVM processes to do the actual work (called as Task Instances).
MapReduce job is made up of four distinct stages, executed in order: client job submission, map task execution, shuffle and sort, and reduce task execution
MapReduce framework provides a set of APIs for submitting jobs and interacting with the cluster.
A job is made up of code written by a developer against the MapReduce APIs and the configuration which specifies things such as the input and output datasets
The jobtracker process (running on another host in the cluster) is responsible for accepting the job submissions, scheduling tasks to run on worker nodes, and providing administrative functions such as worker health and task progress monitoring to the cluster.
Speculative execution - The speculative execution helps to offset the slow workers. The jobtracker will create multiple instances of the same task and will take the first result into consideration and the second instance of the task will be killed.
Job submission occurs over the network from any machine. It does not have to be one from the cluster.
There is one jobtracker per MapReduce cluster. If it dies all running jobs fails. So put it on a reliable server.
Tasktrackers inform the jobtracker as to their current health and status by way of regular heartbeats.
Each heartbeat contains the total number of map and reduce task slots available, the number occupied, and detailed information about any currently executing tasks. After a configured period of no heartbeats the tasktracker is considered dead.
When a job is submitted to jobtracker, information about each task that makes up the job is stored in memory.
After the job completes, this information is retained for a configurable window of time or until a specified number of jobs have been executed. On an active cluster where many jobs, each with many tasks, are running, this information can consume a considerable amount of RAM. Because of this monitoring jobtracker memory utilization is critical
The act of deciding which tasks of a job should be executed on which worker nodes is referred to as task scheduling. The scheduler decides when tasks get executed and in what order.
There is 1 tasktracker process on all worker nodes, it accepts task assignments from the jobtracker, instantiates the user code, executes those tasks locally, and reports progress back to the jobtracker periodically
Both tasktrackers and datanodes processes run on the same machines, which makes each node both a compute node and a storage node.
Each tasktracker is configured with a specific number of map and reduce task slots that indicate how many of each type of task it is capable of executing in parallel.
Tasktrackers allow more map tasks than reduce tasks to execute in parallel because they consume resources differently.
Upon receiving a task assignment from the jobtracker, the tasktracker executes an attempt of the task in a separate process. A task is the logical unit of work, while a task attempt is a specific, physical instance of that task being executed. Attempts may fail, but each task in a job will have at least 1 attempt.
Communication between the task attempt (usually called the child, or child process) and the tasktracker is maintained via an RPC connection over the loopback interface called the umbilical protocol.
Tasktracker uses a list of user-specified directories to hold the intermediate map output and reducer input during job execution. These user-specified dir's are kept on the local machines filesystem.
When a failure is detected by the tasktracker, it is reported to the jobtracker in the next heartbeat. The job is rescheduled. If enough tasks from the same job fail on the same tasktracker then the node is added to a job-level blacklist. If multiple tasks from different jobs repeatedly fail on a specific tasktracker, the tasktracker in question is added to a global blacklist for 24 hours
If there is a loss of the tasktracker daemon or the entire worker node. The jobtracker, after a configurable amount of time with no heartbeats, will consider the tasktracker dead along with any tasks it was assigned.
Tasks are rescheduled on other nodes.
If the jobtracker fails all jobs will fail eventually. This is a single point of failure in Hadoop.
Benefit of MapReduce is it has data locality. This is the ability to execute computation on the same machine where the data being processed is stored. This helps remove the "Store Effect" of all machines smashing a SAN for large datasets, and bogging down the network.

High Level overview of a MapReduce v1

User creates a job and submits it to the jobtracker process.
A map task is created that runs the user-supplied map function on each record. The map function takes a key-value pair as input and produces zero or more intermediate key-value pairs.

Map tasks are executed in parallel by various machines across the cluster. Mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.
When you set the reducers to zero no reducers will be executed, and the output of each mapper will be stored to a separate file on HDFS. The output of the reduce is normally stored in HDFS for reliability.
When the mapper creates a key-value pair each key is assigned to a partition using a component called the partitioner. The partitioner is it's own key-value pair. It takes a hash of the key, modulo the number of configured reducers in the job, to get a partition number. Keys that are the same are put into the same partition. Try to picture a partition number next to each key value pair (record). Example looks like:

partition 1, dog => poodle

partition 2, cat => bengal

partition 3, bird => hawk

partition 1, dog => pug

partition 3, bird => duck

The next step is shuffle and sort. This is performed by the reducers (reduce tasks). Each reducer is assigned one of the partitions on which it should work. This is a flurry of network copies between each reducer in the cluster so it can get the partition (intermediate key-value) data it was assigned to work on.

The number of reducers is defined by the developer. The output of the reduce is normally stored in HDFS for reliability. This step uses a lot of bandwidth between servers and benefits from very fast networking like 10G.

After the partition data has been copied we can start performing a merge sort of the data. A merge sort takes a number of sorted items and merges them together to form a fully sorted list. Each reducer produces a separate output file, usually in HDFS. Each reducer output file usually named part-, where is the number of the reduce task within the job. The output format of the file is specified by the author of the MapReduce job.

MapReduce v2 (YARN)

YARN splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. ResourceManager, ApplicationMaster, and NodeManager.
Resource management (what was the JobTracker in v1) is now done by a daemon called the ResourceManager(RM). It is responsible for creating and allocating resources to multiple applications. Each application is an individual MapReduce job. The resource manager daemon is still centralized (global) on one machine. assumes the responsibility to negotiate a specified container in which to start the ApplicationMaster and then launches the ApplicationMaster.
On successful container allocations, the ApplicationMaster launches the container by providing the container launch specification to the NodeManager. These containers can be launched on any node with NodeManager in the cluster.
NodeManager(NM) is another daemon that runs on each worker node in the cluster. It runs in place of the traditional tasktracker. NodeManager launches any type of process, dictated by the application, in an application container.
It also manages the ApplicationMaster daemon.
Jobs (applications) are now managed and executed by the per application ApplicationMaster(AM) daemon. This daemon can run on any node of the cluster. It is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks. Jobs are now isolated from each other and are decentralized.
Every application has its own instance of an ApplicationMaster. AM is essentially user code and it is not to be trusted. It does not run at a privileged level.
An application can ask for specific resource requests via the ApplicationMaster to satisfy its resource needs.
The Scheduler responds to a resource request by granting a container, which satisfies the requirements laid out by the ApplicationMaster in the initial ResourceRequest. The ApplicationMaster has to take the Container and present it to the NodeManager managing the host, on which the container was allocated, to use the resources for launching its tasks.
MRV2 maintains API compatibility with previous stable release (hadoop-0.20.205). This means that all MapReduce jobs should still run unchanged on top of MRv2 with just a recompile.
Don't run MRv1 and YARN on the same set of nodes at the same time. It is not supported.

Planning a cluster

Distribution and software version

Download tarballs from Apache Foundation. v1.0 has support for building RPM's or DEB's.
Cloudera has it's own disto called CDH. It backports critical fixes, provides packages for a number of different operating systems, and has a commercial-grade QA and testing process.
CDH currently includes Apache Hadoop, Apache HBase, Apache Hive, Apache Pig, Apache Sqoop, Apache Flume, Apache ZooKeeper, Apache Oozie, Apache Mahout, and Hue.
CDH4 is based on Apache Hadoop 2.0.0
CDH is available as tarballs, RedHat Enterprise Linux 5 and 6 RPMs, SuSE Enterprise Linux RPMs and Debian Deb packages.

Hardware

Master nodes (JobTracker, NameNode, SecondaryNameNode)

Redundancy is key. Dual power supplies, bonded nic's. RAID 10 possibly for NameNode.
Lot's and lot's of RAM. As much as you can afford.
OS disks should be RAID 1 mirror.
NameNode and jobtracker produce a lot of logs so have a good amount of disk space
Small cluster (20 workers or less) dual quad-core 2.6 Ghz CPU, 24 GB of DDR3 RAM, dual 1 Gb Ethernet NICs, a SAS drive controller, and at least two SATA II drives in a JBOD configuration in addition to the host OS device.
Medium cluster (up to 300 workers) 48 GB RAM, and the rest the same as small cluster.
Master nodes in large clusters should have a total of 96 GB of RAM, and the rest the same as small cluster.

NameNode considerations

Length of filenames actually starts to matter at scale. The longer the name the more memory it takes up.
Lot's of small files use up metadata. The more files there are to track the more memory we need.
NameNodes don't need lots of storage as they only store what fits in memory. The metadata dump can't be bigger than memory.
NameNode disk reliability is very important. Use RAID or send multiple copies to different JBOD disks.
Also a copy of data should be written to an NFS share. That is a last ditch recovery effort.
Secondary NameNode is usually the same hardware config as the NameNode.
Jobtracker hardware is similar to NameNode as it is RAM hungry. Jobtracker keeps metadata information about the last 100 (by default) jobs executed on the cluster in RAM. This can be tweaked though.

Worker/datanodes

Data disks should be as big of disks as you can afford setup in JBOD configuration. 20-30% of the machine's raw disk capacity needs to be reserved for temporary data.
Each task needs ~ between 2 GB and 4 GB of memory, depending on the task being performed.
Each worker node in the cluster executes a predetermined number of map and reduce tasks simultaneously.
A cluster administrator configures the number of these slots, and Hadoop's task scheduler a function of the jobtracker assigns tasks that need to execute to available slots. Each one of these slots can be thought of as a compute unit consuming some amount of CPU, memory, and disk I/O resources, depending on the task being performed. It is very dynamic and has to be tuned to the hardware you have.
Mid size would be 2 - 6 core processors, 64 GB ECC RAM, 12 SATA drives, SAS 6 Gb/s, 2 - 1G Ethernet, allocate memory to take advantage of triple or quad channel.
High end would be 2 - 6 core processors, 96 GB ECC RAM, 2 - SAS 6 Gb/s, 24 1 TB SFF Nearline/MDL, 1 - 10G NIC, Allocate memory to take advantage of triple or quad channel.
Hadoop does not generally benefit from Virtualization. Guest OSes are unaware of one another as they perform I/O scheduling operations and, as a result, can cause excessive drive seek.
Blade enclosures commonly share I/O planes and network connectivity, and the blades themselves usually have little to no local storage. They are built for more compute intensive workloads.
JBOD is preferred for datanodes. Multiple independent disks doing sequential reads is better than a group of disks working together. Also, variation in drive speeds will not matter to independent disks.
SAN's and NAS's are not good for Hadoop. They are usually oversubscribed only a small number of machines can do concurrent I/O at one time.

Cluster sizing

You could choose based on your storage needs.

Sample growth plan

Average daily ingest rate	2TB
Replication factor	3
Daily raw consumption	6 TB	Ingest x replication
Node raw storage	24 TB	12 x 2 TB SATA II HDD
MapReduce temp space reserve	25%	For intermediate MapReduce data
Node-usable raw storage	18 TB	Node raw storage MapReduce reserve
1 year (flat growth)	122 nodes	Ingest x replication x 365 / node raw storage

Watch the trade off of CPU for disk space density pitfall. You could try to purchase machines that are half as powerful, but buy twice as many. Then you get in power,space,cooling, network port density issues.
It's preferable to purchase reasonably dense machines without falling outside the normal boundaries of what is considered commodity hardware.

OS selection, prep, layout

Linux is your best bet for a good production quality OS. Pick the distro you are most comfortable with and what works best with your hardware. Also look at support and tools offered for each.
Use a software configuration management system to help you configure a large number of machines
Directory layout

Hadoop home - Hadoop software is installed here. Usually in /opt or /usr/local or /usr.
Datanode data directories - used by the datanode to store HDFS block data. assumes that each directory provided is a separate physical device with independent spindles. Often put on the same devices as the tasktracker MapReduce local directories
NameNode directories - used by the NameNode to store filesystem metadata. NameNode assumes that each directory provided is a separate physical device and replicates all writes to each device synchronously. Directories will all require the same amount of space and generally do not use more than 100 GB.
MapReduce local directories - One or more directories used by the tasktracker to store temporary data during a MapReduce job. The more spindles the better. Often put on the same devices as the datanode data directories.
Hadoop log directory - used by all daemons to store log data as well as job and task-level data.
Hadoop pid directory - small, does not grow. Daemons store pid id's in here.
Hadoop temp directory - small, short-lived files it sometimes needs to create. /tmp/hadoop-<${user.name}> by default.

Java SDK is critical for Hadoop since it's run with Java. HotSpot JVM is, by far, the best performing, most stable implementation available. Cloudera's RPM packages require Oracle Java RPM.
All machines in the cluster should run the exact same version of Java, down to the patch level. Use 64bit version.
Hadoop can be deployed in standalone mode, pseudo-distributed mode or fully-distributed mode.

Hostnames

Tasktracker and datanodes heartbeat into the jobtracker and NameNode (respectively) every few seconds. Heartbeat includes the identity of the machine, either by hostname or by IP address. This identity is how it will refer to this machine. If you contact the NameNode to open a file on a machine it will return the name the worker uses in its heartbeat
Clients must be able to resolve the hostname and communicate with the worker using the identifier as it was reported to the NameNode. This means the hostname of the nodes in the cluster needs to get set correctly in all places on the OS.

Users, Groups, Privs

Try to run HDFS daemons as one user and MapReduce daemons as another.

Jobtracker,Tasktracker, Child tasks as user mapred.
NameNode, Secondary NameNode,Datanode run as hdfs.

In secure mode child tasks are run as the user that submitted the job.

Tuning and conf changes

/etc/security/limits.conf file. Remove file limits for Hadoop daemons.

hdfs - nofile 32768

mapred - nofile 32768

hbase - nofile 32768

/etc/sysctl.conf. Change following Linux kernel parameters

vm.swappiness=0

vm.overcommit_memory=1

Do not use Linux Volume Manager for Hadoop data disks
When mounting filesystems disable dir and file access times (atime).

Network usage and design

HDFS and MapReduce exhibit strong East/West, or full node-to-node communication patterns
Tasktrackers regularly heartbeat small bits of information to the jobtracker to indicate they're alive.
During job submission, the jobtracker communicates with the NameNode, but also in the form of small RPC requests
Each reducer must fetch the map output data for its partition from each tasktracker in the cluster.
Shuffle phase accounts for a rather significant amount of East/West traffic within the cluster.

Traditional Tree Topology

predominant architecture deployed in data centers today
Has multiple tiers, each of which brings together (or aggregates) the branches of another tier
Hosts are connected to leaf or access switches in a tree, which are then connected via one or more uplinks to the next tier
Example: First tier are 48x1GbE switches with four 10GbE ports. Second tier is 1 48x10GbE port to tie all of the trunked 10GbE ports together. Hosts sit on the 1GbE ports. This will get you 576 hosts. Ports are oversubscribed 1.2:1
Adding more hosts means adding more tiers towards the top. Problem is adding more tiers adds more hop points and the amount of oversubscription is compounded with each tier in the tree.
Cluster data ingress and egress should be nearest to the root of a tree network prevents branch monopolization and unbalanced traffic patterns
Tree network works for small and midsize networks that fit within two tiers.
Never place low latency services on the cluster distribution switch

Spine Fabric

Preferred topology for Hadoop traffic because it maximizes east/west traffic patterns and is redundant
Tries to be as close to equal distance between any two hosts as possible
Each leaf switch (1st tier) has one or more uplinks to every switch in the second tier, called the spine
A routing protocol such as IS-IS, OSPF, or EIGRP is run with equal cost multipath (ECMP) routes so that traffic has multiple path options and takes the shortest path between two hosts
ECMP routing says we can simply take a different path to the same place; the bandwidth isn't gone, just spread out
This equidistant, uniform bandwidth configuration is perfect for applications with strong East/West traffic patterns.

Install and Config

Install

Tarball installs are difficult because of all the extra steps needed like permissions and log relocation, etc.
It is however the most flexible.
Installing from RPM or Deb packages is easier. Saves you from making common mistakes and helps keep best practices.
Packages install to places like /usr/bin, /usr/lib, /etc/hadoop.
For a production environment root privs will be required. It's harder to install without root privs. You can do it with the tarball install and change the owner to someone other than root.
CDH install mods the /etc/security/limits.d file for you. alternatives system, which allows multiple alternative versions of files to be installed, software installed to /usr/lib/hadoop-0.20 and symlinked to /usr/lib/hadoop by way of alternatives

Config

Global, cluster level parameters control how the software itself is deployed, service identification, access controls, and integration with the OS and external systems
Not all these parameters can not be changed without restarting the daemons which they affect
Developers or automated systems that submit jobs can override these values in code or from the command line, where applicable
Some parameters can be specified per operation, provided the code or context supports it
The following configuration files exist in the conf directory:

hadoop-env.sh - A bourne shell fragment sourced by the Hadoop scripts, this file specifies environment variables that affect the JDK used by Hadoop, daemon JDK options, the pid file, and log file directories.
core-site.xml - An XML file that specifies parameters relevant to all Hadoop daemons and clients.
hdfs-site.xml - An XML file that specifies parameters used by the HDFS daemons and clients.
mapred-site.xml - An XML file that specifies parameters used by the MapReduce daemons and clients.
log4j.properties - A Java property file that contains all log configuration information. Covered in
masters (optional) - A newline separated list of machines that run the secondary NameNode, used only by the start-*.sh helper scripts.
slaves (optional) - A newline separated list of machine names that run the datanode / tasktracker pair of daemons, used only by the start-*.sh helper scripts.
fair-scheduler.xml (optional) - The file used to specify the resource pools and settings for the Fair Scheduler task scheduler plugin for MapReduce.
capacity-scheduler.xml (optional) - The name of the file used to specify the queues and settings for the Capacity Scheduler task scheduler plugin for MapReduce.
dfs.include (optional, conventional name) - A newline separated list of machine names that are permitted to connect to the NameNode. Blank file means no nodes are allowed to connect.
NameNode.dfs.exclude (optional, conventional name) - A newline separated list of machine names that are not permitted to connect to the NameNode.
hadoop-policy.xml - An XML file that defines which users and / or groups are permitted to invoke specific RPC functions when communicating with Hadoop.
mapred-queue-acls.xml - An XML file that defines which users and / or groups are permitted to submit jobs to which MapReduce job queues.
taskcontroller.cfg - A Java property−style file that defines values used by the setuid task-controller MapReduce helper program used when operating in secure mode.

Hadoop scripts ensure the conf directory is always at the head of the classpath so files can easily be located by the code
Administrators can mark a property as final to prevent a developer override of properties looks like true

core-site.xml important properties

fs.default.name - This paremeter is not depreacted and has been replaced by fs.defaultFS. But it is still given on the test in 2013 so still know that the parameter is a URL that specifies the default filesystem used by clients.

Default set to file:/// (local file system), prod clusters usually use hdfs://hostname:port which points to NameNode.

fs.defaultFS - Replaces fs.default.name. Default value is file:///. The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
fs.checkpoint.dir - The fs.checkpoint.dir parameter specifies the comma separated list of directories used by the secondary NameNode in which to store filesystem metadata during a checkpoint operation
fs.checkpoint.size - Default is 67108864 (64Meg) The size of the current edit log (in bytes) that triggers a periodic checkpoint even if the fs.checkpoint.period hasn't expired.
io.file.buffer.size - used as a general purpose buffer size. Property should be set to a multiple of the system page size, defined in bytes, and is 4KB by default. Try starting value at 65536.
fs.trash.interval - Number of minutes after which the checkpoint gets deleted. If zero, the trash feature is disabled. This option may be configured both on the server and the client. If trash is disabled server side then the client side configuration is checked. If trash is enabled on the server side then the value configured on the server is used and the client configuration value is ignored.
hadoop.security.authentication - Possible values are simple (no authentication), and kerberos

hdfs-site.xml important properties

dfs.name.dir - Specifies a comma separated list of local directories (with no spaces) in which the NameNode should store a copy of the HDFS filesystem metadata. encouraged to specify two internal disks and a low latency, highly reliable, NFS mount.
dfs.name.edits.dir or dfs.namenode.edits.dir - Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.name.dir
dfs.data.dir or dfs.datanode.data.dir - indicate where datanodes should store HDFS block data. Also a comma separate list, rather than mirroring data to each directory specified, the datanode round robins blocks between disks
dfs.replication - Default block replication (3). The actual number of replications can be specified when the file is created.

The default is used if replication is not specified in create time.

dfs.blocksize - The default block size for new files, in bytes.
dfs.permissions.enabled - If "true", enable permission checking in HDFS.
dfs.datanode.du.reserved - Reserved space in bytes per volume. Always leave this much space free for non dfs use.
dfs.permissions.supergroup - a designated group is given special privileges equivalent to being the superuser
dfs.namenode.name.dir.restore - Set to true to enable NameNode to attempt recovering a previously failed dfs.namenode.name.dir.

When enabled, a recovery of any failed directory is attempted during checkpoint.

dfs.namenode.shared.edits.dir - A directory on shared storage between the multiple namenodes in an HA cluster. This directory will be written by the active and read by the standby in order to keep the namespaces synchronized. This directory does not need to be listed in dfs.namenode.edits.dir above. It should be left empty in a non-HA cluster.
dfs.namenode.safemode.threshold-pct - Specifies the percentage of blocks that should satisfy the minimal replication requirement defined by dfs.namenode.replication.min. Values less than or equal to 0 mean not to wait for any particular percentage of blocks before exiting safemode. Values greater than 1 will make safe mode permanent.
dfs.namenode.safemode.min.datanodes - Specifies the number of datanodes that must be considered alive before the name node exits safemode.
dfs.hosts - Names a file that contains a list of hosts that are permitted to connect to the namenode. The full pathname of the file must be specified. If the value is empty, all hosts are permitted.
dfs.hosts.exclude - Names a file that contains a list of hosts that are not permitted to connect to the namenode. The full pathname of the file must be specified. If the value is empty, no hosts are excluded.

mapred-site.xml important properties

mapred.job.tracker - provides the same for tasktrackers and MapReduce clients

Parameter is a hostname (or IP address) and port pair on which the jobtracker listens for RPC communication
Standard port is 8021.

mapred.local.dir - map tasks in a MapReduce job use the machine's local disk to store their intermediate output.

Uses a comma separated list of disks, sometimes sharing disks with the data in dfs.data.dir

mapred.child.ulimit - The maximum virtual memory, in KB, of a process launched by the Map-Reduce framework. This can be used to control both the Mapper/Reducer tasks and applications using Hadoop Pipes, Hadoop Streaming etc. By default it is left unspecified to let cluster admins control it via limits.conf and other such relevant mechanisms.
mapred.tasktracker.map.tasks.maximum - The maximum number of map tasks that will be run simultaneously by a task tracker.
mapred.tasktracker.reduce.tasks.maximum - The maximum number of reduce tasks that will be run simultaneously by a task tracker.
io.sort.mb - The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.
io.sort.factor - The number of streams to merge at once while sorting files. This determines the number of open file handles.
mapred.map.tasks - The default number of map tasks per job. Ignored when mapred.job.tracker is "local".
mapred.reduce.tasks - The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapred.job.tracker is "local".
mapred.map.max.attempts - The maximum number of attempts per map task. In other words, framework will try to execute a map task these many number of times before giving up on it.
mapred.reduce.max.attempts - The maximum number of attempts per reduce task. In other words, framework will try to execute a reduce task these many number of times before giving up on it.
mapred.reduce.parallel.copies - The default number of parallel transfers run by reduce during the copy(shuffle) phase.
mapred.task.timeout - The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string.
mapred.hosts - Names a file that contains the list of nodes that may connect to the jobtracker. If the value is empty, all hosts are permitted.
mapred.hosts.exclude - Names a file that contains the list of hosts that should be excluded by the jobtracker.

If the value is empty, no hosts are excluded.

Tuning a worker node - start with a baseline configuration of 1.5 tasks for each physical CPU core.
Two thirds allocated to map task slots and the remaining one third as reduce task slots

Logging

Hadoop uses the Java logging package log4j to control log output. log4j.properties file in the conf/ dir.
This file controls the overall log levels of both the Hadoop daemons as well as MapReduce jobs
System log files produced by Hadoop are stored in $HADOOP_INSTALL/logs by default.
This can be changed using the HADOOP_LOG_DIR setting in hadoop-env.sh.
Each Hadoop daemon running on a machine produces two log files. The first is the log output written via log4j.
This file, which ends in .log, should be checked first when diagnosing problems because most application log messages are written here. The standard Hadoop log4j configuration uses a Daily Rolling File.
Appender to rotate logfiles. Old log files are never deleted, so you should arrange for them to be periodically deleted or archived, so as to not run out of disk space on the local node.
The second log file is the combined standard output and standard error log. This log file, which ends in .out, usually contains little or no output, since Hadoop uses log4j for logging. It is rotated only when the daemon is restarted, and only the last five logs are retained. Old log files are suffixed with a number between 1 and 5, with 5 being the oldest file.
Logfile names (of both types) are a combination of the name of the user running the daemon, the daemon name, and the machine hostname. For example, hadoop-blahdatanode-poofy.local.log.2008-07-04 is the name of a log file after it has been rotated.
logger is a named channel for log events that has a specified minimum log level
The log levels, in order of most severe to least, are FATAL, ERROR, WARN, INFO, DEBUG, and TRACE. Log events with a log level greater than or equal to that which is specified are accepted while less severe events are simply discarded
Each logger has a parent logger from which it inherits its configuration information
Loggers output their log events to an appender which is responsible to handling the event. Most common appender log to disk, but you can send them to places like syslog or console.
log4j.rootLogger which specifies the default log level and appender to be used by all loggers
Loggers can be specified by using the naming convention of log4j.logger.logger-name
Value of a logger parameter is always a log level, a comma, and the name of one or more appenders.
Ex: hadoop.root.logger INFO, console then lower down you might see log4j.rootLogger ${hadoop.root.logger}, EventCounter

Monitoring cluster

Start the NameNodes and datanodes, and verify that all datanodes are communicating with each NameNode
Each NameNode has two primary URLs: one that shows the view of the namespace managed by that NameNode, and an aggregate cluster view that shows all NameNodes that make up the cluster
URL of view of the namespace managed by that NameNode: http://NameNode.blah.com:50070/dfshealth.jsp
Html page has the following info on it.

NameNode localhost.localdomain:8020 (active)

Started: sat sep 13 4:4:44 EST 2013

Version: 2.0.0-cdh4.3.0, 234929d9s9839s98sd7f9d7s9g

Compiled: Mon May 3 3:3:33 by blah from Unknown

Upgrades: There are not upgrades in progress

Cluster ID: cluster11

Blockpool ID: BP-23434343333-127.0.0.1-34322343ll3434

[Browse the filesystem]

[NameNode Logs]

--------------------------------------------------------------

Cluster Summary

Security is OFF

124 files and directories, 91 blocks = 215 total.

Heap memory used 41.31 MB is 50% of Committed Heap Memory 81.06 MB. Max Heap Memory is 171.94 MB.

Non Heap Memory used 47.55 MB is 70% of Committed Non Heap Memory 67.05 MB. Max Non Heap Memory is 130MB

Configured Capacity : 11.07GB

DFS Used : 58.96 MB

Non DFS Used : 2.95GB

DFS Remaining : 8.06GB

DFS Used% : 0.52%

DFS Remaining% : 72.82%

Block Pool Used : 58.96 MB

Block Pool Used% : 0.52%

DataNodes Usages : Min % Median % Max % stdev %

0.52% 0.52% 0.52% 0.00%

[Live Nodes] : 1 (Decomissioned: 0)

[Dead Nodes] : 0 (Decomissioned: 0)

[Decomissioning Nodes] : 0

Number of unreplicated blocks : 91

NameNode Journal Status:

Current transaction ID: 894

--------------------------------- --------------------------------------------

Journal Manager State

-------------------------------- ---------------------------------------------

FileJournalManager(root=/dfs/nn) EditLogFileOutputStream(/dfs/nn/current/edits_inprogress_00000000844

NameNode Storage:

------------------- ----- -----------

Storage Directory Type State

------------------ ---- -------------

/dfs/nn IMAGE_AND_EDITS Active

- URL of view of aggregate cluster view that shows all NameNodes that make up the cluster.

This is the federated NameNode cluster view.

http://NameNode.blah.com:50070/dfsclusterhealth.jsp

- Html page looks like:

Cluster ' cluster11'

Cluster Summary

Total files and directories 124

Configured Capacity : 11.07GB

DFS Used : 58.96 MB

Non DFS Used : 2.95GB

DFS Remaining : 8.06GB

DFS Used% : 0.52%

DFS Remaining% : 72.82%

-----------------------------------------

NameNodes

Number of NameNodes : 1

NameNode Blockpool Files and Blocks Missing Live Datenode Dead Datanode

Used Directories Blocks (Decomissioned) (Decomissioned)

--------------------- --------- ----------- ----- ------ -------------- --------------

localhost.localdomain 58.96MB 124 91 0 1 (0) 0 (0)

- URL of view of the jobtracker http://jobtracker.blah.com:50030/jobtracker.jsp

- Html page looks like:

localhost Hadoop Map/Reduce Administration

State: RUNNING

Started: Sat Sep 14 12:22:33 PDT 2013

Version: 2.0.0.-mr1-cdh4.3.0, Unknown

Compiled: Mon May 14 12:22:33 by blah from Unknown

Identifier: 202023003020

Cluster Summary (Heap Size is 81.06MB/171.94 MB)

Running Running Total Nodes Occupied Occupied Reserve Map Reduce Avg. Blacklisted Excluded

Map Reduce Submissions Map Reduced Reduce Task Task Task/Node Nodes Nodes

Tasks Tasks Slots Slots Slots Capacity Capacity

------ ------- ----------- ------ -------- -------- ------- -------- -------- --------- ----------- --------

0 0 0 1 0 0 0 0 0 6.00 0 0

Scheduling Information

Queue Name State Scheduling Information

----------- ------ ----------------------

default running N/A

Filter (Jobid,Priority, User, Name ______________

Running Jobs

------------

None

Retired Jobs

-------------

None

Local Logs

[Log] directory, [Job Tracker History]

- sudo -u hdfs hadoop dfsadmin -report

Quick HTTP Ports Reference

HDFS Port Config Parameter

-------------

Namenode 50070 dfs.http.address

Datanodes 50075 dfs.datanode.http.address

Secondary NameNode 50090 dfs.secondary.http.address

Backup/Checkpoint node 50105 dfs.backup.http.address

Mapreduce Port Config Parameter

------------

JobTracker 50030 mapred.job.tracker.http.address

Tasktrackers 50060 mapred.task.tracker.http.address

Quotas and Job Schedulers

Quota

Setting a quota on a directory (size is in bytes)

dfsadmin -setSpaceQuota size path

Look a quotas of a dir

hadoop fs -count -q /user/blah

Column one is the file count quota and two is the file count remaining column. The third and fourth columns are the space quota and remaining space in bytes
Remember: HDFS quota is applied to the physical (post-replication) size rather than the logical (pre-replication) size
You can also apply quotas on the number of files that may be created within a directory.

hadoop dfsadmin -setQuota number path

Schedulers

Task slots are shared across jobs that are executing concurrently. The scheduler is responsible for assigning tasks to open slots on tasktrackers
The scheduler is a plug-in within the jobtracker

FIFO (First In First Out) Scheduler

Default scheduler in Hadoop
Very simple concept. Jobs are processed in the order they are received.
Supports five levels of job prioritization, very low, low, normal, high, very high. Each is implemented in a separate queue.
Higher tasks are processed before lower tasks. No matter when the jobs come in higher always trumps lower.
To set this edit the mapred-site.xml file and set to mapred.jobtracker.taskScheduler

Fair Scheduler

Jobs, which are submitted to queues, are placed into pools. Each pool is assigned a number of task slots based on a number of factors including the total slot capacity of the cluster, the current demand (where "demand" is the number of tasks in a pool) on other pools, minimum slot guarantees, and available slot capacity.
Pools may optionally have minimum slot guarantees. These pools are said to have an SLA, with the minimum number of slots providing the vehicle for ensuring task scheduling within a given period of time. Beyond the minimum slot guarantees, each pool gets an equal number of the remaining available slots on the cluster; this is where the "fair share" portion of the name comes from.
By default, no pools have minimums and so all pools simply receive an equal number of slots.
Pools can be added dynamically, in which case, the number of slots they receive is adjusted
The scheduler determines to which pool a job (and really, it's tasks) are assigned by a MapReduce property usually set to user.name. This yields a pool per user, each of which receives an equal number of slots
Each time a tasktracker heartbeats to the jobtracker and reports available slots the rules are evaluated and changes are made.
Terminology

Total capacity - In the context of scheduling, total capacity (or total cluster capacity) is the sum of all slots of each type (map slots and reduce slots) regardless of their state. Total capacity is not a static number
Total available capacity - The total available capacity is the number of open slots in a cluster.
Pool - A pool is a container for a group of jobs and the recipient of resource allocation.
Demand - A pool is said to have demand if and only if there are queued tasks that should be assigned to it.
Fair share - The "fair" number of slots a pool should receive.
Minimum share - An administrator-configured number of slots that a pool is guaranteed to receive. By default, pools have no minimum share.

When assigning tasks, the scheduler first looks at the demand for each pool. No demand then no slots are given.
The scheduler gives each pool with demand its minimum share, if min share is configured. Min share is always 1st.
Once the minimum shares satisfied the scheduler switches to allocating the remaining slots
If the sum of the minimum shares of pools with demand is greater than the total capacity, the minimum shares of each pool are adjusted pro rata.
The scheduler now tries to assign the rest of the slots evenly across the pools.
The scheduler never allocates more slots than the demand. If a user wants 10 slots and 20 are available then they only get 10 for their pool.
In addition to, or in place of a minimum share, pools may also have a weight. Pools with greater weight receive more slots during fair share allocation (weight does not impact minimum share allocation). The weight of a pool simply acts as a multiplier; a weight of 2 means the pool receives two slots to every one slot the other pools receive.
By default, pools have a weight of 1.
Job priorities, like those supported in the FIFO scheduler, are also supported in the Fair Scheduler. When a priority is set, it simply affects the weight of the job; the higher the priority, the greater the weight, and the more tasks assigned during fair share allocation.
If two or more jobs could be submitted to the same pool by different users then The Fair Scheduler uses another instance of itself to schedule jobs within each pool. It evenly splits up the jobs in the pool.
When a job is submitted to a pool with a minimum share and those slots have been given away it can kill tasks or wait for them to complete. This is called preemption.
Delayed task assignment (sometimes called delay scheduling) is done by the scheduler to increase the data locality hit ratio and as a result, the performance of a job. This lets a free slot on a tasktracker remain open for a short amount of time if there is no queued task that would prefer to run on the host in question
Choose the Fair Scheduler over the Capacity Scheduler if:

You have a slow network and data locality makes a significant difference to job runtime. Features like delay scheduling can make a dramatic difference in the effective locality rate of map tasks.
You have a lot of variability in the utilization between pools. The Fair Scheduler's preemption model affects much greater overall cluster utilization by giving away otherwise reserved resources when they're not used.
You require jobs within a pool to make equal progress rather than running in FIFO order.

Capacity Scheduler (CS)

An administrator configures one or more queues, each with a capacity, a predetermined fraction of the total cluster slot capacity. It is reserved for the queue in question and is not given away in the absence of demand.
During the tasktracker heartbeat, slots are given to queues, with the most starved queues receive slots first.
Queue starvation is measured by dividing the number of running tasks in the queue by the queue's capacity or in other words, its percentage used. Any additional slots beyond the sum of the queue capacities defined may be freely assigned to any queue, as needed, by the scheduler.
Within a queue, jobs for the same user are FIFO ordered.
CS additionally understands scheduling tasks based on (user defined) memory consumption of a job's tasks. It uses information collected by the tasktracker to aid in scheduling decisions
Administrators may specify a default virtual and physical memory limit on tasks that users may optionally override upon job submission. The scheduler then uses this information to decide on which tasktracker to place the tasks,
The CS has feature where job initialization is performed lazily, which can reduce the required memory footprint of the jobtracker. This feature allows an administrator to specify a maximum number of jobs to initialize, per user, per queue.
Choose the CS scheduler if:

You know a lot about your cluster workloads and utilization and simply want to enforce resource allocation.
You have very little fluctuation within queue utilization. The CS's more rigid resource allocation makes sense when all queues are at capacity almost all the time.
You have high variance in the memory requirements of jobs and you need the CS's memory-based scheduling support.
You demand scheduler determinism.

Monitoring and Managing

Starting a process using scripts can be done from the /etc/init.d dir with the process name you want to start and using start|stop|restart command. You need to become root before doing this.
Starting a process in the foreground can be done with the hadoop command "hadoop process" where process is the name of the process you want to start.
Adding a datanode:

Add hosts ip address to the dfs.hosts file if using host include functionality.
run hdfs dfsadmin -refreshNodes as HDFS superuser.
Update any rack awareness files.
Start datanode process on the datanode.
Check hadoop dfsadmin -report or webUI for new node.
Run the balancer to re-distribute the blocks.

Adding a tasktracker node:

Follow the procedure for adding a datanode to HDFS.
Run the balancer utility to distribute existing block data to the new datanode.
Start the tasktracker process.
Confirm that the jobtracker can communicate with the new tasktracker by checking the number of available tasktrackers in its web user interface.

The decommissioning process relies on the HDFS host include and exclude files. If you are not using these files, it is not possible to gracefully decommission a datanode.
To decommission do the following

Add the ip of the node to the dfs.hosts.exclude file.
Run hadoop dfsadmin -refreshNodes as HDFS superuser.
Monitor the NameNode web UI to make sure it starts the decommission process.
Wait a very long time and keep watching the NameNode web UI until it says "decommissioned" next to the nodes name.
Stop the datanode process.
Remove it from the HDFS include and exclude files as well as any rack topology database.
Run hadoop dfsadmin -refreshNodes so NameNode sees the removal.

Check the health of the HDFS filesystem with the fsck command. HDFS is considered healthy if and only if all files have a minimum number of replicas available. The command is

sudo -u hdfs hadoop fsck /

Dealing with a failed disk

Is determined by:

The specified path is a directory.
The directory exists.
The directory is readable.
The directory is writable.

Default is if one disk fails, node is considered dead. To fix and replace

Stop any Hadoop-related processes (optionally following the decommissioning process for the datanode).
Replace any failed disks.
Follow the process for adding the node back into the cluster.
Run the Hadoop fsck utility to validate the health of HDFS. Over-replicated blocks are normal immediately after a node is reintroduced to the cluster, which is automatically corrected (over replicated files are deleted) over time.

When a datanode loses network connection for a few minutes

The replication factor is actively maintained by the NameNode.
The NameNode monitors the status of all datanodes and keeps track which blocks are located on that node. The moment the datanode is not available the NameNode will notice that certain blocks will be under replicated. The NameNode begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the NameNode. The NameNode also makes sure that not all replicas of a block are located on one rack.
If the NameNode detects that a block's replicas end up at one rack, the NameNode treats the block as mis-replicated and replicates the block to a different rack

However, if the datanode comes back up, over replicated data will be deleted. Note: the data might be deleted from the original datanode.

Ecosystem

Apache Hive

Converts your SQL query into a series of MapReduce jobs for execution on a Hadoop cluster.
Hive organizes data into tables, which provide a means for attaching structure to data stored in HDFS. Metadata such as table schema is stored in a database called the metastore. Everyone can use the metastore if it is centrally kept.
To install just download tarball from website and untar it: % tar xzf hive-x.y.z.tar.gz
It's handy to put Hive on your path to make it easy to launch:

export HIVE_INSTALL=/home/blah/hive-x.y.z-dev

export PATH=$PATH:$HIVE_INSTALL/bin

Now type hive to launch the Hive shell:

hive

hive>

Log files located in /tmp/$USER/hive.log, Check conf/hive-log4j.properties, and you can edit this file to change log levels and other logging-related settings.

Pig

The language used to express data flows, called Pig Latin.
The execution environment to run Pig Latin programs. There are currently two environments: local execution in a single JVM and distributed execution on a Hadoop cluster.
A Pig Latin program is script made up of a series of operations, or transformations, that are applied to the input data to produce output. Taken as a whole, the operations describe a data flow, which the Pig execution environment translates into an executable representation and then runs. Under the covers, Pig turns the transformations into a series of MapReduce jobs
Pig runs as a client-side application. Even if you want to run Pig on a Hadoop cluster, there is nothing extra to install on the cluster: Pig launches jobs and interacts with HDFS (or other Hadoop filesystems) from your workstation.
To install (you need Java 6)

Download and untar:

tar xzf pig-x.y.z.tar.gz

export PIG_INSTALL=/home/tom/pig-x.y.z

export PATH=$PATH:$PIG_INSTALL/bin

You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.
You can execute Pig in local or MapReduce mode. Local mode is for a small local file (pig -x local) or you can run it in MapReduce mode Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster.

pig -x mapreduce

Grunt is an interactive shell for running Pig commands. Just run Pig with no options.

Zookeeper

Distributed coordination service with a set of tools to build distributed applications that can safely handle partial failures
ZooKeeper is, at its core, a stripped-down filesystem that exposes a few simple operations, and some extra abstractions such as ordering and notifications.
ZooKeeper runs on a collection of machines and is designed to be highly available. No single point of failure.
ZooKeeper interactions support participants that do not need to know about one another. Processes can drop notes to each other even if they don't know the other process.
Installing

Download and untar:

tar xzf zookeeper-x.y.z.tar.gz

export ZOOKEEPER_INSTALL=/home/blah/zookeeper-x.y.z

export PATH=$PATH:$ZOOKEEPER_INSTALL/bin

Config file is zoo.cfg and is usually placed in the conf/ dir.

tickTime=2000
dataDir=/Users/tom/zookeeper
clientPort=2181

Start local zookeeper server:

zkServer.sh start

Sqoop

Allows users to extract data from a structured data store (like a RDBMS) into Hadoop for further processing
This processing can be done with MapReduce programs or other higher-level tools such as Hive
When the final results of an analytic pipeline are available, Sqoop can export these results back to the data store for consumption by other clients.
Installing

Download and untar:

tar xvzf sqoop-x.y.z.tar.gz

export SQOOP_HOME=/home/blah/sqoop export ENV var

$SQOOP_HOME/bin/sqoop

then run it. standard location is /usr/bin/

Sqoop has an extension framework that makes it possible to import data fromand export data to any external storage system that has bulk data transfer capabilities. A Sqoop connector is a modular component that uses this framework to enable Sqoop imports and exports.
Sqoop ships with connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2
By default, Sqoop will generate comma-delimited text files for imported data.

Other Important Projects

Mahout: A Scalable machine learning and data mining library.
Cassandra: A scalable multi-master database with no single points of failure.
Chukwa: A data collection system for managing large distributed systems.
HBase: A scalable, distributed database that supports structured data storage for large tables.
Mahout: A Scalable machine learning and data mining library.
Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.

Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search.

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.