Distributed Computing with MapReduce
Lecture 2 of NoSQL Databases (PA195)�
David Novak, FI, Masaryk University, Brno
Agenda
Agenda
Big Data Processing
=> model of “moving the computing to data”
Big Data Processing II
Computing cluster architecture:
switch
racks with compute nodes
source: J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets. 2014.
Agenda
MapReduce: Origins
Google Solution
Google File System (GFS)
GFS: Schema
source: http://dl.acm.org/citation.cfm?id=945450
MapReduce (1)
Map
input data
map function
output data
(color indicates key)
Grouping Phase
intermediate output
(color indicates key)
shuffle (grouping) phase
Reduce Phase
input data
map function
intermediate output
(color indicates key)
input data
reduce function
output data
shuffle (grouping) phase
Example: Word Count
Task: Calculate word frequency in a set of documents
map(String key, Text value):
// key: document name (ignored)
// value: content of document (words)
foreach word w in value:
emitIntermediate(w, 1);
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
foreach v in values:
result += v;
emit(key, result);
Example: Word Count (2)
source: http://www.cs.uml.edu/~jlu1/doc/source/report/MapReduce.html
MapReduce: Combiner
Example: Word Count, Combiner
Task: Calculate word frequency in a set of documents
combine(String key, Iterator values):
// key: a word
// values: a list of local counts
int result = 0;
foreach v in values:
result += v;
emit(key, result);
Example: Word Count with Combiner
source: http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop
MapReduce Framework
MapReduce Framework (2)
source: Dean, J. & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters
MapReduce Framework: Details
MapReduce Framework: Details (2)
MapReduce: Example II
Task: Calculate graph of web links
map(String url, Text html):
// url: web page URL
// html: HTML text of the page (linearized HTML tags)
foreach tag t in html:
if t is <a> then:
emitIntermediate(t.href, url);
reduce(String key, Iterator values):
// key: target URLs
// values: a list of source URLs
emit(key, values);
Example II: Result
Input: (page_URL, HTML_code)
("http://cnn.com", "<html>...<a href="http://cnn.com">link</a>...</html>")
("http://ihned.cz", "<html>...<a href="http://cnn.com">link</a>...</html>")
("http://idnes.cz",
"<html>...<a href="http://cnn.com">x</a>... � <a href="http://ihned.cz">y</a>...<a href="http://idnes.cz">z</a>� </html>")
Intermediate output after Map phase:
("http://cnn.com", "http://cnn.com")
("http://cnn.com", "http://ihned.cz")
("http://cnn.com", "http://idnes.cz")
("http://ihned.cz", "http://idnes.cz")
("http://idnes.cz", "http://idnes.cz")
Intermediate result after shuffle phase (the same as output after Reduce phase):
("http://cnn.com", ["http://cnn.com", "http://ihned.cz", "http://idnes.cz"] )
("http://ihned.cz", [ "http://idnes.cz" ])
("http://idnes.cz", [ "http://idnes.cz" ])
MapReduce: Example III
Task: What are the lengths of words in the input text
map(String key, Text value):
// key: document name (ignored)
// value: content of document (words)
foreach word w in value:
emitIntermediate(length(w), 1);
reduce(Integer key, Iterator values):
// key: a length
// values: a list of counts
int result = 0;
foreach v in values:
result += v;
emit(key, result);
MapReduce: Features
Applicability of MapReduce
Agenda
Apache Hadoop
web: http://hadoop.apache.org/
Hadoop: Modules
source: https://goo.gl/NPuuJr
HDFS (Hadoop Distributed File System)
HDFS: Data Characteristics
HDFS: Basic Components
HDFS: Schema
HDFS: NameNode
HDFS: DataNode
HDFS: Blocks & Replication
HDFS: Block Replication
HDFS: Reliability
Hadoop: MapReduce
Hadoop HDFS + MapReduce
source: http://bigdata.black/architecture/hadoop/what-is-hadoop/
Hadoop MapReduce: Schema
Hadoop MR: WordCount Example (1)
public class Map
extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private final Text word = new Text();
@Override protected void map(LongWritable key, Text value,
Context context) throws ... {
String string = value.toString()
StringTokenizer tokenizer = new StringTokenizer(string);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
Hadoop MR: WordCount Example (2)
public class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce (Text key, Iterable<IntWritable> values,
Context context) throws ... {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
source: http://www.dineshonjava.com/2014/11/hadoop-architecture.html#.WLU6aBLyso8
Hadoop: Related Projects
Agenda
Apache Spark
homepage: http://spark.apache.org/
Apache Spark: Example
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
References