MapReduce and Hadoop
January 17, 2018
Tristan Glatard
FACULTY OF ENGINEERING
AND COMPUTER SCIENCE
Department of Computer Science and Software Engineering
Objectives
Slower
Larger
Introduction
Google’s data analytics
Before MapReduce (< 2004): complex implementations mixing:
MapReduce is:
Let’s talk!
About one of the most highly cited paper in the history of computer science
MapReduce:
summary
MapReduce Workflows
Multi-step MapReduce programs
Loops
1
2
termination
condition
Combiners
More WordCount: initial implementation
See Jimmy Lin page 47, algorithms 3.2 and 3.3
Data-Intensive Text Processing with MapReduce�Jimmy Lin and Chris Dyer.�Morgan & Claypool Publishers, 2010.
http://lintool.github.io/MapReduceAlgorithms/index.html
More WordCount: local aggregation
See Jimmy Lin page 47, algorithms 3.2 and 3.3
Data-Intensive Text Processing with MapReduce�Jimmy Lin and Chris Dyer.�Morgan & Claypool Publishers, 2010.
http://lintool.github.io/MapReduceAlgorithms/index.html
More WordCount: “in-mapper combining”
See Jimmy Lin page 47, algorithms 3.2 and 3.3
Data-Intensive Text Processing with MapReduce�Jimmy Lin and Chris Dyer.�Morgan & Claypool Publishers, 2010.
http://lintool.github.io/MapReduceAlgorithms/index.html
A given map task may process several key-value pairs, each in a separate call to the map function.
In-mapper combining
Advantages
Drawbacks
Data-Intensive Text Processing with MapReduce�Jimmy Lin and Chris Dyer.�Morgan & Claypool Publishers, 2010.
http://lintool.github.io/MapReduceAlgorithms/index.html
More on combiners
Data-Intensive Text Processing with MapReduce�Jimmy Lin and Chris Dyer.�Morgan & Claypool Publishers, 2010.
http://lintool.github.io/MapReduceAlgorithms/index.html
Note: reducer can’t be used as a combiner!
More on combiners
Data-Intensive Text Processing with MapReduce�Jimmy Lin and Chris Dyer.�Morgan & Claypool Publishers, 2010.
http://lintool.github.io/MapReduceAlgorithms/index.html
Combiner may never be executed!
More on combiners
Data-Intensive Text Processing with MapReduce�Jimmy Lin and Chris Dyer.�Morgan & Claypool Publishers, 2010.
http://lintool.github.io/MapReduceAlgorithms/index.html
More on combiners
Data-Intensive Text Processing with MapReduce�Jimmy Lin and Chris Dyer.�Morgan & Claypool Publishers, 2010.
http://lintool.github.io/MapReduceAlgorithms/index.html
Backup tasks to cope with stragglers
Bringing executions to completion
Solution 1: pilot jobs
Pilot jobs
Application
tasks
Pilot jobs
Regular
S. Camarasu-Pop, T. Glatard, J. T. Moscicki, H. Benoit-Cattin, and D. Sarrut, "Dynamic partitioning of GATE Monte-Carlo simulations on EGEE"
Journal of Grid Computing, vol. 8, no. 2, pp. 241-259, mar, 2010
Solution 2: dynamic load-balancing (Monte-Carlo simulations)
S. Camarasu-Pop, T. Glatard, J. T. Moscicki, H. Benoit-Cattin, and D. Sarrut, "Dynamic partitioning of GATE Monte-Carlo simulations on EGEE"
Journal of Grid Computing, vol. 8, no. 2, pp. 241-259, mar, 2010
Dynamic load-balancing: result
23
Static load balancing + pilot jobs
Dynamic load balancing + pilot jobs
S. Camarasu-Pop, T. Glatard, J. T. Moscicki, H. Benoit-Cattin, and D. Sarrut, "Dynamic partitioning of GATE Monte-Carlo simulations on EGEE"
Journal of Grid Computing, vol. 8, no. 2, pp. 241-259, mar, 2010
Solution 3: Task replication (backup tasks)
24
Time
Completed tasks
R. Ferreira da Silva, T. Glatard, and F. Desprez, "Self-healing of operational workflow incidents on distributed computing infrastructures"
12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing - CCGrid 2012, Ottawa, Canada, pp. 318-325, 05/2012
Without replication
With replication
MapReduce exercises
Inverted index (record level)
map(String key, String value):
// key: document name
// value: document content
map(String key, String value):
// key: document name
// value: document content
for each word w in value:
emit (w,key)
reduce (String key, Iterator values):
// key: a word
// values: a list of document names
documents = removeDuplicates(values)
emit (key, documents)
This is a very interesting document describing how...
Name: doc1
This is a boring document explaining that...
Name: doc2
this: doc1, doc2
is: doc1, doc2
a: doc1, doc2
boring: doc2
interesting: doc1
...
Inverted index
Relational-algebra operations
Data representation:
Input of map tasks:
1975,08,18,John,Doe
1989,03,01,Jane,Doe
...
Table: users
Selection
SQL query: SELECT * from <table> WHERE <condition>
map(key, value):
// key: table name (e.g. “users”)
// value: record (e.g. “1975,08,18,John,Doe”)
if value satisfies <condition>
emit(value,value)
No reducer needed
map(key, value):
// key: table name (e.g. “users”)
// value: record (e.g. “1975,08,18,John,Doe”)
Projection
SELECT DISTINCT <attributes> from <table>
map(key, value):
proj = select <attributes> in value
emit(proj,proj)
reduce(key,values): // eliminates duplicates
emit(key,key)
Union
SELECT * from <table1> UNION SELECT * from <table2>
map(key, value):
emit(value, value)
reduce(key,values): // eliminates duplicates
emit(key,key)
Intersection
SELECT <attribute> from <table1> WHERE <attribute> IN
( SELECT <attribute> from <table2> )
map(key, value):
attribute = select <attribute> in value
emit(attribute, key)
reduce(key,values):
if size(removeDuplicates(values)) == 2
emit(key,key)
Difference
SELECT <attribute> from <table1> WHERE <attribute> NOT IN
( SELECT <attribute> from <table2>)
map(key, value):
attribute = select <attribute> in value
emit(attribute, key)
reduce(key,values):
if size(removeDuplicates(values)) == 1 && values[0] == “table1”
emit(key,key)
Matrix-Vector Multiplication
Assumption: M is big.
m11
m12
m13
m21
m22
m23
m33
m32
m31
M (3,3)
v (3,1)
v1
v2
v3
x =
x (3,1)
x1
x2
x3
Data (matrix) representation as key-value pairs
Non-null elements:
(“1,1”,m11)
(“1,2”,m12)
(“1,3”,m13)
(“2,1”,m21)
(“2,2”,m22)
…
Null elements:
omitted
Matrix-Vector Multiplication
Assumption: M is big.
m11
m12
m13
m21
m22
m23
m33
m32
m31
M (3,3)
v (3,1)
v1
v2
v3
x =
x (3,1)
x1
x2
x3
The map task
map(key, value):
i = get_first_element(key)
j = get_second_element(key)
result = value * vj
emit(i,result)
The reduce task
reduce(key, values):
result = 0
for each value in values:
result += value
emit(key,result)
New assumption: v is big
Problem:
Assigned to mapper x
Read by mapper x
M
v
New assumption: v is big
Solution:
1 stripe: fits in memory
Assigned to mapper x
Read by mapper x
Conclusion
The overhead of disk I/O
Google used MapReduce widely
Beyond Google: The Hadoop ecosystem
The Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS)
HDFS Blocks
Hadoop: The Definitive Guide, Tom White, 4th edition, pp 45-46.
Some HDFS properties
HDFS: Reading data
Hadoop: The Definitive Guide, Tom White, 4th edition.
HDFS: Writing data
Hadoop: The Definitive Guide, Tom White, 4th edition.
HDFS: network distance
Hadoop: The Definitive Guide, Tom White, 4th edition.