CSE 444: Database Internals
Section 9:
Parallel Processing and MapReduce
1
Review in this section
2
Parallel DBMS
R(a,b) is horizontally partitioned across N = 3 machines.
Each machine locally stores approximately 1/N of the tuples in R.
The tuples are randomly organized across machines (i.e., R is block partitioned across machines).
Show a RA plan for this query and how it will be executed across the N = 3 machines.
Pick an efficient plan that leverages the parallelism as much as possible.
SELECT a, max(b) as topb
FROM R
WHERE a > 0
GROUP BY a
3
SELECT a, max(b) as topb �FROM R�WHERE a > 0�GROUP BY a
4
1/3 of R
1/3 of R
1/3 of R
Machine 1
Machine 2
Machine 3
R(a, b)
SELECT a, max(b) as topb �FROM R�WHERE a > 0�GROUP BY a
5
1/3 of R
1/3 of R
1/3 of R
Machine 1
Machine 2
Machine 3
R(a, b)
scan
scan
scan
If more than one relation on a machine, then “scan S”, “scan R” etc
SELECT a, max(b) as topb �FROM R�WHERE a > 0�GROUP BY a
6
1/3 of R
1/3 of R
1/3 of R
Machine 1
Machine 2
Machine 3
R(a, b)
scan
scan
scan
σa>0
σa>0
σa>0
SELECT a, max(b) as topb �FROM R�WHERE a > 0�GROUP BY a
7
1/3 of R
1/3 of R
1/3 of R
Machine 1
Machine 2
Machine 3
R(a, b)
scan
scan
scan
σa>0
σa>0
σa>0
γa, max(b)-> b
γa, max(b)-> b
γa, max(b)-> b
SELECT a, max(b) as topb �FROM R�WHERE a > 0�GROUP BY a
8
1/3 of R
1/3 of R
1/3 of R
Machine 1
Machine 2
Machine 3
R(a, b)
scan
scan
scan
σa>0
σa>0
σa>0
γa, max(b)-> b
γa, max(b)-> b
γa, max(b)-> b
Hash on a
Hash on a
Hash on a
SELECT a, max(b) as topb FROM R�WHERE a > 0 GROUP BY a
9
1/3 of R
1/3 of R
1/3 of R
Machine 1
Machine 2
Machine 3
R(a, b)
scan
scan
scan
σa>0
σa>0
σa>0
γa, max(b)-> b
γa, max(b)-> b
γa, max(b)-> b
Hash on a
Hash on a
Hash on a
SELECT a, max(b) as topb FROM R�WHERE a > 0 GROUP BY a
10
1/3 of R
1/3 of R
1/3 of R
Machine 1
Machine 2
Machine 3
R(a, b)
scan
scan
scan
σa>0
σa>0
σa>0
γa, max(b)-> b
γa, max(b)-> b
γa, max(b)-> b
Hash on a
Hash on a
Hash on a
γa, max(b)->topb
γa, max(b)->topb
γa, max(b)->topb
Benefit of hash-partitioning
11
SELECT a, max(b) as topb �FROM R�WHERE a > 0�GROUP BY a
SELECT a, max(b) as topb FROM R�WHERE a > 0 GROUP BY a
12
1/3 of R
1/3 of R
1/3 of R
Machine 1
Machine 2
Machine 3
Hash-partition on a for R(a, b)
scan
scan
scan
σa>0
σa>0
σa>0
γa, max(b)->topb
γa, max(b)->topb
γa, max(b)->topb
Problem 1)
13
Consider relations R(a,b), S(c,d), and T(e,f). All three are horizontally partitioned across N = 3 machines.
The tuples are randomly organized across machines.
Show a relational algebra plan for the following query and how it will be executed across the N = 3 machines:
SELECT *
FROM R, S, T
WHERE R.b = S.c
AND S.d = T.e
AND (R.a - T.f) > 100
Problem 1)
14
SELECT *
FROM R, S, T
WHERE R.b = S.c
AND S.d = T.e
AND (R.a - T.f) > 100
R(a,b)
S(c,d)
T(e,f)
Problem 1)
15
SELECT *
FROM R, S, T
WHERE R.b = S.c
AND S.d = T.e
AND (R.a - T.f) > 100
R(a,b)
S(c,d)
T(e,f)
Problem 1)
16
SELECT *
FROM R, S, T
WHERE R.b = S.c
AND S.d = T.e
AND (R.a - T.f) > 100
R(a,b)
S(c,d)
T(e,f)
Problem 1)
17
SELECT *
FROM R, S, T
WHERE R.b = S.c
AND S.d = T.e
AND (R.a - T.f) > 100
R(a,b)
S(c,d)
T(e,f)
Problem 1)
18
SELECT *
FROM R, S, T
WHERE R.b = S.c
AND S.d = T.e
AND (R.a - T.f) > 100
R(a,b)
S(c,d)
T(e,f)
Problem 1)
19
SELECT *
FROM R, S, T
WHERE R.b = S.c
AND S.d = T.e
AND (R.a - T.f) > 100
R(a,b)
S(c,d)
T(e,f)
Problem 1)
20
SELECT *
FROM R, S, T
WHERE R.b = S.c
AND S.d = T.e
AND (R.a - T.f) > 100
R(a,b)
S(c,d)
T(e,f)
Map Reduce
Explain how the query will be executed in MapReduce
SELECT a, max(b) as topb
FROM R
WHERE a > 0
GROUP BY a
Specify the computation performed in the map and the reduce functions
21
Map
22
SELECT a, max(b) as topb �FROM R�WHERE a > 0�GROUP BY a
Note: When each map task scans multiple relations, it needs to output something like
key = a and value = (‘R’, b) which has the relation name ‘R’
R(a, b)
Shuffle
23
SELECT a, max(b) as topb �FROM R�WHERE a > 0�GROUP BY a
Note: the programmer has to write only the map and reduce functions, the shuffle phase is done by the MapReduce engine (although the programmer can rewrite the partition function), but you should still mention this in HW6 answers.
R(a, b)
Reduce
24
SELECT a, max(b) as topb �FROM R�WHERE a > 0�GROUP BY a
Note: A local combiner can be used to compute local max before data gets reshuffled (in the map tasks)
R(a, b)
25
1/3 R
1/3 R
1/3 R
File system: HDFS
Could use a Combiner
(compute local max)
key = a, value =max(b)
SELECT a, max(b) as topb �FROM R�WHERE a > 0�GROUP BY a
Map:
Reduce:
R(a, b)
Benefit of hash-partitioning
26
SELECT a, max(b) as topb �FROM R�WHERE a > 0�GROUP BY a
Problem 2)
27
Consider two relations R(a, b) and S(b, c).
SELECT R.b, max(S.c) as cmax
FROM R, S
WHERE R.b = S.b
AND R.a <= 100
GROUP BY R.b
For the Map function, what are the computations performed, and what will be its outputs? Assume that the Map function reads a block of R or S relation as input.
For the Reduce function, what will be its inputs, what are the computations performed, and what will be its outputs?
Problem 2)
28
SELECT R.b, max(S.c) as cmax
FROM R, S
WHERE R.b = S.b
AND R.a <= 100
GROUP BY R.b
Map Function:
Reduce Function:
R(a, b)
S(b, c)
Note: In some cases, you may need to perform more than one MapReduce job to get the final result
Comparing between Parallel DBMSs and MapReduce Systems
29
Parallel DBMS:
MapReduce: