Big Data Programming
Dr. Shih-wei Liao
Dr. Shih-wei Liao Android + Big Data
Big Data
Dr. Shih-wei Liao Big Data lectures
McKinsey’s Definition
Sometimes you see 4V’s : Variety, Volume, Velocity, Veracity:
Unstructured Structured Data: data with meta information and label. Unstructured Data: data of free format, such as text data on the web. |
Distributed Data are not stored in one single machine, but distributed in diffierent machines. |
Noisy Data contains irrelevent, redudant, or even erroneous information. |
Dr. Shih-wei Liao Big Data lectures
Volume:
Volume in terms of Real Users:
Google+
>
>
>
In the last slide:
Dr. Shih-wei Liao Big Data lectures
Velocity: A Minute Internet Life
350,000 twetts
208,000 photos upload
350,000 twetts
100 hours of video upload
120 new accounts
$118,000 in revenue
3.5 millions search queries
Dr. Shih-wei Liao �BigData lectures
頂著幹:Learning-by-Doing
I'll show how we use MapReduce and Spark to:
"MapReduce: Simplified Data Processing on Large Clusters" (OSDI'04).
“Pregel: a system for large-scale graph processing” (SIGMOD’10)
Big Data Programming
First, in memoriam of Robert Floyd
Robert Floyd: Turing award recipient of 1978
“To the designer of programming languages, I say: unless you can support the paradigms I use when I program, or at least support my extending your language into one that does support my programming methods, I don't need your shiny new languages. [...] To persuade me of the merit of your language, you must show me how to construct programs in it.”
MapReduce Programming Model
Programmer specifies two functions:
Main data structure:
Mappers and Reducers
Mapper/Reducer Template:
list(out_key, intermediate_value)
list(out_value)
Mapper
Reducer
MapReduce Programming Model
Main data structure: (Key, Value) pairs.
(list_key, list_value) = map(list, function_key, function_value)
for each element in list
list_key’s element <= function_key(element)
list_value’s element <= function_value(element)
group by key
(key, result) = reduce(key, list_for_key, function_reduce)
result <= 0
for each element in list_for_key
result <= function_reduce(result, element)
Mappers and Reducers: Observations
Example: Count Word Occurrences
Input: A set of words.
Output: Number of occurrences of each�word.
Mapper/Reducer implementation:
Mapper/Reducer template:
Supporting Parallelized and Distributed Processing
Mapper/Reducer Template:
list(out_key, intermediate_value)
list(out_value)
Mapper
Reducer
Parallel, Distributed Execution
Mapper
Reducer
Parallel, Distributed Execution
Combiner
Shuffler
Mapper
Reducer
Key components of MapReduce
Mapper:
Combiner (optional, in the same machine of a mapper):
Shuffler:
Reducer:
Examples of MapReduce-based Algorithms
Web search:
Data processing:
Machine learning:
Graph mining:
Inverted Index
Definition:
Input: A collection of documents.
Output: A mapping from a word to a list of documents.
apple | doc1, doc3, ... |
banana | doc1, doc2, doc3, ... |
cherry | doc2, doc3, ... |
doc1 | apple, banana, date, ... |
doc2 | banana, cherry, pear, ... |
doc3 | apple, banana, cherry, ... |
Inverted Index - using MapReduce
Mapper:
Shuffler:
Reducer:
Sort
Input: A set of English words.
Output: The sorted list of the input English words.
Example:
Sort - using MapReduce.
Mapper:
Shuffler:
Reducer:
Compute Average Value per Key.
Given a (StudentId, Grade) table, compute:
SELECT StudentId, avg(Grade)
FROM student_grades
GROUP BY StudentId
StudentId | Grade |
Amy | 95 |
Beth | 98 |
Cathy | 91 |
Amy | 88 |
?
Compute Average Value per Key.
Given a (StudentId, Grade) table, compute:
SELECT StudentId, avg(Grade)
FROM student_grades
GROUP BY StudentId
StudentId | Grade |
Amy | 95 |
Beth | 98 |
Cathy | 91 |
Amy | 88 |
StudentId | Avg(Grade) |
Amy | 91.5 |
Beth | 98 |
Cathy | 91 |
Compute Average Value per Key (w/ MapReduce)
Mapper:
Reducer:
(Q) Can we reduce the network traffic between mappers & reducers?
Compute Average Value per Key (w/ MapReduce)
Using "combiner" to improve performance.
Mapper:
Combiner (at each mapper):
Reducer:
K-means Clustering
From: http://stanford.edu/~cpiech/cs221/img/kmeansViz.png
K-means Clustering
Input: A set of data points (x, y), and the number of clusters, K.
Output: The centers of the K clusters.
Algorithm:
K-means Clustering using MapReduce
"centers": Initialize the K cluster centers.
Mapper:
Reducer:
Repeat the above Mapper/Reducer steps, until convergence.
Matrix Multiplication
Definition:
Algorithm:
X
m
n
=
Matrix Multiplication using MapReduce
Mapper:
( (m,n), sum(A[m,:] * B[:,n]) )
Reducer:
(Q) What if the row or the column is too big?
Matrix Multiplication
If the row or the column is too big, then ...
X
m
n
=
Matrix Multiplication using MapReduce
The "improved" version:
Mapper:
( (m,n), sum(row segment * column segment) ) )
Reducer:
´( (m,n), sum(partial_sum_i) )
X
m
n
Connected Component
Definition: Given a graph, find the "components" where each is a set of nodes connected to one another through a path.
Three connected components:
{A, B, D, E, F},
{C},
{G, H, I}.
Connected Components
Input: The set of graph edges, {(x, y), where x, y are nodes}.
Output: The collection of components, i.e.,
{ ( component 1, [node1, node2, ...] ), ... }
Algorithm Ideas:
Connected Components using MapReduce
Initialize the "leader map": each node is its own leader.
Node | Leader |
a | a |
b | b |
c | c |
d | d |
e | e |
f | f |
Connected Components using MapReduce
Initialize the "leader map": each node is its own leader.
Mapper:
Reducer:
( node, min(leader1, leader2, ...) )
Update the "leader map" with reducer's output.
Repeat the Mapper/Reducer step, until convergence.
Connected Components w/ MapReduce (Illustration: Iter 1)
Edge | (Node, Leader) update |
(a, b) | (b, leader(a)) = (b, a) |
(a, d) | (d, leader(a)) = (d, a) |
(a, e) | (e, leader(a)) = (e, a) |
(d, e) | (e, leader(d)) = (e, d) |
(d, f) | (f, leader(d)) = (f, d) |
(e, f) | (f, leader(e)) = (f, e) |
Node | Leader |
a | a |
b | b |
c | c |
d | d |
e | e |
f | f |
Mapper
Reducer
Input | Output |
(b, [a]) | (b, a) |
(d, [a]) | (d, a) |
(e, [a, d]) | (e, a) |
(f, [e, d]) | (f, d) |
Connected Components w/ MapReduce (Illustration: Iter 1)
Edge | (Node, Leader) update |
(a, b) | (b, leader(a)) = (b, a) |
(a, d) | (d, leader(a)) = (d, a) |
(a, e) | (e, leader(a)) = (e, a) |
(d, e) | (e, leader(d)) = (e, d) |
(d, f) | (f, leader(d)) = (f, d) |
(e, f) | (f, leader(e)) = (f, e) |
Node | Leader |
a | a |
b | a |
c | c |
d | a |
e | a |
f | d |
Mapper
Reducer
Input | Output |
(b, [a]) | (b, a) |
(d, [a]) | (d, a) |
(e, [a, d]) | (e, a) |
(f, [e, d]) | (f, d) |
Connected Components w/ MapReduce (Illustration: Iter 2)
Edge | (Node, Leader) update |
(a, b) | (b, leader(a)) = (b, a) |
(a, d) | (d, leader(a)) = (d, a) |
(a, e) | (e, leader(a)) = (e, a) |
(d, e) | (e, leader(d)) = (e, a) |
(d, f) | (f, leader(d)) = (f, a) |
(e, f) | (f, leader(e)) = (f, a) |
Node | Leader |
a | a |
b | a |
c | c |
d | a |
e | a |
f | d |
Mapper
Reducer
Input | Output |
(b, [a]) | (b, a) |
(d, [a]) | (d, a) |
(e, [a]) | (e, a) |
(f, [a]) | (f, a) |
Connected Components w/ MapReduce (Illustration: Iter 2)
Edge | (Node, Leader) update |
(a, b) | (b, leader(a)) = (b, a) |
(a, d) | (d, leader(a)) = (d, a) |
(a, e) | (e, leader(a)) = (e, a) |
(d, e) | (e, leader(d)) = (e, a) |
(d, f) | (f, leader(d)) = (f, a) |
(e, f) | (f, leader(e)) = (f, a) |
Node | Leader |
a | a |
b | a |
c | c |
d | a |
e | a |
f | a |
Mapper
Reducer
Input | Output |
(b, [a]) | (b, a) |
(d, [a]) | (d, a) |
(e, [a]) | (e, a) |
(f, [a]) | (f, a) |
(Recap) Examples of MapReduce-based Algorithms
Web search:
Data processing:
Machine learning:
Graph mining: