2 of 45

Big Data

According to McKinsey Global institute, definition of Big Data is: Datasets (usually unstructured, distributed and noisy) that are beyond the ability of typical database/tools to capture, store, manage, and analyze in term of volume, variety and velocity of coming.”
This is so called “3 Vs of Big Data.”
E.g., IOT data satisfy the definition above.

Dr. Shih-wei Liao Big Data lectures

3 of 45

McKinsey’s Definition

Sometimes you see 4V’s : Variety, Volume, Velocity, Veracity:

Variety = Unstructured
Volume, Velocity: distributed
Veracity: Noisy

Unstructured Structured Data: data with meta information and label. Unstructured Data: data of free format, such as text data on the web.
Distributed Data are not stored in one single machine, but distributed in diffierent machines.
Noisy Data contains irrelevent, redudant, or even erroneous information.

Dr. Shih-wei Liao Big Data lectures

4 of 45

Volume:

Google+: 1.15B users, but key is active users:

Monthly active users: 359M
1 year ago, Google+: 223M active users�among 435M total users

In contrast, Facebook has >1B users too, but

Daily active users: 802M. FB clearly wins.

Interestingly, WeChat’s Monthly active users is also about 355M by Q1, 2014, just like Google+.
Twitter: 200M Monthly active users (out of >500M users)
Linkedin: 66M Monthly active users (out of >200M users)

5 of 45

Volume in terms of Real Users:

Google+

6 of 45

In the last slide:

Top 3 companies: Tier 1.
Bottom 2 companies: Tier 2.

Remember: US and China has total 6 BigData internet companies over US$50B Market Cap:

US: Google, Facebook, Amazon
China: BAT

BAT: Baidu, Alibaba, Tencent

3 tiers of companies:

Tier 1: > US$50B in terms of Market Cap
Tier 2: US$10-$50B
Tier 3: US$1-10B

Dr. Shih-wei Liao Big Data lectures

7 of 45

Velocity: A Minute Internet Life

350,000 twetts

208,000 photos upload

350,000 twetts

100 hours of video upload

120 new accounts

$118,000 in revenue

3.5 millions search queries

8 of 45

Dr. Shih-wei Liao �BigData lectures

9 of 45

頂著幹：Learning-by-Doing

I'll show how we use MapReduce and Spark to:

take care of the low-level tasks for the programmers.�At runtime:

Automatic parallelization and distribution
Fault-tolerance

I/O scheduling
Status and monitoring

"MapReduce: Simplified Data Processing on Large Clusters" (OSDI'04).

“Pregel: a system for large-scale graph processing” (SIGMOD’10)

Both are a programming model & an execution run-time.
Are they both a simple programming model that supports common applications?

10 of 45

Big Data Programming

First, in memoriam of Robert Floyd

Honoring anniversary of Don Knuth’s "Robert W Floyd, In Memoriam" (2003, ACM SIGACT News), we’ll recognize Robert’s influence on us today:

“To persuade me of the merit of your language, you must show me how to construct programs in it.” -- Robert Floyd

11 of 45

Robert Floyd: Turing award recipient of 1978

“To the designer of programming languages, I say: unless you can support the paradigms I use when I program, or at least support my extending your language into one that does support my programming methods, I don't need your shiny new languages. [...] To persuade me of the merit of your language, you must show me how to construct programs in it.”

12 of 45

MapReduce Programming Model

Programmer specifies two functions:

Map: Procedure for processing each data record independently.

map (input_data) -> list(out_key, intermediate_value)

Reduce: Procedure for summarizing values with the same key.

reduce (out_key, list(intermediate_value)) -> list(out_value)

Main data structure:

(Key, Value) pairs.

13 of 45

Mappers and Reducers

Mapper/Reducer Template:

map (input_data) ->

list(out_key, intermediate_value)

reduce (out_key, list(intermediate_value)) ->

list(out_value)

Mapper

Reducer

14 of 45

MapReduce Programming Model

map (input_data) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)

Main data structure: (Key, Value) pairs.

(list_key, list_value) = map(list, function_key, function_value)

for each element in list

list_key’s element <= function_key(element)

list_value’s element <= function_value(element)

group by key

(key, result) = reduce(key, list_for_key, function_reduce)

result <= 0

for each element in list_for_key

result <= function_reduce(result, element)

15 of 45

Mappers and Reducers: Observations

Memoryless: Work independently and process one record at a time.

Easier to achieve parallelization and fault-tolerance.

Limited functionality, but surprisingly enough for most data processing applications.

16 of 45

Example: Count Word Occurrences

Input: A set of words.

Output: Number of occurrences of each�word.

Mapper/Reducer implementation:

map(word) -> (word, 1)
reduce(word, list(1, 1, ..., 1)) -> [(word, total_count)]

Mapper/Reducer template:

map (input_data) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)

17 of 45

Supporting Parallelized and Distributed Processing

Mapper/Reducer Template:

map (input_data) ->

list(out_key, intermediate_value)

reduce (out_key, list(intermediate_value)) ->

list(out_value)

Mapper

Reducer

18 of 45

Parallel, Distributed Execution

Mapper

Reducer

19 of 45

Parallel, Distributed Execution

Combiner

Shuffler

Mapper

Reducer

20 of 45

Key components of MapReduce

Mapper:

Process an input data and emit one or more (key, value) pairs.

Combiner (optional, in the same machine of a mapper):

Combine output data at local machine, before emitting to Shuffler.

Shuffler:

Send values of the same key to a particular Reducer machine.
Group (and sort) the values with the same key.

Reducer:

Process a (key, value_list) tuple and output one or more (key, value) pairs.

21 of 45

Examples of MapReduce-based Algorithms

Web search:

Inverted index

Data processing:

Sort
Average value per key

Machine learning:

K-means clustering
Matrix multiplication

Graph mining:

Connected component

22 of 45

Inverted Index

Definition:

Given a word, output documents (doc_id) containing the word.

Input: A collection of documents.

Output: A mapping from a word to a list of documents.

apple	doc1, doc3, ...
banana	doc1, doc2, doc3, ...
cherry	doc2, doc3, ...

doc1	apple, banana, date, ...
doc2	banana, cherry, pear, ...
doc3	apple, banana, cherry, ...

23 of 45

Inverted Index - using MapReduce

Mapper:

map( (doc_id, {w1, w2,...}) ) -> list( (w1, doc_id), (w2, doc_id), ...)

Shuffler:

Shuffle the (word, doc_id) tuples, and group tuples by the word.

Reducer:

reducer(word, [doc1, doc2, ...]) -> ( word, [doc1, doc2, ...] )

24 of 45

Sort

Input: A set of English words.

Output: The sorted list of the input English words.

Example:

Input: {apple, cherry, banana}.
Output: [apple, banana, cherry].

25 of 45

Sort - using MapReduce.

Mapper:

map(string) -> (prefix_key=GetPrefix(string), string)

Shuffler:

Each prefix is shuffled to a machine, according to lexicographical ordering.

"aa" to machine 1, "ab" to machine 2, etc.

Reducer:

reducer(prefix, [s1, s2, ...]) -> (prefix, sorted([s1, s2, ...]))

26 of 45

Compute Average Value per Key.

Given a (StudentId, Grade) table, compute:

SELECT StudentId, avg(Grade)

FROM student_grades

GROUP BY StudentId

StudentId	Grade
Amy	95
Beth	98
Cathy	91
Amy	88

27 of 45

Compute Average Value per Key.

Given a (StudentId, Grade) table, compute:

SELECT StudentId, avg(Grade)

FROM student_grades

GROUP BY StudentId

StudentId	Grade
Amy	95
Beth	98
Cathy	91
Amy	88

StudentId	Avg(Grade)
Amy	91.5
Beth	98
Cathy	91

28 of 45

Compute Average Value per Key (w/ MapReduce)

Mapper:

map( (id, grade) ) -> (id, grade)

Reducer:

reduce(id, [grade1, grade2, ...]) -> (id, avg([grade1, grade2, ...]))

(Q) Can we reduce the network traffic between mappers & reducers?

... if there are a lot of (id, grade) pairs to emit.

29 of 45

Compute Average Value per Key (w/ MapReduce)

Using "combiner" to improve performance.

Mapper:

map( (id, grade) ) -> (id, grade)

Combiner (at each mapper):

(id, [grade1, grade2, ..., grade_n]) -> (id, (sum(g1, g2, …, g_n), n) )

Reducer:

reducer(id, [(sum1, n1), (sum2, n2), ...]) -> (id, avg(...(sum_i, n_i)...) )

30 of 45

K-means Clustering

From: http://stanford.edu/~cpiech/cs221/img/kmeansViz.png

31 of 45

K-means Clustering

Input: A set of data points (x, y), and the number of clusters, K.

Output: The centers of the K clusters.

Algorithm:

Initialize the cluster centers.
Do until convergence:

For each data point, find the closest cluster center, and become the member of that cluster.
For each cluster, update the center as the mean of its members.
Reach convergence, if all cluster centers do not change.

32 of 45

K-means Clustering using MapReduce

"centers": Initialize the K cluster centers.

Mapper:

Load the current values of "centers".
map(data_point) -> (NearestCenterId(data_point), data_point)

Reducer:

reducer(center_id, [p1, p2, ...]) -> ( center_id, avg([p1, p2, ...]) )

Repeat the above Mapper/Reducer steps, until convergence.

33 of 45

Matrix Multiplication

Definition:

Given two matrices A and B, compute the matrix C = A * B.

Algorithm:

C(m, n) = sum( A(m, i) * B(i, n) ), over all i values.

34 of 45

Matrix Multiplication using MapReduce

Mapper:

map( (row A[m,:], column B[:,n]) ) ->

( (m,n), sum(A[m,:] * B[:,n]) )

Reducer:

reducer( (m,n), sum(A[m,:]*B[:,n]) ) -> ( (m,n), sum(A[m,:]*B[:,n]) )

(Q) What if the row or the column is too big?

35 of 45

Matrix Multiplication

If the row or the column is too big, then ...

36 of 45

Matrix Multiplication using MapReduce

The "improved" version:

More parallelization and more scalable.

Mapper:

map( ((m, row segment), (n, column segment)) ) ->

( (m,n), sum(row segment * column segment) ) )

Reducer:

reduce( (m,n), [partial_sum_1, partial_sum_2, ...] ) ->

´( (m,n), sum(partial_sum_i) )

37 of 45

Connected Component

Definition: Given a graph, find the "components" where each is a set of nodes connected to one another through a path.

Three connected components:

{A, B, D, E, F},

{C},

{G, H, I}.

38 of 45

Connected Components

Input: The set of graph edges, {(x, y), where x, y are nodes}.

Output: The collection of components, i.e.,

{ ( component 1, [node1, node2, ...] ), ... }

Algorithm Ideas:

Each component has a leader.

The node with the "smallest" id.

Update the "leader" mapping until convergence.

39 of 45

Connected Components using MapReduce

Initialize the "leader map": each node is its own leader.

Node	Leader
a	a
b	b
c	c
d	d
e	e
f	f

40 of 45

Connected Components using MapReduce

Initialize the "leader map": each node is its own leader.

Mapper:

map( (u, v) ) -> ( max(u, v), Leader(min(u, v)) )

Intuitively, follow the leader's leader.

Reducer:

reduce( node, [leader1, leader2, ...] ) ->

( node, min(leader1, leader2, ...) )

Update the "leader map" with reducer's output.

Repeat the Mapper/Reducer step, until convergence.

41 of 45

Connected Components w/ MapReduce (Illustration: Iter 1)

Edge	(Node, Leader) update
(a, b)	(b, leader(a)) = (b, a)
(a, d)	(d, leader(a)) = (d, a)
(a, e)	(e, leader(a)) = (e, a)
(d, e)	(e, leader(d)) = (e, d)
(d, f)	(f, leader(d)) = (f, d)
(e, f)	(f, leader(e)) = (f, e)

Node	Leader
a	a
b	b
c	c
d	d
e	e
f	f

Mapper

Reducer

Input	Output
(b, [a])	(b, a)
(d, [a])	(d, a)
(e, [a, d])	(e, a)
(f, [e, d])	(f, d)

42 of 45

Connected Components w/ MapReduce (Illustration: Iter 1)

Edge	(Node, Leader) update
(a, b)	(b, leader(a)) = (b, a)
(a, d)	(d, leader(a)) = (d, a)
(a, e)	(e, leader(a)) = (e, a)
(d, e)	(e, leader(d)) = (e, d)
(d, f)	(f, leader(d)) = (f, d)
(e, f)	(f, leader(e)) = (f, e)

Node	Leader
a	a
b	a
c	c
d	a
e	a
f	d

Mapper

Reducer

Input	Output
(b, [a])	(b, a)
(d, [a])	(d, a)
(e, [a, d])	(e, a)
(f, [e, d])	(f, d)

43 of 45

Connected Components w/ MapReduce (Illustration: Iter 2)

Edge	(Node, Leader) update
(a, b)	(b, leader(a)) = (b, a)
(a, d)	(d, leader(a)) = (d, a)
(a, e)	(e, leader(a)) = (e, a)
(d, e)	(e, leader(d)) = (e, a)
(d, f)	(f, leader(d)) = (f, a)
(e, f)	(f, leader(e)) = (f, a)

Node	Leader
a	a
b	a
c	c
d	a
e	a
f	d

Mapper

Reducer

Input	Output
(b, [a])	(b, a)
(d, [a])	(d, a)
(e, [a])	(e, a)
(f, [a])	(f, a)

44 of 45

Connected Components w/ MapReduce (Illustration: Iter 2)

Edge	(Node, Leader) update
(a, b)	(b, leader(a)) = (b, a)
(a, d)	(d, leader(a)) = (d, a)
(a, e)	(e, leader(a)) = (e, a)
(d, e)	(e, leader(d)) = (e, a)
(d, f)	(f, leader(d)) = (f, a)
(e, f)	(f, leader(e)) = (f, a)

Node	Leader
a	a
b	a
c	c
d	a
e	a
f	a

Mapper

Reducer

Input	Output
(b, [a])	(b, a)
(d, [a])	(d, a)
(e, [a])	(e, a)
(f, [a])	(f, a)

1 of 45

2 of 45

3 of 45

4 of 45

5 of 45

6 of 45

7 of 45

8 of 45

9 of 45

10 of 45

11 of 45

12 of 45

13 of 45

14 of 45

15 of 45

16 of 45

17 of 45

18 of 45

19 of 45

20 of 45

21 of 45

22 of 45

23 of 45

24 of 45

25 of 45

26 of 45

27 of 45

28 of 45

29 of 45

30 of 45

31 of 45

32 of 45

33 of 45

34 of 45

35 of 45

36 of 45

37 of 45

38 of 45

39 of 45

40 of 45

41 of 45

42 of 45

43 of 45

44 of 45

45 of 45