Distributed and Cloud Computing�K. Hwang, G. Fox and J. Dongarra��Chapter 6: Cloud Programming �and Software Environments�Part 1��Adapted from Kai Hwang, University of Southern California�with additions from �Matei Zaharia, EECS, UC Berkeley��November 25, 2012�
1
Copyright © 2012, Elsevier Inc. All rights reserved.
1 - 1
FEATURES OF CLOUD AND GRID PLATFORMS
2
Cloud Capabilities and Platform Features
3
Cloud Capabilities and Platform Features
4
Cloud Capabilities and Platform Features
5
Cloud Capabilities and Platform Features
6
Traditional Features Common to Grids and Clouds
7
Security, Privacy, and Availability
8
Data Features and Databases
9
�
10
Programming and Runtime Support
11
Parallel Computing and Programming Enviroments
12
Parallel Computing and Programming Enviroments
13
What is MapReduce?
14
What is MapReduce used for?
15
Motivation: Large Scale Data Processing
16
What is MapReduce used for?
17
Distributed Grep
Very
big
data
Split data
Split data
Split data
Split data
grep
grep
grep
grep
matches
matches
matches
matches
cat
All
matches
grep is a command-line utility for searching plain-text data sets for lines matching a regular expression.
cat is a standard Unix utility that concatenates and lists files
18
Distributed Word Count
Very
big
data
Split data
Split data
Split data
Split data
count
count
count
count
count
count
count
count
merge
merged
count
19
Map+Reduce
Very
big
data
Result
M
A
P
R
E
D
U
C
E
Partitioning
Function
20
21
22
23
24
Architecture overview
Job tracker
Task tracker
Task tracker
Task tracker
Master node
Slave node 1
Slave node 2
Slave node N
Workers
user
Workers
Workers
25
GFS: underlying storage system
26
GFS architecture
GFS Master
C0
C1
C2
C5
Chunkserver 1
C0
C5
Chunkserver N
C1
C3
C5
Chunkserver 2
…
C2
Client
27
Functions in the Model
28
Programming Concept
29
30
Fig.6.5 �Dataflow Implementation
of MapReduce
31
Copyright © 2012, Elsevier Inc. All rights reserved.
1 - 31
A Simple Example
map(string value)
//key: document name
//value: document contents
for each word w in value
EmitIntermediate(w, “1”);
reduce(string key, iterator values)
//key: word
//values: list of counts
int results = 0;
for each v in values
result += ParseInt(v);
Emit(AsString(result));
The map function emits each word w plus an associated count of occurrences (just a “1” is recorded in this �pseudo-code)
The reduce function sums together all counts emitted for a particular word
32
A Word Counting Example on <Key, Count> Distribution
33
Copyright © 2012, Elsevier Inc. All rights reserved.
1 - 33
How Does it work?
34
MapReduce : Operation Steps
When the user program calls the MapReduce function, the following sequence of actions occurs :
1) The MapReduce library in the user program first splits the input files into M pieces – 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of program on a cluster of machines.
2) One of the copies of program is master. The rest are workers that are assigned work by the master.
35
35
MapReduce : Operation Steps
3) A worker who is assigned a map task :
The intermediate key/value pairs produced by the Map function are buffered in memory.
4) The buffered pairs are written to local disk, partitioned into R regions by the partitioning function.
The location of these buffered pairs on the local disk are passed back to the master, who forwards these locations to the reduce workers.
36
36
MapReduce : Operation Steps
5) When a reduce worker is notified by the master about these locations, it reads the buffered data from the local disks of the map workers.
When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.
6) The reduce worker iterates over the sorted intermediate data and for each unique intermediate key, it passes the key and the corresponding set of intermediate values to the user’s Reduce function.
The output of the Reduce function is appended to a final output file.
37
37
MapReduce : Operation Steps
7) When all map tasks and reduce tasks have been completed, the master wakes up the user program.
At this point, MapReduce call in the user program returns back to the user code.
After successful completion, output of the mapreduce execution is available in the R output files.
38
38
Logical Data Flow in 5 Processing �Steps in MapReduce Process
(Key, Value) Pairs are generated by the Map function over multiple available Map Workers (VM instances). These pairs are then sorted and group based on key ordering. Different key-groups are then processed by multiple Reduce Workers in parallel.
39
Copyright © 2012, Elsevier Inc. All rights reserved.
1 - 39
Locality issue
40
Fault Tolerance
41
Fault Tolerance
42
Fault Tolerance
43
Status monitor
44
Points need to be emphasized
45
Other Examples
46
MapReduce Implementations
MapReduce
Cluster,
1, Google
2, Apache Hadoop
Multicore CPU,
Phoenix @ stanford
GPU,
Mars@HKUST
47
Hadoop : software platform originally developed by Yahoo enabling users to write and run applications over vast distributed data.
Attractive Features in Hadoop :
48
Copyright © 2012, Elsevier Inc. All rights reserved.
1 - 48
Typical Hadoop Cluster
Aggregation switch
Rack switch
Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/YahooHadoopIntro-apachecon-us-2008.pdf
49
Typical Hadoop Cluster
Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf
50
Challenges
Cheap nodes fail, especially if you have many
Commodity network = low bandwidth
Programming distributed systems is hard
51
Hadoop Components
52
Hadoop Distributed File System
Namenode
Datanodes
1
2
3
4
1
2
4
2
1
3
1
4
3
3
2
4
File1
53
54
Copyright © 2012, Elsevier Inc. All rights reserved.
1 - 54
Secure Query Processing with Hadoop/MapReduce
55
Copyright © 2012, Elsevier Inc. All rights reserved.
1 - 55
Higher-level languages over Hadoop: Pig and Hive�
56
Motivation
57
Pig
58
An Example Problem
Suppose you have user data in one file, page view data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
59
In MapReduce
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
60
In Pig Latin
Users = load ‘users’ as (name, age);�Filtered = filter Users by � age >= 18 and age <= 25; �Pages = load ‘pages’ as (user, url);�Joined = join Filtered by name, Pages by user;�Grouped = group Joined by url;�Summed = foreach Grouped generate group,� count(Joined) as clicks;�Sorted = order Summed by clicks desc;�Top5 = limit Sorted 5;
�store Top5 into ‘top5sites’;
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
61
Ease of Translation
Notice how naturally the components of the job translate into Pig Latin.
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …�Filtered = filter … �Pages = load …�Joined = join …�Grouped = group …�Summed = … count()…�Sorted = order …�Top5 = limit …
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
62
Ease of Translation
Notice how naturally the components of the job translate into Pig Latin.
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …�Filtered = filter … �Pages = load …�Joined = join …�Grouped = group …�Summed = … count()…�Sorted = order …�Top5 = limit …
Job 1
Job 2
Job 3
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
63
Hive
64
Sample Hive Queries
SELECT p.url, COUNT(1) as clicks
FROM users u JOIN page_views p ON (u.name = p.user)
WHERE u.age >= 18 AND u.age <= 25
GROUP BY p.url
ORDER BY clicks
LIMIT 5;
SELECT TRANSFORM(p.user, p.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dt
FROM page_views p;
65
Amazon Elastic MapReduce
66
Elastic MapReduce Workflow
67
Elastic MapReduce Workflow
68
Elastic MapReduce Workflow
69
Elastic MapReduce Workflow
70
Conclusions
71
Resources
72