1 of 32

1

Sri Raghavendra Educational Institutions Society (R)

(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Sri Krishna Institute of Technology

www.skit.org.in

Prepared by:

Latha

Course: Cloud Computing

Department: Computer Science &Engineering

2 of 32

Module-5: FEATURES OF CLOUD AND GRID PLATFORMS

2

CO Addressed: CO5

3 of 32

Cloud Capabilities and Platform Features

  • Important Cloud Platform Capabilities
    • Physical or virtual computing platform
    • Massive data storage service, distributed file system
    • Massive database storage service
    • Massive data processing method and programming model
    • Programming interface and service deployment
    • …….

3

4 of 32

Traditional Features Common to Grids and Clouds

  • Workflow
    • eg Pipeline Pilot, AVS (dated), and the LIMS environments,Trident runs on Azure
  • Data Transport
    • high-bandwidth links between clouds and TeraGrid.
  • Security, Privacy, and Availability
    • Use virtual clustering for dynamic resource provisioning with minimum overhead cost
    • Use stable and persistent data storage with fast queries for information retrieval.
    • Use special APIs for authenticating users
    • access Cloud resources with security protocols such as HTTPS and SSL.
    • Fine-grained access control to protect data integrity and deter intruders or hackers.
    • Shared data sets are protected from malicious alteration, deletion, or copyright violations.

4

5 of 32

Data Features and Databases

  • Program Library
  • Blobs and Drives
    • blobs for Azure and S3 for Amazon with the help of containers
  • DPFS
    • support of file systems such as Google File System (MapReduce), HDFS (Hadoop), and Cosmos (Dryad) with compute-data affinity optimized for data processing.
  • SQL and Relational Databases
    • Azure and Amazon, the database is installed on a separate VM independent from your job (worker roles in Azure). This implements “SQL as a Service.”

5

6 of 32

  • Table and NOSQL Nonrelational Databases
    • “NOSQL”— emphasizes distribution & scalability.
    • three major clouds: BigTable in Google, SimpleDB in Amazon, and Azure Table for Azure.
    • All these tables are schema-free (each record can have different properties), with BigTable having schema for column (property) families.
  • Queuing Services
    • The messages are short (less than 8 KB) and have a Representational State Transfer (REST) service interface with “deliver at least once” semantics.
    • They are controlled by timeouts for posting the length of time allowed for a client to process.
    • EG publish-subscribe systems such as ActiveMQ

6

7 of 32

Programming and Runtime Support

Worker and Web Roles

MapReduce

Cloud Programming Models

SaaS

7

8 of 32

PARALLEL AND DISTRIBUTED PROGRAMMING PARADIGMS

Parallel Computing and Programming Paradigms

  • Partitioning

• Computation partitioning

• Data partitioning

• Mapping

• Synchronization

• Communication

• Scheduling

8

9 of 32

MapReduce, Twister, and Iterative MapReduce

9

10 of 32

Formal Definition of MapReduce

MapReduce Logical Data Flow

10

11 of 32

Formal Notation of MapReduce Data Flow

MapReduce Actual Data and Control Flow

  1. Data partitioning
  2. Computation partitioning
  3. Determining the master and workers
  4. Reading the input data (data distribution)
  5. Map function
  6. Combiner function
  7. Partitioning function
  8. Synchronization
  9. Communication
  10. Sorting and Grouping
  11. Reduce function

11

12 of 32

12

13 of 32

13

14 of 32

Hadoop Library from Apache

  • Hadoop is an open source implementation of MapReduce coded and released in Java by Apache.
  • Hadoop core is divided into two fundamental

layers:

    • MapReduce engine -computation engine
    • HDFS -as its data storage manager.
  1. HDFS Architecture: HDFS has a master/slave architecture containing a single NameNode as the master and a number of DataNodes as workers (slaves).
  2. HDFS splits the file into fixed-size blocks and stores on workers,whose mapping is determined by the NameNode.
  3. HDFS Fault Tolerance: Block replication,Replica placement, Heartbeat and Blockreport messages
  4. HDFS High-Throughput Access to Large Data Sets (Files)
  5. HDFS Operation

14

15 of 32

Architecture of MapReduce in Hadoop

15

16 of 32

  • The topmost layer of Hadoop is the MapReduce engine that manages the data flow and control flow of MapReduce jobs over distributed computing systems.
  • MapReduce engine has a master/slave architecture consisting of a single JobTracker as the master and a number of TaskTrackers as the slaves (workers).
  • The JobTracker manages the MapReduce job over a cluster and is responsible for monitoring jobs and assigning tasks to TaskTrackers.
  • The TaskTracker manages the execution of the map and/or reduce tasks on a single computation node in the cluster.
  • Each TaskTracker node has a number of simultaneous execution slots, each executing either a map or a reduce task. Slots are defined as the number of simultaneous threads supported by CPUs of the TaskTracker node.
  • there is a one-to-one correspondence between map tasks in a TaskTracker and data blocks in the respective DataNode.

16

17 of 32

Running a Job in Hadoop

Three components contribute in running a job in this system: a user node, a JobTracker, and several TaskTrackers.

  1. Job Submission
  2. Task assignment
  3. Task execution
  4. Task running check

17

18 of 32

18

19 of 32

Dryad and DryadLINQ from Microsoft

19

20 of 32

20

21 of 32

LINQ-expression execution in DryadLINQ.

21

22 of 32

Sawzall and Pig Latin High-Level Languages

22

23 of 32

Programming the Google App Engine

23

24 of 32

Google File System (GFS)

24

25 of 32

  • size of the web data that was crawled and saved is large
  • Google has chosen its file data block size to be 64 MB
  • Files are typically written once & multiple write operations are often the appending of data blocks to the end of files.
  • Reliability is achieved by using replications
  • each chunk or data block of a file is replicated across more than three chunk servers
  • A single master coordinates access as well as keeps the metadata.
  • There is no data cache in GFS
  • chunk servers stores data, while the single master stores the metadata.
  • The file system namespace and locking facilities are managed by the master.
  • The master periodically communicates with the chunk servers to collect management information as well as give instructions to the chunk servers to do work such as load balancing or fail recovery.

25

26 of 32

The data mutation takes the following steps:

1. The client asks the master which chunk server holds the current lease for the chunk and the locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses (not shown).

2. The master replies with the identity of the primary and the locations of the other (secondary) replicas. The client caches this data for future mutations. It needs to contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease.

3. The client pushes the data to all the replicas. Each chunk server will store the data in an internal LRU buffer cache until the data is used or aged out.

4. Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The request identifies the data pushed earlier to all the replicas. The primary assigns consecutive serial numbers to all the mutations it receives

5. The primary forwards the write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary.

6. The secondaries all reply to the primary indicating that they have completed the operation.

7. The primary replies to the client. Any errors encountered at any replicas are reported to the client. In case of errors, the write corrects at the primary and an arbitrary subset of the secondary replicas. The client request is considered to have failed, and the modified region is left in an inconsistent state.

26

27 of 32

27

28 of 32

BigTable, Google’s NOSQL System

  • BigTable provides service for storing and retrieving structured and semistructured data.
  • BigTable applications include storage of web pages, per-user data, and geographic locations.
  • It is viewed as a distributed multilevel map. It is scalable,

fault-tolerant and persistent database as in a storage service.

  • The BigTable system is BigTable is a self-managing system
  • The BigTable system is built on top of an existing Google cloud infrastructure. BigTable uses the following building blocks:

1. GFS: stores persistent state

2. Scheduler: schedules jobs involved in BigTable serving

3. Lock service: master election, location bootstrapping

4. MapReduce: often used to read/write BigTable data

28

29 of 32

Tablet Location Hierarchy

29

30 of 32

The first level is a file stored in Chubby that contains the location of the root tablet, which contains the location of all tablets in a special METADATA table.

Each METADATA tablet contains the location of a set of user tablets.

The root tablet is just the first tablet in the METADATA table, which is never split to ensure that the tablet location hierarchy has no more than 3 levels.

The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet’s table identifier and its end row.

BigTable includes many optimizations and fault-tolerant features. Chubby can guarantee the availability of the file for finding the root tablet. The BigTable

master can quickly scan the tablet servers to determine the status of all nodes.

30

31 of 32

Chubby, Google’s Distributed Lock Service

31

32 of 32

32