1 of 50

1

Sri Raghavendra Educational Institutions Society (R)

(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Sri Krishna Institute of Technology

www.skit.org.in

Prepared by:

Latha

Course: Cloud Computing

Department: Computer Science &Engineering

2 of 50

Module-5: FEATURES OF CLOUD AND GRID PLATFORMS

2

CO Addressed: CO5

3 of 50

Cloud Capabilities and Platform Features

  • Important Cloud Platform Capabilities
    • Physical or virtual computing platform
    • Massive data storage service, distributed file system
    • Massive database storage service
    • Massive data processing method and programming model
    • Programming interface and service deployment
    • …….

3

4 of 50

Traditional Features Common to Grids and Clouds

  • Workflow
    • eg Pipeline Pilot, AVS (dated), and the LIMS environments,Trident runs on Azure
  • Data Transport
    • high-bandwidth links between clouds and TeraGrid.
  • Security, Privacy, and Availability
    • Use virtual clustering for dynamic resource provisioning with minimum overhead cost
    • Use stable and persistent data storage with fast queries for information retrieval.
    • Use special APIs for authenticating users
    • access Cloud resources with security protocols such as HTTPS and SSL.
    • Fine-grained access control to protect data integrity and deter intruders or hackers.
    • Shared data sets are protected from malicious alteration, deletion, or copyright violations.

4

5 of 50

Data Features and Databases

  • Program Library
  • Blobs and Drives
    • blobs for Azure and S3 for Amazon with the help of containers
  • DPFS
    • support of file systems such as Google File System (MapReduce), HDFS (Hadoop), and Cosmos (Dryad) with compute-data affinity optimized for data processing.
  • SQL and Relational Databases
    • Azure and Amazon, the database is installed on a separate VM independent from your job (worker roles in Azure). This implements “SQL as a Service.”

5

6 of 50

  • Table and NOSQL Nonrelational Databases
    • “NOSQL”— emphasizes distribution & scalability.
    • three major clouds: BigTable in Google, SimpleDB in Amazon, and Azure Table for Azure.
    • All these tables are schema-free (each record can have different properties), with BigTable having schema for column (property) families.
  • Queuing Services
    • The messages are short (less than 8 KB) and have a Representational State Transfer (REST) service interface with “deliver at least once” semantics.
    • They are controlled by timeouts for posting the length of time allowed for a client to process.
    • EG publish-subscribe systems such as ActiveMQ

6

7 of 50

Programming and Runtime Support

Worker and Web Roles

MapReduce

Cloud Programming Models

SaaS

7

8 of 50

PARALLEL AND DISTRIBUTED PROGRAMMING PARADIGMS

Parallel Computing and Programming Paradigms

  • Partitioning

• Computation partitioning

• Data partitioning

• Mapping

• Synchronization

• Communication

• Scheduling

8

9 of 50

MapReduce, Twister, and Iterative MapReduce

9

10 of 50

Formal Definition of MapReduce

MapReduce Logical Data Flow

10

11 of 50

Formal Notation of MapReduce Data Flow

MapReduce Actual Data and Control Flow

  1. Data partitioning
  2. Computation partitioning
  3. Determining the master and workers
  4. Reading the input data (data distribution)
  5. Map function
  6. Combiner function
  7. Partitioning function
  8. Synchronization
  9. Communication
  10. Sorting and Grouping
  11. Reduce function

11

12 of 50

12

13 of 50

13

14 of 50

Hadoop Library from Apache

  • Hadoop is an open source implementation of MapReduce coded and released in Java by Apache.
  • Hadoop core is divided into two fundamental

layers:

    • MapReduce engine -computation engine
    • HDFS -as its data storage manager.
  1. HDFS Architecture: HDFS has a master/slave architecture containing a single NameNode as the master and a number of DataNodes as workers (slaves).
  2. HDFS splits the file into fixed-size blocks and stores on workers,whose mapping is determined by the NameNode.
  3. HDFS Fault Tolerance: Block replication,Replica placement, Heartbeat and Blockreport messages
  4. HDFS High-Throughput Access to Large Data Sets (Files)
  5. HDFS Operation

14

15 of 50

Architecture of MapReduce in Hadoop

15

16 of 50

  • The topmost layer of Hadoop is the MapReduce engine that manages the data flow and control flow of MapReduce jobs over distributed computing systems.
  • MapReduce engine has a master/slave architecture consisting of a single JobTracker as the master and a number of TaskTrackers as the slaves (workers).
  • The JobTracker manages the MapReduce job over a cluster and is responsible for monitoring jobs and assigning tasks to TaskTrackers.
  • The TaskTracker manages the execution of the map and/or reduce tasks on a single computation node in the cluster.
  • Each TaskTracker node has a number of simultaneous execution slots, each executing either a map or a reduce task. Slots are defined as the number of simultaneous threads supported by CPUs of the TaskTracker node.
  • there is a one-to-one correspondence between map tasks in a TaskTracker and data blocks in the respective DataNode.

16

17 of 50

Running a Job in Hadoop

Three components contribute in running a job in this system: a user node, a JobTracker, and several TaskTrackers.

  1. Job Submission
  2. Task assignment
  3. Task execution
  4. Task running check

17

18 of 50

18

19 of 50

Dryad and DryadLINQ from Microsoft

19

20 of 50

20

21 of 50

LINQ-expression execution in DryadLINQ.

21

22 of 50

Sawzall and Pig Latin High-Level Languages

22

23 of 50

Programming the Google App Engine

23

24 of 50

Google File System (GFS)

24

25 of 50

  • size of the web data that was crawled and saved is large
  • Google has chosen its file data block size to be 64 MB
  • Files are typically written once & multiple write operations are often the appending of data blocks to the end of files.
  • Reliability is achieved by using replications
  • each chunk or data block of a file is replicated across more than three chunk servers
  • A single master coordinates access as well as keeps the metadata.
  • There is no data cache in GFS
  • chunk servers stores data, while the single master stores the metadata.
  • The file system namespace and locking facilities are managed by the master.
  • The master periodically communicates with the chunk servers to collect management information as well as give instructions to the chunk servers to do work such as load balancing or fail recovery.

25

26 of 50

The data mutation takes the following steps:

1. The client asks the master which chunk server holds the current lease for the chunk and the locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses (not shown).

2. The master replies with the identity of the primary and the locations of the other (secondary) replicas. The client caches this data for future mutations. It needs to contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease.

3. The client pushes the data to all the replicas. Each chunk server will store the data in an internal LRU buffer cache until the data is used or aged out.

4. Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The request identifies the data pushed earlier to all the replicas. The primary assigns consecutive serial numbers to all the mutations it receives

5. The primary forwards the write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary.

6. The secondaries all reply to the primary indicating that they have completed the operation.

7. The primary replies to the client. Any errors encountered at any replicas are reported to the client. In case of errors, the write corrects at the primary and an arbitrary subset of the secondary replicas. The client request is considered to have failed, and the modified region is left in an inconsistent state.

26

27 of 50

27

28 of 50

BigTable, Google’s NOSQL System

  • BigTable provides service for storing and retrieving structured and semistructured data.
  • BigTable applications include storage of web pages, per-user data, and geographic locations.
  • It is viewed as a distributed multilevel map. It is scalable,

fault-tolerant and persistent database as in a storage service.

  • The BigTable system is BigTable is a self-managing system
  • The BigTable system is built on top of an existing Google cloud infrastructure. BigTable uses the following building blocks:

1. GFS: stores persistent state

2. Scheduler: schedules jobs involved in BigTable serving

3. Lock service: master election, location bootstrapping

4. MapReduce: often used to read/write BigTable data

28

29 of 50

BigTable is an innovative Google cloud technology for managing large-scale structured and semi-structured data.

Designed for storage, retrieval, and scalable data processing. Built to handle massive datasets beyond traditional databases.

Related Google Technologies

  • MapReduce – Large-scale distributed data processing.
  • Sawzall – Data analysis language.
  • GFS (Google File System) – Distributed storage system.
  • Chubby – Distributed lock service.
  • BigTable – Scalable distributed database.

Applications of BigTable

Web Pages Storage - URLs, page content, metadata, links, page rank.

Per-user Data - User preferences, search history, emails.

Geographic Data - Roads, satellite images, locations, user annotations. Eg. Used in Google Maps and Google Earth.

Need for BigTable

    • Better scalability
    • High performance
    • Low-cost reusable data management system
    • Optimized storage handling

29

30 of 50

Design Goals of BigTable

  • Continuous asynchronous data updates.
  • Access to most recent data at all times.
  • Very high read/write speed.
  • Support millions of operations per second.
  • Efficient data scanning and querying.

BigTable Architecture

  • Distributed multilevel map structure.
  • Fault-tolerant persistent database.
  • Works as scalable storage service.

Self-Managing System

  • Dynamic addition/removal of servers.
  • Automatic load balancing.
  • High availability and fault recovery.

BigTable Building Blocks

  1. GFS – Stores persistent data
  2. Scheduler – Manages BigTable jobs
  3. Lock Service (Chubby) – Master election and coordination
  4. MapReduce – Reads/writes BigTable data

30

31 of 50

Tablet Location Hierarchy

31

32 of 50

The first level is a file stored in Chubby that contains the location of the root tablet, which contains the location of all tablets in a special METADATA table.

Each METADATA tablet contains the location of a set of user tablets.

The root tablet is just the first tablet in the METADATA table, which is never split to ensure that the tablet location hierarchy has no more than 3 levels.

The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet’s table identifier and its end row.

BigTable includes many optimizations and fault-tolerant features. Chubby can guarantee the availability of the file for finding the root tablet. The BigTable

master can quickly scan the tablet servers to determine the status of all nodes.

32

33 of 50

Chubby, Google’s Distributed Lock Service

33

34 of 50

  • Chubby is a distributed coarse-grained locking service developed by Google.
  • Provides a simple file-system-like namespace for storing small files.
  • Stores metadata and lock information, unlike large files in GFS.
  • Continues functioning even if some member nodes fail.
  • Each Chubby cell contains 5 servers with the same file system namespace.
  • Clients communicate using the Chubby library to access servers.
  • Supports file operations such as create, read, write, and lock management.
  • Used as Google’s primary internal name service.
  • GFS and BigTable use Chubby to elect a primary server from replicas.

34

35 of 50

PROGRAMMING ON AMAZON AWS AND MICROSOFT AZURE

  • AWS provides a powerful cloud platform for application development and deployment.
  • Key services include EC2, S3, SimpleDB, and RDS.
  • EC2 (Elastic Compute Cloud) offers scalable virtual servers.
  • S3 (Simple Storage Service) provides reliable object storage.
  • RDS (Relational Database Service) supports managed relational databases.
  • SimpleDB offers NoSQL database support.
  • Elastic MapReduce (EMR) provides Hadoop-based big data processing on EC2.
  • AWS does not directly provide BigTable support.

Messaging Services

  • SQS (Simple Queue Service) for message queuing.
  • SNS (Simple Notification Service) for notifications and pub/sub messaging.

Scalability Services

  • Auto Scaling automatically increases/decreases EC2 instances based on demand.
  • Helps maintain performance and reduce costs.

Load Balancing

  • Elastic Load Balancing (ELB) distributes traffic across multiple EC2 instances.
  • Detects failed nodes and balances workload efficiently.

Monitoring Service

  • CloudWatch monitors AWS resources and applications.
  • Tracks CPU usage, disk I/O, network traffic, and performance metrics.

35

36 of 50

Programming on Amazon EC2

Amazon EC2 and Amazon Machine Images (AMIs)

  • Amazon was the first company to introduce Virtual Machines (VMs) for application hosting.
  • Customers can rent VMs instead of physical servers to run applications.
  • Users can install and run any software of their choice on rented VMs.
  • Provides elastic computing, allowing users to:
    • Create instances when needed
    • Launch applications quickly
    • Terminate instances anytime
  • Customers pay only for active server usage (pay-per-use model).
  • Amazon offers several preinstalled VM types called Amazon Machine Images (AMIs).
  • AMIs are templates used to create EC2 instances.
  • Preconfigured with:
    • Linux or Windows operating systems
    • Additional software packages
  • Enables fast, flexible, and scalable cloud deployment.

36

37 of 50

Amazon VM Hosting

  • Amazon introduced virtual machines (VMs) for application hosting.
  • Customers can rent VMs instead of physical servers to run applications.
  • Users can install any software of their choice inside the VM.
  • The service is elastic — instances can be created, launched, and terminated on demand.
  • Billing follows a pay-per-use (hourly) model for active servers.
  • Amazon Machine Image (AMI) is a preconfigured VM template.
  • AMIs come with Linux or Windows operating systems and additional software.
  • AMIs are templates, while instances are running virtual machines.
  • Supported AMI types include public, private, and paid AMIs.
  • VM creation workflow:
    • Create AMI → Create Key Pair → Configure Firewall → Launch

37

38 of 50

Amazon EC2 execution environment.

38

39 of 50

39

40 of 50

40

41 of 50

41

42 of 50

Amazon Simple Storage Service (S3)

42

43 of 50

Microsoft Azure Programming Support

43

44 of 50

Microsoft Azure Programming Model

  • Microsoft Microsoft Azure provides a rich cloud programming environment
  • Built on Azure Fabric: virtualized hardware + resource management & fault tolerance
  • Supports dynamic resource allocation and automated service management (XML-based templates)
  • Includes monitoring & logging: event logs, performance counters, IIS logs, crash dumps
  • Debugging is done using trace data (no direct runtime debugging)

Core Components

  • Compute Services: Web Role & Worker Role
  • Storage Services: Integrated with Azure storage system
  • SQLAzure: Managed cloud database service

44

45 of 50

Role Lifecycle Methods

  • OnStart() → Initialization
  • Run() → Main execution logic
  • OnStop() → Graceful shutdown
  • Azure applications are Internet-connected via compute VMs (roles)
  • Concept of roles enables scalable and distributed cloud applications

Roles in Azure

  • Web Role: Handles HTTP/HTTPS requests (web hosting VM)
  • Worker Role: Executes background processing tasks
  • Roles support HTTP(S) and TCP communication
  • Load balancing across multiple role instances

45

46 of 50

SQLAzure

  • Microsoft Microsoft Azure provides rich and scalable storage services
  • SQL Azure offers SQL Server as a cloud service
  • Most storage services are accessed via REST APIs (URL-based access)
  • Data replication (3 copies) ensures fault tolerance and consistency

Storage Types

  • Blob Storage: Core storage system (similar to Amazon S3)
  • Azure Drives: File system interface (similar to Amazon EBS) using NTFS volumes

Blob Storage Hierarchy

  • Account → Container → Blob (Block/Page)
  • Container acts like a directory, Account acts as root

46

Blob Types

  • Block Blobs
    • Used for streaming data
    • Divided into blocks (up to 4 MB each)
    • Max size: 200 GB
  • Page Blobs
    • Used for random read/write operations
    • Structured as pages
    • Max size: 1 TB

Additional Features

  • Metadata stored as <name, value> pairs (up to 8 KB per blob)
  • Designed for high availability, scalability, and durability

47 of 50

Azure Tables

  • Microsoft Microsoft Azure supports Table and Queue storage for small-scale data

Queue Storage

  • Used for message passing between Web Role and Worker Role
  • Ensures reliable message delivery (processed at least once)
  • Supports operations: PUT, GET, DELETE messages
  • Queue operations: CREATE and DELETE queues
  • Unlimited messages, each up to 8 KB size

Table Storage

  • NoSQL-style storage with flexible schema
  • Data stored as Entities (rows) and Properties (columns)
  • Each entity can have up to 255 properties (<name, type, value>)
  • No limit on number of entities → highly scalable

47

48 of 50

Key Features

  • PartitionKey: Groups related entities → improves performance
  • RowKey: Unique identifier for each entity
  • Max entity size: 1 MB
  • Large data handled via links to Blob Storage

Additional Support

  • Query support using ADO.NET and LINQ
  • Designed for distributed storage and high scalability

48

49 of 50

EMERGING CLOUD SOFTWARE ENVIRONMENTS

Eucalyptus Cloud

Eucalyptus developed by Eucalyptus Systems (origin: University of California, Santa Barbara research)

  • Designed to bring cloud computing to clusters & academic supercomputers
  • Provides AWS-compatible interface (EC2-based services)
  • Supports cloud management through web services and user interface

Architecture Highlights

  • Open-source cloud software environment
  • Supports both compute cloud and storage cloud
  • Focus on VM image management and virtual clustering

VM Image Management

  • Stores VM images in Walrus storage (similar to Amazon S3)
  • Users can create, bundle, upload, and register custom VM images
  • Images linked with kernel and RAM disk
  • Stored in user-defined buckets and accessible across zones

49

50 of 50

Eucalyptus

50