1 of 63

Cloud Computing: �Overview

1

Mohamed Hefeeda

2 of 63

Cloud Computing: Vision

2

  • Goal … achieve the old1 dream for computing

Make computing a utility

  • similar to electricity & water

  • 1Parkhill, The Challenge of the Computer Utility, Addison-Wesley, 1966.

3 of 63

Electricity as Utility

3

Lighting

High voltage

3-phase, 380 V

Multiple services

4 of 63

Computing as Utility

4

Analytics

Apps

Dev Tools

Large scale

5 of 63

Why Cloud Computing?

  • Better computing services
    • Managed by experts (see example next slide)
    • High availability
    • Numerous software tools and systems: better productivity
  • Cost effective
    • Economy of scale 🡺 computing resources cost much less
    • Pay on-demand 🡺 lower risk and lower barrier of entry
  • Fast and elastic deployment
    • Pre-existing infrastructure
    • Virtualized resources and management tools 🡺 deployment in minutes or hours compared to weeks and months
    • Illusion of “infinite” resources 🡺 allows for scaling

5

6 of 63

Security

  • Perception … which is safer?

6

  • Cloud could be more secure than local infrastructure!
    • Cloud employs security experts that most companies cannot afford
    • 🡺 but there is still need to identify risks of moving to cloud

7 of 63

Cloud Computing: Risks and Challenges

7

  • Lock-in with a specific provider
    • Mitigation: use open-source and standard systems (not always possible)
  • Security and Privacy
    • Still an issue, as clouds are shared platform
    • Data breaches, malicious insiders, compromised credentials, …
  • Dependence on Internet Connectivity
    • Network latency and Bandwidth
    • Latency to reach the cloud: can affect interactive apps
    • Bandwidth to the cloud: could impact data-intensive apps

8 of 63

Cloud Computing: NIST Definition

8

“Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources

(e.g., networks, servers, storage, applications, and services)

that can be rapidly provisioned and released with minimal management effort or service provider interaction.”

9 of 63

Cloud Computing: Service Models

9

  • IaaS (Infrastructure as a Service)
    • Basic computing resources (CPU, storage, network, …)
    • Amazon EC2

  • PaaS (Platform as a Service)
    • Platform to develop apps using programming languages, libraries, services, and tools supported by the cloud provider
    • Google App Engine, Amazon EMR (Elastic MapReduce)

  • SaaS (Software as a Service)
    • Software apps provided by the cloud provider
    • Office 365
    • SalesForce.com (e.g., payroll, customer relation management, …)

10 of 63

Cloud Computing: Why Now?

10

  • Better Internet & Mega Datacenter
  • Internet:
    • faster, prevalent, and more reliable
    • Mega Datacenters:
    • economy of scale (5—7x cheaper hardware than other companies)
    • Already deployed (Amazon AWS, Google, …) 🡺 Additional revenue streams
    • Already developed software for in-house use (e.g., Google File System, MapReduce)

11 of 63

Cloud Computing: Why Now? (2)

11

  • New applications (enabled by cloud) 🡺
    • Numerous mobile interactive apps
    • IoT (Internet of Things): sensors, cameras, cars, smart Homes, smart…everything
    • Business analytics and intelligence systems

  • New technology trends and business models
    • Shifting from high-touch (& high cost) service model to low-touch (& much lower cost) service
      • E.g., content distribution using Akamai vs. using Amazon CloudFront

12 of 63

Simple Model for Cloud Computing

12

System Design

Programming Models & Resource Management

Cloud Services

Cloud Applications

Data center Hardware

Virtualization, allocation, programming

Libraries & services

Large-scale applications

13 of 63

Cloud Applications

  • Cloud Apps … from simple to web-scale

To Cloudify or not to Cloudify

14 of 63

Migrating Apps to Cloud

14

  • Candidate apps for migration have following C/C’s
    • Demand for resources vary with time
      • provisioning private data centers for peak wastes resources
    • Demand is not known in advance
      • Cannot optimally provision private data centers; either too much waste (overprovisioning) or lost opportunities (under provisioning)
    • Can leverage “cost associativity”
      • Using one machine for 100 hrs costs same as using 100 for 1 hr on the cloud 🡺 but we get the results faster in the latter

Wasted resources

Lost opportunity

15 of 63

Migrating Apps to Cloud

15

  • Cloud migration transfers the risk of miscalculating demand from user to cloud provider
    • Major advantage, especially for start-ups
  • Cloud providers mitigate the risk by statistical multiplexing across multiple users
    • Statistical multiplexing means:
    • requests from different users may vary significantly
    • but with many users, the average may not fluctuate too much

16 of 63

Data Center Design

16

17 of 63

Data Centers

17

Useful info & virtual tours at:

https://www.google.com/about/datacenters/

18 of 63

Racks of Servers

  • Modular design
    • Preconfigured racks
    • Power, network, cabling

18

18

Top of the Rack (ToR) Switch

Servers

(commodity, customized )

19 of 63

Servers and Virtualization

  • Multiple virtual machines on one physical machine
  • Applications run unmodified as on real machine
  • VM can migrate from one computer to another

19

Shared Hardware

Host OS

Virtual Machine Manager (VMM)

VM1

VM2

VM3

20 of 63

Datacenter Network

Server racks

  • 20- 40 servers
  • Each runs multiple VMs

Top of Rack (ToR) switch

  • one per rack
  • 40-100Gbps

Tier-2 switches

  • connecting to ~16 ToRs below

Tier-1 switches

  • connecting to ~16 T-2s below

Border routers

  • connections outside datacenter

Most common: tree structure

21 of 63

Why Tree-structured Datacenter Network?

21

  • Cost effective: it allows a hierarchical design, where
      • many low-end switches are used in racks, and
      • few high-end switches are used across racks

  • Note: Cost of switches increases substantially with increasing throughput and #ports

22 of 63

9

10

11

12

13

14

15

16

two disjoint paths highlighted between racks 1 and 11

Datacenter Network: Oversubscription

  • ToR switch uses multiple uplinks to higher-level switches
    • Why would we do this?
    • Create multiple paths for servers across racks 🡺 higher throughput and more reliability

23 of 63

Datacenter Network: Oversubscription

23

  • Example: 48-port switch
      • 40 ports are used for servers in the rack
      • 8 ports are used to connect to higher-level switch 🡺 up to 8 servers can concurrently connect to others in different racks
    • 🡺 oversubscription factor in this case = 40/8 = 5
      • Also means that 5 servers are sharing one uplink, i.e., not all of them can achieve their full bandwidth at the same time

24 of 63

Load

balancer

Internet

load balancer: application-layer routing

  • receives external client requests
  • directs workload within data center
  • returns results to external client (hiding data center internals from client)

Datacenter Network: Load Balancing

Can we implement Load Balancing in switches?

25 of 63

P4: Load Balancer

Link Layer: 6-25

26 of 63

Facebook F16 Datacenter Network

Each ToR connects to 16 Fabric Switches with 100Gpbs links 🡺 1.6 Tbps uplink/downlink capacity

Datacenter in 1 building

Similarly, each Fabric Switch connects to 16 Spine Switches

27 of 63

Facebook F16 Datacenter Network

6 datacenters (buildings) in one region interconnected together

Interconnection network

28 of 63

Alternative Networking Fabrics

28

  • Fat trees: From commodity Ethernet switches
    • Connect end-host together using a “fat-tree” topology
    • All hosts can transmit at line speed, if packets are distributed along different paths. What is the oversubscription factor here?

29 of 63

Fat-trees

  • A k-port fat tree can support k3/4 hosts
    • Common deployment k=48 🡪 27,648 hosts

29

30 of 63

Fat-tree Challenges

  • Layer 3 will only use one of the existing equal cost paths

  • Packet re-ordering occurs if layer 3 blindly takes advantage of path diversity
    • E.g., Equal-cost multiple path (ECMP)

30

31 of 63

Other Topologies

  • Jellyfish

31

32 of 63

Alternative Network Fabrics

  • Infiniband:
    • Interconnection network with much higher bandwidth, but more costly
    • Common in HPC (high performance computers)

32

33 of 63

Datacenter: Storage

  • Storage in datacenters can be …
    • Distributed (disks connected to individual servers)
    • Centralized (Network Attached Storage, NAS)

33

34 of 63

Datacenter: Distributed Storage

34

Distributed File System (e.g., GFS)

35 of 63

Datacenter: Distributed Storage

  • Distributed Storage
    • Inexpensive
    • Need distributed file system (e.g., Google File System)
      • Manages and replicates data across machines
    • May provide higher read bandwidth at the expense of higher write overheads (can read from parallel machines)
    • Allow software to exploit data locality (data on local disk)

35

36 of 63

Datacenter: Centralized Storage

36

NAS

37 of 63

Datacenter: Centralized Storage

  • NAS (Network Attached Storage):
    • Usually more expensive
    • Directly connected to cluster-level switching fabric
    • NAS is responsible for data management and integrity (error correction, fault tolerance, replication, etc.) 🡺 simplifies deployment

37

38 of 63

Datacenter: Storage Hierarchy

38

  • Notice the differences in latency and bandwidth

Size

Latency

BW

39 of 63

Datacenter Storage: Latency, BW, Capacity

39

  • Very important for programmers to appreciate

Log-scale

40 of 63

Datacenter Buildings

40

41 of 63

Datacenter: Energy Consumption

41

  • Big portion of the cost of constructing data centers goes in power: distribution and cooling

  • Usually, data centers are referred to by the amount of power they use, e.g., 10 MW data center

  • Rough estimates for the construction cost of large data centers: $10-20/Watt

42 of 63

Data Center: Power Distribution

42

10 – 20 kV

400-600 V

200-480 V

110-220 V

From Grid

Computing

Power

Cooling

43 of 63

Data Center: Power Distribution

  • Primary Switch Gear:
    • Has breakers to protect again electrical faults
    • Scales voltage down from medium (10—20 kV) to low (400—600 V)
  • Uninterruptable Power Supply (UPS)
    • Gets feed from switchgear and another from diesel generators
    • Has batteries (DC current)
    • Senses and decides the active power line (utility power or diesel)
    • After power failure, starts diesel generators (10-15s)
    • Performs AC-DC-AC double conversions:
      • AC-DC: Converts AC to DC to store in batteries
      • DC-AC: during power failures, from DC in batteries to AC to feed data center equipment
      • Conversion also helps in power conditioning (remove spikes etc.)

43

44 of 63

Data Center: Power Distribution

  • Power Distribution Units (PDUs)
    • Takes feed from UPS (200—480 V)
    • Breaks it up into many 110—220 V circuits for actual servers
    • Circuits are individually protected
    • Sometimes UPS units are duplicated for added reliability
      • PDUs take two lines and can switch among them fast

44

45 of 63

Data Centers: Cooling

45

Hot aisle

Cold aisle

Chiller or cooling tower

  • Cold airflow from tiles should match horizontal airflow through servers
    • Otherwise, lower servers absorb all cold air and higher ones suck in warm air from above the rack
    • This puts a physical limit on #servers in each rack

46 of 63

Cooling: Managing Airflow

  • If airflow is not managed carefully 🡺 Creates hot and cold regions

  • Newer data centers separate hot aisles from cold aisles
    • Improves efficiency

  • 🡺 Watch Video from Google about managing airflow

46

47 of 63

Cooling: Free Cooling

47

  • Use ambient temperature to reduce reliance on chillers

  • 🡺 Watch Video from Google Datacenters in Finland and Belgium

48 of 63

Container-Based Data Centers

  • Put server racks into container
    • Integrate heat exchange and power distribution inside container
    • 🡺 higher server (power) densities than raised-floor datacenters
    • 🡺 Higher energy efficiency
      • E.g., Microsoft Datacenter in Chicago

48

49 of 63

Energy Efficiency

49

  • The whole ICT industry contributes ~2% to green house emissions

  • Datacenters alone take 15% of this 2%

50 of 63

PUE: Power Usage Effectiveness

50

  • PUE = Total building power / power in IT equipment
    • reflects quality of the datacenter building
    • Ideally close to 1.0

  • Old data centers had PUE from 2.0 to 3.0

  • Newer ones have PUE < 2.0
    • Average is around 1.7
  • Google reported PUE ~ 1.1 in recent data centers

  • Where are the power overheads in datacenters?

51 of 63

Power Overheads in Data Centers

51

  • Losses from AC-DC-AC conversion
  • Google new design: �per-server UPS, battery on each server 🡺 one AC-DC
  • Use free cooling from ambient
  • Better airflow control and isolation of isles

52 of 63

PUE and Server PUE (SPUE)

52

  • PUE captures overheads in datacenters
  • But it does NOT account for inefficiencies in IT equipment
    • Power can be lost in server’s power supply, voltage regulator modules (VRMs), cooling fans, …
      • Power supplies: ~80% efficient
      • VRMs could lose ~30% of power

  • Server PUE (SPUE) = Total Server Input Power / Power consumed by electronic components involved in computation (CPUs, DRAM, …)

  • 🡺 “True” PUE of datacenter = PUE x SPUE

53 of 63

Server Energy Proportionality

53

  • The proportionality is worse for elements other than CPU
  • Energy Proportionality: means energy consumption decreases linearly with load 🡺 NOT the case in reality

54 of 63

Server Energy Proportionality

54

  • On-going research to improve energy proportionality
    • E.g., for disks, reduce spinning speed by adding more heads
  • Notes:
    • Idle = means active idle, i.e., incurs short latency to wake up
      • Example: CPU dynamic voltage scaling (DVS)
        • DVS 🡺 Reduces frequency of CPU 🡺 less energy
    • This is unlike inactive idle, which takes long time but saves more energy
      • Example: disk sleep and wake up

  • Can we put servers more often in inactive modes?
  • Not really. Let us see why

55 of 63

Profile of Some Servers at Google

55

  • Servers are rarely completely idle. Why is that?
    • Load balancing (each server processes few transactions)
    • Other operations, e.g., Distributed File System replication
  • 🡺 need to rely more on active idle modes (or consolidate workloads on few machines)

56 of 63

Active Idle Modes for CPU: DVS

56

  • Dynamic Voltage Scaling (DVS): reduces frequency of CPU (and hence energy consumed) when load is low
  • DVS provides 10—20% (OK, not impressive)

57 of 63

Datacenter: Power Provisioning

57

  • Two costs for power:
  • Construction
    • power provisioning of the facility
    • ~ $10—22 per IT watt
    • With 10-year depreciation, yearly cost is $1.0—2.2/watt
    • Operation
    • ~ $1.2 per IT watt per year (average in the US)

    • That is, saving in operation power is quite significant

58 of 63

Power Saving: Issues

58

  • Rate maximum power (on nameplate) is conservative
    • Rarely happens during operation
  • Power consumed by a server depends on the load
    • 🡺 need to measure power consumption in real time

  • Workload consolidation can help putting more servers in inactive idle modes
    • But it might complicate distributed applications

  • Power Oversubscription:
    • Not all servers run at their peak all the time

59 of 63

Power Measurement Study from Google

59

  • Measure power at Rack (80 servers), PDU (800 servers), Cluster (5,000 servers) over 6 months

60 of 63

Power Measurement Study

  • Cluster never ran above 72% of its peak power
  • 🡺 28% of power is wasted
  • We could add more machines to the cluster (~40%) at the same power level

  • Need to take some pre-cautions:
    • Mixed workload (some less critical ones that can be terminated or delayed)

60

61 of 63

Data Centers—Tiers

61

  • Tier I:
    • single path for power /cooling distribution, no redundant components
  • Tier II
    • adds redundant components (N + 1), improving availability.
  • Tier III:
    • Multiple power /cooling distribution paths but one active path
    • Provide redundancy even during maintenance, usually N + 2
  • Tier IV:
    • two active power/cooling distribution paths, redundant components
  • Most commercial DCs are III and IV
    • Availability for II, III, IV: 99.75, 99.98%, 99.995%

62 of 63

Summary

62

  • Cloud Computing … making computing a utility
    • Cost effective, elastic, less risk for customers
  • Datacenter design
    • Datacenter network: Tree structure, Facebook F16
    • Datacenter storage: distributed vs centralized
    • Storage hierarchy: latency, BW, and Size
  • Datacenter power
    • Datacenters are characterized by IT power level (e.g., 100 MW)
    • Power distribution and cooling
    • Measuring efficiency: PUE, SPUE
    • Energy non-proportionality of servers
      • CPU and even worse for disks and DRAMS

63 of 63

References

63

  • Abts and Felderman, A Guided Tour of Data-Center networking, Communications of the ACM, June 2012
    • Required Reading