1 of 62

1

Adaptive Meta-scheduling for Computational Workloads

Abdulrahman Azab

azab@uio.no

2 of 62

Computational Grid

2

Broker

C

C

C

W

W

W

3 of 62

Inter-Grid Structure

  • Hierarchical

(UNICORE, EGI, BOINC,…)

3

Meta-scheduler

4 of 62

Inter-Grid Structure

  • Hierarchical [Limitation]

(UNICORE, EGI, BOINC,…)

4

5 of 62

Inter-Grid Structure

  • Client initiated [Limitation]

(Condor flocking)

5

6 of 62

Inter-Grid Structure

  • Client initiated [Limitation]

(Condor flocking)

6

How many connections?

For How long?

How many user

authentications?

7 of 62

Inter-Grid Structure

  • Broker Overlay

(P2P Condor flocking, Grid Federation)

7

B-1

B-2

B-3

B-5

B-4

8 of 62

Inter-Grid Architecture

  • Broker Overlay [Limitation]

(P2P Condor flocking, Grid Federation)

8

9 of 62

Inter-Grid Architecture

  • Consider:

9

Grid as bus

Grid node as bus seat

Grid job as passenger

10 of 62

Inter-Grid Architecture

  • Problem

10

11 of 62

Inter-Grid Architecture

  • Solution

11

Load balancing

12 of 62

Inter-Grid Architecture

  • Load balancing

12

Gateway

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Stroll-S

13 of 62

Slick Inter-Grid Architecture

13

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Stroll-S

Grids with vacant

Matching workers

Scheduler

14 of 62

usegalaxy.eu - Under the Hood

Being built from common components

  • Continuous integration
  • all open source
  • few dedicated servers/HW
  • highly scalable
  • federated

14

DB server

mq

node15

...

node0

node15

...

node0

node15

...

node0

stratum 1

BWCloud

1. Intel + AMD arch

2. > 8000 CPU cores

3. > 42 TiB RAM

4. > 200 TiB storage

5. > 20 T4 GPUs

Pulsar - 1

c004

c003

c002

c001

c005

c004

c003

c002

NFS

> 3 PB

Portal

Jobs

Master node

Italy

TIG stack

Worker

Pulsar - 2

c004

c003

c002

c001

c005

c004

c003

c002

Belgium

Pulsar - 3

c004

c003

c002

c001

c005

c004

c003

c002

France

Pulsar - 12

c004

c003

c002

c001

c005

c004

c003

c002

UK

Celery

DIRAC

Galaxy pilot factory

Presentation title | Name Surname

15 of 62

Slick Design

15

GlS

RIS1

RIS2

RISn

Communication channel

DSM

q4

q3

q2

q1

q0

Gateway

Scheduler

IS

Starvation/far-data

detector

Local

Scheduler

q4

q3

q2

q1

q0

C

C

C

C

Local Clients

W

W

W

W

Local Workers

B

Neighborhood

Broker

Overlay

B

B

B

B

B

SLICK Gateway

Receive

Submit

Gateway Queue

Local Queue

Broker

Commu-

nication

Information

Scheduling

Local

Broker

Job Bus

Information Bus

16 of 62

Inter-grid Job Submission

16

Broker

C

C

C

W

W

W

Broker

C

C

C

W

W

W

Broker

C

C

C

W

W

W

Broker

C

C

C

W

W

W

Job

17 of 62

Slick Scheduling Model

17

universe = java

……

requirements = OpSys == "LINUX"&&

Arch =="INTEL“

Image_size = 2800 Meg

…….

queue 60

Fuzzy

Matchmaker

D 1

D 2

D 3

D 4

D 50

ClassAd

Matchmaker

Initial list

Job description

D 15

D 22

D 20

D 1

D 50

Match list

Ordered match list

1

2

3

4

5

Data locality

r #3 – 1 hops

r #4 – 1 hops

r #1 – 1 hops

r #7 – 2 hops

r #9 – 4 hops

Domains Routes

D 1

D 15

D 20

D 22

D 50

D 1

D 15

D 20

D 22

D 50

r #7 – 2 hops

r #3 – 1 hops

r #1 – 1 hops

r #4 – 1 hops

r #9 – 4 hops

18 of 62

Fuzzy Matchmaker

  1. Fuzzification of Inputs.
  2. Applying Fuzzy Operator.
  3. Applying Implication Method.
  4. Defuzzification.

18

19 of 62

1. Fuzzification of Inputs

  • CPU input membership function

19

0.25

0.5

0.75

1

0

D1

0

2500

2000

1500

1000

500

# CPU Slots

Degree of Membership

D2

D3

3000

0.9

0.65

0.24

20 of 62

1. Fuzzification of Inputs

  • Memory input membership function

20

0.25

0.5

0.75

1

0

D1

0

5

4

3

2

1

Image_size (100 GB)

Degree of Membership

D2

D3

6

1.0

0.75

0.36

21 of 62

21

0.13

D2

D3

D1

T

# CPU Slots

Degree of Membership

22 of 62

2. Applying Fuzzy Operator

22

IF CPUSlotsOf (j) IS AllocatableInCPUSlots(Di)

AND MemoryImageOf (j) IS AllocatableInVirtualMemory(Di)

THEN j IS AllocatableIn(Di)

The rule weight is specified as:

A separate Fuzzy rule is created for each Domain, Di, as a function of job j as follows:

**Fuzzy AND is represented as Minimum()

23 of 62

3. Applying Implication Method

  • The consequent of a rule is an output fuzzy set represented by an output membership function.
  • The output membership function associated with each fuzzy set will be in the form of a unique identifier of the associated Domain

23

D1 D2 D3

0.52

0.9

0.36

24 of 62

4. Defuzzification

Output = MAX (weight [D1], weight [D2],

weight [D3])

24

25 of 62

Binary Matchmaking

25

Broker

C

C

C

W

W

W

Name = slot1@52016.uis.no

VirtualMemory = 34241

TotalDisk = 3704876

Disk = 34951

CondorLoadAvg = 0.000000

LoadAvg = 0.590000

KeyboardIdle = 6

ConsoleIdle = 6

Memory = 77

Cpus = 1

Arch = "INTEL"

OpSys = "WINNT51“

……………………..

Workers Metadata

26 of 62

Matchmaking

26

HasJava = TRUE

VirtualMemory = 34241

Memory = 77

Arch = "INTEL"

OpSys = “LINUX“

……………………..

universe = java

……

requirements = OpSys == "LINUX" &&

Arch =="INTEL“

Image_size = 28 Meg

…….

Job Attributes

Worker Attributes

MATCH

27 of 62

Sharing resource information

27

Broker

C

C

C

W

W

W

Broker

C

C

C

W

W

W

Broker

C

C

C

W

W

W

Broker

C

C

C

W

W

W

Lots of Data

>1GB for 10,000 workers

28 of 62

Binary operations are always faster

28

29 of 62

Binary representation of Job and Worker Attributes

  • Boolean attributes
  • Multi-valued attributes
  • Range-valued attributes

29

30 of 62

Boolean Attributes

  • HasJava, HasMPI (true/false)
  • Represented in 16-bit operand B so that:

Attribute i is TRUE 🡪 bi = 0

  • Boolean attribute operand for both worker and job, Bw and Bj, are presented in the same method.

30

31 of 62

Multi-valued Attributes

  • OpSys, State
  • Each is represented in 4-bits
  • Represented in 64-bit operand M 🡪 holds up to 16 attributes.
  • Multi-valued attribute operand for both worker and job, Mw and Mj, are presented in the same method.

31

Arch

INTEL

0001

IA64

0010

ALPHA

0011

SGI

0100

OperSys

LINUX

0001

WINNT51

0010

WINNT52

0011

SOLARIS25

0100

State

Unclaimed

0001

claimed

0010

Owner

0011

Matched

0100

32 of 62

Range-valued Attributes

  • LoadAvg, TotalCPUs, Memory
  • Each is represented in 16-bits with an Increment unit I
  • Represented in 128-bit operand R 🡪 holds up to 8 attributes.
  • Bit values of R are set as follows:

32

33 of 62

Range-valued Attributes

  • Rw 🡪 Range-valued attribute operand of the worker.
  • Rj 🡪 Range-valued attribute operand of the job.

33

34 of 62

Attribute Records

  • WAR 🡪 Worker attribute record
  • JAR 🡪 Job attribute record

34

Bj

Rj

Mj

WAR

JAR

Bw

Rw

Mw

16 bits 128 bits 64 bits

Boolean attributes

Range-valued attributes

Multi-valued attributes

35 of 62

Matchmaking

  • BwRw Ʌ BjRj = BwRw

AND

  • Mw Ʌ MSK = Mj

35

Bw

Rw

Mw

Bj

Rj

Mj

16 bits 128 bits 64 bits

WAR

JAR

(BwRw Ʌ BjRj) BwRw = 0

+

(Mw Ʌ MSK) Mj = 0

+

Match

&

36 of 62

Matchmaking

  • MSK 🡪 Multi-valued attribute mask for the job attribute record.

36

0010

000000000000000000000000000000000000000000

1111

000000000000000000000000000000000000000000

Mj

MSKj

0001

1111

64 bits

Mj[0]

Mj[3]

Arch = IA64

OpSys = LINUX

37 of 62

Binary Information set database BIS-db

37

w1

Bj

Rj

Mj

w2

Bj

Rj

Mj

wn

Bj

Rj

Mj

Worker id

WAR

Broker id

# Unclaimed workers

Time-stamp

38 of 62

Broker Architecture

38

Communication controller

Gateway

Scheduler

starvation

detector

Local

Scheduler

q4

q3

q2

q1

q0

C

C

C

C

Local Clients

W

W

W

W

Local Workers

Gateway Broker

Received jobs

Submitted jobs

Job Queue

Broker

Local

Broker

Job Bus

Information Bus

IS-db

IS

BIS-db

BIS

Requested BIS-dbs

Job profile

Job labeled

BIS-db Response

B

Neighborhood

Broker

Overlay

B

B

B

B

B

39 of 62

Speedup

39

40 of 62

InterGridSim

Allocation Protocol

Idle Protocol

Regular Node

Grid CD Protocol

Allocation Protocol

Idle Protocol

Regular Node

Grid CD Protocol

Allocation Protocol

Idle Protocol

Regular Node

Grid CD Protocol

Allocation Protocol

Idle Protocol

Regular Node

Grid CD Protocol

Broker Protocol

Broker

Service

Allocator

Broker Protocol

Broker

Broker Overlay

Idle Protocol

Idle Protocol

41 of 62

InterGridSim

  • Broker Overlay

41

B-1

B-2

B-3

B-5

B-4

42 of 62

InterGridSim

  • Broker Overlay Topologies

1

2

K

1

2

K

1

2

K

1

2

K

1

2

K

1

2

K

1

2

K

Ring

Hyper-Cube

Fully connected

Wire-k-out

43 of 62

InterGridSim: Output

=======================Cycle : 1991================ Cycle time (ms): 139

# Node Types 1 [OS LINUX, CPU 4, M 8]: 20001 Type 2[OS LINUX, CPU 2, M 4]: 29999

# Workload types Type 1[OS LINUX, CPU 2, M 4]: 30000 Type 2[OS LINUX, CPU 2, M 4]: 40000

Total generated workloads for allocation 70000

Succeeded Allocations 44909 Completed workloads 0 Deployments/cycle 7

Waiting workloads 25091

No. of workload exchanging 62908 Connections Between Brokers/cycle 700

(Deployed At) Broker Indexes: 15 27 49 6 44 36 34 23 45 26 9 5 8 41 17 30 24 40 32 47

Broker Indexes:

(0) (1) (2) (3) (4) (5) (6) (7) …. (50)

Broker Queue size & No. of free CPUs

2835 0 0 0 0 182 182 0

0 3970 3970 3970 3970 1566 1566 3970

Broker Queued workloads

  1. 15 15 15 1400 1400 15

Broker Local Allocations:

2 15 15 15 15 1218 1218 15

43

44 of 62

Evaluation

  • We test the efficiency of five inter-grid scheduling techniques:

  1. UNICORE service orchestrator
  2. Condor flocking
  3. P2P Condor flocking
  4. Slick first-match
  5. Slick best-match

44

45 of 62

Evaluation

  • Benchmarks:

  1. Validity of the stored resource information
  2. Job Allocation Throughput
  3. Load Balancing

45

46 of 62

Validity of the stored resource information

  • The deviation of the reading time values of RIDBs stored in the resource information data set, from the current cycle in a broker, with the simulation cycles.

  • The deviation value for cycle (c):

47 of 62

Evaluation

  • Validity of the stored resource information.
  • Impact of broker failure on resource information updating.

N 🡪 Total Grid size, M 🡪 Number of VOs

48 of 62

Validity of the stored resource information

N = 100, M = 20

N = 500, M = 100 (log scale)

49 of 62

Impact of Broker Failures on Resource Information Updating (N = 500, M = 100)

Ring Topology

50 of 62

Evaluation: Workload allocation

Simulation setup:

  • Total Network size = 50,000 nodes
  • Number of Domains = 512
  • Total number of jobs = 80,000
  • Number of Job sequences = 100
  • Network topology of Slick broker overlay is HyperCube.

50

51 of 62

Results

  1. Job Allocation Throughput

51

52 of 62

Results

2. Load Balancing

52

53 of 62

Results

2. Load Balancing

53

54 of 62

Thank You��Questions?

54

55 of 62

What is Grid computing?

55

Instead of running programs on one computer

They can be distributed among many computers

On the Internet

56 of 62

What is Grid computing?

56

“Grid computing is concerned with coordinated resource sharing and

problem solving in dynamic, multi-institutional virtual organizations.”

Ian Foster & Karl Kesselman , 2001.

VO1

VO2

VO3

B1

B2

B3

57 of 62

What is Cloud?

57

“A large-scale distributed computing paradigm that is driven by economies of scale, in which a pool of abstracted, virtualized, dynamically-scalable, managed computing power, storage, platforms, and services are delivered on demand to external customers over the Internet”

Ian Foster, Yong Zhao, Ioan Raicu, and Shiyong Lu 2008

58 of 62

Grid vs Cloud

  • Grid

58

Manager(s)

Resource:Hero

I need a Scientific Linux

with 2GB RAM!

I have scientific linux

With 3 GB Ram

Take Hero

User: Ali

59 of 62

Grid vs Cloud

  • Cloud

59

I need 3 high-CPU

windows machines for 2 weeks

Available for 1000$

60 of 62

Grid vs Cloud

  • The Cloud is NOT an upgrade to the Grid
  • The Grid is a computing paradigm
  • The Cloud is a computing platform

60

61 of 62

What comes out of Grid computing?

  • Proven statistics: Grid

61

278,832 volunteer computers

Speed: 769 teraFLOPS = 769 * 10^12 arithmetic operations per second

Cost: FREE

62 of 62

What comes out of Grid computing?

  • Proven statistics: Supercomputer

62

Model: IBM Blue Gene/P

Speed: 478 teraFLOPS

Cost: about $1,3 million