1 of 109

Firmament

Fast, centralized cluster scheduling at scale

1

Ionel Gog @ICGog

joint work with

Malte Schwarzkopf, Adam Gleave,�Robert N. M. Watson & Steven Hand

2 of 109

Sesame Inc

2

Meet Sesame, Inc.

  • Sesame’s employees run:
    • Interactive data analytics�that must complete in seconds
    • Batch processing jobs�that increase resource utilization
    • Long-running services�that must provide high performance

3 of 109

Ideal scheduler

3

The cluster scheduler must achieve:

1. Resources are used efficiently

    • leverage heterogenous hardware
    • high utilization without interference

2. Low task scheduling latency

    • support interactive tasks
    • no idle resources

4 of 109

Centralized vs. distributed

4

State of the art

Good task placements

Low scheduling latency

Centralized�Sophisticated algorithms

[Borg, Quincy, Quasar]

Distributed�Simple heuristics

[Sparrow, Tarcil, Yaq-d]

Can’t get both good placements and low latency for the entire workload!

Hybrid�Split workload, provide either

[Mercury, Hawk, Eagle]

5 of 109

Centralized vs. distributed

5

Firmament provides a solution!

  • Centralized architecture
  • Good task placements
  • Low task scheduling latency
  • Scales to 10,000+ machines

6 of 109

Task-by-task intro

6

Task-by-task scheduler

Queue

Interactive

Batch

Service

Rack 1

Rack 2

Me too!

Preference for first rack

7 of 109

Task-by-task intro

7

Task-by-task scheduler

Queue

Interactive

Batch

Service

Rack 1

Rack 2

Me too!

8 of 109

Task-by-task: first task

8

Task-by-task scheduler

Queue

Interactive

Batch

Service

Rack 1

Rack 2

Me too!

9 of 109

Task-by-task: first task

9

Task-by-task scheduler

Queue

Interactive

Batch

Service

Rack 1

Rack 2

I prefer rack 1 :(

Option 1: place task on available resources

10 of 109

Task-by-task: first task

10

Task-by-task scheduler

Queue

Interactive

Batch

Service

Rack 1

Rack 2

Why migrate me?

Option 2: preempt/migrate task to meet requirements

11 of 109

Task-by-task: first task

11

Task-by-task scheduler

Queue

Interactive

Batch

Service

Rack 1

Rack 2

1

Decisions restrict the choice for queued tasks

2

Do not reschedule all already running tasks

12 of 109

Flow-based intro

12

  • Finds optimal task �placements�
  • Min-cost flow-based�centralized scheduler

[SOSP 2009]

Quincy: Fair Scheduling for Distributed�Computing Clusters

Copyright - Heather Gwinn

13 of 109

Flow-based schedules all tasks

13

Min-cost flow scheduler

Rack 1

Rack 2

Interactive

Batch

Service

Me too!

Preference for first rack

14 of 109

Flow-based schedules all tasks

14

Min-cost flow scheduler

Rack 1

Rack 2

Interactive

Batch

Service

Schedules all tasks at the same time

Running tasks are reconsidered

15 of 109

Flow-based schedules all tasks

15

Min-cost flow scheduler

Rack 1

Rack 2

Interactive

Batch

Service

Migrate

Considers tasks for migration or preemption

16 of 109

Flow-based schedules all tasks

16

Min-cost flow scheduler

Rack 1

Rack 2

Interactive

Batch

Service

Globally optimal placement!

17 of 109

Flow scheduling: tasks & machines

17

T0

Tasks

Machines

Introduction to min-cost flow scheduling

T1

T2

T3

T4

T5

M0

M1

M2

M3

M4

18 of 109

Flow scheduling: tasks to machines

18

Introduction to min-cost flow scheduling

T0

M0

M1

M2

M3

M4

T1

T2

T3

T4

T5

19 of 109

Flow scheduling: zoom in

19

Introduction to min-cost flow scheduling

T0

M0

M1

M2

M3

M4

T1

T2

T3

T4

T5

Cost: 3

Cost: 5

20 of 109

Flow scheduling: Min costs for all

20

Introduction to min-cost flow scheduling

T0

M0

M1

M2

M3

M4

T1

T2

T3

T4

T5

Cost: 3

Cost: 5

Cost: 5

Cost: 3

Cost: 9

Cost: 6

Cost: 2

Cost: 6

Min-cost flow places tasks with minimum overall cost

21 of 109

Flow scheduling: pushing flow for n tasks

21

Sink

S

Flow demand: 6

Introduction to min-cost flow scheduling

T0

M0

M1

M2

M3

M4

T1

T2

T3

T4

T5

Cost: 3

Cost: 5

Cost: 5

Cost: 3

Cost: 9

Cost: 6

Cost: 2

Cost: 6

Flow supply

1

1

1

1

1

1

FLOW

22 of 109

Flow scheduling: pushing flow for n tasks

22

S

Introduction to min-cost flow scheduling

T0

M0

M1

M2

M3

M4

T1

T2

T3

T4

T5

Flow supply

0

0

0

0

0

1

Flow demand: 1

Assignments:

  • T0 to M0
  • T1 to M3
  • T2 to M1
  • T3 to M4
  • T4 to M2

23 of 109

Flow scheduling: pushing flow for n tasks

23

S

Introduction to min-cost flow scheduling

T0

M0

M1

M2

M3

M4

T1

T2

T3

T4

T5

Flow supply

0

0

0

0

0

1

Flow demand: 1

T5’s supply is not routed

24 of 109

Flow scheduling: pushing flow for n tasks

24

S

Introduction to min-cost flow scheduling

T0

M0

M1

M2

M3

M4

T1

T2

T3

T4

T5

Flow supply

0

0

0

0

0

1

Flow demand: 1

U

Unscheduled node

25 of 109

Flow scheduling: pushing flow for n tasks

25

S

Introduction to min-cost flow scheduling

T0

M0

M1

M2

M3

M4

T1

T2

T3

T4

T5

Flow supply

0

0

0

0

0

0

Flow demand: 0

U

Unscheduled node

26 of 109

Quincy doesn’t scale: intro

26

How well does�the Quincy approach scale?

27 of 109

Quincy doesn’t scale: empty figure

27

Simulated Quincy using Google trace, 50% utilization

better

Google cluster

28 of 109

Quincy doesn’t scale

28

Simulated Quincy using Google trace, 50% utilization

better

66 sec on average

Too slow! 30% of tasks wait to be scheduled for over 33% of their runtime and waste resources

29 of 109

Quincy doesn’t scale

29

Simulated Quincy using Google trace, 50% utilization

Goal: sub-second scheduling latency in common case

better

Goal

30 of 109

Firmament contributions

30

Contributions

  • Low task scheduling latency
    • Uses best suited min-cost flow algorithm
    • Incrementally recomputes the solution
  • Good task placement
    • Same optimal placements as Quincy
    • Customizable scheduling policies

31 of 109

Firmament scheduler: intro

31

Scheduling policy

scheduler

master

Task table

Task statistics

Agent

Agent

Cluster topology

machine

machine

Scheduler

Scheduling policy

Flow graph

Scheduling policy

32 of 109

Firmament policy specification

32

class QuincyPolicy {

Cost_t TaskToResourceNodeCost(

TaskID_t task_id) {

return task_unscheduled_time *

quincy_wait_time_factor;

}

...

}

Specifying scheduling policies

Defines flow graph

N.B: More details in the paper.

33 of 109

Firmament scheduler: intro

33

Scheduling policy

scheduler

master

Task table

Task statistics

Agent

Agent

Cluster topology

machine

machine

Scheduler

Flow graph

Scheduling policy

Defines graph

Flow graph

34 of 109

Firmament scheduler: submit to solver

34

Scheduling policy

scheduler

master

Task table

Task statistics

Agent

Agent

Cluster topology

machine

machine

Scheduler

Scheduling policy

Flow graph

Min-cost, �max-flow solver

Submits graph

Defines graph

35 of 109

Firmament scheduler: slow min-cost flow solver

35

Scheduling policy

scheduler

master

Task table

Task statistics

Agent

Agent

Cluster topology

machine

machine

Scheduler

Scheduling policy

Flow graph

Min-cost�max-flow solver

Defines graph

Submits graph

Most time�spent here

36 of 109

Algorithms complexity: Successive shortest path

36

Algorithm

Worst-case complexity

Cost scaling

O(V2Elog(VC))

E: number of arcs

V: number of nodes

U: largest arc capacity

C: largest cost value

E > V > CU

Used by Quincy

37 of 109

Algorithms results: Cost scaling

37

better

Cost scaling is too slow beyond 1,000 machines

Subsampled Google trace, 50% slot utilization [Quincy policy]

Too slow!

Goal

38 of 109

Algorithms complexity: Cost scaling

38

Algorithm

Worst-case complexity

Cost scaling

O(V2Elog(VC))

Successive shortest path

O(V2Ulog(V))

E: number of arcs

V: number of nodes

U: largest arc capacity

C: largest cost value

E > V > CU

Lower worst-case complexity

39 of 109

Algorithms results: successive shortest path

39

Successive shortest path only scales to ~100 machines

better

Too slow!

Goal

Subsampled Google trace, 50% slot utilization [Quincy policy]

40 of 109

Algorithms complexity: Relaxation

40

Algorithm

Worst-case complexity

Cost scaling

O(V2Elog(VC))

Successive shortest path

O(V2Ulog(V))

Relaxation

O(E3CU2)

E: number of arcs

V: number of nodes

U: largest arc capacity

C: largest cost value

E > V > CU

Highest complexity

41 of 109

Algorithms results: Relaxation

41

better

Relaxation meets our sub-second latency goal

Goal

Perfect!

Subsampled Google trace, 50% slot utilization [Quincy policy]

42 of 109

Relaxation single-ish pass

42

M0

M1

R0

M2

M3

R1

S

T2

X

T0

T1

T3

T4

Single-ish pass flow push

Relaxation is well-suited to the graph structure

Why is Relaxation fast?

43 of 109

Slow Relaxation: tasks -> machine

43

M0

M1

M2

M3

T2

T0

T1

T3

T4

S

Capacity: 1 task

Relaxation suffers in pathological edge cases

Machine utilization

high

medium

44 of 109

Slow Relaxation: machine -> sink

44

M0

M1

M2

M3

T2

T0

T1

T3

T4

S

Relaxation suffers in pathological edge cases

Capacity: 1 task

Machine utilization

high

medium

45 of 109

Slow Relaxation: machine -> tasks

45

Machine utilization

high

medium

M0

M1

M2

M3

T2

T0

T1

T3

T4

S

Relaxation suffers in pathological edge cases

Relaxation cannot push flow in a single pass any more

Capacity: 0 tasks

46 of 109

High utilization introduction

46

How bad does�Relaxation’s edge�case get?

Experimental setup:

  • Simulated 12,500 machine cluster
  • Used the Quincy scheduling policy
  • Utilization >90% to oversubscribed cluster

47 of 109

High utilization empty figure

47

Quincy, 12,500 machines cluster, job of increasing size

better

48 of 109

High utilization Relaxation

48

Relaxation’s runtime increases with utilization

Quincy, 12,500 machines cluster, job of increasing size

better

49 of 109

High utilization Relaxation & Cost scaling

49

Cost scaling is unaffected by high utilization

better

Quincy, 12,500 machines cluster, job of increasing size

Cost scaling is faster

Best

50 of 109

Solver runs both algorithms

50

Seduling policy

scheduler

Scheduling policy

Flow graph

Min-cost, �max-flow solver

Input graph

Min-cost max-flow solver

��Cost scaling

��Relaxation

Input graph

Output flow

51 of 109

High utilization push-down runtime

51

better

Quincy, 12,500 machines cluster, job of increasing size

Algorithm runtime is still high at utilization > 94%

52 of 109

Incremental cost scaling introduction

52

Seduling policy

scheduler

Scheduling policy

Flow graph

Min-cost, �max-flow solver

Min-cost max-flow solver

��Cost scaling

Graph state

Input graph

State discarded

Output flow

53 of 109

Incremental cost scaling

53

Seduling policy

scheduler

Scheduling policy

Flow graph

Min-cost, �max-flow solver

��Cost scaling

Min-cost max-flow solver

Graph state

Output flow

Input graph

Previous graph state

Graph changes

54 of 109

54

M0

M1

R0

M2

M3

R1

S

T2

X

U

T0

T1

T3

Incremental min-cost max-flow

55 of 109

55

M0

M1

R0

M2

M3

R1

S

T2

X

U

T0

T1

T3

T4

Maintain flow, apply changes, re-optimize

Incremental min-cost max-flow

56 of 109

Incremental Cost scaling results

56

Quincy, 12,500 machines cluster, job of increasing size

better

Incremental cost scaling is ~2x faster

57 of 109

57

1

Run both algorithms

2

Incremental min-cost max-flow

3

Efficiently adjust graph state with price refine

Techniques to reduce scheduling latency

4

Problem specific min-cost flow optimizations

58 of 109

Ideal scheduler

58

Demo

59 of 109

Centralized vs. distributed

59

Note: many additional experiments in the paper.

Evaluation

Good task placements

Low scheduling latency

Centralized�Sophisticated algorithms

e.g., Borg, Quincy, Quasar

Distributed�Simple heuristics

e.g., Sparrow, Tarcil

Does Firmament choose good placements with low latency?

60 of 109

60

How well does Firmament work at scale?

Experimental setup:

  • Simulated 12,500 machine Google cluster
  • Used the Quincy scheduling policy
  • 90% cluster slot utilization

61 of 109

61

Google trace, 12,500 machines, 90% slot utilization

better

Task�submitted

Start�scheduling

Task�placed

Task�running

Task�completed

scheduling

waiting

starting

running

time

task placement latency

62 of 109

62

Google trace, 12,500 machines, 90% slot utilization

Placement latency is too high!

63 of 109

63

Google trace, 12,500 machines, 90% slot utilization

Firmament places tasks with low-latency at Google scale

Sub-second latency for 98% of tasks

64 of 109

Google acceleration introduction

64

How well does�Firmament handle challenging workloads?

Experimental setup:

  • Simulated 12,500 machine Google cluster
  • Used the centralized Quincy scheduling policy
  • Utilization varies between 75% and 95%

65 of 109

Google acceleration empty figure

65

Firmament handles challenging workloads at low latency

Google trace, 12,500 machines, �utilization between 75% and 90%

Simulate interactive workloads by �scaling down task runtimes

better

Median task runtime: 420s

Median task runtime: 1.7s

66 of 109

Google acceleration Cost scaling

66

Google trace, 12,500 machines, �utilization between 75% and 90%

better

Average latency is too high even without many short tasks

45 seconds average latency (tuned over Quincy setup’s 66s)

Firmament handles challenging workloads at low latency

67 of 109

Google acceleration Relaxation

67

Google trace, 12,500 machines, �utilization between 75% and 90%

Latency with a 250x acceleration:

75th percentile: 2 sec

maximum: 57 sec

better

Firmament handles challenging workloads at low latency

68 of 109

Google acceleration Firmament

68

Google trace, 12,500 machines, �utilization between 75% and 90%

Firmament’s common-case latency is sub-second even at 250x acceleration

better

Firmament handles challenging workloads at low latency

69 of 109

Workload-mix intro

69

How do Firmament’s placements compare to other schedulers?

Experimental setup:

  • Homogeneous 40-machine cluster, 10G network
  • Mixed batch/service/interactive workload
  • Network-aware scheduling policy

70 of 109

Workload mix-service task figure

70

Interactive

Service

M1

M2

M3

M4

M8

M7

M6

M5

M9

M10

Rack1

Rack2

Core

Network utilization:

low

medium

high

71 of 109

Workload mix-service task figure

71

Interactive

Service

M1

M2

M3

M4

M8

M7

M6

M5

M9

M10

Rack1

Rack2

Core

Network utilization:

low

medium

high

Already running

72 of 109

Workload mix-service task figure

72

Interactive

Service

M1

M2

M3

M4

M8

M7

M6

M5

M9

M10

Rack1

Rack2

Core

Network utilization:

low

medium

high

SCHEDULER

73 of 109

Workload-mix Sparrow

73

better

5 seconds task response time on idle cluster

Firmament chooses good placements

Task�submitted

Start�scheduling

Task�placed

Task�running

Task�completed

scheduling

waiting

starting

running

time

task response time

74 of 109

Workload-mix Sparrow

74

Sparrow is unaware of resource utilization

better

20% of tasks experience poor placement

Firmament chooses good placements

75 of 109

Workload-mix Kubernetes

75

Kubernetes only accounts for memory and CPU

better

20% of tasks experience poor placement

Firmament chooses good placements

76 of 109

Workload-mix Docker

76

Centralized Kubernetes and Docker still suffer

better

Firmament chooses good placements

20% of tasks experience poor placement

77 of 109

Workload-mix Firmament

77

better

Avoided co-location interference

Firmament chooses good placements

Firmament outperforms centralized and distributed schedulers

78 of 109

Firmament policy specification

78

class QuincyPolicy {

Cost_t TaskToResourceNodeCost(

TaskID_t task_id) {

return task_unscheduled_time *

quincy_wait_time_factor;

}

...

}

Firmament supports multiple policies

Defines flow graph

79 of 109

Quincy policy graph

79

Firmament supports multiple policies

T0,0

Quincy

Quincy policy

M0

M1

R0

M2

M3

R1

S

T2

X

U

T0

T1

T3

T4

Locality preference

T0,0

Locality preference

T0

T3

T3

T1

T4

80 of 109

Co-location policy graph

80

T0,0

Firmament supports multiple policies

Quincy

Co-location-aware

M0

M1

R0

M2

M3

R1

X

S

T0

T1

T2

T3

T4

U1

L3

C0

C1

L2

L3

C0

C1

L2

Co-location-aware policy

81 of 109

Network-aware policy graph

81

Firmament supports multiple policies

Co-location-aware

T0,0

Quincy

Network-aware

M0

M1

M2

M3

S

T2

U

T0

T1

T3

Network utilization

high

low

Network-aware policy

82 of 109

Energy-aware policy graph

82

Firmament supports multiple policies

Co-location-aware

T0,0

Quincy

Network-aware

Energy-aware policy

Energy-aware

M0

M1

M2

P

B

S

T0

T1

T2

T3

T4

U

C0

C1

C0

C1

C0

C1

Energy-aware policy

83 of 109

Ideal scheduler

83

Current and future work

1

Gang scheduling

  • min flow arc requirement and max arc capacity

2

Complex constraints (e.g., task affinity)

  • generalized min-cost flow algorithms

3

Modeling costs for scheduling policies

  • Bayesian optimization

84 of 109

Conclusions

84

Conclusions

  • Low task scheduling latency
    • Uses best algorithm at all times
    • Incrementally recomputes solution
  • Good task placement
    • Same optimal placements as Quincy
    • Customizable scheduling policies

Firmament cluster manager & scheduler:

Firmament Kubernetes integration:

85 of 109

Ideal scheduler

85

Musketeer – workflow manager�[EuroSys’15]

QJump – bounded latency data center networking�[NSDI’15, best paper]

Broom – memory management�for big data[HotOS’15]

Firmament – scalable, high quality cluster scheduling[OSDI’16]

Falkirk – rollback recovery for dataflow systems[under submission]

The big picture:

86 of 109

86

Backup slides

87 of 109

Ideal scheduler

87

88 of 109

Ideal scheduler

88

Task�submitted

Start�scheduling

Task�placed

Task�running

Task�completed

scheduling

waiting

starting

running

time

task placement latency

task response time

algorithm runtime

89 of 109

Centralized vs. distributed

89

Sees entire cluster

Sophisticated algorithms

Good task placements

BUT: high scheduling latency

Partial, stale state

Simple algorithms

Low scheduling latency

BUT: poor task placements

What do we have today?

centralized

SCHEDULER

Task

Task

distributed

S

S

S

S

S

S

S

S

S

Task

Task

90 of 109

90

How well does Firmament

work if the cluster is oversubscribed?

Experimental setup:

  • Simulated 12,500 machine Google cluster
  • Used the Quincy centralized scheduling policy
  • Utilization ~ 97-100%,�periods of oversubscription

91 of 109

91

better

Google trace, 12,500 machines, periods of oversubscription

Oversubscription

92 of 109

92

better

Relaxation is slow when utilization is very high

Google trace, 12,500 machines, periods of oversubscription

Clearing backlog

93 of 109

93

better

Cost scaling’s runtime is not affected by utilization

Google trace, 12,500 machines, periods of oversubscription

94 of 109

94

better

Google trace, 12,500 machines, periods of oversubscription

Recovers earlier

95 of 109

95

What is Firmament’s scalability limit?

96 of 109

96

Batch tasks�~100s seconds

Workload mix of a typical cluster

Time

0

50

75

100

25

Tasks [%]

Long running service tasks

97 of 109

97

Short batch tasks only�(~100s milliseconds)

Short batch-only tasks workload

Time

0

50

75

100

25

Tasks [%]

  • Experiment modeled after Sparrow
    • Figure 12 [Ousterhout et al., SOSP’13]

98 of 109

98

Task duration [ms]

500

250

1000

1 second

  • Non-interfering tasks�
  • Hold utilization constant at 80%�
  • Decrease task duration

Breaking point experiment setup

99 of 109

99

100 machines, 80% utilization, 10 tasks per job

100 of 109

100

Spark’s scheduler cannot handle tasks <= 1355 ms

100 machines, 80% utilization, 10 tasks per job

101 of 109

101

Firmament handles tasks > 10 ms

100 machines, 80% utilization, 10 tasks per job

102 of 109

102

Firmament handles tasks >= 350 ms

1000 machines, 80% utilization, 10 tasks per job

103 of 109

103

How well does Firmament place tasks with different resource requirements?

Experimental setup:

  • Homogeneous 28-machine cluster, 10G network
  • 80% utilization
  • Scheduler places 15,000 tasks workload mix:
    • 30% CPU bound tasks
    • 60% Memory bound tasks
    • 10% Disk bound tasks

104 of 109

104

runtime on otherwise idle cluster: ideal

Homogeneous 28-machine cluster, 80% utilization, 15,000 tasks

better

105 of 109

105

runtime on otherwise idle cluster: ideal

2-4x increase in runtime over ideal

Homogeneous 28-machine cluster, 80% utilization, 15,000 tasks

better

106 of 109

106

runtime on otherwise idle cluster: ideal

Mesos reduces runtime, but still 2-3x of ideal

Homogeneous 28-machine cluster, 80% utilization, 15,000 tasks

better

107 of 109

107

runtime on otherwise idle cluster: ideal

Firmament comes close to the ideal task runtime

Homogeneous 28-machine cluster, 80% utilization, 15,000 tasks

better

108 of 109

Quincy vs Firmament locality

108

Quincy slows down when threshold is decreased

Firmament improves data locality without a significant increase in scheduling latency

Google trace, 12,500 machines, varying locality threshold

2x

18%

109 of 109

Approximative solver

109

better

12,500 machines, runs on a graph from a highly utilized cluster