1 of 109

Firmament

Fast, centralized cluster scheduling at scale

1

Ionel Gog @ICGog

joint work with

Malte Schwarzkopf, Adam Gleave,�Robert N. M. Watson & Steven Hand

2 of 109

Sesame Inc

2

Meet Sesame, Inc.

Sesame’s employees run:

Interactive data analytics�that must complete in seconds

Batch processing jobs�that increase resource utilization

Long-running services�that must provide high performance

3 of 109

Ideal scheduler

3

The cluster scheduler must achieve:

1. Resources are used efficiently

leverage heterogenous hardware
high utilization without interference

2. Low task scheduling latency

support interactive tasks
no idle resources

4 of 109

Centralized vs. distributed

4

State of the art

Good task placements

Low scheduling latency

Centralized�Sophisticated algorithms

[Borg, Quincy, Quasar]

Distributed�Simple heuristics

[Sparrow, Tarcil, Yaq-d]

Can’t get both good placements and low latency for the entire workload!

Hybrid�Split workload, provide either

[Mercury, Hawk, Eagle]

Kermit first looked at the state of the art. He discovered that there is a tension between good placements and low scheduling latency. <CLICK>

Some schedulers are centralized, and choose good task placements because they see the entire cluster state and use sophisticated algorithms. However, they do not offer low scheduling latency.<CLICK>

Other solutions parallelize decisions across several distributed schedulers that each serve a part of the workload. They use simple heuristics to place tasks with sub-second latency, but do not choose good task placements <CLICK>

Finally, hybrid schedulers place long-running tasks in a centralized manner, and short tasks distributedly. However, tasks can still experience poor placement due to the distributed components simple heuristics.<CLICK>

Therefore, Kermit cannot get good task placement and low scheduling latency for his whole workload. <CLICK>

5 of 109

Centralized vs. distributed

5

Firmament provides a solution!

Centralized architecture
Good task placements
Low task scheduling latency
Scales to 10,000+ machines

6 of 109

Task-by-task intro

6

Task-by-task scheduler

Queue

Interactive

Batch

Service

Rack 1

Rack 2

Me too!

Preference for first rack

7 of 109

Task-by-task intro

7

Task-by-task scheduler

Queue

Interactive

Batch

Service

Rack 1

Rack 2

Me too!

8 of 109

Task-by-task: first task

8

Task-by-task scheduler

Queue

Interactive

Batch

Service

Rack 1

Rack 2

Me too!

9 of 109

Task-by-task: first task

9

Task-by-task scheduler

Queue

Interactive

Batch

Service

Rack 1

Rack 2

I prefer rack 1 :(

Option 1: place task on available resources

10 of 109

Task-by-task: first task

10

Task-by-task scheduler

Queue

Interactive

Batch

Service

Rack 1

Rack 2

Why migrate me?

Option 2: preempt/migrate task to meet requirements

11 of 109

Task-by-task: first task

11

Task-by-task scheduler

Queue

Interactive

Batch

Service

Rack 1

Rack 2

1

Decisions restrict the choice for queued tasks

2

Do not reschedule all already running tasks

12 of 109

Flow-based intro

12

Finds optimal task �placements�
Min-cost flow-based�centralized scheduler

[SOSP 2009]

Quincy: Fair Scheduling for Distributed�Computing Clusters

Copyright - Heather Gwinn

13 of 109

Flow-based schedules all tasks

13

Min-cost flow scheduler

Rack 1

Rack 2

Interactive

Batch

Service

Me too!

Preference for first rack

14 of 109

Flow-based schedules all tasks

14

Min-cost flow scheduler

Rack 1

Rack 2

Interactive

Batch

Service

Schedules all tasks at the same time

Running tasks are reconsidered

15 of 109

Flow-based schedules all tasks

15

Min-cost flow scheduler

Rack 1

Rack 2

Interactive

Batch

Service

Migrate

Considers tasks for migration or preemption

16 of 109

Flow-based schedules all tasks

16

Min-cost flow scheduler

Rack 1

Rack 2

Interactive

Batch

Service

Globally optimal placement!

17 of 109

Flow scheduling: tasks & machines

17

T₀

Tasks

Machines

Introduction to min-cost flow scheduling

T₁

T₂

T₃

T₄

T₅

M₀

M₁

M₂

M₃

M₄

18 of 109

Flow scheduling: tasks to machines

18

Introduction to min-cost flow scheduling

T₀

M₀

M₁

M₂

M₃

M₄

T₁

T₂

T₃

T₄

T₅

19 of 109

Flow scheduling: zoom in

19

Introduction to min-cost flow scheduling

T₀

M₀

M₁

M₂

M₃

M₄

T₁

T₂

T₃

T₄

T₅

Cost: 3

Cost: 5

20 of 109

Flow scheduling: Min costs for all

20

Introduction to min-cost flow scheduling

T₀

M₀

M₁

M₂

M₃

M₄

T₁

T₂

T₃

T₄

T₅

Cost: 3

Cost: 5

Cost: 3

Cost: 9

Cost: 6

Cost: 2

Cost: 6

Min-cost flow places tasks with minimum overall cost

21 of 109

Flow scheduling: pushing flow for n tasks

21

Sink

S

Flow demand: 6

Introduction to min-cost flow scheduling

T₀

M₀

M₁

M₂

M₃

M₄

T₁

T₂

T₃

T₄

T₅

Cost: 3

Cost: 5

Cost: 3

Cost: 9

Cost: 6

Cost: 2

Cost: 6

Flow supply

1

FLOW

22 of 109

Flow scheduling: pushing flow for n tasks

22

S

Introduction to min-cost flow scheduling

T₀

M₀

M₁

M₂

M₃

M₄

T₁

T₂

T₃

T₄

T₅

Flow supply

0

1

Flow demand: 1

Assignments:

T₀ to M₀
T₁ to M₃
T₂ to M₁
T₃ to M₄
T₄ to M₂

23 of 109

Flow scheduling: pushing flow for n tasks

23

S

Introduction to min-cost flow scheduling

T₀

M₀

M₁

M₂

M₃

M₄

T₁

T₂

T₃

T₄

T₅

Flow supply

0

1

Flow demand: 1

T₅’s supply is not routed

24 of 109

Flow scheduling: pushing flow for n tasks

24

S

Introduction to min-cost flow scheduling

T₀

M₀

M₁

M₂

M₃

M₄

T₁

T₂

T₃

T₄

T₅

Flow supply

0

1

Flow demand: 1

U

Unscheduled node

25 of 109

Flow scheduling: pushing flow for n tasks

25

S

Introduction to min-cost flow scheduling

T₀

M₀

M₁

M₂

M₃

M₄

T₁

T₂

T₃

T₄

T₅

Flow supply

0

Flow demand: 0

U

Unscheduled node

26 of 109

Quincy doesn’t scale: intro

26

How well does�the Quincy approach scale?

27 of 109

Quincy doesn’t scale: empty figure

27

Simulated Quincy using Google trace, 50% utilization

better

Google cluster

28 of 109

Quincy doesn’t scale

28

Simulated Quincy using Google trace, 50% utilization

better

66 sec on average

Too slow! 30% of tasks wait to be scheduled for over 33% of their runtime and waste resources

29 of 109

Quincy doesn’t scale

29

Simulated Quincy using Google trace, 50% utilization

Goal: sub-second scheduling latency in common case

better

Goal

30 of 109

Firmament contributions

30

Contributions

Low task scheduling latency

Uses best suited min-cost flow algorithm

Incrementally recomputes the solution

Good task placement

Same optimal placements as Quincy

Customizable scheduling policies

31 of 109

Firmament scheduler: intro

31

Scheduling policy

scheduler

master

Task table

Task statistics

Agent

Cluster topology

machine

Scheduler

Scheduling policy

Flow graph

Scheduling policy

32 of 109

Firmament policy specification

32

class QuincyPolicy {

Cost_t TaskToResourceNodeCost(

TaskID_t task_id) {

return task_unscheduled_time *

quincy_wait_time_factor;

}

...

}

Specifying scheduling policies

Defines flow graph

N.B: More details in the paper.

33 of 109

Firmament scheduler: intro

33

Scheduling policy

scheduler

master

Task table

Task statistics

Agent

Cluster topology

machine

Scheduler

Flow graph

Scheduling policy

Defines graph

Flow graph

34 of 109

Firmament scheduler: submit to solver

34

Scheduling policy

scheduler

master

Task table

Task statistics

Agent

Cluster topology

machine

Scheduler

Scheduling policy

Flow graph

Min-cost, �max-flow solver

Submits graph

Defines graph

35 of 109

Firmament scheduler: slow min-cost flow solver

35

Scheduling policy

scheduler

master

Task table

Task statistics

Agent

Cluster topology

machine

Scheduler

Scheduling policy

Flow graph

Min-cost�max-flow solver

Defines graph

Submits graph

Most time�spent here

36 of 109

Algorithms complexity: Successive shortest path

36

Algorithm	Worst-case complexity
Cost scaling	O(V²Elog(VC))

E: number of arcs

V: number of nodes

U: largest arc capacity

C: largest cost value

E > V > C ≅ U

Used by Quincy

37 of 109

Algorithms results: Cost scaling

37

better

Cost scaling is too slow beyond 1,000 machines

Subsampled Google trace, 50% slot utilization [Quincy policy]

Too slow!

Goal

38 of 109

Algorithms complexity: Cost scaling

38

Algorithm	Worst-case complexity
Cost scaling	O(V²Elog(VC))
Successive shortest path	O(V²Ulog(V))

E: number of arcs

V: number of nodes

U: largest arc capacity

C: largest cost value

E > V > C ≅ U

Lower worst-case complexity

39 of 109

Algorithms results: successive shortest path

39

Successive shortest path only scales to ~100 machines

better

Too slow!

Goal

Subsampled Google trace, 50% slot utilization [Quincy policy]

40 of 109

Algorithms complexity: Relaxation

40

Algorithm	Worst-case complexity
Cost scaling	O(V²Elog(VC))
Successive shortest path	O(V²Ulog(V))
Relaxation	O(E³CU²)

E: number of arcs

V: number of nodes

U: largest arc capacity

C: largest cost value

E > V > C ≅ U

Highest complexity

41 of 109

Algorithms results: Relaxation

41

better

Relaxation meets our sub-second latency goal

Goal

Perfect!

Subsampled Google trace, 50% slot utilization [Quincy policy]

42 of 109

Relaxation single-ish pass

42

M₀

M₁

R₀

M₂

M₃

R₁

S

T₂

X

T₀

T₁

T₃

T₄

Single-ish pass flow push

Relaxation is well-suited to the graph structure

Why is Relaxation fast?

43 of 109

Slow Relaxation: tasks -> machine

43

M₀

M₁

M₂

M₃

T₂

T₀

T₁

T₃

T₄

S

Capacity: 1 task

Relaxation suffers in pathological edge cases

Machine utilization

high

medium

44 of 109

Slow Relaxation: machine -> sink

44

M₀

M₁

M₂

M₃

T₂

T₀

T₁

T₃

T₄

S

Relaxation suffers in pathological edge cases

Capacity: 1 task

Machine utilization

high

medium

45 of 109

Slow Relaxation: machine -> tasks

45

Machine utilization

high

medium

M₀

M₁

M₂

M₃

T₂

T₀

T₁

T₃

T₄

S

Relaxation suffers in pathological edge cases

Relaxation cannot push flow in a single pass any more

Capacity: 0 tasks

46 of 109

High utilization introduction

46

How bad does�Relaxation’s edge�case get?

Experimental setup:

Simulated 12,500 machine cluster
Used the Quincy scheduling policy
Utilization >90% to oversubscribed cluster

47 of 109

High utilization empty figure

47

Quincy, 12,500 machines cluster, job of increasing size

better

48 of 109

High utilization Relaxation

48

Relaxation’s runtime increases with utilization

Quincy, 12,500 machines cluster, job of increasing size

better

49 of 109

High utilization Relaxation & Cost scaling

49

Cost scaling is unaffected by high utilization

better

Quincy, 12,500 machines cluster, job of increasing size

Cost scaling is faster

Best

50 of 109

Solver runs both algorithms

50

Seduling policy

scheduler

Scheduling policy

Flow graph

Min-cost, �max-flow solver

Input graph

Min-cost max-flow solver

��Cost scaling

��Relaxation

Input graph

Output flow

51 of 109

High utilization push-down runtime

51

better

Quincy, 12,500 machines cluster, job of increasing size

Algorithm runtime is still high at utilization > 94%

52 of 109

Incremental cost scaling introduction

52

Seduling policy

scheduler

Scheduling policy

Flow graph

Min-cost, �max-flow solver

Min-cost max-flow solver

��Cost scaling

Graph state

Input graph

State discarded

Output flow

53 of 109

Incremental cost scaling

53

Seduling policy

scheduler

Scheduling policy

Flow graph

Min-cost, �max-flow solver

��Cost scaling

Min-cost max-flow solver

Graph state

Output flow

Input graph

Previous graph state

Graph changes

54 of 109

54

M₀

M₁

R₀

M₂

M₃

R₁

S

T₂

X

U

T₀

T₁

T₃

Incremental min-cost max-flow

55 of 109

55

M₀

M₁

R₀

M₂

M₃

R₁

S

T₂

X

U

T₀

T₁

T₃

T₄

Maintain flow, apply changes, re-optimize

Incremental min-cost max-flow

56 of 109

Incremental Cost scaling results

56

Quincy, 12,500 machines cluster, job of increasing size

better

Incremental cost scaling is ~2x faster

57 of 109

57

1

Run both algorithms

2

Incremental min-cost max-flow

3

Efficiently adjust graph state with price refine

Techniques to reduce scheduling latency

4

Problem specific min-cost flow optimizations

58 of 109

Ideal scheduler

58

Demo

59 of 109

Centralized vs. distributed

59

Note: many additional experiments in the paper.

Evaluation

Good task placements

Low scheduling latency

Centralized�Sophisticated algorithms

e.g., Borg, Quincy, Quasar

Distributed�Simple heuristics

e.g., Sparrow, Tarcil

Does Firmament choose good placements with low latency?

60 of 109

60

How well does Firmament work at scale?

Experimental setup:

Simulated 12,500 machine Google cluster
Used the Quincy scheduling policy
90% cluster slot utilization

61 of 109

61

Google trace, 12,500 machines, 90% slot utilization

better

Task�submitted

Start�scheduling

Task�placed

Task�running

Task�completed

scheduling

waiting

starting

running

time

task placement latency

62 of 109

62

Google trace, 12,500 machines, 90% slot utilization

Placement latency is too high!

63 of 109

63

Google trace, 12,500 machines, 90% slot utilization

Firmament places tasks with low-latency at Google scale

Sub-second latency for 98% of tasks

64 of 109

Google acceleration introduction

64

How well does�Firmament handle challenging workloads?

Experimental setup:

Simulated 12,500 machine Google cluster
Used the centralized Quincy scheduling policy
Utilization varies between 75% and 95%

65 of 109

Google acceleration empty figure

65

Firmament handles challenging workloads at low latency

Google trace, 12,500 machines, �utilization between 75% and 90%

Simulate interactive workloads by �scaling down task runtimes

better

Median task runtime: 420s

Median task runtime: 1.7s

66 of 109

Google acceleration Cost scaling

66

Google trace, 12,500 machines, �utilization between 75% and 90%

better

Average latency is too high even without many short tasks

45 seconds average latency (tuned over Quincy setup’s 66s)

Firmament handles challenging workloads at low latency

67 of 109

Google acceleration Relaxation

67

Google trace, 12,500 machines, �utilization between 75% and 90%

Latency with a 250x acceleration:

75th percentile: 2 sec

maximum: 57 sec

better

Firmament handles challenging workloads at low latency

68 of 109

Google acceleration Firmament

68

Google trace, 12,500 machines, �utilization between 75% and 90%

Firmament’s common-case latency is sub-second even at 250x acceleration

better

Firmament handles challenging workloads at low latency

69 of 109

Workload-mix intro

69

How do Firmament’s placements compare to other schedulers?

Experimental setup:

Homogeneous 40-machine cluster, 10G network
Mixed batch/service/interactive workload
Network-aware scheduling policy

70 of 109

Workload mix-service task figure

70

Interactive

Service

M₁

M₂

M₃

M₄

M₈

M₇

M₆

M₅

M₉

M₁₀

Rack₁

Rack₂

Core

Network utilization:

low

medium

high

71 of 109

Workload mix-service task figure

71

Interactive

Service

M₁

M₂

M₃

M₄

M₈

M₇

M₆

M₅

M₉

M₁₀

Rack₁

Rack₂

Core

Network utilization:

low

medium

high

Already running

72 of 109

Workload mix-service task figure

72

Interactive

Service

M₁

M₂

M₃

M₄

M₈

M₇

M₆

M₅

M₉

M₁₀

Rack₁

Rack₂

Core

Network utilization:

low

medium

high

SCHEDULER

73 of 109

Workload-mix Sparrow

73

better

5 seconds task response time on idle cluster

Firmament chooses good placements

Task�submitted

Start�scheduling

Task�placed

Task�running

Task�completed

scheduling

waiting

starting

running

time

task response time

74 of 109

Workload-mix Sparrow

74

Sparrow is unaware of resource utilization

better

20% of tasks experience poor placement

Firmament chooses good placements

75 of 109

Workload-mix Kubernetes

75

Kubernetes only accounts for memory and CPU

better

20% of tasks experience poor placement

Firmament chooses good placements

76 of 109

Workload-mix Docker

76

Centralized Kubernetes and Docker still suffer

better

Firmament chooses good placements

20% of tasks experience poor placement

77 of 109

Workload-mix Firmament

77

better

Avoided co-location interference

Firmament chooses good placements

Firmament outperforms centralized and distributed schedulers

78 of 109

Firmament policy specification

78

class QuincyPolicy {

Cost_t TaskToResourceNodeCost(

TaskID_t task_id) {

return task_unscheduled_time *

quincy_wait_time_factor;

}

...

}

Firmament supports multiple policies

Defines flow graph

79 of 109

Quincy policy graph

79

Firmament supports multiple policies

T_0,0

Quincy

Quincy policy

M₀

M₁

R₀

M₂

M₃

R₁

S

T₂

X

U

T₀

T₁

T₃

T₄

Locality preference

T_0,0

Locality preference

T₀

T₃

T₁

T₄

80 of 109

Co-location policy graph

80

T_0,0

Firmament supports multiple policies

Quincy

Co-location-aware

M₀

M₁

R₀

M₂

M₃

R₁

X

S

T₀

T₁

T₂

T₃

T₄

U₁

L3

C₀

C₁

L2

L3

C₀

C₁

L2

Co-location-aware policy

81 of 109

Network-aware policy graph

81

Firmament supports multiple policies

Co-location-aware

T_0,0

Quincy

Network-aware

M₀

M₁

M₂

M₃

S

T₂

U

T₀

T₁

T₃

Network utilization

high

low

Network-aware policy

82 of 109

Energy-aware policy graph

82

Firmament supports multiple policies

Co-location-aware

T_0,0

Quincy

Network-aware

Energy-aware policy

Energy-aware

M₀

M₁

M₂

P

B

S

T₀

T₁

T₂

T₃

T₄

U

C₀

C₁

C₀

C₁

C₀

C₁

Energy-aware policy

83 of 109

Ideal scheduler

83

Current and future work

1

Gang scheduling

min flow arc requirement and max arc capacity

2

Complex constraints (e.g., task affinity)

generalized min-cost flow algorithms

3

Modeling costs for scheduling policies

Bayesian optimization

There are some challenges with the min-cost flow based approach for doing scheduling. First, these type of schedulers could not gang schedule tasks, but we now have a solution that uses arc flow requirements which that ensure that an expected amount of flow is routed via an arc. <CLICK>

Second, min-cost flow schedulers could not express complex constraints such as task affinity. We have a solution for this limitation. This solution requires the scheduler to used a generalized network flow. We are currently working on implementing our generalized min-cost flow solver.

Finally, it is challenging to set the costs on the arcs. We are currently working on using Bayesian optimization to discover and tune weights we attribute to different resources so that the flow network costs are suitable for our optimization goals. <CLICK>

Fairness?

84 of 109

Conclusions

84

Conclusions

Low task scheduling latency

Uses best algorithm at all times

Incrementally recomputes solution

Good task placement

Same optimal placements as Quincy

Customizable scheduling policies

Firmament cluster manager & scheduler:

github.com/camsas/firmament

github.com/camsas/poseidon

Firmament Kubernetes integration:

85 of 109

Ideal scheduler

85

Musketeer – workflow manager�[EuroSys’15]

QJump – bounded latency data center networking�[NSDI’15, best paper]

Broom – memory management�for big data�[HotOS’15]

Firmament – scalable, high quality cluster scheduling�[OSDI’16]

Falkirk – rollback recovery for dataflow systems�[under submission]

The big picture:

Before I conclude my talk, I just want to mention that I also worked on several other data-center related projects that I’m happy to chat more about. I’m going to quickly list them in the order in which they fit in a software stack: <CLICK>

At the lower levels I’ve worked on QJump which is a network architecture that guarantees a hard upper bound on packet latency in data center networks using only commodity hardware. <CLICK>
Broom a region-based memory management system that eliminates garbage collection overheads for big data systems <CLICK>
Falkirk, a framework for rollback recovery design for general dataflow systems that combine, batch, streaming and graph computations. <CLICK>
Finally, I also worked on Musketeer, a workflow manager that takes data processing workflows as input and automatically generates code for the best combination of data processing systems.

Thank you for your attention. I am now happy to take questions.

86 of 109

86

Backup slides

87 of 109

Ideal scheduler

87

88 of 109

Ideal scheduler

88

Task�submitted

Start�scheduling

Task�placed

Task�running

Task�completed

scheduling

waiting

starting

running

time

task placement latency

task response time

algorithm runtime

89 of 109

Centralized vs. distributed

89

Sees entire cluster

Sophisticated algorithms

Good task placements

BUT: high scheduling latency

Partial, stale state

Simple algorithms

Low scheduling latency

BUT: poor task placements

What do we have today?

centralized

SCHEDULER

Task

distributed

S

Task

However, if we look at the state-of-the art we can see that there are two main types of cluster scheduler architectures:<CLICK>

In the first type there is a single centralized scheduler that is responsible for placing all the different types of tasks.<CLICK>

The scheduler sees and manages the entire cluster.<CLICK>

Thus, it can use sophisticated algorithms<CLICK>

to choose good task placements<CLICK>

However, the best centralized schedulers sacrifice scalability and scheduling latency for quality placements.<CLICK>

In the second type of cluster scheduler architecture there are several schedulers that distributively run at the same time.<CLICK>

These schedulers only have partial and potentially stale information of the cluster’s state.<CLICK>

As a result, they must use simple algorithms,<CLICK>

that provide low scheduling latency even at scale,<CLICK>

But unfortunately, they choose poor task placements because they only have access to partial information<CLICK>

TODO: Tension - fast, centralized ⇔

TODO: explain here why latency is bad

90 of 109

90

How well does Firmament

work if the cluster is oversubscribed?

Experimental setup:

Simulated 12,500 machine Google cluster
Used the Quincy centralized scheduling policy
Utilization ~ 97-100%,�periods of oversubscription

91 of 109

91

better

Google trace, 12,500 machines, periods of oversubscription

Oversubscription

92 of 109

92

better

Relaxation is slow when utilization is very high

Google trace, 12,500 machines, periods of oversubscription

Clearing backlog

93 of 109

93

better

Cost scaling’s runtime is not affected by utilization

Google trace, 12,500 machines, periods of oversubscription

94 of 109

94

better

Google trace, 12,500 machines, periods of oversubscription

Recovers earlier

95 of 109

95

What is Firmament’s scalability limit?

96 of 109

96

Batch tasks�~100s seconds

Workload mix of a typical cluster

Time

0

50

75

100

25

Tasks [%]

Long running service tasks

97 of 109

97

Short batch tasks only�(~100s milliseconds)

Short batch-only tasks workload

Time

0

50

75

100

25

Tasks [%]

Experiment modeled after Sparrow

Figure 12 [Ousterhout et al., SOSP’13]

98 of 109

98

Task duration [ms]

500

250

1000

1 second

Non-interfering tasks�
Hold utilization constant at 80%�
Decrease task duration

Breaking point experiment setup

99 of 109

99

100 machines, 80% utilization, 10 tasks per job

100 of 109

100

Spark’s scheduler cannot handle tasks <= 1355 ms

100 machines, 80% utilization, 10 tasks per job

101 of 109

101

Firmament handles tasks > 10 ms

100 machines, 80% utilization, 10 tasks per job

102 of 109

102

Firmament handles tasks >= 350 ms

1000 machines, 80% utilization, 10 tasks per job

103 of 109

103

How well does Firmament place tasks with different resource requirements?

Experimental setup:

Homogeneous 28-machine cluster, 10G network
80% utilization
Scheduler places 15,000 tasks workload mix:

30% CPU bound tasks
60% Memory bound tasks
10% Disk bound tasks

104 of 109

104

runtime on otherwise idle cluster: ideal

Homogeneous 28-machine cluster, 80% utilization, 15,000 tasks

better

105 of 109

105

runtime on otherwise idle cluster: ideal

2-4x increase in runtime over ideal

Homogeneous 28-machine cluster, 80% utilization, 15,000 tasks

better

106 of 109

106

runtime on otherwise idle cluster: ideal

Mesos reduces runtime, but still 2-3x of ideal

Homogeneous 28-machine cluster, 80% utilization, 15,000 tasks

better

We then executed the same workload mix using Mesos, a current state-of-the-art cluster manager. Mesos works better than random because it balances the resources used on each machine, but the runtime is still 2 to 3x from the ideal.<CLICK>

as we all know, Mesos is a resource manager, so it doesn't do anything smart here. However, none of the existing frameworks deal with interference either. That said, with the new exciting QoS interface that recently landed in Mesos, it should be possible to add support, which would hopefully get Mesos to perform similarly to Firmament

Simple shell executor because none of the existing frameworks would behave differently.

Spin CPU for 60s.

Swap words in 1M array.

Swap words in 50M array.

fio asynchronous read of 4 GB.

fio asynchronous write of 4 GB.

107 of 109

107

runtime on otherwise idle cluster: ideal

Firmament comes close to the ideal task runtime

Homogeneous 28-machine cluster, 80% utilization, 15,000 tasks

better

108 of 109

Quincy vs Firmament locality

108

Quincy slows down when threshold is decreased

Firmament improves data locality without a significant increase in scheduling latency

Google trace, 12,500 machines, varying locality threshold

2x

18%

109 of 109

Approximative solver

109

better

12,500 machines, runs on a graph from a highly utilized cluster