Firmament
Fast, centralized cluster scheduling at scale
1
Ionel Gog @ICGog
joint work with
Malte Schwarzkopf, Adam Gleave,�Robert N. M. Watson & Steven Hand
Sesame Inc
2
Meet Sesame, Inc.
Ideal scheduler
3
The cluster scheduler must achieve:
1. Resources are used efficiently
2. Low task scheduling latency
Centralized vs. distributed
4
State of the art
Good task placements
Low scheduling latency
Centralized�Sophisticated algorithms
[Borg, Quincy, Quasar]
Distributed�Simple heuristics
[Sparrow, Tarcil, Yaq-d]
Can’t get both good placements and low latency for the entire workload!
Hybrid�Split workload, provide either
[Mercury, Hawk, Eagle]
Centralized vs. distributed
5
Firmament provides a solution!
Task-by-task intro
6
Task-by-task scheduler
Queue
Interactive
Batch
Service
Rack 1
Rack 2
Me too!
Preference for first rack
Task-by-task intro
7
Task-by-task scheduler
Queue
Interactive
Batch
Service
Rack 1
Rack 2
Me too!
Task-by-task: first task
8
Task-by-task scheduler
Queue
Interactive
Batch
Service
Rack 1
Rack 2
Me too!
Task-by-task: first task
9
Task-by-task scheduler
Queue
Interactive
Batch
Service
Rack 1
Rack 2
I prefer rack 1 :(
Option 1: place task on available resources
Task-by-task: first task
10
Task-by-task scheduler
Queue
Interactive
Batch
Service
Rack 1
Rack 2
Why migrate me?
Option 2: preempt/migrate task to meet requirements
Task-by-task: first task
11
Task-by-task scheduler
Queue
Interactive
Batch
Service
Rack 1
Rack 2
1
Decisions restrict the choice for queued tasks
2
Do not reschedule all already running tasks
Flow-based intro
12
[SOSP 2009]
Quincy: Fair Scheduling for Distributed�Computing Clusters
Copyright - Heather Gwinn
Flow-based schedules all tasks
13
Min-cost flow scheduler
Rack 1
Rack 2
Interactive
Batch
Service
Me too!
Preference for first rack
Flow-based schedules all tasks
14
Min-cost flow scheduler
Rack 1
Rack 2
Interactive
Batch
Service
Schedules all tasks at the same time
Running tasks are reconsidered
Flow-based schedules all tasks
15
Min-cost flow scheduler
Rack 1
Rack 2
Interactive
Batch
Service
Migrate
Considers tasks for migration or preemption
Flow-based schedules all tasks
16
Min-cost flow scheduler
Rack 1
Rack 2
Interactive
Batch
Service
Globally optimal placement!
Flow scheduling: tasks & machines
17
T0
Tasks
Machines
Introduction to min-cost flow scheduling
T1
T2
T3
T4
T5
M0
M1
M2
M3
M4
Flow scheduling: tasks to machines
18
Introduction to min-cost flow scheduling
T0
M0
M1
M2
M3
M4
T1
T2
T3
T4
T5
Flow scheduling: zoom in
19
Introduction to min-cost flow scheduling
T0
M0
M1
M2
M3
M4
T1
T2
T3
T4
T5
Cost: 3
Cost: 5
Flow scheduling: Min costs for all
20
Introduction to min-cost flow scheduling
T0
M0
M1
M2
M3
M4
T1
T2
T3
T4
T5
Cost: 3
Cost: 5
Cost: 5
Cost: 3
Cost: 9
Cost: 6
Cost: 2
Cost: 6
Min-cost flow places tasks with minimum overall cost
Flow scheduling: pushing flow for n tasks
21
Sink
S
Flow demand: 6
Introduction to min-cost flow scheduling
T0
M0
M1
M2
M3
M4
T1
T2
T3
T4
T5
Cost: 3
Cost: 5
Cost: 5
Cost: 3
Cost: 9
Cost: 6
Cost: 2
Cost: 6
Flow supply
1
1
1
1
1
1
FLOW
Flow scheduling: pushing flow for n tasks
22
S
Introduction to min-cost flow scheduling
T0
M0
M1
M2
M3
M4
T1
T2
T3
T4
T5
Flow supply
0
0
0
0
0
1
Flow demand: 1
Assignments:
Flow scheduling: pushing flow for n tasks
23
S
Introduction to min-cost flow scheduling
T0
M0
M1
M2
M3
M4
T1
T2
T3
T4
T5
Flow supply
0
0
0
0
0
1
Flow demand: 1
T5’s supply is not routed
Flow scheduling: pushing flow for n tasks
24
S
Introduction to min-cost flow scheduling
T0
M0
M1
M2
M3
M4
T1
T2
T3
T4
T5
Flow supply
0
0
0
0
0
1
Flow demand: 1
U
Unscheduled node
Flow scheduling: pushing flow for n tasks
25
S
Introduction to min-cost flow scheduling
T0
M0
M1
M2
M3
M4
T1
T2
T3
T4
T5
Flow supply
0
0
0
0
0
0
Flow demand: 0
U
Unscheduled node
Quincy doesn’t scale: intro
26
How well does�the Quincy approach scale?
Quincy doesn’t scale: empty figure
27
Simulated Quincy using Google trace, 50% utilization
better
Google cluster
Quincy doesn’t scale
28
Simulated Quincy using Google trace, 50% utilization
better
66 sec on average
Too slow! 30% of tasks wait to be scheduled for over 33% of their runtime and waste resources
Quincy doesn’t scale
29
Simulated Quincy using Google trace, 50% utilization
Goal: sub-second scheduling latency in common case
better
Goal
Firmament contributions
30
Contributions
Firmament scheduler: intro
31
Scheduling policy
scheduler
master
Task table
Task statistics
Agent
Agent
Cluster topology
machine
machine
Scheduler
Scheduling policy
Flow graph
Scheduling policy
Firmament policy specification
32
class QuincyPolicy {
Cost_t TaskToResourceNodeCost(
TaskID_t task_id) {
return task_unscheduled_time *
quincy_wait_time_factor;
}
...
}
Specifying scheduling policies
Defines flow graph
N.B: More details in the paper.
Firmament scheduler: intro
33
Scheduling policy
scheduler
master
Task table
Task statistics
Agent
Agent
Cluster topology
machine
machine
Scheduler
Flow graph
Scheduling policy
Defines graph
Flow graph
Firmament scheduler: submit to solver
34
Scheduling policy
scheduler
master
Task table
Task statistics
Agent
Agent
Cluster topology
machine
machine
Scheduler
Scheduling policy
Flow graph
Min-cost, �max-flow solver
Submits graph
Defines graph
Firmament scheduler: slow min-cost flow solver
35
Scheduling policy
scheduler
master
Task table
Task statistics
Agent
Agent
Cluster topology
machine
machine
Scheduler
Scheduling policy
Flow graph
Min-cost�max-flow solver
Defines graph
Submits graph
Most time�spent here
Algorithms complexity: Successive shortest path
36
Algorithm | Worst-case complexity |
Cost scaling | O(V2Elog(VC)) |
E: number of arcs
V: number of nodes
U: largest arc capacity
C: largest cost value
E > V > C ≅ U
Used by Quincy
Algorithms results: Cost scaling
37
better
Cost scaling is too slow beyond 1,000 machines
Subsampled Google trace, 50% slot utilization [Quincy policy]
Too slow!
Goal
Algorithms complexity: Cost scaling
38
Algorithm | Worst-case complexity |
Cost scaling | O(V2Elog(VC)) |
Successive shortest path | O(V2Ulog(V)) |
E: number of arcs
V: number of nodes
U: largest arc capacity
C: largest cost value
E > V > C ≅ U
Lower worst-case complexity
Algorithms results: successive shortest path
39
Successive shortest path only scales to ~100 machines
better
Too slow!
Goal
Subsampled Google trace, 50% slot utilization [Quincy policy]
Algorithms complexity: Relaxation
40
Algorithm | Worst-case complexity |
Cost scaling | O(V2Elog(VC)) |
Successive shortest path | O(V2Ulog(V)) |
Relaxation | O(E3CU2) |
E: number of arcs
V: number of nodes
U: largest arc capacity
C: largest cost value
E > V > C ≅ U
Highest complexity
Algorithms results: Relaxation
41
better
Relaxation meets our sub-second latency goal
Goal
Perfect!
Subsampled Google trace, 50% slot utilization [Quincy policy]
Relaxation single-ish pass
42
M0
M1
R0
M2
M3
R1
S
T2
X
T0
T1
T3
T4
Single-ish pass flow push
Relaxation is well-suited to the graph structure
Why is Relaxation fast?
Slow Relaxation: tasks -> machine
43
M0
M1
M2
M3
T2
T0
T1
T3
T4
S
Capacity: 1 task
Relaxation suffers in pathological edge cases
Machine utilization
high
medium
Slow Relaxation: machine -> sink
44
M0
M1
M2
M3
T2
T0
T1
T3
T4
S
Relaxation suffers in pathological edge cases
Capacity: 1 task
Machine utilization
high
medium
Slow Relaxation: machine -> tasks
45
Machine utilization
high
medium
M0
M1
M2
M3
T2
T0
T1
T3
T4
S
Relaxation suffers in pathological edge cases
Relaxation cannot push flow in a single pass any more
Capacity: 0 tasks
High utilization introduction
46
How bad does�Relaxation’s edge�case get?
Experimental setup:
High utilization empty figure
47
Quincy, 12,500 machines cluster, job of increasing size
better
High utilization Relaxation
48
Relaxation’s runtime increases with utilization
Quincy, 12,500 machines cluster, job of increasing size
better
High utilization Relaxation & Cost scaling
49
Cost scaling is unaffected by high utilization
better
Quincy, 12,500 machines cluster, job of increasing size
Cost scaling is faster
Best
Solver runs both algorithms
50
Seduling policy
scheduler
Scheduling policy
Flow graph
Min-cost, �max-flow solver
Input graph
Min-cost max-flow solver
��Cost scaling
��Relaxation
Input graph
Output flow
High utilization push-down runtime
51
better
Quincy, 12,500 machines cluster, job of increasing size
Algorithm runtime is still high at utilization > 94%
Incremental cost scaling introduction
52
Seduling policy
scheduler
Scheduling policy
Flow graph
Min-cost, �max-flow solver
Min-cost max-flow solver
��Cost scaling
Graph state
Input graph
State discarded
Output flow
Incremental cost scaling
53
Seduling policy
scheduler
Scheduling policy
Flow graph
Min-cost, �max-flow solver
��Cost scaling
Min-cost max-flow solver
Graph state
Output flow
Input graph
Previous graph state
Graph changes
54
M0
M1
R0
M2
M3
R1
S
T2
X
U
T0
T1
T3
Incremental min-cost max-flow
55
M0
M1
R0
M2
M3
R1
S
T2
X
U
T0
T1
T3
T4
Maintain flow, apply changes, re-optimize
Incremental min-cost max-flow
Incremental Cost scaling results
56
Quincy, 12,500 machines cluster, job of increasing size
better
Incremental cost scaling is ~2x faster
57
1
Run both algorithms
2
Incremental min-cost max-flow
3
Efficiently adjust graph state with price refine
Techniques to reduce scheduling latency
4
Problem specific min-cost flow optimizations
Ideal scheduler
58
Demo
Centralized vs. distributed
59
Note: many additional experiments in the paper.
Evaluation
Good task placements
Low scheduling latency
Centralized�Sophisticated algorithms
e.g., Borg, Quincy, Quasar
Distributed�Simple heuristics
e.g., Sparrow, Tarcil
Does Firmament choose good placements with low latency?
60
How well does Firmament work at scale?
Experimental setup:
61
Google trace, 12,500 machines, 90% slot utilization
better
Task�submitted
Start�scheduling
Task�placed
Task�running
Task�completed
scheduling
waiting
starting
running
time
task placement latency
62
Google trace, 12,500 machines, 90% slot utilization
Placement latency is too high!
63
Google trace, 12,500 machines, 90% slot utilization
Firmament places tasks with low-latency at Google scale
Sub-second latency for 98% of tasks
Google acceleration introduction
64
How well does�Firmament handle challenging workloads?
Experimental setup:
Google acceleration empty figure
65
Firmament handles challenging workloads at low latency
Google trace, 12,500 machines, �utilization between 75% and 90%
Simulate interactive workloads by �scaling down task runtimes
better
Median task runtime: 420s
Median task runtime: 1.7s
Google acceleration Cost scaling
66
Google trace, 12,500 machines, �utilization between 75% and 90%
better
Average latency is too high even without many short tasks
45 seconds average latency (tuned over Quincy setup’s 66s)
Firmament handles challenging workloads at low latency
Google acceleration Relaxation
67
Google trace, 12,500 machines, �utilization between 75% and 90%
Latency with a 250x acceleration:
75th percentile: 2 sec
maximum: 57 sec
better
Firmament handles challenging workloads at low latency
Google acceleration Firmament
68
Google trace, 12,500 machines, �utilization between 75% and 90%
Firmament’s common-case latency is sub-second even at 250x acceleration
better
Firmament handles challenging workloads at low latency
Workload-mix intro
69
How do Firmament’s placements compare to other schedulers?
Experimental setup:
Workload mix-service task figure
70
Interactive
Service
M1
M2
M3
M4
M8
M7
M6
M5
M9
M10
Rack1
Rack2
Core
Network utilization:
low
medium
high
Workload mix-service task figure
71
Interactive
Service
M1
M2
M3
M4
M8
M7
M6
M5
M9
M10
Rack1
Rack2
Core
Network utilization:
low
medium
high
Already running
Workload mix-service task figure
72
Interactive
Service
M1
M2
M3
M4
M8
M7
M6
M5
M9
M10
Rack1
Rack2
Core
Network utilization:
low
medium
high
SCHEDULER
Workload-mix Sparrow
73
better
5 seconds task response time on idle cluster
Firmament chooses good placements
Task�submitted
Start�scheduling
Task�placed
Task�running
Task�completed
scheduling
waiting
starting
running
time
task response time
Workload-mix Sparrow
74
Sparrow is unaware of resource utilization
better
20% of tasks experience poor placement
Firmament chooses good placements
Workload-mix Kubernetes
75
Kubernetes only accounts for memory and CPU
better
20% of tasks experience poor placement
Firmament chooses good placements
Workload-mix Docker
76
Centralized Kubernetes and Docker still suffer
better
Firmament chooses good placements
20% of tasks experience poor placement
Workload-mix Firmament
77
better
Avoided co-location interference
Firmament chooses good placements
Firmament outperforms centralized and distributed schedulers
Firmament policy specification
78
class QuincyPolicy {
Cost_t TaskToResourceNodeCost(
TaskID_t task_id) {
return task_unscheduled_time *
quincy_wait_time_factor;
}
...
}
Firmament supports multiple policies
Defines flow graph
Quincy policy graph
79
Firmament supports multiple policies
T0,0
Quincy
Quincy policy
M0
M1
R0
M2
M3
R1
S
T2
X
U
T0
T1
T3
T4
Locality preference
T0,0
Locality preference
T0
T3
T3
T1
T4
Co-location policy graph
80
T0,0
Firmament supports multiple policies
Quincy
Co-location-aware
M0
M1
R0
M2
M3
R1
X
S
T0
T1
T2
T3
T4
U1
L3
C0
C1
L2
L3
C0
C1
L2
Co-location-aware policy
Network-aware policy graph
81
Firmament supports multiple policies
Co-location-aware
T0,0
Quincy
Network-aware
M0
M1
M2
M3
S
T2
U
T0
T1
T3
Network utilization
high
low
Network-aware policy
Energy-aware policy graph
82
Firmament supports multiple policies
Co-location-aware
T0,0
Quincy
Network-aware
Energy-aware policy
Energy-aware
M0
M1
M2
P
B
S
T0
T1
T2
T3
T4
U
C0
C1
C0
C1
C0
C1
Energy-aware policy
Ideal scheduler
83
Current and future work
1
Gang scheduling
2
Complex constraints (e.g., task affinity)
3
Modeling costs for scheduling policies
Conclusions
84
Conclusions
Firmament cluster manager & scheduler:
Firmament Kubernetes integration:
Ideal scheduler
85
Musketeer – workflow manager�[EuroSys’15]
QJump – bounded latency data center networking�[NSDI’15, best paper]
Broom – memory management�for big data�[HotOS’15]
Firmament – scalable, high quality cluster scheduling�[OSDI’16]
Falkirk – rollback recovery for dataflow systems�[under submission]
The big picture:
86
Backup slides
Ideal scheduler
87
Ideal scheduler
88
Task�submitted
Start�scheduling
Task�placed
Task�running
Task�completed
scheduling
waiting
starting
running
time
task placement latency
task response time
algorithm runtime
Centralized vs. distributed
89
Sees entire cluster
Sophisticated algorithms
Good task placements
BUT: high scheduling latency
Partial, stale state
Simple algorithms
Low scheduling latency
BUT: poor task placements
What do we have today?
centralized
SCHEDULER
Task
Task
distributed
S
S
S
S
S
S
S
S
S
Task
Task
90
How well does Firmament
work if the cluster is oversubscribed?
Experimental setup:
91
better
Google trace, 12,500 machines, periods of oversubscription
Oversubscription
92
better
Relaxation is slow when utilization is very high
Google trace, 12,500 machines, periods of oversubscription
Clearing backlog
93
better
Cost scaling’s runtime is not affected by utilization
Google trace, 12,500 machines, periods of oversubscription
94
better
Google trace, 12,500 machines, periods of oversubscription
Recovers earlier
95
What is Firmament’s scalability limit?
96
Batch tasks�~100s seconds
Workload mix of a typical cluster
Time
0
50
75
100
25
Tasks [%]
Long running service tasks
97
Short batch tasks only�(~100s milliseconds)
Short batch-only tasks workload
Time
0
50
75
100
25
Tasks [%]
98
Task duration [ms]
500
250
1000
1 second
Breaking point experiment setup
99
100 machines, 80% utilization, 10 tasks per job
100
Spark’s scheduler cannot handle tasks <= 1355 ms
100 machines, 80% utilization, 10 tasks per job
101
Firmament handles tasks > 10 ms
100 machines, 80% utilization, 10 tasks per job
102
Firmament handles tasks >= 350 ms
1000 machines, 80% utilization, 10 tasks per job
103
How well does Firmament place tasks with different resource requirements?
Experimental setup:
104
runtime on otherwise idle cluster: ideal
Homogeneous 28-machine cluster, 80% utilization, 15,000 tasks
better
105
runtime on otherwise idle cluster: ideal
2-4x increase in runtime over ideal
Homogeneous 28-machine cluster, 80% utilization, 15,000 tasks
better
106
runtime on otherwise idle cluster: ideal
Mesos reduces runtime, but still 2-3x of ideal
Homogeneous 28-machine cluster, 80% utilization, 15,000 tasks
better
107
runtime on otherwise idle cluster: ideal
Firmament comes close to the ideal task runtime
Homogeneous 28-machine cluster, 80% utilization, 15,000 tasks
better
Quincy vs Firmament locality
108
Quincy slows down when threshold is decreased
Firmament improves data locality without a significant increase in scheduling latency
Google trace, 12,500 machines, varying locality threshold
2x
18%
Approximative solver
109
better
12,500 machines, runs on a graph from a highly utilized cluster