RobustScaler: QoS-Aware Autoscaling for Complex Workloads
Huajie Qian, Qingsong Wen, Liang Sun, Jing Gu, Qiulin Niu, Zhimin Tang
Alibaba Group
Overview
Background: Autoscaling
load balancer
instance
instance
QPS
Background: Autoscaling
load balancer
instance
instance
QPS
Background: Autoscaling
load balancer
instance
instance
QPS
instance
instance
Background: Autoscaling
load balancer
instance
instance
QPS
Background: Scaling-Per-Query
Instance 1
query 1
Instance 2
query 3
query 4
query 2
Instance 1
query 1
Instance 3
query 3
query 4
query 2
Instance 2
Instance 4
Background: Scaling-Per-Query
Instance 1
query 1
Instance 2
query 3
query 4
query 2
Instance 1
query 1
Instance 3
query 3
query 4
query 2
Instance 2
Instance 4
Background: Scaling-Per-Query
Instance 1
query 1
Instance 2
query 3
query 4
query 2
Instance 1
query 1
Instance 3
query 3
query 4
query 2
Instance 2
terminated
Background: Scaling-Per-Query
Instance 1
query 1
Instance 2
query 3
query 4
query 2
Instance 1
query 1
Instance 3
query 3
query 2
Instance 2
terminated
query 5
Background: Scaling-Per-Query
Instance 1
query 1
Instance 2
query 3
query 4
query 2
Instance 1
query 1
Instance 3
query 3
query 2
Instance 2
terminated
query 5
starting
waiting (cold start)
Background: Scaling-Per-Query
Instance 1
query 1
Instance 2
query 3
query 4
query 2
Instance 1
query 1
Instance 3
query 3
query 2
Instance 2
query 5
Instance 5
terminated
Background: Scaling-Per-Query Dynamics
time
QoS metrics:
end of processing – time of arrival
Cost metrics:
time to start processing a query – time of finishing startup
Background: Scaling-Per-Query Dynamics
time
not hit, idle time 0,
query wait > 0
QoS metrics:
end of processing – time of arrival
Cost metrics:
time to start processing a query – time of finishing startup
Background: Scaling-Per-Query Dynamics
time
not hit, idle time 0,
query wait > 0
not hit, idle time 0,
query wait longest
QoS metrics:
end of processing – time of arrival
Cost metrics:
time to start processing a query – time of finishing startup
Background: Scaling-Per-Query Dynamics
time
not hit, idle time 0,
query wait > 0
not hit, idle time 0,
query wait longest
QoS metrics:
end of processing – time of arrival
Cost metrics:
time to start processing a query – time of finishing startup
Idle time
RT
start earlier
start later
Related Works & Challenges
Nonhomogeneous Poisson Process (NHPP) + periodicity regularization
Stochastically constrained optimization
RobustScaler Framework
historical query arrival modeling
Periodicity detection
query arrival prediction
scaling plan
historical QPS data
Detect periodic patterns of different scales (RobustPeriod, Wen et al. SIGMOD ’21)
decisions
RobustScaler Framework
historical query arrival modeling
Periodicity detection
query arrival prediction
scaling plan
historical QPS data
decisions
Detect periodic patterns of different scales (RobustPeriod, Wen et al. SIGMOD ’21)
RobustScaler Framework
historical query arrival modeling
Periodicity detection
query arrival prediction
scaling plan
historical QPS data
decisions
Detect periodic patterns of different scales (RobustPeriod, Wen et al. SIGMOD ’21)
RobustScaler Framework
historical query arrival modeling
Periodicity detection
query arrival prediction
scaling plan
historical QPS data
service/cost level
Solve stochastically constrained optimization (SCO) for optimal scaling decisions
decisions
Detect periodic patterns of different scales (RobustPeriod, Wen et al. SIGMOD ’21)
RobustScaler Framework
historical query arrival modeling
Periodicity detection
query arrival prediction
scaling plan
historical QPS data
service/cost level
decisions
Solve stochastically constrained optimization (SCO) for optimal scaling decisions
Detect periodic patterns of different scales (RobustPeriod, Wen et al. SIGMOD ’21)
Arrival Modeling as NHPP
log likelihood
smoothness penalty
periodicity penalty
Arrival Modeling as NHPP
log likelihood
smoothness penalty
periodicity penalty
effect of periodicity penalty
Arrival Modeling as NHPP
log likelihood
smoothness penalty
periodicity penalty
effect of periodicity penalty
Arrival Modeling as NHPP
log likelihood
smoothness penalty
periodicity penalty
effect of periodicity penalty
Arrival Modeling as NHPP
update smoothness penalty
update periodicity penalty
update dual variables
update intensity (no closed-form sol., use quadratic approximation)
Augmented Lagrangian:
ADMM Solution
Experiment: Benefit of Periodicity Penalty
Synthetic data:
Estimate vs. truth:
Blue: ground truth
Orange: w/o periodicity penalty (wavy, especially at the valleys)
Red: w/ periodicity penalty (accurate)
RobustScaler Framework
historical query arrival modeling
Periodicity detection
query arrival prediction
scaling plan
historical QPS data
service/cost level
decisions
Solve stochastically constrained optimization (SCO) for optimal scaling decisions
Detect periodic patterns of different scales (RobustPeriod, Wen et al. SIGMOD ’21)
The Constrained Optimization Model
The Constrained Optimization Model
The Constrained Optimization Model
separable
monotonicity
The Constrained Optimization Model
separable
monotonicity
The Constrained Optimization Model
Experiments
| #queries | duration |
Alibaba container registry (CRS) trace | 21,059 | 28 days |
Alibaba cluster 2018 trace (https://github.com/alibaba/clusterdata) | 503,850 | 5 days |
Google cluster 2019 trace (https://github.com/google/cluster-data) | 20,254 | 24 hours |
QoS-cost Pareto Plots on Alibaba trace
Under the same relative cost, higher hitting rate or lower response time is better!
RobustScaler > AdapBP > BP
QoS-cost Pareto Plots on Google trace
Under the same relative cost, higher hitting rate or lower response time is better!
RobustScaler > BP > AdapBP
QoS-cost Pareto Plots on CRS trace
QoS-cost Pareto Plots on CRS trace
QoS-cost Pareto Plots under Perturbation
Perturbation: Queries within certain time intervals are deleted, or multiplied by c times for c=1,2,4,6
RobustScaler-HP surpasses AdapBP as c increases! Robustscaler is more robust against data contaminations.
Robustness: Missing Data and Anomalies
Scalability
historical query arrival modeling
Periodicity detection
query arrival prediction
scaling plan
historical QPS data
decisions
Accuracy of Metric Control
RobustScaler-RT
RobustScaler-cost
RobustScaler-HP
Conclusion