1 of 49

Scale Your Deployments, Not Your Wallet

Financial and Technical Efficiency

Danielle Zephyr Malament

Site Reliability Engineer

zephyr@spotify.com

link to slides at the end

2 of 49

Prerequisites

Familiarity with basic Kubernetes (k8s) concepts and terminology such as:

Containers
Pods
Deployments
Nodes
Clusters
Namespaces

3 of 49

Disclaimer

The recommendations in this talk aren't absolute; they are good guidelines for our use cases, but YMMV

4 of 49

Tunables

On the deployment resource, for each container:

CPU request and limit
Memory request and limit

On the HPA (Horizontal Pod Autoscaler) resource:

HPA target(s)
HPA min and max

On the GKE namespace:

CPU and memory quotas (requests and limits)

5 of 49

Service Features

Continuous service vs cron job
Burstiness
Latency sensitivity
Batch (async, time-fungible) workloads

6 of 49

Requests and Limits

7 of 49

Approach

Start per-pod, then modify based on HPA requirements
Within a pod, each container has its own CPU and memory requests and limits

8 of 49

Definitions

A request is a minimum reserved value

Used for bin-packing

A limit is a requested maximum value

9 of 49

CPU Limits

Compressible

If CPU is above the limit or unavailable, the container will be slowed down

Expressed in floats (1, 0.25) or in millicpu (1000m, 250m)

Absolute scale (not hardware-dependent)

10 of 49

Memory Limits

Incompressible

If memory is above the limit, the container will be killed
If memory is unavailable, the container will be moved if possible (or else killed)

Expressed in bytes, or in larger units with suffixes of K, M, G, T, P, or E

Or rather, Ki, Mi, Gi, Ti, Pi, or Ei
For example, 1048576 = 1Mi

11 of 49

Information to Gather

Minimum CPU/memory used
Average CPU/memory used
Maximum CPU/memory used (steady-state)
Maximum CPU/memory used (spikes)

12 of 49

CPU Tuning

Request

Near the average CPU used (generally)
Must be above the minimum CPU

Limit

Somewhat above the steady-state maximum for services that can tolerate latency (e.g. batch handlers)
Above the spike maximum for services that are latency-sensitive

This approach balances keeping resources available with efficient utilization (and therefore lower financial cost)

13 of 49

CPU Tuning Caveat

Use CPU requests that allow efficient bin-packing

Remember that Kubernetes has per-node overhead; you may also be using sidecar containers
For example, if your nodes have 32 cores, you should probably keep your CPU requests <=12, or else around 25

14 of 49

Memory Tuning

Request

Near the average memory used
Must be above the minimum memory

Limit

Well above the spike maximum, unless you can tolerate a lot of container churn (= latency)

15 of 49

Memory Tuning Caveats (1/2)

If your service isn’t memory-bound:

Make request = limit
Keep it to something sane to protect the rest of the cluster

This protects against container moves

If the container's usage is above the request and is moved, service discovery may take time to catch up
If the container's usage is above the limit, it will be restarted, probably on the same node
You should already be as resilient as possible against service discovery staleness/failure

16 of 49

Memory Tuning Caveats (2/2)

Java has its own memory management

Allocates memory, keeps it, and never returns it to the OS
Based on the Java MaxHeap setting and the limit

Non-Java services may not be aware of the limit

May eventually pass it and be terminated

17 of 49

HPA Settings:

Targets

18 of 49

Approach

Start with per-pod settings (target and limit)
Then work on number of pods (min and max)

19 of 49

Definitions (1/3)

A Horizontal Pod Autoscaler (HPA) increases or decreases the number of pods in a deployment based on some condition or set of conditions
An HPA target is a condition on one side of which the HPA will scale up, and on the other side scale down

Takes time
Has cooldown periods to prevent changing too rapidly

20 of 49

Definitions (2/3)

Standard HPA targets are percentages of the CPU or memory requests

When actual usage exceeds the target percentage of the CPU or memory request, more pods will be added (if possible)
When actual usage drops below the target percentage, pods will be stopped
Usage is the average across all pods in a deployment
Percentages >= 100 are allowed
Config key is targetAverageUtilization

Custom targets are also possible by writing plugins (e.g. QPS)

21 of 49

Definitions (3/3)

Memory targets aren't necessarily useful

Processes don't always release memory back to the OS (e.g. Java)

Absolute HPA targets are request values multiplied by the HPA target percentages
A pod's CPU headroom is how much more CPU a pod can use than its HPA target

The amount the pod can burst while waiting for the HPA to kick in

Services with high startup CPU costs may have lot of churn when they start; consider using grace periods

22 of 49

HPA Target Tuning

Resilience vs. efficient utilization

Lower target → more pods with less utilization in each

Better resilience due to headroom and averaging
But more expensive

Higher target → greater utilization

Cheaper
But too high → no scaling until heavily loaded

Scaling also takes time

Recommendation: CPU target = 80%

23 of 49

HPA Target Tuning Caveat

If the absolute HPA target is below the minimum CPU actually used, the HPA will add more and more pods until the usage bottoms out below the target

E.g. with very flat usage
Raise the absolute target indirectly, by raising the request

24 of 49

CPU Limit Implications (1/2)

The HPA uses the CPU request, but not the limit
But the limit sets how far your service can burst (CPU headroom)

Lets you tune performance under increasing load, while waiting for HPA scaling
If the limit isn't high enough, the service might be throttled

25 of 49

CPU Limit Implications (2/2)

The CPU headroom is the space between the absolute HPA target and the CPU limit:

headroom = limit - (target * request)

The limit is the lever here:

If the service needs more headroom, raise the limit, over and above the steady-state or spike maximum usage

Suggested starting point: CPU limit = 1.5 * CPU request

26 of 49

HPA Settings:

Minimums and Maximums

27 of 49

Approach

Some changes in load will show up as changes in the CPU usage on the pods (especially rapid ones), and others will show up as changes in the number of pods
The balance between these is affected by the HPA min and max

28 of 49

Definitions and Caveat

A deployment’s HPA minimum means at least this many pods will always be deployed (if possible)

Config key is minReplicas

A deployment’s HPA maximum means no more than this many pods will be deployed, even if the target conditions are exceeded

Config key is maxReplicas

Don’t use the replicas field with deployments that have an HPA

The deployment will start with that number of pods, and then the HPA will start scaling

29 of 49

HPA Minimum Tuning: Fundamentals (1/3)

HPA minimums have the same balance as requests: availability and resilience vs. efficient utilization (= high minimum vs. low minimum)
Don’t go too high, or you’ll be wasting resources
Don’t go too low:

At least 2 or 3 for redundancy, e.g. against node failure
But really, as many as needed to handle the minimum actual usage, to be robust against shortages and outages

30 of 49

HPA Minimum Tuning: Fundamentals (2/3)

In practice:

For small services, you can start small and tweak
For migrating larger services to Kubernetes, you’ll need to handle realistic requirements from the beginning

Figure out what average QPS you can expect for the service, and what QPS your starting pod size can handle, then:

min = total average QPS / pod QPS

31 of 49

HPA Minimum Tuning: Fundamentals (3/3)

If the service is hitting its HPA minimum for more than about 6 hours per day, lower it, or lower the CPU request and limit

Which one is better depends on many factors of your circumstances; changing these values will move your bottleneck around

Alternatively, start with min = total minimum QPS / pod QPS and raise it as necessary

32 of 49

HPA Minimum Tuning: Service Implications (1/2)

The min and max help tune the tradeoff between more less-loaded pods and fewer more-loaded pods (= resiliency vs. efficiency)

“Width” vs. “height”

Many pods → surges will be spread out

CPU will be fairly flat on all pods
But the utilization will be low

Few pods → surges taken up by pod headroom

CPU will vary a fair bit on each pod
But utilization will be high

33 of 49

HPA Minimum Tuning: Service Implications (2/2)

HPA scaling is automatic and faster than tuning requests and limits
So, aim for flat and high CPU, in that order, and let the HPA handle fluctuations
The HPA scaling pattern should follow the traffic load as closely as possible, while the CPU should be within +/- 20% of the average

Otherwise, lower the CPU request

Note: HPA scaling is based on average load, not max

Individual pod overloads → an increase (or spikes) in latency

34 of 49

HPA Maximum Tuning (1/3)

A deployment’s HPA maximum determines how wide it can get
This time, the tradeoff is lopsided:

High enough to handle current maximum needs plus some future growth
Reasonable, for the safety of the cluster

Unlike HPA minimums, HPA maximums don’t translate directly into financial cost, so it's ok to err on the high side

35 of 49

HPA Maximum Tuning (2/3)

Start by figuring out the highest number of pods actually created on a regular basis

Observe HPA size graphs, and/or:
Figure out what maximum QPS you can expect for the service, and what QPS your pod size can handle, then double the result: make max = at least 2 * (total maximum QPS / pod QPS)

Doubling allows for cross-region failovers and growth

The actual factor to use depends on your region distribution and failover requirements

36 of 49

HPA Maximum Tuning (3/3)

Typically, go above this, especially if you expect the service to grow quickly (i.e. faster than you plan to revisit the settings)
If the HPA size is frequently above 50% of the HPA max, raise the HPA max and/or CPU request and limit
For very small services, start with HPA maximum = 20

37 of 49

Namespace Quotas

38 of 49

Namespace Quotas

One more layer: k8s namespaces can have CPU and memory quotas

No decisions this time, though

If you’re using them, check them when adding/adjusting deployments to avoid having throttling or scheduling issues
Quotas are CPU and memory requests and limits

They apply to the totals of the requests and limits, respectively, of all of your deployments at their HPA maximums

39 of 49

Iteration:

Staying Stable and Efficient

40 of 49

Staying Stable and Efficient (1/3)

To start:

Set initial values
Observe usage and scaling
Tune until stable and reasonable

Over time:

Revisit the settings periodically (and regularly)
Account for growth
I.e., do capacity planning

41 of 49

Staying Stable and Efficient (2/3)

Planning importance

Small service → lower stakes w.r.t. efficiency
Bigger → more financial impact
Faster growth → greater importance of regular capacity planning

Planning aspects

Stability
Efficiency
Growth without hitting limits

42 of 49

Staying Stable and Efficient (3/3)

The HPA should be taking care of most of your service's size fluctuations
Growth → raise the HPA minimum, HPA maximum, and namespace quotas

43 of 49

(Over)simplification / TL;DR

44 of 49

Quick (and Dirty) Start Guide (1/3)

Set initial configuration

CPU request: 4 cores
CPU limit: 6 cores
Memory request and limit: 8GB
HPA CPU target: 80%
HPA Min: 3
HPA Max: 20

45 of 49

Quick (and Dirty) Start Guide (2/3)

Deploy your application

OOM → increase the memory request and limit
Watch the metrics for the HPA (daily rhythm)

If the HPA size is frequently near the HPA min, lower the HPA min and/or the CPU request and limit

Keep the HPA min at least 3 for better availability

If the HPA size is frequently >50% of the HPA max, raise the HPA max and/or the CPU request and limit

Use CPU requests that allow efficient bin-packing

46 of 49

Quick (and Dirty) Start Guide (3/3)

If the CPU varies beyond +/- 20% of the average, lower the CPU request
Iterate until stable

47 of 49

Quick (and Dirty) Start Guide Flowchart

48 of 49

References

49 of 49

https://tinyurl.com/kubernetes-scaling

Danielle Zephyr Malament

Site Reliability Engineer • Automator • Gender Stuff

Pronouns: they/them

zephyr@spotify.com

danielle.malament@gmail.com