1 of 49

Scale Your Deployments, Not Your Wallet

Financial and Technical Efficiency

Danielle Zephyr Malament

Site Reliability Engineer

zephyr@spotify.com

link to slides at the end

2 of 49

Prerequisites

  • Familiarity with basic Kubernetes (k8s) concepts and terminology such as:
    • Containers
    • Pods
    • Deployments
    • Nodes
    • Clusters
    • Namespaces

3 of 49

Disclaimer

  • The recommendations in this talk aren't absolute; they are good guidelines for our use cases, but YMMV

4 of 49

Tunables

  • On the deployment resource, for each container:
    • CPU request and limit
    • Memory request and limit
  • On the HPA (Horizontal Pod Autoscaler) resource:
    • HPA target(s)
    • HPA min and max
  • On the GKE namespace:
    • CPU and memory quotas (requests and limits)

5 of 49

Service Features

  • Continuous service vs cron job
  • Burstiness
  • Latency sensitivity
  • Batch (async, time-fungible) workloads

6 of 49

Requests and Limits

7 of 49

Approach

  • Start per-pod, then modify based on HPA requirements
  • Within a pod, each container has its own CPU and memory requests and limits

8 of 49

Definitions

  • A request is a minimum reserved value
    • Used for bin-packing
  • A limit is a requested maximum value

9 of 49

CPU Limits

  • Compressible
    • If CPU is above the limit or unavailable, the container will be slowed down
  • Expressed in floats (1, 0.25) or in millicpu (1000m, 250m)
    • Absolute scale (not hardware-dependent)

10 of 49

Memory Limits

  • Incompressible
    • If memory is above the limit, the container will be killed
    • If memory is unavailable, the container will be moved if possible (or else killed)
  • Expressed in bytes, or in larger units with suffixes of K, M, G, T, P, or E
    • Or rather, Ki, Mi, Gi, Ti, Pi, or Ei
    • For example, 1048576 = 1Mi

11 of 49

Information to Gather

  • Minimum CPU/memory used
  • Average CPU/memory used
  • Maximum CPU/memory used (steady-state)
  • Maximum CPU/memory used (spikes)

12 of 49

CPU Tuning

  • Request
    • Near the average CPU used (generally)
    • Must be above the minimum CPU
  • Limit
    • Somewhat above the steady-state maximum for services that can tolerate latency (e.g. batch handlers)
    • Above the spike maximum for services that are latency-sensitive
  • This approach balances keeping resources available with efficient utilization (and therefore lower financial cost)

13 of 49

CPU Tuning Caveat

  • Use CPU requests that allow efficient bin-packing
    • Remember that Kubernetes has per-node overhead; you may also be using sidecar containers
    • For example, if your nodes have 32 cores, you should probably keep your CPU requests <=12, or else around 25

14 of 49

Memory Tuning

  • Request
    • Near the average memory used
    • Must be above the minimum memory
  • Limit
    • Well above the spike maximum, unless you can tolerate a lot of container churn (= latency)

15 of 49

Memory Tuning Caveats (1/2)

  • If your service isn’t memory-bound:
    • Make request = limit
    • Keep it to something sane to protect the rest of the cluster
  • This protects against container moves
    • If the container's usage is above the request and is moved, service discovery may take time to catch up
    • If the container's usage is above the limit, it will be restarted, probably on the same node
    • You should already be as resilient as possible against service discovery staleness/failure

16 of 49

Memory Tuning Caveats (2/2)

  • Java has its own memory management
    • Allocates memory, keeps it, and never returns it to the OS
    • Based on the Java MaxHeap setting and the limit
  • Non-Java services may not be aware of the limit
    • May eventually pass it and be terminated

17 of 49

HPA Settings:

Targets

18 of 49

Approach

  • Start with per-pod settings (target and limit)
  • Then work on number of pods (min and max)

19 of 49

Definitions (1/3)

  • A Horizontal Pod Autoscaler (HPA) increases or decreases the number of pods in a deployment based on some condition or set of conditions
  • An HPA target is a condition on one side of which the HPA will scale up, and on the other side scale down
    • Takes time
    • Has cooldown periods to prevent changing too rapidly

20 of 49

Definitions (2/3)

  • Standard HPA targets are percentages of the CPU or memory requests
    • When actual usage exceeds the target percentage of the CPU or memory request, more pods will be added (if possible)
    • When actual usage drops below the target percentage, pods will be stopped
    • Usage is the average across all pods in a deployment
    • Percentages >= 100 are allowed
    • Config key is targetAverageUtilization
  • Custom targets are also possible by writing plugins (e.g. QPS)

21 of 49

Definitions (3/3)

  • Memory targets aren't necessarily useful
    • Processes don't always release memory back to the OS (e.g. Java)
  • Absolute HPA targets are request values multiplied by the HPA target percentages
  • A pod's CPU headroom is how much more CPU a pod can use than its HPA target
    • The amount the pod can burst while waiting for the HPA to kick in
  • Services with high startup CPU costs may have lot of churn when they start; consider using grace periods

22 of 49

HPA Target Tuning

  • Resilience vs. efficient utilization
    • Lower target → more pods with less utilization in each
      • Better resilience due to headroom and averaging
      • But more expensive
    • Higher target → greater utilization
      • Cheaper
      • But too high → no scaling until heavily loaded
        • Scaling also takes time
  • Recommendation: CPU target = 80%

23 of 49

HPA Target Tuning Caveat

  • If the absolute HPA target is below the minimum CPU actually used, the HPA will add more and more pods until the usage bottoms out below the target
    • E.g. with very flat usage
    • Raise the absolute target indirectly, by raising the request

24 of 49

CPU Limit Implications (1/2)

  • The HPA uses the CPU request, but not the limit
  • But the limit sets how far your service can burst (CPU headroom)
    • Lets you tune performance under increasing load, while waiting for HPA scaling
    • If the limit isn't high enough, the service might be throttled

25 of 49

CPU Limit Implications (2/2)

  • The CPU headroom is the space between the absolute HPA target and the CPU limit:
    • headroom = limit - (target * request)
  • The limit is the lever here:
    • If the service needs more headroom, raise the limit, over and above the steady-state or spike maximum usage
  • Suggested starting point: CPU limit = 1.5 * CPU request

26 of 49

HPA Settings:

Minimums and Maximums

27 of 49

Approach

  • Some changes in load will show up as changes in the CPU usage on the pods (especially rapid ones), and others will show up as changes in the number of pods
  • The balance between these is affected by the HPA min and max

28 of 49

Definitions and Caveat

  • A deployment’s HPA minimum means at least this many pods will always be deployed (if possible)
    • Config key is minReplicas
  • A deployment’s HPA maximum means no more than this many pods will be deployed, even if the target conditions are exceeded
    • Config key is maxReplicas
  • Don’t use the replicas field with deployments that have an HPA
    • The deployment will start with that number of pods, and then the HPA will start scaling

29 of 49

HPA Minimum Tuning: Fundamentals (1/3)

  • HPA minimums have the same balance as requests: availability and resilience vs. efficient utilization (= high minimum vs. low minimum)
  • Don’t go too high, or you’ll be wasting resources
  • Don’t go too low:
    • At least 2 or 3 for redundancy, e.g. against node failure
    • But really, as many as needed to handle the minimum actual usage, to be robust against shortages and outages

30 of 49

HPA Minimum Tuning: Fundamentals (2/3)

  • In practice:
    • For small services, you can start small and tweak
    • For migrating larger services to Kubernetes, you’ll need to handle realistic requirements from the beginning
  • Figure out what average QPS you can expect for the service, and what QPS your starting pod size can handle, then:
    • min = total average QPS / pod QPS

31 of 49

HPA Minimum Tuning: Fundamentals (3/3)

  • If the service is hitting its HPA minimum for more than about 6 hours per day, lower it, or lower the CPU request and limit
    • Which one is better depends on many factors of your circumstances; changing these values will move your bottleneck around
  • Alternatively, start with min = total minimum QPS / pod QPS and raise it as necessary

32 of 49

HPA Minimum Tuning: Service Implications (1/2)

  • The min and max help tune the tradeoff between more less-loaded pods and fewer more-loaded pods (= resiliency vs. efficiency)
    • “Width” vs. “height”
  • Many pods → surges will be spread out
    • CPU will be fairly flat on all pods
    • But the utilization will be low
  • Few pods → surges taken up by pod headroom
    • CPU will vary a fair bit on each pod
    • But utilization will be high

33 of 49

HPA Minimum Tuning: Service Implications (2/2)

  • HPA scaling is automatic and faster than tuning requests and limits
  • So, aim for flat and high CPU, in that order, and let the HPA handle fluctuations
  • The HPA scaling pattern should follow the traffic load as closely as possible, while the CPU should be within +/- 20% of the average
    • Otherwise, lower the CPU request
  • Note: HPA scaling is based on average load, not max
    • Individual pod overloads → an increase (or spikes) in latency

34 of 49

HPA Maximum Tuning (1/3)

  • A deployment’s HPA maximum determines how wide it can get
  • This time, the tradeoff is lopsided:
    • High enough to handle current maximum needs plus some future growth
    • Reasonable, for the safety of the cluster
  • Unlike HPA minimums, HPA maximums don’t translate directly into financial cost, so it's ok to err on the high side

35 of 49

HPA Maximum Tuning (2/3)

  • Start by figuring out the highest number of pods actually created on a regular basis
    • Observe HPA size graphs, and/or:
    • Figure out what maximum QPS you can expect for the service, and what QPS your pod size can handle, then double the result: make max = at least 2 * (total maximum QPS / pod QPS)
  • Doubling allows for cross-region failovers and growth
    • The actual factor to use depends on your region distribution and failover requirements

36 of 49

HPA Maximum Tuning (3/3)

  • Typically, go above this, especially if you expect the service to grow quickly (i.e. faster than you plan to revisit the settings)
  • If the HPA size is frequently above 50% of the HPA max, raise the HPA max and/or CPU request and limit
  • For very small services, start with HPA maximum = 20

37 of 49

Namespace Quotas

38 of 49

Namespace Quotas

  • One more layer: k8s namespaces can have CPU and memory quotas
    • No decisions this time, though
  • If you’re using them, check them when adding/adjusting deployments to avoid having throttling or scheduling issues
  • Quotas are CPU and memory requests and limits
    • They apply to the totals of the requests and limits, respectively, of all of your deployments at their HPA maximums

39 of 49

Iteration:

Staying Stable and Efficient

40 of 49

Staying Stable and Efficient (1/3)

  • To start:
    • Set initial values
    • Observe usage and scaling
    • Tune until stable and reasonable
  • Over time:
    • Revisit the settings periodically (and regularly)
    • Account for growth
    • I.e., do capacity planning

41 of 49

Staying Stable and Efficient (2/3)

  • Planning importance
    • Small service → lower stakes w.r.t. efficiency
    • Bigger → more financial impact
    • Faster growth → greater importance of regular capacity planning
  • Planning aspects
    • Stability
    • Efficiency
    • Growth without hitting limits

42 of 49

Staying Stable and Efficient (3/3)

  • The HPA should be taking care of most of your service's size fluctuations
  • Growth → raise the HPA minimum, HPA maximum, and namespace quotas

43 of 49

(Over)simplification / TL;DR

44 of 49

Quick (and Dirty) Start Guide (1/3)

  • Set initial configuration
    • CPU request: 4 cores
    • CPU limit: 6 cores
    • Memory request and limit: 8GB
    • HPA CPU target: 80%
    • HPA Min: 3
    • HPA Max: 20

45 of 49

Quick (and Dirty) Start Guide (2/3)

  • Deploy your application
    • OOM → increase the memory request and limit
    • Watch the metrics for the HPA (daily rhythm)
  • If the HPA size is frequently near the HPA min, lower the HPA min and/or the CPU request and limit
    • Keep the HPA min at least 3 for better availability
  • If the HPA size is frequently >50% of the HPA max, raise the HPA max and/or the CPU request and limit
    • Use CPU requests that allow efficient bin-packing

46 of 49

Quick (and Dirty) Start Guide (3/3)

  • If the CPU varies beyond +/- 20% of the average, lower the CPU request
  • Iterate until stable

47 of 49

Quick (and Dirty) Start Guide Flowchart

48 of 49

References

49 of 49

https://tinyurl.com/kubernetes-scaling

Danielle Zephyr Malament

Site Reliability Engineer • Automator • Gender Stuff

Pronouns: they/them

zephyr@spotify.com

danielle.malament@gmail.com