1 of 16

Requests, Limits, and Autoscalers

How they (sometimes don't) work together

2 of 16

Anatomy of a Worker Node

DOKS worker nodes have some reserved memory and CPU for management processes like kubelet, kube-proxy, docker, cilium, cilium-operator, coredns, do-node-agent, kubelet-rubber-stamp, and the Operating System itself.

But what about the unreserved resources? We allocate those to pods based on the pod's limit and request values. These values tell Kubernetes how many resources a pod need at minimum (request) and at maximum (limit).

Reserved CPU/RAM

Available CPU/RAM

Kubelet

kube-proxy

docker

cilium

cilium- operator

coredns

do-node- agent

kubelet-rubber-stamp

Operating System

YOUR_POD

3 of 16

Anatomy of a Worker Node

This is how the Kubernetes Scheduler sees a 4vCPU worker node (with regards to CPU).

But each of those CPUs is actually broken up into millicores at a rate of 1,000 millicores per 1vCPU (I just didn't want to make 4,000 squares).

The Kubernetes Scheduler only looks at the "requests" value in the app spec when planning pod placement, so here it sees 4,000 millicores available (minus system process reservations).

1 vCPU

4 of 16

What's in a Core?

In short, 1,000 millicores! But how does Kubernetes actually deal with a millicore?

Time multiplexing.

5 of 16

What's in a Core?

The "Completely Fair Scheduler" converts those millicores into milliseconds at a rate of 1ms per 10mc.

Then it doles out CPU usage in 100ms cycles.

6 of 16

What's in a Core?

The CFS and requests combine to make the CPU schedule look something like this:

100ms

The CPU is broken into 100ms cycles, and each pod is given a CPU allocation equal to their request.

Pods are allowed to use extra millicores up to their limit setting as long as there are extra cycles available.

7 of 16

Setting the Right Requests & Limits: CPU

Requests

Request size >= max exec time * 10

That means if the longest-running process on the pod typically executes instructions on the CPU for 50ms, you should set requests to AT LEAST 500m.

Limits

Because CPU is a compressible resource, you can safely set CPU limits higher than your requests. If your worker node gets overloaded on CPU, it will throttle pods down to their request amount based on which pods are breaking their request amount by the most.

8 of 16

Why should I care about execution time?

50 ms

100 ms

500m request

40 ms execution

9 of 16

Why should I care about execution time?

25 ms

100 ms

250m request

40 ms execution

10 of 16

Why should I care about execution time?

25 ms

115 ms execution

11 of 16

Setting the Right Requests & Limits: RAM

RAM differs significantly from CPU in that it's an incompressible resource. That means we can't just throttle your RAM usage, RAM is state!

If a pod requests 1GB but has a limit of 2GB, we can end up with Kubernetes trying to schedule 1GB pods on a worker node that has no available RAM.

For that reason, there's one big rule for setting RAM requests & limits:

Limits = Requests

That ensures Kubernetes never allocates more RAM to a pod than has been planned for, reducing the likelihood of the OOM killer being triggered.

12 of 16

The Three Musketeers

The Horizontal Pod Autoscaler�This service watches pod metrics to determine when more pods are needed in a deployment.

The Kubernetes Scheduler�This tries to schedule pods based on worker node utilization and requests.

The Cluster Autoscaler�If pods cannot be scheduled, this service creates additional worker nodes to allow for more scaling.

Three services that try to balance and run your cluster

13 of 16

The Basic Idea

If the Kubernetes Scheduler sees a worker node has "free capacity" (unrequested resources), it will attempt to schedule pods there.

If that free capacity doesn't exist (because the pods are using more than they requested), Kubernetes will temporarily throttle CPU down or trigger the OOM killer on a Droplet to destroy non-system pods.

The HPA only triggers when pods in a deployment pass the HPA watch metric on average.

The Cluster Autoscaler only triggers when Kubernetes says it can't schedule pods because there are insufficient unclaimed resources.

14 of 16

Why did my cluster fail to scale?

When a pod isn't sending metrics or isn't yet ready, the HPA doesn't ignore it. Instead, the HPA calculates that pod as using 0% of its available capacity.

Because the HPA averages the usage every pod in a deployment to determine if it should scale, each "not ready" or non-communicative pod actually reduces the calculated load across all pods.

The more pods get stuck in one of those states, the more Kubernetes thinks the cluster is reducing average pod resource consumption. This means it doesn't try to schedule more pods.

Because it doesn't schedule more pods, Kubernetes never thinks it's running out of worker node resources, so the Cluster Autoscaler is never called.

15 of 16

Why Pods Get Stuck

CPU requests are so small, it takes them hours to start up.
Memory limits are so large, the OS OOM killer triggered and killed a system pod or broke an existing pod.

16 of 16

Recommended Resources

Henning Jacobs Requests & Limits Crash Course (Very good)

CPU limits and aggressive throttling�(The CFS bug he mentions has been patched)

HPA Algorithm details

Using the Vertical Pod Autoscaler

Google Best Practices

Configuring Liveness Probes

CPU Management from K8s