Requests, Limits, and Autoscalers
How they (sometimes don't) work together
Anatomy of a Worker Node
DOKS worker nodes have some reserved memory and CPU for management processes like kubelet, kube-proxy, docker, cilium, cilium-operator, coredns, do-node-agent, kubelet-rubber-stamp, and the Operating System itself.
But what about the unreserved resources? We allocate those to pods based on the pod's limit and request values. These values tell Kubernetes how many resources a pod need at minimum (request) and at maximum (limit).
Reserved CPU/RAM
Available CPU/RAM
Kubelet
kube-proxy
docker
cilium
cilium- operator
coredns
do-node- agent
kubelet-rubber-stamp
Operating System
YOUR_POD
YOUR_POD
YOUR_POD
YOUR_POD
YOUR_POD
YOUR_POD
YOUR_POD
Anatomy of a Worker Node
This is how the Kubernetes Scheduler sees a 4vCPU worker node (with regards to CPU).
But each of those CPUs is actually broken up into millicores at a rate of 1,000 millicores per 1vCPU (I just didn't want to make 4,000 squares).
The Kubernetes Scheduler only looks at the "requests" value in the app spec when planning pod placement, so here it sees 4,000 millicores available (minus system process reservations).
1 vCPU
1 vCPU
1 vCPU
1 vCPU
What's in a Core?
In short, 1,000 millicores! But how does Kubernetes actually deal with a millicore?
Time multiplexing.
What's in a Core?
The "Completely Fair Scheduler" converts those millicores into milliseconds at a rate of 1ms per 10mc.
Then it doles out CPU usage in 100ms cycles.
What's in a Core?
The CFS and requests combine to make the CPU schedule look something like this:
100ms
The CPU is broken into 100ms cycles, and each pod is given a CPU allocation equal to their request.
Pods are allowed to use extra millicores up to their limit setting as long as there are extra cycles available.
Setting the Right Requests & Limits: CPU
Requests
Request size >= max exec time * 10
That means if the longest-running process on the pod typically executes instructions on the CPU for 50ms, you should set requests to AT LEAST 500m.
Limits
Because CPU is a compressible resource, you can safely set CPU limits higher than your requests. If your worker node gets overloaded on CPU, it will throttle pods down to their request amount based on which pods are breaking their request amount by the most.
Why should I care about execution time?
50 ms
100 ms
500m request
40 ms execution
Why should I care about execution time?
25 ms
100 ms
250m request
40 ms execution
Why should I care about execution time?
25 ms
25 ms
115 ms execution
Setting the Right Requests & Limits: RAM
RAM differs significantly from CPU in that it's an incompressible resource. That means we can't just throttle your RAM usage, RAM is state!
If a pod requests 1GB but has a limit of 2GB, we can end up with Kubernetes trying to schedule 1GB pods on a worker node that has no available RAM.
For that reason, there's one big rule for setting RAM requests & limits:
Limits = Requests
That ensures Kubernetes never allocates more RAM to a pod than has been planned for, reducing the likelihood of the OOM killer being triggered.
The Three Musketeers
The Horizontal Pod Autoscaler�This service watches pod metrics to determine when more pods are needed in a deployment.
The Kubernetes Scheduler�This tries to schedule pods based on worker node utilization and requests.
The Cluster Autoscaler�If pods cannot be scheduled, this service creates additional worker nodes to allow for more scaling.
Three services that try to balance and run your cluster
The Basic Idea
If the Kubernetes Scheduler sees a worker node has "free capacity" (unrequested resources), it will attempt to schedule pods there.
If that free capacity doesn't exist (because the pods are using more than they requested), Kubernetes will temporarily throttle CPU down or trigger the OOM killer on a Droplet to destroy non-system pods.
The HPA only triggers when pods in a deployment pass the HPA watch metric on average.
The Cluster Autoscaler only triggers when Kubernetes says it can't schedule pods because there are insufficient unclaimed resources.
Why did my cluster fail to scale?
When a pod isn't sending metrics or isn't yet ready, the HPA doesn't ignore it. Instead, the HPA calculates that pod as using 0% of its available capacity.
Because the HPA averages the usage every pod in a deployment to determine if it should scale, each "not ready" or non-communicative pod actually reduces the calculated load across all pods.
The more pods get stuck in one of those states, the more Kubernetes thinks the cluster is reducing average pod resource consumption. This means it doesn't try to schedule more pods.
Because it doesn't schedule more pods, Kubernetes never thinks it's running out of worker node resources, so the Cluster Autoscaler is never called.
Why Pods Get Stuck
Recommended Resources
Henning Jacobs Requests & Limits Crash Course (Very good)
CPU limits and aggressive throttling�(The CFS bug he mentions has been patched)