1 of 34

K8s autoscaling

2 of 34

Why we need to scale

3 of 34

4 of 34

What kind of scale we could do in K8s

Vertical Scale

Increases and decreases pod CPU and memory

Horizontal Scale

Adds and removes pods
Adds and removes cluster nodes

5 of 34

Resource for Scale

Resource Type: CPU and memory are each a resource type. A resource type has a base unit.

CPU resource units

Limits and requests for CPU resources are measured in cpu units. 1 CPU unit = 1000 millicpu.

Memory resource units

Limits and requests for Memory are measured in bytes. You can express memory as a plain integer or as a fixed-point number using one of these quantity suffixes: E, P, T, G, M, k.

6 of 34

Resource Example

7 of 34

How to manually Scale

8 of 34

Ways to scale pods

ReplicationController: Ensures that a specified number of pod replicas are running at any one time.
ReplicaSet: The next-generation ReplicationController that supports the new set-based label selector. It's mainly used by Deployment as a mechanism to orchestrate pod creation, deletion and updates.
Deployment: A higher-level API object that updates

its underlying Replica Sets and their Pods.

Example: kubectl scale deploy/application-cpu --replicas 2

9 of 34

Add new Kubernetes Worker Node Manually

Pre-requisites to bring up worker node

Install OS
Network Config
Disable Swap Memory
Install Container Runtime
Install Kubernetes Components: kubeadm, kubelet, kubectl

10 of 34

Add new Kubernetes Worker Node Manually

Step 1: Get join Token:

kubeadm token create --print-join-command

Step 2: Get Discovery Token CA cert Hash:

openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'

Step 3: Get API Server Advertise address

kubectl cluster-info

Step 4: Join a new Kubernetes Worker Node a Cluster:

kubeadm join <control-plane-host>:<control-plane-port> --token <token> --discovery-token-ca-cert-hash sha256:<hash>

11 of 34

Why we need autoscaling

12 of 34

Revisit Scale Example

Without an autoscaling solution in place, the traditional approach to mitigating such scalability failures involves:

an alert (on degradation/failures)
intervention by a human operator
root cause analysis
scaling out the number of replicas

13 of 34

Problems for Human Touch

There is a likely delay between the alert and the intervention.

Scalability failure is not the only risk to systems reliability. To identify and understand the cause of failures. This further delays any action. A bug? Network issue? DB issue?

It is unlikely that human operators are observing a single service that they know everything about. It is more likely that they are monitoring everything. So, when a service endures scalability failure, the human operator needs to get information on the service, like peak capacity and current load before calculating the required number of replicas to handle the current load.

14 of 34

How can autoscaling help?

The peak capacity of services tends to be codified instead of documented.
Unlike humans, the autoscaler does not need a coffee break.
The autoscaler concerns itself with one task and one task only (i.e.) respond to scalability triggers.

15 of 34

What kind of autoscale we could do in K8s

Vertical Pod Autoscaler (VPA): Increases and decreases pod CPU and memory
Horizontal Pod Autoscaler (HPA): Adds and removes pods
Cluster Autoscaler (CA): Adds and removes cluster nodes

16 of 34

Pod Level Scale

17 of 34

VPA

18 of 34

VPA

kubectl describe vpa application-cpu

vpa.yaml

deployment.yaml

19 of 34

VPA

20 of 34

VPA Limitations

VPA doesn’t consider network and I/O
VPA is not yet ready for JVM-based workloads.
VPA is not aware of Kubernetes cluster infrastructure variables such as node size in terms of memory and CPU

Recommend 18 GB memory, but node only have 16 GB Memory -> pod pending all the time

VPA won't work with HPA using the same CPU and memory metrics because it would cause a race condition
VPA performance has not been tested in large clusters.

VPA won't work with HPA using the same CPU and memory metrics because it would cause a race condition. Suppose HPA and VPA both use CPU and memory metrics for scaling decisions. HPA will try to scale out (horizontally) based on CPU and memory, while at the same time, VPA will try to scale the pods up (vertically). Therefore if you need to use both HPA and VPA together, you must configure HPA to use a custom metric such as web requests.
Apply the recommendations directly by updating/recreating the pods (updateMode = auto).
Store the recommended values for reference (updateMode = off).
Apply the recommended values to newly created pods only (updateMode = initial).

Keep in mind that updateMode = auto is ok to use in testing or staging environments but not in production. The reason is that the pod restarts when VPA applies the change, which causes a workload disruption.

We should set updateMode = off in production, feed the recommendations to a capacity monitoring dashboard such as Grafana, and apply the recommendations in the next deployment cycle.

21 of 34

HPA

22 of 34

HPA

TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target)

23 of 34

24 of 34

HPA Example

25 of 34

HPA Limitation

HPA only works for stateless applications that support running multiple instances in parallel.
HPA (and VPA) don’t consider IOPS, network, and storage in their calculations

26 of 34

Metrics Server

What’s it for

Metrics Server collects resource metrics from Kubelets and exposes them in Kubernetes apiserver through Metrics API for use by Horizontal Pod Autoscaler and Vertical Pod Autoscaler.

What isn’t it for

Non-Kubernetes clusters
An accurate source of resource usage metrics
Horizontal autoscaling based on other resources than CPU/Memory

27 of 34

Metrics Server

cAdvisor: Daemon for collecting, aggregating and exposing container metrics included in Kubelet.
kubelet: Node agent for managing container resources. Resource metrics are accessible using the /metrics/resource and /stats kubelet API endpoints.

Summary API: API provided by the kubelet for discovering and retrieving per-node summarized stats available through the /stats endpoint.
metrics-server: Cluster addon component that collects and aggregates resource metrics pulled from each kubelet. The API server serves Metrics API for use by HPA, VPA, and by the kubectl top command. Metrics Server is a reference implementation of the Metrics API.
Metrics API: Kubernetes API supporting access to CPU and memory used for workload autoscaling. To make this work in your cluster, you need an API extension server that provides the Metrics API.

28 of 34

Node Level Scale

29 of 34

Cluster Autoscaler

30 of 34

Cluster Autoscaler Behavior

c444e24e3915@cloudshell:~ (qwiklabs-gcp-03-a94f05d7b8a0)$ kubectl get nodes

NAME STATUS ROLES AGE VERSION

gke-scaling-demo-default-pool-b182e404-5l2v Ready <none> 6m v1.20.8-gke.900

c444e24e3915@cloudshell:~ (qwiklabs-gcp-03-a94f05d7b8a0)$ kubectl scale deploy/application-cpu --replicas 2

deployment.apps/application-cpu scaled

c444e24e3915@cloudshell:~ (qwiklabs-gcp-03-a94f05d7b8a0)$ kubectl get pods

NAME READY STATUS RESTARTS AGE

application-cpu-7879778795-8t8bn 1/1 Running 0 2m29s

application-cpu-7879778795-rzxc7 0/1 Pending 0 5s

31 of 34

Cluster Autoscaler Behavior

Scale Up:

Pod Event

Node Event

32 of 34

Cluster Autoscaler Behavior

ScaleDown:node removed by cluster autoscaler

NodeNotReady

Deleting node

RemovingNode:Removing Node from Controller.

Scale Down:

33 of 34

Cluster Autoscaler Limitations

Cluster Autoscaler is not supported on on-premise environments until an autoscaler is implemented for on-premise deployments.
Only support < 1000 nodes.
Scaling up is not immediate. Therefore, a pod will be in a pending state for a few minutes until a new worker is added.
Cluster Autoscaler works based on resource requests, not actual usage.

34 of 34

Reference