Kubernetes Scalability:
A multi-dimensional analysis
Shyam Jeedigunta (@shyamjvs)
Maciej Rozacki (@mrozacki)
Background
FAQs by several devs/teams:
Goal
Address those concerns by:
Understanding Scalability
Scalability Limits
Scalability is not a single number (like 5000)
Yes, we ”support” upto 5000 nodes in k8s
But that’s not even close to the whole story!
Let’s see what is...
# Nodes
5000
Scalability Envelope
Scalability is a subspace of configurations
Think of it as a ~ higher-dimensional cube (not really a cube… see next slide)
If you’re within the envelope, you’re safe
By safe, we mean:
# Nodes
# Namespaces
Pod Churn
# Pods/node
# Services
# Secrets
# Backends/service
# Net LBs
# Ingresses
Properties of the Envelope
Because...�the dimensions are sometimes NOT independent.
So if we support X1= A and X2= B
we support (X1= A, X2= B)
# Nodes
# Pods/node
5000
110
Don’t even think about
it!
E.g
Properties of the Envelope
2. NOT convex
Because...�the dimensions are sometimes NOT linearly dependent.
So if we support configuration A and configuration B
we support configuration (A+B)/2
# Services
# Backends/service
10k
250
Don’t even think about
it!
E.g
(5k services,
125 backends/service)
Properties of the Envelope
3. Tapers along each axis
As you move farther along one dimension, your cross-section wrt other dimensions gets smaller.
So don’t push too many dimensions at once!
Note that it means even a 5-node cluster can break if you push too much along some dimension(s).
# Nodes
# Namespaces
Pod Churn
# Pods/node
# Services
# Secrets
# Backends/service
# Net LBs
# Ingresses
E.g
Properties of the Envelope
4. Bounded
No axis can be arbitrarily pushed (even if all others are kept at minimum).
We have hard limits - mainly due to etcd size. So…
Total #Objects (built-in API objects + CRDs) ≤ X (~300,000*)
is a bounding box.
*It’s a crude limit and assumes etcd size is 4GB (it may change in future)
Source of cube image: https://en.wikipedia.org/wiki/Hypercube Source of cropped hyperbola image: http://inspirehep.net/record/1454384
Properties of the Envelope
5. Decomposable into smaller envelopes
Precisely computing the envelope boundaries is too �hard a problem (O(2^#dimensions)).
Luckily, we can ~break it into simpler envelopes, due to some independence among the dimensions.
Each envelope == some constraint
Let’s look at those...
=
( , , , )
Source of cube image: https://en.wikipedia.org/wiki/Hypercube Source of cropped hyperbola image: http://inspirehep.net/record/1454384
Few notes...
The scalability limits we’re about to discuss are:
In general, use discretion or consult SIG scalability if in doubt.
#Nodes vs #Pods/node
5k
110
# Pods/node
# Nodes
30
1300
Kubelet starts getting overloaded past this point.
Apiserver starts getting overloaded past this point.
#Pods <= 150k
&
#Nodes <= 5k
&
#Pods/node <= 110
We assume the average #containers/pod is not too high (<= 2).
Having too many containers might reduce the limit of 110 because some resources are allocated per container.
#Services vs #Backends/service
10k
250
# Backends/service
# Services
(ClusterIP)
5
200
Endpoints traffic becomes larger after this (due to being quadratic in #backends).
Performance of iptables degrades with too many services in KUBE_SVC chain after this.
#Backends <= 50k
&
#Services <= 10k
&
#Backends/service <= 250
Note: You can have more backends if majority of them belong to small services. For e.g we tested with 75k backends comprising of:
#Services/namespace
#Services <= 10k
&
#Services/namespace <= 5k
5k
# Namespaces
# Services/namespace
2
This curve represents limit on total #Services we can have
After this, size of service-linked env vars gets too big for the namespace - causing pod crashes
Pod Churn
Pod churn
Pod churn <= 20/s
20
“ Pod churn = (#Pod-creates|updates|deletes) per second”
<some caveats>
Some caveats:
- You can go above 20 only if you’re manually changing pods, as controller-manager has default qps limit of 20
- For deletions through GC, only a throughput of 10/s can be achieved currently as each delete uses 2 API calls
- If pods belong to huge services, higher churn can affect control plane due to endpoints traffic
#Nodes vs #Configs/node
We got rid of this limitation in k8s 1.12 after moving kubelets to watch secrets.
Few ways to mitigate it for versions < 1.12:
5k
# Configs/node
# Nodes
30
Kubelets make too many “GET secrets/configmaps” calls on going beyond this curve.
Limit for #nodes
200
This bound is due to kubelet qps limit.
“#Configs/Node = Avg (# Unique secrets + # Unique configmaps) needed per node”
Σnodes #Configs <= 150k
&
#Nodes <= 5k
#Namespaces vs #Pods/namespace
10k
3k
# Pods/namespace
# Namespaces
15
50
Controllers may start seeing a performance drop as we increase #pods per namespace
We can have a large no. of namespaces with few pods per namespace
#Pods <= 150k
&
#Namespaces <= 10k
&
#Pods/namespace <= 3k
We got rid of the limitation on x-axis in k8s 1.12 after moving kubelets to watch secrets.
Scalability: Next Steps
Knowing our bounds better
SIG scalability:
So…
If you’re a k8s developer:
If you’re a k8s user:
Where to find us?
SIG Scalability is happy to receive any feedback/questions through:
Tweet #SIGScalability or #K8sScalability with questions/feedback!
Thank you!