1 of 18

Kubernetes Scheduling

CS-548, April 2022

Yannis Sfakianakis

2 of 18

Purpose of scheduling in kubernetes?

  • Assigns Pods to Nodes

  • Preemption: Terminating Pods with lower priority to schedule Pods with higher priority

  • Eviction: Terminate Pods on Nodes

3 of 18

How does it work?

  • Monitors for newly created Pods that have no Node assignment
  • Schedules Pods using two steps: Filtering and Scoring
  • Filtering finds the set of Nodes where it’s feasible to schedule a pod: Nodes with enough available resources according to the requirements
    • In case there is no feasible node the Pod remains unscheduled
  • Scoring ranks the feasible Nodes to choose the best placement
    • The scores are affected by affinity and anti-affinity specs, data locality, etc.

4 of 18

Node labels – Node selection (Filtering)

  • We can manually select the set of feasible Nodes for a Pod with labels and nodeSelector.
  • We can choose labels for a set of Nodes and add the nodeSelector constraint in the pod spec

kubectl label nodes master on-master=true

5 of 18

6 of 18

Node Affinity (Filtering – Scoring)

  • An extended version of node selection
  • requiredDuringSchedulingIgnoredDuringExecution
    • The Pod is not scheduled unless this rule is met
  • preferredDuringSchedulingIgnoredDuringExecution
    • The Pod can be scheduled even if there is no matching Node

7 of 18

apiVersion: v1

kind: Pod

metadata:

name: with-affinity-anti-affinity

spec:

affinity:

nodeAffinity:

requiredDuringSchedulingIgnoredDuringExecution:

nodeSelectorTerms:

- matchExpressions:

- key: kubernetes.io/os

operator: In

values:

- linux

preferredDuringSchedulingIgnoredDuringExecution:

- weight: 50

preference:

matchExpressions:

- key: label-1

operator: In

values:

- key-1

containers:

- name: with-node-affinity

image: k8s.gcr.io/pause:2.0

8 of 18

Anti Affinity – Taints and tolerations (Filtering)

  • Taints allow a node to repel a set of pods.
  • Tolerations allow the pods to schedule onto nodes with matching taints.

kubectl taint nodes node1 key1=value1:NoSchedule

tolerations:

  • key: "key1"

operator: "Equal"

value: "value1"

effect: "NoSchedule"

9 of 18

Resource bin packing (scoring)

  • It is a priority function that is used to fine-tune Pod placement into Nodes.
  • Users can specify weights to resources and Kubernetes scores Nodes based on the request to capacity ratio
  • It includes two parameters: shape and resources
    • Shape: Tunes the function as least or most requested based on the utilization and score values
    • Resources: Specify the weight for each resource

10 of 18

Bin packing configuration example

apiVersion: kubescheduler.config.k8s.io/v1beta3

kind: KubeSchedulerConfiguration

profiles:

# ...

pluginConfig:

- name: RequestedToCapacityRatio

args:

shape:

- utilization: 0

score: 10

- utilization: 100

score: 0

resources:

- name: intel.com/foo

weight: 3

- name: intel.com/bar

weight: 5

11 of 18

How score for bin packing is calculated

  • Resource utilization = requested resources + used resources / avail resources %
  • Score = resource utilization / shape

Requested resources:

intel.com/foo : 2

memory: 256MB

cpu: 2

Resource weights:

intel.com/foo : 5

memory: 1

cpu: 3

Available:

intel.com/foo: 4

memory: 1 GB

cpu: 8

Used:

intel.com/foo: 1

memory: 256MB

cpu: 1

intel.com/foo = resourceScoringFunction((2+1),4)

= (100 - ((4-3)*100/4)

= (100 - 25)

= 75 # requested + used = 75% * available

= rawScoringFunction(75)

= 7 # floor(75/10)

memory = resourceScoringFunction((256+256),1024)

= (100 -((1024-512)*100/1024))

= 50 # requested + used = 50% * available

= rawScoringFunction(50)

= 5 # floor(50/10)

cpu = resourceScoringFunction((2+1),8)

= (100 -((8-3)*100/8))

= 37.5 # requested + used = 37.5% * available

= rawScoringFunction(37.5)

= 3 # floor(37.5/10)

NodeScore = (7 * 5) + (5 * 1) + (3 * 3) / (5 + 1 + 3) = 5

12 of 18

Pod priority

  • We can create priority classes
  • We can add a priority class to a Pod with priorityClassName

apiVersion: scheduling.k8s.io/v1

kind: PriorityClass

metadata: name: high-priority

value: 1000000

globalDefault: false

description: "This priority class should be used for XYZ service pods only."

13 of 18

Pod priority

apiVersion: v1

kind: Pod

metadata:

name: nginx

labels:

env: test

spec:

containers:

- name: nginx

image: nginx

imagePullPolicy: IfNotPresent

priorityClassName: high-priority

14 of 18

Pod preemption

  • Newly created Pods wait in a queue before scheduling
  • The scheduler picks a Pod from the queue and tries to schedule it
  • The preemption logic is triggered if the no Node is found that satisfies all the specified requirements
  • Preemption logic tries to find a Node where removal of one or more Pods with lower priority than P would enable P to be scheduled

15 of 18

Node-pressure eviction

  • Is the process by which the kubelet proactively terminates pods to reclaim resources on nodes
  • Monitors resources e.g., CPU, memory, disk space, etc.
  • If one of these resources reach specific consumption levels the kubelet can proactively fail one or more pods 
  • Soft eviction thresholds
    • Kubelet respects the eviction-max-pod-grace-period
  • Hard eviction thresholds
    • Kubelet uses 0s for eviction-max-pod-grace-period

16 of 18

Pod selection for eviction

  • First, kubelet checks whether the Pod’s resource usage exceeds requests
  • Second, it checks the Pod priority
  • Third, the Pod’s resource usage relative to requests, i.e., how much more resources a Pod has consumed with respect to its request.
  • BestEffort Pods: resource requests != resource limits
  • Guaranteed Pods: resource requests == resource limits
  • Ranking of Pods for eviction (assuming same priority):
    • First, BestEffort Pods that the resource consumption exceeds their request
    • Second, Guaranteed Pods or BestEffort Pod that resource consumption does not exceed the request

17 of 18

API-initiated Eviction

  • Users can request eviction by calling Eviction API
  • Kubernetes language client that creates an Eviction object.

{

"apiVersion": "policy/v1",

"kind": "Eviction",

"metadata": {

"name": "quux",

"namespace": "default"

}

}

curl -v -H 'Content-type: application/json'

https://your-cluster/api/v1/namespaces/default/pods/quux/eviction -d @evict.json

18 of 18

More details