Scheduling on Kubernetes

Scheduling

  • Kube-scheduler takes care of finding a node to schedule new Pods
  • Nodes are filtered according to specific requirements that may be set
  • Resource requirements
  • Affinity and anti-affinity
  • Taints and tolerations and more
  • The scheduler first finds feasible nodes then scores them; it then picks the node with the highest score
  • Once this node is found, the scheduler notifies the API server in a process called binding

From Scheduler to Kubelet

  • Once the scheduler decision has been made, it is picked up by the kubelet
  • The kubelet will instruct the CRI to fetch the image of the required container
  • After fetching the image, the container is created and started

Setting Node Preferences

  • The nodeSelector field in the pod.spec specifies a key-value pair that must
    match a label which is set on nodes that are eligible to run the Pod
  • Use kubectl label nodes worker1.example.com disktype=ssd to set the label on a node
  • Use nodeSelector:disktype:ssd in the pod.spec to match the Pod to the specific node
  • nodeName is part of the pod.spec and can be used to always run a Pod on a node with a specific name
    • Not recommended: if that node is not currently available; the Pod will never run

Using Node Preferences

  • kubectl label nodes worker2 disktype=ssd
  • kubectl apply -f selector-pod.yaml

 

Affinity and Anti-Affinity

  • (Anti-)Affinity is used to define advanced scheduler rules
  • Node affinity is used to constrain a node that can receive a Pod by matching labels of these nodes
  • Inter-pod affinity constrains nodes to receive Pods by matching labels of existing Pods already running on that node
  • Anti-affinity can only be applied between Pods

How it Works

  • A Pod that has a node affinity label of key=value will only be scheduled to
    nodes with a matching label
  • A Pod that has a Pod affinity label of key=value will only be scheduled to nodes running Pods with the matching label

 

Setting Node Affinity

  • To define node affinity, two different statements can be used
  • requiredDuringSchedulinglgnoredDuringExecution requires the node to meet the constraint that is defined
  • preferredDuringSchedulinglgnoredDuringExecution defines a soft affinity that is ignored if it cannot be fulfilled
  • At the moment, affinity is only applied while scheduling Pods, and cannot be used to change where Pods are already running

Defining Affinity Labels

  • Affinity rules go beyond labels that use a key=value label
  • A matchexpression is used to define a key (the label), an operator as well as optionally one or more values

  •  Matches any node that has type set to either blue or green

  • Matches any node where the key storage is defined

Examples:

 

TopologyKey

  • When defining Pod affinity and anti-affinity, a toplogyKey property is
    required
  • The topologyKey refers to a label that exists on nodes, and typically has a format containing a slash
    • kubernetes.io/host
  • Using topologyKeys allows the Pods only to be assigned to hosts matching the topologyKey
  • This allows administrators to use zones where the workloads are implemented
  • If no matching topologyKey is found on the host, the specified topologyKey will be ignored in the affinity

Using Pod Anti-Affinity

  • kubectl create -f redis-with-pod-affinity.yaml
  • On a two-node cluster, one Pod stays in a state of pending
  • kubectl create -f web-with-pod-affinity.yaml
  • This will run web instances only on nodes where redis is running as well

The anti affinity rule makes that you’ll never get two of the same applications running on the same node.

That mean that it’ll run web instances only on nodes where redis is running as well.

Taints

  • Taints are applied to a node to mark that the node should not accept any
    Pod that doesn’t tolerate the taint
  • Tolerations are applied to Pods and allow (but do not require) Pods to schedule on nodes with matching Taints — so they are an exception to taints that are applied
  • Where Affinities are used on Pods to attract them to specific nodes, Taints allow a node to repel a set of Pods
  • Taints and Tolerations are used to ensure Pods are not scheduled on inappropriate nodes, and thus make sure that dedicated nodes can beconfigured for dedicated tasks

Taint Types

  • Three types of Taint can be applied:
    • NoSchedule: does not schedule new Pods
    • PreferNoSchedule: does not schedule new Pods, unless there is no other option
    • NoExecute: migrates all Pods away from this node
  • If the Pod has a toleration however, it will ignore the taint

SettingTaints

  • Taints are set in different ways
  • Control plane nodes automatically get taints that won’t schedule user Pods
  • When kubectl drain and kubectl cordon are used, a taint is applied on the target node
  • Taints can be set automatically by the cluster when critical conditions arise, such as a node running out of disk space
  • Administrators can use kubectl taint to set taints:
    • kubectl taint nodes worker1 key1=value1:NoSchedule
    • kubectl taint nodes worker1 key1=value1:NoSchedule-

Tolerations

  • To allow a Pod to run on a node with a specific taint, a toleration can be
    used
  • This is essential for running core Kubernetes Pods on the control plane nodes
  • While creating taints and tolerations, a key and value are defined to allow for more specific access
    • kubectl taint nodes worker1 storage=ssd:NoSchedule
  • This will allow a Pod to run if it has a toleration containing the key storage and the value ssd

Taint Key and Value

  • While defining a toleration, the Pod needs a key, operator, and value:
    tolerations:
    - key: "storage"
      operator: "Equal"
      value: "ssd"
  • The default value for the operator is “Equal”; as an alternative, “Exists” is commonly used
  • If the operator “Exists” is used, the key should match the taint key and the value is ignored
  • If the operator “Equal” is used, the key and value must match

Node Conditions and Taints

  • Node conditions can automatically create taints on nodes if one of the
    following applies

    • memory-pressure
    • disk-pressure
    • pid-pressure
    • unschedulable
    • network-unavailable
  • If any of these conditions apply, a taint is automatically set
  • Node conditions can be ignored by adding corresponding Pod tolerations

Using Taints – commands

  • kubectl taint nodes worker1 storage=ssd:NoSchedule
  • kubectl describe nodes worker1
  • kubectl create deployment nginx-taint --image=nginx
  • kubectl scale deployment nginx-taint —replicas=3
  • kubectl get pods —o wide # will show that pods are all on worker2
  • kubectl create —f taint-toleration.yaml # will run
  • kubectl create -f taint-toleration2.yaml # will not run

All the nginx-taint pods don’t run because the taint is set on control node and there is only one node. Control node will only allow nodes that have this storage in ssd. Let’s create a toleration.

The nginx-ssd has configured tolerations for storage=ssd:NoSchedule taint and is running on contrl node. The nginx-hdd has confogured only tolerations for storage=hdd:NoSchedule so it’s not running on node.

LimitRange

  • LimitRange is an API object that limits resource usage per container or Pod
    in a Namespace
  • It uses three relevant options:
    • type: specifies whether it applies to Pods or containers
    • defaultRequest: the default resources the application will request
    • default: the maximum resources the application can use

Quota

  • Quota is an API object that limits total resources available in a Namespace
  • If a Namespace is configured with Quota, applications in that Namespace must be configured with resource settings in pod.spec.containers.resources
  • Where the goal of the LimitRange is to set default restrictions for each application running in a Namespace, the goal of Quota is to define maximum resources that can be consumed within a Namespace by all applications

Managing Quota

  • kubectl create quota qtest --hard pods=3,cpu=100m,memory=500Mi--namespace limited
  • kubectl describe quota --namespace limited
  • kubectl create deploy nginx --image=nginx:latest --replicas=3 -n limited
  • kubectl get all -n limited # no pods
  • kubectl describe rs/nginx-xxx -n limited # it fails because no quota have been set on the deployment
  • kubectl set resources deploy nginx --requests cpu=100m,memory=5Mii --limits cou=200m,memory=20Mi -n limited
  • kubectl get pods -n limited

It doesn’t work because no resource limitations have been set on a deployment. It can easily be done using kubectl set resources:

Only one pod is running because of quota. We can edit quota and set 1000m for spec hard instead 100m.

Now we can see that the three pods have been scheduled.

Defining Limitrange

  • kubectl explain limitrange.spec
  • kubectl create ns limited
  • kubectl apply -f limitrange.yaml -n limited
  • kubectl describe ns limited
  • kubectl run limitpod --image=nginx -n limited
  • kubectl describe pod limitpod -n limited

 

Lab: Configuring Taints

  • Create a taint on node worker2, that doesn’t allow new Pods to be
    scheduled that don’t have an SSD hard disk, unless they have the
    appropriate toleration set
  • Remove the taint after verifying that it works

Add tolerations: in container spec:

And now: