Kubernetes Node Maintenance

Kubernetes Monitoring

  • Kubernetes monitoring is offered by the integrated Metrics Server
  • The server, after installation, exposes a standard API and can be used to
    expose custom metrics
  • Use kubectl top to see a top-like interface to provide resource usage

Setting up Metrics Server

  • See https://github.com/kubernetes-sigs/metrics-server.git
  • Read github documentation!
  • kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
  • kubectl —n kube-system get pods # look for metrics-server
  • kubectl —n kube-system edit deployment metrics-server
    • In spec.template.spec.containers.args, use the following
      • - --kubelet-insecure-tls
      • - --kubelet-preferred-address-types=InternallP,ExternallP, Hostname
  • kubectl —n kube-system logs metrics-server<TAB> should show “Generating self-signed cert” and “Serving securely on [::]443
  • kubectl top pods --all-namespaces will show most active Pods

Let’s investigate the metric server.

There is an issue int th metric server and we have to edit deployment metric server.  One line has been added:

And now:

 

Etcd

  • The etcd is a core Kubernetes service that contains all resources that have
    been created
  • It is started by the kubelet as a static Pod on the control node
  • Losing the etcd means losing all your configuration

Etcd Backup

  • To back up the etcd, root access is required to run the etcdctl tool
  • Use sudo apt install etcd-client to install this tool
  • etcdctl uses the wrong API version by default, fix this by using sudo ETCDCTL_API=3 etcdctl ... snapshot save
  • to use etcdctl, you need to specify the etcd service API endpoint, as well as cacert, cert and key to be used
  • Values for all of these can be obtained by using ps aux | grep etcd

Backing up the Etcd

  • sudo apt install etcd-client
  • sudo etcdctl --help; sudo ETCDCTL_API=3 etcdctl --help
  • ps aux | grep etcd
  • sudo ETCDCTL_API=3 etcdctl --endpoints=localhost:2379 --cacert  /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key get / --prefix --keys-only
  • sudo ETCDCTL_API=3 etcdctl --endpoints=localhost:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key snapshot save /tmp/etcdbackup.db

 

Verifying the Etcd Backup

  • sudo etcdutl --write-out=table snapshot status
    /tmp/etcdbackup.db
  • Just to be sure: cp /tmp/etcdbackup.db /tmp/etcdbackup.db.2

In case anything happens to one of this backup we always have a spare version.

Restoring the ETCD

  • sudo etcdutl snapshot restore /tmp/etcdbackup.db --
    data-dir /var/lib/etcd-backup restores the etcd backup in a non-default
    folder
  • To start using it, the Kubernetes core services must be stopped, after which the etcd can be reconfigured to use the new directory
  • To stop the core services, temporarily move /etc/kubernetes/manifests/*.yaml to /etc/kubernetes/
  • As the kubelet process temporarily polls for static Pod files, the etcd process will disappear within a minute
  • Use sudo crictl ps to verify that is has been stopped
  • Once the etcd Pod has stopped, reconfigure the etcd to use the non-
    default etcd path
  • In etcd.yaml you’ll find a HostPath volume with the name etcd-data, pointing to the location where the Etcd files are found. Change this to the location where the restored files are
  • Move back the static Pod files to /etc/kubernetes/manifests/
  • Use sudo crictl ps to verify the Pods have restarted successfully
  • Next, kubectl get all should show the original Etcd resources

Restoring the Etcd Commands

  • kubectl delete --all deploy
  • cd /etc/kubernetes/manifests/
  • sudo mv * .. # this will stop all running pods
  • sudo crictl ps
  • sudo etcdutl snapshot restore /tmp/etcdbackup.db --data-dir /var/lib/etcd-backup
  • sudo ls -l /var/lib/etcd-backup/
  • sudo vi /etc/kubernetes/etcd.yaml # change etcd-data HostPath volume to /var/lib/etcd-backup
  • sudo mv../*.yaml .
  • sudo crictl ps # should show all resources
  • kubectl get deploy -A

 

Cluster Nodes Upgrade

  • Kubernetes clusters can be upgraded from one to another minor versions
    Skipping minor versions (1.23 to 1.25) is not supported
  • First, you’ll have to upgrade kubeadm
  • Next, you’ll need to upgrade the control plane node
  • After that, the worker nodes are upgraded
  • Use “Upgrading kubeadm clusters” from the documentation

Control Plane Node Upgrade Overview

  • upgrade kubeadm
  • use kubeadm upgrade plan to check available versions
  • use kubeadm upgrade apply v1.xx.y to run the upgrade
  • use kubectl drain controlnode --ignore-daemonsets
  • upgrade and restart kubelet and kubectl
  • use kubectl uncordon controlnode to bring back the control node
  • proceed with other nodes

 

High Availability Options

  • Stacked control plane nodes requires less infrastructure as the etcd
    members, and control plane nodes are co-located

    • Control planes and etcd members are running together on the same node
    • For optimal protection, requires a minimum of 3 stacked control plane nodes
  • External etcd cluster requires more infrastructure as the control plane nodes and etcd members are separated
    • Etcd service is running on external nodes, so this requires twice the number of nodes

High Availability Requirements

  • In a Kubernetes HA cluster, a load balancer is needed to distribute the
    workload between the cluster nodes
  • The load balancer can be externally provided using open source software, or a load balancer appliance

Exploring Load Balancer Configuration

  • In the load balancer setup, HAProxy is running on each server to provide
    access to port 8443 on all IP addresses on that server
  • Incoming traffic on port 8443 is forwarded to the kube-apiserver port 6443
  • The keepalived service is running on all HA nodes to provide a virtual IP address on one of the nodes
  • kubectl clients connect to this VIP:8443,
  • Use the setup-lb-ubuntu.sh script provided in the GitHub repository for easy setup
  • Additional instructions are in the script
  • After running the load balancer setup, use nc 192.168.29.100 8443 to verify the availability of the load balancer IP and port

Setting up a Highly Available Kubernetes Cluster

  • 3 VMs to be used as controllers in the cluster; Install K8s software but don’t
    set up the cluster yet
  • 2 VMs to be used as worker nodes; Install K8s software
  • Ensure /etc/hosts is set up for name resolution of all nodes and copy to all nodes
  • Disable selinux on all nodes if applicable
  • Disable firewall if applicable

Initializing the HA Setup

  • sudo kubeadm init --control-plane-endpoint "192.168.29.100:8443" --
    upload-certs
  • Save the output of the command which shows next steps
  • Configure networking
    • kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
  • Copy the kubectl join command that was printed after successfully initializing the first control node
    • Make sure to use the command that has --control-plane in it!
  • Complete setup on other control nodes as instructed
  • Use kubectl get nodes to verify setup
  • Continue and join worker nodes as instructed

Configuring the HA Client

  • On the machine you want to use as operator workstation, create a .kube
    directory and copy /etc/kubernetes/admin.conf from any control node to
    the client machine
  • Install the kubectl utility
  • Ensure that host name resolution goes to the new control plane VIP
  • Verify using kubectl get nodes

Testing it

  • On all nodes: find the VIP using ip a
  • On all nodes with a kubectl, use kubectl get all to verify client working
  • Shut down the node that has the VIP
  • Verify that kubectl get all still works
  • Troubleshooting: consider using sudo systemctl restart haproxy

Lab: Etcd Backup and Restore

  • Create a backup of the etcd
  • Remove a few resources (Pods and/or Deployments)
  • Restore the backup of the etcd and verify that gets your resources back