Logging, Monitoring and Troubleshooting on Kubernetes

Resource Monitoring

  • kubectl get can be used on any resource and shows generic resource
  • If metrics are collected, use kubectl top pods and kubectl top nodes to get performance-related information about Pods and Nodes


Troubleshooting Flow

  • Resources are first created in the Kubernetes etcd database
  • Use kubectl describe and kubectl events to see how that has been going
  • After adding the resources to the database, the Pod application is started on the nodes where it is scheduled
  • Before it can be started, the Pod image needs to be fetched
    • Use sudo crictl images to get a list
  • Once the application is started, use kubectl logs to read the output of the application

Troubleshooting Pods

  • The first step is to use kubectl get, which will give a generic overview of
    Pod states
  • A Pod can be in any of the following states:
    • Pending: the Pod has been created in etcd, but is waiting for an eligible node
    • Running: the Pod is in healthy state
    • Succeeded: the Pod has done its work and there is no need to restart it
    • Failed: one or more containers in the Pod have ended with an error code and will not be restarted
    • Unknown: the state could not be obtained, often related to network issues
    • Completed: the Pod has run to completion
    • CrashLoopBackOff: one or more containers in the Pod have generated an error, but the scheduler is still trying to run them

Investigating Resource Problems

  • If kubectl get indicates that there is an issue, the next step is to use kubectl
    describe to get more information
  • kubectl describe shows API information about the resource and often has good indicators of what is going wrong
  • If kubectl describe shows that a Pod has an issue starting its primary container, use kubectl logs to investigate the application logs
  • If the Pod is running, but not behaving as expected, open an interactive shell on the Pod for further troubleshooting: kubectl exec -it mypod -- sh

Testdb is individual unmanaged pod so we must delete it  and run again with appropriate environment variable: kubectl set env -h


Troubleshooting Cluster Nodes

  • Use kubectl cluster-info for a generic impression of cluster health
  • Use kubectl cluster-info dump for (too much) information coming from all the cluster log files
  • kubectl get nodes will give a generic overview of node health
  • kubectl get pods -n kube-system shows Kubernetes core services running on the control node
  • kubectl describe node nodename shows detailed information about nodes, check the “Conditions” section for operational information
  • sudo systemctl status kubelet will show current status information about the kubelet
  • sudo systemctl restart kubelet allows you to restart it
  • sudo openssl x509 -in /var/lib/kubelet/pki/kubelet.crt -text allows you to verify kubelet certificates and verify they are still valid
  • The kube-proxy Pods are running to ensure connectivity with worker nodes, use kubectl get pods -n kube-system for an overview


Application Access

  • To access applications running in the Pods, Services and Ingress are used
  • The Service resource uses a selector label to connect to Pods with a matching label
  • The Ingress resource connects to a Service and picks up its selector label to connect to the backend Pods directly
  • To troubleshoot application access, check the labels in all of these resources

The selector is wrong:

Change the selector to webserver



  • Probes can be used to test access to Pods
  • They are a part of the Pod specification
  • A readinessProbe is used to make sure a Pod is not published as available until the readinessProbe has been able to access it
  • The livenessProbe is used to continue checking the availability of a Pod
  • The startupProbe was introduced for legacy applications that require additional startup time on first initialization


Probe Types

  • The probe itself is a simple test, which is often a command
  • The following probe types are defined in pods.spec.container
    • exec: a command is executed and returns a zero exit value
    • httpGet: an HTTP request returns a response code between 200 and 399
    • tcpSocket: connectivity to a TCP socket (available port) is successful
  • The Kubernetes API provides 3 endpoints that indicate the current status of the API server
    • /healtz
    • /livez
    • /readyz
  • These endpoints can be used by different probes


Using Probes 

  • kubectl create -f busybox-ready.yaml
  • kubectl get pods note the READY state, which is set to 0/1, which means that the Pod has successfully started, but is not considered ready.
  • kubectl edit pods busybox-ready and change /tmp/nothing to /etc/hosts. Notice this is not allowed.
  • kubectl exec —it busybox-ready -- /bin/sh
    • touch /tmp/nothing; exit
  • kubectl get pods at this point we have a Pod that is started

Change /tmp/nothing

to /etc/hosts

This is unfortunatelly forbiden so we will do it in the diffrent way:

Now busybox-pod  1 of 1 is ready.


Using Probes example 2

  • kubectl create -f nginx-probes.yaml
  • kubectl get pods


Lab: Troubleshooting Kubernetes

  • Create a Pod that is running the Busybox container with the sleep 3600
    command. Configure a Probe that checks for the existence of the file
    /etc/hosts on that Pod.

Go to the Documentation: search “probes” -> Configure Liveness, Readiness and Startup Probes



Lab: Troubleshooting Nodes

  • Use the appropriate tools to find out if the cluster nodes are in good health you are going to use all