Cluster Troubleshooting
- An OpenShift cluster has two focal areas for troubleshooting
- OpenShift operators are cluster applications that can be monitored and fixed
 like any other application that runs in OpenShift
- OpenShift nodes can be monitored individually
- Other problems may come from version mismatches
Verifying Node Health
- oc get nodesis a good first step to investigate current health of nodes- Anything other than Ready means that the node is dead to the control plane
 
- oc adm top nodesshows current node health, based on statistics gathered by the metrics server
- oc describe nodemay be used to investigate recent events and resource usage- Eventsshows an event log
- Allocated resourcesgives an overview of allocated resources and requests
- Capacityshows available capacity
- Non-terminated Podsshows Pods currently being used
 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | $ oc whoami system:admin $ oc get nodes NAME        STATUS    ROLES     AGE       VERSION localhost   Ready     <none>    7d        v1.11.0+d4cacc0 $ oc adm top nodes Error from server (NotFound): the server could not find the requested resource (get services https:heapster:) $ oc describe node | less $ oc describe node Name:               localhost Roles:              <none> Labels:             beta.kubernetes.io/arch=amd64                     beta.kubernetes.io/os=linux                     disktype=nvme                     kubernetes.io/hostname=localhost Annotations:        volumes.kubernetes.io/controller-managed-attach-detach=true CreationTimestamp:  Sat, 22 Jul 2023 20:54:32 +0200 Taints:             <none> Unschedulable:      false Conditions:   Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message   ----             ------  -----------------                 ------------------                ------                       -------   OutOfDisk        False   Sun, 30 Jul 2023 15:33:30 +0200   Sat, 22 Jul 2023 20:54:22 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available   MemoryPressure   False   Sun, 30 Jul 2023 15:33:30 +0200   Sat, 22 Jul 2023 20:54:22 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available   DiskPressure     False   Sun, 30 Jul 2023 15:33:30 +0200   Sat, 22 Jul 2023 20:54:22 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure   PIDPressure      False   Sun, 30 Jul 2023 15:33:30 +0200   Sat, 22 Jul 2023 20:54:22 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available   Ready            True    Sun, 30 Jul 2023 15:33:30 +0200   Sat, 22 Jul 2023 20:54:22 +0200   KubeletReady                 kubelet is posting ready status Addresses:   InternalIP:  172.30.9.22   Hostname:    localhost Capacity:  cpu:            8  hugepages-1Gi:  0  hugepages-2Mi:  0  memory:         7981844Ki  pods:           250 Allocatable:  cpu:            8  hugepages-1Gi:  0  hugepages-2Mi:  0  memory:         7879444Ki  pods:           250 System Info:  Machine ID:                     a37388a4746444f1b3f079f777748845  System UUID:                    6099DE02-9EA8-C210-7553-A7697F2C302A  Boot ID:                        7c076895-85e9-45ce-ae2c-8bbe7127be73  Kernel Version:                 3.10.0-1160.92.1.el7.x86_64  OS Image:                       CentOS Linux 7 (Core)  Operating System:               linux  Architecture:                   amd64  Container Runtime Version:      docker://24.0.3  Kubelet Version:                v1.11.0+d4cacc0  Kube-Proxy Version:             v1.11.0+d4cacc0 Non-terminated Pods:             (45 in total)   Namespace                      Name                                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits   ---------                      ----                                                       ------------  ----------  ---------------  -------------   debug                          dnginx-88c7766dd-hlbtd                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)   default                        bitginx-1-jzk9r                                            0 (0%)        0 (0%)      0 (0%)           0 (0%)   default                        busybox                                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)   default                        docker-registry-1-ctgff                                    100m (1%)     0 (0%)      256Mi (3%)       0 (0%)   default                        lab4pod                                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)   default                        linginx1-dc9f65f54-6zw8j                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)   default                        linginx2-69bf6fc66b-mv6wx                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)   default                        nginx                                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)   default                        nginx-cm                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)   default                        pv-pod                                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)   default                        router-1-k8zgt                                             100m (1%)     0 (0%)      256Mi (3%)       0 (0%)   default                        test1                                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)   kube-dns                       kube-dns-t727w                                             0 (0%)        0 (0%)      0 (0%)           0 (0%)   kube-proxy                     kube-proxy-cr7kh                                           0 (0%)        0 (0%)      0 (0%)           0 (0%)   kube-system                    kube-controller-manager-localhost                          0 (0%)        0 (0%)      0 (0%)           0 (0%)   kube-system                    kube-scheduler-localhost                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)   kube-system                    master-api-localhost                                       0 (0%)        0 (0%)      0 (0%)           0 (0%)   kube-system                    master-etcd-localhost                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)   limits                         nee-597889d8c7-p6tc2                                       0 (0%)        0 (0%)      0 (0%)           0 (0%)   love                           anti1                                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)   love                           newpod-1-qgmpj                                             0 (0%)        0 (0%)      0 (0%)           0 (0%)   love                           runonssd                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)   network-security               nginxlab-1-bcgkt                                           0 (0%)        0 (0%)      0 (0%)           0 (0%)   nodesel                        simple-6f55965d79-mklpc                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)   nodesel                        simple-6f55965d79-q8pq9                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)   openshift-apiserver            openshift-apiserver-thwpd                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)   openshift-controller-manager   openshift-controller-manager-c9ms5                         0 (0%)        0 (0%)      0 (0%)           0 (0%)   openshift-core-operators       openshift-service-cert-signer-operator-6d477f986b-jzcgw    0 (0%)        0 (0%)      0 (0%)           0 (0%)   openshift-core-operators       openshift-web-console-operator-664b974ff5-px7gw            0 (0%)        0 (0%)      0 (0%)           0 (0%)   openshift-service-cert-signer  apiservice-cabundle-injector-8ffbbb6dc-x9l4r               0 (0%)        0 (0%)      0 (0%)           0 (0%)   openshift-service-cert-signer  service-serving-cert-signer-668c45d5f-lxvff                0 (0%)        0 (0%)      0 (0%)           0 (0%)   openshift-web-console          webconsole-78f59b4bfb-qqv4p                                100m (1%)     0 (0%)      100Mi (1%)       0 (0%)   quota-test                     bitginx-84b698ff5c-h5j57                                   10m (0%)      50m (0%)    5Mi (0%)         20Mi (0%)   quota-test                     bitginx-84b698ff5c-qxwd5                                   10m (0%)      50m (0%)    5Mi (0%)         20Mi (0%)   quota-test                     bitginx-84b698ff5c-xfnp4                                   10m (0%)      50m (0%)    5Mi (0%)         20Mi (0%)   source-project                 nginx-access                                               0 (0%)        0 (0%)      0 (0%)           0 (0%)   source-project                 nginx-noaccess                                             0 (0%)        0 (0%)      0 (0%)           0 (0%)   target-project                 nginx-target-1-9kdn6                                       0 (0%)        0 (0%)      0 (0%)           0 (0%)   template-project               anti1                                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)   test-project                   nginxmany-5859c9dbb6-5xxr6                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)   test-project                   nginxmany-5859c9dbb6-6ljwm                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)   test-project                   nginxmany-5859c9dbb6-9lv6c                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)   test-project                   nginxmany-5859c9dbb6-dgr7k                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)   test-project                   nginxmany-5859c9dbb6-dl8wr                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)   test-project                   nginxmany-5859c9dbb6-hk2sm                                 0 (0%)        0 (0%)      0 (0%)           0 (0%) Allocated resources:   (Total limits may be over 100 percent, i.e., overcommitted.)   Resource  Requests    Limits   --------  --------    ------   cpu       330m (4%)   150m (1%)   memory    627Mi (8%)  60Mi (0%) Events:     <none> | 
Monitoring Operators
- Operators are the programs that are responsible for starting the different components running in the cluster
- These components are started by operators as Daemonsets or Deployments
- oc get clusteroperatorsshows the current status of operators- If an operator is in progressing state, it is currently being updated
- If in degraded state, something is wrong and further investigation is required
 
Analyzing Operators
- ClusterOperatorresources are non-namespaced
- Each operator starts its resources in dedicated namespaces
- Some operators use one namespace, some operators use more
 
- If an operator shows a degraded status in oc get co, investigate resources running in its namespace and use common tools to check their status
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | $ oc get co NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE authentication                             4.13.1    True        False         False      3d22h    baremetal                                  4.13.1    True        False         False      263d     cloud-controller-manager                   4.13.1    True        False         False      263d     cloud-credential                           4.13.1    True        False         False      263d     cluster-autoscaler                         4.13.1    True        False         False      263d     config-operator                            4.13.1    True        False         False      263d     console                                    4.13.1    True        False         False      84d      control-plane-machine-set                  4.13.1    True        False         False      190d     csi-snapshot-controller                    4.13.1    True        False         False      263d     dns                                        4.13.1    True        False         False      133d     etcd                                       4.13.1    True        False         False      133d     image-registry                             4.13.1    True        False         False      190d     ingress                                    4.13.1    True        False         False      59d      insights                                   4.13.1    True        False         False      133d     kube-apiserver                             4.13.1    True        False         False      263d     kube-controller-manager                    4.13.1    True        False         False      263d     kube-scheduler                             4.13.1    True        False         False      263d     kube-storage-version-migrator              4.13.1    True        False         False      59d      machine-api                                4.13.1    True        False         False      263d     machine-approver                           4.13.1    True        False         False      263d     machine-config                             4.13.1    True        False         False      41h      marketplace                                4.13.1    True        False         False      263d     monitoring                                 4.13.1    True        False         False      133d     network                                    4.13.1    True        False         False      263d     node-tuning                                4.13.1    True        False         False      59d      openshift-apiserver                        4.13.1    True        False         False      10d      openshift-controller-manager               4.13.1    True        False         False      190d     openshift-samples                          4.13.1    True        False         False      59d      operator-lifecycle-manager                 4.13.1    True        False         False      263d     operator-lifecycle-manager-catalog         4.13.1    True        False         False      263d     operator-lifecycle-manager-packageserver   4.13.1    True        False         False      131d     service-ca                                 4.13.1    True        False         False      263d     storage                                    4.13.1    True        False         False      263d    $ oc get ns | grep monitoring Error from server (Forbidden): namespaces is forbidden: User "makarewicz-openshift" cannot list resource "namespaces" in API group "" at the cluster scope $ oc get ns                   Error from server (Forbidden): namespaces is forbidden: User "makarewicz-openshift" cannot list resource "namespaces" in API group "" at the cluster scope $ oc get all -n openshift-monitoring Error from server (Forbidden): pods is forbidden: User "makarewicz-openshift" cannot list resource "pods" in API group "" in the namespace "openshift-monitoring" Error from server (Forbidden): replicationcontrollers is forbidden: User "makarewicz-openshift" cannot list resource "replicationcontrollers" in API group "" in the namespace "openshift-monitoring" Error from server (Forbidden): services is forbidden: User "makarewicz-openshift" cannot list resource "services" in API group "" in the namespace "openshift-monitoring" Error from server (Forbidden): daemonsets.apps is forbidden: User "makarewicz-openshift" cannot list resource "daemonsets" in API group "apps" in the namespace "openshift-monitoring" Error from server (Forbidden): deployments.apps is forbidden: User "makarewicz-openshift" cannot list resource "deployments" in API group "apps" in the namespace "openshift-monitoring" Error from server (Forbidden): replicasets.apps is forbidden: User "makarewicz-openshift" cannot list resource "replicasets" in API group "apps" in the namespace "openshift-monitoring" Error from server (Forbidden): statefulsets.apps is forbidden: User "makarewicz-openshift" cannot list resource "statefulsets" in API group "apps" in the namespace "openshift-monitoring" Error from server (Forbidden): horizontalpodautoscalers.autoscaling is forbidden: User "makarewicz-openshift" cannot list resource "horizontalpodautoscalers" in API group "autoscaling" in the namespace "openshift-monitoring" Error from server (Forbidden): cronjobs.batch is forbidden: User "makarewicz-openshift" cannot list resource "cronjobs" in API group "batch" in the namespace "openshift-monitoring" Error from server (Forbidden): jobs.batch is forbidden: User "makarewicz-openshift" cannot list resource "jobs" in API group "batch" in the namespace "openshift-monitoring" Error from server (Forbidden): deploymentconfigs.apps.openshift.io is forbidden: User "makarewicz-openshift" cannot list resource "deploymentconfigs" in API group "apps.openshift.io" in the namespace "openshift-monitoring" Error from server (Forbidden): buildconfigs.build.openshift.io is forbidden: User "makarewicz-openshift" cannot list resource "buildconfigs" in API group "build.openshift.io" in the namespace "openshift-monitoring" Error from server (Forbidden): builds.build.openshift.io is forbidden: User "makarewicz-openshift" cannot list resource "builds" in API group "build.openshift.io" in the namespace "openshift-monitoring" Error from server (Forbidden): imagestreams.image.openshift.io is forbidden: User "makarewicz-openshift" cannot list resource "imagestreams" in API group "image.openshift.io" in the namespace "openshift-monitoring" Error from server (Forbidden): routes.route.openshift.io is forbidden: User "makarewicz-openshift" cannot list resource "routes" in API group "route.openshift.io" in the namespace "openshift-monitoring" Error from server (Forbidden): brokers.eventing.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "brokers" in API group "eventing.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): triggers.eventing.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "triggers" in API group "eventing.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): eventtypes.eventing.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "eventtypes" in API group "eventing.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): kafkasinks.eventing.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "kafkasinks" in API group "eventing.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): sequences.flows.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "sequences" in API group "flows.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): parallels.flows.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "parallels" in API group "flows.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): subscriptions.messaging.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "subscriptions" in API group "messaging.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): inmemorychannels.messaging.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "inmemorychannels" in API group "messaging.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): channels.messaging.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "channels" in API group "messaging.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): kafkachannels.messaging.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "kafkachannels" in API group "messaging.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): configurations.serving.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "configurations" in API group "serving.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): routes.serving.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "routes" in API group "serving.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): services.serving.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "services" in API group "serving.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): revisions.serving.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "revisions" in API group "serving.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): domainmappings.serving.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "domainmappings" in API group "serving.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): pingsources.sources.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "pingsources" in API group "sources.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): sinkbindings.sources.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "sinkbindings" in API group "sources.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): apiserversources.sources.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "apiserversources" in API group "sources.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): containersources.sources.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "containersources" in API group "sources.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): kafkasources.sources.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "kafkasources" in API group "sources.knative.dev" in the namespace "openshift-monitoring" | 
Verifying Cluster Versions
- oc get clusterversionshows details about the current version of the cluster that is used
- oc describe clusterversionshows more details about versions of the different components
- oc versionshows OpenShift version, Kubernetes version, as well as client version
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 | $ oc get clusterversion NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS version   4.13.1    True        False         59d     Cluster version is 4.13.1 $ oc describe clusterversion Name:         version Namespace:     Labels:       <none> Annotations:  <none> API Version:  config.openshift.io/v1 Kind:         ClusterVersion Metadata:   Creation Timestamp:  2022-11-08T19:49:58Z   Generation:          12   Resource Version:    1072696093   UID:                 02fed286-c1d7-4b17-923d-54cb4de01696 Spec:   Channel:     fast-4.13   Cluster ID:  8a263e14-10dc-4a52-925b-2d53054945a2   Desired Update:     Version:  4.13.1 Status:   Available Updates:     Channels:       candidate-4.13       candidate-4.14       fast-4.13     Image:    quay.io/openshift-release-dev/ocp-release@sha256:3ca57045e070978b38c36d4c98e188795a6cb4b128130f9c8d7a08b47c133aba     URL:      https://access.redhat.com/errata/RHSA-2023:4226     Version:  4.13.6     Channels:       candidate-4.13       candidate-4.14       fast-4.13       stable-4.13     Image:    quay.io/openshift-release-dev/ocp-release@sha256:af19e94813478382e36ae1fa2ae7bbbff1f903dded6180f4eb0624afe6fc6cd4     URL:      https://access.redhat.com/errata/RHSA-2023:4091     Version:  4.13.5     Channels:       candidate-4.13       candidate-4.14       fast-4.13       stable-4.13     Image:    quay.io/openshift-release-dev/ocp-release@sha256:e3fb8ace9881ae5428ae7f0ac93a51e3daa71fa215b5299cd3209e134cadfc9c     URL:      https://access.redhat.com/errata/RHSA-2023:3614     Version:  4.13.4     Channels:       candidate-4.13       candidate-4.14       fast-4.13       stable-4.13     Image:    quay.io/openshift-release-dev/ocp-release@sha256:bc9835804046aa844c874d2cc37387ec95fe7e87d8ce96129fba78d465c932fa     URL:      https://access.redhat.com/errata/RHSA-2023:3537     Version:  4.13.3     Channels:       candidate-4.13       candidate-4.14       fast-4.13       stable-4.13     Image:    quay.io/openshift-release-dev/ocp-release@sha256:6ef3cf4bed1970d547dce08a6e334b675d361b212427c4493151dcad6e093d27     URL:      https://access.redhat.com/errata/RHSA-2023:3367     Version:  4.13.2   Capabilities:     Enabled Capabilities:       CSISnapshot       Console       Insights       NodeTuning       Storage       baremetal       marketplace       openshift-samples     Known Capabilities:       CSISnapshot       Console       Insights       NodeTuning       Storage       baremetal       marketplace       openshift-samples   Conditional Updates:     Conditions:       Last Transition Time:  2023-07-30T14:52:52Z       Message:               The update is recommended, because none of the conditional update risks apply to this cluster.       Reason:                AsExpected       Status:                True       Type:                  Recommended     Release:       Channels:         candidate-4.13         candidate-4.14         fast-4.13         stable-4.13       Image:    quay.io/openshift-release-dev/ocp-release@sha256:e3fb8ace9881ae5428ae7f0ac93a51e3daa71fa215b5299cd3209e134cadfc9c       URL:      https://access.redhat.com/errata/RHSA-2023:3614       Version:  4.13.4     Risks:       Matching Rules:         Promql:           Promql:  group(network_attachment_definition_instances > 0) or 0 * group(network_attachment_definition_instances)         Type:      PromQL       Message:     Upgrade can get stuck on clusters that have multiple network attachments.       Name:        MultiNetworkAttachmentsWhereaboutsVersion       URL:         https://access.redhat.com/solutions/7024726     Conditions:       Last Transition Time:  2023-07-30T14:52:52Z       Message:               The update is recommended, because none of the conditional update risks apply to this cluster.       Reason:                AsExpected       Status:                True       Type:                  Recommended     Release:       Channels:         candidate-4.13         candidate-4.14         fast-4.13         stable-4.13       Image:    quay.io/openshift-release-dev/ocp-release@sha256:bc9835804046aa844c874d2cc37387ec95fe7e87d8ce96129fba78d465c932fa       URL:      https://access.redhat.com/errata/RHSA-2023:3537       Version:  4.13.3     Risks:       Matching Rules:         Promql:           Promql:  group(network_attachment_definition_instances > 0) or 0 * group(network_attachment_definition_instances)         Type:      PromQL       Message:     Upgrade can get stuck on clusters that have multiple network attachments.       Name:        MultiNetworkAttachmentsWhereaboutsVersion       URL:         https://access.redhat.com/solutions/7024726     Conditions:       Last Transition Time:  2023-07-30T14:52:52Z       Message:               The update is recommended, because none of the conditional update risks apply to this cluster.       Reason:                AsExpected       Status:                True       Type:                  Recommended     Release:       Channels:         candidate-4.13         candidate-4.14         fast-4.13         stable-4.13       Image:    quay.io/openshift-release-dev/ocp-release@sha256:6ef3cf4bed1970d547dce08a6e334b675d361b212427c4493151dcad6e093d27       URL:      https://access.redhat.com/errata/RHSA-2023:3367       Version:  4.13.2     Risks:       Matching Rules:         Promql:           Promql:  group(network_attachment_definition_instances > 0) or 0 * group(network_attachment_definition_instances)         Type:      PromQL       Message:     Upgrade can get stuck on clusters that have multiple network attachments.       Name:        MultiNetworkAttachmentsWhereaboutsVersion       URL:         https://access.redhat.com/solutions/7024726   Conditions:     Last Transition Time:  2023-01-21T01:57:45Z     Status:                True     Type:                  RetrievedUpdates     Last Transition Time:  2023-05-31T23:04:38Z     Message:               Capabilities match configured spec     Reason:                AsExpected     Status:                False     Type:                  ImplicitlyEnabledCapabilities     Last Transition Time:  2023-05-31T23:04:26Z     Message:               Payload loaded version="4.13.1" image="quay.io/openshift-release-dev/ocp-release@sha256:9c92b5ec203ee7f81626cc4e9f02086484056a76548961e5895916f136302b1f" architecture="amd64"     Reason:                PayloadLoaded     Status:                True     Type:                  ReleaseAccepted     Last Transition Time:  2022-11-08T20:11:43Z     Message:               Done applying 4.13.1     Status:                True     Type:                  Available     Last Transition Time:  2023-07-28T20:56:25Z     Status:                False     Type:                  Failing     Last Transition Time:  2023-06-01T00:29:13Z     Message:               Cluster version is 4.13.1     Status:                False     Type:                  Progressing     Last Transition Time:  2023-05-31T23:36:38Z     Message:               Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: ClusterServiceVersions blocking cluster upgrade: openshift-operators/openshift-pipelines-operator-rh.v1.10.2 is incompatible with OpenShift minor versions greater than 4.13     Reason:                IncompatibleOperatorsInstalled     Status:                False     Type:                  Upgradeable   Desired:     Channels:       candidate-4.13       candidate-4.14       fast-4.13       stable-4.13     Image:    quay.io/openshift-release-dev/ocp-release@sha256:9c92b5ec203ee7f81626cc4e9f02086484056a76548961e5895916f136302b1f     URL:      https://access.redhat.com/errata/RHSA-2023:3304     Version:  4.13.1   History:     Completion Time:    2023-06-01T00:29:13Z     Image:              quay.io/openshift-release-dev/ocp-release@sha256:9c92b5ec203ee7f81626cc4e9f02086484056a76548961e5895916f136302b1f     Started Time:       2023-05-31T23:04:26Z     State:              Completed     Verified:           true     Version:            4.13.1     Completion Time:    2023-05-27T07:59:25Z     Image:              quay.io/openshift-release-dev/ocp-release@sha256:7ca5f8aa44bbc537c5a985a523d87365eab3f6e72abc50b7be4caae741e093f4     Started Time:       2023-05-27T03:06:28Z     State:              Completed     Verified:           true     Version:            4.12.17     Completion Time:    2023-04-04T02:20:05Z     Image:              quay.io/openshift-release-dev/ocp-release@sha256:96bf74ce789ccb22391deea98e0c5050c41b67cc17defbb38089d32226dba0b8     Started Time:       2023-04-04T01:02:02Z     State:              Completed     Verified:           true     Version:            4.12.9     Completion Time:    2023-01-21T02:26:02Z     Image:              quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18     Started Time:       2023-01-21T01:05:32Z     State:              Completed     Verified:           true     Version:            4.12.0     Completion Time:    2022-11-22T19:11:59Z     Image:              quay.io/openshift-release-dev/ocp-release@sha256:9ffb17b909a4fdef5324ba45ec6dd282985dd49d25b933ea401873183ef20bf8     Started Time:       2022-11-22T18:15:57Z     State:              Completed     Verified:           true     Version:            4.11.13     Completion Time:    2022-11-08T20:11:43Z     Image:              quay.io/openshift-release-dev/ocp-release@sha256:1ce5676839bca4f389cdc1c3ddc1a78ab033d4c554453ca7ef61a23e34da0803     Started Time:       2022-11-08T19:50:00Z     State:              Completed     Verified:           false     Version:            4.11.3   Observed Generation:  12   Version Hash:         aIi9LgsoS9c= Events:                 <none> $ oc version Client Version: 4.13.0-202303241616.p0.g92b1a3d.assembly.stream-92b1a3d Kustomize Version: v4.5.7 Server Version: 4.13.1 Kubernetes Version: v1.26.3+b404935 | 
Understanding Nodes
- OpenShift worker nodes run CoreOS
- CoreOS is a minimized operating system that is managed like a container
- No direct modifications allowed
 
- Most services on the CoreOS node run as containers
- Investigate like any other container
 
- Some services are managed by systemd
- CRI-o is the container engine that is required to run the containers
- kubelet is the interface that allows the OpenShift cluster to schedule containers on top of the container engine
 
Investigating Node Logs
- oc adm node-logs nodenamewill show logs generated by a CoreOS node
- oc adm node-logs -u crio nodenamewill show logs generated by the CRI-o service
- oc adm node-logs -u kubelet nodenamewill show logs generated by the kubelet service
Opening a Shell on a Node
- Opening a shell session to nodes in a managed full-stack automation OpenShift cluster is not always necessary, because of how the cluster is offered within the cloud
- Use oc debug node/nodenameto open a debug shell on a node- The debug shell mounts the node root file system at the /host folder, which allows you to inspect files from the node
- To run host binaries, use chroot /host
- Notice that the host is running a minimal operating system and does not provide access to all Linux tools
- Use systemctl status kubeletorsystemctl status crioto investigate status of these vital services
- Use crictl psfor low-level information about CRI-o containers
 
- If the control plane is not running, you cannot use oc debug node
Using Direct SSH Access
- You should not use direct SSH access
- If you want to do it anyway, look up the SSH keys stored on the client machine in some deployment scenarios
- On CRC, use ssh -i ~/.crc/machines/crc/id_rsa coreos@$(crc ip)to open a shell as user coreos
Cluster Scaling
- Manual or automatic cluster scaling works through machine API
- Installing an additional worker node is not considered scaling!
- Machine API is a standard component that runs as an operator
- This operator provides controllers that interact with cluster resources
- In a full-stack automated environment, it communicates with the provider to take care of cluster scaling
Machine API Custom Resources
- Machines are the compute units in the cluster
- MachineSets describe groups of machines, but not control plane nodes
- MachineSets are to machines what replica sets are to Pods
- It includes labels that allow you to work with regions, zones, and instance types
- When deployed to public cloud, you’ll typically get one machine set per availability zone
 
- MachineHealthChecks verify the health of a machine and take action if required
Manually Scaling Machines
- Manually scaling the number of machines works in two ways:
- Use oc scale
- Change the number of replicas in the machine set resource
 
- Use 
Automatic scaling
- Automatic scaling in full-stack automation requires two custom resources:
- MachineAutoscaler
- ClusterAutoscaler
 
- The Machine API operator must be operational in order to configure any type of scaling
- The machine autoscaler automatically scales the number of replicas based on load
- The cluster autoscaler enforces limits for the entire cluster
- MaxNodesTotalsets maximum nodes
- MaxMemoryTotalsets maximum memory
 
Implementing AutoScaler
- To implement autoscaling, the following requirements must be met:
- The cluster is deployed in full-stack automation
- There is a cluster autoscaler resource
- Set scaleDowntoenabled: trueto allow for downscaling as well
 
- Set 
- At least one machine autoscaler resource exists
 
Cluster Updates
- OpenShift 4.x offers Over-the-Air (OTA) upgrades
- The OTA software distribution system manages controller manifests, cluster roles and other resources necessary to update a cluster
- OTA is offered as a service on https://cloud.redhat.com; which provides a web interface to easily perform the update
- OTA requires the cluster to have a persistent connection to the Internet
How OTA Works
- Prometheus based telemetry is used to determine the update path
- Supported operators can be automatically updated
- Future versions will include Independent Software Vendor (ISV) operators to be updated in this way as well
- From the cloudiredhat.com interface, an update channel can be selected to determine the version of OpenShift to update to
- Notice that no support is offered to do rollback
How OTA Flow
- First, all operators need to be updated to the newer version
- Next, the CoreOS images can be updated
- The node will first pull the new image
- Next, the image is written to disk
- Then the bootloader is changed to boot the new image
- To complete, the CoreOS machine reboots
 
Manually Updating the Cluster
- oc get clusterversionwill show the current version
- oc adm upgradewill show if an upgrade is available
- oc adm upgrade --to-latest=truewill upgrade to the latest version
- oc adm upgrade --to=versionwill upgrade to a specific version
- oc get clusterversionallows to verify the update
- oc get clusteroperatorswill show if operators are in the right state
- oc describe clusterversionwill show an overview of past upgrades
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | $ oc get clusterversion NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS version   4.13.1    True        False         59d     Cluster version is 4.13.1 bash-4.4 ~ $  $ oc adm upgrade Cluster version is 4.13.1 Upgradeable=False   Reason: IncompatibleOperatorsInstalled   Message: Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: ClusterServiceVersions blocking cluster upgrade: openshift-operators/openshift-pipelines-operator-rh.v1.10.2 is incompatible with OpenShift minor versions greater than 4.13 Upstream is unset, so the cluster will use an appropriate default. Channel: fast-4.13 (available channels: candidate-4.13, candidate-4.14, fast-4.13, stable-4.13) Recommended updates:   VERSION     IMAGE   4.13.6      quay.io/openshift-release-dev/ocp-release@sha256:3ca57045e070978b38c36d4c98e188795a6cb4b128130f9c8d7a08b47c133aba   4.13.5      quay.io/openshift-release-dev/ocp-release@sha256:af19e94813478382e36ae1fa2ae7bbbff1f903dded6180f4eb0624afe6fc6cd4   4.13.4      quay.io/openshift-release-dev/ocp-release@sha256:e3fb8ace9881ae5428ae7f0ac93a51e3daa71fa215b5299cd3209e134cadfc9c   4.13.3      quay.io/openshift-release-dev/ocp-release@sha256:bc9835804046aa844c874d2cc37387ec95fe7e87d8ce96129fba78d465c932fa   4.13.2      quay.io/openshift-release-dev/ocp-release@sha256:6ef3cf4bed1970d547dce08a6e334b675d361b212427c4493151dcad6e093d27 $ oc adm upgrade --to-latest=true error: Unable to upgrade: clusterversions.config.openshift.io "version" is forbidden: User "makarewicz-openshift" cannot patch resource "clusterversions" in API group "config.openshift.io" at the cluster scope $ oc get clusterversion NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS version   4.13.1    True        False         59d     Cluster version is 4.13.1 | 
Lab: Monitoring Cluster Health
- Use the appropriate tools to create a full cluster health report, and write the result of these commands to the file/tmp/health.txt
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | $ oc get nodes > /tmp/health.txt $ oc describe nodes Name:               localhost Roles:              <none> Labels:             beta.kubernetes.io/arch=amd64                     beta.kubernetes.io/os=linux                     disktype=nvme                     kubernetes.io/hostname=localhost Annotations:        volumes.kubernetes.io/controller-managed-attach-detach=true CreationTimestamp:  Sat, 22 Jul 2023 20:54:32 +0200 Taints:             <none> Unschedulable:      false Conditions:   Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message   ----             ------  -----------------                 ------------------                ------                       -------   OutOfDisk        False   Sun, 30 Jul 2023 20:16:27 +0200   Sat, 22 Jul 2023 20:54:22 +0200   KubeletHasSufficientDisk     kubelet has                   sufficient disk space available   MemoryPressure   False   Sun, 30 Jul 2023 20:16:27 +0200   Sat, 22 Jul 2023 20:54:22 +0200   KubeletHasSufficientMemory   kubelet has                   sufficient memory available   DiskPressure     False   Sun, 30 Jul 2023 20:16:27 +0200   Sat, 22 Jul 2023 20:54:22 +0200   KubeletHasNoDiskPressure     kubelet has                   no disk pressure   PIDPressure      False   Sun, 30 Jul 2023 20:16:27 +0200   Sat, 22 Jul 2023 20:54:22 +0200   KubeletHasSufficientPID      kubelet has                   sufficient PID available   Ready            True    Sun, 30 Jul 2023 20:16:27 +0200   Sat, 22 Jul 2023 20:54:22 +0200   KubeletReady                 kubelet is                   posting ready status Addresses:   InternalIP:  172.30.9.22   Hostname:    localhost Capacity:  cpu:            8  hugepages-1Gi:  0  hugepages-2Mi:  0  memory:         7981844Ki  pods:           250 Allocatable:  cpu:            8  hugepages-1Gi:  0  hugepages-2Mi:  0  memory:         7879444Ki  pods:           250 System Info:  Machine ID:                     a37388a4746444f1b3f079f777748845  System UUID:                    6099DE02-9EA8-C210-7553-A7697F2C302A  Boot ID:                        7c076895-85e9-45ce-ae2c-8bbe7127be73  Kernel Version:                 3.10.0-1160.92.1.el7.x86_64  OS Image:                       CentOS Linux 7 (Core)  Operating System:               linux  Architecture:                   amd64  Container Runtime Version:      docker://24.0.3  Kubelet Version:                v1.11.0+d4cacc0  Kube-Proxy Version:             v1.11.0+d4cacc0 Non-terminated Pods:             (45 in total)   Namespace                      Name                                                       CPU Requests  CPU Limits  Memory Requests                    Memory Limits   ---------                      ----                                                       ------------  ----------  ---------------                    -------------   debug                          dnginx-88c7766dd-hlbtd                                     0 (0%)        0 (0%)      0 (0%)                             0 (0%)   default                        bitginx-1-jzk9r                                            0 (0%)        0 (0%)      0 (0%)                             0 (0%)   default                        busybox                                                    0 (0%)        0 (0%)      0 (0%)                             0 (0%)   default                        docker-registry-1-ctgff                                    100m (1%)     0 (0%)      256Mi (3%)                         0 (0%)   default                        lab4pod                                                    0 (0%)        0 (0%)      0 (0%)                             0 (0%)   default                        linginx1-dc9f65f54-6zw8j                                   0 (0%)        0 (0%)      0 (0%)                             0 (0%)   default                        linginx2-69bf6fc66b-mv6wx                                  0 (0%)        0 (0%)      0 (0%)                             0 (0%)   default                        nginx                                                      0 (0%)        0 (0%)      0 (0%)                             0 (0%)   default                        nginx-cm                                                   0 (0%)        0 (0%)      0 (0%)                             0 (0%)   default                        pv-pod                                                     0 (0%)        0 (0%)      0 (0%)                             0 (0%)   default                        router-1-k8zgt                                             100m (1%)     0 (0%)      256Mi (3%)                         0 (0%)   default                        test1                                                      0 (0%)        0 (0%)      0 (0%)                             0 (0%)   kube-dns                       kube-dns-t727w                                             0 (0%)        0 (0%)      0 (0%)                             0 (0%)   kube-proxy                     kube-proxy-cr7kh                                           0 (0%)        0 (0%)      0 (0%)                             0 (0%)   kube-system                    kube-controller-manager-localhost                          0 (0%)        0 (0%)      0 (0%)                             0 (0%)   kube-system                    kube-scheduler-localhost                                   0 (0%)        0 (0%)      0 (0%)                             0 (0%)   kube-system                    master-api-localhost                                       0 (0%)        0 (0%)      0 (0%)                             0 (0%)   kube-system                    master-etcd-localhost                                      0 (0%)        0 (0%)      0 (0%)                             0 (0%)   limits                         nee-597889d8c7-p6tc2                                       0 (0%)        0 (0%)      0 (0%)                             0 (0%)   love                           anti1                                                      0 (0%)        0 (0%)      0 (0%)                             0 (0%)   love                           newpod-1-qgmpj                                             0 (0%)        0 (0%)      0 (0%)                             0 (0%)   love                           runonssd                                                   0 (0%)        0 (0%)      0 (0%)                             0 (0%)   network-security               nginxlab-1-bcgkt                                           0 (0%)        0 (0%)      0 (0%)                             0 (0%)   nodesel                        simple-6f55965d79-mklpc                                    0 (0%)        0 (0%)      0 (0%)                             0 (0%)   nodesel                        simple-6f55965d79-q8pq9                                    0 (0%)        0 (0%)      0 (0%)                             0 (0%)   openshift-apiserver            openshift-apiserver-thwpd                                  0 (0%)        0 (0%)      0 (0%)                             0 (0%)   openshift-controller-manager   openshift-controller-manager-c9ms5                         0 (0%)        0 (0%)      0 (0%)                             0 (0%)   openshift-core-operators       openshift-service-cert-signer-operator-6d477f986b-jzcgw    0 (0%)        0 (0%)      0 (0%)                             0 (0%)   openshift-core-operators       openshift-web-console-operator-664b974ff5-px7gw            0 (0%)        0 (0%)      0 (0%)                             0 (0%)   openshift-service-cert-signer  apiservice-cabundle-injector-8ffbbb6dc-x9l4r               0 (0%)        0 (0%)      0 (0%)                             0 (0%)   openshift-service-cert-signer  service-serving-cert-signer-668c45d5f-lxvff                0 (0%)        0 (0%)      0 (0%)                             0 (0%)   openshift-web-console          webconsole-78f59b4bfb-qqv4p                                100m (1%)     0 (0%)      100Mi (1%)                         0 (0%)   quota-test                     bitginx-84b698ff5c-h5j57                                   10m (0%)      50m (0%)    5Mi (0%)                           20Mi (0%)   quota-test                     bitginx-84b698ff5c-qxwd5                                   10m (0%)      50m (0%)    5Mi (0%)                           20Mi (0%)   quota-test                     bitginx-84b698ff5c-xfnp4                                   10m (0%)      50m (0%)    5Mi (0%)                           20Mi (0%)   source-project                 nginx-access                                               0 (0%)        0 (0%)      0 (0%)                             0 (0%)   source-project                 nginx-noaccess                                             0 (0%)        0 (0%)      0 (0%)                             0 (0%)   target-project                 nginx-target-1-9kdn6                                       0 (0%)        0 (0%)      0 (0%)                             0 (0%)   template-project               anti1                                                      0 (0%)        0 (0%)      0 (0%)                             0 (0%)   test-project                   nginxmany-5859c9dbb6-5xxr6                                 0 (0%)        0 (0%)      0 (0%)                             0 (0%)   test-project                   nginxmany-5859c9dbb6-6ljwm                                 0 (0%)        0 (0%)      0 (0%)                             0 (0%)   test-project                   nginxmany-5859c9dbb6-9lv6c                                 0 (0%)        0 (0%)      0 (0%)                             0 (0%)   test-project                   nginxmany-5859c9dbb6-dgr7k                                 0 (0%)        0 (0%)      0 (0%)                             0 (0%)   test-project                   nginxmany-5859c9dbb6-dl8wr                                 0 (0%)        0 (0%)      0 (0%)                             0 (0%)   test-project                   nginxmany-5859c9dbb6-hk2sm                                 0 (0%)        0 (0%)      0 (0%)                             0 (0%) Allocated resources:   (Total limits may be over 100 percent, i.e., overcommitted.)   Resource  Requests    Limits   --------  --------    ------   cpu       330m (4%)   150m (1%)   memory    627Mi (8%)  60Mi (0%) Events:     <none> $ oc describe nodes >> /tmp/health.txt  $ oc get co >> /tmp/health.txt $ oc get clusterversion >> /tmp/health.txt | 


