Cluster Troubleshooting
- An OpenShift cluster has two focal areas for troubleshooting
- OpenShift operators are cluster applications that can be monitored and fixed
like any other application that runs in OpenShift - OpenShift nodes can be monitored individually
- Other problems may come from version mismatches
Verifying Node Health
oc get nodes
is a good first step to investigate current health of nodes- Anything other than Ready means that the node is dead to the control plane
oc adm top nodes
shows current node health, based on statistics gathered by the metrics serveroc describe node
may be used to investigate recent events and resource usageEvents
shows an event logAllocated resources
gives an overview of allocated resources and requestsCapacity
shows available capacityNon-terminated Pods
shows Pods currently being used
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
$ oc whoami system:admin $ oc get nodes NAME STATUS ROLES AGE VERSION localhost Ready <none> 7d v1.11.0+d4cacc0 $ oc adm top nodes Error from server (NotFound): the server could not find the requested resource (get services https:heapster:) $ oc describe node | less $ oc describe node Name: localhost Roles: <none> Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux disktype=nvme kubernetes.io/hostname=localhost Annotations: volumes.kubernetes.io/controller-managed-attach-detach=true CreationTimestamp: Sat, 22 Jul 2023 20:54:32 +0200 Taints: <none> Unschedulable: false Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- OutOfDisk False Sun, 30 Jul 2023 15:33:30 +0200 Sat, 22 Jul 2023 20:54:22 +0200 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Sun, 30 Jul 2023 15:33:30 +0200 Sat, 22 Jul 2023 20:54:22 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Sun, 30 Jul 2023 15:33:30 +0200 Sat, 22 Jul 2023 20:54:22 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Sun, 30 Jul 2023 15:33:30 +0200 Sat, 22 Jul 2023 20:54:22 +0200 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Sun, 30 Jul 2023 15:33:30 +0200 Sat, 22 Jul 2023 20:54:22 +0200 KubeletReady kubelet is posting ready status Addresses: InternalIP: 172.30.9.22 Hostname: localhost Capacity: cpu: 8 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7981844Ki pods: 250 Allocatable: cpu: 8 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7879444Ki pods: 250 System Info: Machine ID: a37388a4746444f1b3f079f777748845 System UUID: 6099DE02-9EA8-C210-7553-A7697F2C302A Boot ID: 7c076895-85e9-45ce-ae2c-8bbe7127be73 Kernel Version: 3.10.0-1160.92.1.el7.x86_64 OS Image: CentOS Linux 7 (Core) Operating System: linux Architecture: amd64 Container Runtime Version: docker://24.0.3 Kubelet Version: v1.11.0+d4cacc0 Kube-Proxy Version: v1.11.0+d4cacc0 Non-terminated Pods: (45 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits --------- ---- ------------ ---------- --------------- ------------- debug dnginx-88c7766dd-hlbtd 0 (0%) 0 (0%) 0 (0%) 0 (0%) default bitginx-1-jzk9r 0 (0%) 0 (0%) 0 (0%) 0 (0%) default busybox 0 (0%) 0 (0%) 0 (0%) 0 (0%) default docker-registry-1-ctgff 100m (1%) 0 (0%) 256Mi (3%) 0 (0%) default lab4pod 0 (0%) 0 (0%) 0 (0%) 0 (0%) default linginx1-dc9f65f54-6zw8j 0 (0%) 0 (0%) 0 (0%) 0 (0%) default linginx2-69bf6fc66b-mv6wx 0 (0%) 0 (0%) 0 (0%) 0 (0%) default nginx 0 (0%) 0 (0%) 0 (0%) 0 (0%) default nginx-cm 0 (0%) 0 (0%) 0 (0%) 0 (0%) default pv-pod 0 (0%) 0 (0%) 0 (0%) 0 (0%) default router-1-k8zgt 100m (1%) 0 (0%) 256Mi (3%) 0 (0%) default test1 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-dns kube-dns-t727w 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-proxy kube-proxy-cr7kh 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system kube-controller-manager-localhost 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system kube-scheduler-localhost 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system master-api-localhost 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system master-etcd-localhost 0 (0%) 0 (0%) 0 (0%) 0 (0%) limits nee-597889d8c7-p6tc2 0 (0%) 0 (0%) 0 (0%) 0 (0%) love anti1 0 (0%) 0 (0%) 0 (0%) 0 (0%) love newpod-1-qgmpj 0 (0%) 0 (0%) 0 (0%) 0 (0%) love runonssd 0 (0%) 0 (0%) 0 (0%) 0 (0%) network-security nginxlab-1-bcgkt 0 (0%) 0 (0%) 0 (0%) 0 (0%) nodesel simple-6f55965d79-mklpc 0 (0%) 0 (0%) 0 (0%) 0 (0%) nodesel simple-6f55965d79-q8pq9 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-apiserver openshift-apiserver-thwpd 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-controller-manager openshift-controller-manager-c9ms5 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-core-operators openshift-service-cert-signer-operator-6d477f986b-jzcgw 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-core-operators openshift-web-console-operator-664b974ff5-px7gw 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-service-cert-signer apiservice-cabundle-injector-8ffbbb6dc-x9l4r 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-service-cert-signer service-serving-cert-signer-668c45d5f-lxvff 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-web-console webconsole-78f59b4bfb-qqv4p 100m (1%) 0 (0%) 100Mi (1%) 0 (0%) quota-test bitginx-84b698ff5c-h5j57 10m (0%) 50m (0%) 5Mi (0%) 20Mi (0%) quota-test bitginx-84b698ff5c-qxwd5 10m (0%) 50m (0%) 5Mi (0%) 20Mi (0%) quota-test bitginx-84b698ff5c-xfnp4 10m (0%) 50m (0%) 5Mi (0%) 20Mi (0%) source-project nginx-access 0 (0%) 0 (0%) 0 (0%) 0 (0%) source-project nginx-noaccess 0 (0%) 0 (0%) 0 (0%) 0 (0%) target-project nginx-target-1-9kdn6 0 (0%) 0 (0%) 0 (0%) 0 (0%) template-project anti1 0 (0%) 0 (0%) 0 (0%) 0 (0%) test-project nginxmany-5859c9dbb6-5xxr6 0 (0%) 0 (0%) 0 (0%) 0 (0%) test-project nginxmany-5859c9dbb6-6ljwm 0 (0%) 0 (0%) 0 (0%) 0 (0%) test-project nginxmany-5859c9dbb6-9lv6c 0 (0%) 0 (0%) 0 (0%) 0 (0%) test-project nginxmany-5859c9dbb6-dgr7k 0 (0%) 0 (0%) 0 (0%) 0 (0%) test-project nginxmany-5859c9dbb6-dl8wr 0 (0%) 0 (0%) 0 (0%) 0 (0%) test-project nginxmany-5859c9dbb6-hk2sm 0 (0%) 0 (0%) 0 (0%) 0 (0%) Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 330m (4%) 150m (1%) memory 627Mi (8%) 60Mi (0%) Events: <none> |
Monitoring Operators
- Operators are the programs that are responsible for starting the different components running in the cluster
- These components are started by operators as Daemonsets or Deployments
oc get clusteroperators
shows the current status of operators- If an operator is in progressing state, it is currently being updated
- If in degraded state, something is wrong and further investigation is required
Analyzing Operators
ClusterOperator
resources are non-namespaced- Each operator starts its resources in dedicated namespaces
- Some operators use one namespace, some operators use more
- If an operator shows a degraded status in
oc get co
, investigate resources running in its namespace and use common tools to check their status
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.1 True False False 3d22h baremetal 4.13.1 True False False 263d cloud-controller-manager 4.13.1 True False False 263d cloud-credential 4.13.1 True False False 263d cluster-autoscaler 4.13.1 True False False 263d config-operator 4.13.1 True False False 263d console 4.13.1 True False False 84d control-plane-machine-set 4.13.1 True False False 190d csi-snapshot-controller 4.13.1 True False False 263d dns 4.13.1 True False False 133d etcd 4.13.1 True False False 133d image-registry 4.13.1 True False False 190d ingress 4.13.1 True False False 59d insights 4.13.1 True False False 133d kube-apiserver 4.13.1 True False False 263d kube-controller-manager 4.13.1 True False False 263d kube-scheduler 4.13.1 True False False 263d kube-storage-version-migrator 4.13.1 True False False 59d machine-api 4.13.1 True False False 263d machine-approver 4.13.1 True False False 263d machine-config 4.13.1 True False False 41h marketplace 4.13.1 True False False 263d monitoring 4.13.1 True False False 133d network 4.13.1 True False False 263d node-tuning 4.13.1 True False False 59d openshift-apiserver 4.13.1 True False False 10d openshift-controller-manager 4.13.1 True False False 190d openshift-samples 4.13.1 True False False 59d operator-lifecycle-manager 4.13.1 True False False 263d operator-lifecycle-manager-catalog 4.13.1 True False False 263d operator-lifecycle-manager-packageserver 4.13.1 True False False 131d service-ca 4.13.1 True False False 263d storage 4.13.1 True False False 263d $ oc get ns | grep monitoring Error from server (Forbidden): namespaces is forbidden: User "makarewicz-openshift" cannot list resource "namespaces" in API group "" at the cluster scope $ oc get ns Error from server (Forbidden): namespaces is forbidden: User "makarewicz-openshift" cannot list resource "namespaces" in API group "" at the cluster scope $ oc get all -n openshift-monitoring Error from server (Forbidden): pods is forbidden: User "makarewicz-openshift" cannot list resource "pods" in API group "" in the namespace "openshift-monitoring" Error from server (Forbidden): replicationcontrollers is forbidden: User "makarewicz-openshift" cannot list resource "replicationcontrollers" in API group "" in the namespace "openshift-monitoring" Error from server (Forbidden): services is forbidden: User "makarewicz-openshift" cannot list resource "services" in API group "" in the namespace "openshift-monitoring" Error from server (Forbidden): daemonsets.apps is forbidden: User "makarewicz-openshift" cannot list resource "daemonsets" in API group "apps" in the namespace "openshift-monitoring" Error from server (Forbidden): deployments.apps is forbidden: User "makarewicz-openshift" cannot list resource "deployments" in API group "apps" in the namespace "openshift-monitoring" Error from server (Forbidden): replicasets.apps is forbidden: User "makarewicz-openshift" cannot list resource "replicasets" in API group "apps" in the namespace "openshift-monitoring" Error from server (Forbidden): statefulsets.apps is forbidden: User "makarewicz-openshift" cannot list resource "statefulsets" in API group "apps" in the namespace "openshift-monitoring" Error from server (Forbidden): horizontalpodautoscalers.autoscaling is forbidden: User "makarewicz-openshift" cannot list resource "horizontalpodautoscalers" in API group "autoscaling" in the namespace "openshift-monitoring" Error from server (Forbidden): cronjobs.batch is forbidden: User "makarewicz-openshift" cannot list resource "cronjobs" in API group "batch" in the namespace "openshift-monitoring" Error from server (Forbidden): jobs.batch is forbidden: User "makarewicz-openshift" cannot list resource "jobs" in API group "batch" in the namespace "openshift-monitoring" Error from server (Forbidden): deploymentconfigs.apps.openshift.io is forbidden: User "makarewicz-openshift" cannot list resource "deploymentconfigs" in API group "apps.openshift.io" in the namespace "openshift-monitoring" Error from server (Forbidden): buildconfigs.build.openshift.io is forbidden: User "makarewicz-openshift" cannot list resource "buildconfigs" in API group "build.openshift.io" in the namespace "openshift-monitoring" Error from server (Forbidden): builds.build.openshift.io is forbidden: User "makarewicz-openshift" cannot list resource "builds" in API group "build.openshift.io" in the namespace "openshift-monitoring" Error from server (Forbidden): imagestreams.image.openshift.io is forbidden: User "makarewicz-openshift" cannot list resource "imagestreams" in API group "image.openshift.io" in the namespace "openshift-monitoring" Error from server (Forbidden): routes.route.openshift.io is forbidden: User "makarewicz-openshift" cannot list resource "routes" in API group "route.openshift.io" in the namespace "openshift-monitoring" Error from server (Forbidden): brokers.eventing.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "brokers" in API group "eventing.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): triggers.eventing.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "triggers" in API group "eventing.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): eventtypes.eventing.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "eventtypes" in API group "eventing.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): kafkasinks.eventing.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "kafkasinks" in API group "eventing.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): sequences.flows.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "sequences" in API group "flows.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): parallels.flows.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "parallels" in API group "flows.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): subscriptions.messaging.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "subscriptions" in API group "messaging.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): inmemorychannels.messaging.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "inmemorychannels" in API group "messaging.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): channels.messaging.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "channels" in API group "messaging.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): kafkachannels.messaging.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "kafkachannels" in API group "messaging.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): configurations.serving.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "configurations" in API group "serving.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): routes.serving.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "routes" in API group "serving.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): services.serving.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "services" in API group "serving.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): revisions.serving.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "revisions" in API group "serving.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): domainmappings.serving.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "domainmappings" in API group "serving.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): pingsources.sources.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "pingsources" in API group "sources.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): sinkbindings.sources.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "sinkbindings" in API group "sources.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): apiserversources.sources.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "apiserversources" in API group "sources.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): containersources.sources.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "containersources" in API group "sources.knative.dev" in the namespace "openshift-monitoring" Error from server (Forbidden): kafkasources.sources.knative.dev is forbidden: User "makarewicz-openshift" cannot list resource "kafkasources" in API group "sources.knative.dev" in the namespace "openshift-monitoring" |
Verifying Cluster Versions
oc get clusterversion
shows details about the current version of the cluster that is usedoc describe clusterversion
shows more details about versions of the different componentsoc version
shows OpenShift version, Kubernetes version, as well as client version
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.1 True False 59d Cluster version is 4.13.1 $ oc describe clusterversion Name: version Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterVersion Metadata: Creation Timestamp: 2022-11-08T19:49:58Z Generation: 12 Resource Version: 1072696093 UID: 02fed286-c1d7-4b17-923d-54cb4de01696 Spec: Channel: fast-4.13 Cluster ID: 8a263e14-10dc-4a52-925b-2d53054945a2 Desired Update: Version: 4.13.1 Status: Available Updates: Channels: candidate-4.13 candidate-4.14 fast-4.13 Image: quay.io/openshift-release-dev/ocp-release@sha256:3ca57045e070978b38c36d4c98e188795a6cb4b128130f9c8d7a08b47c133aba URL: https://access.redhat.com/errata/RHSA-2023:4226 Version: 4.13.6 Channels: candidate-4.13 candidate-4.14 fast-4.13 stable-4.13 Image: quay.io/openshift-release-dev/ocp-release@sha256:af19e94813478382e36ae1fa2ae7bbbff1f903dded6180f4eb0624afe6fc6cd4 URL: https://access.redhat.com/errata/RHSA-2023:4091 Version: 4.13.5 Channels: candidate-4.13 candidate-4.14 fast-4.13 stable-4.13 Image: quay.io/openshift-release-dev/ocp-release@sha256:e3fb8ace9881ae5428ae7f0ac93a51e3daa71fa215b5299cd3209e134cadfc9c URL: https://access.redhat.com/errata/RHSA-2023:3614 Version: 4.13.4 Channels: candidate-4.13 candidate-4.14 fast-4.13 stable-4.13 Image: quay.io/openshift-release-dev/ocp-release@sha256:bc9835804046aa844c874d2cc37387ec95fe7e87d8ce96129fba78d465c932fa URL: https://access.redhat.com/errata/RHSA-2023:3537 Version: 4.13.3 Channels: candidate-4.13 candidate-4.14 fast-4.13 stable-4.13 Image: quay.io/openshift-release-dev/ocp-release@sha256:6ef3cf4bed1970d547dce08a6e334b675d361b212427c4493151dcad6e093d27 URL: https://access.redhat.com/errata/RHSA-2023:3367 Version: 4.13.2 Capabilities: Enabled Capabilities: CSISnapshot Console Insights NodeTuning Storage baremetal marketplace openshift-samples Known Capabilities: CSISnapshot Console Insights NodeTuning Storage baremetal marketplace openshift-samples Conditional Updates: Conditions: Last Transition Time: 2023-07-30T14:52:52Z Message: The update is recommended, because none of the conditional update risks apply to this cluster. Reason: AsExpected Status: True Type: Recommended Release: Channels: candidate-4.13 candidate-4.14 fast-4.13 stable-4.13 Image: quay.io/openshift-release-dev/ocp-release@sha256:e3fb8ace9881ae5428ae7f0ac93a51e3daa71fa215b5299cd3209e134cadfc9c URL: https://access.redhat.com/errata/RHSA-2023:3614 Version: 4.13.4 Risks: Matching Rules: Promql: Promql: group(network_attachment_definition_instances > 0) or 0 * group(network_attachment_definition_instances) Type: PromQL Message: Upgrade can get stuck on clusters that have multiple network attachments. Name: MultiNetworkAttachmentsWhereaboutsVersion URL: https://access.redhat.com/solutions/7024726 Conditions: Last Transition Time: 2023-07-30T14:52:52Z Message: The update is recommended, because none of the conditional update risks apply to this cluster. Reason: AsExpected Status: True Type: Recommended Release: Channels: candidate-4.13 candidate-4.14 fast-4.13 stable-4.13 Image: quay.io/openshift-release-dev/ocp-release@sha256:bc9835804046aa844c874d2cc37387ec95fe7e87d8ce96129fba78d465c932fa URL: https://access.redhat.com/errata/RHSA-2023:3537 Version: 4.13.3 Risks: Matching Rules: Promql: Promql: group(network_attachment_definition_instances > 0) or 0 * group(network_attachment_definition_instances) Type: PromQL Message: Upgrade can get stuck on clusters that have multiple network attachments. Name: MultiNetworkAttachmentsWhereaboutsVersion URL: https://access.redhat.com/solutions/7024726 Conditions: Last Transition Time: 2023-07-30T14:52:52Z Message: The update is recommended, because none of the conditional update risks apply to this cluster. Reason: AsExpected Status: True Type: Recommended Release: Channels: candidate-4.13 candidate-4.14 fast-4.13 stable-4.13 Image: quay.io/openshift-release-dev/ocp-release@sha256:6ef3cf4bed1970d547dce08a6e334b675d361b212427c4493151dcad6e093d27 URL: https://access.redhat.com/errata/RHSA-2023:3367 Version: 4.13.2 Risks: Matching Rules: Promql: Promql: group(network_attachment_definition_instances > 0) or 0 * group(network_attachment_definition_instances) Type: PromQL Message: Upgrade can get stuck on clusters that have multiple network attachments. Name: MultiNetworkAttachmentsWhereaboutsVersion URL: https://access.redhat.com/solutions/7024726 Conditions: Last Transition Time: 2023-01-21T01:57:45Z Status: True Type: RetrievedUpdates Last Transition Time: 2023-05-31T23:04:38Z Message: Capabilities match configured spec Reason: AsExpected Status: False Type: ImplicitlyEnabledCapabilities Last Transition Time: 2023-05-31T23:04:26Z Message: Payload loaded version="4.13.1" image="quay.io/openshift-release-dev/ocp-release@sha256:9c92b5ec203ee7f81626cc4e9f02086484056a76548961e5895916f136302b1f" architecture="amd64" Reason: PayloadLoaded Status: True Type: ReleaseAccepted Last Transition Time: 2022-11-08T20:11:43Z Message: Done applying 4.13.1 Status: True Type: Available Last Transition Time: 2023-07-28T20:56:25Z Status: False Type: Failing Last Transition Time: 2023-06-01T00:29:13Z Message: Cluster version is 4.13.1 Status: False Type: Progressing Last Transition Time: 2023-05-31T23:36:38Z Message: Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: ClusterServiceVersions blocking cluster upgrade: openshift-operators/openshift-pipelines-operator-rh.v1.10.2 is incompatible with OpenShift minor versions greater than 4.13 Reason: IncompatibleOperatorsInstalled Status: False Type: Upgradeable Desired: Channels: candidate-4.13 candidate-4.14 fast-4.13 stable-4.13 Image: quay.io/openshift-release-dev/ocp-release@sha256:9c92b5ec203ee7f81626cc4e9f02086484056a76548961e5895916f136302b1f URL: https://access.redhat.com/errata/RHSA-2023:3304 Version: 4.13.1 History: Completion Time: 2023-06-01T00:29:13Z Image: quay.io/openshift-release-dev/ocp-release@sha256:9c92b5ec203ee7f81626cc4e9f02086484056a76548961e5895916f136302b1f Started Time: 2023-05-31T23:04:26Z State: Completed Verified: true Version: 4.13.1 Completion Time: 2023-05-27T07:59:25Z Image: quay.io/openshift-release-dev/ocp-release@sha256:7ca5f8aa44bbc537c5a985a523d87365eab3f6e72abc50b7be4caae741e093f4 Started Time: 2023-05-27T03:06:28Z State: Completed Verified: true Version: 4.12.17 Completion Time: 2023-04-04T02:20:05Z Image: quay.io/openshift-release-dev/ocp-release@sha256:96bf74ce789ccb22391deea98e0c5050c41b67cc17defbb38089d32226dba0b8 Started Time: 2023-04-04T01:02:02Z State: Completed Verified: true Version: 4.12.9 Completion Time: 2023-01-21T02:26:02Z Image: quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 Started Time: 2023-01-21T01:05:32Z State: Completed Verified: true Version: 4.12.0 Completion Time: 2022-11-22T19:11:59Z Image: quay.io/openshift-release-dev/ocp-release@sha256:9ffb17b909a4fdef5324ba45ec6dd282985dd49d25b933ea401873183ef20bf8 Started Time: 2022-11-22T18:15:57Z State: Completed Verified: true Version: 4.11.13 Completion Time: 2022-11-08T20:11:43Z Image: quay.io/openshift-release-dev/ocp-release@sha256:1ce5676839bca4f389cdc1c3ddc1a78ab033d4c554453ca7ef61a23e34da0803 Started Time: 2022-11-08T19:50:00Z State: Completed Verified: false Version: 4.11.3 Observed Generation: 12 Version Hash: aIi9LgsoS9c= Events: <none> $ oc version Client Version: 4.13.0-202303241616.p0.g92b1a3d.assembly.stream-92b1a3d Kustomize Version: v4.5.7 Server Version: 4.13.1 Kubernetes Version: v1.26.3+b404935 |
Understanding Nodes
- OpenShift worker nodes run CoreOS
- CoreOS is a minimized operating system that is managed like a container
- No direct modifications allowed
- Most services on the CoreOS node run as containers
- Investigate like any other container
- Some services are managed by systemd
- CRI-o is the container engine that is required to run the containers
- kubelet is the interface that allows the OpenShift cluster to schedule containers on top of the container engine
Investigating Node Logs
oc adm node-logs nodename
will show logs generated by a CoreOS nodeoc adm node-logs -u crio nodename
will show logs generated by the CRI-o serviceoc adm node-logs -u kubelet nodename
will show logs generated by the kubelet service
Opening a Shell on a Node
- Opening a shell session to nodes in a managed full-stack automation OpenShift cluster is not always necessary, because of how the cluster is offered within the cloud
- Use
oc debug node/nodename
to open a debug shell on a node- The debug shell mounts the node root file system at the /host folder, which allows you to inspect files from the node
- To run host binaries, use
chroot /host
- Notice that the host is running a minimal operating system and does not provide access to all Linux tools
- Use
systemctl status kubelet
orsystemctl status crio
to investigate status of these vital services - Use
crictl ps
for low-level information about CRI-o containers
- If the control plane is not running, you cannot use
oc debug node
Using Direct SSH Access
- You should not use direct SSH access
- If you want to do it anyway, look up the SSH keys stored on the client machine in some deployment scenarios
- On CRC, use
ssh -i ~/.crc/machines/crc/id_rsa coreos@$(crc ip)
to open a shell as user coreos
Cluster Scaling
- Manual or automatic cluster scaling works through machine API
- Installing an additional worker node is not considered scaling!
- Machine API is a standard component that runs as an operator
- This operator provides controllers that interact with cluster resources
- In a full-stack automated environment, it communicates with the provider to take care of cluster scaling
Machine API Custom Resources
- Machines are the compute units in the cluster
- MachineSets describe groups of machines, but not control plane nodes
- MachineSets are to machines what replica sets are to Pods
- It includes labels that allow you to work with regions, zones, and instance types
- When deployed to public cloud, you’ll typically get one machine set per availability zone
- MachineHealthChecks verify the health of a machine and take action if required
Manually Scaling Machines
- Manually scaling the number of machines works in two ways:
- Use
oc scale
- Change the number of replicas in the machine set resource
- Use
Automatic scaling
- Automatic scaling in full-stack automation requires two custom resources:
MachineAutoscaler
ClusterAutoscaler
- The Machine API operator must be operational in order to configure any type of scaling
- The machine autoscaler automatically scales the number of replicas based on load
- The cluster autoscaler enforces limits for the entire cluster
MaxNodesTotal
sets maximum nodesMaxMemoryTotal
sets maximum memory
Implementing AutoScaler
- To implement autoscaling, the following requirements must be met:
- The cluster is deployed in full-stack automation
- There is a cluster autoscaler resource
- Set
scaleDown
toenabled: true
to allow for downscaling as well
- Set
- At least one machine autoscaler resource exists
Cluster Updates
- OpenShift 4.x offers Over-the-Air (OTA) upgrades
- The OTA software distribution system manages controller manifests, cluster roles and other resources necessary to update a cluster
- OTA is offered as a service on https://cloud.redhat.com; which provides a web interface to easily perform the update
- OTA requires the cluster to have a persistent connection to the Internet
How OTA Works
- Prometheus based telemetry is used to determine the update path
- Supported operators can be automatically updated
- Future versions will include Independent Software Vendor (ISV) operators to be updated in this way as well
- From the cloudiredhat.com interface, an update channel can be selected to determine the version of OpenShift to update to
- Notice that no support is offered to do rollback
How OTA Flow
- First, all operators need to be updated to the newer version
- Next, the CoreOS images can be updated
- The node will first pull the new image
- Next, the image is written to disk
- Then the bootloader is changed to boot the new image
- To complete, the CoreOS machine reboots
Manually Updating the Cluster
oc get clusterversion
will show the current versionoc adm upgrade
will show if an upgrade is availableoc adm upgrade --to-latest=true
will upgrade to the latest versionoc adm upgrade --to=version
will upgrade to a specific versionoc get clusterversion
allows to verify the updateoc get clusteroperators
will show if operators are in the right stateoc describe clusterversion
will show an overview of past upgrades
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.1 True False 59d Cluster version is 4.13.1 bash-4.4 ~ $ $ oc adm upgrade Cluster version is 4.13.1 Upgradeable=False Reason: IncompatibleOperatorsInstalled Message: Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: ClusterServiceVersions blocking cluster upgrade: openshift-operators/openshift-pipelines-operator-rh.v1.10.2 is incompatible with OpenShift minor versions greater than 4.13 Upstream is unset, so the cluster will use an appropriate default. Channel: fast-4.13 (available channels: candidate-4.13, candidate-4.14, fast-4.13, stable-4.13) Recommended updates: VERSION IMAGE 4.13.6 quay.io/openshift-release-dev/ocp-release@sha256:3ca57045e070978b38c36d4c98e188795a6cb4b128130f9c8d7a08b47c133aba 4.13.5 quay.io/openshift-release-dev/ocp-release@sha256:af19e94813478382e36ae1fa2ae7bbbff1f903dded6180f4eb0624afe6fc6cd4 4.13.4 quay.io/openshift-release-dev/ocp-release@sha256:e3fb8ace9881ae5428ae7f0ac93a51e3daa71fa215b5299cd3209e134cadfc9c 4.13.3 quay.io/openshift-release-dev/ocp-release@sha256:bc9835804046aa844c874d2cc37387ec95fe7e87d8ce96129fba78d465c932fa 4.13.2 quay.io/openshift-release-dev/ocp-release@sha256:6ef3cf4bed1970d547dce08a6e334b675d361b212427c4493151dcad6e093d27 $ oc adm upgrade --to-latest=true error: Unable to upgrade: clusterversions.config.openshift.io "version" is forbidden: User "makarewicz-openshift" cannot patch resource "clusterversions" in API group "config.openshift.io" at the cluster scope $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.1 True False 59d Cluster version is 4.13.1 |
Lab: Monitoring Cluster Health
- Use the appropriate tools to create a full cluster health report, and write the result of these commands to the file
/tmp/health.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
$ oc get nodes > /tmp/health.txt $ oc describe nodes Name: localhost Roles: <none> Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux disktype=nvme kubernetes.io/hostname=localhost Annotations: volumes.kubernetes.io/controller-managed-attach-detach=true CreationTimestamp: Sat, 22 Jul 2023 20:54:32 +0200 Taints: <none> Unschedulable: false Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- OutOfDisk False Sun, 30 Jul 2023 20:16:27 +0200 Sat, 22 Jul 2023 20:54:22 +0200 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Sun, 30 Jul 2023 20:16:27 +0200 Sat, 22 Jul 2023 20:54:22 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Sun, 30 Jul 2023 20:16:27 +0200 Sat, 22 Jul 2023 20:54:22 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Sun, 30 Jul 2023 20:16:27 +0200 Sat, 22 Jul 2023 20:54:22 +0200 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Sun, 30 Jul 2023 20:16:27 +0200 Sat, 22 Jul 2023 20:54:22 +0200 KubeletReady kubelet is posting ready status Addresses: InternalIP: 172.30.9.22 Hostname: localhost Capacity: cpu: 8 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7981844Ki pods: 250 Allocatable: cpu: 8 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7879444Ki pods: 250 System Info: Machine ID: a37388a4746444f1b3f079f777748845 System UUID: 6099DE02-9EA8-C210-7553-A7697F2C302A Boot ID: 7c076895-85e9-45ce-ae2c-8bbe7127be73 Kernel Version: 3.10.0-1160.92.1.el7.x86_64 OS Image: CentOS Linux 7 (Core) Operating System: linux Architecture: amd64 Container Runtime Version: docker://24.0.3 Kubelet Version: v1.11.0+d4cacc0 Kube-Proxy Version: v1.11.0+d4cacc0 Non-terminated Pods: (45 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits --------- ---- ------------ ---------- --------------- ------------- debug dnginx-88c7766dd-hlbtd 0 (0%) 0 (0%) 0 (0%) 0 (0%) default bitginx-1-jzk9r 0 (0%) 0 (0%) 0 (0%) 0 (0%) default busybox 0 (0%) 0 (0%) 0 (0%) 0 (0%) default docker-registry-1-ctgff 100m (1%) 0 (0%) 256Mi (3%) 0 (0%) default lab4pod 0 (0%) 0 (0%) 0 (0%) 0 (0%) default linginx1-dc9f65f54-6zw8j 0 (0%) 0 (0%) 0 (0%) 0 (0%) default linginx2-69bf6fc66b-mv6wx 0 (0%) 0 (0%) 0 (0%) 0 (0%) default nginx 0 (0%) 0 (0%) 0 (0%) 0 (0%) default nginx-cm 0 (0%) 0 (0%) 0 (0%) 0 (0%) default pv-pod 0 (0%) 0 (0%) 0 (0%) 0 (0%) default router-1-k8zgt 100m (1%) 0 (0%) 256Mi (3%) 0 (0%) default test1 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-dns kube-dns-t727w 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-proxy kube-proxy-cr7kh 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system kube-controller-manager-localhost 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system kube-scheduler-localhost 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system master-api-localhost 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system master-etcd-localhost 0 (0%) 0 (0%) 0 (0%) 0 (0%) limits nee-597889d8c7-p6tc2 0 (0%) 0 (0%) 0 (0%) 0 (0%) love anti1 0 (0%) 0 (0%) 0 (0%) 0 (0%) love newpod-1-qgmpj 0 (0%) 0 (0%) 0 (0%) 0 (0%) love runonssd 0 (0%) 0 (0%) 0 (0%) 0 (0%) network-security nginxlab-1-bcgkt 0 (0%) 0 (0%) 0 (0%) 0 (0%) nodesel simple-6f55965d79-mklpc 0 (0%) 0 (0%) 0 (0%) 0 (0%) nodesel simple-6f55965d79-q8pq9 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-apiserver openshift-apiserver-thwpd 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-controller-manager openshift-controller-manager-c9ms5 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-core-operators openshift-service-cert-signer-operator-6d477f986b-jzcgw 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-core-operators openshift-web-console-operator-664b974ff5-px7gw 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-service-cert-signer apiservice-cabundle-injector-8ffbbb6dc-x9l4r 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-service-cert-signer service-serving-cert-signer-668c45d5f-lxvff 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-web-console webconsole-78f59b4bfb-qqv4p 100m (1%) 0 (0%) 100Mi (1%) 0 (0%) quota-test bitginx-84b698ff5c-h5j57 10m (0%) 50m (0%) 5Mi (0%) 20Mi (0%) quota-test bitginx-84b698ff5c-qxwd5 10m (0%) 50m (0%) 5Mi (0%) 20Mi (0%) quota-test bitginx-84b698ff5c-xfnp4 10m (0%) 50m (0%) 5Mi (0%) 20Mi (0%) source-project nginx-access 0 (0%) 0 (0%) 0 (0%) 0 (0%) source-project nginx-noaccess 0 (0%) 0 (0%) 0 (0%) 0 (0%) target-project nginx-target-1-9kdn6 0 (0%) 0 (0%) 0 (0%) 0 (0%) template-project anti1 0 (0%) 0 (0%) 0 (0%) 0 (0%) test-project nginxmany-5859c9dbb6-5xxr6 0 (0%) 0 (0%) 0 (0%) 0 (0%) test-project nginxmany-5859c9dbb6-6ljwm 0 (0%) 0 (0%) 0 (0%) 0 (0%) test-project nginxmany-5859c9dbb6-9lv6c 0 (0%) 0 (0%) 0 (0%) 0 (0%) test-project nginxmany-5859c9dbb6-dgr7k 0 (0%) 0 (0%) 0 (0%) 0 (0%) test-project nginxmany-5859c9dbb6-dl8wr 0 (0%) 0 (0%) 0 (0%) 0 (0%) test-project nginxmany-5859c9dbb6-hk2sm 0 (0%) 0 (0%) 0 (0%) 0 (0%) Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 330m (4%) 150m (1%) memory 627Mi (8%) 60Mi (0%) Events: <none> $ oc describe nodes >> /tmp/health.txt $ oc get co >> /tmp/health.txt $ oc get clusterversion >> /tmp/health.txt |