You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When upgrading a cluster from v2.26 -> v2.27 the container-engine/cri-o role hangs indefinitely waiting for cri-o to start on the first cluster node being upgraded.
TASK [container-engine/cri-o : Cri-o | ensure crio service is started and enabled] ***
ok: [my-control01]
Monday 20 January 2025 17:02:53 +0000 (0:00:00.731) 0:07:23.256 ********
Monday 20 January 2025 17:02:53 +0000 (0:00:00.731) 0:07:23.255 ********
^^^ hangs heres
What did you expect to happen?
cri-o should successfully upgrade
How can we reproduce it (as minimally and precisely as possible)?
when using cri-o as the container engine, upgrade a cluster from v2.26.0 -> v2.27.0. Upgrading the first node is the cluster should fail because cri-o does not upgrade.
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.10
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"
the issue is occuring because the existing containers need to be stopped before updating /etc/crio/config.json. crio will stop the containers if the runc config used.
From the container-engine/cri-o role...
the first time crio will get restarted is here:
for p in $(crictl pods -q); do if [[ "$(crictl inspectp $p | jq -r .status.linux.namespaces.options.network)" != "NODE" ]]; then crictl rmp -f $p; fi; done
crictl rmp -fa
systemctl stop crio
Update CRI-O here at this stage
systemctl restart crio
systemctl restart kubelet
kubectl uncordon
so we need to stop any running containers before starting cri-o with the new crio.conf config file
What happened?
When upgrading a cluster from v2.26 -> v2.27 the
container-engine/cri-o
role hangs indefinitely waiting for cri-o to start on the first cluster node being upgraded.^^^ hangs heres
What did you expect to happen?
cri-o should successfully upgrade
How can we reproduce it (as minimally and precisely as possible)?
when using cri-o as the container engine, upgrade a cluster from v2.26.0 -> v2.27.0. Upgrading the first node is the cluster should fail because cri-o does not upgrade.
OS
Linux 4.18.0-553.33.1.el8_10.x86_64 x86_64
NAME="Red Hat Enterprise Linux"
VERSION="8.10 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.10 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://issues.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.10
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"
Version of Ansible
(controller)
ansible [core 2.16.14]
config file = /custom/ansible.cfg
configured module search path = ['/custom/library', '/custom/kubespray/library']
ansible python module location = /opt/venv/lib64/python3.12/site-packages/ansible
ansible collection location = /opt/venv/ansible/collections
executable location = /opt/venv/bin/ansible
python version = 3.12.5 (main, Dec 3 2024, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] (/opt/venv/bin/python3.12)
jinja version = 3.1.5
libyaml = True
Version of Python
Python 3.12.5 (controller)
Version of Kubespray (commit)
9ec9b3a
Network plugin used
cilium
Full inventory with variables
we use a custom inventory plugin. Here are the kubespray crio variables set when creating/upgrading a cluster:
Command used to invoke ansible
ansible-playbook -i custom_plugin.yaml --become-method=sudo --become --become-user root upgrade-cluster.yml
Output of ansible run
crio section of the ansible run logs (trying to upgrade the first node):
^^^ hangs here
journal logs for cri-o:
Anything else we need to know
Looks like the issue is caused by switching cri-o to use crun container runtime in crio 1.31 / kubespray v2.27.0
#11601
Checking the journal logs for cri-o on the node that failed to restart cri-o we can see that crio is failing to stop containers
These containers were started using the runc container runtime (and cri-o 1.30.x).
/etc/crio/config.json
before upgrade (kubespray v2.26.0 / crio 1.30.x / runc ):and after the upgrade ((kubespray v2.27.0 / crio 1.31.x / crun)::
the issue is occuring because the existing containers need to be stopped before updating
/etc/crio/config.json
. crio will stop the containers if the runc config used.From the
container-engine/cri-o
role...the first time crio will get restarted is here:
kubespray/roles/container-engine/cri-o/tasks/main.yaml
Line 230 in d2e51e7
This happens after config files are updated & the binary files for crio are updated... crio stop will use updated files to stop crio.
One suggestion for improving the role:
during upgrades, crio should be stopped with the version & config that originally started the containers.
this is a safer way to upgrade:
it's better to always stop/start crio in this role... this will always stop any running containers but it's a safer way to upgrade crio.
The text was updated successfully, but these errors were encountered: