upgrade from v3.1.2: cni plugin bridge failed: failed to allocate #13679

edsantiago · 2022-03-28T14:34:11Z

This one isn't new -- the first instance I see is from January -- but it's starting to happen daily

[+0033s] [not ok 10 network - restart]()
         # (from function `die' in file test/upgrade/../system/[helpers.bash, line 500](https://github.com/containers/podman/blob/640c2d53a88f46e997d4e5a594cfc85a57e74d36/test/system/helpers.bash#L500),
         #  from function `run_podman' in file test/upgrade/../system/[helpers.bash, line 219](https://github.com/containers/podman/blob/640c2d53a88f46e997d4e5a594cfc85a57e74d36/test/system/helpers.bash#L219),
         #  in test file test/upgrade/[test-upgrade.bats, line 254](https://github.com/containers/podman/blob/640c2d53a88f46e997d4e5a594cfc85a57e74d36/test/system/test-upgrade.bats#L254))
         #   `run_podman start myrunningcontainer' failed with status 125
         # # podman stop -t0 myrunningcontainer
         # myrunningcontainer
         # # podman start myrunningcontainer
         # Error: unable to start container "c1e9132d70f2879283fe46d4e9aa5e4a0740766c3c2d38c3b61f38a2db7c13a6": plugin type="bridge" failed (add): cni plugin bridge failed: failed to allocate for range 0: 10.89.0.2 has been allocated to c1e9132d70f2879283fe46d4e9aa5e4a0740766c3c2d38c3b61f38a2db7c13a6, duplicate allocation is not allowed
         # [ rc=125 (** EXPECTED 0 **) ]

[Upgrade] 12 exec

fedora-35 : Upgrade test: from v3.1.2

The text was updated successfully, but these errors were encountered:

Luap99 · 2022-03-28T15:12:28Z

Yeah I think there was a reason why I added podman start && podman stop instead of restart.
I will take a look.

Luap99 · 2022-03-28T16:57:45Z

Good news, I can reproduce. Bad news, it looks like the upgrade test is terribly broken (it is hanging) when your host uses netavark because the old podman will still use cni obviously.

Luap99 · 2022-03-29T11:47:57Z

I think I understand the root cause now.

Podman4 uses a new db structure for networks. Podman3 cannot read this any more.
To understand how it flakes, we need to look at how podman container cleanup works. Running podman stop will always cause a race between the podman container cleanup and podman stop, both processes will try to cleanup the mounts/networks but only one can do it (locked operation). This is not a problem normally since both processes use the same version. However in this particular test setup, the cleanup process will be spawned with the old podman version.
On a slow system such as in CI usually the stop process wins and thus network cleanup will work. However on a fast system with many cores the cleanup process wins and thus it fails 100% of the time for me locally.

One fix is to remove the network connect/disconnect test before the stop/start since it causes a v4 network db migration but I do not want this since it should test connect/disconnect. This would also explain why upgrade test from 2.X are not flaking because we skip network connect/disconnect there since it was only added in 3.0.

@mheon Any ideas if we could influence the behaviour between stop and cleanup so that stop would win always?

If not possible we can manually doing a network teardown with network disconnect before stopping, this should also work.

With podman4 we support netavark, however old versions will still use cni. Since netavark and cni can conflict we should not mix them. Remove the network setup from the inital podman command and create the directories manually to prevent such conflicts. Also the update to 4.0 changes the network db structure. While it is compatible from 3.X to 4.0 it will fail the other way around. In this test it will happen because the cleanup process still uses the old podman while the network connect/disconnect test already changed the db format. Therefore the cleanup process cannot see any networks and will not tear it down. The following start will fail because the ip address is already assigned. Fixes containers#13679 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

mheon · 2022-03-29T14:41:38Z

Usually podman stop will always win because it has a higher priority (assuming it is run keyboard-interactive). However, if run from a script, we lose that.

Maybe we can deliberately nice our podman cleanup processes, to try and guarantee other Podman processes get CPU time first? It's not a guarantee but it's better than nothing.

With podman4 we support netavark, however old versions will still use cni. Since netavark and cni can conflict we should not mix them. Remove the network setup from the inital podman command and create the directories manually to prevent such conflicts. Also the update to 4.0 changes the network db structure. While it is compatible from 3.X to 4.0 it will fail the other way around. In this test it will happen because the cleanup process still uses the old podman while the network connect/disconnect test already changed the db format. Therefore the cleanup process cannot see any networks and will not tear it down. The following start will fail because the ip address is already assigned. Fixes containers#13679 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Luap99 · 2022-03-29T16:31:02Z

Usually podman stop will always win because it has a higher priority (assuming it is run keyboard-interactive). However, if run from a script, we lose that.

I cannot confirm this on my laptop, if I run podman --log-level debug stop interactively it always shows Network is already cleaned up, skipping...

Anyway I fixed it in the test, I don't think this is a real world problem.

With podman4 we support netavark, however old versions will still use cni. Since netavark and cni can conflict we should not mix them. Remove the network setup from the inital podman command and create the directories manually to prevent such conflicts. Also the update to 4.0 changes the network db structure. While it is compatible from 3.X to 4.0 it will fail the other way around. In this test it will happen because the cleanup process still uses the old podman while the network connect/disconnect test already changed the db format. Therefore the cleanup process cannot see any networks and will not tear it down. The following start will fail because the ip address is already assigned. Fixes containers#13679 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

edsantiago added the flakes Flakes from Continuous Integration label Mar 28, 2022

Luap99 mentioned this issue Mar 29, 2022

upgrade tests: fix networking problems #13692

Merged

openshift-merge-robot closed this as completed in #13692 Mar 29, 2022

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade from v3.1.2: cni plugin bridge failed: failed to allocate #13679

upgrade from v3.1.2: cni plugin bridge failed: failed to allocate #13679

edsantiago commented Mar 28, 2022

Luap99 commented Mar 28, 2022 •

edited

Loading

Luap99 commented Mar 28, 2022 •

edited

Loading

Luap99 commented Mar 29, 2022

mheon commented Mar 29, 2022

Luap99 commented Mar 29, 2022

upgrade from v3.1.2: cni plugin bridge failed: failed to allocate #13679

upgrade from v3.1.2: cni plugin bridge failed: failed to allocate #13679

Comments

edsantiago commented Mar 28, 2022

[Upgrade] 12 exec

Luap99 commented Mar 28, 2022 • edited Loading

Luap99 commented Mar 28, 2022 • edited Loading

Luap99 commented Mar 29, 2022

mheon commented Mar 29, 2022

Luap99 commented Mar 29, 2022

Luap99 commented Mar 28, 2022 •

edited

Loading

Luap99 commented Mar 28, 2022 •

edited

Loading