Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Followon to #14559 #14685

Merged
merged 3 commits into from
Jun 27, 2022
Merged

Followon to #14559 #14685

merged 3 commits into from
Jun 27, 2022

Conversation

mheon
Copy link
Member

@mheon mheon commented Jun 21, 2022

Taking over #14559 from @vrothberg and trying to get CI to pass.

No changes (besides a rebase) yes, I just want to verify we have the same symptoms.

Fix a bug in the wait logic to enable Podman on Gitlab runner.

@openshift-ci openshift-ci bot added release-note approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 21, 2022
@Conan-Kudo
Copy link

The commit message saying "gitlab:" as the prefix makes no sense. Can you please drop it?

@mheon
Copy link
Member Author

mheon commented Jun 22, 2022

Got inside a CI VM to test. I was only able to reproduce an actual timeout once, and it was on podman rm -t 0, not podman kill. The cause was one of more stuck cleanup processes holding container locks. Still unclear as to why they were stuck, and how that could affect podman kill as it does in the CI runs (cleanup process should only be running after kill?).

@mheon
Copy link
Member Author

mheon commented Jun 22, 2022

ID'd some potential issues in the DB. Have not had a chance to test, but I'll let CI bake overnight.

@mheon
Copy link
Member Author

mheon commented Jun 22, 2022

Hot damn it's actually passing (minus a flake).

I'll clean this up in the morning, then we can get final review + merge done.

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two non-blocking nits, otherwise LGTM

libpod/boltdb_state.go Outdated Show resolved Hide resolved
libpod/boltdb_state.go Show resolved Hide resolved
We should just silently fall through.  The log was flooding the
system-service logs when running Gitlab runner.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
This commit addresses three intertwined bugs to fix an issue when using
Gitlab runner on Podman.  The three bug fixes are not split into
separate commits as tests won't pass otherwise; avoidable noise when
bisecting future issues.

1) Podman conflated states: even when asking to wait for the `exited`
   state, Podman returned as soon as a container transitioned to
   `stopped`.  The issues surfaced in Gitlab tests to fail [1] as
   `conmon`'s buffers have not (yet) been emptied when attaching to a
   container right after a wait.  The race window was extremely narrow,
   and I only managed to reproduce with the Gitlab runner [1] unit
   tests.

2) The clearer separation between `exited` and `stopped` revealed a race
   condition predating the changes.  If a container is configured for
   autoremoval (e.g., via `run --rm`), the "run" process competes with
   the "cleanup" process running in the background.  The window of the
   race condition was sufficiently large that the "cleanup" process has
   already removed the container and storage before the "run" process
   could read the exit code and hence waited indefinitely.

   Address the exit-code race condition by recording exit codes in the
   main libpod database.  Exit codes can now be read from a database.
   When waiting for a container to exit, Podman first waits for the
   container to transition to `exited` and will then query the database
   for its exit code. Outdated exit codes are pruned during cleanup
   (i.e., non-performance critical) and when refreshing the database
   after a reboot.  An exit code is considered outdated when it is older
   than 5 minutes.

   While the race condition predates this change, the waiting process
   has apparently always been fast enough in catching the exit code due
   to issue 1): `exited` and `stopped` were conflated.  The waiting
   process hence caught the exit code after the container transitioned
   to `stopped` but before it `exited` and got removed.

3) With 1) and 2), Podman is now waiting for a container to properly
   transition to the `exited` state.  Some tests did not pass after 1)
   and 2) which revealed the third bug: `conmon` was executed with its
   working directory pointing to the OCI runtime bundle of the
   container.  The changed working directory broke resolving relative
   paths in the "cleanup" process.  The "cleanup" process error'ed
   before actually cleaning up the container and waiting "main" process
   ran indefinitely - or until hitting a timeout.  Fix the issue by
   executing `conmon` with the same working directory as Podman.

Note that fixing 3) *may* address a number of issues we have seen in the
past where for *some* reason cleanup processes did not fire.

[1] https://gitlab.com/gitlab-org/gitlab-runner/-/issues/27119#note_970712864

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

[MH: Minor reword of commit message]

Signed-off-by: Matthew Heon <mheon@redhat.com>
@mheon
Copy link
Member Author

mheon commented Jun 23, 2022

Force-pushed, should be ready now.

@mheon
Copy link
Member Author

mheon commented Jun 23, 2022

@baude @Luap99 @rhatdan PTAL

@Luap99
Copy link
Member

Luap99 commented Jun 23, 2022

build is failing

Firstly: don't prune exit codes after a refresh - instead, clear
the table entirely. We are guaranteed that all containers are
gone after a refresh, we should not worry about exit codes given
this.

Secondly: alter the way pruning was done. We were updating the DB
by calling Update from within an existing View, and stacking an
RW transaction on top of an existing RO one seems dodgy; further,
modifying a bucket while iterating over it with ForEach is
undefined behavior.

Hopefully this will resolve our CI issues.

Signed-off-by: Matthew Heon <mheon@redhat.com>
@mheon
Copy link
Member Author

mheon commented Jun 23, 2022

Oops, fixed

@mheon
Copy link
Member Author

mheon commented Jun 23, 2022

Ready to merge

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks for kicking it over the finish line, @mheon. Great work!

@mheon
Copy link
Member Author

mheon commented Jun 24, 2022

@baude @Luap99 @rhatdan PTAL and merge

@vrothberg
Copy link
Member

@giuseppe PTAL

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 27, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 27, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot merged commit 3176b3f into containers:main Jun 27, 2022
@runiq
Copy link

runiq commented Jun 27, 2022

As an outsider: Good job, everyone! :) Looking forward to (eventually) using Podman in Gitlab.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants