Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness of pod removal #3082

Merged

Conversation

mheon
Copy link
Member

@mheon mheon commented May 8, 2019

Removing a pod must first removal all containers in the pod. Libpod requires the state to remain consistent at all times, so references to a deleted pod must all be cleansed first.

Pods can have many containers in them. We presently iterate through all of them, and if an error occurs trying to clean up and remove any single container, we abort the entire operation (but cannot recover anything already removed - pod removal is not an atomic operation).

Because of this, if a removal error occurs partway through, we can end up with a pod in an inconsistent state that is no longer usable. What's worse, if the error is in the infra container, and it's persistent, we get zombie pods - completely unable to be removed.

When we saw some of these same issues with containers not in pods, we modified the removal code there to aggressively purge containers from the database, then try to clean up afterwards. Take the same approach here, and make cleanup errors nonfatal. Once we've gone ahead and removed containers, we need to see pod deletion through to the end - we'll log errors but keep
going.

Also, fix some other small things (most notably, we didn't make events for the containers removed).

Removing a pod must first removal all containers in the pod.
Libpod requires the state to remain consistent at all times, so
references to a deleted pod must all be cleansed first.

Pods can have many containers in them. We presently iterate
through all of them, and if an error occurs trying to clean up
and remove any single container, we abort the entire operation
(but cannot recover anything already removed - pod removal is not
an atomic operation).

Because of this, if a removal error occurs partway through, we
can end up with a pod in an inconsistent state that is no longer
usable. What's worse, if the error is in the infra container, and
it's persistent, we get zombie pods - completely unable to be
removed.

When we saw some of these same issues with containers not in
pods, we modified the removal code there to aggressively purge
containers from the database, then try to clean up afterwards.
Take the same approach here, and make cleanup errors nonfatal.
Once we've gone ahead and removed containers, we need to see
pod deletion through to the end - we'll log errors but keep
going.

Also, fix some other small things (most notably, we didn't make
events for the containers removed).

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M labels May 8, 2019
@mheon
Copy link
Member Author

mheon commented May 8, 2019

I'm thinking about modifying the API to also return a map[string]error for errors removing individual containers, so we can accurately return errors that occurred as we evicted individual containers (instead of only logging like we do now)

Ensure that, if an error occurs somewhere along the way when we
remove a pod, it's preserved until the end and returned, even as
we continue to remove the pod.

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@mheon
Copy link
Member Author

mheon commented May 8, 2019

Pushed a patch that fixes without API changes. It's less elegant, but gets the job done.

@mheon
Copy link
Member Author

mheon commented May 8, 2019

@baude I'm starting to think we need a 1.3.1 - between this and #3073 we merged some pretty major fixes for things getting messed up after a reboot

@mheon
Copy link
Member Author

mheon commented May 8, 2019

Test issues: this PR bumped CGroup removal from a warning that was never reported, to a reported error. It looks like removing cgroupfs CGroups on CI is very unreliable?

@mheon
Copy link
Member Author

mheon commented May 8, 2019

@cevich I'm seeing the journalctl error messages out of the failed Ubuntu job here, for reference

@cevich
Copy link
Member

cevich commented May 8, 2019

@mheon thx, useful to know it goes with passing tests also.

(edit) Mmmm, I see it's coming from vendor code as well. I was going to send a PR and even add a unittest, but...sigh wouldn't be accepted upstream (most likely).

@giuseppe
Copy link
Member

giuseppe commented May 8, 2019

LGTM

// Clean up network namespace, cgroups, mounts
if err := ctr.cleanup(ctx); err != nil {
return err
if removalErr == nil {
removalErr = err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add a logrus.debug here. Might make it easier later on to track down exactly what went south. Ditto the other "removalErr = err" lines below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of these should already be good errors including the container ID, but I'll slap some Wrapfs on them to be sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I checked, and everything not already wrapped is returning a well-formatted, wrapped error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mheon Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the errors are clear enough to know that it came from ctr.cleanup failing at line 231 vs ctr.teardownStorage failing at line 240?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you grep for the specific strings we wrap with, yes - you'll be able to get the specific function that failed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK then, I'll trust you on this one @mheon, thanks for the discussion.

libpod/runtime_pod_linux.go Show resolved Hide resolved
// Clean up network namespace, cgroups, mounts
if err := ctr.cleanup(ctx); err != nil {
return err
if removalErr == nil {
removalErr = err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mheon Thanks!

@TomSweeneyRedHat
Copy link
Member

LGTM

@jwhonce
Copy link
Member

jwhonce commented May 8, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2019
@openshift-merge-robot openshift-merge-robot merged commit 7b54ebb into containers:master May 8, 2019
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants