Skip to content
This repository has been archived by the owner on Mar 9, 2022. It is now read-only.

Use wait instead of TaskExit. #1133

Merged
merged 1 commit into from
Apr 18, 2019
Merged

Conversation

Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented Apr 12, 2019

Fixes #1132.

This PR:

  1. Uses Wait instead of TaskExit event. This should fix issues like Containers can't be stopped because TaskExit event failed to be published for runc.v2. containerd#3177 and Cannot stop the container: stop timeout containerd#3125;
  2. Removes big sandbox and container start lock, because with Wait we can control the timing of exit event generation to avoid race condition. This is very important, because sandbox/container start lock is the last potential long running lock we have today. And in GKE, we do see slow container start holds container lock and blocks docker ps for Docker.

Please note that for containers/sandboxes in unknown state, we still need to rely on containerd TaskExit event, because they were not successfully loaded, there are no corresponding exit monitors running.

Since we'll support containerd 1.2 for a long time, I'd like to cherry-pick this into 1.2.
/cc @crosbymichael

Signed-off-by: Lantao Liu lantaol@google.com

Copy link
Member

@crosbymichael crosbymichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jterry75
Copy link
Contributor

LGTM! This is huge!

@Random-Liu
Copy link
Member Author

Random-Liu commented Apr 15, 2019

Actually I have a different idea about handling container/sandbox in unknown state (failed to be loaded at startup).

  1. Based on the state machine https://github.com/containerd/cri/blob/master/pkg/store/sandbox/status.go#L24-L53 and https://github.com/containerd/cri/blob/master/pkg/store/container/status.go#L31-L61, the only allowed operation to container/sandbox in unknown state is STOP;
  2. We've changed Kubernetes to always try stopping container/sandbox in unknown state before removing Stop container in unknown state before recreate or remove. kubernetes/kubernetes#73802

Actually we don't really care about whether a sandbox/container in unknown state exits itself or not, because Kubernetes will stop it right away anyway. So we don't need to constantly monitor it, we only need to start a synchronized Wait in stop for container/sandbox in unknown state, in this way we can completely get rid of the dependency on TaskExit.

I'll make the change.

@mikebrow
Copy link
Member

flying back today... will review in the morning!

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

logrus.WithError(err).Errorf("failed to set start failure state for container %q", id)
}
}
// Reset starting if start failed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not in the retErr != nil block.. so resets in this defer if setContainerStarting was successful...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, forgot to remove the comment. Done.


// resetContainerStarting resets the container starting state on start failure. So
// that we could remove the container later.
func resetContainerStarting(container containerstore.Container) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment... .. Do we want this function to be idempotent? Or should it fail if called twice...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be idempotent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough the set was not idempotent.. so the unset being so was unsettling ;-) But ok by me either way.

@@ -125,21 +125,16 @@ func TestSandboxStore(t *testing.T) {
assert.Equal(sb, got)
}

t.Logf("should not be able to get unknown sandbox")
t.Logf("should be able to get sandbox with Get")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... in unknown state

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -53,23 +53,17 @@ import (
// +-------------> DELETED

// State is the sandbox state we use in containerd/cri.
// It includes init and unknown, which are internal states not defined in CRI.
// It includes unknown, which are internal states not defined in CRI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/s/are/is/

which is an internal state not defined in CRI.. (INIT state has been removed)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Lantao Liu <lantaol@google.com>
Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Wait instead of TaskExit event.
6 participants