-
Notifications
You must be signed in to change notification settings - Fork 348
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM! This is huge! |
Actually I have a different idea about handling container/sandbox in unknown state (failed to be loaded at startup).
Actually we don't really care about whether a sandbox/container in unknown state exits itself or not, because Kubernetes will stop it right away anyway. So we don't need to constantly monitor it, we only need to start a synchronized I'll make the change. |
flying back today... will review in the morning! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments
pkg/server/container_start.go
Outdated
logrus.WithError(err).Errorf("failed to set start failure state for container %q", id) | ||
} | ||
} | ||
// Reset starting if start failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not in the retErr != nil block.. so resets in this defer if setContainerStarting was successful...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, forgot to remove the comment. Done.
|
||
// resetContainerStarting resets the container starting state on start failure. So | ||
// that we could remove the container later. | ||
func resetContainerStarting(container containerstore.Container) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comment... .. Do we want this function to be idempotent? Or should it fail if called twice...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be idempotent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough the set was not idempotent.. so the unset being so was unsettling ;-) But ok by me either way.
pkg/store/sandbox/sandbox_test.go
Outdated
@@ -125,21 +125,16 @@ func TestSandboxStore(t *testing.T) { | |||
assert.Equal(sb, got) | |||
} | |||
|
|||
t.Logf("should not be able to get unknown sandbox") | |||
t.Logf("should be able to get sandbox with Get") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... in unknown state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pkg/store/sandbox/status.go
Outdated
@@ -53,23 +53,17 @@ import ( | |||
// +-------------> DELETED | |||
|
|||
// State is the sandbox state we use in containerd/cri. | |||
// It includes init and unknown, which are internal states not defined in CRI. | |||
// It includes unknown, which are internal states not defined in CRI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/s/are/is/
which is an internal state not defined in CRI.. (INIT state has been removed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: Lantao Liu <lantaol@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM
Cherrypick #1133 release 1.2
Fixes #1132.
This PR:
Wait
instead ofTaskExit
event. This should fix issues like Containers can't be stopped because TaskExit event failed to be published for runc.v2. containerd#3177 and Cannot stop the container: stop timeout containerd#3125;Wait
we can control the timing of exit event generation to avoid race condition. This is very important, because sandbox/container start lock is the last potential long running lock we have today. And in GKE, we do see slow container start holds container lock and blocksdocker ps
for Docker.Please note that for containers/sandboxes in unknown state, we still need to rely on containerdTaskExit
event, because they were not successfully loaded, there are no corresponding exit monitors running.Since we'll support containerd 1.2 for a long time, I'd like to cherry-pick this into 1.2.
/cc @crosbymichael
Signed-off-by: Lantao Liu lantaol@google.com