Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If node that runs the taskrun pod shutdown then retry will not work as expected #6558

Closed
yuzp1996 opened this issue Apr 19, 2023 · 20 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@yuzp1996
Copy link
Contributor

yuzp1996 commented Apr 19, 2023

Expected Behavior

When the node shutdown and the tasrkrun pod running on the node will be failed. And if we set retry on taskrun we would expect the retried pod to start in a normal node and taskrun can continue to work.

Actual Behavior

Taskrun will use the failed pod as the work pod and will not create a new pod for retry.

Steps to Reproduce the Problem

  1. create a taskrun with retry in a k8s cluster that has multi nodes
  2. shutdown the node which running the pod
  3. check the status of taskrun

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.6", GitCommit:"ff2c119726cc1f8926fb0585c74b25921e866a28", GitTreeState:"clean", BuildDate:"2023-01-18T19:22:09Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.6", GitCommit:"ff2c119726cc1f8926fb0585c74b25921e866a28", GitTreeState:"clean", BuildDate:"2023-01-18T19:15:26Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:
v0.41.0
@yuzp1996 yuzp1996 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 19, 2023
@yuzp1996
Copy link
Contributor Author

yuzp1996 commented Apr 19, 2023

It is a normal retry that retry pods are different pods.

image

It is an unnormal retry that retry pods are always the first pod.

image

@yuzp1996
Copy link
Contributor Author

yuzp1996 commented Apr 19, 2023

I find it's because we only check the status of container however when node shutdown the container status will not change.

image

And I find kubectl get po will return the right pod status Terminating indicating that the pod is unnormal.
image

I think that k8s can recognise that the pod is in an unnormal status and then I read the related code.

If pod.DeletionTimestamp != nil and k8s will think this pod is in Terminating status.

image

I think maybe we can change the logic of detecting the taskrun pod status like this to resolve the problem.

// DidTaskRunFail check the status of pod to decide if related taskrun is failed
func DidTaskRunFail(pod *corev1.Pod) bool {
	f := pod.Status.Phase == corev1.PodFailed
	for _, s := range pod.Status.ContainerStatuses {
		if IsContainerStep(s.Name) {
			if s.State.Terminated != nil {
				f = f || s.State.Terminated.ExitCode != 0 || isOOMKilled(s)
			}
		}
	}
+	// when the node lost the pod will be in Terminating
+	f = f || pod.DeletionTimestamp != nil

	return f
}

@yuzp1996
Copy link
Contributor Author

/assign

yuzp1996 added a commit to yuzp1996/pipeline that referenced this issue Apr 22, 2023
When node shutdown then the retry pod will always use the same pod
because it can not recongize that the pod can not work anymore and
k8s can not delete it before the node recovered and it will cause
retry is actually not working.

I fix it by check the DeletionTimestamp to know if the pod is actually
not work any more.

Fix tektoncd#6558

Signed-off-by: yuzhipeng <zpyu@alauda.io>
yuzp1996 added a commit to yuzp1996/pipeline that referenced this issue Apr 22, 2023
When the node shutdown then the retry pod will always be the same pod
because it can not recognise that the pod can not work anymore and
k8s can not delete the pod before the node recovers and it causes the retry is actually not to work.

I fix it by checking the DeletionTimestamp to know if the pod is actually
not work any more.

Fix tektoncd#6558

Signed-off-by: yuzhipeng <zpyu@alauda.io>
@lbernick
Copy link
Member

lbernick commented Apr 24, 2023

I think it's worth considering whether we want a TaskRun to handle node failures in general or only in the case of retries. FYI @pritidesai @skaegi

Edit: discussed at API WG today; we only want to create a new pod if a taskrun has retries specified.

@pritidesai
Copy link
Member

pritidesai commented Apr 24, 2023

Thank you @yuzp1996 for reporting this issue 🙏

image

Why are these pod names same for multiple attempts? @XinruZhang @lbernick can we please reproduce this 🙏

As per our retry strategies, the pod names must be unique for each retry attempt if I am not mistaken.

@lbernick
Copy link
Member

@yuzp1996 behavior of retries was modified in v0.43.0. Are you able to reproduce this bug on v0.43.0 or newer?

@XinruZhang
Copy link
Member

I reproduced the error -- it seems like this behavior is unrelated to the "retries" feature: evicting the pod of the taskrun that is created out of the following pipelinerun has the same result:

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: pr-retries-
spec:
  serviceAccountName: 'default'
  pipelineSpec:
    tasks:
      - name: task-abnormal-pod
        taskSpec:
          steps:
          - name: echo
            image: alpine
            script: |
                sleep 50000 && exit 1

We discussed this issue in the API WG today: do we want to retry a taskrun that contains an unfinished pod that is evicted or terminated unexpectedly (node deletion etc)? People in the WG meeting generally agree that we should not retry it.

cc @yuzp1996 @pritidesai @lbernick

@pritidesai
Copy link
Member

thanks a bunch @XinruZhang 🙏

evicting the pod of the taskrun that is created out of the following pipelinerun has the same result

please explain what does the same result mean here. What happens to such pipelineRun without retry? Is pipelineRun kind of hangs until it times out?

@XinruZhang
Copy link
Member

XinruZhang commented Apr 24, 2023

Most likely here is the case (I've tested the following behavior):

When the pod is scheduled in the same node that tekton-pipelines-controller and tekton-pipelines-webhook are running on, then if we drain the node, all pods on that node are evicted, the reported behavior (the same pod will be created for the taskrun -- simply because the controller logic needs to be re-run 1) happens.

But if we only evict the pod created for the taskrun, the taskrun and pipelinerun faills:

Events:
  Type     Reason         Age                    From     Message
  ----     ------         ----                   ----     -------
  Normal   Started        15m                    TaskRun
  Normal   Pending        15m                    TaskRun  Pending
  Normal   Pending        15m                    TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [prepare place-scripts]"
  Normal   Pending        15m                    TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [place-scripts]"
  Normal   Pending        15m                    TaskRun  pod status "Ready":"False"; message: "containers with unready status: [step-echo]"
  Normal   Running        15m (x2 over 15m)      TaskRun  Not all Steps in the Task have finished executing
  Warning  Failed         6m27s                  TaskRun  "step-echo" exited with code 255 (image: "docker.io/library/alpine@sha256:124c7d2707904eea7431fffe91522a01e5a861a624ee31d03372cc1d138a3126"); for logs run: kubectl -n default logs pr-retries-z5mdd-task-abnormal-retries-pod -c step-echo
  Warning  InternalError  6m26s (x2 over 6m26s)  TaskRun  pods "pr-retries-z5mdd-task-abnormal-retries-pod" not found

Footnotes

  1. https://github.com/tektoncd/pipeline/blob/8e8c163446e8bef885bd628072320f9805529ac9/pkg/reconciler/taskrun/taskrun.go#L437-L504

@XinruZhang
Copy link
Member

I don't quite understand why "creating the pod" and "updating the podname in taskrun" would happen in two different reconcile loops -- why is there a case that PodName is not empty while the Pod is not found.

if tr.Status.PodName != "" {
pod, err = c.podLister.Pods(tr.Namespace).Get(tr.Status.PodName)
if k8serrors.IsNotFound(err) {
// Keep going, this will result in the Pod being created below.
} else if err != nil {
// This is considered a transient error, so we return error, do not update
// the task run condition, and return an error which will cause this key to
// be requeued for reconcile.
logger.Errorf("Error getting pod %q: %v", tr.Status.PodName, err)
return err
}
} else {

@XinruZhang
Copy link
Member

evicting the pod of the taskrun that is created out of the following pipelinerun has the same result

Please explain what does the same result mean here. What happens to such pipelineRun without retry? Is pipelineRun kind of hangs until it times out?

The same result means: the pod will be recreated (with the same PodName as this issue reports), and the PipelineRun will eventually succeed (or fail if that's what's defined by the PipelineTask).

@pritidesai
Copy link
Member

pritidesai commented Apr 24, 2023

Looking at the history, this check on tr.Status.PodName was deleted but then it was reverted back.

Introduced in 0f20c35 to stop looking up the pod for a taskRun by name but instead only look up by labelSelector.

This adds Reconciler.getPod, which looks up the Pod for a TaskRun by performing a label selector query on Pods, looking for the label we apply to Pods generated by TaskRuns.

If zero Pods are returned, it's the same as .status.podName being "". If multiple Pods are returned, that's an error.

This commit was reverted in #1944 because back then we had multiple pods associated to single taskRun object and was not easy to identify when to declare done for a taskRun.

Further reference: #1689

We can certainly update pod creation implementation since we have ownerReference implemented now but I think that could be something nice to have and not causing any issues here.

@pritidesai
Copy link
Member

pritidesai commented Apr 25, 2023

When the pod is scheduled in the same node that tekton-pipelines-controller and tekton-pipelines-webhook are running on, then if we drain the node, all pods on that node are evicted, the reported behavior (the same pod will be created for the taskrun -- simply because the controller logic needs to be re-run [1]

Let's focus our testing for this bug report in which the controller and taskRun pods are on different nodes. There is generally a separation of duties and access controls in actual deployments.

But if we only evict the pod created for the taskrun, the taskrun and pipelinerun faills:

Warning  Failed         6m27s                  TaskRun  "step-echo" exited with code 255 (image: "docker.io/library/alpine@sha256:124c7d2707904eea7431fffe91522a01e5a861a624ee31d03372cc1d138a3126"); for logs run: kubectl -n default logs pr-retries-z5mdd-task-abnormal-retries-pod -c step-echo
Warning  InternalError  6m26s (x2 over 6m26s)  TaskRun  pods "pr-retries-z5mdd-task-abnormal-retries-pod" not found

I am interpreting the taskRun and pipelineRun fails without any retries based on the status updates here ⬆️ which is as expected, right?

How does this InternalError impact the pipelineRun status?

@yuzp1996 can you please try this with the controller beyond 0.43 as @lbernick requested?

@XinruZhang
Copy link
Member

Thank you @pritidesai for tracing back to the history.

I am interpreting the taskRun and pipelineRun fails without any retries based on the status updates here ⬆️ which is as expected, right?

Yes yes, the TaskRun and PipelineRun failed directly which is expected because no Retries was specified.

Once I specify the Retries field, and then the pod is retried, see the status tansitions here:

pr-retries-vmbdz-task-abnormal-retries-pod   1/1     Running   0          41s
pr-retries-vmbdz-task-abnormal-retries-pod   1/1     Terminating   0          51s
pr-retries-vmbdz-task-abnormal-retries-pod   0/1     Terminating   0          52s
pr-retries-vmbdz-task-abnormal-retries-pod   0/1     Terminating   0          52s
pr-retries-vmbdz-task-abnormal-retries-pod   0/1     Terminating   0          52s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   0/1     Pending       0          0s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   0/1     Pending       0          0s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   0/1     Init:0/2      0          1s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   0/1     Init:1/2      0          2s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   0/1     PodInitializing   0          3s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   1/1     Running           0          4s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   1/1     Running           0          4s

P.S. I'm testing the behavior on the latest Tekton Pipelines code. cc @yuzp1996

@XinruZhang
Copy link
Member

XinruZhang commented Apr 25, 2023

Let's focus our testing for this bug report in which the controller and taskRun pods are on different nodes. There is generally a separation of duties and access controls in actual deployments.

Agreed that generally the deployment should be seperate. Curious is there any related guideline for end users to follow (Or is this just an industrial standard that we can assume most people know XD )?

Plus, what's the expected behavior if the they are deployed in the same node and then the node is drained. Should we fail the TaskRun. I would think so because the behavior it's better to align the behavior among different deployment setups).

@pritidesai pritidesai changed the title If node that runs the taskrun pod shutdown then retry will not work as except If node that runs the taskrun pod shutdown then retry will not work as expected May 2, 2023
@yuzp1996
Copy link
Contributor Author

yuzp1996 commented May 5, 2023

Thank you for your response! I'll try this with a controller above 0.43 and see if the problem persists.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 3, 2023
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 2, 2023
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants