-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If node that runs the taskrun pod shutdown then retry will not work as expected #6558
Comments
/assign |
When node shutdown then the retry pod will always use the same pod because it can not recongize that the pod can not work anymore and k8s can not delete it before the node recovered and it will cause retry is actually not working. I fix it by check the DeletionTimestamp to know if the pod is actually not work any more. Fix tektoncd#6558 Signed-off-by: yuzhipeng <zpyu@alauda.io>
When the node shutdown then the retry pod will always be the same pod because it can not recognise that the pod can not work anymore and k8s can not delete the pod before the node recovers and it causes the retry is actually not to work. I fix it by checking the DeletionTimestamp to know if the pod is actually not work any more. Fix tektoncd#6558 Signed-off-by: yuzhipeng <zpyu@alauda.io>
Edit: discussed at API WG today; we only want to create a new pod if a taskrun has retries specified. |
Thank you @yuzp1996 for reporting this issue 🙏 Why are these pod names same for multiple attempts? @XinruZhang @lbernick can we please reproduce this 🙏 As per our |
I reproduced the error -- it seems like this behavior is unrelated to the "retries" feature: evicting the pod of the taskrun that is created out of the following pipelinerun has the same result: apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: pr-retries-
spec:
serviceAccountName: 'default'
pipelineSpec:
tasks:
- name: task-abnormal-pod
taskSpec:
steps:
- name: echo
image: alpine
script: |
sleep 50000 && exit 1 We discussed this issue in the API WG today: do we want to retry a taskrun that contains an unfinished pod that is evicted or terminated unexpectedly (node deletion etc)? People in the WG meeting generally agree that we should not retry it. |
thanks a bunch @XinruZhang 🙏
please explain what does the |
Most likely here is the case (I've tested the following behavior): When the pod is scheduled in the same node that But if we only evict the pod created for the taskrun, the taskrun and pipelinerun faills:
Footnotes |
I don't quite understand why "creating the pod" and "updating the podname in taskrun" would happen in two different reconcile loops -- why is there a case that pipeline/pkg/reconciler/taskrun/taskrun.go Lines 437 to 448 in 8e8c163
|
The same result means: the pod will be recreated (with the same |
Looking at the history, this check on Introduced in 0f20c35 to stop looking up the pod for a taskRun by name but instead only look up by labelSelector.
This commit was reverted in #1944 because back then we had multiple pods associated to single taskRun object and was not easy to identify when to declare Further reference: #1689 We can certainly update pod creation implementation since we have ownerReference implemented now but I think that could be something nice to have and not causing any issues here. |
Let's focus our testing for this bug report in which the controller and taskRun pods are on different nodes. There is generally a separation of duties and access controls in actual deployments.
I am interpreting the taskRun and pipelineRun fails without any retries based on the status updates here ⬆️ which is as expected, right? How does this @yuzp1996 can you please try this with the controller beyond |
Thank you @pritidesai for tracing back to the history.
Yes yes, the TaskRun and PipelineRun failed directly which is expected because no Once I specify the pr-retries-vmbdz-task-abnormal-retries-pod 1/1 Running 0 41s
pr-retries-vmbdz-task-abnormal-retries-pod 1/1 Terminating 0 51s
pr-retries-vmbdz-task-abnormal-retries-pod 0/1 Terminating 0 52s
pr-retries-vmbdz-task-abnormal-retries-pod 0/1 Terminating 0 52s
pr-retries-vmbdz-task-abnormal-retries-pod 0/1 Terminating 0 52s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1 0/1 Pending 0 0s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1 0/1 Pending 0 0s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1 0/1 Init:0/2 0 1s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1 0/1 Init:1/2 0 2s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1 0/1 PodInitializing 0 3s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1 1/1 Running 0 4s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1 1/1 Running 0 4s P.S. I'm testing the behavior on the latest Tekton Pipelines code. cc @yuzp1996 |
Agreed that generally the deployment should be seperate. Curious is there any related guideline for end users to follow (Or is this just an industrial standard that we can assume most people know XD )? Plus, what's the expected behavior if the they are deployed in the same node and then the node is drained. Should we fail the TaskRun. I would think so because the behavior it's better to align the behavior among different deployment setups). |
Thank you for your response! I'll try this with a controller above 0.43 and see if the problem persists. |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
Rotten issues close after 30d of inactivity. /close Send feedback to tektoncd/plumbing. |
@tekton-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Expected Behavior
When the node shutdown and the tasrkrun pod running on the node will be failed. And if we set retry on taskrun we would expect the retried pod to start in a normal node and taskrun can continue to work.
Actual Behavior
Taskrun will use the failed pod as the work pod and will not create a new pod for retry.
Steps to Reproduce the Problem
Additional Info
Kubernetes version:
Output of
kubectl version
:The text was updated successfully, but these errors were encountered: