-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up #3517
Comments
This seems a pretty important issue to get sorted, setting the priority higher. |
@pritidesai there are multiple different pipelines used in the cluster. All however are really simple (less than 5 tasks, only a few parameters). My understanding is that #3524 is mostly relevant if there are way more. |
hey @Fabian-K I am cutting 0.18.1, will update you once its available |
Unfortunately, 0.18.1 does not fix the issue :( |
@Fabian-K to understand your pipeline, you have five independent tasks running all in parallel and operating on a common git
I ran the similar pipeline on my local cluster multiple times, all five pods are getting created as expected. I am having hard time reproducing this. Can you please check if the |
Hi @pritidesai, this issue is not related to 3126, sorry for creating confusion by linking it. To reproduce it:
|
@Fabian-K I will try to reproduce this on my local cluster. In the meanwhile, it might help if you could:
To change the log level, you can
|
I created a fresh kind cluster, ran 17 pipelines 17 with 40 tasks each, let everything finish and then deleted all completed pods.
And plotted the result:
The resulting graph shows that the pipeline run controller is not affected by the pod deletion (as expected). |
|
I believe the issue happens when trying to stopping sidecars. In case the pod is not found we return a non-permanent (!) error: pipeline/pkg/reconciler/taskrun/taskrun.go Lines 215 to 217 in 473e3f3
Since at this point the taskrun is marked as done, I think it is safe to assume that if the pod was not found, we can just ignore and finish reconcile. I will make a PR, it should be an easy fix. |
When the taskrun is "done" i.e. it has a completion time set, we try to stop sidecars. If the pod cannot be found, it has most likely been evicted and in any case the sidecar is not running anymore, so we should not return a transient error. Fixes tektoncd#3517 Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
When the taskrun is "done" i.e. it has a completion time set, we try to stop sidecars. If the pod cannot be found, it has most likely been evicted and in any case the sidecar is not running anymore, so we should not return a transient error. Fixes #3517 Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
@Fabian-K for further details, it was very helpful 👍 Thanks @afrittoli for the fix 🙏 Does this warrant 0.18.2? |
Hi,
after upgrading tekton pipelines to v0.18.0, the reconciliation seems to be stuck or at least really slow. Here is a screenshot of the tekton_workqueue_depth metric:
The controller log is full of repeated "pod not found" messages like the following.
We do have a cleanup job running in the cluster that deletes Pods of finished TaskRuns after some time. Before 0.18.0 this does not seem to be an issue for the controller.
Thanks,
Fabian
Additional Info
The text was updated successfully, but these errors were encountered: