Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up #3517

Fabian-K · 2020-11-12T15:12:18Z

Hi,

after upgrading tekton pipelines to v0.18.0, the reconciliation seems to be stuck or at least really slow. Here is a screenshot of the tekton_workqueue_depth metric:

The controller log is full of repeated "pod not found" messages like the following.

{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:528","msg":"Reconcile error","commit":"8eaaeaa","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:528\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:514\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}
{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"taskrun/reconciler.go:294","msg":"Returned an error","commit":"8eaaeaa","knative.dev/traceid":"bd9fd972-9191-44b3-b040-028215d651d2","knative.dev/key":"site/honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6","targetMethod":"ReconcileKind","targetMethod":"ReconcileKind","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile\n\tgithub.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:294\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:513\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}

We do have a cleanup job running in the cluster that deletes Pods of finished TaskRuns after some time. Before 0.18.0 this does not seem to be an issue for the controller.

Thanks,
Fabian

Additional Info

Kubernetes version: v1.18.9
Tekton Pipeline version: v0.18.0

The text was updated successfully, but these errors were encountered:

Fabian-K · 2020-11-13T09:13:04Z

Looks like this was eventually processed after 3 hours. Unfortunately, after every controller restart this repeats.

afrittoli · 2020-11-16T14:26:13Z

This seems a pretty important issue to get sorted, setting the priority higher.

pritidesai · 2020-11-17T16:18:24Z

@Fabian-K can you please provide your environment details and yamls if they are shareable? We just fixed scaling issues with larger pipelines with PR #3524 and it has been cherry picked in 0.18 branch. Can you please run with those changes and see if you are still running into issues?

pritidesai · 2020-11-17T16:20:42Z

I notice you have uploaded a pipelineRun here in other issue #3126

Fabian-K · 2020-11-17T16:29:56Z

@pritidesai there are multiple different pipelines used in the cluster. All however are really simple (less than 5 tasks, only a few parameters). My understanding is that #3524 is mostly relevant if there are way more.
Are images for https://github.com/tektoncd/pipeline/tree/release-v0.18.x already available somewhere? I could deploy and verify if that already fixes the issue.

pritidesai · 2020-11-17T16:49:44Z

hey @Fabian-K I am cutting 0.18.1, will update you once its available

pritidesai · 2020-11-17T18:57:45Z

@Fabian-K 0.18.1 is here, please give it a try 🙏

Fabian-K · 2020-11-18T10:02:48Z

Unfortunately, 0.18.1 does not fix the issue :(

pritidesai · 2020-11-18T19:30:27Z

@Fabian-K to understand your pipeline, you have five independent tasks running all in parallel and operating on a common git pipelineResource:

assemble
unittest
kustomize
push-docker-images
push-jib-images

I ran the similar pipeline on my local cluster multiple times, all five pods are getting created as expected. I am having hard time reproducing this.

Can you please check if the pods exist before running cleanup job?

Fabian-K · 2020-11-19T09:50:14Z

Hi @pritidesai, this issue is not related to 3126, sorry for creating confusion by linking it. To reproduce it:

create a cluster and create some PipelineRuns in it. Any pipeline should work.
wait for all PipelineRuns to complete
delete the pods of the executed tasks (but keep the PipelineRuns and TaskRuns)
restart the tekton pipelines controller
check the tekton_workqueue_depth metric, this will take some time to reach 0 (much longer compared with versions before 0.18.0)

afrittoli · 2020-11-19T11:19:07Z

@Fabian-K I will try to reproduce this on my local cluster. In the meanwhile, it might help if you could:

turn the log level of the controller to debug
grab the logs during the time when no PipelineRuns are running and the pods have been deleted

To change the log level, you can kubectl edit cm/config-logging -n tekton-pipelines and then:

   # Common configuration for all knative codebase
   zap-logger-config: |
     {
-      "level": "info",
+      "level": "debug",
       "development": false,
       "sampling": {
         "initial": 100,
@@ -49,5 +49,5 @@ data:
     }

   # Log level overrides
-  loglevel.controller: "info"
+  loglevel.controller: "debug"
   loglevel.webhook: "info"

afrittoli · 2020-11-19T12:01:44Z

I created a fresh kind cluster, ran 17 pipelines 17 with 40 tasks each, let everything finish and then deleted all completed pods.
I collected the work_queue metrics for pipeline run and taskrun, staring during pipeline run execution, through the pod deletion:

$ while true; do curl -s http://localhost:9090/metrics | egrep '^tekton_work_queue_depth' | awk '{ print "'$(date +%F__%H:%M:%S)',"$1","$2 }' | sed 's/__/ /g' >> metrics.log; sleep 5; done

And plotted the result:

set xdata time
set timefmt '%Y-%m-%d %H:%M:%S'
set datafile separator ','
plot "<(grep taskrun metrics.log)" using 1:3 notitle with lines
replot "<(grep pipelinerun metrics.log)" using 1:3 notitle with lines

The resulting graph shows that the pipeline run controller is not affected by the pod deletion (as expected).
The taskrun graph spikes up because I ran all the deletions together, and each will trigger a reconcile, however the line goes down really slowly, as described in this issue:

afrittoli · 2020-11-19T12:04:55Z

{"level":"info","ts":"2020-11-19T12:03:26.817Z","logger":"tekton.event-broadcaster","caller":"record/event.go:278","msg":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"default\", Name:\"pipeline-40-run-c4zcp-t9-vt7wk\", UID:\"703f55a9-7c24-4c13-8d21-1e12af8d633f\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"24686\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' pods \"pipeline-40-run-c4zcp-t9-vt7wk-pod-tcnsq\" not found","commit":"473e3f3"}
{"level":"debug","ts":"2020-11-19T12:03:26.817Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:536","msg":"Requeuing key default/pipeline-40-run-c4zcp-t9-vt7wk due to non-permanent error (depth: 609)","commit":"473e3f3"}
{"level":"info","ts":"2020-11-19T12:03:26.818Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:515","msg":"Reconcile failed. Time taken: 300.5814ms","commit":"473e3f3","knative.dev/traceid":"3cd7b15c-3350-4f93-8365-53e865c7576a","knative.dev/key":"default/pipeline-40-run-c4zcp-t9-vt7wk"}

afrittoli · 2020-11-19T12:13:13Z

I believe the issue happens when trying to stopping sidecars. In case the pod is not found we return a non-permanent (!) error:

pipeline/pkg/reconciler/taskrun/taskrun.go

Lines 215 to 217 in 473e3f3

    
           if k8serrors.IsNotFound(err) { 
        
           	// failed to get the pod, return error without any sidecars 
        
           	return nil, err

Since at this point the taskrun is marked as done, I think it is safe to assume that if the pod was not found, we can just ignore and finish reconcile. I will make a PR, it should be an easy fix.

afrittoli · 2020-11-19T12:25:11Z

Graph before applying the patch - it went down in about 30 min.

afrittoli · 2020-11-19T12:28:48Z

Controller restart to apply the patch:

When the taskrun is "done" i.e. it has a completion time set, we try to stop sidecars. If the pod cannot be found, it has most likely been evicted and in any case the sidecar is not running anymore, so we should not return a transient error. Fixes tektoncd#3517 Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>

When the taskrun is "done" i.e. it has a completion time set, we try to stop sidecars. If the pod cannot be found, it has most likely been evicted and in any case the sidecar is not running anymore, so we should not return a transient error. Fixes #3517 Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>

pritidesai · 2020-11-19T16:29:30Z

@Fabian-K for further details, it was very helpful 👍

Thanks @afrittoli for the fix 🙏

Does this warrant 0.18.2?

Fabian-K added the kind/bug Categorizes issue or PR as related to a bug. label Nov 12, 2020

afrittoli added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 16, 2020

Fabian-K mentioned this issue Nov 17, 2020

Executing a Pipeline with a single Task results in multiple TaskRuns #3126

Closed

afrittoli self-assigned this Nov 19, 2020

afrittoli mentioned this issue Nov 19, 2020

Consider not-found pod as permanent error when taskrun is done #3542

Merged

4 tasks

tekton-robot closed this as completed in #3542 Nov 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up #3517

Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up #3517

Fabian-K commented Nov 12, 2020

Fabian-K commented Nov 13, 2020

afrittoli commented Nov 16, 2020

pritidesai commented Nov 17, 2020

pritidesai commented Nov 17, 2020

Fabian-K commented Nov 17, 2020

pritidesai commented Nov 17, 2020

pritidesai commented Nov 17, 2020

Fabian-K commented Nov 18, 2020

pritidesai commented Nov 18, 2020

Fabian-K commented Nov 19, 2020

afrittoli commented Nov 19, 2020

afrittoli commented Nov 19, 2020

afrittoli commented Nov 19, 2020

afrittoli commented Nov 19, 2020

afrittoli commented Nov 19, 2020

afrittoli commented Nov 19, 2020

pritidesai commented Nov 19, 2020

Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up #3517

Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up #3517

Comments

Fabian-K commented Nov 12, 2020

Additional Info

Fabian-K commented Nov 13, 2020

afrittoli commented Nov 16, 2020

pritidesai commented Nov 17, 2020

pritidesai commented Nov 17, 2020

Fabian-K commented Nov 17, 2020

pritidesai commented Nov 17, 2020

pritidesai commented Nov 17, 2020

Fabian-K commented Nov 18, 2020

pritidesai commented Nov 18, 2020

Fabian-K commented Nov 19, 2020

afrittoli commented Nov 19, 2020

afrittoli commented Nov 19, 2020

afrittoli commented Nov 19, 2020

afrittoli commented Nov 19, 2020

afrittoli commented Nov 19, 2020

afrittoli commented Nov 19, 2020

pritidesai commented Nov 19, 2020