Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up #3517

Closed
Fabian-K opened this issue Nov 12, 2020 · 17 comments · Fixed by #3542
Closed

Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up #3517

Fabian-K opened this issue Nov 12, 2020 · 17 comments · Fixed by #3542
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@Fabian-K
Copy link
Contributor

Hi,

after upgrading tekton pipelines to v0.18.0, the reconciliation seems to be stuck or at least really slow. Here is a screenshot of the tekton_workqueue_depth metric:

image

The controller log is full of repeated "pod not found" messages like the following.

{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:528","msg":"Reconcile error","commit":"8eaaeaa","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:528\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:514\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}
{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"taskrun/reconciler.go:294","msg":"Returned an error","commit":"8eaaeaa","knative.dev/traceid":"bd9fd972-9191-44b3-b040-028215d651d2","knative.dev/key":"site/honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6","targetMethod":"ReconcileKind","targetMethod":"ReconcileKind","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile\n\tgithub.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:294\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:513\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}

We do have a cleanup job running in the cluster that deletes Pods of finished TaskRuns after some time. Before 0.18.0 this does not seem to be an issue for the controller.

Thanks,
Fabian

Additional Info

  • Kubernetes version: v1.18.9
  • Tekton Pipeline version: v0.18.0
@Fabian-K Fabian-K added the kind/bug Categorizes issue or PR as related to a bug. label Nov 12, 2020
@Fabian-K
Copy link
Contributor Author

Looks like this was eventually processed after 3 hours. Unfortunately, after every controller restart this repeats.

image

@afrittoli afrittoli added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 16, 2020
@afrittoli
Copy link
Member

This seems a pretty important issue to get sorted, setting the priority higher.

@pritidesai
Copy link
Member

@Fabian-K can you please provide your environment details and yamls if they are shareable? We just fixed scaling issues with larger pipelines with PR #3524 and it has been cherry picked in 0.18 branch. Can you please run with those changes and see if you are still running into issues?

@pritidesai
Copy link
Member

I notice you have uploaded a pipelineRun here in other issue #3126

@Fabian-K
Copy link
Contributor Author

@pritidesai there are multiple different pipelines used in the cluster. All however are really simple (less than 5 tasks, only a few parameters). My understanding is that #3524 is mostly relevant if there are way more.
Are images for https://github.com/tektoncd/pipeline/tree/release-v0.18.x already available somewhere? I could deploy and verify if that already fixes the issue.

@pritidesai
Copy link
Member

hey @Fabian-K I am cutting 0.18.1, will update you once its available

@pritidesai
Copy link
Member

@Fabian-K 0.18.1 is here, please give it a try 🙏

@Fabian-K
Copy link
Contributor Author

Unfortunately, 0.18.1 does not fix the issue :(

@pritidesai
Copy link
Member

@Fabian-K to understand your pipeline, you have five independent tasks running all in parallel and operating on a common git pipelineResource:

  1. assemble
  2. unittest
  3. kustomize
  4. push-docker-images
  5. push-jib-images

I ran the similar pipeline on my local cluster multiple times, all five pods are getting created as expected. I am having hard time reproducing this.

Can you please check if the pods exist before running cleanup job?

@Fabian-K
Copy link
Contributor Author

Hi @pritidesai, this issue is not related to 3126, sorry for creating confusion by linking it. To reproduce it:

  • create a cluster and create some PipelineRuns in it. Any pipeline should work.
  • wait for all PipelineRuns to complete
  • delete the pods of the executed tasks (but keep the PipelineRuns and TaskRuns)
  • restart the tekton pipelines controller
  • check the tekton_workqueue_depth metric, this will take some time to reach 0 (much longer compared with versions before 0.18.0)

@afrittoli
Copy link
Member

@Fabian-K I will try to reproduce this on my local cluster. In the meanwhile, it might help if you could:

  • turn the log level of the controller to debug
  • grab the logs during the time when no PipelineRuns are running and the pods have been deleted

To change the log level, you can kubectl edit cm/config-logging -n tekton-pipelines and then:

   # Common configuration for all knative codebase
   zap-logger-config: |
     {
-      "level": "info",
+      "level": "debug",
       "development": false,
       "sampling": {
         "initial": 100,
@@ -49,5 +49,5 @@ data:
     }

   # Log level overrides
-  loglevel.controller: "info"
+  loglevel.controller: "debug"
   loglevel.webhook: "info"

@afrittoli
Copy link
Member

I created a fresh kind cluster, ran 17 pipelines 17 with 40 tasks each, let everything finish and then deleted all completed pods.
I collected the work_queue metrics for pipeline run and taskrun, staring during pipeline run execution, through the pod deletion:

$ while true; do curl -s http://localhost:9090/metrics | egrep '^tekton_work_queue_depth' | awk '{ print "'$(date +%F__%H:%M:%S)',"$1","$2 }' | sed 's/__/ /g' >> metrics.log; sleep 5; done

And plotted the result:

set xdata time
set timefmt '%Y-%m-%d %H:%M:%S'
set datafile separator ','
plot "<(grep taskrun metrics.log)" using 1:3 notitle with lines
replot "<(grep pipelinerun metrics.log)" using 1:3 notitle with lines

The resulting graph shows that the pipeline run controller is not affected by the pod deletion (as expected).
The taskrun graph spikes up because I ran all the deletions together, and each will trigger a reconcile, however the line goes down really slowly, as described in this issue:

image

@afrittoli
Copy link
Member

{"level":"info","ts":"2020-11-19T12:03:26.817Z","logger":"tekton.event-broadcaster","caller":"record/event.go:278","msg":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"default\", Name:\"pipeline-40-run-c4zcp-t9-vt7wk\", UID:\"703f55a9-7c24-4c13-8d21-1e12af8d633f\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"24686\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' pods \"pipeline-40-run-c4zcp-t9-vt7wk-pod-tcnsq\" not found","commit":"473e3f3"}
{"level":"debug","ts":"2020-11-19T12:03:26.817Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:536","msg":"Requeuing key default/pipeline-40-run-c4zcp-t9-vt7wk due to non-permanent error (depth: 609)","commit":"473e3f3"}
{"level":"info","ts":"2020-11-19T12:03:26.818Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:515","msg":"Reconcile failed. Time taken: 300.5814ms","commit":"473e3f3","knative.dev/traceid":"3cd7b15c-3350-4f93-8365-53e865c7576a","knative.dev/key":"default/pipeline-40-run-c4zcp-t9-vt7wk"}

@afrittoli
Copy link
Member

I believe the issue happens when trying to stopping sidecars. In case the pod is not found we return a non-permanent (!) error:

if k8serrors.IsNotFound(err) {
// failed to get the pod, return error without any sidecars
return nil, err

Since at this point the taskrun is marked as done, I think it is safe to assume that if the pod was not found, we can just ignore and finish reconcile. I will make a PR, it should be an easy fix.

@afrittoli afrittoli self-assigned this Nov 19, 2020
@afrittoli
Copy link
Member

Graph before applying the patch - it went down in about 30 min.

image

@afrittoli
Copy link
Member

Controller restart to apply the patch:

image

afrittoli added a commit to afrittoli/pipeline that referenced this issue Nov 19, 2020
When the taskrun is "done" i.e. it has a completion time set,
we try to stop sidecars. If the pod cannot be found, it has most
likely been evicted and in any case the sidecar is not running
anymore, so we should not return a transient error.

Fixes tektoncd#3517

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
tekton-robot pushed a commit that referenced this issue Nov 19, 2020
When the taskrun is "done" i.e. it has a completion time set,
we try to stop sidecars. If the pod cannot be found, it has most
likely been evicted and in any case the sidecar is not running
anymore, so we should not return a transient error.

Fixes #3517

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
@pritidesai
Copy link
Member

@Fabian-K for further details, it was very helpful 👍

Thanks @afrittoli for the fix 🙏

Does this warrant 0.18.2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants