Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEP-0124] Proposal to add distributed tracing for tekton tasks and pipelines #839

Merged
merged 5 commits into from
Oct 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions teps/0124-distributed-tracing-for-tasks-and-pipelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
status: proposed
title: Distributed tracing for Tasks and Pipelines
creation-date: '2022-09-30'
last-updated: '2022-09-30'
authors:
- '@kmjayadeep'
---

# TEP-0124: Distributed tracing for Tasks and Pipelines

<!-- toc -->
- [Summary](#summary)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Use Cases](#use-cases)
- [Requirements](#requirements)
- [Proposal](#proposal)
- [PipelineRun controller](#pipelinerun-controller)
- [TaskRun controller](#taskrun-controller)
- [Test Plan](#test-plan)
- [References](#references)
<!-- /toc -->

## Summary

With distributed tracing, we can track the time taken by each action in the pipeline like reconciling logic, fetching resources, pulling images etc.
This allows the developers to improve the reconciliation logic and also allow end users to monitor and optimize the pipelines.

### Goals

* Implementation of opentelemetry tracing with Jaeger
* Instrumentation of pipelinerun and taskrun reconciliation logic
* Able to visualize pipeline and task reconciliation steps in jaeger

### Non-Goals

* Instrumentation of sidecars and initcontainers
* Support for more tracing backends
* Instrumentation of individual steps in each task
* Adding events to each span to indicate what is happening inside each
method. This is an improvement task, and it can be done later once the
basic setup for tracing is in place. The scope of this proposal is
only to get the plumbing done by covering only the method boundaries.

### Use Cases

Pipeline and Task User:
* I would like to understand the duration of each task in my pipeline so that I can optimize the slow taks to improve the pipeline execution speed

Tekton Developer:
* I would like to understand the duration of each reconciliation step, so that I can optimize the code to improve reconciliation performance
* When the pipelines are failing due to a bug, I would like to understand which reconciliation logic caused the issue so that I can easily fix the problem

### Requirements

* Trace all functions in the PipelineRun controller
* Trace all functions in the TaskRun controller
* Support Jaeger backend
* Propagate traces so that subsequent reconciles of the same resource belong to the same trace
* Propagate traces so that reconciles of a resource owned by a parent resource belong a parent span from the parent resource
* Reconcile of different resources must belong to separate traces

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small architecture diagram of how jaeger integrates with tekton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to come up with an architecture diagram for this. There are only two relevant components, Tekton operator and Jaeger. Tekton operator just records the spans during reconcile process and flush them to jaeger on specific intervals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Just wanted to make sure if there was anything out of the ordinary being introduced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pxp928 - a diagram would be nice indeed.
We could add a diagram in the next PR when we propose this as implementable if that's ok


Initialize a tracer provider with jaeger as the backend for each reconciler. The jaeger collector Base URL can be passed as an argument to the controller.

The following snippet shows how a tracing provider is initialized

```go
url := "http://jaeger-collector.jaeger:14268/api/traces"

exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(url)))
if err != nil {
return nil, err
}
tp := tracesdk.NewTracerProvider(
tracesdk.WithBatcher(exp),
// Record information about this application in a Resource.
tracesdk.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("PipelineRunReconciler"),
)),
)
```

If jaeger collector url is not provided, the operator will continue working as before and tracing will be disabled. TracerProvider will be replaced with a no-op provider which doesn't record any spans in this case.

### PipelineRun controller
A new trace will be initialized in the pipelineRun controller when a new PipelineRun CR is created. The span context will be propogated through the reconciliation methods
to instrument the actions and steps included in the reconciliation logic. The span context will be saved back to the PipelineRun CR as an annotation `tekton/pipelinerun-span-context`. This span context can
be retrieved during the next reconciliation loop for the same CR. This way, we will have a single parent span for the entire reconciliation logic for a single PipelineRun CR. This makes it easy to visualize
the multiple reconciliation steps involved for each PipelineRun.

When a TaskRun is created by the PipelineRun reconciler, the parent span context is passed as an annotation `tekton/taskrun-span-context` so that TaskRun reconciler use the same span as its parent.

### TaskRun controller
TaskRun reconciler retrieves the parent span context propogated by PipelineRun controller. If it is not present (TaskRun is created by user in this case), a new span will be created. It will be used to
instrument the logic similar to PipelineRun controller.
The spancontext should be also made available as environment variables containers (using downward api) running the tasks. So that the task containers can continue the span if it supports it.

### Test Plan

There must be unit tests for recording of spans and e2e tests for context propogation through custom resources.

### POC

A POC was developed to check the feasibility of the implementation. It can be found [here](https://github.com/kmjayadeep/pipeline/tree/opentelemetry-poc)

A trace from PipelineRun looks like the screenshot below.

![Jaeger - PipelineRun](images/0124-jaeger.png "Jaeger - Pipelinerun")

## References

* [Instrument Tekton resources for tracing](https://github.com/tektoncd/pipeline/issues/2814)
* [OpenTelemetry](https://opentelemetry.io/)
* [OpenTelemetry instrumentation in GO](https://opentelemetry.io/docs/instrumentation/go/manual/)
* [Jaeger Tracing](https://www.jaegertracing.io/)
1 change: 1 addition & 0 deletions teps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,3 +287,4 @@ This is the complete list of Tekton teps:
|[TEP-0118](0118-matrix-with-explicit-combinations-of-parameters.md) | Matrix with Explicit Combinations of Parameters | implementable | 2022-08-08 |
|[TEP-0119](0119-add-taskrun-template-in-pipelinerun.md) | Add taskRun template in PipelineRun | implementable | 2022-09-01 |
|[TEP-0120](0120-canceling-concurrent-pipelineruns.md) | Canceling Concurrent PipelineRuns | proposed | 2022-09-23 |
|[TEP-0124](0124-distributed-tracing-for-tasks-and-pipelines.md) | Distributed tracing for Tasks and Pipelines | proposed | 2022-09-30 |
Binary file added teps/images/0124-jaeger.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.