Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

OpenTelemetry Proposal: Introduce semantic conventions for CI/CD observability #223

Closed
wants to merge 5 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions text/0223-cicd-observability-OTEP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# OpenTelemetry Proposal: CI/CD Observability Support by OpenTelemetry

OpenTelemetry project can serve Continuous Integration & Continuous Delivery (CI/CD) observability use cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OpenTelemetry project can serve Continuous Integration & Continuous Delivery (CI/CD) observability use cases.
OpenTelemetry project can serve integration, delivery, deployment observability use cases.
In particular, it would be useful for implementations of widely adopted development practices including but not limited to:
Continuous Integration & Continuous Delivery (CI/CD), Continuous Deployment, Progressive Delivery, Canary deployments and A/B Testing.
All these practices are automated by hundreds of different tools, and hence having a unified convention would help
to improve interoperability and generalize processing logic by the tools using OpenTelemetry data.


## Motivation

OpenTelemetry is already known for DevOps use cases around monitoring production systems and reducing mean time to identification and resolution/recovery (MTTI/MTTR).
However, the project can also bring value for pre-production DevOps use cases, by enabling monitoring of the Continuous Integration & Continuous Delivery (CI/CD) pipelines. CI/CD observability helps to reduce the Lead Time for Changes, which is another crucial [DORA metric](https://horovits.medium.com/improving-devops-performance-with-dora-metrics-918b9604f8e2) measuring how much time it takes a commit to get into production.

This enhancement will broaden the target audience of the project also to Release Engineering teams, and will unleash a whole new value proposition of OpenTelemetry in the software release process, in close collaboration and integration with the CI/CD ecosystem, specifications and tooling.

## Explanation

Lack of CI/CD observability results in unnecessarily long Lead Time for Changes, which is a crucial metric measuring how much time it takes a commit to get into production.

horovits marked this conversation as resolved.
Show resolved Hide resolved
CI/CD tools today emit various telemetry data, whether logs, metrics or trace data to report on the release pipeline state, to help pinpoint flakyness, and accelerate root cause analysis of failures, whether stemming from the application code, a configuration, or from the CI/CD environment. However, these tools do not follow any particular standard, specification, or semantic conventions. This makes it hard to use observability tools for monitoring these pipelines. Some of these tools provide some observability visualization and analytics capabilities out of the box, but in addition to the tight coupling the offered capabilities are oftentime not enough, especially when one wishes to monitor aggregated information across different tools and different stages of the release process.

Some tools have started adopting OpenTelemetry, which is an important step in creating standardization. A good example is [Jenkins](https://github.com/jenkinsci/jenkins), a popular CI OSS project, which offers the [Jenkins OpenTelemetry plugin](https://plugins.jenkins.io/opentelemetry/) for emitting telemetry data in order to:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub Actions also support sending OpenTelemetry data using the action otel-export-trace-action

1. Visualize jobs and pipelines executions as distributed traces
2. Visualize Jenkins and pipeline health indicators
3. Troubleshoot Jenkins performances with distributed tracing of HTTPs requests

Building CI/CD observability involves four stages: Collect → Store → Visualize → Alert. OpenTelemetry provides a unified way for the first step, namely collecting and ingesting the telemetry data in an open and uniform manner.
horovits marked this conversation as resolved.
Show resolved Hide resolved

If you are a CI/CD tool builder, the specification and instrumentation will enable you to properly structure your telemetry, package and emit it over OTLP. OpenTelemetry specification will determine which data to collect, the semantic convention of the data, and how different signal types can be correlated based on that, to support downstream analytics of that data by various tools.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is another potential spec here considering a lot of CI/CD is a combination of tooling in a repeatable fashion.

For example if you're able to generate a trace of your pipeline, the existing tooling is unaware of what is being used as your trace context. This means that your tools could be generating their own trace contexts that are disjointed to the CI/CD trace context, making it extremely difficult to combine those two together.

Copy link
Author

@horovits horovits Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VCS and the use cases you mention such as "track PR size, contributors, and time to merge" are not in the scope of this OTEP. this proposal is about the release pipelines. it can carry metadata such as the build number being deployed and perhaps the commit SHA, but not getting into the internals of the build process. monitoring VCS may be a valid use case for a separate proposal.


If you are an end user looking to gain observability over your pipelines, you will be able to collect OpenTelemetry-formatted telemetry using the OpenTelemetry Collector, ingest, process and then export to a rich ecosystem of observability analytics backend tools, independent of your CI/CD tools in use.

Here are some examples of potential resulting observability visualization over popular backend tools such as Jaeger, Grafana and OpenSearch:

Monitoring Jenkins metrics for nodes, queues, jobs and executors with Grafana dashboard:
![Monitoring Jenkins metrics for nodes, queues, jobs and executors with Grafana dashboard](https://dytvr9ot2sszz.cloudfront.net/wp-content/uploads/2022/05/image6.png "Monitoring Jenkins metrics for nodes, queues, jobs and executors with Grafana dashboard")

Jenkins pipeline run visualized as a trace in the Timeline View in Jaeger UI:
![Jenkins pipeline run visualized as a trace in the Timeline View in Jaeger UI](https://dytvr9ot2sszz.cloudfront.net/wp-content/uploads/2022/05/image9.png)

OpenSearch dashboard for monitoring Jenkins pipelines:
![OpenSearch dashboard for monitoring Jenkins pipelines](https://dytvr9ot2sszz.cloudfront.net/wp-content/uploads/2022/05/image7.png)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth to mention also the Elastic Stack as the primary integration of the Jenkins Opentelemetry plugin

Suggested change
Elastic Stack dashboard for monitoring Jenkins pipelines:
![Elastic Stack dashboard for monitoring Jenkins pipelines](https://raw.githubusercontent.com/jenkinsci/opentelemetry-plugin/master/docs/images/kibana_jenkins_overview_dashboard.png)

For more examples, see [this article](https://logz.io/learn/cicd-observability-jenkins/) on CI/CD observability using currently available open source tools.
Copy link

@kuisathaverat kuisathaverat May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Internal details

OpenTelemetry specification should be enhanced to cover semantics relevant to pipelines, such as the branch, build, step (ID, duration, status), commit SHA (or other UUID), run (type, status, duration). These should be geared for observability into issues in the released application code.
In addition, oftentimes release issues are not code-based but rather environmental, stemming from issues in the build machines, garbage collection issues of the tool or even a malstructured pipeline step. In order to provide observability into CI/CD environment, especially one with distributed execution mechanism, there's need to monitor various entities such as nodes, queues, jobs and executors (using the Jenkins terms, other tools having respective equivalents, which the specification should abstract with the semantic convention).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like another area that is conflating what the goal of this OTEP is.

Internal operations of Jenkins (or any build system) should be separated out into their own section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is within your control (ie self hosted runners), you can use the otel collector and auto instrumentation agents (where possible and not provided by the vendor) on build nodes to surface this information.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many times pipeline runs fail due to environmental issues rather than ones related to the deployed code.
I see it as a core value that CI/CD observability brings, to discern these two cases.
and while you can still use OTEL to monitor your individual tools, the purpose of this OTEP is to standardize on this.
same as we've done with client side instrumentation WG.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed issues with the environment in a distributed build system can be very difficult to track down even if both CI and nodes are instrumented to emit OTEL data.

For example CI could use https://plugins.jenkins.io/opentelemetry/ to send traces of the job executions and build agents could use https://github.com/prometheus/node_exporter and/or https://github.com/open-telemetry/opentelemetry-collector to export host metrics. Even in this scenario it can be difficult to correlate a failing pipeline run with the metrics host(s) that executed the build.

The plugin https://plugins.jenkins.io/opentelemetry-agent-metrics/ builds on https://plugins.jenkins.io/opentelemetry/ to solve this issue by running dedicated otel collectors on each build agent and adding attributes to the metrics identifying the run and main CI controller (https://github.com/jenkinsci/opentelemetry-agent-metrics-plugin/blob/main/src/main/resources/io/jenkins/plugins/onmonit/otel.yaml.tmpl)


The CDF (Continuous Delivery Foundation) has the Events Special Interest Group ([SIG Events](https://github.com/cdfoundation/sig-events)) which explores standardizing on CI/CD event to facilitate interoperability (it is a work-stream within the CDF SIG Interoperability.). The group is working on [CDEvents](https://cdevents.dev/), a standardized event protocol that caters for technology agnostic machine-to-machine communication in CI/CD systems. It makes sense to evaluate alignment between the standards.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The CDF (Continuous Delivery Foundation) has the Events Special Interest Group ([SIG Events](https://github.com/cdfoundation/sig-events)) which explores standardizing on CI/CD event to facilitate interoperability (it is a work-stream within the CDF SIG Interoperability.). The group is working on [CDEvents](https://cdevents.dev/), a standardized event protocol that caters for technology agnostic machine-to-machine communication in CI/CD systems. It makes sense to evaluate alignment between the standards.
The CDF (Continuous Delivery Foundation) has the Events Special Interest Group ([SIG Events](https://github.com/cdfoundation/sig-events)) which explores standardizing on CI/CD event to facilitate interoperability (it is a work-stream within the CDF SIG Interoperability.). The group is working on [CDEvents](https://cdevents.dev/), a standardized event protocol that caters for technology agnostic machine-to-machine communication in CI/CD systems, on the top of [CloudEvents](https://cloudevents.io/) or other carrier protocols. It makes sense to evaluate alignment between the standards.


OpenTelemetry instrumentation should then support in collecting and emitting the new data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more language semantic but the instrumentation provided by open telemetry doesn't need to be updated, but rather, our definitions for events that should be captured within this domain of interest.

Suggested change
OpenTelemetry instrumentation should then support in collecting and emitting the new data.
These defined events from the Continuous Delivery Foundation (CDF) should be merged into the semantic convention for these systems to implement.


OpenTelemetry Collector can then offer designated processors for these payloads, as well as new exporters for designated backend analytics tools, as such prove useful for release engineering needs beyond existing ecosystem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest moving this to a stretch goal to be honest, mostly contributing components to the collector requires (hopefully an contributors from that company) to help maintain that component. Gathering interest to develop and contribute components from vendors may take a bit of effort and I wouldn't want it to block this OTEP.


Copy link

@kuisathaverat kuisathaverat May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only the CI/CD system can send OpenTelemetry data, the tools used as part of the CI/CD pipeline can also send its own OpenTelemetry data. Distributed tracing allows combining all these spans in a single stream of spans. This process gives you more fine-grained details about your pipeline and process.

These are some of the tools integrated into your pipelines give you more details:

## Trade-offs and mitigations

Today’s tools already emit some telemetry, some of which may not easily fit into vendor-agnostic unified semantic conventions. These can be accommodated within extra baggage payload, which may be parsed on a tool-specific fashion.

## Prior art and alternatives

Today’s tools already emit some telemetry, which can be visualized by the tool’s designated backend, or by general-purpose tools with custom-built dashboards and queries for this specific data. These, however, use proprietary specifications.

## Open questions

Open questions include:
- Which entity model should be supported to best represent CI/CD domain and pipelines?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some work done in the Jenkins OpenTelemetry plugin to try to have a general model for naming conventions

Copy link

@kuisathaverat kuisathaverat May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- What are the common CI/CD workflows we aim to support?
Copy link

@kuisathaverat kuisathaverat May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major CI/CD tools

  • Jenkins
  • GitLab
  • GitHub Actions
  • CircleCI
  • TeamCity
  • ArgoCD
  • TravisCI
  • Azure DevOps
  • ...

And build systems/tools:

  • Maven (Java)
  • Gradle (Java)
  • Npm/yarn (Node.js)
  • Mage (Go)
  • CMake (C/C++)
  • Make
  • ...

Test frameworks:

  • Junit (Java, C, C++, ...)
  • Pytest (Python)
  • Jest (Node.js)
  • ...

Deploy/devOps tools

  • Ansible
  • Terraform
  • Puppet
  • Chef
  • ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it isn't mentioned here, I am happy to advocate this internally to be added to Atlassian's Bitbucket Pipelines.

Copy link
Author

@horovits horovits Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good listing, thanks for putting it up.
I would first focus on CI/CD tools. build and test frameworks etc. can come as a separate phase, as it brings in new domains.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+Flux

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+Tekton

- What are the primary tools that should be supported with instrumentation in order to gain critical mass on CI/CD coverage?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be made its own area of interest mostly because there is a lot of scope and how would we standardise tooling to work and link back to the build pipeline natively.

- Is CDEvents a good fit of a specification to integrate with? what is the aligmment, overlap and gaps? and if so, how to establish the cross-foundation and cross-group collaboration in an effective manner?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenTelemetry is more general purpose. It has spans, metrics, and logs. Overall you can replace CDEvents with OpenTelemetry but not the other way around.

Copy link

@afrittoli afrittoli Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @horovits for this proposal and for mentioning CDEvents here!

CDEvents aims to define shared semantics for interoperability in the CI/CD space. Interoperability includes the observability space too - common semantics in the events generated by the different tools enable things like visualization, metrics and more across tools. The transport layer we use today for such events is CloudEvents; the specification is, however, decoupled from the underlying transport by design, and we were planning indeed to reach out to this community to discuss collaboration.

So, I definitely agree we should collaborate, I think it would be really valuable for the ecosystem.
Many tools are adopting OpenTelemetry and many are adopting CDEvents (or both) and having common semantics would be very beneficial.

I'd be happy to join one of the open telemetry community meetings to present CDEvents, if that is helpful.

/cc @e-backmark-ericsson

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CDEvents could be attached to spans as span events and thus best of both worlds.

- How can we bring the existing ecosystem players, both open source and others, to form a concensus and leverage existing knowledge and experience?
- Which receivers are needed beyond OTLP to support the use cases and workflows?
- Which exporters are needed to support common backends?
- Which processors are needed to support the defined workflows?
Comment on lines +69 to +71
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would considered these out of scope considering that we are not looking to bring additional instrumentation from tooling and vendors, but rather having agreed upon convention that each implement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this one.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- How the trace context should be propagated?

There are tools which already implement some basic context propagation. Such as the mentioned Jenkins plugin or OTEL-CLI which use environment variables for that use case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the exhaustive list of tool integrations, and glad to collect it with other contributors here.
However, I think the path here is analyzing the needed semantic conventions.
In this context, I should flag a subsequent PR opened to propose semantic conventions for deployments:
open-telemetry/opentelemetry-specification#3169
@thisthat how do you see the alignment of these proposals?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late answer @horovits
I like to align the two proposals 👍 My PR focuses only on what attributes we should emit on a trace/log that describes a CI/CD pipeline. I think as part of this OTEP we should address the point of @secustor and try to agree on how the trace context can be propagated. This way, different tools can use together and each one contributes spans to the trace so there won't be blind spots.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@horovits I agree that shared semantics will help. For a tool to propagate context, it needs to know where to look for it in any incoming trigger or data/policy that is used to decide to perform an action. It also needs how to enrich the context and where to send it to propagate it further. Shared semantics definitely help here, which is very much aligned with the mission of CDEvents.

CDEvents has a simple model for deployment related of events today - it would be great to have shared semantics here as well. /cc @AloisReitbauer I think this would be of interest to the App Delivery Tag as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thschue FYI ^^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to retrigger this discussion as I'm facing some decisions which will have an effect on how we are tackling things.

I see 3 basic ways we should support to propagate context.

HTTP headers

Basically what is already in place the W3C standard.
The use case would be here that systems get triggered by webhook, e.g. VCS to CI system.

Environment variables

I'm imagining here one process handing over context to another. e.g. a pipeline run handing over context to a tool used inside of the run.

Other

Other ways are possible too, tough I would add them to the specification and make them optional e.g. CLI parameters or files containing the necessary information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@secustor - not sure if you've seen it but there's another OTEP that is specifically related to context propagation at the environment levels that's in progress. open-telemetry/opentelemetry-specification#740

@deejgregor has an working branch for that addition as well that was brought up on today's SIG alongside this OTEP.

## Future possibilities

This OTEP will enable customized instrumentation options, as well as processing within the Collector, which will be designated to the capabilities and evolution of the CI/CD tools and domain.