Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for OpenTelemetry tracing #2796

Closed
sjonpaulbrown opened this issue Jul 13, 2022 · 4 comments
Closed

Add support for OpenTelemetry tracing #2796

sjonpaulbrown opened this issue Jul 13, 2022 · 4 comments
Assignees

Comments

@sjonpaulbrown
Copy link
Collaborator

Problem Definition

We would like to add support for tracing with OpenTelemetry. Currently, we support tracing with Open Tracer, but we would like to support OpenTelemetry so that we can integrate into our standard model for distributed tracing.

Proposed Solution

To add support for OpenTelemetry, we can add a new struct that implements the Tracer interface, and we need to add support for enabling tracing with OpenTelemetry.

Definition of Done

  • OpenTelemetry tracing is supported
  • Nodes can be started up with OpenTelemetry tracing enabled as opposed to starting up with OpenTracer
@sjonpaulbrown
Copy link
Collaborator Author

The initial work for this has been started here.

@ramtinms
Copy link
Contributor

ramtinms commented Jul 13, 2022

Some notes on how this should be handled:

  • we are currently hacking spans by overriding trace_ids to match entity ids (blockID, transactionID, etc), this is mostly done to circumvent having to deal with passing trace_id s over the libp2p and gossip network and makes it in a way that trace_id is known by nature of the entity.

  • we also disabled the controllable sampling from trace aggregators and do it based on the ID of the entity and sensitivity flag sets per node level. For example, currently we collect traces for every block that has a blockID starting with a zero byte). This prevents nodes' performance be controllable by trace aggregation agents and other external factors and keeps a balanced way of sampling according to the needs of the protocol.

@SaveTheRbtz
Copy link
Contributor

SaveTheRbtz commented Jul 14, 2022

Seems like OTEL supports overriding both:

  • IDGenerator gives hooks for generating custom TraceIDs.
  • Sampler provides an interface for making a decision on sampling.

This way we can likely make it backwards compatible with the current tracer. Let me try migrating existing code to it.

bors bot added a commit that referenced this issue Jul 26, 2022
2823: [All] OpenTracing to OpenTelemetry migration r=SaveTheRbtz a=SaveTheRbtz

Issue: #2796

OpenTracing spec is deprecated: opentracing/specification#163.  Switching to the supported OpenTelemetry implementation.

Note that new tracing code would honor [OTEL env vars](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.12.0/specification/protocol/exporter.md) instead of JAEGER ones, e.g.:
`JAEGER_ENDPOINT` -> `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`, `OTEL_EXPORTER_OTLP_TRACES_INSECURE`, etc

(cc: `@sjonpaulbrown` `@haroldsphinx)`

New API is slightly slower and consumes more memory but does substantially less allocs (~ -50%):
```
name                              old time/op    new time/op    delta
StartSpanFromParent-8                485ns ± 6%     639ns ± 1%  +31.71%  (p=0.000 n=10+9)
StartTransactionSpan/cacheHit-8      993ns ± 4%     927ns ± 3%   -6.69%  (p=0.000 n=10+10)
StartTransactionSpan/cacheMiss-8    2.24µs ± 1%    2.19µs ± 1%   -2.42%  (p=0.000 n=10+10)

name                              old alloc/op   new alloc/op   delta
StartSpanFromParent-8                 577B ± 0%      864B ± 0%  +49.74%  (p=0.000 n=10+10)
StartTransactionSpan/cacheHit-8       776B ± 1%      944B ± 0%  +21.65%  (p=0.000 n=10+8)
StartTransactionSpan/cacheMiss-8    2.16kB ± 1%    2.39kB ± 0%  +10.62%  (p=0.000 n=10+10)

name                              old allocs/op  new allocs/op  delta
StartSpanFromParent-8                 7.00 ± 0%      4.00 ± 0%  -42.86%  (p=0.000 n=10+10)
StartTransactionSpan/cacheHit-8       10.3 ± 7%       6.0 ± 0%  -41.75%  (p=0.000 n=10+10)
StartTransactionSpan/cacheMiss-8      35.3 ± 2%      18.0 ± 0%  -49.01%  (p=0.000 n=10+10)
```

An example of both extensive tracing and cadence tracing enabled:
<img width="1459" alt="Screen Shot 2022-07-25 at 10 21 07 PM" src="https://user-images.githubusercontent.com/169976/180929610-edbfdc54-e604-401f-aea5-501af2f861b2.png">

- [x] test against localnet's Tempo instance.
- [x] noop tracer.
- [x] log tracer.
- [x] merge cadence bits onflow/cadence#1824.
- [x] merge onflow/flow-core-contracts#298.
- [x] update cadence.

Co-authored-by: Alexey Ivanov <rbtz@dapperlabs.com>
@SaveTheRbtz
Copy link
Contributor

OpenTelemetry support has landed.

To control new library, feel free to switch from old JAEGER_* env vars to the new OTEL_EXPORTER_OTLP_TRACES_*, for example localnet uses following: a97f416.
Full list of env vars: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.12.0/specification/protocol/exporter.md

(NB! We use gRPC exporter by default, so you need to use gRPC-enabled collector on the other side.)

Re-assigning back to @sjonpaulbrown for verification. (cc: @haroldsphinx)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants