Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document limitations on span recording #3152

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions specification/protocol/otlp.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ nodes such as collectors and telemetry backends.
- [Implementation Recommendations](#implementation-recommendations)
* [Multi-Destination Exporting](#multi-destination-exporting)
- [Known Limitations](#known-limitations)
* [Span Tracking](#span-tracking)
* [Request Acknowledgements](#request-acknowledgements)
+ [Duplicate Data](#duplicate-data)
- [Future Versions and Interoperability](#future-versions-and-interoperability)
Expand Down Expand Up @@ -643,6 +644,13 @@ speed of reception (within the available limits imposed by the size of the
client-side queue).

## Known Limitations

### Span Tracking

It is impossible to send incomplete spans, so if the span failed to complete for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it impossible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just asked the same question 😆 #3152 (review)

Copy link
Author

@abitrolly abitrolly Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what "real-time tracking" means.

Watching spans in real time. When you have a long running task, and you can see its span and parent span before they are finished. See them the middle of the run. To trace long running jobs.

EDIT: Also to trace long running jobs that were in the end terminated by timeout or in any other way where end of span was not reached, not recorded, not sent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it impossible?

Like @reyang said "it is just the current exporter and protocol which don't support the concept of having separate data that represents start and stop events". Protocol doesn't support sending span with no end attribute. That's why it is impossible to trace spans that were not completed using this protocol.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, separate concepts for start and stop events are not required for that. I feel like the only limitation is the restrictions placed on the end_time_unix_nano field, i.e. if it's allowed to be 0 to indicate an incomplete span, then no other changes in the protocol are required to support the use case you describe.

Copy link
Member

@reyang reyang Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what "real-time tracking" means.

Watching spans in real time. When you have a long running task, and you can see its span and parent span before they are finished. See them the middle of the run. To trace long running jobs.

EDIT: Also to trace long running jobs that were in the end terminated by timeout or in any other way where end of span was not reached, not recorded, not sent.

Got it, thanks @abitrolly! Maybe this can be a solution https://github.com/open-telemetry/opentelemetry-specification/blob/main/experimental/trace/zpages.md?

This zPage is also useful for debugging latency issues (slow parts of applications), deadlocks and instrumentation problems (running spans that don't end)...

I believe spans are not designed for long running operations (e.g. I don't feel span is the right tool to track a batch job which runs for 5 hours).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro maybe there is a better page to document the OpenTelementry limitations? I thought that protocol encompasses all interactions between tools. I would actually prefer to define actual solution rather than document limitations. For me the good solution is that makes end time optional, not relying on specific value to be set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyang zPages is a bad alternative, because requires polling over HTTP. Not all long running processes that needs to be traced expose web server. Think CI/CD pipelines for example.

If OpenTelemetry is a replacement for all other tracing protocols, it should support this tracing scenario firsthand (#2930).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abitrolly I am not sure documenting limitations is a particularly productive exercise, especially in this case where I don't even see it as a limitation but rather a missing feature that simply has not been high on the priority list. I think there is an opportunity here to spec a new feature to support long-running processes better. I am not saying that this is the only way to implement such feature (i.e. you could go with a completely different, event-based protocol, aka streaming implementation of the OTEL API), but in practice sticking with the existing protocol is a much easier path. For example, Jaeger already has ability to receive multiple instances under the same span ID and merge them at query time, but it's probably not completely sufficient for this use case, I would prefer a better definition of the merge semantics and clear spec of the protocol that indicates partial spans.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro if OLTP is going to be labelled stable, this feature won't find its place in the spec, and spec limitations need to be described.

any reason (abnormal termination, exception, logical error), there won't be any
sign that the span even started. For the same reason real-time tracking of spans
is impossible. Spans are sent only when their end is reached.

### Request Acknowledgements

Expand Down