Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messaging Batch Receive - swap parent / link? #958

Closed
anuraaga opened this issue Sep 16, 2020 · 10 comments
Closed

Messaging Batch Receive - swap parent / link? #958

anuraaga opened this issue Sep 16, 2020 · 10 comments
Assignees
Labels
area:semantic-conventions Related to semantic conventions area:span-relationships Related to span relationships release:after-ga Not required before GA release, and not going to work on before GA spec:trace Related to the specification/trace directory

Comments

@anuraaga
Copy link
Contributor

In the spec for message batch receiving

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/messaging.md#batch-receiving

we define that processing spans should have a parent set to the receiving span and a link to the producing span. I feel this could be reversed to provide a better experience - not all tracing systems support links, so for a default experience, isn't it better for the processing span to be part of the same trace as the producing span, not the batch receiver which is an implementation detail of messaging?

@anuraaga anuraaga added the spec:trace Related to the specification/trace directory label Sep 16, 2020
@anuraaga
Copy link
Contributor Author

anuraaga commented Sep 16, 2020

For reference, I'm working on some instrumentation for lambda processing of SQS here

https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/1210/files

I'm trying to best-effort produce traces where possible, mostly to have best compatibility with tracing systems including ones without links. Principle is data is better than no data, and trace of request is better than only having a trace of message processing.

So I tried this sort of logic

https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/1210/files#diff-ed7a1d14efebc980171842b736da5616R88

User code, and therefore anything using auto instrumentation, can only process SQSEvent, there is no per-message processing model like other queue systems. So I try my best to connect a batch to a trace and in this case there is no process span.

If all messages in a batch have the same parent, use it as the parent of the receive span, add a link to message execution system span if it's present. Notably, this handles the case where there is a single message in the batch with a parent, and it's cool IMO that auto instrumentation would connect it to the trace. Single message batch is fairly common for SQS users.

Otherwise, with no common parent found, the parent of the receive span is set to the message execution system span, and a link is added for all of the parents among the messages.

Admittedly, this is not consistent behavior. But it feels like this provides the best UX if links are ignored by a tracing system. I haven't used links much myself, but from what I understand there are generally two types of links, and IIUC, these would represent those two types of links (span connected to message parent means link is type of follows_from, span connected to execution engine parent means links are all type of child_of? No clue please teach me :) ). Open to any thoughts on that.

Less complicated, unlike auto instrumentation, it is possible for our tracing library itself to provide a per-message processing model and I've tried this as well

https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/1210/files#diff-ed7a1d14efebc980171842b736da5616R65

In this case, a process span is created per message, and I set the parent of the process span to the message parent when available with a link to the execution engine span. This completely contradicts the spec :) And is why I've tried opening this issue to collect thoughts.

/cc @arminru

@arminru
Copy link
Member

arminru commented Sep 16, 2020

It is indeed a pity that without link support we will lose the link to the producing span in case of batch receiving. The processing span being a child of the batch receiving span is, however, consistent with the general parent-child relationship we define in the spec. The receive operation is the most direct cause of the processing operation and the initial message producing operation is only indirectly causing the processing, even though this might be interpreted as an implementation detail. It is rather a limitation of a batch receive operation, that it will usually start before the first message is deserialized and interpreted to get the trace ID and parent span ID from the message's metadata and that there can be multiple parents which is also not supported.

from what I understand there are generally two types of links, and IIUC, these would represent those two types of links (span connected to message parent means link is type of follows_from, span connected to execution engine parent means links are all type of child_of

There is currently no parent or link type in the spec. Adding parent types (child_of and follows_from) was discussed in order to represent sync and async children in #65 and #906 but no agreement was found there and therefore it will be addressed after GA.

Less complicated, unlike auto instrumentation, it is possible for our tracing library itself to provide a per-message processing model and I've tried this as well
In this case, a process span is created per message, and I set the parent of the process span to the message parent when available with a link to the execution engine span.

With "message parent" you mean the producing span and with "execution engine span" you mean the batch receive span, right?

@arminru
Copy link
Member

arminru commented Sep 16, 2020

Labeling it as "required for GA" to decide if we want to adapt the semantic convention here or not. This would be a breaking change and we are yet to come up with a proper solution for handling these in semantic conventions, especially since this is about changing the parent/child and link semantics rather than attributes, where we could introduce additional ones with deviating semantics and deprecate the old ones.

@arminru arminru added area:semantic-conventions Related to semantic conventions priority:p3 Lowest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Sep 16, 2020
@Oberon00
Copy link
Member

Oberon00 commented Sep 16, 2020

This might be solvable with link attributes. E.g. we could prescribe a certain link attribute as identifier for this messaging "uplink" or just consider the first link of a messaging processing span special. Then the collector/exporter could be configured to do the translation for backends that do not support links. I think theoretically & semantically, the current conventions make sense as-is.

E.g. we could add "is_logical_main_cause"=true to the link and exporters for protocols without link support would set the parent to this link.

@anuraaga
Copy link
Contributor Author

@Oberon00 Using link attributes to allow exporters to map for what's best UX for them seems like a nice idea to me.

I'm presuming Dynatrace supports Links - could you describe how they're handled just to understand the implications on the model better? I can imagine 3 producer traces, and 1 receiver trace that handled the three independent messages as a batch. Does the trace for each of the producer traces also display the receiver trace on it?

The receive operation is the most direct cause of the processing operation and the initial message producing operation is only indirectly causing the processing, even though this might be interpreted as an implementation detail.

I do get the motivation for this but at the same time am not fully convinced. For example, unless it is fan-in, for a message processor of independent messages in a batch, a failure to process one message doesn't cause other messages in a batch to not be processed. But that failure to process the message affects the request that produced it - this causal relationship seems far more important. So wouldn't we want to model this using our parent/child model? Of course, if a Link type is child_or, then it is the same parent/child model. But in that case, I guess I don't know the reasoning to pick one over the other, though my intuitive decision making would be, using native parent to represent that real causal relationship within the request itself.

So I still want to propose the possibility of swapping parent / link for a processing span of an individual message. As a user viewing the trace, I expect the trace to look something like this all in the same trace - actually the API calls and receive span, while interesting and nice to have, are not representative of the trace that I was expecting, this is my business logic that handles a single request to submit two orders, for example. It might just be me, of course, I'm presenting this as an example and not truth but it has been very useful to me in the past.

-----------------Request------------------
---Produce1--- ---Produce2---
                                                                                ---Process1--- ---Process2---

I don't know the status of the request, or how long it took to process the order, etc, without seeing the consumption together. Do systems that support Link usually behave this way? If parent / link were swapped, then all systems would work this way I think. I haven't used a tracing system with links before, but I have traced requests like the above that span a message queue and appreciated this modeling - this was via my own manual instrumentation using Brave though.

Similar to how we usually define span name to a good baseline for systems that don't understand rich semantic attributes, I feel like parent can be considered the baseline for cause relationship. If we set parent in the way that's compatible with as many systems as possible, the subset that support links can do even better. Yes, this can also be shifted to the exporter as @Oberon00 mentions and it seems like an good idea too, but not sure it's better.

@Oberon00
Copy link
Member

Oberon00 commented Sep 17, 2020

I'm presuming Dynatrace supports Links

Actually we don't 😞 I have been informed that we actually do, since recently, but the below still applies:

We have had the same receive/process thing in our SDK for quite some time and what we did was basically use a workaround: We will suppress creation of the receive Span if it would be a root Span (i.e. if there is no active trace when it is created; the Dynatrace SDK does not support explicit parents). That way, the process span can get the incoming message as ordinary parent. However, if the receive span already has a parent, the incoming message parent will basically be ignored in the Dynatrace SDK.

@Oberon00
Copy link
Member

Oberon00 commented Sep 17, 2020

my intuitive decision making would be, using native parent to represent that real causal relationship within the request itself

Good point. However, consider the case where the receive span is caused by an incoming HTTP request. Now what is the "real causal relationship"? Incoming HTTP -> Receive -> Process 1..n, or Send 1..n -> Process 1..n. In both cases the other part of the trace gets orphaned in systems that don't support links.

I think we should define a good meta semantic convention for what to use links vs parent for.

Currently we have https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/overview.md#links-between-spans which may need to be updated. I don't think it currently matches our semantic conventions. I also don't think

When using the scatter/gather (also called fork/join) pattern, the root operation starts multiple downstream processing operations and all of them are aggregated back in a single Span. This last Span is linked to many operations it aggregates.

is implementable without adding links after span start, if you still want correct timings.

@anuraaga
Copy link
Contributor Author

@Oberon00 I'm not entirely sure what you mean by "receive span is caused by an incoming HTTP request". Is it the APIReceive span I mention in open-telemetry/semantic-conventions#652? If so, it's true I have no idea what to do with that span ;) Regardless of spec, I also don't have a great handle on the model and am trying to get it in that issue, appreciate any help. But my initial impression is if there are going to be orphans, I want the spans that aren't part of the actual business logic, receive in your example IIUC, to be the orphan.

But regardless of how the spec turns out, more clear semantic conventions on parent vs link would be great!

@Oberon00
Copy link
Member

I'm not entirely sure what you mean by "receive span is caused by an incoming HTTP request"

I was just thinking of a theoretical scenario. For example, you might have a REST endpoint "POST /flushmessages" that triggers the processing of all messages in a queue. The HTTP server span would be the parent of the (probably async) receive span(s) that then trigger the process spans.

@arminru arminru added release:after-ga Not required before GA release, and not going to work on before GA and removed priority:p3 Lowest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Sep 25, 2020
@Oberon00 Oberon00 added the area:span-relationships Related to span relationships label Oct 12, 2020
@pyohannes pyohannes moved this to V1 - Stable Semantics in Spec: Messaging Semantics Mar 30, 2022
@pyohannes
Copy link
Contributor

@anuraaga (if you're still somewhere around), we presumably clarified this situation with merging open-telemetry/semantic-conventions#284.

@pyohannes pyohannes self-assigned this Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions area:span-relationships Related to span relationships release:after-ga Not required before GA release, and not going to work on before GA spec:trace Related to the specification/trace directory
Projects
Status: V1 - Stable Semantics
Development

No branches or pull requests

4 participants