-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Messaging Batch Receive - swap parent / link? #958
Comments
For reference, I'm working on some instrumentation for lambda processing of SQS here https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/1210/files I'm trying to best-effort produce traces where possible, mostly to have best compatibility with tracing systems including ones without links. Principle is data is better than no data, and trace of request is better than only having a trace of message processing. So I tried this sort of logic User code, and therefore anything using auto instrumentation, can only process If all messages in a batch have the same parent, use it as the parent of the receive span, add a link to message execution system span if it's present. Notably, this handles the case where there is a single message in the batch with a parent, and it's cool IMO that auto instrumentation would connect it to the trace. Single message batch is fairly common for SQS users. Otherwise, with no common parent found, the parent of the receive span is set to the message execution system span, and a link is added for all of the parents among the messages. Admittedly, this is not consistent behavior. But it feels like this provides the best UX if links are ignored by a tracing system. I haven't used links much myself, but from what I understand there are generally two types of links, and IIUC, these would represent those two types of links (span connected to message parent means link is type of follows_from, span connected to execution engine parent means links are all type of child_of? No clue please teach me :) ). Open to any thoughts on that. Less complicated, unlike auto instrumentation, it is possible for our tracing library itself to provide a per-message processing model and I've tried this as well In this case, a process span is created per message, and I set the parent of the process span to the message parent when available with a link to the execution engine span. This completely contradicts the spec :) And is why I've tried opening this issue to collect thoughts. /cc @arminru |
It is indeed a pity that without link support we will lose the link to the producing span in case of batch receiving. The processing span being a child of the batch receiving span is, however, consistent with the general parent-child relationship we define in the spec. The receive operation is the most direct cause of the processing operation and the initial message producing operation is only indirectly causing the processing, even though this might be interpreted as an implementation detail. It is rather a limitation of a batch receive operation, that it will usually start before the first message is deserialized and interpreted to get the trace ID and parent span ID from the message's metadata and that there can be multiple parents which is also not supported.
There is currently no parent or link type in the spec. Adding parent types (child_of and follows_from) was discussed in order to represent sync and async children in #65 and #906 but no agreement was found there and therefore it will be addressed after GA.
With "message parent" you mean the producing span and with "execution engine span" you mean the batch receive span, right? |
Labeling it as "required for GA" to decide if we want to adapt the semantic convention here or not. This would be a breaking change and we are yet to come up with a proper solution for handling these in semantic conventions, especially since this is about changing the parent/child and link semantics rather than attributes, where we could introduce additional ones with deviating semantics and deprecate the old ones. |
This might be solvable with link attributes. E.g. we could prescribe a certain link attribute as identifier for this messaging "uplink" or just consider the first link of a messaging processing span special. Then the collector/exporter could be configured to do the translation for backends that do not support links. I think theoretically & semantically, the current conventions make sense as-is. E.g. we could add |
@Oberon00 Using link attributes to allow exporters to map for what's best UX for them seems like a nice idea to me. I'm presuming Dynatrace supports
I do get the motivation for this but at the same time am not fully convinced. For example, unless it is fan-in, for a message processor of independent messages in a batch, a failure to process one message doesn't cause other messages in a batch to not be processed. But that failure to process the message affects the request that produced it - this causal relationship seems far more important. So wouldn't we want to model this using our parent/child model? Of course, if a So I still want to propose the possibility of swapping parent / link for a processing span of an individual message. As a user viewing the trace, I expect the trace to look something like this all in the same trace - actually the API calls and receive span, while interesting and nice to have, are not representative of the trace that I was expecting, this is my business logic that handles a single request to submit two orders, for example. It might just be me, of course, I'm presenting this as an example and not truth but it has been very useful to me in the past.
I don't know the status of the request, or how long it took to process the order, etc, without seeing the consumption together. Do systems that support Similar to how we usually define span name to a good baseline for systems that don't understand rich semantic attributes, I feel like parent can be considered the baseline for cause relationship. If we set parent in the way that's compatible with as many systems as possible, the subset that support links can do even better. Yes, this can also be shifted to the exporter as @Oberon00 mentions and it seems like an good idea too, but not sure it's better. |
We have had the same receive/process thing in our SDK for quite some time and what we did was basically use a workaround: We will suppress creation of the receive Span if it would be a root Span (i.e. if there is no active trace when it is created; the Dynatrace SDK does not support explicit parents). That way, the process span can get the incoming message as ordinary parent. However, if the receive span already has a parent, the incoming message parent will basically be ignored in the Dynatrace SDK. |
Good point. However, consider the case where the receive span is caused by an incoming HTTP request. Now what is the "real causal relationship"? Incoming HTTP -> Receive -> Process 1..n, or Send 1..n -> Process 1..n. In both cases the other part of the trace gets orphaned in systems that don't support links. I think we should define a good meta semantic convention for what to use links vs parent for. Currently we have https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/overview.md#links-between-spans which may need to be updated. I don't think it currently matches our semantic conventions. I also don't think
is implementable without adding links after span start, if you still want correct timings. |
@Oberon00 I'm not entirely sure what you mean by "receive span is caused by an incoming HTTP request". Is it the But regardless of how the spec turns out, more clear semantic conventions on parent vs link would be great! |
I was just thinking of a theoretical scenario. For example, you might have a REST endpoint "POST /flushmessages" that triggers the processing of all messages in a queue. The HTTP server span would be the parent of the (probably async) receive span(s) that then trigger the process spans. |
@anuraaga (if you're still somewhere around), we presumably clarified this situation with merging open-telemetry/semantic-conventions#284. |
In the spec for message batch receiving
https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/messaging.md#batch-receiving
we define that processing spans should have a parent set to the receiving span and a link to the producing span. I feel this could be reversed to provide a better experience - not all tracing systems support links, so for a default experience, isn't it better for the processing span to be part of the same trace as the producing span, not the batch receiver which is an implementation detail of messaging?
The text was updated successfully, but these errors were encountered: