-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify relationship between messaging, faas, and RPC #652
Comments
Hi @anuraaga! It's really nice to see a real-life end-to-end example that is a bit more complex than the isolated examples we currently have in the spec, thanks for that! Regarding your questions: 1.) While the semantic convention does not prescribe which links should be applied, it could make sense to add links to the receive span to not lose that information if no (individual) processing spans can be created to preserve it as a parent. If - from the (auto-)instrumentation's and library's point of view - processing is finished when the messages are handed over to user code, just creating process spans instead of a receive span could make sense. I could imagine, however, that it does not necessarily make sense to look at each message independently in that function, which is why the semantic conventions suggests creating one receive span covering all messages as they are received in a batch. Therefore going with links might be the better solution. 2.) If producing the message is an operation that might be interesting, it should be tracked independently. Given that you list them with a duration of zero, it looks like you only create the spans to fulfill the messaging semantic conventions separately but don't track any actual work, right? I don't think our "top-level" semantic conventions consider being mixed together in one span, as you can see from the incompatible span name and kind guidelines. I don't see that zero duration spans as a problem but I'm not sure if there are backends that would not be able to cope with that. 3.) Both being separate looks correct to me. Why would you combine them? Those are different operations and concerns being tracked in my opinion. 4.) Adding to my last sentence on 2.), please give more detail on how this operation is invoked. I'd expect APIReceive to have a parent span rather than being an orphan. I would also expect it to be a parent or sibling of FaaSInitialize (or FaaSInvoke, if FaaSInitialize is not happening each time). |
a) Use Java library instrumentation of lambda with code change. Instead of using official Receive span created around the processing of the batch (instrumenting RequestHandler.handleRequest, which is implemented by us and instrumented by us). Receive span has no links Processing span for each message (instrumenting b) Use Java auto instrumentation with no code change at all. User is using Receive span created around the processing of the batch (instrumenting RequestHandler.handleRequest, which is implemented by the user but instrumented by us). Receive span has links to all the producers. No processing spans c) Use Java auto instrumentation, but also manually instrument user code to create processing spans. User is using Receive span created around the processing of the batch (instrumenting RequestHandler.handleRequest, which is implemented by the user but instrumented by us). Receive span has links to all the producers. (must be same as b, it's auto instrumentation and we have no idea what the user is doing underneath) Processing span for each message, created by user. Each processing span has a link to its producer. Observations that make me not totally clear on these, but they're not that big of a deal either is c) has duplication of links between receive span and producer span
This brought me a possibly more radical thought - should
|
Also fleshed out table and added an ugly ascii time chart :) One key point in the table is a new column on whether a span is created in the user's app or infrastructure (e.g., cloud provider). Infrastructure spans are generally not available, though in my example in practice FaaSInitialize and FaaSInvoke are present. But they're not generated with OTel and it will be years before they follow semantic conventions and should be considered as just parent windows of time. So I realized I was missing yet another span, one in the user's app that is generated by instrumentation so it can have semantic conventions. |
Noticed a possible corner case in our messaging spec - I don't think in lambda+SQS it's possible to robustly have a name like |
1.a) And the processing spans have the receive span as parent, right? This would be fine then. 1.b) Here the connections to the producing spans are lost. This could be solved by adding links from the receive span to the producing spans. You could also declare that span as processing span (since it actually encloses the processing of all messages) or declare no operation at all. Would you suggest the semantic convention to include these links and optionally also an attribute to define the kind of relationship? 1.c) That sounds legitimate to me and I think the duplication of links is acceptable as auto-instrumentation and user-instrumentation are colliding here. Since links are allowed to be added for any reason and don't have any semantic defined at all (beyond that links specify spans as "related"), automated analysis could only determine that the span linked from the processing (or receiving) span is the producer of the message by looking at the attributes ( |
2.) Yeah I think that having the producing span encompass the lower-level send call (implementation detail of the produce operation) would make sense. The messaging semantic conventions define whether it should be producer+consumer or client+server by whether it is part of a synchronous conversation (client sends request via a message, server responds to it via a message) or rather a "fire and forget" item of work (producer sends request via a message, consumer fulfills the request but does not render a (direct) response). Does that make sense to you? |
3.) The receive span is intended to track the time spent waiting on a message. If this time spent waiting is not of interest, which is, for example, the case when waiting/polling is hidden away in a framework and the instrumented routine is only invoked at the time a message arrives, the receive span can be omitted. Only a span with no This is explained in the paragraph below the |
4.) I see. If this is the main loop of that app, I also don't think tracking it with a span that spans over the entire lifetime of the process would make sense. |
@anuraaga Thanks for expanding the table and for the pretty piece of art!
What do you mean by that? Are you saying you made them up for your example but don't expect any spans like these to be created IRL (unless the cloud provider instruments their engine and deploys OTel in production while also exporting the spans to the same back end as the ones from the user domain)? |
@anuraaga I think it makes sense to discuss this separately. I could well imagine the same handler code being invoked at once for multiple sources in batch scenarios like you just described. We might need to consider this in the semconv. |
Thanks @arminru
Ah - I wasn't thinking enough about the "no operation" option. I might understand what the receive is now (hopefully, heh). I think the important piece missing from the convention is to indicate this is a batch. I think there are options of
I lean towards the latter two, it is about processing so makes sense as an operation. But I think one of these is needed to make clear this span is indicating multiple pieces of work, which may have absolutely no relation to each other. Ideally we'll have spans below for individual work which is even better, but if not, we need to know that the span information is lossy.
Yeah I think this makes sense, and the unclear part is about parenting. For example, we always consider a
Aha! Does this mean in my chart, rather than |
Yeah that's basically what I mean. Though my example has a caveat in that while OTel backends will have no way of getting any infra span, X-Ray backend currently does get |
Another point - @iNikem pointed out that my SQS instrumentation is fragile because we need to assign two spans to one |
You're right, it could make sense to explicitly indicate a batching operation. Introducing an attribute
What do you mean by that?
Is that stated anywhere in the spec?
CLIENT/SERVER vs. PRODUCER/CONSUMER indicates whether it is a synchronous conversation (client sends request via a message, server responds to it via a message) or rather a "fire and forget" item of work (producer sends request via a message, consumer fulfills the request but does not render a (direct) response).
Depending on the backend, this information can still be used to detect a dependency between these two.
Yes, I would say that this would be a reasonable way of modeling it. It could indeed be an infinite (or rather long running) span indeed, yes.
Having two spans within one method should be fine if they are used to describe two different concerns. Our "top-level" semantic conventions (FaaS invocations, RPC calls, messaging, ...) don't support being mixed together as you can see from the span name and span kind requirements but that's something we should probably state more clearly. Which issues do you see in having two separate spans? The FaaS invocation will likely have a broader scope than the actual processing of the message so this looks legitimate to me. |
@anuraaga: Can we maybe use the SQS guidance in open-telemetry/opentelemetry-specification#1442 as the base for something more broadly applicable to resolve this issue? |
Hey hey, Trying to get this issue to resolution, I will start by answering the original message, based on what we currently have (OTEP 220 ):
Process spans are work in progress, and discussing whether batch OR individual spans are needed is something we will discuss soon. See #657
Under the current model, no. We still separate Receive and Process Spans.
There's an ongoing discussion on this item, see #674
Yes, they are separate spans. The fact FaaSInvoke calls Receive is an implementation detail (or so it looks to me).
As this part is only about polling + invoking the FaaS when needed, i'd argue it's not related to Messaging, and it's more related to FaaS (where it can be considered a parent, either directly or via link). Will create an issue for that, if that makes sense. |
About other comments, I do feel there are a few details specific to SQS, e.g.
I'd love to get somebody to update the SQS instrumentation in Java as a prototype for this and other details. |
open-telemetry#652) * Add peer.service semantic convention to indicate the name of a target remote service. * Apply suggestions from code review Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com> * Simplify * Remove fallback wording * Needs to be configured using instrumentation * CHANGELOG * Clarify relationship with rpc.service and peer.service and some examples * Clarify example * Move peer.service / rpc.service relationship explanation to rpc doc. * Apply suggestions from code review Co-authored-by: Christian Neumüller <christian+github@neumueller.me> * Update specification/trace/semantic_conventions/span-general.md Co-authored-by: Christian Neumüller <christian+github@neumueller.me> * Cleanup * Tweak * Update specification/trace/semantic_conventions/rpc.md Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com> * Update CHANGELOG.md Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com> * Apply suggestions from code review Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com> * Update specification/trace/semantic_conventions/span-general.md Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com> * TOC * Apply suggestions from code review Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com> Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com> Co-authored-by: Christian Neumüller <christian+github@neumueller.me>
@carlosalberto @Oberon00 @arminru We in the messaging workgroup think that the problems expressed in this issue are addressed by the recent changes in messaging semantic conventions. If there are no reservations from your side, we will close this issue. |
I want to clarify the relationship between messaging, RPC, and FaaS which is confusing to me right now. Let me give an example, based on my understanding of the current spec, and let me know if this aligns with expectations. This is an example of lambda + sqs, something I'm working on now and is sufficiently complicated that hopefully other cases are simpler.
SQS - an HTTP-based RPC API for publishing and receiving messages
Lambda - a FaaS that can be integrated with SQS. Lambda runtime polls for messages using SQS's RPC API, and when a message comes in, it runs a function with the received batch of messages
SQS Queue - NiceQueue
Lambda function - process_function
So I can invision many spans. These are all for the same request so up to the producer spans is all in the same trace
Request - Request span for the "trace", with current span lasts until end of production
Message1 - Message 1
Message2 - Message 2
Producer1 - Producer Span for Message 1
Producer2 - Producer Span for Message 2
APIPublish1 - Client span using SQS API to send Message1
APIPublish2 - Client span using SQS API to send Message2
PollMessages - Internal, infinite? span that corresponds to the lambda runtime process itself
APIReceive - Client span using SQS API to poll for messages and receive Message 1 and 2
FaaSInitialize - Span encompassing lambda function initialization and invocation. May have cold start
FaaSInvoke - Span encompassing lambda function invocation by runtime
FaasFunction - Span encmpassing lambda function execution in user's app
Receive - Consumer span encompassing the recived batch
Process1 - Processing span for Message1
Process2 - Processing span for Message2
Phew lots of spans. So does it look like this?
My currently open questions
This is an ideal case, but especially when using auto instrumentation, it may not be possible to have
Process1
andProcess2
- Lambda presents the entire batch to the user. ShouldReceive
also have links toProducer1
andProducer2
? Should Receive be a process span instead of a receive span?How to connect the API calls and the messaging spans. Completely separate like I listed? It means there are mysterious 0 duration producer spans and APIPublish are child or PRODUCER. Or combine e.g.,
Producer1
andApiPublish1
? Would the name beNiceQueue.send
but we still includerpc.service == SQS
,rpc.method == SendMessage
to not lose that information? Is such a span's kind producer or client?Are
FaaSInvoke
andReceive
separate or should be combined? In reality, if the runtime is a cloud provider, there may not be control andFaasInvoke
is automatically created.APIReceive
is a CLIENT span so I wouldn't expect it to be a parent of the rest - is it basically an orphan span? Anyways, while I may be able to model it as a parent of something if I was implementing all this myself, it wouldn't be for lambda = SQS.Appreciate any guidance :)
The text was updated successfully, but these errors were encountered: