Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify relationship between messaging, faas, and RPC #652

Closed
anuraaga opened this issue Sep 17, 2020 · 21 comments
Closed

Clarify relationship between messaging, faas, and RPC #652

anuraaga opened this issue Sep 17, 2020 · 21 comments

Comments

@anuraaga
Copy link
Contributor

anuraaga commented Sep 17, 2020

I want to clarify the relationship between messaging, RPC, and FaaS which is confusing to me right now. Let me give an example, based on my understanding of the current spec, and let me know if this aligns with expectations. This is an example of lambda + sqs, something I'm working on now and is sufficiently complicated that hopefully other cases are simpler.

SQS - an HTTP-based RPC API for publishing and receiving messages
Lambda - a FaaS that can be integrated with SQS. Lambda runtime polls for messages using SQS's RPC API, and when a message comes in, it runs a function with the received batch of messages

SQS Queue - NiceQueue
Lambda function - process_function

So I can invision many spans. These are all for the same request so up to the producer spans is all in the same trace
Request - Request span for the "trace", with current span lasts until end of production
Message1 - Message 1
Message2 - Message 2
Producer1 - Producer Span for Message 1
Producer2 - Producer Span for Message 2
APIPublish1 - Client span using SQS API to send Message1
APIPublish2 - Client span using SQS API to send Message2
PollMessages - Internal, infinite? span that corresponds to the lambda runtime process itself
APIReceive - Client span using SQS API to poll for messages and receive Message 1 and 2
FaaSInitialize - Span encompassing lambda function initialization and invocation. May have cold start
FaaSInvoke - Span encompassing lambda function invocation by runtime
FaasFunction - Span encmpassing lambda function execution in user's app
Receive - Consumer span encompassing the recived batch
Process1 - Processing span for Message1
Process2 - Processing span for Message2

Trace 1
|--------------------------OrderService.ProcessOrders-------------------------|
 |NiceQueue send|                                    |NiceQueue send|
                     |SQS.SendMessage|                               |SQS.SendMessage|


Trace 2
|-------------------------pollmessages------------------------------------------------------------------------------------------|
|APIReceive||APIReceive||APIReceive||APIReceive||APIReceive||APIReceive||APIReceive|
                                                                                    |-----FaaSInitialize-----------------|                                                                                                                  
                                                                                               |---FaasInvoke--------------|
                                                                                               |---FaasFunction------------|
                                                                                                  |--Receive--------------|
                                                                                                   |Process1||Process2|

Phew lots of spans. So does it look like this?

id name kind duration parent link semantic types owner
Request OrderService.ProcessOrders SERVER 50ms none none rpc,http user
Producer1 NiceQueue send PRODUCER 0 Request none messaging user
ApiPublish1 SQS.SendMessage CLIENT 20ms Producer1 none rpc,http user
Producer2 NiceQueue send PRODUCER 0 Request none messaging user
ApiPublish2 SQS.SendMessage CLIENT 20ms Producer2 none rpc, http user
PollMessages pollmessages INTERNAL infinity none none none infra
APIReceive SQS.ReceiveMessage CLIENT 30ms PollMessages none rpc, http infra
FaaSInitialize Lambda::Initialize INTERNAL 500ms none none none infra
FaaSInvoke Lambda::Invoke INTERNAL 300ms FaasInitialize none none infra
FaaSFunction process_function INTERNAL 300ms FaaSInvoke none faas user
Receive NiceQueue receive CONSUMER 250ms FaaSFunction none messaging user
Process1 NiceQueue process CONSUMER 100ms Receive Producer1 messaging user
Process2 NiceQueue process CONSUMER 100ms Receive Producer2 messaging user

My currently open questions

  • This is an ideal case, but especially when using auto instrumentation, it may not be possible to have Process1 and Process2 - Lambda presents the entire batch to the user. Should Receive also have links to Producer1 and Producer2? Should Receive be a process span instead of a receive span?

  • How to connect the API calls and the messaging spans. Completely separate like I listed? It means there are mysterious 0 duration producer spans and APIPublish are child or PRODUCER. Or combine e.g., Producer1 and ApiPublish1? Would the name be NiceQueue.send but we still include rpc.service == SQS, rpc.method == SendMessage to not lose that information? Is such a span's kind producer or client?

  • Are FaaSInvoke and Receive separate or should be combined? In reality, if the runtime is a cloud provider, there may not be control and FaasInvoke is automatically created.

  • APIReceive is a CLIENT span so I wouldn't expect it to be a parent of the rest - is it basically an orphan span? Anyways, while I may be able to model it as a parent of something if I was implementing all this myself, it wouldn't be for lambda = SQS.

Appreciate any guidance :)

@anuraaga
Copy link
Contributor Author

/cc @Oberon00 @arminru Sorry for summoning you but I'm enjoying hearing about messaging from you and hope to enjoy more here too :)

@arminru arminru added the question Further information is requested label Sep 17, 2020
@arminru
Copy link
Member

arminru commented Sep 17, 2020

Hi @anuraaga! It's really nice to see a real-life end-to-end example that is a bit more complex than the isolated examples we currently have in the spec, thanks for that!
In order to understand the whole compilation better, it would be helpful if you could mention which semantic conventions are applied in which span (I suppose it's only RPC in APIReceive and only Messaging in Receive, right?) and ideally arrange them in a timing diagram (either a simple ASCII chart or a quick drawing).

Regarding your questions:

1.) While the semantic convention does not prescribe which links should be applied, it could make sense to add links to the receive span to not lose that information if no (individual) processing spans can be created to preserve it as a parent. If - from the (auto-)instrumentation's and library's point of view - processing is finished when the messages are handed over to user code, just creating process spans instead of a receive span could make sense. I could imagine, however, that it does not necessarily make sense to look at each message independently in that function, which is why the semantic conventions suggests creating one receive span covering all messages as they are received in a batch. Therefore going with links might be the better solution.
I wonder if we should extend the messaging semantic conventions further to give guidance on links rather than showing possible solutions in the examples.

2.) If producing the message is an operation that might be interesting, it should be tracked independently. Given that you list them with a duration of zero, it looks like you only create the spans to fulfill the messaging semantic conventions separately but don't track any actual work, right? I don't think our "top-level" semantic conventions consider being mixed together in one span, as you can see from the incompatible span name and kind guidelines. I don't see that zero duration spans as a problem but I'm not sure if there are backends that would not be able to cope with that.
I would expect FaaSInitialize to either be a child of APIReceive or both to share common parent span in order to be connected. Can you illustrate in more detail how they are connected together?

3.) Both being separate looks correct to me. Why would you combine them? Those are different operations and concerns being tracked in my opinion.

4.) Adding to my last sentence on 2.), please give more detail on how this operation is invoked. I'd expect APIReceive to have a parent span rather than being an orphan. I would also expect it to be a parent or sibling of FaaSInitialize (or FaaSInvoke, if FaaSInitialize is not happening each time).

@anuraaga
Copy link
Contributor Author

  1. Let me give a few possible scenarios, based on the instrumentation restrictions, and see if they align with what you're saying

a) Use Java library instrumentation of lambda with code change. Instead of using official RequestHandler, which only supports receiving a batch of messages, users use a MessageHandler we provide.

Receive span created around the processing of the batch (instrumenting RequestHandler.handleRequest, which is implemented by us and instrumented by us). Receive span has no links

Processing span for each message (instrumenting opentelemetry-instrumentation-awslambda's special interface, MessageHandler.handleMessage, which is implemented by the user but instrumented by us). Each processing span has a link to its producer.

b) Use Java auto instrumentation with no code change at all. User is using RequestHandler.handleRequest.

Receive span created around the processing of the batch (instrumenting RequestHandler.handleRequest, which is implemented by the user but instrumented by us). Receive span has links to all the producers.

No processing spans

c) Use Java auto instrumentation, but also manually instrument user code to create processing spans. User is using RequestHandler.handleRequest but also calls effectively Tracer.startSpan (more likely, a helper we provide AwsLambdaMessagingTracer.startSpan which handles semantic conventions OK, while processing each message.

Receive span created around the processing of the batch (instrumenting RequestHandler.handleRequest, which is implemented by the user but instrumented by us). Receive span has links to all the producers. (must be same as b, it's auto instrumentation and we have no idea what the user is doing underneath)

Processing span for each message, created by user. Each processing span has a link to its producer.

Observations that make me not totally clear on these, but they're not that big of a deal either is

c) has duplication of links between receive span and producer span
a) isn't consistent with b)/c) aren't consistent since a) doesn't have links on receive span. Should it?

  1. Yeah mainly to fulfill the messaging conventions. FWIW, these spans are what particularly matter when understanding the request in the trace since they are business logic. However, they are implemented as API calls in practice. Actually I'm starting to realize the produce span should probably have a duration, and encompass the API call itself, since if the API call fails the production failed too. I wonder if that makes sense as a model.

This brought me a possibly more radical thought - should APIReceive be a SERVER span? It's an HTTP client call - but as car as causal relationship, it's caused by the ApiPublish spans, and normally this looks like a client-server interaction. I'm guessing the answer is no though - there is a different server involved and if it was part of the trace, things are different. It's not part of the trace coincidentally because message queue happens to not be traced, at least in this example.

  1. The function is invoked to receive the messages - they're one operation with a cause (message came in) and procedure (invoke function) and don't feel like two spans conceptually. I guess, I might not understand what the actual meaning of a receive span is.

  2. Good callout. While I don't think we'd usually have a span for it in practice, since it's an infinite duration span, I added a pollmessages span which hopefully makes it more clear. The runtime is an app that is basically an infinite loop calling APIReceive, and that would be its parent. But since we don't really want an infinite trace I think, each APIReceive is fairly independent, no real cause relationship between each, except for the fact that it happens to be invoked in the same code fragment only. So figured APIReceive is supposed to be parentless in practice.

@anuraaga
Copy link
Contributor Author

Also fleshed out table and added an ugly ascii time chart :) One key point in the table is a new column on whether a span is created in the user's app or infrastructure (e.g., cloud provider). Infrastructure spans are generally not available, though in my example in practice FaaSInitialize and FaaSInvoke are present. But they're not generated with OTel and it will be years before they follow semantic conventions and should be considered as just parent windows of time. So I realized I was missing yet another span, one in the user's app that is generated by instrumentation so it can have semantic conventions.

@anuraaga
Copy link
Contributor Author

Noticed a possible corner case in our messaging spec - I don't think in lambda+SQS it's possible to robustly have a name like NiceQueue receive. A function can be registered to multiple queues, so a received batch can correspond to multiple queues, only the messages themselves have a reference to their queue. One option is to check the messages and if they all have the same event source, use it, otherwise something like multiple_sources receive?

@arminru
Copy link
Member

arminru commented Sep 18, 2020

@anuraaga

1.a) And the processing spans have the receive span as parent, right? This would be fine then.

1.b) Here the connections to the producing spans are lost. This could be solved by adding links from the receive span to the producing spans. You could also declare that span as processing span (since it actually encloses the processing of all messages) or declare no operation at all. Would you suggest the semantic convention to include these links and optionally also an attribute to define the kind of relationship?

1.c) That sounds legitimate to me and I think the duplication of links is acceptable as auto-instrumentation and user-instrumentation are colliding here.

Since links are allowed to be added for any reason and don't have any semantic defined at all (beyond that links specify spans as "related"), automated analysis could only determine that the span linked from the processing (or receiving) span is the producer of the message by looking at the attributes (messaging.* present) and kind (either producer or client). I think this could even be sufficient but I'm not entirely sure if we should rely on just that. What do you think?

@arminru
Copy link
Member

arminru commented Sep 18, 2020

@anuraaga

2.) Yeah I think that having the producing span encompass the lower-level send call (implementation detail of the produce operation) would make sense.

The messaging semantic conventions define whether it should be producer+consumer or client+server by whether it is part of a synchronous conversation (client sends request via a message, server responds to it via a message) or rather a "fire and forget" item of work (producer sends request via a message, consumer fulfills the request but does not render a (direct) response). Does that make sense to you?

@arminru
Copy link
Member

arminru commented Sep 18, 2020

@anuraaga

3.) The receive span is intended to track the time spent waiting on a message. If this time spent waiting is not of interest, which is, for example, the case when waiting/polling is hidden away in a framework and the instrumented routine is only invoked at the time a message arrives, the receive span can be omitted. Only a span with no operation attribute being set would be created in this case and would track the operation handling (i.e., processing) the message.

This is explained in the paragraph below the message.operation attribute here:
https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/messaging.md#messaging-attributes

@arminru
Copy link
Member

arminru commented Sep 18, 2020

@anuraaga

4.) I see. If this is the main loop of that app, I also don't think tracking it with a span that spans over the entire lifetime of the process would make sense.

@arminru
Copy link
Member

arminru commented Sep 18, 2020

@anuraaga Thanks for expanding the table and for the pretty piece of art!

Infrastructure spans are generally not available, though in my example in practice FaaSInitialize and FaaSInvoke are present. But they're not generated with OTel and it will be years before they follow semantic conventions and should be considered as just parent windows of time.

What do you mean by that? Are you saying you made them up for your example but don't expect any spans like these to be created IRL (unless the cloud provider instruments their engine and deploys OTel in production while also exporting the spans to the same back end as the ones from the user domain)?

@arminru
Copy link
Member

arminru commented Sep 18, 2020

Noticed a possible corner case in our messaging spec - I don't think in lambda+SQS it's possible to robustly have a name like NiceQueue receive. A function can be registered to multiple queues, so a received batch can correspond to multiple queues, only the messages themselves have a reference to their queue. One option is to check the messages and if they all have the same event source, use it, otherwise something like multiple_sources receive?

@anuraaga I think it makes sense to discuss this separately. I could well imagine the same handler code being invoked at once for multiple sources in batch scenarios like you just described. We might need to consider this in the semconv.

@anuraaga
Copy link
Contributor Author

Thanks @arminru

1.b) Here the connections to the producing spans are lost. This could be solved by adding links from the receive span to the producing spans. You could also declare that span as processing span (since it actually encloses the processing of all messages) or declare no operation at all. Would you suggest the semantic convention to include these links and optionally also an attribute to define the kind of relationship?

Ah - I wasn't thinking enough about the "no operation" option. I might understand what the receive is now (hopefully, heh). I think the important piece missing from the convention is to indicate this is a batch. I think there are options of

  • No operation, messaging.scope = batch
  • operation = process, messaging.scope = batch
  • operation = process_batch

I lean towards the latter two, it is about processing so makes sense as an operation. But I think one of these is needed to make clear this span is indicating multiple pieces of work, which may have absolutely no relation to each other. Ideally we'll have spans below for individual work which is even better, but if not, we need to know that the span information is lossy.

The messaging semantic conventions define whether it should be producer+consumer or client+server by whether it is part of a synchronous conversation (client sends request via a message, server responds to it via a message) or rather a "fire and forget" item of work (producer sends request via a message, consumer fulfills the request but does not render a (direct) response). Does that make sense to you?

Yeah I think this makes sense, and the unclear part is about parenting. For example, we always consider a CLIENT span to be the parent of a SERVER span in another service. This is a reason for not allowing nested CLIENT spans in instrumentation. I guess we have no similar notion for PRODUCER / CONSUMER? In particular, in our spec we allow nested CONSUMER spans and I want to confirm that's ok (@iNikem does too :) Also, for practical purposes, what is the difference between these PRODUCER and CONSUMER spans, and INTERNAL spans? Since the semantic conventions indicate the messaging semantics, and they don't have parent/child relationship, I'm not sure what these kinds are enabling us to do, where as CLIENT / SERVER allows us to create dependency graphs. Somehow, I feel like messaging.operation is enough (as long as we fill in send of course) without having these span kinds.

The receive span is intended to track the time spent waiting on a message.

Aha! Does this mean in my chart, rather than Receive being a child of FaasFunction, it's supposed to be a child of pollmessages and encompass the time around all of the APIReceive calls? If this is true, a lot of things get clearer for me - however I have a worry that this receive span could be considered an "infinite span" too. Though maybe we have an expectation that a message will come at least once in the future.

@anuraaga
Copy link
Contributor Author

anuraaga commented Sep 19, 2020

What do you mean by that? Are you saying you made them up for your example but don't expect any spans like these to be created IRL (unless the cloud provider instruments their engine and deploys OTel in production while also exporting the spans to the same back end as the ones from the user domain)?

Yeah that's basically what I mean. Though my example has a caveat in that while OTel backends will have no way of getting any infra span, X-Ray backend currently does get FaaSInitialize and FaaSRevoke, though naturally with no OTel conventions, but we can still parent to them. APIReceive and pollmessages I made up completely trying to better understand the relationship between messaging spans and API spans.

@anuraaga
Copy link
Contributor Author

Another point - @iNikem pointed out that my SQS instrumentation is fragile because we need to assign two spans to one handleRequest method. We need to decide whether this is a bug or a feature :) If FaaS span and message span are separate, then this is inevitable when an instrumented method has both meanings. Otherwise, we could define a way to combine into one span - the only issue really is span name. In that case I would probably prioritize the FaaS span name over the messaging one. What do you think?

@arminru
Copy link
Member

arminru commented Sep 24, 2020

@anuraaga

You're right, it could make sense to explicitly indicate a batching operation. Introducing an attribute
messaging.scope: not set (assumes single message) | "batch" or probably
messaging.batched_operation: false (default if not set) | true
could make sense, yes.

we need to know that the span information is lossy.

What do you mean by that?

This is a reason for not allowing nested CLIENT spans in instrumentation.

Is that stated anywhere in the spec?

Also, for practical purposes, what is the difference between these PRODUCER and CONSUMER spans, and INTERNAL spans?

CLIENT/SERVER vs. PRODUCER/CONSUMER indicates whether it is a synchronous conversation (client sends request via a message, server responds to it via a message) or rather a "fire and forget" item of work (producer sends request via a message, consumer fulfills the request but does not render a (direct) response).

I'm not sure what these kinds are enabling us to do, where as CLIENT / SERVER allows us to create dependency graphs

Depending on the backend, this information can still be used to detect a dependency between these two.

Does this mean in my chart, rather than Receive being a child of FaasFunction, it's supposed to be a child of pollmessages and encompass the time around all of the APIReceive calls? If this is true, a lot of things get clearer for me - however I have a worry that this receive span could be considered an "infinite span" too. Though maybe we have an expectation that a message will come at least once in the future.

Yes, I would say that this would be a reasonable way of modeling it. It could indeed be an infinite (or rather long running) span indeed, yes.

my SQS instrumentation is fragile because we need to assign two spans to one handleRequest method. We need to decide whether this is a bug or a feature :) If FaaS span and message span are separate, then this is inevitable when an instrumented method has both meanings. Otherwise, we could define a way to combine into one span - the only issue really is span name. In that case I would probably prioritize the FaaS span name over the messaging one. What do you think?

Having two spans within one method should be fine if they are used to describe two different concerns. Our "top-level" semantic conventions (FaaS invocations, RPC calls, messaging, ...) don't support being mixed together as you can see from the span name and span kind requirements but that's something we should probably state more clearly. Which issues do you see in having two separate spans? The FaaS invocation will likely have a broader scope than the actual processing of the message so this looks legitimate to me.

@Oberon00
Copy link
Member

Oberon00 commented Apr 7, 2021

@anuraaga: Can we maybe use the SQS guidance in open-telemetry/opentelemetry-specification#1442 as the base for something more broadly applicable to resolve this issue?

@carlosalberto
Copy link
Contributor

Hey hey,

Trying to get this issue to resolution, I will start by answering the original message, based on what we currently have (OTEP 220 ):

it may not be possible to have Process1 and Process2 - Lambda presents the entire batch to the user.

Process spans are work in progress, and discussing whether batch OR individual spans are needed is something we will discuss soon. See #657

Should Receive be a process span instead of a receive span?

Under the current model, no. We still separate Receive and Process Spans.

How to connect the API calls and the messaging spans. Completely separate like I listed? It means there are mysterious 0 duration producer spans and APIPublish are child or PRODUCER.

There's an ongoing discussion on this item, see #674

Are FaaSInvoke and Receive separate or should be combined? In reality, if the runtime is a cloud provider, there may not be control and FaasInvoke is automatically created.

Yes, they are separate spans. The fact FaaSInvoke calls Receive is an implementation detail (or so it looks to me).

APIReceive is a CLIENT span so I wouldn't expect it to be a parent of the rest - is it basically an orphan span?

As this part is only about polling + invoking the FaaS when needed, i'd argue it's not related to Messaging, and it's more related to FaaS (where it can be considered a parent, either directly or via link). Will create an issue for that, if that makes sense.

@carlosalberto
Copy link
Contributor

About other comments, I do feel there are a few details specific to SQS, e.g.

@iNikem pointed out that my SQS instrumentation is fragile because we need to assign two spans to one handleRequest method. We need to decide whether this is a bug or a feature :) If FaaS span and message span are separate, then this is inevitable when an instrumented method has both meanings.

I'd love to get somebody to update the SQS instrumentation in Java as a prototype for this and other details.

@carlosalberto
Copy link
Contributor

@arminru @Oberon00 any opinion?

@carlosalberto
Copy link
Contributor

Ping @Oberon00 @arminru

joaopgrassi pushed a commit to dynatrace-oss-contrib/semantic-conventions that referenced this issue Mar 21, 2024
open-telemetry#652)

* Add peer.service semantic convention to indicate the name of a target remote service.

* Apply suggestions from code review

Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>

* Simplify

* Remove fallback wording

* Needs to be configured using instrumentation

* CHANGELOG

* Clarify relationship with rpc.service and peer.service and some examples

* Clarify example

* Move peer.service / rpc.service relationship explanation to rpc doc.

* Apply suggestions from code review

Co-authored-by: Christian Neumüller <christian+github@neumueller.me>

* Update specification/trace/semantic_conventions/span-general.md

Co-authored-by: Christian Neumüller <christian+github@neumueller.me>

* Cleanup

* Tweak

* Update specification/trace/semantic_conventions/rpc.md

Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>

* Update CHANGELOG.md

Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>

* Apply suggestions from code review

Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>

* Update specification/trace/semantic_conventions/span-general.md

Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>

* TOC

* Apply suggestions from code review

Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>

Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>
Co-authored-by: Christian Neumüller <christian+github@neumueller.me>
@pyohannes
Copy link
Contributor

@carlosalberto @Oberon00 @arminru We in the messaging workgroup think that the problems expressed in this issue are addressed by the recent changes in messaging semantic conventions.

If there are no reservations from your side, we will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: V1 - Stable Semantics
Development

No branches or pull requests

7 participants