-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FaaS: Change requirements regarding handling of AWS Lambda-provided SpanContext
(take 2)
#277
Conversation
…SpanContext` Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
7998399
to
67fdf2a
Compare
Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
@Aneurysm9 thanks for giving this another take Is my understanding of this change correct given Scenario 1 @jsuereth described in #272 (comment)?
And one other question (assuming I understood the above correctly):
|
This aligns with my understanding with the exception that the W3C
Yes. The Go Lambda instrumentation includes configuration options for specifying a A simple |
contain an incomplete trace context which indicates X-Ray isn’t enabled. The environment variable will be set and the | ||
`Context` will be valid and sampled only if AWS X-Ray has been enabled for the Lambda function. A user can | ||
disable AWS X-Ray for the function if the X-Ray Span Link is not desired. | ||
If the `_X_AMZN_TRACE_ID` environment variable is set, instrumentation MUST ensure that the value from that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should work. The instrumentation has control over both the carrier and the function that reads fields from that carrier (so it could always create a synthetic carrier / accessor function). @rapphil would you or someone at AWS be able to send a Java draft PR to prove this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI @trask I had previously created this POC rapphil/opentelemetry-java-instrumentation#1 for this proposal.
I will update it and share the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a prototype but can't imagine that this would be difficult in JS. We have full control of the carrier and it is just a plain javascript object. There might be an additional allocation to ensure the original header object is not mutated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI @trask I had previously created this POC rapphil/opentelemetry-java-instrumentation#1 for this proposal.
I will update it and share the details.
when you're ready, go ahead and send it as a draft PR so we can review / add comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trask Here is the POC: open-telemetry/opentelemetry-java-instrumentation#9300
Unfortunately that dictates the API, but the SDK has default configuration specified that's in conflcit with that line, see: To be more explicit, Our Propagator-API spec says:
The original PR for that line states:
So, in reality, once you've configured an SDK the default is w3c traceparent + baggage. A noop propagation only happens when the SDK has not been configured. |
That is an optional part of the specification governing configuration of the SDK through environment variables and is not the case in all SDKs:
I think we generally agree on this, though may disagree on when or whether an SDK should cause a non-no-op global propagator to be configured in the absence of explicit user direction.
I believe that this line in the API spec requires explicit configuration and a default value being set by an SDK without explicit user configuration is in contravention of that spec. I believe the API spec is controlling here as the propagation API is entirely specified within the API spec and an SDK is a user of that propagation API, but that is not relevant to this discussion. |
I believe Go may be the only SDK that hasn't implemented this portion of the specification now, and it's my understanding they're adding this configuration now. I'm not sure the interpretation of the specification you're taking matches what we see in implementations across SDKs in the environment, nor do I think it represents a consensus view from the community. |
Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
@Aneurysm9 Is |
I do not believe so. It is the only mechanism presently supported by AWS Lambda within SQS message headers if you want AWS Lambda to participate in the trace. To the extent that instrumentation does not offer user control over the construction of the propagation |
We had a discussion in the TC meeting about this issue. I'd like to recap the consensus and direction from that meeting:
Given these fundamental beliefs, we (the technical committee) propose the following:
I'll be ensuring this comment is shared in relevant PRs and issues, unfortuantely I could not find a central place for discussion, so I just chose the latest PR. |
it seems that this makes the in which case, I wonder if it would make sense to have two propagation formats:
|
I don't understand this. Propagation format and backend used are orthogonal concerns. The propagator should not know about the backend in use, only the carrier provided to it. What would be the functional difference between the two propagators you propose? I only see a difference in intent, not effect. To the extent that this PR modifies the AWS Lambda instrumentation guidance to ensure that the user's chosen propagation implementation can be used can we separate this discussion? Can you open a new issue detailing your concerns and the approach you would like us to consider? I think that discussion would likely be relevant to more than just Lambda instrumentation. |
… into fix/lambdaSpanCarrier
I have a specific scenario that would help if I could understand: How should a Dynatrace user of the OpenTelemetry Java agent configure My understanding (which could definitely be wrong) from above, is that if they use the default And so the Dynatrace user should use But how can this one setting serve both users who want to parent the X-Ray vended span and those who don't? |
basically I'm suggesting that it may help to have two differently named |
environment variable is included in the `Carrier` used to extract a parent `Context` with the `Field` name | ||
"X-Amzn-Trace-Id". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More specifically in the context of this PR, what I’m asking about is whether it would make sense to populate a different field name here (instead of overwriting the non-vended parent), so that the choice of which one to parent can be made by the configured propagator(s).
environment variable is included in the `Carrier` used to extract a parent `Context` with the `Field` name | |
"X-Amzn-Trace-Id". | |
environment variable is included in the `Carrier` used to extract a parent `Context` with the `Field` name | |
"X-Amzn-Vended-Trace-Id". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe that is appropriate as it would require different propagation formats to be used for extraction and injection and would involve OTel creating a new propagation format and attributing it to AWS. I believe the appropriate thing to do is to further extend the user agency offered by this specification to require that instrumentation give users control over how the carrier is constructed so that they can choose to either overlay or underlay the value from the Lambda execution environment as is appropriate for their use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require that instrumentation give users control over how the carrier is constructed so that they can choose to either overlay or underlay the value from the Lambda execution environment as is appropriate for their use case
this makes sense to me
I think it brings us back to the question of default behavior, which my thought is that the default behavior should be the one which doesn't parent the vended span, since the vended span is specific to the X-Ray backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's a question of default behavior at that point. I'm not aware of any SDK that configures the default global propagator to be or include the AWS X-Ray propagator outside of the ADOT Java agent. To the extent that most SDK implementations that set a default global propagator use a composite including tracecontext, baggage
then the vended span would never be a parent. To the extent that a user has configured a different global propagator then we're out of the default behavior and into user-determined behavior and the user has the flexibility to choose the order they configure propagators in a composite propagator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the appropriate thing to do is to further extend the user agency offered by this specification to require that instrumentation give users control over how the carrier is constructed so that they can choose to either overlay or underlay the value from the Lambda execution environment as is appropriate for their use case.
I agree with extending user agency to pick which parent (or link) they want, but my understanding of the TC's "lean into propagators" meant that this would be better to do via propagators (and maybe propagator configuration) as opposed to via instrumentation configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're in agreement here. I might add only that instrumentation configuration should include propagator selection as there may be cases where specific instrumentation should be configured to use a different propagator, but that is orthogonal to how the instrumentation constructs the carrier to pass to whatever propagator it does use, which is the focus of this PR.
Not Dynatrace specific, but the Java AWS SDK instrumentation currently always sets the X-Ray system property to inject into SQS messages (since that one is for free), although there is also an option to additionally use the custom propagator. BTW, not only the property name but also which carrier you need to use is different for X-Ray vs everything else (message system vs message properties), so this would be really hard to represent in a generic propagator API (and personally I never thought it was a problem if instrumentations with library-specific propagation needs implement special handling directly) If you want to use the OpenTelemetry Java agent with Dynatrace, you can just do it like with any other OTLP/HTTP backend, there are no special considerations (but also no explicit support for that). We do offer our own Lambda layers that come with an agent that is unrelated to the OTel Java agent, though it does share some basic technology and most concepts with core opentelemetry-java. These agents support W3C and Dynatrace-proprietary headers. Because the backend processing is reliant on some additional routing information (+ tenant-specific parent span ID) stored in the tracestate header, it cannot support the X-Ray propagation format (at least some AWS services are known to strip/normalize unknown parts out of the X-Ray string). |
I have a specific scenario that would help if I could understand: How should a Dynatrace user of the OpenTelemetry Java agent configure OTEL_PROPAGATORS to work across SQS + Lambda?
We're not suggesting either/or here. The idea is we lean into multiple propagation formats today. For Java, this would mean:
Effectively, the order just determines the default parent when multiple propagation formats exist. So, for the lambda case where only xray propagation is used (e.g. in a message), then dynatrace would still see/parent the trace, albeit with missing spans. However, if the user had instrumentation where w3c headers propagate in the messages, they could parent directly if desired. This does suggest that a Dynatrace user may lose completely the xray span if they select the w3c propagation as their default. However:
TL;DR; We think that discussion will take some time and deserves some prototypes before we move forward. However, moving forward on using propagators and config for the current issue should open some doors for us to proceed with that work. |
@jsuereth I thought @Aneurysm9 said above that |
I believe a user / instrumentation could encode w3c headers as message attributes (instead of message system attributes / x-ray) if desired. IIUC from the messaging OTEP, that's how we plan to support most messaging systems (encoding traceparent as an attribute in available locations). |
That is how I understand it as well. It is part of why I objected to requiring AWS SDK instrumentation to use the X-Ray propagator because I believed it was an instrumentation error to only rely on message system attributes when a context of any form could be carried through message attributes.
More generally, to the extent that any system that requires X-Ray propagation to be used to have AWS services participate in a trace carries a user-defined payload then it is possible to carry context in any propagation format in that payload. |
I don't think carrying the context inside of the user-defined payload is a good solution for auto-instrumentation like the OTel Java agent(?) EDIT or does the "user-defined payload" include generic headers/attributes, which would allow you to add the context there without affecting the downstream consumer code? EDIT sorry, still processing, this may answer my question:
so we could automatically attach custom metadata (i.e. |
Yes, that is my understanding. From the AWS service perspective the message headers are part of the "user-defined payload" but from the user perspective they are separable from the message body in the same way HTTP headers are separable from a request body. I think the one example that has been raised that may not fit neatly into that pattern is putting an object in an S3 bucket and having that action trigger a Lambda function invocation, such as through EventBridge. It's not clear to me that such architectures are amenable to a cleanly defined one-size-fits-all solution and may depend largely on the structure and content of the object stored. |
This PR is a POC in Java that tries to implement the original proposal in this PR. |
@trask Yes, within the messaging group we discussed this at length and the plan is to propagate the If you want to read more, here's the relevant portion from the OTEP: Standards for context propagation cc @pyohannes |
Yes exactly. The related wording already made it from an OTEP into the semantic conventions: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/messaging-spans.md#context-propagation. Note that the context we're referring to here isn't supposed to be changed by the intermediary/broker, but is intended to be used to correlate producer and consumer traces. The working we chose is intentionally vague, in order to be applicable to messaging systems in general. More specific requirements could only be made on the level of messaging protocols (have a look at this section linked by @joaopgrassi above), but that's likely outside the scope of OTel, and more a W3C topic (that's where the current standardization drafts for Trace Context for AMQP and MQTT live). If I understand correctly, some amount of confusion stems from the fact that in the AWS Lambda scenario, there can be two contexts to choose from for the consumer. In the messaging WG we were discussing the possibility of several layers of context: a message context layer tying together producers and consumers, and a transport context layer, tying together producers and intermediaries, and intermediaries and consumers. However, we decided to focus solely on the message context layer for an initial stable release of messaging semantic conventions. |
I think this AWS x-ray thing is quite specific and best handled with instrumentation-specific options (of course these can be standardized cross-language). I would not use it as a motivation to add or change anything in the generic propagator API, unless you find actual real other non-AWS use cases that would benefit from this. |
I'm not sure that I agree. I think there are use cases around migration from one context propagation format to another where both need to be accepted that could also benefit from enhancements to the specification regarding composite propagators. I do agree, however, that that is a separate discussion from what we're trying to address here. It would be useful, but is not necessary to reach a reasonable outcome. |
I think it's important to place a comment here as well: OTEP 220 was approved and accepted into OpenTelemetry. this means the default for message-passing in OpenTelemetry will be to link between producers and consumers. This is the direction we're moving and the acceptance of that PR represents consensus of this community and is not up for (re-)debate. Only the mechanism or how we migrate to this detail is up for debate. |
I can strike the paragraph regarding links from this PR or change if from |
this makes sense to me. this PR could then deal just with how to populate the carrier. this could help in the short-term because then different propagators can be configured to pick which parent you want. and it would still be helpful long-term once there is general spec support for composite propagators to fallback and extract span links if there's a parent has already been extracted. |
…environment context Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
POC updated to match the latest changes |
`Context` will be valid and sampled only if AWS X-Ray has been enabled for the Lambda function. A user can | ||
disable AWS X-Ray for the function if the X-Ray Span Link is not desired. | ||
If the `_X_AMZN_TRACE_ID` environment variable is set, instrumentation MUST ensure that the value from that | ||
environment variable is included in the `Carrier` used to extract a parent `Context` with the `Field` name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to point out that as currently written, this document doesn't specify what should happen to X-Amzn-Trace-Id
if it already exists in the carrier before the environment variable value is applied to it. I think that's an important behavior to define.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does define the behavior. It says that the instrumentation MUST
ensure that value is in the carrier at that field name. The only way to do that it to set that value. Setting that value will override any value that may have existed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I'm saying that should be more explicit. It also doesn't give any option to users who wish to disable this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting that value will override any value that may have existed.
what would be the downside for this to be added as a separate key in the carrier, e.g. x-amzn-env-trace-id
or x-amzn-vended-trace-id
(instead of overriding the existing carrier value), this way the propagator chain can have access to all of the different propagated trace contexts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting that value will override any value that may have existed.
what would be the downside for this to be added as a separate key in the carrier, e.g.
x-amzn-env-trace-id
orx-amzn-vended-trace-id
(instead of overriding the existing carrier value), this way the propagator chain can have access to all of the different propagated trace contexts?
I believe that I explained why that is not appropriate here.
I have repeatedly said that I believe that there needs to be more user agency afforded with regard to the construction of the carrier used for context extraction. Doing so would make it irrelevant what any of us think about what users want because they would be able to do what they want without configuration options or overrides. I have avoided going that far in this PR as here I am trying to fix a broken spec in the smallest possible way so that instrumentation can start offering more flexibility to users in a way that leaves the door open for further changes in that direction. Can we all agree that this change does that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W3C can only propagate over http headers or sqs/sns message attributes. That's fine in simple architectures as long as users haven't added many of their own attributes. AWS imposes a limit of 10 message attributes which means w3c propagation can fail if the limit is already reached. X-Ray is not subject to this limit as it propagates over message system attributes.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-message-metadata.html#sqs-message-system-attributes
Attributes also don't allow propagation when using S3 events or SNS->SQS type architectures (among other scenarios involving aws managed services). Tracing in these scenarios require x-ray. This was also the main motivation for open-telemetry/opentelemetry-specification#3212.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attributes also don't allow propagation when using S3 events or SNS->SQS type architectures (among other scenarios involving aws managed services). Tracing in these scenarios require x-ray.
is there any AWS (or other) doc about this, I feel like this is an important point that I'd like to understand better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a few links I found:
https://docs.aws.amazon.com/sns/latest/dg/sns-active-tracing.html
https://aws.amazon.com/about-aws/whats-new/2020/11/aws-x-ray-supports-trace-context-propagation-amazon-simple-storage-service-s3/
https://docs.aws.amazon.com/xray/latest/devguide/xray-services-sqs.html
The key point being this all applies only to x-ray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level - The answers to these questions should be sufficient for acceptance of this PR -
- Does this give users the ability to control parenting choice via Propagator config?
- Does any of the normative language prevent moving to linking as the default behavior for Producer/Consumer message passing? Does it limit updating Propagator specification to allow this?
IIUC the details correctly, the answers are:
- No. Users will not have the ability to configure whether ENV variable or attribute is parented in this solution. By fiat, the ENV will always win.
- Yes. The x-ray header for SQS is passed as an attribute. Because the ENV variable would override this, we'd no longer see the alternative parent in the carrier sent to Propgators.
(2) Is less important given we still need to prototype and define changes to Propagation, and we could find workaround, but (1) is concerning to me.
I like @trask's proposals to be more explicit here, possibly having two x-ray propagators one which uses ENV and the other which does not so users can prioritize between them if necessary, and then ideally this would work with an improved propagation framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jsuereth,
Now that you have context with this proposal, there is also this other proposal that tries to solve the problem with a different direction and that I believe has "yes" as answer for questions 1 and 2: #164
While I agree that @trask 's proposal would work, I would prefer to avoid it given that this propagator would be specific to lambda (i.e.: users would not be able to use this propagator elsewhere) and it would not fully comply with the propagator API, specifically:
If a value can not be parsed from the carrier, for a cross-cutting concern, the implementation MUST NOT throw an exception and MUST NOT store a new value in the Context, in order to preserve any previously existing valid value.
Moreover, what would be the behaviour of the inject
operation of this propagator?
I would prefer that the lambda instrumentation prepare a valid carrier to be consumed by subsequent propagators.
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
This presents another alternative to #263 and #272. This is an initial effort to codify the conclusion reached by the FaaS SIG in January.