Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-sdk sqs does not extract trace info when using promises #1680

Open
ekosz opened this issue Nov 9, 2021 · 17 comments
Open

aws-sdk sqs does not extract trace info when using promises #1680

ekosz opened this issue Nov 9, 2021 · 17 comments
Labels
bug Something isn't working

Comments

@ekosz
Copy link

ekosz commented Nov 9, 2021

At Lattice, we've been having issues with DD distributed traces and AWS SQS. Even though we see all of the aws.request spans created, none of them are connected to parent spans or the trace that created them in the first place. Finally after some digging / tracing I think I may know what it is.

We use sqs-consumer for managing our SQS queues. Looking the SQS dd-trace-js code, the extraction of the trace info happens at responseExtract. And that is ever called from wrapCb. And that is only called when there is a callback to wrap. But sqs-consumer uses the .promise() style of making requests to AWS. Which, as far as I can tell, makes it so the cb is undefined and makes it so we never call responseExtract.

Do you think I got that right? And if so do you know of another library that has this pattern that I could try and copy from to open a PR to fix?

Cheers!
Eric

@ekosz ekosz added the bug Something isn't working label Nov 9, 2021
@rochdev
Copy link
Member

rochdev commented Nov 9, 2021

This is a known issue that is unfortunately a limitation of Node support for promises. Without going into too much details, it binds the context of async_hooks to the context when promise.then() was called, which is not necessarily correct for message queue consumers which should follow the context of the previous resolve() instead. In theory this could be worked around with monkey-patching if the code doesn't use async/await, but if it does it's not possible since these are language constructs that are implemented directly in v8.

This is something we do plan to support, and this work has already started with nodejs/node#39283, but it will still require a few more changes to Node itself, and then we'll have to implement the changes in our scope manager before it can finally land in the aws-sdk integration.

ekosz added a commit to ekosz/dd-trace-js that referenced this issue Nov 9, 2021
When using `.promise()` there is no callback for `dd-trace-js` to wrap.
Before this change, that meant we never called tracer.extract on any
incoming requests.

This changes the functionality by creating a "fake" callback for us to
wrap and then immediately calling it.

Fixes DataDog#1680
@ekosz
Copy link
Author

ekosz commented Nov 9, 2021

@rochdev Hmm, I don't know if we're on the same page. I just opened up #1682 . I first added a failing test, then I was able to get it to pass.

I think this is less to do with V8 and more to do with the fact that certain code paths are currently not being called. That said, I could still be completely misunderstanding.

@rochdev
Copy link
Member

rochdev commented Nov 9, 2021

The test case focuses on a very specific usage of the library and doesn't address other patterns (for example async/await, Promise.all(), nested or chained then(), etc). This makes it incomplete to actually address promise support for aws-sdk as a whole.

However, I took a quick look at sqs-consumer, and since it only uses a handler registered as a configured callback, it should be straightforward to patch the library directly instead, so I would say that a sqs-consumer plugin would make more sense here. Since kafkajs uses the exact same handleMessage pattern, most of the code from that plugin could probably be used as a base.

Please let me know if this is something you want to try to implement and I can help if needed.

@rochdev
Copy link
Member

rochdev commented Nov 9, 2021

I just realized that the tests in #1682 don't verify if the context was propagated properly, so I should probably provide more background as to why this doesn't work.

Basically, on the consumer side, the instrumentation needs to be able to link with the following other spans to provide distributed tracing:

  • The span from the publisher, which is extracted from the message. This works regardless of callback VS promise usage.
  • Any spans that are created within the handler, which should be children of the handler span and thus ancestors of the publisher span.

With the callback approach this is easy since all that is needed is to wrap the callback to make sure the message span is on the current scope, which will then be automatically propagated by dd-trace to child spans. However, for promises, every promise created using then or await is bound to where then or await was called, which is chained outside of where the message is received. This means that within the then callback doing the instrumentation, there is no way to make the span active on the scopes of the other calls to then. This cannot be addressed without changes to how async_hooks work, which is why today we simply don't handle this at all.

Supporting promises will be a lot of work and will require several changes to Node itself and to dd-trace, so it's definitely not a small task. However, since for your case all that is needed is a sqs-consumer plugin, it should be possible to make it work with just patching Consumer.poll.

@ekosz
Copy link
Author

ekosz commented Nov 9, 2021

I get it now. Thank you so much for the explanation! Relatedly, I think we may have been bit by something similar when I tried contributing to the opentelemetry-js-contrib package.

I'll open a new PR to try to patch the sqs-consumer package using kafkajs as a template. Thank you!

@uptownhr
Copy link

I'm also using sqs-consumer and wondering what a patch to the Consumer.poll looks like. @ekosz did you by end up patching sqs-consumer?

@rochdev by any chance are there samples that I can follow to do this on my end?

@rochdev
Copy link
Member

rochdev commented Feb 2, 2022

@uptownhr Not a sample per se, but from a quick look at the library it looks like handleMessage is very similar to kafkajs eachMessage so the code would probably be almost identical other than the tags.

@leoncider
Copy link

leoncider commented Apr 22, 2022

very interested in this as well, @rochdev @ekosz any updates by any chance?

@rochdev
Copy link
Member

rochdev commented Apr 25, 2022

@leoncider Been wanting to work on this for a while but haven't had the time. We're going to try to land support for aws-sdk v3 soon, I'll try to see if I can add support for this as well at the same time.

@ixalon
Copy link

ixalon commented May 30, 2022

@rochdev Has there been any progress on this (or #1682)? Currently we receive no traces for any endpoint that calls a aws-sdk (v2) method with .promise() and we can't avoid this pattern as it's used by 3rd-party libraries (e.g. calling dynamodb via dynamodb-toolbox).

Is there anyway we can temporarily fork the aws-sdk plugin locally, add the fix from @ekosz and get dd-trace-js to load our version of the plugin rather than the built-in? (I've not been able to find docs on creating custom plugins). 🙏

Alternatively, is there some way we can skip generating spans involving aws-sdk and .promise() and ensure the code awaiting the promise continues in the same span as triggered the call? I'd love to have DynamoDB calls measurable in the APM, but I'd live without that if it means we can get traces for the endpoints with spans for everything else. I tried disabling the aws-sdk plugin but this does not help.

@rochdev
Copy link
Member

rochdev commented May 30, 2022

@ixalon If you are not using async/await it should be possible to instrument the call manually. If you are using async/await you might not be able to bind the asynchronous context properly unless the span is created in a wrapper. Unfortunately this is not something we can solve in the auto-instrumentation without changes to Node and to our data model.

I've not been able to find docs on creating custom plugins

Custom plugins are not supported right now, but it should be possible to manually instrument individual calls or use a wrapper to handle the tracing.

@ixalon
Copy link

ixalon commented May 30, 2022

If you are using async/await you might not be able to bind the asynchronous context properly unless the span is created in a wrapper.

Thanks for the quick response @rochdev. Unfortunately the library we rely heavily on uses await <awsSdkClient>.method().promise(), e.g https://github.com/jeremydaly/dynamodb-toolbox/blob/main/src/classes/Entity.ts#L292 so looks like we'll have to try the wrapping approach.

@ixalon
Copy link

ixalon commented May 31, 2022

@rochdev I had a good attempt at working around this issue within the traced application itself - I could get no combination of trace(), wrap() or bind() to work. Instead we are now applying a patch to 2.7.0 using @ekosz's PR which works perfectly (though means we're pinned to that version as 2.8.0 has seen a major refactor.

I'm not sure if there genuinely is an issue with async_hook here. I think it's just an oversight in the aws-sdk plugin that it makes assumptions about cb - which don't apply when using .promise(). We use async/await everywhere along side async_hook in this same application (for maintaining logging contexts and database transaction propagation) and there have been no issues.

@ixalon
Copy link

ixalon commented Jun 6, 2022

Hi @rochdev, unfortunately sticking with 2.7.0 is turning out to be problematic due to the memory leak (fixed by #2078 in 2.8.0).

Do you think #2099 may fix the issue we're seeing above? If so, when can we expect a release? 🙏

Edit: I can confirm #2099 fixes the issue we were having with traces involving AWS calls going missing. Many thanks @rochdev!

@Treparme
Copy link

Treparme commented Mar 5, 2023

any update on this? We also use sqs-consumer and would love to have our traces working with it :)

@JavaScriptBach
Copy link

+1 to this. I would love to know whether this feature is being prioritized and whether there are any workarounds we can use in the meantime. All our services communicate using SQS queues and sqs-consumer, so we currently have to implement request IDs and propagate them ourselves. Which is a bummer because this is a core dd-trace feature that is being advertised: https://www.datadoghq.com/blog/request-log-correlation/#the-solution-automatic-context-propagation

@jithin-zac
Copy link

Any update on this? We are facing a similar issue. We use promises as well for SQSClient from @aws-sdk/client-sqs library. The traces are available for the receiveMessage AWS operation but the trace context is not propagated further down to the consumers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants