Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for undici tracing #1642

Closed
wants to merge 23 commits into from

Conversation

flovilmart
Copy link

@flovilmart flovilmart commented Oct 5, 2021

What does this PR do?

This PR adds support for tracing undici, for the upcoming release.

Closes #1615

Motivation

Undici does not use the standard HTTP / HTTPS modules from node.js.
This is required so we can properly trace undici requests.

Plugin Checklist

Additional Notes

@flovilmart flovilmart force-pushed the undici-tracing branch 2 times, most recently from 18b1cfa to a8feda6 Compare October 5, 2021 20:55
@flovilmart flovilmart marked this pull request as ready for review October 5, 2021 21:12
@flovilmart flovilmart requested a review from a team as a code owner October 5, 2021 21:12
Qard
Qard previously approved these changes Oct 13, 2021
@flovilmart
Copy link
Author

@Qard , any suggestion regarding the node-ext and upstream koa failures? Is my setup wrong?

@Qard
Copy link
Contributor

Qard commented Oct 15, 2021

I'm pretty sure it's due to next.js tests trying to bundle it when it shouldn't. I believe there was a place to exclude things, but I forget where exactly. @rochdev might know how to deal with that.

@flovilmart
Copy link
Author

Got bitten by this; https://github.com/nodejs/node/blob/v16.11.1/lib/diagnostics_channel.js#L101-L103;

the channels are weak ref'd and I've spent about 2 days trying to understand why my tracing would not work while deployed and was working from the REPL / tests 😮‍💨
Will adjust the PR.

@rochdev
Copy link
Member

rochdev commented Oct 19, 2021

any suggestion regarding the node-ext and upstream koa failures? Is my setup wrong?

For upstream koa they are also failing on master so you can ignore them. For next I think rebasing might fix it since there was a recent change to how it's tested.

the channels are weak ref'd and I've spent about 2 days trying to understand why my tracing would not work

We were also bitten by this while implementing diagnostics_channel. This is supposed to be an implementation details but is impacting common use cases enough that it should be changed. I'll open a PR to Node when I have some time.

@flovilmart
Copy link
Author

is impacting common use cases enough that it should be changed

glad to know I'm not alone! Will rebase and see how it goes!

@flovilmart flovilmart force-pushed the undici-tracing branch 3 times, most recently from ae925f6 to 206ff99 Compare October 19, 2021 16:11
@flovilmart
Copy link
Author

@rochdev should I gate here for multiple registrations? or is the framework guaranteeing a single installation of the plugin?

@flovilmart flovilmart requested a review from Qard November 29, 2021 18:52
@flovilmart
Copy link
Author

Hi there! Any update on this one?

Copy link
Member

@rochdev rochdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply as we've been trying to move the plugin system to diagnostics_channel and figuring out the details of how to properly use it ourselves.

It turns out that undici cannot be instrumented using DC in its current state because it doesn't have the proper channels/events for context propagation, which we need for spans to be connected together properly and for tracking the execution and control flows. This is something we'll have to work with @mcollina to support for both undici and fastify when we have finalized the design. The short version is that in order for AsyncLocalStorage to be used properly, there must be points at which to call enterWith() both at the beginning and synchronous end of a function for a client. Without this we cannot activate the span properly without leaking to the callback. So we need a minimum of 4 events (start, end, async-end, error) and undici only implements 3.

In the meantime, can you alter the integration to monkey-patch undici instead of using DC? It's not ideal, but unfortunately it's the only way short-term, and it will have the benefit of supporting older versions as well. At least most of the current logic should remain untouched other than the instrumentation points.

const REFERENCE_CHILD_OF = opentracing.REFERENCE_CHILD_OF
const REFERENCE_NOOP = constants.REFERENCE_NOOP

function addError (span, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of code including this one seems to have been copied from the http plugin. Can it all be deduplicated with some sort of helper? Other helpers can be found in the plugins/util folder, so for example adding a http.js.

For this one specifically, it shouldn't even be needed at all since span.setTag('error', error) now handles this logic for you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect! Let me have a look at how I can get the implementation going for the monkey patch as well.

@mcollina
Copy link

mcollina commented Dec 4, 2021

callback. So we need a minimum of 4 events (start, end, async-end, error) and undici only implements 3.

I do not have the time to look into it right now, but adding the missing one is just a PR away. It would probably take less time and be more maintenance friendly to just do that.

@flovilmart
Copy link
Author

@rochdev , if I recap, in order to use diagnostics_channels, we need to implement the event handling through async storage.
We cannot currently properly implement async_storage as we're missing an event from undici, (end or async-end I assume).
@mcollina or @rochdev do you happen to know where this 4th event need to be emitted from? I really wish to move forwards with diags channels, as the monkey patch implementation is not trivial coming from there.

@rochdev
Copy link
Member

rochdev commented Dec 6, 2021

@flovilmart The events that are needed are basically:

  • start: When the operation starts, which would usually be when the function from the public API is called. We need this event to create a span, create a new AsyncLocalStorage store, and add the span to the store.
  • end: When the operation ends but on the current stack, so not the entire asynchronous operation but only the synchronous part of it, which would usually be when the function from the public API returns. We need this event to restore the previous store from before the function was called.
  • error: When any error occurs, which is needed to attach the error to the current span.
  • async-end: When the operation finished completely, including anything asynchronous that it was waiting on. We need this to finish the span.

Additionally, while promises are handled automatically by async_hooks, callbacks are not, so any callback needs to be properly bound to the correct AsyncResource.

So an example fake request function could look something like this:

const { AsyncResource } = require('async_hooks')

function request (callback) {
  const req = {}

  callback = AsyncResource.bind(callback)

  startChannel.publish(req)

  doActualAsyncRequest(req, (err, res) => {
    if (err) {
      errorChannel.publish(err)
    }

    asyncEndChannel.publish(req)

    callback(req, res)
  })

  endChannel.publish(req)
}

@flovilmart
Copy link
Author

Oh great! Thanks @rochdev ; let me see how I can get this implemented.

@flovilmart
Copy link
Author

@rochdev , I'm curious; and perhaps the current implementation is completely wrong, but given that this implementation keeps a WeakMap of Request -> Spans, isn't it circumventing this problem by manually creating / terminating spans?
I'm currently running this implementation in a production service, and the traces are sane 🤔

@rochdev
Copy link
Member

rochdev commented Dec 7, 2021

@flovilmart The context is never activated in the current implementation, unlike http, so any child of undici would end up in a different context. Since this is a client, it's less likely to have children, and when it does it could be less often, for example tcp.connect, which is why it appears to be working properly for your use case. However, for tracing we shouldn't assume what can or cannot have children so this should be handled. The problem is that without the missing event, handling it would be impossible since you would have no way to know when the function returns, meaning that the new context holding the span would leak after the function call. The same issue would potentially affect the callback if it's not properly bound with async_hooks.

Unfortunately these are issues that are much easier to solve with monkey-patching, but since undici is in active development it should also be relatively straightforward to add the missing events/bindings to the library itself and then use that in this PR which would only require small changes.

We're also working on a doc to explain how to do tracing using DC, especially since it's much more involved to get it right compared to monkey-patching, which makes it more difficult to use for external contributors.

@flovilmart
Copy link
Author

Thanks a lot @rochdev for taking the time for this explanation! I concur that monkey patching is more straightforward then! Let me reevaluate both approaches.

@rochdev
Copy link
Member

rochdev commented Dec 7, 2021

@flovilmart Sorry for the runaround with this PR. We hadn't quite realized that the current events were incomplete when we recommended trying to use DC for this. Let me know if you need any help to bring this to the finish line or have any questions about either approach.

@flovilmart
Copy link
Author

@rochdev no problem! All's good; I did not realize we were missing so many events.

@artur-ma
Copy link

artur-ma commented Dec 26, 2021

@rochdev

async-end: When the operation finished completely, including anything asynchronous that it was waiting on. We need this to finish the span.

Does it mean the "async-end" should be emitted when body is consumed (read end from socket) ?
If so, @flovilmart correct me if im wrong the current implementation with WeakMap is wrong, since the user is receiving "Readable" instance which may be consumed at any time later, means the span does not include the socket read time.

@rochdev
Copy link
Member

rochdev commented Jan 10, 2022

@artur-ma According to the current implementation for http yes, it would be after the response has been received. However, these are just the bare minimum required for tracing to work at all, but it definitely wouldn't hurt to have more events that are more precise, for example adding some sort of response or headers-received or something similar, which I believe already exist in Undici although I haven't looked for a few weeks.

@flovilmart
Copy link
Author

flovilmart commented Jan 11, 2022

@artur-ma sure, this doesn't include body reading probably it would need it's own span outside of the request span. What do you think? I'm saying outside, as some users may not even read the body, but simply rely on the headers.

Comment on lines 311 to 316
const span = tracer.startSpan('http.request', {
childOf,
tags: {
[SPAN_KIND]: CLIENT
}
})
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we init the span early and pass it through the asyncLocalStorage. not sure this is exactly what was intended

'http.url': uri,
'service.name': getServiceName(tracer, config, request)
})
requestSpansMap.set(request, span)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we keep the request in the WeakMap as for some reason, they're not available in the subsequent diagnostics channels events. Which makes me wonder why? and also if we can rely on this span to exist in the undici:request:create event

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open an issue about this? Or a PR to add them? saving the WeakMap will greatly reduce overhead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. Is that more efficient to store through the weak map than the async context?
or would you want something on the request object to add external context in the form of a weak map?

@flovilmart
Copy link
Author

@rochdev I believe I made some strides in the right direction:

  • calls are now wrapped
  • we use diagnostics channels in order to
    • inject span with all the request info
    • inject propagation headers in the request
    • capture connect / abort errors
    • capture status code
  • we use the wrapping to
    • initially create the spans
    • propagate the async context
    • properly wrap the pipeline command
    • finish the spans when the promise resolves

I also extracted commons from the http client wrapper as suggested.
The test suite covers both using client.[method] and undici.[method] and ensure both emit traces properly at a high level.

❓ If we want to be extensive, we should run the whole suite with all methods we'll probably find some wholes in there.

The instrumentation is a bit all over the place with the 2 wrapping strategies, I could attempt to simplify a bit 😅.

let me know what you think.

@flovilmart flovilmart requested a review from rochdev February 2, 2022 13:32
@flovilmart
Copy link
Author

@mcollina how would you approach instrumenting reading the response body? Is there a single member that I can wrap on the response and watch for an end event? Ideally, I'd love to add a child span that just represents reading the request body. Is it possible to get an event when the stream starts being consumed? Or should I simply finish the span only when the body end event is emitted?

@mcollina
Copy link

This is a bit tricky and I don't have the solution at the top of my head without digging deep into the code.

What I would try first is to add one event listeners for the 'close' event on the stream body itself.

@flovilmart
Copy link
Author

Thanks for the tip @mcollina !

@flovilmart
Copy link
Author

closing as another implementation is coming.

@flovilmart flovilmart closed this Mar 3, 2023
@flovilmart flovilmart deleted the undici-tracing branch January 15, 2024 23:20
@ezequielmiranda87
Copy link

ezequielmiranda87 commented Feb 20, 2024

Hey, there @flovilmart Just wanted to check in on this. Any updates on the new implementation? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Undici / Global Fetch integration
7 participants