Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undici instrumentation breaks fastify throughput tracking #972

Closed
artur-ma opened this issue Nov 9, 2021 · 17 comments
Closed

Undici instrumentation breaks fastify throughput tracking #972

artur-ma opened this issue Nov 9, 2021 · 17 comments
Assignees
Labels

Comments

@artur-ma
Copy link

artur-ma commented Nov 9, 2021

Description

We enabled undici instrumentation in our production and saw huge decrease in our service throughput, but all other KPIs were stable (which means the issue is the metric, the actual throughput didn't drop)

We are using fastify with Node 16, our base NR configuration is

NEW_RELIC_FEATURE_FLAG_NEW_PROMISE_TRACKING=true
NEW_RELIC_FEATURE_FLAG_UNRESOLVED_PROMISE_CLEANUP=false

We enabled undici instrumentation (newrelic SDK v8.5.1) using these flags:

NEW_RELIC_FEATURE_FLAG_UNDICI_INSTRUMENTATION=true
NEW_RELIC_FEATURE_FLAG_UNDICI_ASYNC_TRACKING=false

this caused the first drop u see on the screenshot.

image

our next try was to set undici_asyn_tracking to "true"

NEW_RELIC_FEATURE_FLAG_UNDICI_INSTRUMENTATION=true
NEW_RELIC_FEATURE_FLAG_UNDICI_ASYNC_TRACKING=true

This change increased the throughput for a while after app restart, but then the throughput dropped again (this is the second drop on the screenshot)

image

after disabling undici instrumentation, throughput got back to normal.

Expected Behavior

I don't see how an http client instrumentation(undici) in my service(fastify), can cause the throughput metric to get broken (as I mentioned the real throughput stayed the same, its just the metrics report that got broken)

Your Environment

Fastify v3.22.1
Node 16.11
undici 4.7.1

@bizob2828
Copy link
Member

Thanks for logging this @artur-ma, I will take a look in a bit. Is there anything interesting to note about the fastify app? Can I just make a barebones fastify app and throw some traffic at it with undici feature flag on and off and notice this difference?

@artur-ma
Copy link
Author

artur-ma commented Nov 9, 2021

@bizob2828 Im not sure about that, we have another service that uses undici as a gateway with fastify + fastify-reply-from lib, it works as expected.

I will try to run some local tests to understand where it comes from

@bizob2828
Copy link
Member

Ok @artur-ma. I'll try to reproduce. If I can't, I'll probably have you go through our technical support to get the right info.

@bizob2828
Copy link
Member

bizob2828 commented Nov 11, 2021

@artur-ma I ran a test with a fastify + undici app. I ran with just

NEW_RELIC_FEATURE_FLAG_NEW_PROMISE_TRACKING=true
NEW_RELIC_FEATURE_FLAG_UNRESOLVED_PROMISE_CLEANUP=false

then with

NEW_RELIC_FEATURE_FLAG_NEW_PROMISE_TRACKING=true
NEW_RELIC_FEATURE_FLAG_UNRESOLVED_PROMISE_CLEANUP=false
NEW_RELIC_FEATURE_FLAG_UNDICI_INSTRUMENTATION=true
NEW_RELIC_FEATURE_FLAG_UNDICI_ASYNC_TRACKING=false

I used a tool called hey and here was the throughput it measured

Without undici instrumentation:

Summary:
  Total:	600.2235 secs
  Slowest:	4.9640 secs
  Fastest:	0.0156 secs
  Average:	0.1214 secs
  Requests/sec:	411.6900

With undici instrumentation and async tracking off:

Summary:
  Total:	600.2269 secs
  Slowest:	2.0420 secs
  Fastest:	0.0168 secs
  Average:	0.2268 secs
  Requests/sec:	220.4117

screenshot 2021-11-11 at 1 45 45 PM

There is about 53% reduction in throughput the New Relic graph is aligned with what the output in hey was giving me.

411 reqs/sec * 60 = 24.5k/min
220 reqs/sec * 60 = 13.2k/min

I don't see a discrepancy between actual throughput and what New Relic One is rendering. I recommend you file a ticket with our technical support team so they can request a lot of environmental info and logs and possibly a repro case.

@artur-ma
Copy link
Author

Sure, will open a support ticket, thanks!

@artur-ma
Copy link
Author

Closing since we can not reproduce it anymore, very strange

@artur-ma
Copy link
Author

artur-ma commented Nov 19, 2021

Reopening again

image

It happens sporadically, this time it happened 3 days after we changed the configuration, I dont know what is the trigger, but we also can not enable trace logs to be set for 3 days, if we will enable trace while the issue occurs, it will require app restart, which will solve the issue
this is the throughput after app restart

image

@artur-ma artur-ma reopened this Nov 19, 2021
@bizob2828
Copy link
Member

@artur-ma I'm sorry to see that the issue resurfaced. We recommend leveraging our support team over at support.newrelic.com which can help deep-dive into solving the problem and collecting the necessary information which may or may not be sensitive (such as logs).

@artur-ma
Copy link
Author

@bizob2828 after reaching support, they asked to enable trace logs, which is impossible, as described..

we also can not enable trace logs to be set for 3 days, if we will enable trace while the issue occurs, it will require app restart, which will solve the issue

@michaelgoin
Copy link
Member

I've been able to reproduce this a couple ways but only one way consistently: by modifying an endpoint to hang for a period of time and then allowing traffic to resume (in this case by exiting but may be other ways) with other endpoints being hit. It is quite contrived but at least demonstrates a problem. I have not toggled all the potential variable bits to confirm the exact contributing factors.

A standalone reproduction (of at least a similar case to what is documented here) can be found here: https://github.com/michaelgoin/undici-fastify-repro. The default configuration which doesn't all lump everything under a single transaction may not see a drop in traffic in the UI (in fast may increase due to quick failures of what doesn't get lumped together) but the logs will show the state conflation.

@coreyarnold coreyarnold added points: 5 1-2 weeks and removed support labels Mar 16, 2022
@artur-ma
Copy link
Author

Maybe this is somehow related: nodejs/node#42170
even if not, it should be changed for now (till it will be fixed in Node)
to avoid garbage collection of subscribers, u have assign the channel to a variable

here for example https://github.com/newrelic/node-newrelic/blob/main/lib/instrumentation/undici.js#L72
it should be

const createChannel = diagnosticsChannel.channel('undici:request:create');
createChannel.subscribe(({ request });

@bizob2828
Copy link
Member

Nice find @artur-ma. I'm going to see if it addresses the issue you raised here 🤞🏻

@bizob2828
Copy link
Member

bizob2828 commented Mar 30, 2022

@artur-ma I tried saving references to the channels and ran our test and it is still an issue. However, that's not to say your issue has not been resolved. The repro case we created is very brute force. So I'm going to make this suggested change and perhaps you could retest, or could you apply the patch below and test yourself without us releasing a version of the agent?

commit c43cbf2bd168e6901124bc6bcc850c7ffd1691ca
Author: Bob Evans <robert.evans25@gmail.com>
Date:   Wed Mar 30 12:09:24 2022 -0400

    updated diag channel subscriptions to be hard references

diff --git a/lib/instrumentation/undici.js b/lib/instrumentation/undici.js
index 17eafdc6..d1b2a8b2 100644
--- a/lib/instrumentation/undici.js
+++ b/lib/instrumentation/undici.js
@@ -69,7 +69,8 @@ module.exports = function addUndiciChannels(agent, undici, modName, shim) {
    * @param {object} params
    * @param {object} params.request undici request object
    */
-  diagnosticsChannel.channel('undici:request:create').subscribe(({ request }) => {
+  const requestCreate = diagnosticsChannel.channel('undici:request:create')
+  requestCreate.subscribe(({ request }) => {
     const parent = getParentSegment()
     request[SYMBOLS.PARENT_SEGMENT] = parent
     if (!parent || (parent && parent.opaque)) {
@@ -113,7 +114,8 @@ module.exports = function addUndiciChannels(agent, undici, modName, shim) {
    * @param {object} params.request undicie request object
    * @param {TLSSocket | net.Socket} socket active socket connection
    */
-  diagnosticsChannel.channel('undici:client:sendHeaders').subscribe(({ request, socket }) => {
+  const sendHeaders = diagnosticsChannel.channel('undici:client:sendHeaders')
+  sendHeaders.subscribe(({ request, socket }) => {
     const parentSegment = request[SYMBOLS.PARENT_SEGMENT]
     if (!parentSegment || (parentSegment && parentSegment.opaque)) {
       return
@@ -156,7 +158,8 @@ module.exports = function addUndiciChannels(agent, undici, modName, shim) {
    * @param {object} params.request undici request object
    * @param {object} params.response { statusCode, headers, statusText }
    */
-  diagnosticsChannel.channel('undici:request:headers').subscribe(({ request, response }) => {
+  const requestHeaders = diagnosticsChannel.channel('undici:request:headers')
+  requestHeaders.subscribe(({ request, response }) => {
     const activeSegment = request[SYMBOLS.SEGMENT]
     if (!activeSegment) {
       return
@@ -187,7 +190,8 @@ module.exports = function addUndiciChannels(agent, undici, modName, shim) {
    *
    * @param {object} params.request undici request object
    */
-  diagnosticsChannel.channel('undici:request:trailers').subscribe(({ request }) => {
+  const requestTrailers = diagnosticsChannel.channel('undici:request:trailers')
+  requestTrailers.subscribe(({ request }) => {
     endAndRestoreSegment(request)
   })
 
@@ -198,7 +202,8 @@ module.exports = function addUndiciChannels(agent, undici, modName, shim) {
    *
    * Note: This event occurs before the error handler so we will always log it for now.
    */
-  diagnosticsChannel.channel('undici:request:error').subscribe(({ request, error }) => {
+  const requestError = diagnosticsChannel.channel('undici:request:error')
+  requestError.subscribe(({ request, error }) => {
     endAndRestoreSegment(request, error)
   })

Commit Sha

@coreyarnold
Copy link
Member

@artur-ma Just wanted to give you a heads up that we're considering work related to this in the upcoming roadmap. We don't have a definitive timeline yet. I'll update when we do. Thanks!

@liesislukas
Copy link

liesislukas commented Jan 30, 2023

fyi i have issue seemingly related to undici https://discuss.newrelic.com/t/node-i-see-two-separate-traces-but-no-connection-between-micro-services/192927

Yes, issue is undici is not supported. can't wait to see it supported by new relic.

https://undici.nodejs.org/#/

@coreyarnold coreyarnold removed the dnm2j label Aug 29, 2023
@workato-integration
Copy link

@bizob2828 bizob2828 self-assigned this Sep 1, 2023
@bizob2828
Copy link
Member

@artur-ma I'm revisiting this after the work we just did to remove the undici feature flag and add native fetch. I tested on 20.60 and 18.17.1 and I can no longer reproduce this issue with the brute force repo mentioned above. I'm going to close this. If we can get more concrete steps to reproduce after releasing 11.1.0, feel free to reopen. Lastly, I apologize for the time it took to get around to this. 👋🏻 priorities 👋🏻

@bizob2828 bizob2828 moved this to Done: Issues recently completed in Node.js Engineering Board Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

5 participants