-
-
Notifications
You must be signed in to change notification settings - Fork 956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help debugging ECONNRESET problems #1062
Comments
Could be both. Try console.log('methods', [...error.gotOptions.retry.methods]);
console.log('statusCodes', [...error.gotOptions.retry.statusCodes);
console.log('errorCodes', [...error.gotOptions.retry.errorCodes]); Also the stack trace seems to be incomplete. Can you try executing
Unfortunately no. I've got some questions anyway:
Please try the following:
If none of the above helps, you may want to try Got 10. There are some breaking changes compared to Got 9, but it's got some nice features: fixed retry functionality (twice), DNS cache, Brotli support, Node.js 13 support, better async stacktraces, experimental pagination API and other bug fixes. I'm willing to send a PR with the upgrade if necessary :) |
Got 10 has some bugs but I don't think they would affect Renovate. Anyway half of the issues is gonna be fixed in an upcoming rewrite of Got because the current design has some limitations and the performance can be improved. |
Also Got 10 provides native TypeScript typings :) |
Hi @szmarczak, I appreciate the very quick response! FYI this is only happening in production, so I am a little constrained as to how "debuggy" I can get or how much risk I can take, because there are well over 100k installed repos and any mistake can have large consequences (ask me how..) I am adding specific logging for
I need to check that this is OK to do in production. I notice
No. I've run Renovate locally with some of the commonly failing ones, but they download fine.
This would seem unlikely. The problem occurs hour after hour, and has "clusters" of particular package failures that indicate it's not random.
No
There seems to be a correlation that the ones that fail are mostly large. e.g. https://registry.npmjs.org/@typescript-eslint%2Feslint-plugin is 14MB. However there are other even larger ones that have had zero failures.
Running Node 12. Node 13 is not really an option consider it's unstable. I am a few releases behind on Node 12 though so will upgrade that in production today.
I'd prefer not to, unless we have a reason to think it's the problem.
No
Can you give an example of this in a |
Unfortunately my logger doesn't want to share errorCodes and statusCodes:
Any trick for this? Here is the code in case that helps: https://github.com/renovatebot/renovate/blob/237eeffe6fa74335c817d308431ee9551b431c3b/lib/datasource/npm/get.ts#L226-L235 We use bunyan logger but have a custom serializer for |
Something else surprising to me now. I have increased retries to 5: https://github.com/renovatebot/renovate/blob/237eeffe6fa74335c817d308431ee9551b431c3b/lib/datasource/npm/get.ts#L111 How likely is it that we still end up with the same number of |
Yup. If you believe that unlimited stack trace is bad, you can do
This could be related to the network speed and/or the
Do they look like
Well, I know from the experience that http modules are the most stable in the newest Node.js versions...
My point is that your current DNS provider can point to overloaded servers or has outdated entries. Try providing a
Yep, it is.
Normalized logger.warn(
{
err,
errorCodes: [...err.gotOptions?.retry?.errorCodes],
statusCodes: [...err.gotOptions?.retry?.statusCodes],
regUrl,
depName: name,
},
'npm registry failure'
);
Assuming the NPM uptime is ~100%, the chance is almost 0%.
Add a I'll send you a PR to upgrade to Got 10 today. That should fix the problem with retries. |
Thanks, I'll do that.
I'm not using
It's mostly scoped packages and mostly large ones, but still a few counterexamples too. I think scoped packages have a different "backend" within npmjs so scoped vs non-scoped can definitely explain some differences at times. I saw the
FYI updating to Node v12.15.0 did not appear to have any impact on the stats. I worked something else out that may be making the problem worse. It appears that the custom got cache I added a long time ago blocked my own attempts at |
After adding in "manual" retries (catching got errors and recursing) I was finally able to lower the error rate. Anecdotally it seems like most are recovering first-time, as I think we had both thought should happen. My best guess is that Renovate's custom got caching behaviour prevents got's own retry mechanism from succeeding. I wonder if it's possible to add a |
Yes, because
So it seems that retries in Got 9 are buggy. I'd prefer to switch to Got 10.
Dunno. Possibly.
Should be possible. Just remember to rethrow the error. I'll send you a PR with Got 10 soon. |
I did some basic work here: renovatebot/renovate#5469 but you still need to fix the types and run the tests. If something fails please let me know straight away :) |
I discovered that unfortunately our |
What does the
Note that it may be missing in the TS types: // 10MB
got(url, {...options, readableHighWaterMark: 1024 * 1024 * 10}) It just stores (max) 10MB of the upcoming response data instead of the default 16KB. |
Unfortunately, it looks like a multi-megabyte dump in my log files whenever it happens! I checked a few and they appear to just be truncated JSON responses from npmjs, as if receiving data finished early.
Thanks. I did not find many references to this in my searches. Do you know if there are any references for people implementing this type of increase in a client and it not causing something else to break? |
I did the research on my own.
As you can see, the options are passed all along, so You don't need to set it manually on the
That's what I expected. Please let me know if the |
Assuming I inserted the code correctly, it unfortunately has had zero effect on the number of errors observed in production: renovatebot/renovate@8598c5e The errors are still about 75% ECONNRESET and 25% ParseError. Retrying solves the problem in each case though. |
Weird. I'll try to reproduce the issue locally. |
I think you should upgrade to Got 10 and see if it's gone. |
Got 10 doesn’t seem stable/ battle tested enough yet to deploy to something of the size of Renovate’s production (1m+ requests per hour). I think I’d be jumping out of the course on and into the fire.. |
Well, if you have many instances of Renovate you can just deploy to one and roll back if something doesn't work as it should. If that's still a no, I'll try to reproduce the issue locally :) |
I would like to soon, but I need to think how I can measure errors before and after to try to detect the things that would go wrong |
got 10 does not acceppt |
I think we can remove it, as it seems to have no effect (assuming I configured it correctly) |
It has no effect because it didn't work. Seems like One more question: Do the requests fail in bulk? Like if request A fails, then request B fails even 1s later? |
There is some clustering in time, yes. Certainly the problems tend to be more likely to be together than separate, including when one repository depends on multiple of the problem libraries. I'll try to sort a snapshot of logs and see how pronounced the clustering is. |
Is there any way to determine if we have too many The past two days has seen a big problem in our production app that I've tentatively traced down to |
By looking at your code I can say that it's impossible unless you're using https://github.com/sindresorhus/import-fresh
If you're using a custom instance of Agent you can use https://runkit.com/szmarczak/5e5e64d2c38f7e0013896198 or if you use the global one you can do https://runkit.com/szmarczak/5e5e6a5010e7d500137cb9ae
Well, if there was a memory leak then the Node.js app would have crashed. If your CPU spins up to 100% then I'm pretty sure there's an infinite loop. It's best to use some resource monitoring tools like https://prometheus.io/
That's definitely a bug. As of now I cannot say whether it's in the Got core or the Got instance was configured improperly. When did this start to occur? Was it before renovatebot/renovate@18d2c52 or after? |
Definitely before that commit, it has started without any change. Maybe we can try the custom agent pattern to log sockets |
CPU sits at 0%. Adding copious logging messages appears to show the last thing that happened is a Renovate does support the global proxy agent, although it's not in use for the hosted app as it has direct internet connectivity. I'd need to do some code tracing to work out if it's ever called/initialized or not. |
What are the Got options for these requests? Are they the same as in the first post? |
FYI I was finally able to verify got v11 in production and the npm errors immediately stopped. Thank you for the assistance. |
@rarkins By the way, did you upgrade Node.js also when switching to Got v11? |
Nope, we switched to v12 long time ago. |
I know posting comments on closed issues are not very helpful, but we're seeing many
|
@kirillgroshkov It comes directly from the TCP socket, meaning either your network was disrupted or the end server is having issues. |
What would you like to discuss?
My tool uses
got@9.6.0
and regularly get these types of error fromnpmjs
:I'm wondering, does the
retry
part above indicate that there's actually no retrying? Or could it indicate an incomplete stringifying of the object, etc?Additionally, is there any additional debugging of the
err
object I can do once such errors are thrown?Checklist
The text was updated successfully, but these errors were encountered: