-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MIGRATION ISSUE: Lambdas seemingly not cleaning up/disposing of RDS connections properly after upgrade #6369
Comments
Hi @gcv-epalmer
It's not clear how you are instantiating the client though. This reads as in v2 you were instantiating the client outside of the handler but in v3 you dont? The lazy loading method here does not tell us where the client is actually being instantiated. Please make sure you read this Thanks, |
My apologies @RanVaknin. In both cases, we are instantiating the client outside of the handler. The difference between v2 and v3 is that we would only instantiate the client when calling a method, vs previously clients were being instantiated whenever the file was imported. Here's our full setup to make things clear: We have a core library that has all of our AWS clients with some helper methods. Our s3 client for example looks like this:
Then our handler function looks like this:
Also to give an update, we tested this on Node18 with no success. Next thing we're going to test is setting keepAlive to false on all of the AWS clients (not just s3 which we started with testing) |
Another update, I did the following things:
After doing these changes, I started getting different errors from different processes that high with a high level of parallelism with a Promise.all: After googling this, I'm seeing all sorts of similar issues with connections being kept alive: Ultimately the best approach is to reduce the concurrency of our functions to reduce the total number of connections, but it's still troubling to me that all of this behavior changed in such a way just after updating the clients. I'm now trying to figure out, for each of our AWS clients, what the best strategy for initializing, keepAlive, and various socket/connection parameters so that our business functions back to normal |
|
I'm really not sure, networking is not my strong suite. I came to testing this parameter because I have ran into similar issues with RDS dropping connections from high concurrency processes and causing them all to fail. Prior to updating our SDK to v3 our nightly process was churning through everything fine, and we made no adjustments to that process. After updating the SDK, suddenly the RDS issues are cropping up from it, and the issues are eerily similar to what I've seen before. But if nothing changed but the SDK, I started looking at the differences of the SDK, and the main one that stood out to me was the
That's interesting, something I might want to try. Though would there be risk of multiple lambda invocations in a single context destroying that socket while another invocation is using it? Or would each lambda invoc have its own socket?
Not 100% sure, we deploy our Lambdas with a Serverless Cloudformation template. The lambdas are given an IAM role with policies to give permissions to other AWS services under our account. The SDK seems to handle credentials under the hood
somewhat psuedo coded but
I've since gone through and put this behind a limiter to limit the concurrency. It was just very telling that as soon as I started playing with the settings of our various AWS service clients that different issues came up, all related to concurrency and TCP connection things. And again I'm terrible at networking stuff, I'm a code person, I'm just trying my best without fully understanding the issue at hand, so I apologize and appreciate all of your help :) |
Yes, I believe it's possible for invocations to close each others' sockets. Ideally, this function call shouldn't be necessary since Nothing stands out as looking like a bottleneck. Some possibilities to explore:
const getObject = await s3.getObject({ ... });
await getObject.Body?.transformToString(); // consumes the stream and allows the socket to close.
import { NodeHttpHandler } from "@smithy/node-http-handler";
const requestHandler = NodeHttpHandler.create({
httpsAgent: { maxSockets: 1000, keepAlive: true }
});
const client1 = new DynamoDB({ requestHandler });
const client2 = new DynamoDB({ requestHandler });
... await requests ...
requestHandler.destroy(); // close sockets for both clients. |
Pre-Migration Checklist
UPGRADING.md
.Which JavaScript Runtime is this issue in?
Node.js (includes AWS Lambda)
AWS Lambda Usage
Describe the Migration Issue
Update: FWIW, looking into this being a possible issue with Node20: nodejs/node#47130
We'll be testing with Node18 tonight
We have a nightly process that runs a large number of concurrent step function executions that contain lambdas which open connections to RDS to get some data. These lambdas use the SDKv3 to talk to s3, secrets manager, and dynamodb. We are using AWS RDS Proxy and read replicas to help manage the high concurrent connections, and our code logic is always closing the connection after getting data. This process has been running fine for months and months at our current scale, until we upgraded to AWS SDK v3 and our lambdas to node 20 (from node 16). Suddenly after updating, our lambdas in the step function are erroring out from RDS in a few ways:
unnamed prepared statement does not exist
could not send data to client: Connection reset by peer
Timed-out waiting to acquire database connection
canceling statement due to conflict with recovery
After looking at some of the performance charts in RDS, it's clear there is a change in behavior on the day that our updated SDK lambdas are deployed:
You can see our lambda function duration has also increased after this update:
I know the SDK should have nothing to do with connecting to RDS, but after lots of testing the only thing that alleviates the issue is going back to SDK v2 on node16. So I don't know where the source of the issue is, it's a chicken or egg scenario. Is the longer lambda duration causing us to keep RDS connections open longer, therefore bogging down RDS to a point where it breaks? Or is there some behind the scenes change in how the RDS connections are being opened and closed that are causing lambda environments to stay alive longer? I've been doing a good amount of testing various logical changes and optimizations with no effect. Only rolling back the code to sdkv2 and node16 does our performance come back to normal.
Code Comparison
Previously our V2 initialized clients outside of handlers as such
V2 Code Snippet:
After upgrading and doing some investigation, reading issue other folks have been having led me to implement a 'lazy-loading' strategy
V3 Code Snippet:
After reading about the changes to
keep-alive
and maxSockets, I tried to configure our S3 client to behave more like sdkv2, as I've seen some github issues related specifically to the s3 client which our lambdas heavily use.And here is an example of our RDS connection and query logic that we use within our lambdas
Observed Differences/Errors
It's hard to provide specific error messages because this isn't a simple reproducible issue. The issue seems to be something to do with concurrent lambdas executions that are talking to RDS. I took some performance metrics from cloudwatch logs on the lambda to try and see if any one particular client or piece of logic is taking longer now, and it honestly seems like our code execution is faster now on sdk v3, but something behind the scenes is still causing the lambdas to take longer.
Here's the metrics I gathered doing a test with the sdk v3 and ndoe 20 code vs sdkv2 and node16:
On 8/6, with SDK v2 and node 16, we had 586 total executions. Here is a breakdown of avg # of milliseconds to complete the different steps within our lambda function:
Here are average performance metrics for that same set of executions
And compared to AWS SDK v3 on node 20, the same set of metrics for 620 total executions
On SDK v2, our lambda logic on average took longer to complete, looking at the steps that take the longest:
Rules v2: 4652.07ms avg
Rules v3: 3415.17ms avg
savedynamo v2: 3415.17
savedynamo v3: 1897.37
Yet the lambda performance logs indicate the duration to be greater on v3:
v2 duration avg 9125.73
v3 duration avg 49139.67
And there is also a significant number of cold starts, going from 12.11% on sdkv2 to 28.87% on sdk V3
I opened up an AWS Support ticket for this, and they pointed out to me that errors in the lambda could explain the higher duration and increased cold starts. But the errors are coming from RDS, which are a result of high concurrent amount of connections, which I can't explain what is happening
Additional Context
AWS services involved:
All Project AWS Dependencies:
"@aws-sdk/client-cloudwatch-logs": "3.490.0",
"@aws-sdk/client-dynamodb": "3.614.0",
"@aws-sdk/client-lambda": "3.614.0",
"@aws-sdk/client-location": "3.614.0",
"@aws-sdk/client-s3": "3.614.0",
"@aws-sdk/client-secrets-manager": "3.614.0",
"@aws-sdk/client-ses": "3.614.0",
"@aws-sdk/client-sfn": "3.614.0",
"@aws-sdk/client-sns": "3.614.0",
"@aws-sdk/client-sqs": "3.614.0",
"@aws-sdk/client-ssm": "3.614.0",
"@aws-sdk/client-sts": "3.614.0",
"@aws-sdk/lib-dynamodb": "3.614.0",
"@aws-sdk/lib-storage": "3.614.0",
"@aws-sdk/s3-request-presigner": "3.614.0",
"@aws-sdk/util-dynamodb": "3.614.0",
Our postgres library to connect to RDS:
"pg": "8.12.0",
"pg-promise": "11.9.1",
Notably we currently aren't bundling our app, I've seen that as a recommendation so I'm going to try that even though it feels unrelated to this issue.
We have all of the AWS client packages installed as dev dependencies to utilize the preinstalled sdk in the lambda environments. I tried doing some tests where we would have some aws sdk v2 code that we bundle and deploy alongside v3, but the bundle size becomes too large. We either have to go full node16 with SDK v2, or node 209 with SDK v3; there's no middle ground for us without tediously splitting apart our services and repositories and maintaining separate copies.
We deploy our lambdas through Cloudformation serverless templates, here's what the particular lambda in question is setup like:
The text was updated successfully, but these errors were encountered: