Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coldstart Performance/Bundle Size Regression in 3.577.0 #6144

Closed
3 tasks done
perpil opened this issue May 30, 2024 · 27 comments
Closed
3 tasks done

Coldstart Performance/Bundle Size Regression in 3.577.0 #6144

perpil opened this issue May 30, 2024 · 27 comments
Assignees
Labels
bug This issue is a bug. closed-for-staleness p2 This is a standard priority issue

Comments

@perpil
Copy link

perpil commented May 30, 2024

Checkboxes for prior research

Describe the bug

I'm seeing a larger than normal regression in coldstart performance and treeshaken bundle size after 3.575. This chart shows cold start latency and esbuild built bundle size by sdk. It's a small sample size (3 runs), but I'm seeing +24 ms in coldstart and +146K in bundle size.


aws sdk node version avg(@initDuration) bundleSize # runs
3.575.0 v20.13.1 246.79 316588 3
3.577.0 v20.13.1 270.0433 463023 3

SDK version number

@aws-sdk/client-dynamodb@3.577.0

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

v20.13.1

Reproduction Steps

Read the README in this Repro

Observed Behavior

Increased coldstart times and bundle size.

Expected Behavior

Minimal impact to coldstart and bundle size on tree-shaken code.

Possible Solution

Looking at the bundle analyzer, the biggest size impact are from the @smithy/* packages, but it also seems like fast-xml-parser appeared too, I haven't dug much beyond that.
3 575-repro
3 577-repro

Additional Information/Context

In my app that I upgraded, I saw higher impacts to coldstart performance (>90ms) between 3.563 and 3.577. I was using more clients however: client-sts, client-dynamodb, client-iam, and lib-dynamodb. To keep things simple, I did the repro with just client-dynamodb.

@perpil perpil added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 30, 2024
@mymattcarroll
Copy link

The size difference in the following packages has increased our lambda deployment package size to over the hard limit of 262144000 bytes:

  • @aws-sdk/client-backup
  • @aws-sdk/client-lambda
  • @aws-sdk/client-resource-groups-tagging-api
  • @aws-sdk/client-s3
  • @aws-sdk/client-secrets-manager
  • @aws-sdk/client-sesv2
  • @aws-sdk/client-sqs
  • @aws-sdk/client-ssm
  • @aws-sdk/client-textract
  • @aws-sdk/credential-providers
  • @aws-sdk/lib-storage
  • @aws-sdk/s3-request-presigner

@perpil
Copy link
Author

perpil commented May 30, 2024

@mymattcarroll Although there was a bump in size, I was seeing it be on the order of kilobytes not megabytes. I limited my repro to one package though. Were you close to the 250 MB limit prior? Do you have numbers of your size between 3.575 and 3.577 or later? If you can drop some numbers in here, that may help determine whether it is the same cause or a different one.

@kuhe kuhe self-assigned this May 30, 2024
@kuhe kuhe added p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels May 30, 2024
@kuhe
Copy link
Contributor

kuhe commented May 30, 2024

You should mark fast-xml-parser as external if you don't need an XML-based client.

Another 10kb, which we the SDK team should fix, is the inclusion of tslib in 3.576.0, which can be removed.

After these fixes the difference should only be 134.6kb -> 140.3kb

test code:

npm init -y
npm i @aws-sdk/client-dynamodb@3.577.0 esbuild --save-exact
echo "import { DynamoDBClient, ListTablesCommand } from \"@aws-sdk/client-dynamodb\";\
      \
      export { DynamoDBClient, ListTablesCommand };" > index.ts
npx esbuild --bundle index.ts --outfile=dist/bundle.js --minify --analyze --format=esm --main-fields=module,main --platform=node \
	--external:@aws-sdk/client-sts \
	--external:@aws-sdk/client-sso* \
	--external:@aws-sdk/credential-prov* \
	--external:@aws-sdk/token-providers \
	--external:fast-xml-parser
npm i @aws-sdk/client-dynamodb@3.575.0 --save-exact
npx esbuild --bundle index.ts --outfile=dist/bundle.js --minify --analyze --format=esm --main-fields=module,main --platform=node  \
	--external:@aws-sdk/client-sts \
	--external:@aws-sdk/client-sso* \
	--external:@aws-sdk/credential-prov* \
	--external:@aws-sdk/token-providers \
        --external:fast-xml-parser

@kuhe
Copy link
Contributor

kuhe commented May 30, 2024

When we publish the fix in #6149 you shouldn't have to do anything different anymore.
Taking the dist-es (ESM, tree-shakeable) code for bundling should avoid importing both tslib and fast-xml-parser.

@kuhe kuhe added the pending-release This issue will be fixed by an approved PR that hasn't been released yet. label May 30, 2024
@mymattcarroll
Copy link

Turns out our problem was package-lock.json containing duplicate versions of the AWS SDK packages. Regenerating our package-lock.json fixed our issue.

Credit to @divporter for teaching me about the npm dedupe command.

@perpil
Copy link
Author

perpil commented May 31, 2024

I just tried 3.588. I'm seeing better coldstart times, but still 50K in bloat. One thing I noticed was smithy/core was bundled as cjs instead of esm. Here's the breakdown of where the size increases are coming from if it helps:

Delta 3.575 -> 3.588
@smithy +56.1

  • smithy-client +15.6
  • signature-v4 same
  • core +12
  • middleware-retry +10.3
  • node-http-handler same
  • util-endpoints newly added, duplicate in @AWS-SDK? + 12.8
  • middleware-endpoint +3.7
  • shared-ini-file-loader +3.3
  • types +2
  • util-stream same
  • protocol-http +1.6
  • property-provider +.4

@kuhe
Copy link
Contributor

kuhe commented Jun 3, 2024

@siddsriv is making the same change in @smithy/core smithy-lang/smithy-typescript#1297

@perpil
Copy link
Author

perpil commented Jun 5, 2024

Numbers for 3.590 are much more inline with 3.575. I think we can close now. Thanks for all your help here!


aws sdk node version avg(@initDuration) bundleSize # runs
3.575.0 v20.13.1 246.79 316588 3
3.577.0 v20.13.1 270.0433 463023 3
3.590.0 v20.13.1 241.88 318138 1

@perpil
Copy link
Author

perpil commented Jun 6, 2024

Actually, hold on closing here. Although I saw better performance and size on the repro, I still saw poor performance when I updated my prod code to 3.590.0. I had to manually set @smithy/core to 2.2.0 to get proper bundle size and still wasn't seeing the same performance. Let me dig in a bit.

@perpil
Copy link
Author

perpil commented Jun 6, 2024

3.592 updates the @smithy/core issue above and fixes the bundle size issue I was having in my prod code (not the repro).

Here's some data from my production api. The best performance was from 3.568 for me, I'm seeing about a 50 ms coldstart delta still and I'm still digging into it.


@awsSdk node runtime avg(initDuration) avg(secrets loading) max(bundleSize)
3.590.0 v20.13.1 2024-05-13T12:21:54.000Z 294.255 136.25 1122060
3.592.0 v20.13.1 2024-05-13T12:21:54.000Z 299.41 134.875 840024
3.568.0 v20.13.1 2024-05-13T12:21:54.000Z 243.1 136 837877

@perpil
Copy link
Author

perpil commented Jun 7, 2024

I've updated the repro to include more clients client-sts, client-dynamodb,lib-dynamodb and client-iam. The performance difference between 3.575 and 3.592 is negligible, but you can see the 50 ms difference between 3.568 and 3.592 clearly here.

aws sdk node version avg(@initDuration) bundleSize # runs
3.568.0 v20.13.1 213.5387 496644 8
3.575.0 v20.13.1 269.3038 504269 8
3.592.0 v20.13.1 265.4675 504954 8

@dreamorosi
Copy link

50ms on these number is 25% slower, not sure I agree with the negligible part.

Would be great to understand what was the trade off for adding this and if there's anything at all that we can do to reverse it besides pinning to 3.575.

@perpil
Copy link
Author

perpil commented Jun 7, 2024

Hey @dreamorosi, the negligible is between 3.575 and 3.592. However, I then compared 3.568 to 3.592 and saw the 50ms difference when I replicated what I saw in my production code. I agree that 50ms is a big regression and you should be able to replicate my results with the code I've currently checked into the repo. I looked at a few things to see if I could identify the source and I found some tiny things (shared-ini-loader and one other package I'm blanking on) but nothing that explains the 50 ms. The source of the latency no longer seems to be related to code size because I tried minification and didn't see much difference.

@perpil
Copy link
Author

perpil commented Jun 8, 2024

Alright, here's some more benchmarks with my repro. Looks like the performance bombed in 3.575. I'd recommend taking a close look at #6088

aws sdk node version avg(@initDuration) bundleSize # runs
3.568.0 v20.13.1 206.165 211662 2
3.569.0 v20.13.1 211.84 211837 2
3.572.0 v20.13.1 214.2233 211837 3
3.574.0 v20.13.1 206.9233 211837 3
3.575.0 v20.13.1 269.46 214626 2
3.592.0 v20.13.1 270.725 214940 2

@kuhe
Copy link
Contributor

kuhe commented Jun 10, 2024

Could you create a repro outside of Lambda that isolates the issue?

I have run the CDK example and the initDuration for v3.592.0 is in the 210-220 range. If there is an issue with the SDK it should be demonstrable in Node.js outside of Lambda.

Add a performance.now() mark to the top of the handler file and then to the top of the handler function. This should correspond to the cold start initialization duration. I did this, and no difference in init duration was shown. I also looked at the diff between the two bundled files and there wasn't anything that would take 50ms.

@kuhe kuhe added response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. and removed pending-release This issue will be fixed by an approved PR that hasn't been released yet. labels Jun 10, 2024
@perpil
Copy link
Author

perpil commented Jun 10, 2024

I'll see what I can do. Just to confirm, you are comparing 3.574 to 3.592 and initializing multiple clients (like the repro) and you don't see much difference?

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. label Jun 11, 2024
@kuhe
Copy link
Contributor

kuhe commented Jun 11, 2024

Yes, I used the provided repository and ran both those versions many times using the provided execution command examples in the readme.

I also tested the generated bundles themselves with additional timing reports, edited by hand, but those consisted merely of

const A = performance.now(); // top of file
...
const B = performance.now(); // top of handler function
...
console.log("init time", B - A); // before end of handler function

The contributing factors to initialization time in AWS Lambda are not entirely related to the user code, and fluctuate over time due to fleet provisioning. I don't know the details, so I'd like to remove that variance.

@jaapvanblaaderen
Copy link

Not sure if I'm running into the same issue but I saw something similar when upgrading to a newer SDK version and found this issue. The size of my Lambda function increased from 3.8mb to 6.1mb. I also use the built in CDK NodejsFunction which bundles with esbuild (CJS, not minifying the code).

When analyzing the bundle, it seems that @smithy/... packages are now part of every AWS SDK package and responsible for the majority of the size.

AWS SDK 3.556.0 (click to enlarge):
3 556 0

AWS SDK 3.592.0 (click to enlarge):
3 592 0

@perpil
Copy link
Author

perpil commented Jun 13, 2024

@jaapvanblaaderen It could potentially be related. I was seeing something similar with esm modules until they made a change to the packaging.

@kuhe WRT to the performance.now() benchmarking. I haven't been able to isolate this from lambda yet, but I do have some new performance benchmarks where I've wrapped the imports outside of the handler with performance.now(). The @initTime in the timings below shows that, but I haven't yet updated the repro code since I'm OOTO today. What is interesting is @initDuration and @initTime is consistently better for 3.574 and the percentage difference between the two SDKs is consistent for both performance.now and @initDuration. I did take a look at the diff between the actual bundled code and it wasn't much between 3.574 and 3.575. I'll see if I can isolate exactly what code is causing this performance regression.

@awsSdk @node avg(initDuration) max(@bundleSize) runs avg(@initTime) min(@initTime) max(@initTime) pct(@initTime,90)
3.574.0 v20.13.1 213.1845 211917 11 7.0253 6.646 8.3802 7.2597
3.575.0 v20.13.1 264.9182 214706 11 8.1337 7.5145 10.1601 8.8453

@perpil
Copy link
Author

perpil commented Jun 15, 2024

I've narrowed it down to the exact line that got added in 3.575.

If I patch that line from:

export const keepAliveSupport = {
  supported: Boolean(typeof Request !== "undefined" && "keepalive" in new Request("https://[::1]")),
};

To:

export const keepAliveSupport = {
  supported: true,
};

Then the 50 ms latency hit disappears in 3.575. Wild guess: maybe DNS is getting involved and there is a 50 ms timeout?

@richarddd
Copy link

richarddd commented Jun 17, 2024

I've narrowed it down to the exact line that got added in 3.575.

If I patch that line from:

export const keepAliveSupport = {
  supported: Boolean(typeof Request !== "undefined" && "keepalive" in new Request("https://[::1]")),
};

To:

export const keepAliveSupport = {
  supported: true,
};

Then the 50 ms latency hit disappears in 3.575. Wild guess: maybe DNS is getting involved and there is a 50 ms timeout?

Hi @perpil. How did you measure/narrow it down to that specific line? This is only relevant in Node 16 as Node 18+ has Request object at global scope so keepAlive will be true. This change was also added 9 months ago. Test here:

export const handler = async (event) => {
    const response = {
        statusCode: 200,
        body: JSON.stringify({
            "keepAlive": Boolean(typeof Request !== "undefined" && "keepalive" in new Request("https://[::1]"))
        }),
    };
    return response;
};

You also need to test initDuration + duration in order to evaluate performance. Code outside handler contributes to initDuration and inside contributes toduration

@kuhe
Copy link
Contributor

kuhe commented Jun 17, 2024

Thanks @perpil, that's a great help finding the most likely location of the problem. I will investigate how to fix it.

@perpil
Copy link
Author

perpil commented Jun 17, 2024

@richarddd the smithy dependencies were upgraded in 3.575 so this new code got pulled in. When you bundle using the repro further up in the thread you'll see these lines in the generated index.mjs output which weren't in 3.574:

// node_modules/@smithy/fetch-http-handler/dist-es/fetch-http-handler.js
var keepAliveSupport = {
  supported: Boolean(typeof Request !== "undefined" && "keepalive" in new Request("https://[::1]"))
};

You'll also note, there are no references to that variable in the generated output, so it isn't necessary in the node environment. When I remove that code, or hardcode it to true, I no longer see the 50 ms hit to @initDuration (it never showed up in @duration because it is loaded outside of the handler). If you want to do an accurate repro with your code, you should move the code outside the handler, compare it to not using the Request at all, and look at the resulting @initDuration. As I understand, fetch-http-handler is only supposed to be used in browser environments and node-http-handler is meant to be used when running in Lambda.

@kuhe Thanks for looking into it, I think there are a couple ways of solving it so reach out if you need help. One thing you mentioned was to try and use performance.now() to wrap the code outside the handler to do the benchmark. The issue with this approach is when it is bundled, some of the code in index.mjs is actually executed before any of the imports so you miss some of the latency. In the resulting bundle you get something like this, so @initDuration is what you should be using for an accurate benchmark.

var keepAliveSupport = ...
var start = performance.now();
...other code
var end = performance.now()

@kuhe
Copy link
Contributor

kuhe commented Jun 17, 2024

If you update your install lockfile to pick up @smithy/fetch-http-handler@3.0.3 it should no longer perform the Request init in the module scope.

The reason fetch-http-handler appears now is that recently we added preliminary support for fetch in Node.js, i.e. using FetchHttpHandler in Node.js. The stream collector mixin now brings in some cross-over functionality from the fetch-http-handler package.

@richarddd
Copy link

richarddd commented Jun 17, 2024

@perpil sorry for the misunderstanding I can now reproduce the issue. Great finding! The reason for this regression is that when Request class is being used, it's actually lazily loaded and pulls in a lot of dependencies (undicii etc) that adds this 40-50ms cold starts. @kuhe reproducible can be done by simply doing:

const a = Request;

export const handler = async (event) => {
    const response = {
        statusCode: 200,
        body: JSON.stringify("slow"),
    };
    return response;
};

Comment out first line and we're back to normal. Request further pulls in more lazily evaluated classes such as URL/Stream etc when instantiated.

Source: https://github.com/nodejs/node/blob/f2381bc7cb120e43bb97ffeccd19dfdc946c8b2c/lib/internal/bootstrap/web/exposed-window-or-worker.js#L86
(basically a property that does require)

@perpil
Copy link
Author

perpil commented Jun 17, 2024

@kuhe I can confirm SDK 3.598 with @smithy/fetch-http-handler 3.0.3 no longer has the coldstart issue. The Request code is treeshaken from my bundle now that it isn't in the module scope. Thanks so much! 🚀

@kuhe kuhe added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Jun 18, 2024
@github-actions github-actions bot added closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Jun 22, 2024
Copy link

github-actions bot commented Jul 6, 2024

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue is a bug. closed-for-staleness p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

6 participants