Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRT client not respecting pod memory limits #4034

Closed
Lunkers opened this issue May 24, 2023 · 24 comments
Closed

CRT client not respecting pod memory limits #4034

Lunkers opened this issue May 24, 2023 · 24 comments
Labels
bug This issue is a bug. closed-for-staleness crt-client p1 This is a high priority issue transfer-manager

Comments

@Lunkers
Copy link

Lunkers commented May 24, 2023

Describe the bug

When uploading large files using the CRT s3 client, it seems that Kubernetes memory restrictions are not respected.
When uploading a large file using the code snippet in #4033 in a Kubernetes pod, a number of anonymous file read processess are spawned and don't respect the memory limits on the pod, trying to use more memory than the pod is allowed to.

Expected Behavior

The SDK upload processes should not consume all available RAM in the pod, and respect Kubernetes limits.

Current Behavior

The pod consumes more and more memory for each upload created, rarely freeing memory. Sooner or later, Kubernetes kills the pod for using too many resources. The provided screenshot shows a pod with an 8GB memory limit:
Skärmavbild 2023-05-24 kl  14 50 17

Reproduction Steps

use the transfer manager created in the snippet provided in #4033 to upload a large file, roughly 80-100GB in a Kubernetes pod.

Possible Solution

No response

Additional Information/Context

No response

AWS Java SDK version used

2.20.67

JDK version used

11

Operating System and version

Ubuntu jammy jellyfish

@Lunkers Lunkers added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 24, 2023
@debora-ito
Copy link
Member

Question, as I'm not very familiar with Kubernetes:

The SDK upload processes should not consume all available RAM in the pod, and respect Kubernetes limits.

How is this limit set? Is it a "virtual" limit?

@debora-ito debora-ito added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. and removed needs-triage This issue or PR still needs to be triaged. labels Jun 6, 2023
@debora-ito debora-ito self-assigned this Jun 6, 2023
@SriDeepa-s3
Copy link

even iam getting similar issue

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Jun 10, 2023
@Lunkers
Copy link
Author

Lunkers commented Jun 12, 2023

@debora-ito I'm not a kubernetes expert either, but afaik resource allocation is done using cgroups. Our infrastructure team theorize that the CRT client may not respect cgroups correctly.

@SriDeepa-s3 We've managed to circumvent this by using a normal AsyncClients instead for S3 uploads, which has solved the issue.

@SriDeepa-s3
Copy link

@debora-ito: is there any reason why CRT client not bound to memory ?

@debora-ito
Copy link
Member

@Lunkers noted. We'll investigate.

@debora-ito debora-ito added crt-client p2 This is a standard priority issue labels Jun 16, 2023
@debora-ito debora-ito removed their assignment Jun 20, 2023
@SriDeepa-s3
Copy link

SriDeepa-s3 commented Jun 21, 2023

@debora-ito: any update on above issue ?

@graebm
Copy link
Contributor

graebm commented Jun 26, 2023

I started investigating this last week. I can reproduce this issue on my laptop when my internet speeds are much lower than the CRT S3 client's targetThroughputInGbps setting (it's 10Gbps by default, which is much higher than my home internet).

The CRT S3 client does use memory outside the JVM, so it is able to exceed Java's normal Runtime.maxMemory() aka -XX:MaxHeapSize. The CRT S3 client's only real tuning knob is targetThroughputGbps right now. That leads to a opaque series of calculations for how much concurrency is ideal, but based on my back-of-the-envelope math it should be using much less than 1 GiB of memory.

Something weird is definitely going on. The CRT's memory usage climbs well over 1GiB in the first 60sec of my upload, before coming back down and settling well below 1GiB for the remainder of the upload.

I'm still investigating, this is unacceptable.
But in the meantime you could lower the targetThroughputGbps.

@graebm
Copy link
Contributor

graebm commented Jun 27, 2023

Actually, I'm not reproducing this. When my memory usage climbed over 1GiB, I had been messing around and set the targetThroughputGbps=100, much higher than the default 10.

With the default targetThroughputGbps=10, memory usage stays solidly below 1GiB (JVM using 300MiB, CRT using 500MiB).

@graebm
Copy link
Contributor

graebm commented Jun 27, 2023

I guess we'll need a more reliable repro case...

Are you certain there's only 1 instance of the CRT S3 client running?

You can see how much native memory the CRT S3 client is using by running with the -Daws.crt.memory.tracing=1 system property set, and then calling software.amazon.awssdk.crt.CRT.nativeMemory() to see the current usage (in bytes).

FWIW, you can see the JVM's memory usage by calling Runtime.getRuntime().totalMemory()

@SriDeepa-s3
Copy link

SriDeepa-s3 commented Jun 28, 2023

our scenario is like, we are running consumer where it will listen to one topic and upload the input stream to s3 .
for every request , creating new client with user credentials and closing client obj after each upload

sample code in #4094

is there any thing iam missing here on close?
will provide update on CRT and jvm memory with as suggested

@graebm
Copy link
Contributor

graebm commented Jun 28, 2023

If possible, I would revamp your code to only have 1 instance of the S3Client. The S3Client is built with the intention of being a singleton, handling multiple requests at once in an intelligent way that doesn't use too many resources. But if you have N instances, they will use N times the system resources.

@SriDeepa-s3
Copy link

SriDeepa-s3 commented Jul 3, 2023 via email

@neel1298
Copy link

@debora-ito I'm not a kubernetes expert either, but afaik resource allocation is done using cgroups. Our infrastructure team theorize that the CRT client may not respect cgroups correctly.

@SriDeepa-s3 We've managed to circumvent this by using a normal AsyncClients instead for S3 uploads, which has solved the issue.

Hello, we are also facing some similar issues but with downloading. Can you tell me what do you mean by normal AsyncClients.
Did you use
S3AsyncClient.builder() instead of S3AsyncClient.crtBuilder() ?
And did you see any major performance drops when using a normal client?

@graebm
Copy link
Contributor

graebm commented Jul 14, 2023

Is there way to create client at ones and ovveride the credentials ?

Darn, it's not currently possible in aws-sdk-java-v2. It's possible in the underlying native code to use different buckets and credentials per-request, but that configuration still needs to be exposed out at this level. Sorry. The team has this task in their backlog...

@dfinucane
Copy link

Definitely seems like a bug in the CRT async client because I'm using that, a single instance shared across the process and the native memory use (non JVM heap) soars to 11 or even 14GB during single file uploads. I'm going to switch to the non CRT async client until this gets resolved. I have the same problem where I'm running inside kubernetes which has a limit of 6GB and the pod is getting terminated for far exceeding the 6GB limit whereas the JVM heap is hovering around 2GB.

@dfinucane
Copy link

For me switching to the S3AsyncClient.builder().build() client did not resolve the issue. I'm using SDK version 2.20.103 and regardless which client I use many gigabytes of native (non JVM heap) memory get used, the process exceeds its limit and is terminated. I thought the S3AsyncClient.builder().build() client was implemented in Java and therefore used the JVM heap but that's not what it seems like because the leak I'm experiencing is not in the JVM heap.

@dfinucane
Copy link

Upon further reviewing my situation I do not believe I was experiencing a problem with S3TransferManager or the CRT async client. I wanted to point that out so that nobody wasted time researching a problem based on my last two comments.
I was misreading metrics and now believe that the spikes i was seeing were the sum of the memory used by the pod being terminated and the pod taking its place.
I believe my problem was the result of not leaving enough room for non JVM heap memory where my pod was allowed 6GB and my heap was allowed to consume up to 75% of that because I had used -XX:UseContainerSupport with -XX:MaxRAMPercentage=75.0.
Sorry if I caused confusion with this ticket.

@debora-ito debora-ito added p1 This is a high priority issue and removed p2 This is a standard priority issue labels Jul 25, 2023
@davidh44
Copy link
Contributor

davidh44 commented Aug 4, 2023

Is there way to create client at ones and ovveride the credentials ?

Darn, it's not currently possible in aws-sdk-java-v2. It's possible in the underlying native code to use different buckets and credentials per-request, but that configuration still needs to be exposed out at this level. Sorry. The team has this task in their backlog...

You can configure credentials per-request using AwsRequestOverrideConfiguration:

PutObjectRequest request = PutObjectRequest.builder()
                                                   .bucket("bucket")
                                                   .key("key")
                                                   .overrideConfiguration(o -> o.credentialsProvider(DefaultCredentialsProvider.create()))
                                                   .build();

To handle buckets in different regions than that of the S3Client, you can enable crossRegionAccessEnabled:

S3AsyncClient crossRegionS3Client = crossRegionS3Client = S3AsyncClient.crtBuilder()
                                                                           .region(Region.EU_WEST_1)
                                                                           .crossRegionAccessEnabled(true)
                                                                           .build();

@rtjain21
Copy link

any updates on this issue?

@alexsander-terres
Copy link

I am also wondering if there was any updates on this, since I was also experiencing some memory leaks with crtBuilder.

@jassuncao
Copy link

I experienced the same issue of a pod getting killed due to excessive memory usage.
This happens with the CRT and the normal/pure java client. In the case of the normal client I suspect the cause is the use of direct memory buffers. I was using -Xms to limit the JVM memory usage but this is not enough. I also had to set -XX:MaxDirectMemorySize

@jason-weddington
Copy link

@jassuncao Thanks for the information. We're working on additional configuration options to make this easier to manage and will update this issue then that update is released.

@jma-9code
Copy link

I had memory leak too with the CRT implementation (unable to control the java native memory).

The only viable solution I've found to continue using high-level interfaces for multipart operations, especially for uploading files larger than 5 gigabytes, is to use SDK v1.

@debora-ito
Copy link
Member

debora-ito commented Feb 7, 2024

We apologize for the long silence, we have some updates to share.

The CRT team made some changes in the CRT client core, released in aws-crt 0.29.7, and we observed improvements in the memory usage after running benchmarks using this new version.

We are also exposing a new attribute in the S3CrtAsyncClient that will provide more control over the utilized memory at the client level: maxNativeMemoryLimitInBytes. If a value is not provided, the CRT will attempt to limit native memory usage in an optimal way, based on parameters like target throughput. The new attribute was added via #4885 and will be available in today's release.

In summary:

  • Please upgrade to SDK version 2.23.5 (which includes aws-crt 0.29.7) or greater, and let us know if you see improvements regarding memory usage
  • Upgrade to version 2.23.20 if you want to provide a custom maxNativeMemoryLimitInBytes

Edit: updated with the version of the SDK that includes the new maxNativeMemoryLimitInBytes attribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. closed-for-staleness crt-client p1 This is a high priority issue transfer-manager
Projects
None yet
Development

No branches or pull requests