Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing memory usage #665

Closed
abhijitawachar opened this issue Aug 3, 2022 · 8 comments
Closed

Increasing memory usage #665

abhijitawachar opened this issue Aug 3, 2022 · 8 comments
Labels
stale Issues / PRs with no activity Status: Help Wanted Type: Question All types of questions to/from customers

Comments

@abhijitawachar
Copy link

Describe the bug
We have deployed v1.16.3 of the node termination handler. I noticed that the memory usage is increasing over certain period of time and eventually it reaches the pod memory limit and is OOMKilled. Is there a memory leak somewhere?

nth-memory-usage

Application Logs
Following are the logs for the aws-node-termination-handler pod. Logs do not have anything erroneous:

2022/07/28 07:15:01 INF Starting to serve handler /metrics, port 9092
2022/07/28 07:15:01 INF Starting to serve handler /healthz, port 8080
2022/07/28 07:15:01 INF Startup Metadata Retrieved metadata={"accountId":"xxxx","availabilityZone":"us-west-2b","instanceId":"i-xxxx","instanceLifeCycle":"on-demand","instanceType":"c6i.4xlarge","localHostname":"xxxx.us-west-2.compute.internal","privateIp":"x.x.x.x","publicHostname":"","publicIp":"","region":"us-west-2"}
2022/07/28 07:15:01 INF aws-node-termination-handler arguments:
dry-run: false,
node-name: xxxx.us-west-2.compute.internal,
pod-name: aws-node-termination-handler-b56bf578b-79x5m,
metadata-url: http://abcd,
kubernetes-service-host: x.x.x.X,
kubernetes-service-port: 443,
delete-local-data: true,
ignore-daemon-sets: true,
pod-termination-grace-period: -1,
node-termination-grace-period: 120,
enable-scheduled-event-draining: false,
enable-spot-interruption-draining: false,
enable-sqs-termination-draining: true,
enable-rebalance-monitoring: false,
enable-rebalance-draining: false,
metadata-tries: 3,
cordon-only: false,
taint-node: true,
taint-effect: NoSchedule,
exclude-from-load-balancers: false,
json-logging: false,
log-level: info,
webhook-proxy: ,
webhook-headers: ,
webhook-url: ,
webhook-template: ,
uptime-from-file: ,
enable-prometheus-server: true,
prometheus-server-port: 9092,
emit-kubernetes-events: true,
kubernetes-events-extra-annotations: ,
aws-region: us-west-2,
queue-url: https://xxxx,
check-asg-tag-before-draining: false,
managed-asg-tag: aws-node-termination-handler/managed,
use-provider-id: false,
aws-endpoint: ,
2022/07/28 07:15:01 INF Started watching for interruption events
2022/07/28 07:15:01 INF Kubernetes AWS Node Termination Handler has started successfully!
2022/07/28 07:15:01 INF Started watching for event cancellations
2022/07/28 07:15:01 INF Started monitoring for events event_type=SQS_TERMINATE
2022/07/28 07:20:12 INF Adding new event to the event store event={"AutoScalingGroupName":"xxxx","Description":"EC2 State Change event received. Instance i-xxxx went into shutting-down at 2022-07-28 07:20:11 +0000 UTC \n","EndTime":"0001-01-01T00:00:00Z","EventID":"ec2-state-change-event-xxxx","InProgress":false,"InstanceID":"i-xxxx","IsManaged":true,"Kind":"SQS_TERMINATE","NodeLabels":null,"NodeName":"xxxx.us-west-2.compute.internal","NodeProcessed":false,"Pods":null,"ProviderID":"aws:///us-west-2c/xxxx","StartTime":"2022-07-28T07:20:11Z","State":""}

  • Kubernetes version: v1.21
@AustinSiu AustinSiu self-assigned this Aug 4, 2022
@AustinSiu
Copy link
Contributor

@abhijitawachar thanks for bringing this to our attention. I noticed that your metrics show a similar pattern of memory usage in period between 5/4/22 - 5/12/22. We released v1.16.3 on 5/11/22, so it looks like this memory behavior may have existed prior to v1.16.3.
To help us track down the issue, would you be able to tell us how far back you can see this behavior in your metrics, and what NTH version you had deployed?

@abhijitawachar
Copy link
Author

@AustinSiu thanks for looking into this.
We started using NTH from 5/3/22 with version v1.16.1 and updated it to v1.16.3 on 5/13/22 .
From screenshot below, we can say that we are seeing this issue from v1.16.1 , as we recently started using NTH, not really sure how far back this behaviour is seen.

nth-mem-use

@snay2 snay2 assigned snay2 and unassigned AustinSiu Aug 23, 2022
@abhijitawachar
Copy link
Author

abhijitawachar commented Aug 29, 2022

@snay2 / @AustinSiu did you get a chance to look at this.

@snay2
Copy link
Contributor

snay2 commented Aug 29, 2022

@abhijitawachar Yes, I am actively investigating this week. So far, I have been able to reproduce this behavior in a sandboxed environment. Currently deep diving possible causes with a memory profiler.

@abhijitawachar
Copy link
Author

Hey @snay2 , did you get a chance to look into this

@snay2
Copy link
Contributor

snay2 commented Sep 9, 2022

My initial research from 2022-08-29 was inconclusive, and so far I haven't found a root cause. For now, I'm pausing active work on this issue. Sorry I don't have better news at this time. :(

However, we have long-term plans to build a more robust simulation system in our CI/CD process that will help us monitor for memory leaks. To help with that, we want to write test drivers that simulate real-world workloads. Could you tell me a bit more about how your cluster is set up (Queue Processor or IMDS; how frequently does NTH process messages, and what kinds; etc.)? From your graphs, it looks like every few weeks, the NTH pod gets OOM killed and restarted; is that correct? Your graph is measuring the memory usage of the NTH pod, yes? Is this behavior causing any negative performance impact to your cluster?

That kind of information will help us design a realistic simulation system and hopefully track down this leak.

@snay2 snay2 removed their assignment Sep 9, 2022
@snay2 snay2 added Type: Question All types of questions to/from customers Status: Help Wanted stalebot-ignore To NOT let the stalebot update or close the Issue / PR labels Sep 9, 2022
@jillmon jillmon removed the stalebot-ignore To NOT let the stalebot update or close the Issue / PR label Jan 19, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Feb 19, 2023
@github-actions
Copy link

This issue was closed because it has become stale with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues / PRs with no activity Status: Help Wanted Type: Question All types of questions to/from customers
Projects
None yet
Development

No branches or pull requests

4 participants