Increasing memory usage #665

abhijitawachar · 2022-08-03T10:32:59Z

Describe the bug
We have deployed v1.16.3 of the node termination handler. I noticed that the memory usage is increasing over certain period of time and eventually it reaches the pod memory limit and is OOMKilled. Is there a memory leak somewhere?

Application Logs
Following are the logs for the aws-node-termination-handler pod. Logs do not have anything erroneous:

2022/07/28 07:15:01 INF Starting to serve handler /metrics, port 9092
2022/07/28 07:15:01 INF Starting to serve handler /healthz, port 8080
2022/07/28 07:15:01 INF Startup Metadata Retrieved metadata={"accountId":"xxxx","availabilityZone":"us-west-2b","instanceId":"i-xxxx","instanceLifeCycle":"on-demand","instanceType":"c6i.4xlarge","localHostname":"xxxx.us-west-2.compute.internal","privateIp":"x.x.x.x","publicHostname":"","publicIp":"","region":"us-west-2"}
2022/07/28 07:15:01 INF aws-node-termination-handler arguments:
dry-run: false,
node-name: xxxx.us-west-2.compute.internal,
pod-name: aws-node-termination-handler-b56bf578b-79x5m,
metadata-url: http://abcd,
kubernetes-service-host: x.x.x.X,
kubernetes-service-port: 443,
delete-local-data: true,
ignore-daemon-sets: true,
pod-termination-grace-period: -1,
node-termination-grace-period: 120,
enable-scheduled-event-draining: false,
enable-spot-interruption-draining: false,
enable-sqs-termination-draining: true,
enable-rebalance-monitoring: false,
enable-rebalance-draining: false,
metadata-tries: 3,
cordon-only: false,
taint-node: true,
taint-effect: NoSchedule,
exclude-from-load-balancers: false,
json-logging: false,
log-level: info,
webhook-proxy: ,
webhook-headers: ,
webhook-url: ,
webhook-template: ,
uptime-from-file: ,
enable-prometheus-server: true,
prometheus-server-port: 9092,
emit-kubernetes-events: true,
kubernetes-events-extra-annotations: ,
aws-region: us-west-2,
queue-url: https://xxxx,
check-asg-tag-before-draining: false,
managed-asg-tag: aws-node-termination-handler/managed,
use-provider-id: false,
aws-endpoint: ,
2022/07/28 07:15:01 INF Started watching for interruption events
2022/07/28 07:15:01 INF Kubernetes AWS Node Termination Handler has started successfully!
2022/07/28 07:15:01 INF Started watching for event cancellations
2022/07/28 07:15:01 INF Started monitoring for events event_type=SQS_TERMINATE
2022/07/28 07:20:12 INF Adding new event to the event store event={"AutoScalingGroupName":"xxxx","Description":"EC2 State Change event received. Instance i-xxxx went into shutting-down at 2022-07-28 07:20:11 +0000 UTC \n","EndTime":"0001-01-01T00:00:00Z","EventID":"ec2-state-change-event-xxxx","InProgress":false,"InstanceID":"i-xxxx","IsManaged":true,"Kind":"SQS_TERMINATE","NodeLabels":null,"NodeName":"xxxx.us-west-2.compute.internal","NodeProcessed":false,"Pods":null,"ProviderID":"aws:///us-west-2c/xxxx","StartTime":"2022-07-28T07:20:11Z","State":""}

Kubernetes version: v1.21

AustinSiu · 2022-08-07T22:22:46Z

@abhijitawachar thanks for bringing this to our attention. I noticed that your metrics show a similar pattern of memory usage in period between 5/4/22 - 5/12/22. We released v1.16.3 on 5/11/22, so it looks like this memory behavior may have existed prior to v1.16.3.
To help us track down the issue, would you be able to tell us how far back you can see this behavior in your metrics, and what NTH version you had deployed?

abhijitawachar · 2022-08-08T07:18:01Z

@AustinSiu thanks for looking into this.
We started using NTH from 5/3/22 with version v1.16.1 and updated it to v1.16.3 on 5/13/22 .
From screenshot below, we can say that we are seeing this issue from v1.16.1 , as we recently started using NTH, not really sure how far back this behaviour is seen.

abhijitawachar · 2022-08-29T07:38:32Z

@snay2 / @AustinSiu did you get a chance to look at this.

snay2 · 2022-08-29T15:48:50Z

@abhijitawachar Yes, I am actively investigating this week. So far, I have been able to reproduce this behavior in a sandboxed environment. Currently deep diving possible causes with a memory profiler.

abhijitawachar · 2022-09-08T16:35:51Z

Hey @snay2 , did you get a chance to look into this

snay2 · 2022-09-09T22:32:17Z

My initial research from 2022-08-29 was inconclusive, and so far I haven't found a root cause. For now, I'm pausing active work on this issue. Sorry I don't have better news at this time. :(

However, we have long-term plans to build a more robust simulation system in our CI/CD process that will help us monitor for memory leaks. To help with that, we want to write test drivers that simulate real-world workloads. Could you tell me a bit more about how your cluster is set up (Queue Processor or IMDS; how frequently does NTH process messages, and what kinds; etc.)? From your graphs, it looks like every few weeks, the NTH pod gets OOM killed and restarted; is that correct? Your graph is measuring the memory usage of the NTH pod, yes? Is this behavior causing any negative performance impact to your cluster?

That kind of information will help us design a realistic simulation system and hopefully track down this leak.

github-actions · 2023-02-19T17:03:14Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

github-actions · 2023-02-24T17:03:26Z

This issue was closed because it has become stale with no activity.

AustinSiu self-assigned this Aug 4, 2022

snay2 assigned snay2 and unassigned AustinSiu Aug 23, 2022

snay2 removed their assignment Sep 9, 2022

snay2 added Type: Question All types of questions to/from customers Status: Help Wanted stalebot-ignore To NOT let the stalebot update or close the Issue / PR labels Sep 9, 2022

jillmon removed the stalebot-ignore To NOT let the stalebot update or close the Issue / PR label Jan 19, 2023

github-actions bot added the stale Issues / PRs with no activity label Feb 19, 2023

github-actions bot closed this as completed Feb 24, 2023

peterbueschel mentioned this issue Nov 29, 2023

"Metric leak" which results in restarts caused by out-of-memory #925

Closed

GavinBurris42 mentioned this issue Jan 25, 2024

Prometheus metric leak #950

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increasing memory usage #665

Increasing memory usage #665

abhijitawachar commented Aug 3, 2022

AustinSiu commented Aug 7, 2022

abhijitawachar commented Aug 8, 2022

abhijitawachar commented Aug 29, 2022 •

edited

Loading

snay2 commented Aug 29, 2022 •

edited

Loading

abhijitawachar commented Sep 8, 2022

snay2 commented Sep 9, 2022

github-actions bot commented Feb 19, 2023

github-actions bot commented Feb 24, 2023

Increasing memory usage #665

Increasing memory usage #665

Comments

abhijitawachar commented Aug 3, 2022

AustinSiu commented Aug 7, 2022

abhijitawachar commented Aug 8, 2022

abhijitawachar commented Aug 29, 2022 • edited Loading

snay2 commented Aug 29, 2022 • edited Loading

abhijitawachar commented Sep 8, 2022

snay2 commented Sep 9, 2022

github-actions bot commented Feb 19, 2023

github-actions bot commented Feb 24, 2023

abhijitawachar commented Aug 29, 2022 •

edited

Loading

snay2 commented Aug 29, 2022 •

edited

Loading