Frequent calls to DescribeInstances cause SQS backup and late notifications #494

mechanical-fish · 2021-10-07T16:18:18Z

aws-node-termination-handler seems to have the following logic (in pkg/monitor/sqsevent/sqs-monitor.go):

Pull five messages from the SQS queue every 2 seconds with a 20-second-per-message visibility timeout
Attempt to call the ec2.DescribeInstances API once per message (as part of the retrieveNodeName function). If the SQS queue is consistently full, this is 2.5 calls to DescribeInstances per second per queue.
If DescribeInstances hits a rate limit, do not remove the message from the queue; instead leave it there to be retried 20 seconds later

We are now operating at sufficient scale that DescribeInstances is often hitting API rate limits -- I'm not sure those limits are published, so I'm not sure whether the 2.5-calls-per-second from each aws-node-termination-handler are a signifiant cause of our rate limiting, or merely one contributor. But once rate limiting begins we observe the following failure mode:

aws-node-termination-handler will never pull more than 5 messages every 2 seconds.
If it hits a rate limit the message stays on the queue to be redelivered 20 seconds later.
The queue gets backed up. We consistently observe 40-50 messages in flight on our busiest SQS queues, though never more than 50 (note that 5 messages * 20 seconds / 2 seconds = 50 messages). There is often a backup of hundreds of messages and the wait times are 50 seconds to 300 seconds or more.
At this point the majority of messages are handled so late that the instances are gone from EC2. The "good" news is that, when DescribeInstances manages to get through to the API, the message for an already-terminated instance gets deleted from the backed-up queue. The bad news is that it's way too late to warn Kubernetes by that point, so we are experiencing sudden unexpected failure of Kube pods.

I can think of a couple ways to change this code to help avoid this problem:

Stop making so many calls to DescribeInstances. The purpose of these calls is to translate EC2 instance IDs to node names. Instance IDs are said to be unique (and are certainly vanishingly unlikely to be reused in rapid succession) and private DNS names are not mutable, so this mapping cannot change and can be cached -- we can periodically call DescribeInstances for all instances and cache the results.
Make the number of SQS messages retrieved per 2-second cycle into a configurable parameter that can be tuned along with the number of workers. I'm not even sure why we fetch so few messages each time, except as a means of throttling the rate of DescribeInstances API calls, which could be managed differently as suggested above.

The text was updated successfully, but these errors were encountered:

bwagner5 · 2021-10-08T01:44:58Z

Hey @mechanical-fish ! Thanks for raising this issue!

As a work around for NTH to pull more messages, you can increase the replica count as a helm parameter or through kustomize. That will obviously put more strain on the DescribeInstances limit you are hitting but the messages will be pulled faster.

I agree w/ the longer term solutions you propose. We can definitely increase the number of messages that NTH can pull at one time (the max is 10) and cache DescribeInstances calls.

DescribeInstances does have a published request rate limit here which is 10 QPS w/ a 50 QPS burst (token bucket).

bwagner5 · 2021-10-08T02:03:06Z

I went ahead and put up the PR for increasing the max messages to 10.

I think we also need to dynamically adjust the VisibilityTimeout when processing the message since most nodes will take longer than 20 seconds to drain. We can also retry a bit faster when rate limited by adjusting the visibility timeout in that case.

ngoyal16 · 2021-10-08T14:59:12Z

@mechanical-fish we also facing same issue i have added caching for instance id and nodename to reduce the describe calls.

snay2 · 2021-10-08T17:06:22Z

Our cache implementation should have the following characteristics:

Reduce the QPS rate of DescribeInstances calls while processing queue messages
Memory consumption scales linearly, both up and down, with the size of the k8s cluster
- Specifically, old data should be removed from the cache soon after it becomes obsolete
Have a configurable invalidation frequency so customers can tune it per their cluster volatility or their AWS account API rate limits
- A cluster with high node turnover or rapid scaling up/down may need to refresh the cache more frequently

One way to do this is to create a custom DescribeInstances provider that queries (on a configurable interval) to get the necessary information about all instances in the cluster or all ASGs that NTH is monitoring. Then the implementation of retrieveNodeName uses this provider instead of calling the DescribeInstances API directly.

This is a form of the "constant work" pattern. Using such a pattern improves system stability under high load and unusual traffic patterns, such as a large influx of queue messages or a backed-up queue. Those circumstances will then not cause spikes in DescribeInstances calls that might exceed the rate limit. Rather, NTH will regularly call DescribeInstances on a predictable cadence. It may result in doing more work than necessary, but we trade off that extra work for more predictable performance under load.

ngoyal16 · 2021-10-08T17:26:36Z

@snay2 sure i am happy to help on same.

bwagner5 · 2021-10-08T17:48:41Z

In the provider implementation, rather than using a map, we can use something like this https://github.com/patrickmn/go-cache . I've used this one before and it works pretty well. The provider implementation might be similar to this (a little simpler though): https://github.com/awslabs/karpenter/blob/7d907ad539b45b46f3778e15a244820ae3468153/pkg/cloudprovider/aws/instancetypes.go

ngoyal16 · 2021-10-08T17:53:38Z

if we add this expiring caching after 10 min do we still need to worry about cache invalidate as it will automatically expiry after 10 min. it will also fix the memory leakge issue too.

bwagner5 · 2021-10-08T18:00:36Z

That's probably right, that we can rely on time based eviction.

ngoyal16 · 2021-10-08T18:04:28Z

@bwagner5 go-cache module added

mechanical-fish · 2021-10-08T18:43:41Z

The idea of using go-cache with a constant-work pattern seems great!

Meanwhile as I try to figure out why our queues are backed up I have begun to wonder about the other uncached API calls beneath processSQSMessage, particularly autoscaling.DescribeAutoScalingInstances -- which, if I'm reading this code right, gets called at least once for most types of incoming SQS message. Because we've got CheckIfManaged turned on, this call might actually be happening two times in succession every time such a message comes in, plus a call to ASG.DescribeTagsPages. All of this seems like an excellent opportunity to make more use of go-cache.

I also note that it looks like most of the calls to retrieveAutoScalingGroupName either swallow any API errors that may result or log them in a very indirect way. I'm going to try hacking together a version of NTH that does more logging in an attempt to sort out what API calls are getting exercised and how long they are taking.

ngoyal16 · 2021-10-08T18:55:07Z

I have one more question regards setting this up . currently we are running 3 eks cluster in single region in a account should we use single queue for all of them or different for each.. what's the best practices for this

snay2 · 2021-10-08T19:03:38Z

@ngoyal16 Best practice is one queue per cluster. That avoids unnecessary processing and also prevents situations where NTH tries to handle messages from another cluster that it can't actually influence.

snay2 · 2021-10-08T19:07:20Z

@mechanical-fish I look forward to hearing what you learn. If you think the additional logging would be beneficial to other customers as well, please do open a PR with it!

And yes, let's design the cache so it can support multiple data sources of instance information.

ngoyal16 · 2021-10-11T12:22:19Z

@snay2 what will happen if there are two vpc with same CIDR range running an EKS cluster and one of node in each have exectly same node-name and one of those similer node name instance get evctied or scale-out?

snay2 · 2021-10-11T18:46:31Z

@ngoyal16 I don't have experience with this particular configuration, but the general best practice would be to have one queue per cluster, to avoid ambiguities like this that could arise from naming or IP address collisions across clusters. NTH doesn't have disambiguation logic for that kind of scenario; if you think it would be worthwhile to add it, feel free to open a separate issue as a feature request and we can discuss it with the community there.

ngoyal16 · 2021-10-11T18:54:55Z

can we add tag value based cleanup this way even if such kind of colison happen. or we can use cluster name or something like
that. sure i will create new feature request.

mechanical-fish · 2021-10-12T16:05:27Z

I've been off doing more research and, while I think this issue is real, I'm no longer convinced it's the root cause of our difficulties. I'm going to report two other issues that I've uncovered that have been causing our slowdown, the first of which is #498

mechanical-fish · 2021-10-12T16:10:02Z

(Not to leave everyone in suspense, the second issue is "the code is good about checking the 'managed ASG tag' but it doesn't pull the events from other ASGs off the queue, so they just get redelivered over and over until the instances finally terminate" -- I have a PR that addresses this and also cuts back on the number of AWS API calls, but I want to at least make a half-hearted stab at updating the unit tests before uploading that.)

mechanical-fish · 2021-10-13T22:01:40Z

Thanks for the prompt attention to #498 !

My second PR is #506 which I believe has actually addressed our immediate problems at my company.

If and when #506 lands, it introduces an internal data structure, the NodeInfo struct, which looks like a good candidate for prefetching-and-caching using the go-cache module. That would address the issue we're discussing and make NTH even more API-efficient.

github-actions · 2021-11-13T17:06:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

mechanical-fish · 2021-11-15T14:53:51Z

PRs #498 and #506, together with an increase to our API rate cap, have sufficiently lowered the urgency of this issue for my team that I haven't had further time to work on it.

I still think it would be fine to use e.g. go-cache and a strategy of periodically fetching NodeInfo information to reduce the number of API calls required during high-traffic periods when many instances are terminating at the same time. But the case for such a feature is harder to make now that nearly every one of our terminating instances can be handled with one API call.

github-actions · 2021-12-18T17:07:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

github-actions · 2021-12-23T17:08:26Z

This issue was closed because it has become stale with no activity.

bwagner5 added Priority: High This issue will be seen by most users Type: Enhancement New feature or request labels Oct 8, 2021

bwagner5 mentioned this issue Oct 8, 2021

up sqs max messages and lower wait time and visibility timeout on messages #496

Merged

ngoyal16 mentioned this issue Oct 8, 2021

Instance to Nodename caching added to reduce the DescribeInstances calls #497

Closed

bwagner5 assigned snay2 Oct 8, 2021

github-actions bot added the stale Issues / PRs with no activity label Nov 13, 2021

mechanical-fish closed this as completed Nov 15, 2021

bwagner5 reopened this Nov 15, 2021

bwagner5 removed the stale Issues / PRs with no activity label Nov 15, 2021

jillmon added Priority: Low This issue will not be seen by most users. The issue is a very specific use case or corner case and removed Priority: High This issue will be seen by most users labels Nov 17, 2021

jillmon unassigned snay2 Nov 17, 2021

github-actions bot added the stale Issues / PRs with no activity label Dec 18, 2021

github-actions bot closed this as completed Dec 23, 2021

snay2 mentioned this issue Apr 18, 2022

add node label selector to Terminator #625

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent calls to DescribeInstances cause SQS backup and late notifications #494

Frequent calls to DescribeInstances cause SQS backup and late notifications #494

mechanical-fish commented Oct 7, 2021

bwagner5 commented Oct 8, 2021

bwagner5 commented Oct 8, 2021

ngoyal16 commented Oct 8, 2021

snay2 commented Oct 8, 2021

ngoyal16 commented Oct 8, 2021

bwagner5 commented Oct 8, 2021

ngoyal16 commented Oct 8, 2021

bwagner5 commented Oct 8, 2021

ngoyal16 commented Oct 8, 2021

mechanical-fish commented Oct 8, 2021 •

edited

Loading

ngoyal16 commented Oct 8, 2021 •

edited

Loading

snay2 commented Oct 8, 2021

snay2 commented Oct 8, 2021

ngoyal16 commented Oct 11, 2021

snay2 commented Oct 11, 2021

ngoyal16 commented Oct 11, 2021

mechanical-fish commented Oct 12, 2021

mechanical-fish commented Oct 12, 2021

mechanical-fish commented Oct 13, 2021 •

edited

Loading

github-actions bot commented Nov 13, 2021

mechanical-fish commented Nov 15, 2021

github-actions bot commented Dec 18, 2021

github-actions bot commented Dec 23, 2021

Frequent calls to DescribeInstances cause SQS backup and late notifications #494

Frequent calls to DescribeInstances cause SQS backup and late notifications #494

Comments

mechanical-fish commented Oct 7, 2021

bwagner5 commented Oct 8, 2021

bwagner5 commented Oct 8, 2021

ngoyal16 commented Oct 8, 2021

snay2 commented Oct 8, 2021

ngoyal16 commented Oct 8, 2021

bwagner5 commented Oct 8, 2021

ngoyal16 commented Oct 8, 2021

bwagner5 commented Oct 8, 2021

ngoyal16 commented Oct 8, 2021

mechanical-fish commented Oct 8, 2021 • edited Loading

ngoyal16 commented Oct 8, 2021 • edited Loading

snay2 commented Oct 8, 2021

snay2 commented Oct 8, 2021

ngoyal16 commented Oct 11, 2021

snay2 commented Oct 11, 2021

ngoyal16 commented Oct 11, 2021

mechanical-fish commented Oct 12, 2021

mechanical-fish commented Oct 12, 2021

mechanical-fish commented Oct 13, 2021 • edited Loading

github-actions bot commented Nov 13, 2021

mechanical-fish commented Nov 15, 2021

github-actions bot commented Dec 18, 2021

github-actions bot commented Dec 23, 2021

mechanical-fish commented Oct 8, 2021 •

edited

Loading

ngoyal16 commented Oct 8, 2021 •

edited

Loading

mechanical-fish commented Oct 13, 2021 •

edited

Loading