Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent calls to DescribeInstances cause SQS backup and late notifications #494

Closed
mechanical-fish opened this issue Oct 7, 2021 · 23 comments
Labels
Priority: Low This issue will not be seen by most users. The issue is a very specific use case or corner case stale Issues / PRs with no activity Type: Enhancement New feature or request

Comments

@mechanical-fish
Copy link
Contributor

aws-node-termination-handler seems to have the following logic (in pkg/monitor/sqsevent/sqs-monitor.go):

  • Pull five messages from the SQS queue every 2 seconds with a 20-second-per-message visibility timeout
  • Attempt to call the ec2.DescribeInstances API once per message (as part of the retrieveNodeName function). If the SQS queue is consistently full, this is 2.5 calls to DescribeInstances per second per queue.
  • If DescribeInstances hits a rate limit, do not remove the message from the queue; instead leave it there to be retried 20 seconds later

We are now operating at sufficient scale that DescribeInstances is often hitting API rate limits -- I'm not sure those limits are published, so I'm not sure whether the 2.5-calls-per-second from each aws-node-termination-handler are a signifiant cause of our rate limiting, or merely one contributor. But once rate limiting begins we observe the following failure mode:

  • aws-node-termination-handler will never pull more than 5 messages every 2 seconds.
  • If it hits a rate limit the message stays on the queue to be redelivered 20 seconds later.
  • The queue gets backed up. We consistently observe 40-50 messages in flight on our busiest SQS queues, though never more than 50 (note that 5 messages * 20 seconds / 2 seconds = 50 messages). There is often a backup of hundreds of messages and the wait times are 50 seconds to 300 seconds or more.
  • At this point the majority of messages are handled so late that the instances are gone from EC2. The "good" news is that, when DescribeInstances manages to get through to the API, the message for an already-terminated instance gets deleted from the backed-up queue. The bad news is that it's way too late to warn Kubernetes by that point, so we are experiencing sudden unexpected failure of Kube pods.

I can think of a couple ways to change this code to help avoid this problem:

  • Stop making so many calls to DescribeInstances. The purpose of these calls is to translate EC2 instance IDs to node names. Instance IDs are said to be unique (and are certainly vanishingly unlikely to be reused in rapid succession) and private DNS names are not mutable, so this mapping cannot change and can be cached -- we can periodically call DescribeInstances for all instances and cache the results.
  • Make the number of SQS messages retrieved per 2-second cycle into a configurable parameter that can be tuned along with the number of workers. I'm not even sure why we fetch so few messages each time, except as a means of throttling the rate of DescribeInstances API calls, which could be managed differently as suggested above.
@bwagner5 bwagner5 added Priority: High This issue will be seen by most users Type: Enhancement New feature or request labels Oct 8, 2021
@bwagner5
Copy link
Contributor

bwagner5 commented Oct 8, 2021

Hey @mechanical-fish ! Thanks for raising this issue!

  • As a work around for NTH to pull more messages, you can increase the replica count as a helm parameter or through kustomize. That will obviously put more strain on the DescribeInstances limit you are hitting but the messages will be pulled faster.

I agree w/ the longer term solutions you propose. We can definitely increase the number of messages that NTH can pull at one time (the max is 10) and cache DescribeInstances calls.

DescribeInstances does have a published request rate limit here which is 10 QPS w/ a 50 QPS burst (token bucket).

@bwagner5
Copy link
Contributor

bwagner5 commented Oct 8, 2021

I went ahead and put up the PR for increasing the max messages to 10.

I think we also need to dynamically adjust the VisibilityTimeout when processing the message since most nodes will take longer than 20 seconds to drain. We can also retry a bit faster when rate limited by adjusting the visibility timeout in that case.

@ngoyal16
Copy link
Contributor

ngoyal16 commented Oct 8, 2021

@mechanical-fish we also facing same issue i have added caching for instance id and nodename to reduce the describe calls.

@snay2
Copy link
Contributor

snay2 commented Oct 8, 2021

Our cache implementation should have the following characteristics:

  1. Reduce the QPS rate of DescribeInstances calls while processing queue messages
  2. Memory consumption scales linearly, both up and down, with the size of the k8s cluster
    • Specifically, old data should be removed from the cache soon after it becomes obsolete
  3. Have a configurable invalidation frequency so customers can tune it per their cluster volatility or their AWS account API rate limits
    • A cluster with high node turnover or rapid scaling up/down may need to refresh the cache more frequently

One way to do this is to create a custom DescribeInstances provider that queries (on a configurable interval) to get the necessary information about all instances in the cluster or all ASGs that NTH is monitoring. Then the implementation of retrieveNodeName uses this provider instead of calling the DescribeInstances API directly.

This is a form of the "constant work" pattern. Using such a pattern improves system stability under high load and unusual traffic patterns, such as a large influx of queue messages or a backed-up queue. Those circumstances will then not cause spikes in DescribeInstances calls that might exceed the rate limit. Rather, NTH will regularly call DescribeInstances on a predictable cadence. It may result in doing more work than necessary, but we trade off that extra work for more predictable performance under load.

@ngoyal16
Copy link
Contributor

ngoyal16 commented Oct 8, 2021

@snay2 sure i am happy to help on same.

@bwagner5
Copy link
Contributor

bwagner5 commented Oct 8, 2021

In the provider implementation, rather than using a map, we can use something like this https://github.com/patrickmn/go-cache . I've used this one before and it works pretty well. The provider implementation might be similar to this (a little simpler though): https://github.com/awslabs/karpenter/blob/7d907ad539b45b46f3778e15a244820ae3468153/pkg/cloudprovider/aws/instancetypes.go

@ngoyal16
Copy link
Contributor

ngoyal16 commented Oct 8, 2021

if we add this expiring caching after 10 min do we still need to worry about cache invalidate as it will automatically expiry after 10 min. it will also fix the memory leakge issue too.

@bwagner5
Copy link
Contributor

bwagner5 commented Oct 8, 2021

That's probably right, that we can rely on time based eviction.

@ngoyal16
Copy link
Contributor

ngoyal16 commented Oct 8, 2021

@bwagner5 go-cache module added

@mechanical-fish
Copy link
Contributor Author

mechanical-fish commented Oct 8, 2021

The idea of using go-cache with a constant-work pattern seems great!

Meanwhile as I try to figure out why our queues are backed up I have begun to wonder about the other uncached API calls beneath processSQSMessage, particularly autoscaling.DescribeAutoScalingInstances -- which, if I'm reading this code right, gets called at least once for most types of incoming SQS message. Because we've got CheckIfManaged turned on, this call might actually be happening two times in succession every time such a message comes in, plus a call to ASG.DescribeTagsPages. All of this seems like an excellent opportunity to make more use of go-cache.

I also note that it looks like most of the calls to retrieveAutoScalingGroupName either swallow any API errors that may result or log them in a very indirect way. I'm going to try hacking together a version of NTH that does more logging in an attempt to sort out what API calls are getting exercised and how long they are taking.

@ngoyal16
Copy link
Contributor

ngoyal16 commented Oct 8, 2021

I have one more question regards setting this up . currently we are running 3 eks cluster in single region in a account should we use single queue for all of them or different for each.. what's the best practices for this

@snay2
Copy link
Contributor

snay2 commented Oct 8, 2021

@ngoyal16 Best practice is one queue per cluster. That avoids unnecessary processing and also prevents situations where NTH tries to handle messages from another cluster that it can't actually influence.

@snay2
Copy link
Contributor

snay2 commented Oct 8, 2021

@mechanical-fish I look forward to hearing what you learn. If you think the additional logging would be beneficial to other customers as well, please do open a PR with it!

And yes, let's design the cache so it can support multiple data sources of instance information.

@ngoyal16
Copy link
Contributor

@snay2 what will happen if there are two vpc with same CIDR range running an EKS cluster and one of node in each have exectly same node-name and one of those similer node name instance get evctied or scale-out?

@snay2
Copy link
Contributor

snay2 commented Oct 11, 2021

@ngoyal16 I don't have experience with this particular configuration, but the general best practice would be to have one queue per cluster, to avoid ambiguities like this that could arise from naming or IP address collisions across clusters. NTH doesn't have disambiguation logic for that kind of scenario; if you think it would be worthwhile to add it, feel free to open a separate issue as a feature request and we can discuss it with the community there.

@ngoyal16
Copy link
Contributor

can we add tag value based cleanup this way even if such kind of colison happen. or we can use cluster name or something like
that. sure i will create new feature request.

@mechanical-fish
Copy link
Contributor Author

I've been off doing more research and, while I think this issue is real, I'm no longer convinced it's the root cause of our difficulties. I'm going to report two other issues that I've uncovered that have been causing our slowdown, the first of which is #498

@mechanical-fish
Copy link
Contributor Author

(Not to leave everyone in suspense, the second issue is "the code is good about checking the 'managed ASG tag' but it doesn't pull the events from other ASGs off the queue, so they just get redelivered over and over until the instances finally terminate" -- I have a PR that addresses this and also cuts back on the number of AWS API calls, but I want to at least make a half-hearted stab at updating the unit tests before uploading that.)

@mechanical-fish
Copy link
Contributor Author

mechanical-fish commented Oct 13, 2021

Thanks for the prompt attention to #498 !

My second PR is #506 which I believe has actually addressed our immediate problems at my company.

If and when #506 lands, it introduces an internal data structure, the NodeInfo struct, which looks like a good candidate for prefetching-and-caching using the go-cache module. That would address the issue we're discussing and make NTH even more API-efficient.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Nov 13, 2021
@mechanical-fish
Copy link
Contributor Author

PRs #498 and #506, together with an increase to our API rate cap, have sufficiently lowered the urgency of this issue for my team that I haven't had further time to work on it.

I still think it would be fine to use e.g. go-cache and a strategy of periodically fetching NodeInfo information to reduce the number of API calls required during high-traffic periods when many instances are terminating at the same time. But the case for such a feature is harder to make now that nearly every one of our terminating instances can be handled with one API call.

@bwagner5 bwagner5 reopened this Nov 15, 2021
@bwagner5 bwagner5 removed the stale Issues / PRs with no activity label Nov 15, 2021
@jillmon jillmon added Priority: Low This issue will not be seen by most users. The issue is a very specific use case or corner case and removed Priority: High This issue will be seen by most users labels Nov 17, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Dec 18, 2021
@github-actions
Copy link

This issue was closed because it has become stale with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Low This issue will not be seen by most users. The issue is a very specific use case or corner case stale Issues / PRs with no activity Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants