-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent calls to DescribeInstances cause SQS backup and late notifications #494
Comments
Hey @mechanical-fish ! Thanks for raising this issue!
I agree w/ the longer term solutions you propose. We can definitely increase the number of messages that NTH can pull at one time (the max is 10) and cache DescribeInstances calls. DescribeInstances does have a published request rate limit here which is 10 QPS w/ a 50 QPS burst (token bucket). |
I went ahead and put up the PR for increasing the max messages to 10. I think we also need to dynamically adjust the VisibilityTimeout when processing the message since most nodes will take longer than 20 seconds to drain. We can also retry a bit faster when rate limited by adjusting the visibility timeout in that case. |
@mechanical-fish we also facing same issue i have added caching for instance id and nodename to reduce the describe calls. |
Our cache implementation should have the following characteristics:
One way to do this is to create a custom This is a form of the "constant work" pattern. Using such a pattern improves system stability under high load and unusual traffic patterns, such as a large influx of queue messages or a backed-up queue. Those circumstances will then not cause spikes in |
@snay2 sure i am happy to help on same. |
In the provider implementation, rather than using a map, we can use something like this https://github.com/patrickmn/go-cache . I've used this one before and it works pretty well. The provider implementation might be similar to this (a little simpler though): https://github.com/awslabs/karpenter/blob/7d907ad539b45b46f3778e15a244820ae3468153/pkg/cloudprovider/aws/instancetypes.go |
if we add this expiring caching after 10 min do we still need to worry about cache invalidate as it will automatically expiry after 10 min. it will also fix the memory leakge issue too. |
That's probably right, that we can rely on time based eviction. |
@bwagner5 go-cache module added |
The idea of using Meanwhile as I try to figure out why our queues are backed up I have begun to wonder about the other uncached API calls beneath I also note that it looks like most of the calls to |
I have one more question regards setting this up . currently we are running 3 eks cluster in single region in a account should we use single queue for all of them or different for each.. what's the best practices for this |
@ngoyal16 Best practice is one queue per cluster. That avoids unnecessary processing and also prevents situations where NTH tries to handle messages from another cluster that it can't actually influence. |
@mechanical-fish I look forward to hearing what you learn. If you think the additional logging would be beneficial to other customers as well, please do open a PR with it! And yes, let's design the cache so it can support multiple data sources of instance information. |
@snay2 what will happen if there are two vpc with same CIDR range running an EKS cluster and one of node in each have exectly same node-name and one of those similer node name instance get evctied or scale-out? |
@ngoyal16 I don't have experience with this particular configuration, but the general best practice would be to have one queue per cluster, to avoid ambiguities like this that could arise from naming or IP address collisions across clusters. NTH doesn't have disambiguation logic for that kind of scenario; if you think it would be worthwhile to add it, feel free to open a separate issue as a feature request and we can discuss it with the community there. |
can we add tag value based cleanup this way even if such kind of colison happen. or we can use cluster name or something like |
I've been off doing more research and, while I think this issue is real, I'm no longer convinced it's the root cause of our difficulties. I'm going to report two other issues that I've uncovered that have been causing our slowdown, the first of which is #498 |
(Not to leave everyone in suspense, the second issue is "the code is good about checking the 'managed ASG tag' but it doesn't pull the events from other ASGs off the queue, so they just get redelivered over and over until the instances finally terminate" -- I have a PR that addresses this and also cuts back on the number of AWS API calls, but I want to at least make a half-hearted stab at updating the unit tests before uploading that.) |
Thanks for the prompt attention to #498 ! My second PR is #506 which I believe has actually addressed our immediate problems at my company. If and when #506 lands, it introduces an internal data structure, the |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label. |
PRs #498 and #506, together with an increase to our API rate cap, have sufficiently lowered the urgency of this issue for my team that I haven't had further time to work on it. I still think it would be fine to use e.g. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label. |
This issue was closed because it has become stale with no activity. |
aws-node-termination-handler
seems to have the following logic (inpkg/monitor/sqsevent/sqs-monitor.go
):ec2.DescribeInstances
API once per message (as part of theretrieveNodeName
function). If the SQS queue is consistently full, this is 2.5 calls toDescribeInstances
per second per queue.DescribeInstances
hits a rate limit, do not remove the message from the queue; instead leave it there to be retried 20 seconds laterWe are now operating at sufficient scale that
DescribeInstances
is often hitting API rate limits -- I'm not sure those limits are published, so I'm not sure whether the 2.5-calls-per-second from eachaws-node-termination-handler
are a signifiant cause of our rate limiting, or merely one contributor. But once rate limiting begins we observe the following failure mode:aws-node-termination-handler
will never pull more than 5 messages every 2 seconds.DescribeInstances
manages to get through to the API, the message for an already-terminated instance gets deleted from the backed-up queue. The bad news is that it's way too late to warn Kubernetes by that point, so we are experiencing sudden unexpected failure of Kube pods.I can think of a couple ways to change this code to help avoid this problem:
DescribeInstances
. The purpose of these calls is to translate EC2 instance IDs to node names. Instance IDs are said to be unique (and are certainly vanishingly unlikely to be reused in rapid succession) and private DNS names are not mutable, so this mapping cannot change and can be cached -- we can periodically callDescribeInstances
for all instances and cache the results.DescribeInstances
API calls, which could be managed differently as suggested above.The text was updated successfully, but these errors were encountered: