Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis peer management doesn't scale that well to 300 instances. #793

Closed
kentquirk opened this issue Jul 13, 2023 · 0 comments · Fixed by #794
Closed

Redis peer management doesn't scale that well to 300 instances. #793

kentquirk opened this issue Jul 13, 2023 · 0 comments · Fixed by #794
Labels
type: bug Something isn't working

Comments

@kentquirk
Copy link
Contributor

kentquirk commented Jul 13, 2023

A customer with Many (300) Refinery instances is seeing a thousand Redis timeouts per hour. The Redis peer list updates do SCANs that aren't that fast, and they seem to batch them 10 at a time, plus it repeats twice.

Consider:

  • Using a larger value for COUNT in the scan() function
  • Making that configurable
  • Replacing the individual keys with a Redis SET and fetching them all with a single call

Some testing with setting COUNT to 1000 had this result:

BEFORE
  avg duration=33.230184ms count=36569 max duration=389.308ms    min duration=367.458µs
AFTER
  avg duration= 4.999529ms count=44768 max duration=166.902542ms min duration=190.833µs

The SET option doesn't have the automatic key expiration feature, so I think we'll just increase the value for COUNT and be happy about a 6.5x average performance increase for now.

@kentquirk kentquirk added the type: bug Something isn't working label Jul 13, 2023
TylerHelmuth added a commit that referenced this issue Jul 14, 2023
<!--
Thank you for contributing to the project! 💜
Please make sure to:
- Chat with us first if this is a big change
  - Open a new issue (or comment on an existing one)
- We want to make sure you don't spend time implementing something we
might have to say No to
- Add unit tests
- Mention any relevant issues in the PR description (e.g. "Fixes #123")

Please see our [OSS process
document](https://github.com/honeycombio/home/blob/main/honeycomb-oss-lifecycle-and-practices.md#)
to get an idea of how we operate.
-->

## Which problem is this PR solving?

This week, we've bumped up our number of refinery pods from 250 to 300.
After that change, we've noticed a lot more timeouts from redis.

```
time="2023-07-13T17:49:28Z" level=error msg="redis scan encountered an error" err="redis scan timeout" keys_returned=92 timeoutSec=5s
time="2023-07-13T18:39:41Z" level=error msg="get members failed during watch" error="context deadline exceeded" name="http://host:8081/" oldPeers="[list_of_the_300_hosts:8081]" timeout=5s
```

After digging into some of the code, we've noticed that peers are stored
in redis and are retrieved in batches of 10. Because there are 300
nodes, each node is making 30 requests to redis to get all the peers,
and doing it twice, so 60 total. This is done every 5 seconds. So our
redis instance is handling 18000 requests every 5 seconds.

## Short description of the changes

After a little bit of napkin math, with 300 nodes and an average payload
size of 26 bytes (that's what we see reported by `oldPeers` in our logs,
anyway), redis has within it about 7.8kb of data. That should be very
easily retrieved in a single request and doesn't need to be broken down
in batches of 10.

This PR proposes bumping that up to 1000. That would give us a total
payload of about 26kb. For 1000 nodes, that would also only yield 2000
requests to redis every 5 seconds (total 72mb bandwidth for those 5
seconds, too). Leaving it as is would require 200,000 requests to redis
over 5 seconds for 1000 nodes.

Getting a little extreme, for 1M nodes, that would give total payload of
about 26MB and yield 2000 requests per node, so 2B requests to redis
every 5 seconds for a total of 72GB data transferred (1/1000 of the
total payload 26MB, 2000 times). This might kill the redis, but the data
is still "pretty small".

Fixes #793

---------

Co-authored-by: Kent Quirk <kentquirk@honeycomb.io>
Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant