-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store: More than 2000 Connections/s to S3 Backend #1492
Comments
Hi there, it seems like your index cache is too small. Could you check the index cache metrics and see if there are lots of evictions? If yes, then you might need to increase the size of it. Unfortunately, our current simple LRU based index cache doesn't work well under lots of pressure since it doesn't track what items are being constantly added and removed. |
Hi, thanks for looking into it. In fact I had a lot of those message. I have now doubled both values and set: --index-cache-size=500MB --chunk-pool-size=4GB I once read in slack:
But I don't really know where I can see the size of the index-cache for biggest blocks. So I'm just going with the try and fail method. |
That seems like a pretty good rule of thumb. They are talking about the sizes of Did increasing its size help you with this problem? |
Hi, yes it seems setting higher values for caching did indeed help with this issue. |
In case of issues related to exact bucket implementation, please ping corresponded maintainer from list here: https://github.com/thanos-io/thanos/blob/master/docs/storage.md
pinging @bwplotka because of S3 backend
Thanos, Prometheus and Golang version used
What happened
Sometimes, Thanos Store is making more than 2000 connections per second to S3 backend. The firewall blocks any connections after that for 1 minute (DDOS prevention).
It occurrs as infrequent spikes and I can't yet relate to any event. I did not see any queries being done on Grafana or Thanos Query component shortly before or at the moment of those spikes, that could have caused them.
I updated on 03. September 2019 ~19:00 GMT+02:00 from v0.6.1 to Thanos 0.7.0, hoping that the newly merged
would fix this issue.
What you expected to happen
To not spike that much.
How to reproduce it (as minimally and precisely as possible):
I can't relate any event, so I'm not sure what to do to reproduce them.
Thanos Store Startup Configuration
Contents of --objstore.config-file
Full logs to relevant components
All Thanos components are started via systemd. Log output is transferred to /var/log/thanos.log via rsyslog. Hence the logs may also contain logs from Thanos Compactor & Query components.
Apparently the 2000/s connection threshold was reached at 13:04:11, but I can't see anything suspicious in the logs.
Anything else we need to know
If this is of relevance: We do not downsample the data and we keep it for 7 months.
Thanos Compact Systemd startup command (on same server):
When querying
I've around ~700/s, nothing near 2000/s.
When doing a tcpdump (sadly not at 13:04) I see lots of TCP ACK with NOOP & Timestamp Options, but no content really. Not sure, if those connections are causing the problem...?
Have you heard/encountered something familiar?
I'm not really sure how to further investigate. If you need further data, please tell me so and I'll see to it.
Environment:
uname -a
): Linux p1-thanos-alsu001 3.10.0-957.1.3.el7.x86_64 Initial structure and block shipper #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: