connect: limit agent cache size #4968
Labels
theme/connect
Anything related to Consul Connect, Service Mesh, Side Car Proxies
type/enhancement
Proposed improvement or new feature
type/umbrella-☂️
Makes issue the "source of truth" for multiple requests relating to the same topic
Currently cache can grow without bound. I think there are several aspects we should address here but not sure how yet.
Reduced TTLs
Currently all our background refresh cache types (certs, intentions, service disco) have a TTL of 3 days which was chosen to ensure we keep them in memory during a weekend outage of servers.
proxycfg.Manager
(added in 1.3.0) actually invalidates that rationale since any resources needed by currently registered proxies are constantly blocked on and so won't be considered inactive. So we could set TTLs much more aggressivelyThis is important because of the partition by token behaviour - a ideal application/proxy that rotates it's ACL token every hour will generate all new cache entries each hour but we'll keep old ones around for 3 days consuming 72x the memory we actually need.
It's actually not that hard to do safely now for any of the existing Connect use-cases. If we want to support connect clients that don't block on their certs but poll then the TTL only really needs to be as long as the maximum poll frequency we consider reasonable - 30-60 mins seems like more than enough for anything that is trying to use Connect. In the case the client app crashes or is down from more than that TTL, it would result in a new cert being generated when it came back but that doesn't seem awful.
General Memory Limit
In general it would be nice to have a configurable limit on cache memory -- ideally in bytes but at least in number of entries. Then we'd want LRU behaviour to prevent against accidental or malicious pathological cases that chew through cached requests. (e.g. something using ?cached service discovery queries is configured to watch thousands of different service prefixes across tens of datacenters and rotates it's token every few hours...). This could in worst case end up consuming the same RAM on the client as the whole Raft state * number of token rotations/distinct client tokens used to fetch within the TTL.
This will require measuring or estimating cached response size though which is not impossible but maybe needs additional work on the interface since currently the cache just stores the result as an
interface{}
The text was updated successfully, but these errors were encountered: