connect: limit agent cache size #4968

banks · 2018-11-16T11:21:28Z

Currently cache can grow without bound. I think there are several aspects we should address here but not sure how yet.

Reduced TTLs

Currently all our background refresh cache types (certs, intentions, service disco) have a TTL of 3 days which was chosen to ensure we keep them in memory during a weekend outage of servers. proxycfg.Manager (added in 1.3.0) actually invalidates that rationale since any resources needed by currently registered proxies are constantly blocked on and so won't be considered inactive. So we could set TTLs much more aggressively

This is important because of the partition by token behaviour - a ideal application/proxy that rotates it's ACL token every hour will generate all new cache entries each hour but we'll keep old ones around for 3 days consuming 72x the memory we actually need.

It's actually not that hard to do safely now for any of the existing Connect use-cases. If we want to support connect clients that don't block on their certs but poll then the TTL only really needs to be as long as the maximum poll frequency we consider reasonable - 30-60 mins seems like more than enough for anything that is trying to use Connect. In the case the client app crashes or is down from more than that TTL, it would result in a new cert being generated when it came back but that doesn't seem awful.

General Memory Limit

In general it would be nice to have a configurable limit on cache memory -- ideally in bytes but at least in number of entries. Then we'd want LRU behaviour to prevent against accidental or malicious pathological cases that chew through cached requests. (e.g. something using ?cached service discovery queries is configured to watch thousands of different service prefixes across tens of datacenters and rotates it's token every few hours...). This could in worst case end up consuming the same RAM on the client as the whole Raft state * number of token rotations/distinct client tokens used to fetch within the TTL.

This will require measuring or estimating cached response size though which is not impossible but maybe needs additional work on the interface since currently the cache just stores the result as an interface{}

The text was updated successfully, but these errors were encountered:

banks added type/enhancement Proposed improvement or new feature theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies labels Nov 16, 2018

pearkes added this to the 1.4.3 milestone Feb 1, 2019

mkeeler modified the milestones: 1.4.3, 1.4.4 Mar 4, 2019

hanshasselberg modified the milestones: 1.4.4, 1.5 Mar 19, 2019

pearkes modified the milestones: 1.5.0, 1.5.1 Apr 29, 2019

mkeeler modified the milestones: 1.5.1, 1.5.2 May 23, 2019

pearkes modified the milestones: 1.5.2, 1.5.3 Jul 2, 2019

freddygv self-assigned this Jul 9, 2019

pearkes modified the milestones: 1.5.3, Upcoming Jul 22, 2019

freddygv removed their assignment Jul 30, 2019

pearkes modified the milestones: Upcoming, 1.7.x Aug 20, 2019

dnephin mentioned this issue Jun 30, 2020

Lots of ACL errors due to SI tokens not being found #7441

Open

mikemorris added the type/umbrella-☂️ Makes issue the "source of truth" for multiple requests relating to the same topic label Jun 30, 2020

mikemorris removed this from the 1.7.x milestone Jun 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connect: limit agent cache size #4968

connect: limit agent cache size #4968

banks commented Nov 16, 2018

connect: limit agent cache size #4968

connect: limit agent cache size #4968

Comments

banks commented Nov 16, 2018