Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: Fix bug where connection errors can cause early cache expiry #9979

Merged
merged 4 commits into from
Apr 8, 2021

Conversation

banks
Copy link
Member

@banks banks commented Apr 7, 2021

Based on #9978.

This fixes a bug in the agent cache where TTL expiry times are only updated when successful updates are received.

There are two different ways this can cause items to be prematurely evicted from the cache:

  1. If the agent is disconnected from servers for at least one TTL period, the waiting client Get requests will go round their loop seeing only errors and not updating the expiry time even though they are still interested (i.e. the value is still "alive").
  2. If no updates are sent for one entire TTL then the waiting Get's will never see a valid update response and so never touch the expiry time meaning the value will be evicted.

Currently we don't run into this because until streaming, all cache-types had long TTLs of 72 hours.

It's also self-recovering in that the active clients will just cause the cache value to be refetched anyway even though it was evicted. For typical RPC-based cache types the cost is minimal but for leaf certificates, streaming and other possible future types the eviction causes connections to drop and data to be lost and recomputed which negatively effects performance or requires unnecessary work.

For streaming, this causes dropped connections which then take some time to reconnect based on the timing of the client which can mean delayed event delivery.

banks and others added 3 commits April 7, 2021 14:21
This mostly affects streaming since streaming will immediately return from Fetch calls when the state is Closed on eviction which causes the race condition every time.

However this also affects all other cache types if the fetch call happens to return between the eviction and then next time around the Get loop by any client.

There is a separate bug that allows cache items to be evicted even when there are active clients which is the trigger here.
@banks banks added theme/streaming Related to Streaming connections between server and client backport/1.9 labels Apr 7, 2021
@banks banks added this to the 1.9.x milestone Apr 7, 2021
@banks banks requested a review from a team April 7, 2021 13:55
@github-actions github-actions bot added the theme/agent-cache Agent Cache label Apr 7, 2021
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging April 7, 2021 13:58 Inactive
@vercel vercel bot temporarily deployed to Preview – consul April 7, 2021 13:58 Inactive
@banks banks changed the base branch from cache-fixes to master April 7, 2021 13:59
@banks banks changed the base branch from master to cache-fixes April 7, 2021 13:59
@@ -1,3 +1,3 @@
```release-note:bug
cache: fix a bug in the client agent cache where streaming could potentially leak resources. [[GH-9978](https://github.com/hashicorp/consul/pulls/9978)].
cache: fix a bug in the client agent cache where streaming could potentially leak resources. [[GH-9978](https://github.com/hashicorp/consul/pull/9978)].
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this in the upstream PR branch too and reset base but it's still showing here 🤷

Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this seems like a really important fix.

Base automatically changed from cache-fixes to master April 8, 2021 10:08
@banks banks merged commit 1406671 into master Apr 8, 2021
@banks banks deleted the cache-error-ttl-fix branch April 8, 2021 10:11
@hashicorp-ci
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/347197.

@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit 1406671 onto release/1.9.x succeeded!

hashicorp-ci pushed a commit that referenced this pull request Apr 8, 2021
…9979)

Fixes a cache bug where TTL is not updated while a value isn't changing or cache entry is returning fetch errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/agent-cache Agent Cache theme/streaming Related to Streaming connections between server and client
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants