Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release state lock before catalog operations in State.SyncChanges #12485

Closed
wants to merge 1 commit into from

Conversation

wjordan
Copy link
Contributor

@wjordan wjordan commented Mar 2, 2022

Another attempt at fixing #6616 and #8504.

Overview

The underlying issue is described in #6616, here is a summary:

Consul's 'anti-entropy'' feature depends on periodic sync of state changes from a local client Agent to the remote Server:

anti-entropy is a synchronization of the local agent state and the catalog. For example, when a user registers a new service or check with the agent, the agent in turn notifies the catalog that this new check exists. Similarly, when a check is deleted from the agent, it is consequently removed from the catalog as well.

This synchronization occurs in State.SyncChanges():

consul/agent/local/state.go

Lines 1218 to 1222 in 73c91ed

// SyncChanges pushes checks, services and node info data which has been
// marked out of sync or deleted to the server.
func (l *State) SyncChanges() error {
l.Lock()
defer l.Unlock()

The (sync.RWMutex) l.Lock() / defer l.Unlock() at the start of this function prevents any other reads of the local agent state while SyncChanges is in progress. Unfortunately, actually synchronizing changes to the catalog requires RPC calls to Catalog.Register (syncNodeInfo, syncService, syncCheck) / Catalog.Deregister (deleteService, deleteCheck), which may block for a long time in conditions of high latency or intermittent network issues. This results in long blocking calls for agent-local queries, causing unexpected application delays. Agent-local queries are expected to respond more quickly than RPC calls, but with the mutex the latency is identical to an RPC call when a SyncChanges is in progress.

Solution

The solution in this PR is to temporarily unlock the mutex during the SyncChanges-initiated RPC calls by wrapping them with l.Unlock() / l.Lock(). This allows other agent queries to take place while a catalog update is in-flight.

I also reorganized some logic so that the local state is updated before the RPC call (while the lock is still held), and rolled back (the original state before changes is restored) in the uncommon error case. This way, any additional updates to local state arriving while the RPC call is in-flight would set InSync = false properly and be correctly included in the next sync.

A unit test has been added to verify that SyncChanges is non-blocking with this PR.

I've been running this PR in production for several months without issue, and it has resulted in significantly reduced latencies for all /agent endpoints.

See Also

A previous attempt to solve this issue is at #6619, which raised some concerns around corner-cases and test coverage.
The approach in this PR avoids race conditions in a way that minimizes changes to the existing behavior, which I hope would lead to less corner-cases and make it easier to review.

Prevents agent queries from waiting on mutex while RPC
calls to sync changes with catalog are in-flight.
Update local state before the RPC call then revert changes to the original state in case an error is returned.
@Amier3
Copy link
Contributor

Amier3 commented Mar 3, 2022

Thanks for the the PR @wjordan , you've really been helping us out these past few months and all of us really appreciate it!

We'll take a look at this likely early next week. I haven't looked at the code yet but I will caution that we might have to bring #11500 to completion first before any meaningful review on this.

@hashicorp-cla
Copy link

hashicorp-cla commented Mar 12, 2022

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

@github-actions github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label May 13, 2022
@github-actions
Copy link

Closing due to inactivity. If you feel this was a mistake or you wish to re-open at any time in the future, please leave a comment and it will be re-surfaced for the maintainers to review.

@github-actions github-actions bot closed this Jun 14, 2022
@wjordan
Copy link
Contributor Author

wjordan commented Oct 25, 2024

Bump, this PR is still relevant and I'd still like a review whenever possible. My production deployment has been depending on this improvement for the last few years and I'd still like to see it get off a fork and merged upstream at some point.

I'm happy to rebase the PR if needed but won't spend time on it unless it would be useful for actually getting it reviewed and merged, so please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta/stale Automatically flagged for inactivity by stalebot
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants