Release state lock before catalog operations in State.SyncChanges #12485

wjordan · 2022-03-02T00:20:38Z

Another attempt at fixing #6616 and #8504.

Overview

The underlying issue is described in #6616, here is a summary:

Consul's 'anti-entropy'' feature depends on periodic sync of state changes from a local client Agent to the remote Server:

anti-entropy is a synchronization of the local agent state and the catalog. For example, when a user registers a new service or check with the agent, the agent in turn notifies the catalog that this new check exists. Similarly, when a check is deleted from the agent, it is consequently removed from the catalog as well.

This synchronization occurs in State.SyncChanges():

consul/agent/local/state.go

Lines 1218 to 1222 in 73c91ed

    
           // SyncChanges pushes checks, services and node info data which has been 
        
           // marked out of sync or deleted to the server. 
        
           func (l *State) SyncChanges() error { 
        
           	l.Lock() 
        
           	defer l.Unlock()

The (sync.RWMutex) l.Lock() / defer l.Unlock() at the start of this function prevents any other reads of the local agent state while SyncChanges is in progress. Unfortunately, actually synchronizing changes to the catalog requires RPC calls to Catalog.Register (syncNodeInfo, syncService, syncCheck) / Catalog.Deregister (deleteService, deleteCheck), which may block for a long time in conditions of high latency or intermittent network issues. This results in long blocking calls for agent-local queries, causing unexpected application delays. Agent-local queries are expected to respond more quickly than RPC calls, but with the mutex the latency is identical to an RPC call when a SyncChanges is in progress.

Solution

The solution in this PR is to temporarily unlock the mutex during the SyncChanges-initiated RPC calls by wrapping them with l.Unlock() / l.Lock(). This allows other agent queries to take place while a catalog update is in-flight.

I also reorganized some logic so that the local state is updated before the RPC call (while the lock is still held), and rolled back (the original state before changes is restored) in the uncommon error case. This way, any additional updates to local state arriving while the RPC call is in-flight would set InSync = false properly and be correctly included in the next sync.

A unit test has been added to verify that SyncChanges is non-blocking with this PR.

I've been running this PR in production for several months without issue, and it has resulted in significantly reduced latencies for all /agent endpoints.

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

github-actions · 2022-06-14T01:39:07Z

Closing due to inactivity. If you feel this was a mistake or you wish to re-open at any time in the future, please leave a comment and it will be re-surfaced for the maintainers to review.

wjordan · 2024-10-25T15:46:33Z

Bump, this PR is still relevant and I'd still like a review whenever possible. My production deployment has been depending on this improvement for the last few years and I'd still like to see it get off a fork and merged upstream at some point.

I'm happy to rebase the PR if needed but won't spend time on it unless it would be useful for actually getting it reviewed and merged, so please let me know.

unlock State mutex during sync RPC calls

a424918

Prevents agent queries from waiting on mutex while RPC calls to sync changes with catalog are in-flight. Update local state before the RPC call then revert changes to the original state in case an error is returned.

github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label May 13, 2022

github-actions bot closed this Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release state lock before catalog operations in State.SyncChanges #12485

Release state lock before catalog operations in State.SyncChanges #12485

wjordan commented Mar 2, 2022

Amier3 commented Mar 3, 2022

hashicorp-cla commented Mar 12, 2022 •

edited

Loading

github-actions bot commented May 13, 2022

github-actions bot commented Jun 14, 2022

wjordan commented Oct 25, 2024

	// SyncChanges pushes checks, services and node info data which has been
	// marked out of sync or deleted to the server.
	func (l *State) SyncChanges() error {
	l.Lock()
	defer l.Unlock()

Release state lock before catalog operations in State.SyncChanges #12485

Release state lock before catalog operations in State.SyncChanges #12485

Conversation

wjordan commented Mar 2, 2022

Overview

Solution

See Also

Amier3 commented Mar 3, 2022

hashicorp-cla commented Mar 12, 2022 • edited Loading

github-actions bot commented May 13, 2022

github-actions bot commented Jun 14, 2022

wjordan commented Oct 25, 2024

hashicorp-cla commented Mar 12, 2022 •

edited

Loading