Release state lock before catalog operations in State.SyncChanges #12485
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Another attempt at fixing #6616 and #8504.
Overview
The underlying issue is described in #6616, here is a summary:
Consul's 'anti-entropy'' feature depends on periodic sync of state changes from a local client Agent to the remote Server:
This synchronization occurs in
State.SyncChanges()
:consul/agent/local/state.go
Lines 1218 to 1222 in 73c91ed
The (
sync.RWMutex
)l.Lock()
/defer l.Unlock()
at the start of this function prevents any other reads of the local agent state while SyncChanges is in progress. Unfortunately, actually synchronizing changes to the catalog requires RPC calls toCatalog.Register
(syncNodeInfo
,syncService
,syncCheck
) /Catalog.Deregister
(deleteService
,deleteCheck
), which may block for a long time in conditions of high latency or intermittent network issues. This results in long blocking calls for agent-local queries, causing unexpected application delays. Agent-local queries are expected to respond more quickly than RPC calls, but with the mutex the latency is identical to an RPC call when aSyncChanges
is in progress.Solution
The solution in this PR is to temporarily unlock the mutex during the
SyncChanges
-initiated RPC calls by wrapping them withl.Unlock() / l.Lock()
. This allows other agent queries to take place while a catalog update is in-flight.I also reorganized some logic so that the local state is updated before the RPC call (while the lock is still held), and rolled back (the original state before changes is restored) in the uncommon error case. This way, any additional updates to local state arriving while the RPC call is in-flight would set
InSync = false
properly and be correctly included in the next sync.A unit test has been added to verify that
SyncChanges
is non-blocking with this PR.I've been running this PR in production for several months without issue, and it has resulted in significantly reduced latencies for all
/agent
endpoints.See Also
A previous attempt to solve this issue is at #6619, which raised some concerns around corner-cases and test coverage.
The approach in this PR avoids race conditions in a way that minimizes changes to the existing behavior, which I hope would lead to less corner-cases and make it easier to review.