-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storeliveness: SupportManager improvements and bug fixes #130101
Conversation
2d1dfb0
to
8547230
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r1, 2 of 2 files at r2, 8 of 8 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miraradeva)
-- commits
line 24 at r2:
"invariants"
-- commits
line 27 at r2:
s/it/is/
-- commits
line 28 at r2:
nit: the wording here is a bit off regarding singular vs. plural. "impact", "they"
pkg/kv/kvserver/storeliveness/requester_state.go
line 49 at r1 (raw file):
// needs to lock requesterStateHandler.mu for reading to ensure that no new // stores are added to the supportFrom map. recentlyQueried atomic.Value // recentlyQueriedState
atomic.Value
boxes values in an interface and incurs heap allocation costs as a result. It's not actually generic. We should use an atomic.Int32
instead.
pkg/kv/kvserver/storeliveness/requester_state.go
line 191 at r1 (raw file):
defer rsh.mu.RUnlock() for _, rs := range rsh.requesterState.supportFrom { if !rs.recentlyQueried.CompareAndSwap(active, inactive) {
It's tempting to use atomic.Int32.Add(1)
here to avoid having to think about races with these two CASs and concurrent Stores, but that doesn't quite work.
I guess the following does, but I doubt we want it:
if rs.recentlyQueried.Load() != idle {
rs.recentlyQueried.Add(1)
}
What you have here does seem to work, so no need to change it.
pkg/kv/kvserver/storeliveness/supporter_state.go
line 154 at r3 (raw file):
// write writes the supporter meta and supportFor to disk if they changed in // this update. Use a batch to avoid potentially writing multiple support states
s/Use/Accepts/
pkg/kv/kvserver/storeliveness/testdata/requester_state
line 164 at r1 (raw file):
support from: {Target:{NodeID:2 StoreID:2} Epoch:2 Expiration:410.000000000,0}
Should we show here that heartbeats are still sent to store 2?
pkg/kv/kvserver/storeliveness/support_manager_test.go
line 429 at r3 (raw file):
type testBatch struct { storage.Batch errorOnWrite bool
To avoid confusion in the future, should we add support for blockOnWrite
to testBatch
as well?
8547230
to
ce69163
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"invariants"
Done
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/it/is/
Done
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: the wording here is a bit off regarding singular vs. plural. "impact", "they"
Done
pkg/kv/kvserver/storeliveness/requester_state.go
line 49 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
atomic.Value
boxes values in an interface and incurs heap allocation costs as a result. It's not actually generic. We should use anatomic.Int32
instead.
Done
pkg/kv/kvserver/storeliveness/requester_state.go
line 191 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It's tempting to use
atomic.Int32.Add(1)
here to avoid having to think about races with these two CASs and concurrent Stores, but that doesn't quite work.I guess the following does, but I doubt we want it:
if rs.recentlyQueried.Load() != idle { rs.recentlyQueried.Add(1) }What you have here does seem to work, so no need to change it.
A circular linked list with 3 values is what we need.
pkg/kv/kvserver/storeliveness/supporter_state.go
line 154 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/Use/Accepts/
Done
pkg/kv/kvserver/storeliveness/testdata/requester_state
line 164 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we show here that heartbeats are still sent to store 2?
Done
pkg/kv/kvserver/storeliveness/support_manager_test.go
line 429 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
To avoid confusion in the future, should we add support for
blockOnWrite
totestBatch
as well?
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after the two remaining comments are addressed.
Reviewed 9 of 9 files at r4, 2 of 2 files at r5, 8 of 8 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @miraradeva)
pkg/kv/kvserver/storeliveness/requester_state.go
line 166 at r4 (raw file):
rs := requestedSupport{ state: slpb.SupportState{Target: id, Epoch: rsh.requesterState.meta.MaxEpoch}, recentlyQueried: atomic.Int32{},
The zero value for atomic.Int32
is usable, so we don't need this line.
Arguably, we don't even need the rs.recentlyQueried.Store(active)
below either, but I'll defer to you on whether you'd think there's value in being explicit about the initial value of the field and not assuming that active
is the zero value.
pkg/kv/kvserver/storeliveness/requester_state.go
line 169 at r4 (raw file):
} // Adding a store is done in response to SupportFrom, so it's ok to set // recentlyQueried to true here. This also ensures the store will not be
s/true/active/
ce69163
to
a72c93e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
pkg/kv/kvserver/storeliveness/requester_state.go
line 166 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The zero value for
atomic.Int32
is usable, so we don't need this line.Arguably, we don't even need the
rs.recentlyQueried.Store(active)
below either, but I'll defer to you on whether you'd think there's value in being explicit about the initial value of the field and not assuming thatactive
is the zero value.
I'll remove the initialization line, but I think I like the other assignment being explicit.
pkg/kv/kvserver/storeliveness/requester_state.go
line 169 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/true/active/
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r8, 8 of 8 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @miraradeva)
Previously, if a store had not appeared in a `SupportFrom` call for some period of time, it would get removed from the requester's support-from map. This introduced complex locking logic while removing the stores and not blocking other `SupportFrom` calls. This commit marks the stores as idle instead of actually removing them from the map. Heartbeats are sent only to non-idle stores. An added benefit of this approach is that support-from state is never lost (other than during restarts), which makes the algorithm easier to reason about. Part of: cockroachdb#125063 Release note: None
Previously, if a remote store was missing from the support-from map of a local support requester, and a heartbeat response arrives after the removal, the response would be processed. This breaks two invariants: (1) it adds an alternative way to add a store back to the map outside of a `SupportFrom` call, and (2) it's possible for the epoch in the message to exceed the epoch in the map (which is 0) by more than 1. The above don't impact the correctness of Store Liveness but they make the algorithm harder to reason about. This commit ensures the support requester ignores heartbeat responses from stores absent from the support-from map. Part of: cockroachdb#125063 Release note: None
Previously, the Support Manager would return upon encountering a persistence error, without resetting the {requester, supporter}StateForUpdate correctly. This resulted in a leaked update that is never reset and prevents future updates from being checked out subsequently. This commit consolidates the resetting logic with the checkIn/checkOut logic, which allows to test it in a single place. Fixes: cockroachdb#125063 Release note: None
a72c93e
to
f3bd773
Compare
TFTR! bors r=nvanbenschoten |
Build succeeded: |
This PR includes 3 commits: 2 of them fix a couple of
SupportManager
bugs, and the third one improves the process of stopping heartbeats for idle stores. See individual commits for more details.