Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storeliveness: SupportManager improvements and bug fixes #130101

Merged
merged 3 commits into from
Sep 6, 2024

Conversation

miraradeva
Copy link
Contributor

@miraradeva miraradeva commented Sep 4, 2024

This PR includes 3 commits: 2 of them fix a couple of SupportManager bugs, and the third one improves the process of stopping heartbeats for idle stores. See individual commits for more details.

@miraradeva miraradeva requested a review from a team as a code owner September 4, 2024 18:24
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miraradeva miraradeva force-pushed the mira-125063-errors branch 4 times, most recently from 2d1dfb0 to 8547230 Compare September 5, 2024 16:29
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, 2 of 2 files at r2, 8 of 8 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miraradeva)


-- commits line 24 at r2:
"invariants"


-- commits line 27 at r2:
s/it/is/


-- commits line 28 at r2:
nit: the wording here is a bit off regarding singular vs. plural. "impact", "they"


pkg/kv/kvserver/storeliveness/requester_state.go line 49 at r1 (raw file):

	// needs to lock requesterStateHandler.mu for reading to ensure that no new
	// stores are added to the supportFrom map.
	recentlyQueried atomic.Value // recentlyQueriedState

atomic.Value boxes values in an interface and incurs heap allocation costs as a result. It's not actually generic. We should use an atomic.Int32 instead.


pkg/kv/kvserver/storeliveness/requester_state.go line 191 at r1 (raw file):

	defer rsh.mu.RUnlock()
	for _, rs := range rsh.requesterState.supportFrom {
		if !rs.recentlyQueried.CompareAndSwap(active, inactive) {

It's tempting to use atomic.Int32.Add(1) here to avoid having to think about races with these two CASs and concurrent Stores, but that doesn't quite work.

I guess the following does, but I doubt we want it:

if rs.recentlyQueried.Load() != idle {
    rs.recentlyQueried.Add(1)
}

What you have here does seem to work, so no need to change it.


pkg/kv/kvserver/storeliveness/supporter_state.go line 154 at r3 (raw file):

// write writes the supporter meta and supportFor to disk if they changed in
// this update. Use a batch to avoid potentially writing multiple support states

s/Use/Accepts/


pkg/kv/kvserver/storeliveness/testdata/requester_state line 164 at r1 (raw file):

support from:
{Target:{NodeID:2 StoreID:2} Epoch:2 Expiration:410.000000000,0}

Should we show here that heartbeats are still sent to store 2?


pkg/kv/kvserver/storeliveness/support_manager_test.go line 429 at r3 (raw file):

type testBatch struct {
	storage.Batch
	errorOnWrite bool

To avoid confusion in the future, should we add support for blockOnWrite to testBatch as well?

Copy link
Contributor Author

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


-- commits line 24 at r2:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"invariants"

Done


-- commits line 27 at r2:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/it/is/

Done


-- commits line 28 at r2:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: the wording here is a bit off regarding singular vs. plural. "impact", "they"

Done


pkg/kv/kvserver/storeliveness/requester_state.go line 49 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

atomic.Value boxes values in an interface and incurs heap allocation costs as a result. It's not actually generic. We should use an atomic.Int32 instead.

Done


pkg/kv/kvserver/storeliveness/requester_state.go line 191 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's tempting to use atomic.Int32.Add(1) here to avoid having to think about races with these two CASs and concurrent Stores, but that doesn't quite work.

I guess the following does, but I doubt we want it:

if rs.recentlyQueried.Load() != idle {
    rs.recentlyQueried.Add(1)
}

What you have here does seem to work, so no need to change it.

A circular linked list with 3 values is what we need.


pkg/kv/kvserver/storeliveness/supporter_state.go line 154 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/Use/Accepts/

Done


pkg/kv/kvserver/storeliveness/testdata/requester_state line 164 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we show here that heartbeats are still sent to store 2?

Done


pkg/kv/kvserver/storeliveness/support_manager_test.go line 429 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

To avoid confusion in the future, should we add support for blockOnWrite to testBatch as well?

Done

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: after the two remaining comments are addressed.

Reviewed 9 of 9 files at r4, 2 of 2 files at r5, 8 of 8 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miraradeva)


pkg/kv/kvserver/storeliveness/requester_state.go line 166 at r4 (raw file):

		rs := requestedSupport{
			state:           slpb.SupportState{Target: id, Epoch: rsh.requesterState.meta.MaxEpoch},
			recentlyQueried: atomic.Int32{},

The zero value for atomic.Int32 is usable, so we don't need this line.

Arguably, we don't even need the rs.recentlyQueried.Store(active) below either, but I'll defer to you on whether you'd think there's value in being explicit about the initial value of the field and not assuming that active is the zero value.


pkg/kv/kvserver/storeliveness/requester_state.go line 169 at r4 (raw file):

		}
		// Adding a store is done in response to SupportFrom, so it's ok to set
		// recentlyQueried to true here. This also ensures the store will not be

s/true/active/

Copy link
Contributor Author

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)


pkg/kv/kvserver/storeliveness/requester_state.go line 166 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The zero value for atomic.Int32 is usable, so we don't need this line.

Arguably, we don't even need the rs.recentlyQueried.Store(active) below either, but I'll defer to you on whether you'd think there's value in being explicit about the initial value of the field and not assuming that active is the zero value.

I'll remove the initialization line, but I think I like the other assignment being explicit.


pkg/kv/kvserver/storeliveness/requester_state.go line 169 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/true/active/

Done

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 2 files at r8, 8 of 8 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miraradeva)

Previously, if a store had not appeared in a `SupportFrom` call for some
period of time, it would get removed from the requester's support-from
map. This introduced complex locking logic while removing the stores
and not blocking other `SupportFrom` calls.

This commit marks the stores as idle instead of actually removing them
from the map. Heartbeats are sent only to non-idle stores. An added
benefit of this approach is that support-from state is never lost
(other than during restarts), which makes the algorithm easier to
reason about.

Part of: cockroachdb#125063

Release note: None
Previously, if a remote store was missing from the support-from map of
a local support requester, and a heartbeat response arrives after the
removal, the response would be processed. This breaks two invariants:
(1) it adds an alternative way to add a store back to the map outside
of a `SupportFrom` call, and (2) it's possible for the epoch in the
message to exceed the epoch in the map (which is 0) by more than 1.
The above don't impact the correctness of Store Liveness but they
make the algorithm harder to reason about.

This commit ensures the support requester ignores heartbeat responses
from stores absent from the support-from map.

Part of: cockroachdb#125063

Release note: None
Previously, the Support Manager would return upon encountering a
persistence error, without resetting the
{requester, supporter}StateForUpdate correctly. This resulted in a
leaked update that is never reset and prevents future updates from
being checked out subsequently.

This commit consolidates the resetting logic with the checkIn/checkOut
logic, which allows to test it in a single place.

Fixes: cockroachdb#125063

Release note: None
@miraradeva
Copy link
Contributor Author

TFTR!

bors r=nvanbenschoten

@craig craig bot merged commit 870c5fb into cockroachdb:master Sep 6, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants