Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: enqueue decom ranges at an interval behind a setting #130117

Merged

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Sep 4, 2024

Introduce the ranges.decommissioning gauge metric, which represents
the number of ranges with at least one replica on a decommissioning
node.

The metric is reported by the leaseholder, or if there is no valid
leaseholder, the first live replica in the descriptor, similar to
(under|over)-replication metrics.

The metric can be used to approximately identify the distribution of
decommissioning work remaining across nodes, as the leaseholder replica
is responsible for triggering the replacement of decommissioning
replicas for its own range.

Informs: #130085
Release note (ops change): The ranges.decommissioning metric is added,
representing the number of ranges which have a replica on a
decommissioning node.


When kv.enqueue_in_replicate_queue_on_problem.interval is set to a
positive non-zero value, leaseholder replicas of ranges which are
underreplicated will be enqueued into the replicate queue every
kv.enqueue_in_replicate_queue_on_problem.interval interval.

When kv.enqueue_in_replicate_queue_on_problem.interval is set to 0,
no enqueueing on underreplication will take place, outside of the
regular replica scanner.

A recommended value for users enabling the enqueue (non-zero), is 15
minutes e.g.,

SET CLUSTER SETTING
kv.enqueue_in_replicate_queue_on_problem.interval='900s'

Resolves: #130085
Release note (ops change): The ranges.decommissioning metric is added,
representing the number of ranges which have a replica on a
decommissioning node.

@kvoli kvoli self-assigned this Sep 4, 2024
Copy link

blathers-crl bot commented Sep 4, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli force-pushed the 240904.enqueue-decommissioning-replicas-2 branch 2 times, most recently from a289462 to 0b8287f Compare September 4, 2024 21:19
Introduce the `ranges.decommissioning` gauge metric, which represents
the number of ranges with at least one replica on a decommissioning
node.

The metric is reported by the leaseholder, or if there is no valid
leaseholder, the first live replica in the descriptor, similar to
(under|over)-replication metrics.

The metric can be used to approximately identify the distribution of
decommissioning work remaining across nodes, as the leaseholder replica
is responsible for triggering the replacement of decommissioning
replicas for its own range.

Informs: cockroachdb#130085
Release note (ops change): The `ranges.decommissioning` metric is added,
representing the number of ranges which have a replica on a
decommissioning node.
@kvoli kvoli force-pushed the 240904.enqueue-decommissioning-replicas-2 branch 2 times, most recently from 7fbf34e to e5a15f1 Compare September 4, 2024 23:01
@kvoli kvoli changed the title kvserver: enqueue underrepl ranges at an interval disabled by default kvserver: enqueue decom ranges at an interval behind a setting Sep 4, 2024
@kvoli kvoli force-pushed the 240904.enqueue-decommissioning-replicas-2 branch from e5a15f1 to 4b6600c Compare September 4, 2024 23:05
@kvoli kvoli marked this pull request as ready for review September 4, 2024 23:24
@kvoli kvoli requested a review from a team as a code owner September 4, 2024 23:24
@kvoli kvoli added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.2.x Flags PRs that need to be backported to 24.2 labels Sep 4, 2024
Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks for digging into this by the way!

The patch itself looks good to merge and backport, esp given it's default off. Do you see a world where we turn this default on and subsume the logic introduced in #80993 into maybeEnqueueProblemRange? It should be doable with a bit of tweaking, right?

Separately, it feels a bit unsatisfying that there's a bug lurking here somewhere which we haven't found.

Reviewed 5 of 5 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)


pkg/kv/kvserver/replica.go line 2581 at r2 (raw file):

// NOT be called unless the range is known to require action e.g.,
// decommissioning|underreplicated.
func (r *Replica) maybeEnqueueProblemRange(

Do we want to add some commentary about the motivation for introducing this method and why it isn't needed in the bug free case (which will also explain to readers why it is default off)?

When `kv.enqueue_in_replicate_queue_on_problem.interval` is set to a
positive non-zero value, leaseholder replicas of ranges which have
decommissioning replicas will be enqueued into the replicate queue every
`kv.enqueue_in_replicate_queue_on_problem.interval` interval.

When `kv.enqueue_in_replicate_queue_on_problem.interval` is set to 0,
no enqueueing on decommissioning will take place, outside of the regular
replica scanner.

A recommended value for users enabling the enqueue (non-zero), is at
least 15 minutes e.g.,

```
SET CLUSTER SETTING
kv.enqueue_in_replicate_queue_on_problem.interval='900s'
```

Resolves: cockroachdb#130085
Informs: cockroachdb#130199
Release note: None
@kvoli kvoli force-pushed the 240904.enqueue-decommissioning-replicas-2 branch from 4b6600c to 637cd5d Compare September 5, 2024 22:41
Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYFTR!

Do you see a world where we turn this default on and subsume the logic introduced in #80993 into maybeEnqueueProblemRange? It should be doable with a bit of tweaking, right?

I do see a world where that occurs, probably on a new async task goroutine which iterates over every replica on the node at most every 30s-60s+ (adding a validator to the setting to prevent configuring less than that interval).

Separately, it feels a bit unsatisfying that there's a bug lurking here somewhere which we haven't found.

Indeed, I'd much rather we fix the bug than ship this patch realistically unless we can reproduce the problem ourselves that seems unlikely. I filed #130199 for tracking.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)


pkg/kv/kvserver/replica.go line 2581 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Do we want to add some commentary about the motivation for introducing this method and why it isn't needed in the bug free case (which will also explain to readers why it is default off)?

Good idea, I've added it here and also filed the issue for the bug #130199.

@kvoli
Copy link
Collaborator Author

kvoli commented Sep 5, 2024

bors r=arulajmani

@craig craig bot merged commit b175bad into cockroachdb:master Sep 6, 2024
23 checks passed
Copy link

blathers-crl bot commented Sep 6, 2024

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #130085: branch-release-23.1, branch-release-23.2, branch-release-24.1, branch-release-24.2.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link

blathers-crl bot commented Sep 6, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from a98fe59 to blathers/backport-release-23.1-130117: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


error creating merge commit from a98fe59 to blathers/backport-release-23.2-130117: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.


error creating merge commit from a98fe59 to blathers/backport-release-24.1-130117: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.1.x failed. See errors above.


error creating merge commit from a98fe59 to blathers/backport-release-24.2-130117: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@kvoli
Copy link
Collaborator Author

kvoli commented Sep 6, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from a98fe59 to blathers/backport-release-23.1-130117: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


error creating merge commit from a98fe59 to blathers/backport-release-23.2-130117: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.


error creating merge commit from a98fe59 to blathers/backport-release-24.1-130117: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.1.x failed. See errors above.


error creating merge commit from a98fe59 to blathers/backport-release-24.2-130117: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

This will be fun. I'll cherry-pick the commits onto the release branches and sort out the conflicts tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.2.x Flags PRs that need to be backported to 24.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: add configurable decommission nudger
3 participants