Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: kv: prevent lease interval regression during expiration-to-epoch promotion #130124

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Sep 4, 2024

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release


Release note (bug fix): Fixed a rare bug where a lease transfer could lead to a side-transport update saw closed timestamp regression panic. The bug could occur when a node was overloaded and failing to heartbeat its node liveness record.

Release justification: fixes rare, but serious panic.

tbg and others added 6 commits September 4, 2024 19:03
This can be triggered rapidly because each replica might call this as it tries
and fails to acquire a lease.
This commit adds a check that a replica does not perform a lease transfer if it
does not own the previous lease. This allows us to make a stronger assumption a
layer down.

Epic: None
Release note: None
…otion

Fixes cockroachdb#121480.
Fixes cockroachdb#122016.

This commit resolves a bug in the expiration-based to epoch-based lease
promotion transition, where the lease's effective expiration could be
allowed to regress. To prevent this, we detect when such cases are about
to occur and synchronously heartbeat the leaseholder's liveness record.
This works because the liveness record interval and the expiration-based
lease interval are the same, so a synchronous heartbeat ensures that the
liveness record has a later expiration than the prior lease by the time
the lease promotion goes into effect.

The code structure here leaves a lot to be desired, but since we're
going to be cleaning up and/or removing a lot of this code soon anyway,
I'm prioritizing backportability. This is therefore more targeted and
less general than it could be.

The resolution here also leaves something to be desired. A nicer fix
would be to introduce a minimum_lease_expiration field on epoch-based
leases so that we can locally ensure that the expiration does not
regress. This is what we plan to do for leader leases in the upcoming
release. We don't make this change because it would be require a version
gate to avoid replica divergence, so it would not be backportable.

Release note (bug fix): Fixed a rare bug where a lease transfer could
lead to a `side-transport update saw closed timestamp regression` panic.
The bug could occur when a node was overloaded and failing to heartbeat
its node liveness record.
This commit adds a check that `args.PrevLease` is equivalent to
`cArgs.EvalCtx.GetLease()` to RequestLease. This ensures that the
validation here is consistent with the validation that was performed
when the lease request was constructed.

Release note: None
Epic: None
This commit deflakes the test by waiting for N1's view of N2's lease
expiration to match N2's view. This is important in the rare case
where N1 tries to increase N2's epoch, but it has a stale view of
the lease expiration time.

Epic: None

Release note: None
A race could occur when a replica queue and post lease application both
attempted to switch the lease type. This race would cause the queue to
not process the replica because the lease type had already changed. As a
result, lease preference violations might not have been quickly
resolved by the lease queue.

Read the lease under the same mutex used for requesting the lease, when
possibly switching the lease type.

Resolves: cockroachdb#123998
Release note: None
Copy link

blathers-crl bot commented Sep 4, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Sep 4, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten merged commit 79d3d47 into cockroachdb:release-23.1 Sep 5, 2024
6 checks passed
@nvanbenschoten nvanbenschoten deleted the backport23.1-123442 branch September 5, 2024 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants