Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

v1.17: Change getHealth to compare optimistically confirmed slots (backport of #33651) #33714

Merged
merged 1 commit into from
Oct 16, 2023

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Oct 16, 2023

This is an automatic backport of pull request #33651 done by Mergify.


Mergify commands and options

More conditions and actions can be found in the documentation.

You can also trigger Mergify actions by commenting on this pull request:

  • @Mergifyio refresh will re-evaluate the rules
  • @Mergifyio rebase will rebase this PR on its base branch
  • @Mergifyio update will merge the base branch into this PR
  • @Mergifyio backport <destination> will backport this PR on <destination> branch

Additionally, on Mergify dashboard you can:

  • look at your merge queues
  • generate the Mergify configuration with the config editor.

Finally, you can contact us on https://mergify.com

The current getHealth mechanism checks a local accounts hash slot vs.
those of other nodes as specified by --known-validator. This is a
very coarse comparison given that the default for this value is 100
slots. More so, any nodes using a value larger than the default
(ie --incremental-snapshot-interval 500) will likely see getHealth
return status behind at some point.

Change the underlying mechanism of how health is computed. Instead of
using the accounts hash slots published in gossip, use the latest
optimistically confirmed slot from the cluster. Even when a node is
behind, it is able to observe cluster optimistically confirmed by slots
by viewing votes published in gossip.

Thus, the latest cluster optimistically confirmed slot can be compared
against the latest optimistically confirmed bank from replay to
determine health. This new comparison is much more granular, and not
needing to depend on individual known validators is also a plus.

(cherry picked from commit 8bd0e4c)
@codecov
Copy link

codecov bot commented Oct 16, 2023

Codecov Report

Merging #33714 (76ac0ac) into v1.17 (02f9c97) will decrease coverage by 0.1%.
The diff coverage is 97.1%.

@@            Coverage Diff            @@
##            v1.17   #33714     +/-   ##
=========================================
- Coverage    81.7%    81.7%   -0.1%     
=========================================
  Files         804      804             
  Lines      218124   218103     -21     
=========================================
- Hits       178412   178352     -60     
- Misses      39712    39751     +39     

@steviez
Copy link
Contributor

steviez commented Oct 16, 2023

Me adding the v1.17 label to the original PR implies my support for a v1.17 BP. Operators have been running into problems with getHealth for a while so this PR will make things better for the folks that have been hitting issues.

@steviez steviez requested review from t-nelson and jbiseda October 16, 2023 17:41
@t-nelson
Copy link
Contributor

1.16 viable?

rpc/src/rpc_health.rs Show resolved Hide resolved
@steviez
Copy link
Contributor

steviez commented Oct 16, 2023

1.16 viable?

I believe so, yes. I can specifically test this BP on-top of the latest v1.16 tag. I feel confident in the testing I did, but not sure if we'd want some soak time on v1.17 first given where v1.16 is in its' release cycle.

Another option to get more runtime would be a quick post in mb-validators + asking for volunteers to try it out (namely, anyone who encountered issues in the past)

@t-nelson
Copy link
Contributor

imo, this is fixing a regression in 1.16, so good to backport assuming the changes aren't totally reckless. which in this case they're mostly contained and clean

@steviez steviez merged commit 05ebb1f into v1.17 Oct 16, 2023
@steviez steviez deleted the mergify/bp/v1.17/pr-33651 branch October 16, 2023 18:04
@steviez
Copy link
Contributor

steviez commented Oct 16, 2023

imo, this is fixing a regression in 1.16, so good to backport assuming the changes aren't totally reckless. which in this case they're mostly contained and clean

True on fixing a regression. I'll start the BP to get CI rolling + do some manual testing on my end in parallel

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants