Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

v1.16: Change getHealth to compare optimistically confirmed slots (backport of #33651) #33716

Merged
merged 2 commits into from
Oct 20, 2023

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Oct 16, 2023

This is an automatic backport of pull request #33651 done by Mergify.
Cherry-pick of 8bd0e4c has failed:

On branch mergify/bp/v1.16/pr-33651
Your branch is up to date with 'origin/v1.16'.

You are currently cherry-picking commit 8bd0e4cd95.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   core/src/validator.rs
	modified:   docs/src/api/http.md
	modified:   docs/src/api/methods/_getHealth.mdx
	modified:   rpc/src/rpc_health.rs
	modified:   rpc/src/rpc_service.rs
	modified:   validator/src/cli.rs
	modified:   validator/src/main.rs

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   rpc/src/rpc.rs

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/github/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally


Mergify commands and options

More conditions and actions can be found in the documentation.

You can also trigger Mergify actions by commenting on this pull request:

  • @Mergifyio refresh will re-evaluate the rules
  • @Mergifyio rebase will rebase this PR on its base branch
  • @Mergifyio update will merge the base branch into this PR
  • @Mergifyio backport <destination> will backport this PR on <destination> branch

Additionally, on Mergify dashboard you can:

  • look at your merge queues
  • generate the Mergify configuration with the config editor.

Finally, you can contact us on https://mergify.com

The current getHealth mechanism checks a local accounts hash slot vs.
those of other nodes as specified by --known-validator. This is a
very coarse comparison given that the default for this value is 100
slots. More so, any nodes using a value larger than the default
(ie --incremental-snapshot-interval 500) will likely see getHealth
return status behind at some point.

Change the underlying mechanism of how health is computed. Instead of
using the accounts hash slots published in gossip, use the latest
optimistically confirmed slot from the cluster. Even when a node is
behind, it is able to observe cluster optimistically confirmed by slots
by viewing votes published in gossip.

Thus, the latest cluster optimistically confirmed slot can be compared
against the latest optimistically confirmed bank from replay to
determine health. This new comparison is much more granular, and not
needing to depend on individual known validators is also a plus.

(cherry picked from commit 8bd0e4c)

# Conflicts:
#	rpc/src/rpc.rs
@mergify mergify bot added the conflicts label Oct 16, 2023
@steviez steviez force-pushed the mergify/bp/v1.16/pr-33651 branch 3 times, most recently from 4daf22e to fd2c780 Compare October 16, 2023 20:16
@steviez steviez force-pushed the mergify/bp/v1.16/pr-33651 branch from fd2c780 to 9e78635 Compare October 16, 2023 20:25
@codecov
Copy link

codecov bot commented Oct 16, 2023

Codecov Report

Merging #33716 (9e78635) into v1.16 (08876b9) will increase coverage by 0.0%.
The diff coverage is 97.2%.

@@           Coverage Diff           @@
##            v1.16   #33716   +/-   ##
=======================================
  Coverage    81.8%    81.8%           
=======================================
  Files         766      766           
  Lines      209091   209073   -18     
=======================================
+ Hits       171144   171161   +17     
+ Misses      37947    37912   -35     

@steviez
Copy link
Contributor

steviez commented Oct 16, 2023

Per discussion elsewhere, we're considering this PR for backport based on the following criteria:

  • The change itself is pretty self-contained
  • The change addresses a regression in the reliability of getHealth that became more apparent in v1.16
    • More apparent in v1.16 with the default of putting accounts index on disk

I'm going to run this commit locally for a bit as an extra sanity check; after that, I'll reply back here and request reviewers to take a look

@steviez
Copy link
Contributor

steviez commented Oct 17, 2023

Here are the catchup logs from a node that was running this change with incremental snapshot interval of 1000 slots + health check distance of 25 slots. Again, I had solana-validator monitor running in a tmux session which issues a getHealth request every 2 seconds for updating the dashboard. I didn't hack in the metric to report these, but eye-balling it, you can see a smooth ramp up as the distance shrinks down to within the health distance

[2023-10-17T05:53:43.700589599Z ...] health check: behind by 983 slots: me=224178648, latest cluster=224179631
[2023-10-17T05:53:45.701774217Z ...] health check: behind by 976 slots: me=224178664, latest cluster=224179640
[2023-10-17T05:53:47.703151300Z ...] health check: behind by 967 slots: me=224178677, latest cluster=224179644
[2023-10-17T05:53:49.704430819Z ...] health check: behind by 957 slots: me=224178692, latest cluster=224179649
[2023-10-17T05:53:51.705498874Z ...] health check: behind by 949 slots: me=224178705, latest cluster=224179654
[2023-10-17T05:53:53.706725211Z ...] health check: behind by 943 slots: me=224178717, latest cluster=224179660
[2023-10-17T05:53:55.707977217Z ...] health check: behind by 933 slots: me=224178732, latest cluster=224179665
[2023-10-17T05:53:57.709067355Z ...] health check: behind by 922 slots: me=224178747, latest cluster=224179669
[2023-10-17T05:53:59.710241282Z ...] health check: behind by 913 slots: me=224178761, latest cluster=224179674
[2023-10-17T05:54:01.713162505Z ...] health check: behind by 907 slots: me=224178773, latest cluster=224179680
[2023-10-17T05:54:03.714523909Z ...] health check: behind by 902 slots: me=224178783, latest cluster=224179685
[2023-10-17T05:54:05.715704349Z ...] health check: behind by 900 slots: me=224178789, latest cluster=224179689
[2023-10-17T05:54:07.717163129Z ...] health check: behind by 894 slots: me=224178799, latest cluster=224179693
[2023-10-17T05:54:09.718236645Z ...] health check: behind by 887 slots: me=224178812, latest cluster=224179699
[2023-10-17T05:54:11.719584143Z ...] health check: behind by 878 slots: me=224178825, latest cluster=224179703
[2023-10-17T05:54:13.720842291Z ...] health check: behind by 868 slots: me=224178840, latest cluster=224179708
[2023-10-17T05:54:15.722065372Z ...] health check: behind by 863 slots: me=224178851, latest cluster=224179714
[2023-10-17T05:54:17.723183313Z ...] health check: behind by 854 slots: me=224178865, latest cluster=224179719
[2023-10-17T05:54:19.724551551Z ...] health check: behind by 847 slots: me=224178877, latest cluster=224179724
[2023-10-17T05:54:21.726344517Z ...] health check: behind by 836 slots: me=224178892, latest cluster=224179728
[2023-10-17T05:54:23.727742862Z ...] health check: behind by 828 slots: me=224178906, latest cluster=224179734
[2023-10-17T05:54:25.729200430Z ...] health check: behind by 821 slots: me=224178919, latest cluster=224179740
[2023-10-17T05:54:27.730559500Z ...] health check: behind by 812 slots: me=224178932, latest cluster=224179744
[2023-10-17T05:54:29.731938028Z ...] health check: behind by 803 slots: me=224178945, latest cluster=224179748
[2023-10-17T05:54:31.733381319Z ...] health check: behind by 800 slots: me=224178955, latest cluster=224179755
[2023-10-17T05:54:33.734538334Z ...] health check: behind by 793 slots: me=224178966, latest cluster=224179759
[2023-10-17T05:54:35.735607172Z ...] health check: behind by 784 slots: me=224178979, latest cluster=224179763
[2023-10-17T05:54:37.736667915Z ...] health check: behind by 776 slots: me=224178992, latest cluster=224179768
[2023-10-17T05:54:39.737918469Z ...] health check: behind by 769 slots: me=224179005, latest cluster=224179774
[2023-10-17T05:54:41.739613399Z ...] health check: behind by 761 slots: me=224179017, latest cluster=224179778
[2023-10-17T05:54:43.740797467Z ...] health check: behind by 754 slots: me=224179029, latest cluster=224179783
[2023-10-17T05:54:45.743715795Z ...] health check: behind by 744 slots: me=224179044, latest cluster=224179788
[2023-10-17T05:54:47.744870557Z ...] health check: behind by 742 slots: me=224179053, latest cluster=224179795
[2023-10-17T05:54:49.746268472Z ...] health check: behind by 725 slots: me=224179073, latest cluster=224179798
[2023-10-17T05:54:51.747630308Z ...] health check: behind by 714 slots: me=224179089, latest cluster=224179803
[2023-10-17T05:54:53.748898466Z ...] health check: behind by 705 slots: me=224179104, latest cluster=224179809
[2023-10-17T05:54:55.750107802Z ...] health check: behind by 697 slots: me=224179117, latest cluster=224179814
[2023-10-17T05:54:57.751258977Z ...] health check: behind by 687 slots: me=224179132, latest cluster=224179819
[2023-10-17T05:54:59.752620323Z ...] health check: behind by 673 slots: me=224179148, latest cluster=224179821
[2023-10-17T05:55:01.754342916Z ...] health check: behind by 663 slots: me=224179161, latest cluster=224179824
[2023-10-17T05:55:03.755754807Z ...] health check: behind by 656 slots: me=224179173, latest cluster=224179829
[2023-10-17T05:55:05.757197908Z ...] health check: behind by 647 slots: me=224179187, latest cluster=224179834
[2023-10-17T05:55:07.758611073Z ...] health check: behind by 640 slots: me=224179198, latest cluster=224179838
[2023-10-17T05:55:09.759881014Z ...] health check: behind by 640 slots: me=224179205, latest cluster=224179845
[2023-10-17T05:55:11.761178108Z ...] health check: behind by 633 slots: me=224179217, latest cluster=224179850
[2023-10-17T05:55:13.762437540Z ...] health check: behind by 625 slots: me=224179229, latest cluster=224179854
[2023-10-17T05:55:15.763731547Z ...] health check: behind by 622 slots: me=224179237, latest cluster=224179859
[2023-10-17T05:55:17.765136035Z ...] health check: behind by 616 slots: me=224179248, latest cluster=224179864
[2023-10-17T05:55:19.766496910Z ...] health check: behind by 609 slots: me=224179260, latest cluster=224179869
[2023-10-17T05:55:21.768351655Z ...] health check: behind by 602 slots: me=224179271, latest cluster=224179873
[2023-10-17T05:55:23.769528740Z ...] health check: behind by 592 slots: me=224179284, latest cluster=224179876
[2023-10-17T05:55:25.770708481Z ...] health check: behind by 588 slots: me=224179297, latest cluster=224179885
[2023-10-17T05:55:27.772047023Z ...] health check: behind by 578 slots: me=224179311, latest cluster=224179889
[2023-10-17T05:55:29.773222606Z ...] health check: behind by 569 slots: me=224179323, latest cluster=224179892
[2023-10-17T05:55:31.774437223Z ...] health check: behind by 564 slots: me=224179335, latest cluster=224179899
[2023-10-17T05:55:33.775556679Z ...] health check: behind by 559 slots: me=224179345, latest cluster=224179904
[2023-10-17T05:55:35.776760234Z ...] health check: behind by 556 slots: me=224179352, latest cluster=224179908
[2023-10-17T05:55:37.777891112Z ...] health check: behind by 552 slots: me=224179361, latest cluster=224179913
[2023-10-17T05:55:39.779145816Z ...] health check: behind by 550 slots: me=224179369, latest cluster=224179919
[2023-10-17T05:55:41.781470595Z ...] health check: behind by 548 slots: me=224179376, latest cluster=224179924
[2023-10-17T05:55:43.783121162Z ...] health check: behind by 545 slots: me=224179384, latest cluster=224179929
[2023-10-17T05:55:45.785704274Z ...] health check: behind by 536 slots: me=224179395, latest cluster=224179931
[2023-10-17T05:55:47.786965549Z ...] health check: behind by 535 slots: me=224179405, latest cluster=224179940
[2023-10-17T05:55:49.788343788Z ...] health check: behind by 527 slots: me=224179417, latest cluster=224179944
[2023-10-17T05:55:51.789748498Z ...] health check: behind by 523 slots: me=224179425, latest cluster=224179948
[2023-10-17T05:55:53.790994835Z ...] health check: behind by 514 slots: me=224179435, latest cluster=224179949
[2023-10-17T05:55:55.792198722Z ...] health check: behind by 517 slots: me=224179443, latest cluster=224179960
[2023-10-17T05:55:57.793506326Z ...] health check: behind by 512 slots: me=224179452, latest cluster=224179964
[2023-10-17T05:55:59.794775037Z ...] health check: behind by 508 slots: me=224179459, latest cluster=224179967
[2023-10-17T05:56:01.796394164Z ...] health check: behind by 502 slots: me=224179468, latest cluster=224179970
[2023-10-17T05:56:03.797793754Z ...] health check: behind by 503 slots: me=224179476, latest cluster=224179979
[2023-10-17T05:56:05.798945351Z ...] health check: behind by 492 slots: me=224179492, latest cluster=224179984
[2023-10-17T05:56:07.800028609Z ...] health check: behind by 477 slots: me=224179511, latest cluster=224179988
[2023-10-17T05:56:09.801193031Z ...] health check: behind by 467 slots: me=224179524, latest cluster=224179991
[2023-10-17T05:56:11.802244829Z ...] health check: behind by 463 slots: me=224179536, latest cluster=224179999
[2023-10-17T05:56:13.803454518Z ...] health check: behind by 458 slots: me=224179545, latest cluster=224180003
[2023-10-17T05:56:15.804952314Z ...] health check: behind by 448 slots: me=224179559, latest cluster=224180007
[2023-10-17T05:56:17.806320094Z ...] health check: behind by 443 slots: me=224179571, latest cluster=224180014
[2023-10-17T05:56:19.807780309Z ...] health check: behind by 435 slots: me=224179583, latest cluster=224180018
[2023-10-17T05:56:21.809528352Z ...] health check: behind by 428 slots: me=224179595, latest cluster=224180023
[2023-10-17T05:56:23.810996152Z ...] health check: behind by 418 slots: me=224179609, latest cluster=224180027
[2023-10-17T05:56:25.812291915Z ...] health check: behind by 413 slots: me=224179621, latest cluster=224180034
[2023-10-17T05:56:27.813643624Z ...] health check: behind by 406 slots: me=224179631, latest cluster=224180037
[2023-10-17T05:56:29.814963322Z ...] health check: behind by 404 slots: me=224179639, latest cluster=224180043
[2023-10-17T05:56:31.816156900Z ...] health check: behind by 390 slots: me=224179656, latest cluster=224180046
[2023-10-17T05:56:33.817525542Z ...] health check: behind by 380 slots: me=224179671, latest cluster=224180051
[2023-10-17T05:56:35.818553484Z ...] health check: behind by 370 slots: me=224179685, latest cluster=224180055
[2023-10-17T05:56:37.819825913Z ...] health check: behind by 360 slots: me=224179700, latest cluster=224180060
[2023-10-17T05:56:39.823171637Z ...] health check: behind by 347 slots: me=224179717, latest cluster=224180064
[2023-10-17T05:56:41.825044148Z ...] health check: behind by 341 slots: me=224179731, latest cluster=224180072
[2023-10-17T05:56:43.826341513Z ...] health check: behind by 334 slots: me=224179742, latest cluster=224180076
[2023-10-17T05:56:45.827510475Z ...] health check: behind by 325 slots: me=224179755, latest cluster=224180080
[2023-10-17T05:56:47.828723090Z ...] health check: behind by 318 slots: me=224179768, latest cluster=224180086
[2023-10-17T05:56:49.829901499Z ...] health check: behind by 309 slots: me=224179782, latest cluster=224180091
[2023-10-17T05:56:51.831124244Z ...] health check: behind by 300 slots: me=224179795, latest cluster=224180095
[2023-10-17T05:56:53.832181673Z ...] health check: behind by 289 slots: me=224179810, latest cluster=224180099
[2023-10-17T05:56:55.833313965Z ...] health check: behind by 283 slots: me=224179823, latest cluster=224180106
[2023-10-17T05:56:57.834601943Z ...] health check: behind by 275 slots: me=224179836, latest cluster=224180111
[2023-10-17T05:56:59.835664984Z ...] health check: behind by 264 slots: me=224179850, latest cluster=224180114
[2023-10-17T05:57:01.836947662Z ...] health check: behind by 253 slots: me=224179865, latest cluster=224180118
[2023-10-17T05:57:03.838419781Z ...] health check: behind by 247 slots: me=224179879, latest cluster=224180126
[2023-10-17T05:57:05.839937817Z ...] health check: behind by 237 slots: me=224179893, latest cluster=224180130
[2023-10-17T05:57:07.841257296Z ...] health check: behind by 224 slots: me=224179909, latest cluster=224180133
[2023-10-17T05:57:09.842383737Z ...] health check: behind by 215 slots: me=224179925, latest cluster=224180140
[2023-10-17T05:57:11.843648392Z ...] health check: behind by 207 slots: me=224179940, latest cluster=224180147
[2023-10-17T05:57:13.844964745Z ...] health check: behind by 198 slots: me=224179953, latest cluster=224180151
[2023-10-17T05:57:15.846252743Z ...] health check: behind by 188 slots: me=224179967, latest cluster=224180155
[2023-10-17T05:57:17.847432386Z ...] health check: behind by 184 slots: me=224179976, latest cluster=224180160
[2023-10-17T05:57:19.848869248Z ...] health check: behind by 183 slots: me=224179983, latest cluster=224180166
[2023-10-17T05:57:21.850336378Z ...] health check: behind by 176 slots: me=224179994, latest cluster=224180170
[2023-10-17T05:57:23.851538754Z ...] health check: behind by 168 slots: me=224180006, latest cluster=224180174
[2023-10-17T05:57:25.852817856Z ...] health check: behind by 161 slots: me=224180018, latest cluster=224180179
[2023-10-17T05:57:27.854197249Z ...] health check: behind by 157 slots: me=224180027, latest cluster=224180184
[2023-10-17T05:57:29.855331165Z ...] health check: behind by 147 slots: me=224180042, latest cluster=224180189
[2023-10-17T05:57:31.856595860Z ...] health check: behind by 137 slots: me=224180056, latest cluster=224180193
[2023-10-17T05:57:33.857799178Z ...] health check: behind by 132 slots: me=224180069, latest cluster=224180201
[2023-10-17T05:57:35.858915861Z ...] health check: behind by 123 slots: me=224180083, latest cluster=224180206
[2023-10-17T05:57:37.860190685Z ...] health check: behind by 113 slots: me=224180097, latest cluster=224180210
[2023-10-17T05:57:39.861255359Z ...] health check: behind by 100 slots: me=224180113, latest cluster=224180213
[2023-10-17T05:57:41.862748690Z ...] health check: behind by 91 slots: me=224180129, latest cluster=224180220
[2023-10-17T05:57:43.864099388Z ...] health check: behind by 79 slots: me=224180145, latest cluster=224180224
[2023-10-17T05:57:45.865677620Z ...] health check: behind by 67 slots: me=224180161, latest cluster=224180228
[2023-10-17T05:57:47.867187762Z ...] health check: behind by 57 slots: me=224180172, latest cluster=224180229
[2023-10-17T05:57:49.868555073Z ...] health check: behind by 55 slots: me=224180183, latest cluster=224180238
[2023-10-17T05:57:51.869900411Z ...] health check: behind by 46 slots: me=224180196, latest cluster=224180242
[2023-10-17T05:57:53.871387790Z ...] health check: behind by 35 slots: me=224180210, latest cluster=224180245
[2023-10-17T05:57:55.872762886Z ...] health check: behind by 26 slots: me=224180221, latest cluster=224180247

@steviez steviez requested a review from t-nelson October 17, 2023 17:39
Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this picked pretty clean?

:shipit:

@steviez
Copy link
Contributor

steviez commented Oct 20, 2023

looks like this picked pretty clean?

Yep, only issue stemmed from Bank::new_from_parent() getting updated to take an Arc<Bank> parent instead of &Arc<Bank> and calling clone on it internally

@steviez steviez merged commit 612616b into v1.16 Oct 20, 2023
18 checks passed
@steviez steviez deleted the mergify/bp/v1.16/pr-33651 branch October 20, 2023 08:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants