Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoint: improve HTTP error handling #6130

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

ivan4th
Copy link
Contributor

@ivan4th ivan4th commented Jul 11, 2024

Motivation

When recovery.recovery-uri is configured and an error happens, such as network glitch or a temporary server error due to overload, go-spacemesh continues its execution without recovery, which may not be always intended behavior. This, for example, causes TestCheckpoint systest flakes sometimes.
The way to fix it is to retry a few times upon transient errors (such as network glitch or a 503 server error), terminating after all the retries are exhausted, continue without an error upon 404 Not Found, and terminate go-spacemesh in other cases.

Description

When recovering from an URL, only continue without failing if checkpoint data request fails with 404 code. If the request fails with other 4xx HTTP error code, go-spacemesh startup fails. Upon other errors, including 5xx HTTP errors and network errors, retry checkpoint data request several times.recovery.retry-max and recovery.retry-delay specify retry parameters, defaulting to 5 and 3s, respectively. go-spacemesh terminates if all the retries fail.

Test Plan

Systests should pass

When recovering from an URL, only continue without failing if
checkpoint data request fails with 404 code. If the request fails
with other 4xx HTTP error code, go-spacemesh startup fails. Upon
other errors, including 5xx HTTP errors and network errors, retry
checkpoint data request several times.
`recovery.retry-count` and `recovery.retry-interval` specify retry
parameters, defaulting to 5 and 3s, respectively.
Copy link

codecov bot commented Jul 11, 2024

Codecov Report

Attention: Patch coverage is 55.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 81.8%. Comparing base (e70a108) to head (afcac72).
Report is 25 commits behind head on develop.

Files Patch % Lines
node/node.go 0.0% 10 Missing ⚠️
checkpoint/recovery.go 60.0% 5 Missing and 1 partial ⚠️
checkpoint/util.go 86.6% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           develop   #6130   +/-   ##
=======================================
  Coverage     81.7%   81.8%           
=======================================
  Files          302     301    -1     
  Lines        32637   32620   -17     
=======================================
+ Hits         26695   26710   +15     
+ Misses        4204    4181   -23     
+ Partials      1738    1729    -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

checkpoint/recovery.go Outdated Show resolved Hide resolved
When this flag is specified, go-spacemesh doesn't exit when recovery
URI request fails. Retries are still done, though.
This is needed b/c in the context of TestRecovery systest, all the
nodes are created with --recovery-uri flag passed, including the
bootnodes. This way, they can't successfully start unless the
bootstrapper is already running. But the bootstrapper requires an
working go-spacemesh node to start up, thus we get a circular
dependency.
Copy link
Member

@fasmat fasmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we want to default to ignoring failing requests for checkpoints in system tests? Especially in the tests that do a checkpoint and restore these should be caught and logged and cause the test to fail early (failing to fetch the checkpoint) instead of having the restore "succeed" and then fail at assertion.

I think instead we need to make sure that the pods serving the checkpoints are up and available before starting the smeshing nodes. Wdys?

checkpoint/recovery.go Outdated Show resolved Hide resolved
checkpoint/recovery_test.go Show resolved Hide resolved
checkpoint/util.go Outdated Show resolved Hide resolved
@ivan4th
Copy link
Contributor Author

ivan4th commented Jul 24, 2024

I don't understand why we want to default to ignoring failing requests for checkpoints in system tests? Especially in the tests that do a checkpoint and restore these should be caught and logged and cause the test to fail early (failing to fetch the checkpoint) instead of having the restore "succeed" and then fail at assertion.

I think instead we need to make sure that the pods serving the checkpoints are up and available before starting the smeshing nodes. Wdys?

The problem is more complicated than that, I explained it in the commit message but perhaps should comment in the code:

This is needed b/c in the context of TestRecovery systest, all the nodes are created with --recovery-uri flag passed, including the bootnodes. This way, they can't successfully start unless the bootstrapper is already running. But the bootstrapper requires an working go-spacemesh node to start up, thus we get a circulardependency.

This circular dependency is rather hard to resolve, it involves avoiding bootstrapper restart but that's not easily doable b/c it will involve some substantial restructuring of systest code

@ivan4th
Copy link
Contributor Author

ivan4th commented Jul 25, 2024

Added a comment explaining why go-spacemesh needs to continue after unsuccessfully retrying checkpoint recovery several times. We may want to fix this unfortunate circular dependency of node-bootstrapper startup in a followup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants