checkpoint: improve HTTP error handling #6130

ivan4th · 2024-07-11T21:45:39Z

Motivation

When recovery.recovery-uri is configured and an error happens, such as network glitch or a temporary server error due to overload, go-spacemesh continues its execution without recovery, which may not be always intended behavior. This, for example, causes TestCheckpoint systest flakes sometimes.
The way to fix it is to retry a few times upon transient errors (such as network glitch or a 503 server error), terminating after all the retries are exhausted, continue without an error upon 404 Not Found, and terminate go-spacemesh in other cases.

Description

When recovering from an URL, only continue without failing if checkpoint data request fails with 404 code. If the request fails with other 4xx HTTP error code, go-spacemesh startup fails. Upon other errors, including 5xx HTTP errors and network errors, retry checkpoint data request several times.recovery.retry-max and recovery.retry-delay specify retry parameters, defaulting to 5 and 3s, respectively. go-spacemesh terminates if all the retries fail.

Test Plan

Systests should pass

When recovering from an URL, only continue without failing if checkpoint data request fails with 404 code. If the request fails with other 4xx HTTP error code, go-spacemesh startup fails. Upon other errors, including 5xx HTTP errors and network errors, retry checkpoint data request several times. `recovery.retry-count` and `recovery.retry-interval` specify retry parameters, defaulting to 5 and 3s, respectively.

codecov · 2024-07-11T22:02:59Z

Codecov Report

Attention: Patch coverage is 55.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 81.8%. Comparing base (e70a108) to head (afcac72).
Report is 25 commits behind head on develop.

Files	Patch %	Lines
node/node.go	0.0%	10 Missing ⚠️
checkpoint/recovery.go	60.0%	5 Missing and 1 partial ⚠️
checkpoint/util.go	86.6%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #6130   +/-   ##
=======================================
  Coverage     81.7%   81.8%           
=======================================
  Files          302     301    -1     
  Lines        32637   32620   -17     
=======================================
+ Hits         26695   26710   +15     
+ Misses        4204    4181   -23     
+ Partials      1738    1729    -9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

checkpoint/recovery.go

When this flag is specified, go-spacemesh doesn't exit when recovery URI request fails. Retries are still done, though. This is needed b/c in the context of TestRecovery systest, all the nodes are created with --recovery-uri flag passed, including the bootnodes. This way, they can't successfully start unless the bootstrapper is already running. But the bootstrapper requires an working go-spacemesh node to start up, thus we get a circular dependency.

fasmat

I don't understand why we want to default to ignoring failing requests for checkpoints in system tests? Especially in the tests that do a checkpoint and restore these should be caught and logged and cause the test to fail early (failing to fetch the checkpoint) instead of having the restore "succeed" and then fail at assertion.

I think instead we need to make sure that the pods serving the checkpoints are up and available before starting the smeshing nodes. Wdys?

checkpoint/recovery.go

checkpoint/recovery_test.go

checkpoint/util.go

ivan4th · 2024-07-24T21:03:10Z

I don't understand why we want to default to ignoring failing requests for checkpoints in system tests? Especially in the tests that do a checkpoint and restore these should be caught and logged and cause the test to fail early (failing to fetch the checkpoint) instead of having the restore "succeed" and then fail at assertion.

I think instead we need to make sure that the pods serving the checkpoints are up and available before starting the smeshing nodes. Wdys?

The problem is more complicated than that, I explained it in the commit message but perhaps should comment in the code:

This is needed b/c in the context of TestRecovery systest, all the nodes are created with --recovery-uri flag passed, including the bootnodes. This way, they can't successfully start unless the bootstrapper is already running. But the bootstrapper requires an working go-spacemesh node to start up, thus we get a circulardependency.

This circular dependency is rather hard to resolve, it involves avoiding bootstrapper restart but that's not easily doable b/c it will involve some substantial restructuring of systest code

ivan4th · 2024-07-25T00:26:05Z

Added a comment explaining why go-spacemesh needs to continue after unsuccessfully retrying checkpoint recovery several times. We may want to fix this unfortunate circular dependency of node-bootstrapper startup in a followup.

ivan4th requested review from dshulyak, fasmat, poszu and acud as code owners July 11, 2024 21:45

poszu reviewed Jul 12, 2024

View reviewed changes

checkpoint/recovery.go Outdated Show resolved Hide resolved

checkpoint: use retryablehttp for checkpoint data retrieval

05a5cf6

ivan4th force-pushed the fix/recover-retries branch from ab5cdfb to 05a5cf6 Compare July 12, 2024 17:35

fasmat reviewed Jul 15, 2024

View reviewed changes

checkpoint/recovery.go Outdated Show resolved Hide resolved

checkpoint/recovery_test.go Show resolved Hide resolved

checkpoint/util.go Outdated Show resolved Hide resolved

Address comments

119cd53

Fix typo

afcac72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoint: improve HTTP error handling #6130

checkpoint: improve HTTP error handling #6130

ivan4th commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024 •

edited

Loading

fasmat left a comment

ivan4th commented Jul 24, 2024 •

edited

Loading

ivan4th commented Jul 25, 2024

checkpoint: improve HTTP error handling #6130

Are you sure you want to change the base?

checkpoint: improve HTTP error handling #6130

Conversation

ivan4th commented Jul 11, 2024 • edited Loading

Motivation

Description

Test Plan

codecov bot commented Jul 11, 2024 • edited Loading

Codecov Report

fasmat left a comment

Choose a reason for hiding this comment

ivan4th commented Jul 24, 2024 • edited Loading

ivan4th commented Jul 25, 2024

ivan4th commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024 •

edited

Loading

ivan4th commented Jul 24, 2024 •

edited

Loading