historyarchive: Add round-robin, error-resilience, and back-off to the `ArchivePool` #5224

Shaptic · 2024-02-27T00:13:30Z

What

This adds two main components to the ArchivePool implementation that will significantly add resilience to the interface:

Previously, each method call would just forward the method to a randomly-chosen archive in the pool. Now, it will (persistently) round-robin through each archive in the pool with a constant back-off mechanism (250ms/archive) of up to 3 retries (so 4 total queries per attempt in the worst case).

Why

Closes #5167.

Known limitations

For feedback purposes, I haven't actually implemented this for every ArchiveInterface method as I'm seeking feedback prior to writing a bunch of repetitive code I may have to wipe out later.

historyarchive/archive_pool.go

tamirms · 2024-02-27T09:17:26Z

The checkpoint change reader retries history archive operations here:

https://github.com/stellar/go/blob/master/ingest/checkpoint_change_reader.go#L132-L142

We should get rid of the retry logic in the above code because there's no use in having retries implemented in multiple layers.

tamirms · 2024-02-27T09:26:55Z

I think we should start with a simpler implementation (e.g. retry at most n times in a round-robin fashion). Before adding extra complexity in the retry logic, I think we should understand what metrics we're specifically trying to improve. It's hard to judge whether we should implement a more sophisticated retry algorithm unless we know what metrics we're trying to impact.

Shaptic · 2024-02-27T21:04:37Z

We should get rid of the retry logic in the above code because there's no use in having retries implemented in multiple layers.

While I agree in practice, I don't think we should project behavior from one layer to another. For example, a user of the ingest package may choose to only use a single archive and yet still want retries.

Shaptic · 2024-02-27T21:07:26Z

I think we should start with a simpler implementation (e.g. retry at most n times in a round-robin fashion).

Yeah, I generally agree, @tamirms. And I don't think we have any metrics that we're trying to fight against, per se: it's more that we wanted the ArchivePool to retry automatically so that we would get resilience "for free" in every place that uses the pool. And I thought that if you're going to add retries, you might as well add back-offs, since we've seen issues with a lack of back-off before in other places.

However, I think keeping it simple is a totally fine call.

tamirms · 2024-02-27T21:34:37Z

a user of the ingest package may choose to only use a single archive and yet still want retries.

this is a good point. would it make sense to have another archive wrapper which solely encapsulates the retry / backoff logic? then we could enable both the pooling and retry properties by composing the ArchivePool with the RetryArchive wrapper.

And I thought that if you're going to add retries, you might as well add back-offs, since we've seen issues with a lack of back-off before in other places.

I think it's ok to include back-offs but I'm not sure that we should have something sophisticated which tries to be clever about adjusting the backoff based on the estimated reliability of each archive in the pool.

Shaptic · 2024-02-27T22:35:33Z

would it make sense to have another archive wrapper which solely encapsulates the retry / backoff logic

It's a good idea, but I think no: retrying across the pool is better than retrying on each individual archive (in the sense that it is likelier to result in success), so we need the pool-level retries first. If we did add a RetryArchive, we'd still need this PR.

adjusting the backoff based on the estimated reliability of each archive in the pool

The back-off is linear with the number of consecutive errors; there's no other adjustment. Here, we were (i.e. before the latest push) just preferring archives with a lower back-off.

@tamirms I dropped the back-off related code in 75e50a1 and just kept a simple round-robin retry approach so that we could evaluate it. Do you think we should keep it simple and just start there?

tamirms · 2024-02-28T07:51:35Z

retrying across the pool is better than retrying on each individual archive (in the sense that it is likelier to result in success), so we need the pool-level retries first. If we did add a RetryArchive, we'd still need this PR.

if you compose RetryArchive with PoolArchive you would achieve the desired retry behavior.

Assume PoolArchive cycles through different urls for each invocation in a round robin fashion. You can wrap the PoolArchive in a RetryArchive and, when the RetryArchive invokes the underlying PoolArchive again during retries, the PoolArchive's round robin behavior will ensure that a different history archive url is selected during the retried request.

tamirms · 2024-02-28T08:05:18Z

The back-off is linear with the number of consecutive errors; there's no other adjustment. Here, we were (i.e. before the latest push) just preferring archives with a lower back-off.

I think having a constant back-off should be simple and effective

Shaptic · 2024-03-05T21:06:16Z

@tamirms I've simplified the design somewhat in line with how we discussed it, but not to the same level of depth. Rather than designing a RetriableArchive (which I think would be better scoped in a separate issue relative to #5167), I just used the backoff.Retry you referenced to back-off the loop of

This isn't as... equitable? as the previous solution (which distributed backoffs better to each individual archive) since it backs off on the entire round-robin for a single call, but I think it's still a substantial improvement relative to what we have before and so it achieves the goals in #5167.

It also classifies context errors as permanent like you advised. PTA(nother)L before I do it for the remaining methods!

historyarchive/archive_pool.go

tamirms · 2024-03-05T21:43:59Z

@Shaptic this approach looks good to me!

historyarchive/archive_pool.go

MonsieurNicolas · 2024-03-06T01:49:02Z

Out of curiosity, why is random (assuming uniform distribution) worst than round robin? The issue linked seems to imply that either would work as long as failure information is taken into account

Shaptic · 2024-03-06T17:19:56Z

@MonsieurNicolas yeah you're right that there's nothing inherently better about either, but because we want retries then it's better to have certainty that we won't reuse the same archive again (which is of course doable with random selection, too, but is cleaner with round robin).

tamirms

🎉

MonsieurNicolas · 2024-03-07T21:11:02Z

@MonsieurNicolas yeah you're right that there's nothing inherently better about either, but because we want retries then it's better to have certainty that we won't reuse the same archive again (which is of course doable with random selection, too, but is cleaner with round robin).

makes sense. we just need to make sure to shuffle the list on startup so that you don't introduce a bias towards the first archive across deployments/restarts

Shaptic · 2024-03-07T23:24:10Z

@MonsieurNicolas yep, great call-out! that's sort of being done with this line:

ap.curr = rand.Intn(len(ap.pool)) // don't necessarily start at zero

which effectively achieves the same thing: a random starting point for the round robin.

Shaptic · 2024-03-07T23:27:36Z

Hmm, thinking about that more: a random starting point is definitely not the same level of randomness as a shuffle (you could argue that you are still biasing against groups of archives since most people will pass them in order) when it comes to traversal, but I think since the goal is to alleviate concentrating load on a particular group during restarts, the same end goal is achieved either way.

Shaptic added 4 commits February 22, 2024 13:54

First draft: round-robin and retry archive pool calls

84e1cc3

Bug fixes and test cases for RR and backoff behavior

b4a7688

Improve back-off tracking and behavior, add cycle+backoff test

bdf4ff8

Add changelog entry

Loading
Loading status checks…

ffe7381

Shaptic requested review from tamirms, urvisavla, a team and sreuland February 27, 2024 00:13

Add correct link to changelog entry

Loading
Loading status checks…

86fff63

Shaptic force-pushed the pool-hardening branch from fcfdcea to 86fff63 Compare February 27, 2024 00:17

tamirms reviewed Feb 27, 2024

View reviewed changes

historyarchive/archive_pool.go Outdated Show resolved Hide resolved

tamirms reviewed Feb 27, 2024

View reviewed changes

historyarchive/archive_pool.go Show resolved Hide resolved

tamirms reviewed Feb 27, 2024

View reviewed changes

historyarchive/archive_pool.go Outdated Show resolved Hide resolved

mollykarcher added this to the Sprint 44 milestone Feb 27, 2024

Shaptic added 2 commits February 27, 2024 12:29

Bugfix: starting index should be after success filter

2585dfe

Simplify error recovery to just round-robin

75e50a1

Drop backoff-related tests

Loading
Loading status checks…

0d480b7

Merge branch 'master' into pool-hardening

Loading
Loading status checks…

cbc1fd4

Shaptic added 4 commits March 5, 2024 11:26

Change pool hardening to use simplified interface and backoff strat

Loading
Loading status checks…

5b80923

go check fixup

Loading
Loading status checks…

782b75b

Merge branch 'master' into pool-hardening

Loading
Loading status checks…

049438d

Make context errors permanent

Loading
Loading status checks…

547a0cf

Shaptic requested a review from tamirms March 5, 2024 21:06

Shaptic added 3 commits March 5, 2024 13:06

go.mod fixups with a tidy run

Loading
Loading status checks…

425e1da

Simplify looping code

Loading
Loading status checks…

526d701

Harden code against empty stats array

Loading
Loading status checks…

973842d

tamirms reviewed Mar 5, 2024

View reviewed changes

historyarchive/archive_pool.go Outdated Show resolved Hide resolved

Perform round robin w/ individual rather than cyclical backoff

Loading
Loading status checks…

efbbda6

tamirms reviewed Mar 5, 2024

View reviewed changes

historyarchive/archive_pool.go Outdated Show resolved Hide resolved

Apply the round-robin approach to the other methods

Loading
Loading status checks…

b9341e5

Shaptic commented Mar 5, 2024

View reviewed changes

historyarchive/archive_pool.go Outdated Show resolved Hide resolved

Shaptic changed the title ~~[wip] historyarchive: Add round-robin, error-resilience, and back-off to the ArchivePool.~~ historyarchive: Add round-robin, error-resilience, and back-off to the ArchivePool Mar 5, 2024

Shaptic marked this pull request as ready for review March 5, 2024 23:53

Shaptic requested review from tamirms and a team March 5, 2024 23:53

Remove transient retry test from checkpoints

Loading
Loading status checks…

7328376

Shaptic added 2 commits March 6, 2024 09:19

Move channel-based methods to be non-retrying

Loading
Loading status checks…

96f4461

Merge branch 'master' into pool-hardening

Loading
Loading status checks…

a6d051c

tamirms approved these changes Mar 7, 2024

View reviewed changes

Shaptic merged commit ab3a926 into stellar:master Mar 7, 2024
29 checks passed

Shaptic deleted the pool-hardening branch March 7, 2024 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

historyarchive: Add round-robin, error-resilience, and back-off to the `ArchivePool` #5224

historyarchive: Add round-robin, error-resilience, and back-off to the `ArchivePool` #5224

Shaptic commented Feb 27, 2024 •

edited

Loading

tamirms commented Feb 27, 2024

tamirms commented Feb 27, 2024

Shaptic commented Feb 27, 2024

Shaptic commented Feb 27, 2024

tamirms commented Feb 27, 2024

Shaptic commented Feb 27, 2024 •

edited

Loading

tamirms commented Feb 28, 2024 •

edited

Loading

tamirms commented Feb 28, 2024

Shaptic commented Mar 5, 2024

tamirms commented Mar 5, 2024

MonsieurNicolas commented Mar 6, 2024

Shaptic commented Mar 6, 2024

tamirms left a comment

MonsieurNicolas commented Mar 7, 2024

Shaptic commented Mar 7, 2024

Shaptic commented Mar 7, 2024

historyarchive: Add round-robin, error-resilience, and back-off to the ArchivePool #5224

historyarchive: Add round-robin, error-resilience, and back-off to the ArchivePool #5224

Conversation

Shaptic commented Feb 27, 2024 • edited Loading

What

Why

Known limitations

tamirms commented Feb 27, 2024

tamirms commented Feb 27, 2024

Shaptic commented Feb 27, 2024

Shaptic commented Feb 27, 2024

tamirms commented Feb 27, 2024

Shaptic commented Feb 27, 2024 • edited Loading

tamirms commented Feb 28, 2024 • edited Loading

tamirms commented Feb 28, 2024

Shaptic commented Mar 5, 2024

tamirms commented Mar 5, 2024

MonsieurNicolas commented Mar 6, 2024

Shaptic commented Mar 6, 2024

tamirms left a comment

Choose a reason for hiding this comment

MonsieurNicolas commented Mar 7, 2024

Shaptic commented Mar 7, 2024

Shaptic commented Mar 7, 2024

historyarchive: Add round-robin, error-resilience, and back-off to the `ArchivePool` #5224

historyarchive: Add round-robin, error-resilience, and back-off to the `ArchivePool` #5224

Shaptic commented Feb 27, 2024 •

edited

Loading

Shaptic commented Feb 27, 2024 •

edited

Loading

tamirms commented Feb 28, 2024 •

edited

Loading