Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subgraph failure retry fuzzing #4474

Closed
YaroShkvorets opened this issue Mar 17, 2023 · 5 comments
Closed

Subgraph failure retry fuzzing #4474

YaroShkvorets opened this issue Mar 17, 2023 · 5 comments

Comments

@YaroShkvorets
Copy link
Contributor

At Pinax we see a pattern when every 30 minutes there is a large spike of requests to Firehose.

image

Most likely, the problem is that one or two of the indexers appear to have several broken subgraphs and every 30 minutes they wake up all together and start bombarding the firehose, sometimes overloading the network. We need to find a way to smoothen the line here.

There is a question why graph node can't detect these failures and stop syncing these subgraphs that appear to be failing deterministically, but that's a question for another issue.

30 minutes value here comes from the retry backoff ceiling here

To solve this, I propose adding fuzz_factor to ExponentialBackoff struct. When inited with fuzz_factor, the delay will be randomly selected within delay*(1-fuzz_factor) and delay*(1+fuzz_factor) time.

I.e. if fuzz_factor=0.1 and next delay is calculated to be 10 seconds, it will in fact be, between 9 and 11 seconds.

Retry fuzz factor can be set via ENV variables by the indexer similar to GRAPH_SUBGRAPH_ERROR_RETRY_CEIL_SECS here

If this seems like a good idea I can submit a PR for that feature or listen for other possible solutions.

@maoueh
Copy link
Contributor

maoueh commented Mar 17, 2023

Seems like a good idea to add some randomness to the backoff element. Google SRE guide recommend using jitter for retry element, see https://cloud.google.com/blog/products/gcp/how-to-avoid-a-self-inflicted-ddos-attack-cre-life-lessons (point #2 specifically)

As for the fuzz naming, I think most library uses instead the term jitter, at least that's what I've seen in the wild as the most used term from my experience.

@YaroShkvorets
Copy link
Contributor Author

As for the fuzz naming, I think most library uses instead the term jitter, at least that's what I've seen in the wild as the most used term from my experience.

Cool, I like jitter

@YaroShkvorets
Copy link
Contributor Author

Sent a PR: #4476

@azf20
Copy link
Contributor

azf20 commented Mar 27, 2023

thanks @YaroShkvorets!

@azf20 azf20 closed this as completed Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants