Subgraph failure retry fuzzing #4474

YaroShkvorets · 2023-03-17T18:45:27Z

At Pinax we see a pattern when every 30 minutes there is a large spike of requests to Firehose.

Most likely, the problem is that one or two of the indexers appear to have several broken subgraphs and every 30 minutes they wake up all together and start bombarding the firehose, sometimes overloading the network. We need to find a way to smoothen the line here.

There is a question why graph node can't detect these failures and stop syncing these subgraphs that appear to be failing deterministically, but that's a question for another issue.

30 minutes value here comes from the retry backoff ceiling here

To solve this, I propose adding fuzz_factor to ExponentialBackoff struct. When inited with fuzz_factor, the delay will be randomly selected within delay*(1-fuzz_factor) and delay*(1+fuzz_factor) time.

I.e. if fuzz_factor=0.1 and next delay is calculated to be 10 seconds, it will in fact be, between 9 and 11 seconds.

Retry fuzz factor can be set via ENV variables by the indexer similar to GRAPH_SUBGRAPH_ERROR_RETRY_CEIL_SECS here

If this seems like a good idea I can submit a PR for that feature or listen for other possible solutions.

The text was updated successfully, but these errors were encountered:

maoueh · 2023-03-17T19:01:41Z

Seems like a good idea to add some randomness to the backoff element. Google SRE guide recommend using jitter for retry element, see https://cloud.google.com/blog/products/gcp/how-to-avoid-a-self-inflicted-ddos-attack-cre-life-lessons (point #2 specifically)

As for the fuzz naming, I think most library uses instead the term jitter, at least that's what I've seen in the wild as the most used term from my experience.

maoueh · 2023-03-17T19:03:43Z

randomization_factor also from https://raw.githubusercontent.com/googleapis/google-http-java-client/da1aa993e90285ec18579f1553339b00e19b3ab5/google-http-client/src/main/java/com/google/api/client/util/ExponentialBackOff.java

YaroShkvorets · 2023-03-17T19:04:50Z

As for the fuzz naming, I think most library uses instead the term jitter, at least that's what I've seen in the wild as the most used term from my experience.

Cool, I like jitter

YaroShkvorets · 2023-03-20T13:55:43Z

Sent a PR: #4476

azf20 · 2023-03-27T11:02:28Z

thanks @YaroShkvorets!

YaroShkvorets mentioned this issue Mar 17, 2023

Add jitter to ExponentialBackoff #4476

Merged

azf20 closed this as completed Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subgraph failure retry fuzzing #4474

Subgraph failure retry fuzzing #4474

YaroShkvorets commented Mar 17, 2023

maoueh commented Mar 17, 2023

maoueh commented Mar 17, 2023

YaroShkvorets commented Mar 17, 2023

YaroShkvorets commented Mar 20, 2023

azf20 commented Mar 27, 2023

Subgraph failure retry fuzzing #4474

Subgraph failure retry fuzzing #4474

Comments

YaroShkvorets commented Mar 17, 2023

maoueh commented Mar 17, 2023

maoueh commented Mar 17, 2023

YaroShkvorets commented Mar 17, 2023

YaroShkvorets commented Mar 20, 2023

azf20 commented Mar 27, 2023