-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subgraph failure retry fuzzing #4474
Comments
Seems like a good idea to add some randomness to the backoff element. Google SRE guide recommend using jitter for retry element, see https://cloud.google.com/blog/products/gcp/how-to-avoid-a-self-inflicted-ddos-attack-cre-life-lessons (point As for the |
Cool, I like |
Sent a PR: #4476 |
thanks @YaroShkvorets! |
At Pinax we see a pattern when every 30 minutes there is a large spike of requests to Firehose.
Most likely, the problem is that one or two of the indexers appear to have several broken subgraphs and every 30 minutes they wake up all together and start bombarding the firehose, sometimes overloading the network. We need to find a way to smoothen the line here.
There is a question why graph node can't detect these failures and stop syncing these subgraphs that appear to be failing deterministically, but that's a question for another issue.
30 minutes value here comes from the retry backoff ceiling here
To solve this, I propose adding
fuzz_factor
toExponentialBackoff
struct. When inited withfuzz_factor
, the delay will be randomly selected withindelay*(1-fuzz_factor)
anddelay*(1+fuzz_factor)
time.I.e. if
fuzz_factor=0.1
and next delay is calculated to be 10 seconds, it will in fact be, between 9 and 11 seconds.Retry fuzz factor can be set via ENV variables by the indexer similar to GRAPH_SUBGRAPH_ERROR_RETRY_CEIL_SECS here
If this seems like a good idea I can submit a PR for that feature or listen for other possible solutions.
The text was updated successfully, but these errors were encountered: