-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decorrelated jitter does not jitter well #530
Comments
Thank you @george-polevoy for diving deeper on this than I had time to and for sharing that with the community, this is super interesting. The decorrelated Jitter strategy quoted came from that AWS team research, definitely open to investigating different options. First, could you say what the axes on the graphs represent? (A guess ... the number of instances (vertical) of a specific delay of some-or-other timespan (horizontal) when averaged over 100000 iterations of the algorithm ... but you can state precisely? ) Second, would you be happy to share the test code which generated the results? (maybe in a ZIP or github gist). It would be great for you/ I/ others to be able to validate, and to build further on your results and explore how tweaking the algorithms affects outcomes.
I am definitely seeing in your graphs that the green line is much more smoothly distributed than the yellow line as generated in the tests. (EDIT: But it would be good to, eg, dig into the code as @grant-d suggests, to verify why)
I want to be sure I have understood what we mean by this. Do we mean that the average delay that the algorithm chooses to introduce before retrying (average perhaps in some sense of area under graph/percentiles), is lower? Or do we mean (for example) that the average overall time to achieve success is less, for each algorithm pitted against a failed system which recovers after some random amount of time? It would be really interesting to build on your experiments (if they do not cover this already) to be able to analyse the dimensions of overall latency to achieve success, combined with number of calls made to achieve success. Thank you again for sharing this with the community! |
George, would you mind trying your experiment with the |
@reisenberger Dylan, sure there was a mistake. It was concurrency that is improved, not latency, I've updated the initial issue. The vertical axis represents number of experiments. Virtual massive retry event of simultaneous 100000 requests could result in concurrency splash, which is illustrated. If I try to graph fewer experiments, the graph is just too noisy. I understand that the number of experiments is irrelevant, so it would be usefult to factor it out, and display just some normalized probability distribution. I will try to come up with complete benchmark code, so others could explore and experiment. |
TLDR; I agree with George - the guidance in the wiki is not optimal. It can lead to bad performance characteristics. Which is further evidence for my suggestion to include a formal jitter in the box (#526). @george-polevoy, my concerns are not just a preference; my suggestion is that it may be the case that instantiating Random inside the method is also somewhat of a bug, perhaps leading to the correlations you're seeing in the There's probably further correlation due to the use of Both your [ThreadStatic] // Added, per George's observation re: thread safety
private static readonly Random r = new Random(); // Shared Random between all concurrent executions
IEnumerable<TimeSpan> DecorrelatedExponent()
{
// code that uses 'r'
} |
@grant_d Random is not thread safe, I'm afraid, and I'm using a random seed from Guid.GetHashCode, which is an alternative to using a thread local variable. |
Once again, your source of randomness is not random - see my notes above, and see the first code snippet here, where your pattern corresponds to the warning he has in that code.
(You are correct that I forgot the |
Thanks to both of you @george-polevoy and @grant-d for progressing this! I haven't had much time on this thread but note that:
So: Should we just be following the recommendations here and here? (V happy for further clarifications/corrections, as the above was just based on some brief reading!) |
Thanks @reisenberger, I have used the various thread-static approaches but the locking one is interesting; I would not have expected to see advice to The |
Circling back to our use case: given we are calculating randomised delays of far greater order of magnitude than any delay contention due to fine-grained locking, I guess we can safely follow the advice and not care about the impact of locking. |
@grant-d Randomness is not an issue here. I will share code for you to verify. My computations are now single threaded and deterministic, using a fixed random seed and a shared Random instance, not using Polly. Also, system clock issues can be ruled out, because this is just purely math integration based on deterministic pseudorandom sequence. The entire computation does not depend on runtime issues of NET, it's computed offline. It's just a math routine. I'm computing the delay sequence function itself using virtual regular time intervals. |
I see the problem with your code. In addition to min(cap, ...) that AWS team suggested in their blog, you've also added max(seed, ...). Say we have 100 requests failed at time t0. What is the probability of scheduling a request at time t0 + 1s? It's 1/3, because all of the values lower than t0+1s clamp. So you've got aprox. 33 requests at exact timing: t0 + 1s. That completely explains the fist two spikes on the graph, which are clearly seen. I've examined the code on the AWS blog, and they have slightly different thing. They use random_between, which is not max(something, random.NextDouble). I will follow up with the modelling code i'm using. |
Agree - my code has a slightly different implementation to the wiki too, though I am still refining it. I am curious to compare it with yours, please do post it as soon as you are ready. |
[Edit: This is stale - see related PR for correct algorithm] For reference, here's my code: double range = MaxDelay.TotalMilliseconds - MinDelay.TotalMilliseconds; // Range
for (int i = 0; i < delays.Length; i++)
{
double ms = range * _random.NextDouble(); // Ceiling
ms += MinDelay.TotalMilliseconds; // Floor
delays[i] = TimeSpan.FromMilliseconds(ms);
} |
Thanks both @george-polevoy @grant-d ! Would be great to see graphs of the distributions of generated by the revised algorithms. |
Again, thank you both @george-polevoy @grant-d . @george-polevoy I think your analysis here is exactly correct. Thank you for spotting the (now obvious, yes) flaw in the algorithm here at time of writing. @grant-d The version in your PR (for which many thanks) also matches this analysis and the AWS algorithm, I believe. Thanks for all the work on this and the PR! (EDIT: PR suggestions will follow.) |
The upper-end clampAnalogous to the lower-end clamp problem which you have eliminated, @george-polevoy , your contribution led me to reason again about the upper-end clamp. There is of course a potential upper-end clamp after enough iterations, because of the ceiling imposed at I drew users' attention to this in my existing guidance by emphasizing (from my own experience) that Otherwise, perhaps some pragmatic warnings will suffice. For example, for an initial Thoughts? |
Finally, @george-polevoy : are you happy now with the behaviour of the AWS algorithm following the corrections you have introduced? (for which many thanks!). Or are there aspects of the 'advanced' algorithm from your original post you think still worth considering? I did dive deeper into the advanced algorithm - came up with a bunch of further questions - but won't lengthen this unless you are also interested in pursuing. It would be awesome to see how the "fixed" AWS algorithm looks also in your graphs. |
Alternatively, @grant-d , @george-polevoy , we could eliminate the upper-end-bunching by modifying the latest implementation like this:
Is this worth modelling (to confirm smoothness) in George's graphing system? Can anyone see disadvantages? One implication is that when |
This revisits my previous comment on implementations of
References: post by Stephen Toub; Jon Skeet 1; Jon Skeet 2.
The latest .NET framework and .NET core implementations do in fact differ, because the .NET Framework 4.7.2 implementation still (!- for backward compatibility presumably) has
The recommendation I linked pertains to .NET Framework not .NET Core. However, in targeting .NET Standard, Polly of course can still (as is well known) be consumed by builds against .NET Framework 4.5 and up. So we should still take account of the .NET Framework recommendation (EDIT as PR #536 does) if for jitter we are targeting maximum randomness. Broadly, it is essential to make
|
@reisenberger Algorithm by AWS also has a flaw. On my graphs it looks smoother, but still has spikes, because it's not truly decorrelated. @grant-d The problem with clamping using Min Max is that it ruins randomness, flooding the distribution with some constants. As for the API guarantees, you can use math to proof the MAX and MIN. For ex. max recovery interval for retry is a sum of geometric progression, not matter if you randomize inside, if only limited range random variable does not participate in any asymptotic math function such as 1/x. My graphing framework turned to be very useful to analyze the problem. I will share it as I promised as I clean up the code. Currently busy preparing for my DotNext talk this week. |
Again, big thanks both @george-polevoy , @grant-d for everything you are contributing on this. Looking forward @george-polevoy to seeing more. |
I have few more results for you to consider. https://gist.github.com/george-polevoy/c0c36c3c22c9c1fe67821b1d8255413a |
Very nice, I like how clear the AWS clamping problem is (glad we fixed that). |
I think "soft exp, soft delay X" family looks best to me. It keeps exponential property, has reasonably low peak, low at start, and a hard limit of I have another framework which can be used to analyse concurrency issues. I will try to run a simulation with different kinds of retry strategies to see how they actually affect concurrency in fair-scheduling scenarios. Stochastic algorithms approach is different from deterministic reasoning, no need to clear the interval from 0 to the first retry completely, because you don't know the exact time of start of the transient failure condition anyway, why not to try during that interval, if it helps distribute retries evenly? |
Sorry George, I don't understand what you are getting at in the last paragraph, would you mind explaining again? |
Huge thanks @george-polevoy . Now digging in to this. (Bear with me if it takes a few days to respond; everything I do for Polly happens separately from paid-work ...) |
@grant-d Yes, I mean immediately. Instead of minimum delay, there could be half decay period parameter, which is more meaningful for the exponential function. This would be a point in time when probability drops to 1/2. |
@george-polevoy I've run out time to dive into these in depth this week, due to the number of different strands going through on Polly, but wanting to get to this soon. Thanks! |
@george-polevoy Again, thank you. I now digested the algorithm math and simulation basis, very nice. Nice use of the natural shape of See the comment on @grant-d 's PR: we could pull some of the new algorithms here into a separate A remaining thought: it could be useful to understand for the new algorithms where the 50th percentile (say) falls - or see the shape of the graph as a whole - for try 1, try 2, try 3, try 4, separately ... to see how those 50th percentiles (/or the distribution) progresses, with increasing tries. For example, looking at the graph (and formula) for 'exp jittered (Shifted)', it seems clear the 50th percentiles likely fall with the apex of each peak. But for 'soft exp, soft delay', we probably need to mine some data (/see a graph-per-try) to get a clear idea of how the 50th percentiles fall and progress. I've worked out how to amend your simulation code to generate this, so I can do, but if it's quick for you/you feel like pulling that out, say for the 'soft exp, soft delay 3' case, that would be useful to see. Why? Mostly we have focused on how to make the distribution smooth, to fair-schedule retries to reduce contention, perhaps with the background assumption that the underlying system is basically healthy. But I want to keep also in focus the dimension that the retries should happen with increasing delay, for the case of a system which is unhealthy/failing/struggling under load. For unhealthy systems, it can be important that later retries significantly back off - to reduce the potential for retries creating a self-DDOS attack. (Sure, the problem is stochastic, but broadly speaking, by the time fourth retries (say) start kicking in, we will be quadrupling the load.) (And of course, there are other techniques to avoid this, like circuit-breaking on too many failures, but as a gen principle it's helpful for retries to have increasing back off.) So, I would just like to dig into and be sure we understand this aspect - the degree of increasing backoff emerging from the new algorithms - before we go to the Polly public with a recommendation. Thanks again, @grant-d , @george-polevoy , for everything we are doing on this. |
@george-polevoy @grant-d I adapted George's simulation code (for which huge thanks) to graph up the retry interval distribution for individual tries separately. I am happy that the "soft exp, soft delay" function does create clearly increasing backoff as try number progresses. For example, for "soft exp, soft delay 3". Soft exp, soft delay 3: Distribution of retry intervals for try 0Soft exp, soft delay 3: Distribution of retry intervals for try 1Soft exp, soft delay 3: Distribution of retry intervals for try 2Soft exp, soft delay 3: Distribution of retry intervals for try 3EDIT: Code used as the basis for these graphs, posted here EDIT 2: The same graphs for "decorrelated jitter (AWS)" clearly show the spikes we expected (particularly for tries 0, 1, 2), and also show (as expected) that the "decorrelated jitter (AWS)" reaches back to make some second and third retries after near-zero delay (taking a chance on the system having recovered, if you like). |
Thanks again @grant-d and @george-polevoy for the great investigation into jitter strategy. Polly v7 has now launched, and we have a Polly-Contrib. I propose - if you are in agreement - we take @grant-d 's PR #536 and your awesome jitter work here @george-polevoy, and make a Polly.Contrib repo Polly.Contrib.WaitAndRetryStrategies to publish it. The repo would give you:
(ie you can basically just rename from the template, place your code in, and good to go!) If this sounds good, I can clone to make you a Polly.Contrib.WaitAndRetryStrategies. (Polly.Contrib also allows me to "get out of the way" and allow contributors such as yourselves to publish these contributions without being blocked on feedback or discussion about exact shape of the API - although I'm still v much there to support if needed. And evolve, eg @george-polevoy , if you want to add something later with half-life/half-decay parameter.) Also, if we go the contrib route, I would still want to feature this as strongly out from the main Polly, ie to replace/update the wiki jitter information and reach out to your contrib from there and from the readme. Let me know what you think. @george-polevoy Also, I did some work with Cesar de la Torre around resilience in the Microsoft Microservices sample app couple of years ago, some of it ended up in this ebook, parts of which are also online here. Assuming we publish your jitter strategy to Polly.Contrib.WaitAndRetryStrategies, we can also suggest a PR to update that Microsoft jitter documentation to reference your better strategy. |
That sounds like a great idea, thanks for thinking through it in such detail. The only request I'd make is to that the name is a mouthful, I wonder if there is something more concise. |
Appreciate your humbleness but your involvement is critical to he success of contrib. My submission is much improved after your input. So you 'getting out the way' is not on my priority list. |
No worries @grant-d. We've wanted to get the Polly.Contrib in place for a while, so that users can build additions without the Polly team having to sign off every decision, it gives everyone (me too) a bit more flexibility. The retry strategies are quite a small case (could go either way), but partic the option to keep refining the jitter math point to Contrib. |
I am good with To know: The choice becomes the namespace ie |
I've created the Polly.Contrib.WaitAndRetry repo and am sending you both (@grant-d @george-polevoy) invites to the Polly-Contrib organization. @george-polevoy I'd suggest we use your 'soft exp, soft delay X' algorithm as the best jitter candidate so far (unless you think different). I've invited you to the repo (ie you get write-access) so that you can continue to make changes/suggestions to refine the jitter algorithm further if you want; it's been awesome to have all your great input so far. @grant-d Are you good to pull the conclusions we came to on this/and other thread into the new repo? You can pull down a zip from Polly.Contrib.BlankTemplate, and just rename/add to form the new Polly.Contrib.WaitAndRetry, then upload back up to Polly.Contrib.WaitAndRetry. After you accept the Polly-Contrib invite you have write rights on the repo. Let me know when there's a nuget publish candidate and we can figure out the publish step (publish under a Contrib org or individually) Thanks hugely to you both for the awesome contribution! Let me know if any questions/anything I can help with. |
Initial commit here : Polly-Contrib/Polly.Contrib.WaitAndRetry#2 I just saw your note above about the template - I will retrofit it suitably. |
@george-polevoy Are you happy if we include your 'soft exp, soft delay X' algorithm in the Polly.Contrib.WaitAndRetry repo? (We can do all the necessary work to move the code there.) |
@reisenberger Sure |
PR made to pull the new jitter formula into Polly.Contrib.WaitAndRetry : Polly-Contrib/Polly.Contrib.WaitAndRetry#11 |
Closing. The new jitter formula is now published to nuget as part of Polly.Contrib.WaitAndRetry, and documented in the readme of that project with references back to here. Thanks to all @george-polevoy @grant-d for raising the issue and clarifying the math and logic so strongly. Polly wiki on jitter similarly updated. |
Summary: Jitter the retry delays to achieve low concurrency
I'm trying to use the suggested "decorrelated jitter" strategy
In my experiment, the suggested decorrelated jitter does not achieve what it supposed to.
First,
new Random()
is not quite random, when it comes to reducing concurrency.You can get multiple instances with the same seed, if many retries are issued simultaneously.
Following code achieves much lower concurrency and better distribution.
In the attached screenshot you can see that the 'advanced' graph has much lower maximum and smoother distribution, which is highly desired to avoid concurrency spikes.
The text was updated successfully, but these errors were encountered: