Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current Window Incrementation Despite Blocking #110

Open
CariappaKGanapathi opened this issue Jan 27, 2025 · 13 comments
Open

Current Window Incrementation Despite Blocking #110

CariappaKGanapathi opened this issue Jan 27, 2025 · 13 comments

Comments

@CariappaKGanapathi
Copy link

Currently, under the Increment function for Redis implementation, the current window count is increment regardless of whether the request has to be allowed further or not. While trying to rate limit APIs, this causes an issue as the multiple failed hits to the server will push the next allow time further and further in the future. Ideally, once request is dropped, the time that the next request can be accepted should not change. Which can be achieved by only incrementing the current window counter for requests that were actually ALLOWED.

If there is any specific reason to increment the counter before checking for Allow, please let me know, will create a PR to handle both cases.
If not, can the default behaviour be the one that is suggested above?

Code line :

return prevCount, incr.Val(), nil

@mennanov
Copy link
Owner

Is the behavior you're referring to incorrect only for the Redis backend or for the actual SlidingWindow algorithm (all backends)?

@mennanov
Copy link
Owner

I think i understand the problem you are probably referring to: the case when multiple clients make requests concurrently without coordination.

When the Limit() function returns ErrLimitExhausted with the delay duration d1 another client may call Limit() shortly after which will return ErrLimitExhausted with delay duration d2 (d2 > d1).
The first client waits for duration d1, but the Limit() will return ErrLimitExhausted again because the other client's call incremented the counter.

As a quick workaround you can try an exponential backoff on the clients (instead of just waiting for the duration returned by the Limit() function).

Counting just the allowed requests will likely require changing the SlidingWindowIncrementer interface (possibly adding a capacity argument to the Increment() function to check whether the counter should be incremented) or using a lock to read the current state and increment only when necessary (a better approach IMO).

Either solutions introduce breaking changes and should probably be implemented as a separate sliding window algorithm.

What do you think?

@CariappaKGanapathi
Copy link
Author

I believe the behaviour is same regardless of the backend used because it is being called from a single place and all Increment function increments the current counter regardless.

Called from :

prev, curr, err := s.backend.Increment(ctx, prevWindow, currWindow, ttl+s.rate)

Yes, I agree with the approach of adding a capacity argument to the Increment() function. Currently we are using a Mitigation cache key with the ttl equal to the nearest time that the request can be accepted in order to skip checking for rate limit altogether. Maybe that can be an addition feature too, for performance?

@leeym
Copy link
Collaborator

leeym commented Jan 29, 2025 via email

@mennanov
Copy link
Owner

mennanov commented Jan 30, 2025

Yes, I agree with the approach of adding a capacity argument to the Increment() function. Currently we are using a Mitigation cache key with the ttl equal to the nearest time that the request can be accepted in order to skip checking for rate limit altogether. Maybe that can be an addition feature too, for performance?

To be honest i don't like the approach with adding the capacity argument to the Increment() function because it will force the Increment() function to implement the actual throttling logic, which is too much to ask from the SlidingWindowIncrementer abstraction.

Moreover, that logic will still require reading the values from the backend, do the math and potentially writing the new state to the backend (if the request is not throttled). That introduces a race condition and requires a lock.

That being said, i would rather proceed with the lock approach like the other limiters in this repo.

@mennanov
Copy link
Owner

We can update the
calculation to use the smaller between the counter and the capacity, so
once the counter exceeds the capacity, it won't further increase the wait
time.

Sorry, i don't quite follow. The idea of the returned time duration is for the client to delay its retry for the given duration so that the retry won't get throttled again.
However in a situation with multiple clients and high volume of requests a subsequent throttled request "invalidates" the delays returned to the previous clients.

@leeym
Copy link
Collaborator

leeym commented Jan 30, 2025 via email

@CariappaKGanapathi
Copy link
Author

Moreover, that logic will still require reading the values from the backend, do the math and potentially writing the new state to the backend (if the request is not throttled). That introduces a race condition and requires a lock.

I believe this is the approach you have mentioned above (correct me if I am wrong) :
Limit(){

  • fetch counters
  • if request has to be blocked -> do not increment, simply block with delay time
  • if request can go through -> increment counter, return
    }
    Locks being expensive and affecting performance is my only issue with this approach.

@leeym 's approach of blocking requests is what I have implemented with the mitigate logic.

In a nut shell, when we know a request has to be retried after a certain delay d1, do not calculate/increment for any further calls until d1 elapses.

@CariappaKGanapathi
Copy link
Author

@mennanov @leeym Can we freeze on the solution? Would like to work on the same if everyone is aligned.

@mennanov
Copy link
Owner

mennanov commented Feb 8, 2025

My proposal is to cap the counter at the capacity at line 59, so once the
counter reaches the capacity, it will not push the total further.

prev = int64(math.Min(float64(prev), float64(s.capacity)))
curr = int64(math.Min(float64(curr), float64(s.capacity)))

Interesting approach, @leeym would you be able to create a draft PR with it? Ideally i would like to see the test that explains the logic behind capping the prev and curr with capacity.

The approach with the Lock() seems solid but it also kind of defeats the purpose of this sliding window approach because it's main selling point (IMO) is the lack of the lock.

@CariappaKGanapathi
Copy link
Author

CariappaKGanapathi commented Feb 10, 2025

@leeym If you cap the counter after the increment is done, how does it help?

Won't the next request still fetch current from the backend (which is already increased) and increase it further, leading to ErrLimitExhausted continuously?

The way I see it, although you are capping the wait time. Since the current counter increments in backend without capping, it will continue to block longer than the returned wait time.

The only thing that this fix does is artificially give the wait time same as the first block. Whereas once the wait time is due, request can still be blocked if a few more requests were made in the meantime.

Please correct me if my understanding is wrong.

What i am trying to convey is that, once a request is blocked, it should ideally not be counted in the counter as a valid req. Since the application has not "served" the request. Any requests that were blocked, should not affect the counter at all

@leeym
Copy link
Collaborator

leeym commented Feb 10, 2025 via email

@CariappaKGanapathi
Copy link
Author

Hello @leeym

I am still confused with the working, I have tried my best to explain my scenario using the below data. This is a link to the google sheet, to look at the formulas and structure I have used.

Image

In the test case you have written, I think since the entire capacity is being used, the minimum works as expected. My issue is when a counter in current window that is below capacity has been blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants