add Per Resource Rate Limits #22

switchupcb · 2022-07-25T07:07:48Z

Rate Limits have been merged in #14 with one caveat.

Problem

Per Resource (and Per Resource Per Route) rate limits are not yet implemented.

The main reason is due to a lack of information by Discord regarding the following statement: "During calculation, per-route rate limits often account for top-level resources within the path using an identifier". This statement is rather ambiguous because "often" implies that it does NOT apply to every request containing a top-level resource; or started by a top-level resource. The problem is that this statement is also the only sentence regarding Discord's Per Resource Rate Limits; barring X-RateLimit-Scope which only indicates that a 429'd request is a Per Resource Rate Limit. There is also no clarification why the X-RateLimit-Scope calls Per Resource rate limits shared, which could imply that multiple users can trigger than rather than just the bot.

If we knew the pattern to identify top-level resources, it could be implemented in Disgo's current form. In the worst case (a misidentification of a Route rate limit as a Per Resource rate limit), excessive maps of RouteID's to Hashes are made when the Bucket remains the same; but the Bucket isn't actually duplicated. Unfortunately, that pattern is not confirmed.

Most if not all libraries don't care about Per Resource rate limits since they are not counted against you. As an example where Per Resource rate limits are handled, DiscordGo creates a custom Per Route rate limit bucket for a Per Resource (Emoji) rate limit; limiting it's efficiency. Discord itself states that 429's are inevitable for rate limits such as Emojis, but then also disapproves the strategy of hitting a 429 and adjusting from there.

Implementation

The implementation of Per Resource (and Per Resource Per Route) rate limits in Disgo is rather straightforward.

Change Route ID's from uint16 to strings (which means the rate limiter interface must use strings).
Modify the Send functions of endpoints in the list of known Per Resource Routes (using Copygen).

These endpoints would provide the route-number (i.e 1), top-level resource (i.e g,c,w), and top-level resource id (i.e snowflake 42123...) to the SendRequest function. This would result in a hash that looks something like 1g41231 which does not collide with Discord's bucket hash, while still remaining unique to a per resource route.

Solution

Identification

In order to solve this issue, we must first identify the pattern (or routes) used as Per Resource rate limits. As an example, we know that Emoji endpoints are all Per Resource Per Route rate limits based on a given channel ID. There are multiple ways we can identify these routes, with the simplest being discord/discord-api-docs#5144. Otherwise, we may need to create the following experiments.

Experiment 1

Create a test that attempts every request up to 50 times a second until a 429 is hit (in that second), then outputs the X-RateLimit-Scope with the given route accordingly. If a rate limit is NOT hit, there is no rate limit for that request. Technically the chance of missing rate limits that span longer than a second (i.e Global Gateway Ratelimit) is still possible. However, a shared (Per Resource) rate limit is expected to contain rate limit headers. As a result, these could be used (instead of a specific time frame) to trigger 429's.

Experiment 2

Implement the Per Resource feature on all routes as-is and use logging and/or a detection feature to identify when a Route is switching too much/little. This fast tracks the implementation (by requiring it for this experiment), but this experiment may not be worth the cognitive effort.

The text was updated successfully, but these errors were encountered:

switchupcb · 2022-10-17T08:09:15Z

discord/discord-api-docs#5557

switchupcb · 2022-10-17T20:55:53Z

From that issue (discord/discord-api-docs#5557 (comment)).

What is the significance of the word "often" (which implies that certain routes may not account for top-level resources)?

This is a generic statement for "a lot of rate limits are limited by the top level resource, but not all are."

Is a user safe assuming that all routes that include channels, guilds, and webhook parameters are per-resource?

At a top/root level, that is generally true (as the docs state). There are sometimes exceptions or mistakes that can cause this to not be true, and I would recommend opening an issue if you find routes which you believe do not follow this rule.

Does top-level refer to endpoints that contain those resources first?

A root/top level resource is one in which it is listed at the root of the path. For example:

/channels/:channel_id/...
/guilds/:guild_id/...
/webhooks/:webhook_id/:webhook_token/...

Progress can now be made!

switchupcb · 2022-10-17T22:02:18Z

A new rate limit implementation that will adhere to Discord Rate Limits can be created using the following information.

Information

It all starts with a request that we want to send to Discord. Unfortunately, resources aren't unlimited so Discord must rate limit requests accordingly. As an API Consumer, we do NOT want to hit that rate limit (because it makes our API unreliable). In other words, we want to make sure that each request we send has the highest chance of success. Discord adds pressure by banning people who keep hitting rate limits.

We want our requests to be sent in an performant manner. This means that we must send requests within the rate limit that Discord provides. However, in an effort to provide ideal performance for their own services, Discord rate limits requests per-route; with different rate limits. In other words, we cannot simply allow X requests per second nor allow X requests per second per route. Instead, we must handle the rate limits as Discord provides them (from responses). Even if we didn't want to handle requests in a performant manner, we would still have to handle rate limits since Discord can change them dynamically.

Since there is no way to figure out a rate limit prior to sending a request, we must use an internal rate limiter which matches Discord's rate limiter. The complexity of that has already been discussed and implemented in pull request #14.

The remaining complexity stems from the handling of the rate limiter; such that we can rate limit each request per-route (with different rate limit algorithms). If there was one rate limit algorithm strategy, rate limiting would be as easy as mapping each Route to a Rate Limit Bucket Hash (that Discord provides). When that bucket is updated (or deleted), we could simply update the internal state of the Rate Limit Bucket. Unfortunately, the addition of another rate limit algorithm (per-resource) — which can be observed on the same route — complicates this logic by adding another way for the Bucket's state to change. To be specific, a Rate Limit Bucket Hash will not match the Hash of the previous request in multiple scenarios (provided below).

Per-Route

When the bucket is updated or deleted.

Per-Resource

When the bucket is updated or deleted.
When the bucket is referring to a different resource (from the same route).

In any case, we always have to account for the fact that a bucket can be updated or deleted (by reading response headers). However, the implementation's complexity lies in handling per-resource routes in a different manner to per-route routes. In other words, in order to figure out whether a route's rate limit bucket has updated, you need to know whether the route is a per-resource route or NOT. Otherwise, you have to guess using a number of detectability methods that add unnecessary overhead to the program. Luckily, the addition of the information in #22 (comment) provides a specification we can use to determine this before sending a route (with certain exceptions).

Implementation

The above implementation showcases why rate limits cannot be handled through the use of — and only of — Rate Limit Bucket Hashes. Instead, for any given request, we need to know the Current Bucket Hash, Route, and the Route's algorithm. When we do not have the hash yet (for the first request), we implement the Default Bucket workaround. We can use the specification above to understand whether we need to track hashes per-route or per-route-resource for every Discord route.

Prior to a more direct explanation of an implementation, we should acknowledge that — due to per-route-resource rate limits — there is not a static capacity for the amount of Rate Limit Bucket Hashes for the Rate Limit Bucket Map. In other words, using a per-resource routes will involve more overhead as you increase the amount of different resources that you interact with. As an example, using 4 guilds in a per-resource route entails storing the memory for 4 guild rate limit buckets in the map.

Using copygen, we can specify which send functions involve per-resource routes (i.e contains top-level resource) and have the RouteID be set accordingly from the Send function (without overhead). As an example, per-route route IDs may be 1, while per-resource may be 1resourcespecifieretc. This allows us to maintain a single map for route-request IDs -> Bucket Hashes (and Default Bucket Hashes) and another map for Bucket Hashes -> Buckets Structures. From here, everything follows the same behavior that is already implemented (from #14).

Exception

The above rate limit implementation assumes that Discord will not change a rate limit from per-route to per-resource. More so, it assumes that these changes will be infrequent enough such that the User can fix this by simply restarting the bot (alongside a quick update; in the Go language this requires recompilation unless an external service is used).

There are two implications that this decision would result in.

Per-Route -> Per-Resource
When a route is changed from per-route (i.e ID 1) to per-resource (i.e ID 1abc, 1def, 1efg), the current implementation will continuously update the state of the per-route Bucket Hash (and change the underlying Bucket Structure in memory). This is problematic for performance purposes since the state of the bucket will always be inaccurate every time a request is sent (for that route; when a different resource is used). This is NOT too problematic for the end-user since per-resource 429s are not counted towards a Cloudflare Ban.

Per-Resource -> Per-Route
When a route is change from per-resource (i.e ID 1abc, 1def, 1efg) to per-route (i.e ID 1), the current implementation will continuously update the state of each per-resource Bucket Hash (and point it to the same Bucket Structure in memory). This is not too problematic for performance purposes since the state of each per-route-resource route (that is now a per-route route) will become eventually consistent; once every different resource has been used in a request. However, since the first request will use the prior resource bucket, this could lead to an excessive amount of per-route 429s (which are counted towards a Cloudflare Ban).

switchupcb · 2022-11-03T22:10:35Z

Handling Exceptions

As a thought experiment, let's also explore a mechanism that attempts to avoid the above scenario entirely. This involves the usage of detectability (which adds a slight overhead; but not enough to effect the max RPS of 50). One could argue that you never need to classify routes by algorithm, since a Bucket Hash is the single source of truth of the rate limit.

The issue is that — to support concurrent requests — there must be an internal rate limiter which also updates when the bucket changes. This leads to a requirement for following functionality: If the rate limit algorithm changes, all buckets for that route should be invalidated (such that issues in the above scenario do not occur).

The following table showcases all the possibilities of rate-limit algorithm detection using the information available to each request; by sending the same route multiple times. Note that the possibilities involving a change from a null bucket are not included (because null is a per-route rate limit algorithm).

Sent	Request Delta	Bucket	Bucket Delta	State
0	Same	Default	N/A	1st request
1	Same	None	from Default	No Rate Limit
1	Same	A	from Default	Rate Limit (Unspecified)
2	Same	None or B	from A	Rate Limit Change
2	Same	A	None	Rate Limit Same
2	New Resource	A	None	Rate Limit (Route)
2	New Resource	B	from A	Rate Limit (Resource or Changed)

In the last case, we can NOT guarantee whether a new bucket occurs.

Either the route is per-resource, thus a similar request with a different resource results in a different bucket.
Either the route is per-route, such that a request with a different resource only results in a different bucket because the rate limit has been updated. This can also be the case for every subsequent request.
The bucket classification is irrelevant because you used the same resource with the same route.

The main issue is revealed here. In 2, without prior classification, every subsequent bucket change for a per-resource route could signify a rate limit update. Following this, every subsequent bucket change for a per-route resource (when it happens to be different) could signify a rate limit update (rather than algorithm update). Thus, that information is irrelevant to the rate limit algorithm detection and invalidation mechanism.

The same logic applies to tracking the amount of buckets per route. You can have a Route Bucket with 3 requests consisting of different resources still be a per-route rate-limit algorithm; if the bucket updated in between each request. Not to mention that there is no way to send AND ALSO process two requests at the exact same time.

There's no criteria to invalidate a rate limit algorithm change beyond enhanced tracking of 429s. This is a product of the lack of consistency (and inability to contact a middle man server) in the Discord Rate Limit specification. Nonetheless, we can also explore 429 tracking which would invalidate the detected ALGORITHM for a given route when the bucket encounters a user or shared 429 header.

429s

Discord shares headers on 429s and tells you whether it's a shared (per-resource) rate limit or user (rate-limit): See exception for documentation on the difference. The only reason Discord would NOT provide this (429 header) in regular requests is due to bandwidth. In essence, this decision is the software equivalent of an MBA cutting costs by removing napkins from the restaurant. You just make it harder on the end user, but Discord shows time and time again...

Anyways, one could use either the per-resource specification (that contains exceptions and mistakes) or one of the above methods in an attempt to infer which algorithm is being used. Then, upon hitting a rate limit (that is not global), you can use the 429 header to prune all invalid buckets accordingly. That solves the problem.

That would make Discord a genius except for the fact that this wasn't explicitly stated. And the fact that I had to "bully" them to get to this point.

Yet in any case, that implementation is reliant on some level of consistency; which means exceptions need to be documented. As as an example, invites is not a top-level resource yet uses the per-resource rate limit algorithm. This places Discord on the left end of the curve; in addition to the implications discussed below.

Implication

Exceptions and mistakes result in rate limit algorithms that are suboptimal. Using invites as an example: The Discord specification hints at the invites request being a non-top-level resource. Thus, a correct rate-limiter would assign the per-route rate limit algorithm to the invite route accordingly. However, this would result in the Bot only adhering to a single invite code bucket, rather than being able to send invites per resource.

In any case, using the invalidation implementation described above could result in the correct (per-resource) rate limit algorithm being assigned to invite routes; if a 429 rate limit header is hit. Without any invalidation mechanism, this carries the same implications described in #22 (comment).

When a shared (per-resource) bucket becomes a user (per-route) bucket, invalidating all existing buckets for that route and re-assigning them is simple: Clear all of those buckets and point the routes to the new bucket. On the other hand, when a user (per-route) bucket becomes a shared (per-resource) bucket, you can invalidate the only buckets, but you can NOT assign the existing routes to new buckets correctly without knowing which top-level resource to use.

Per-Route -> Per-Resource
When a route is changed from per-route (i.e ID 1) to per-resource (i.e ID 1abc, 1def, 1efg), the current implementation will continuously update the state of the per-route Bucket Hash (and change the underlying Bucket Structure in memory). This is problematic for performance purposes since the state of the bucket will always be inaccurate every time a request is sent (for that route; when a different resource is used). This is NOT too problematic for the end-user since per-resource 429s are not counted towards a Cloudflare Ban.
— #22 (comment)

At the end of the day, a map of request IDs to Bucket Hashes serves as a cache. In any cache, it will be problematic when the state of a resource in the cache is incorrect. Extra unnecessary resources results in extra unnecessary keys pointing to the same bucket which takes more memory (hash entries) than required. In contrast, a resource that isn't hashed correctly on a per-resource algorithm results in the chance for similar requests with different resources to be tied to the same key. Thus, those resource routes would only be sent one by one within each other (with constantly "updating" flip-flopping buckets); rather than in an optimal manner.

The above problem can technically be solved by classifying every route as per-route. This entails hashing EVERY resource in any route; such that resources that aren't top-level but still rate limited via per-resource are detected. However, this adds extreme overhead (in a similar manner to adding unnecessary key entries).

That's why Discord ends up on the left end of the curve.

The above implication is why it's important for Discord to stick to its specification rather than have — in their own words — "mistakes and exceptions".

Implementation

Use Copygen to create a map of route id's to algorithm types. Then for every request use the algorithm to determine the hash of the request. User (per-route) is simply ID and Shared (per-resource) uses the respective resource. From here, the current implementation applies. Add an invalidation mechanism upon 429 detection accordingly.

switchupcb · 2022-11-03T22:34:18Z

A decision must be made on the specification of exceptions for a full (correct) implementation: discord/discord-api-docs#5605

If exceptions are not documented, the safest method of rate-limiting is to hash every route in a path. As an example, use both guild.id and user.id to create the hash in the route /guilds/{guild.id}/members/{user.id}. Otherwise, the above implementation could work.

switchupcb · 2022-11-05T20:33:32Z

Discord has responded that exceptions will NOT be stated. Such that Disgo will implement a rate limit algorithm that is flexible enough to account for exceptions, but not 100% without fault.

Implementation

Create Route Hash Function which uses a Route Algorithm Map to determine how a request is hashed. This function will provide a parameter such that the user is able to pre-process the hashing algorithm if need be (for known or unknown exceptions). By default, the Discord Rate Limit Specification will be implemented for per-route, top-level per-resource (guilds, channels, resources), and known exception (i.e invites) routes.
Merge hashing functionality with the current rate limit implementation. At this point, Disgo will be able to handle every rate limit, but will not react to changes in rate-limit algorithms during runtime.
Add Detection and Invalidation Mechanism which modifies the Route Algorithm Map upon receiving a 429 (Header) which specifies a rate-limit algorithm that is different from the current bucket's rate-limit algorithm.

Deliverables

As described in #22 (comment), this implementation will do the following:

Provide optimal RPS for all routes given a request from that route has been sent.
Prevent User from Cloudflare Bans (by eliminating user 429s beyond the default bucket).
Identify rate-limit algorithm changes that fit the specification (i.e routes that include top-level resources).
Allow library level exceptions (for documenting known exceptions).
Allow user defined level exceptions (for exceptions and/or mistakes that need to be patched quickly; thus not requiring an upstream merge).

To be specific, it will NOT do the following:

Will not prevent unoptimal RPS in user-specified or resource-specified exception routes. In other words, routes that are not known as exceptions and defined by the specification to be per-user will remain per-user until a rate limit is hit. Routes that are not known as exceptions and defined by the specification to be per-resource will remain per-resource until a rate limit is hit. This can result in user 429s.
Will not prevent per-resource rate limit 429s (given the nature of them).
Will not hash non-specified (non-top-level) resources which experience a rate-limit algorithm change correctly; without requiring developer definition of said exception, a rebuild, and redeploy.

Last Point (Discussion)

The last point is not an issue with the implementation, but rather the specification. While it can be resolved — via implementation — by hashing every route resource, this would result in every request being hashed, then being pointed to a bucket. This has many implications due to the requirement of the default mechanism (by the API Producer) and is largely inefficient.

The current solution is for the developer to identify the specification and correct its implementation in the code. In other words, when the specification is changed, the code needs to be changed with it. Thus, this problem is a byproduct of an inconsistent specification. Discord does not provide a programmatic method to access the specification. This is why a developer who experiences a rate-limit change involving an unspecified exception must identify the change, implement it in the exception function, then redeploy the code.

Redeployment can be avoided through hot-reloading — with the use of an interpreter for the hashing function — at the cost of performance. Cases such as these — unspecified rate-limit algorithm changes — are expected to be rare or problematic enough that it prompts a restart of the bot. As a result, an interpreter will not be used.

switchupcb · 2022-11-10T04:29:44Z

While attempting to create a test for per-resource routes in #34, I kept running into routes that were specified as per-resource by the documentation, but were actually per-route. For example, using two different channel parameters with GetChannel resulted in the same Rate Limit Bucket.

Channel 1: X-Ratelimit-Bucket: c7e5ec5cd97f5f5d9052ac16f420004c
Channel 2: X-Ratelimit-Bucket: c7e5ec5cd97f5f5d9052ac16f420004c

Then, I attempted to use GetChannelMessages and ran into the same behavior. Then, I used the GetReactions emoji endpoint with two different channels and two different emojis... Only to hit the same bucket for each.

In search of more information, I encountered this issue: discord/discord-api-docs#5312. That's when it occurred to me that Discord is likely applying BOTH a per-route and per-resource rate limit to certain routes. Such that per-resource does not refer to a per-resource rate limit (per route per the user), but rather an additional per-resource rate limit on the route (for certain resources on that route). YET THIS IS NOT SPECIFIED IN THE GODDAMN DOCUMENTATION... FUUUUUUUUUUUUUUU-

So what can we do? The Disgo rate limit algorithm is designed to be extendable; such that the ability to use multiple buckets per route request could be implemented (by using slices of bucket IDs). Yet no other wrapper does this. In addition, adhering to per-resource rate limits isn't necessarily required. Other API Wrappers don't adhere to them, and 429s against them aren't counted against the user anyways.

In the context above, a per-resource rate limit is effected by other users. Such that adhering to them is a moot point: The state of a per-resource rate limit will always be manipulated by factors the program cannot keep track of. One could argue that attempts to keep track of a per-resource rate limit would allow a bot to be more efficient (by sending less requests using estimation). The alternative — which developers must implement right now — is to send a request multiple times until a shared rate limit is NOT hit. That's why shared rate limits don't count against you.

Implication

The above information can be confirmed by researching the introduction of Cloudflare Bans to the API. Discord stated that most bots — prior to implementing excessive-429 bans — would send a request until it was successful, rather than adhere to the rate limit. So it seems that the per-route rate limits are used to limit the amount of times the user can send a request (in any given period), while the per-resource rate limits simply limit the usage of the resource itself. The user is still required to barrage the Discord server with a request until that request is successful.

To reiterate, a user is expected to send a request as many times as possible until it is successful. The user rate limit buckets limit how many times a request can be sent (thus reducing the amount of 429s overall). The shared rate limit buckets limit how many times a resource can be used by all users. Such that every endpoint IS per-route and every endpoint that Discord deems necessary ALSO maintains a per-resource rate limit. The significance being that there is no need to implement per-resource rate limit handling for stability.

Viewing the rate limit header examples with this information showcases that per-resource rate limit buckets aren't even provided.

< HTTP/1.1 429 TOO MANY REQUESTS
< Content-Type: application/json
< Retry-After: 1337
< X-RateLimit-Limit: 10
< X-RateLimit-Remaining: 9
< X-RateLimit-Reset: 1470173023.123
< X-RateLimit-Reset-After: 64.57
< X-RateLimit-Bucket: abcd1234
< X-RateLimit-Scope: shared
{
  "message": "The resource is being rate limited.",
  "retry_after": 1336.57,
  "global": false
}

The bucket has tokens remaining despite the rate limit occurring, indicating that the bucket being shown still applies to the user. This isn't specified in the documentation, though.

Implementation

Based on the above information, there is no need for invalidation or detection beyond updating per-route buckets when these are changed. There is no need to implement separate rate limit algorithms for each route, because the user is only expected to manage the per-route rate limit. One could argue here that the documentation is not only unclear, but misleading... That's for another article. Most important is that Disgo already allows the user to control the rate limits that the user is expected to control.

As far as implementation goes, Disgo could assist the user by supplying a way to retry a request until it is successful. Currently, the user must set an arbitrary value to bot.Config.Request.Retries and hope one is successful. In any case, the current implementation already does more than other API Wrappers by allowing the user to customize the default bucket and hashing functionality; such that one could address any further issues on their own.

reverts fill hashing function map #22 (comment)

kangalio · 2022-11-15T14:49:17Z

Jesus christ the rate limit docs really are atrocious. Feels like some employee offhandedly wrote the rate limit docs in 30mins and then the entire Discord team forgot about it since

switchupcb added the awaiting discord Requires more information from Discord. label Jul 25, 2022

switchupcb mentioned this issue Jul 25, 2022

add rate limit algorithm clarification for global rate limits discord/discord-api-docs#5144

Closed

switchupcb added the help wanted Extra attention is needed label Jul 26, 2022

switchupcb mentioned this issue Jul 27, 2022

Issues from DiscordGo #23

Closed

No3371 mentioned this issue Jul 27, 2022

add Sharding #26

Closed

switchupcb mentioned this issue Jul 27, 2022

Issues from Disgord #24

Closed

1 task

switchupcb mentioned this issue Oct 17, 2022

clarify per-resource rate limit algorithm discord/discord-api-docs#5557

Closed

switchupcb removed help wanted Extra attention is needed awaiting discord Requires more information from Discord. labels Oct 17, 2022

switchupcb added the awaiting discord Requires more information from Discord. label Nov 3, 2022

switchupcb removed the awaiting discord Requires more information from Discord. label Nov 5, 2022

switchupcb mentioned this issue Nov 9, 2022

add per-resource rate limits #34

Merged

2 tasks

switchupcb pushed a commit that referenced this issue Nov 10, 2022

test per-resource rate limits

67def7a

reverts fill hashing function map #22 (comment)

switchupcb closed this as completed in d65a5fd Nov 11, 2022

switchupcb mentioned this issue Jul 9, 2024

feat: ratelimit prediction Pycord-Development/pycord#2188

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Per Resource Rate Limits #22

add Per Resource Rate Limits #22

switchupcb commented Jul 25, 2022

switchupcb commented Oct 17, 2022

switchupcb commented Oct 17, 2022 •

edited

Loading

switchupcb commented Oct 17, 2022 •

edited

Loading

switchupcb commented Nov 3, 2022

switchupcb commented Nov 3, 2022

switchupcb commented Nov 5, 2022 •

edited

Loading

switchupcb commented Nov 10, 2022

kangalio commented Nov 15, 2022

add Per Resource Rate Limits #22

add Per Resource Rate Limits #22

Comments

switchupcb commented Jul 25, 2022

Problem

Implementation

Solution

Identification

Experiment 1

Experiment 2

switchupcb commented Oct 17, 2022

switchupcb commented Oct 17, 2022 • edited Loading

switchupcb commented Oct 17, 2022 • edited Loading

Information

Implementation

Exception

switchupcb commented Nov 3, 2022

Handling Exceptions

429s

Implication

Implementation

switchupcb commented Nov 3, 2022

switchupcb commented Nov 5, 2022 • edited Loading

Implementation

Deliverables

Last Point (Discussion)

switchupcb commented Nov 10, 2022

Implication

Implementation

kangalio commented Nov 15, 2022

switchupcb commented Oct 17, 2022 •

edited

Loading

switchupcb commented Oct 17, 2022 •

edited

Loading

switchupcb commented Nov 5, 2022 •

edited

Loading