Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

Handling throttling(429) at deployment(s) level under a single Instance #7

Open
jayendranarumugam opened this issue Jan 8, 2024 · 8 comments

Comments

@jayendranarumugam
Copy link

Currently, we are handling the 429 at the endpoint level (skipping the deployments). However those TPM/RPM are defined at the deployment level

We can have multiple deployments at a single instance like gpt3,gpt4, etc., In this scenario, how we can handle the 429?

The current logic can handle a single deployment alone, so that if that deployment (lets assume gpt3.5turbo) is giving 429, it can be marked as throttling. But if there are multiple deployments, we cannot simply mark the endpoints has throttling as it can be capable of handling other deployments (like gtp4,gpt4-turbo, etc.,)

How can we sever such multiple deployments under a single instance ?

@jayendranarumugam jayendranarumugam changed the title Handling throttling at (429) at Deployment level under a single Instance Handling throttling at (429) at deployment(s) level under a single Instance Jan 8, 2024
@jayendranarumugam jayendranarumugam changed the title Handling throttling at (429) at deployment(s) level under a single Instance Handling throttling(429) at deployment(s) level under a single Instance Jan 8, 2024
@andredewes
Copy link
Collaborator

You can do that by specifying the full deployment model in the BACKEND_X_URL setting. For example:

Then the load balancer will forward that full path to the chosen backend. However, from your client-side you need to remove that path you added in the backend when sending the requests. For example, in your client you go from:

  • [your_load_balancer_url]/openai/deployments/gpt35turbo/chat/completions?api-version=2023-07-01-preview

To

  • [your_load_balancer_url]/chat/completions?api-version=2023-07-01-preview

Otherwise, it will duplicate that part of your path. Let me know if this works for you or not. Some client-side SDKs might automatically add some parts of the URL which can remove your flexibility to send whatever path... let me know if that's your case and we can work to have another feature to facility this case!

@jayendranarumugam
Copy link
Author

Thanks @andredewes for the quick replay. While this will give allow us to hit the deployment-specific route, however I cannot handle or scale this solution to support multiple deployments at different instances. (Grouping all the Instances by deployments)

Let me put my use-case here

I have 2 instances of openai (Instance-1 and instance-2 )

Both of these instances have 2 deployments gpt35turbodeployment and gpt4deployment

  • So when a gpt4 request comes i.e, https://localhost:7151/openai/deployments/gpt4/ It should check with the below available backends

    • gpt4deployment at Instance-1
    • gpt4deployment at Instance-2

    Throttling behaviour: Here if the gpt4deployment at Instance-1 is getting throttling we need to route all the gpt4 traffic to Instance-2 but this should not affect the other deployments(gpt35turbo) traffics

  • So when a gpt35turbo request comes i.e, https://localhost:7151/openai/deployments/gpt35turbo/ It should check with the below available backends

    • gpt35turbodeployment at Instance-1
    • gpt35turbodeployment at Instance-2

    Throttling behaviour: Here if the gpt35turbodeployment at Instance-2 is getting throttling we need to route all the gpt35turbodeployment traffic to Instance-1 but this should not affect the other deployments(gpt4deployment) traffics

Here the assumption is for a given deployment, the name will be the same across all the instances

@andredewes
Copy link
Collaborator

I think I understand what you're trying to say. You don't want to mix GPT35 and GPT4 from your applications perspective, they already specific to their desired model within the request. You don't want an app sending a /gpt4 path to end up in a /gpt35 backend.

In this scenario, wouldn't it make more sense to deploy two instances of the load balancer, one for your GPT3.5 endpoints and the other for the GPT4 applications?

@jayendranarumugam
Copy link
Author

I think I understand what you're trying to say. You don't want to mix GPT35 and GPT4 from your applications perspective, they already specific to their desired model within the request. You don't want an app sending a /gpt4 path to end up in a /gpt35 backend.

Exactly

In this scenario, wouldn't it make more sense to deploy two instances of the load balancer, one for your GPT3.5 endpoints and the other for the GPT4 applications?

Is that too much infra/cost to handle? Also, look at the enterprise level, as models will grow more in the future. So, adding more loading balancers to facilitate each model will be a good design from scalability side? Since Yarp already has the capability of adding more clusters, destinations with multiple routes, Using that capability, can we implement this within a single load balancer?

@andredewes
Copy link
Collaborator

Good question. I think this is a balance of "ease" of use vs how complex the code and configuration can be. One of the drawbacks of this solution is that YARP still doesn't support retries natively (check here microsoft/reverse-proxy#56) but once it does, we need to reevaluate this code completely and any YARP-style configuration will become much more straightforward to implement.

Now, coming back to your concern about capacity: if you check the memory consumption of this container, you will see it stays around 40-60MB after its initial startup. And it goes up and down depending how much traffic you have. That's still an exceptionally low and acceptable consumption IMO.

Can we revisit this topic once YARP implements natively HTTP retries and for now keep it simple?

@jayendranarumugam
Copy link
Author

One of the drawbacks of this solution is that YARP still doesn't support retries natively (check here microsoft/reverse-proxy#56)

Thanks for this. I believe this is why you implemented the Passive mode with the custom ThrottlingHealthPolicy ? Can't it be scaled for multi-cluster Yarp design?

@andredewes
Copy link
Collaborator

The custom ThrottlingHealthPolicy is needed because we want the passive health checker to set the backend to be "unhealthy" only during the time specified in the Retry-After HTTP response headers. This logic is not built-in in any standard YARP policy.

And yes, it is possible to scale to multi-cluster. This is becoming a more common requirement lately and we're planning to implement this in the coming months.

@andredewes
Copy link
Collaborator

The custom ThrottlingHealthPolicy is needed because we want the passive health checker to set the backend to be "unhealthy" only during the time specified in the Retry-After HTTP response headers. This logic is not built-in in any standard YARP policy.

And yes, it is possible to scale to multi-cluster. This is becoming a more common requirement lately and we're planning to implement this in coming weeks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants