Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Serve] Faster bulk imperative Serve Application deploys (ray-project…
…#49168) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Our pattern of using Ray Serve has us deploying many hundreds/thousands of apps using the imperative API (`serve.run`). This ends up being very slow because the Controller needs to checkpoint as part of every RPC. It would be significantly more efficient to batch the deploys so that we can checkpoint fewer times. This PR adds a new `serve.run_many()` public API, marked as developer-only, that can submit many applications to the Serve Controller in one RPC, with just a single checkpoint being saved after all of those applications are registered. The entire existing code path (including `serve.run()`) is refactored to be bulk operations under the hood (`serve.run()` calls `serve.run_many()`). To further help with our particular use case, where the applications are being deployed from a controller that doesn't care about waiting for e.g. ingress deployment creation, the new code path also has fine-grained control over which things are waited for. --- Just introducing a batch API isn't sufficient to actually provide a meaningful speedup. As mentioned above, the thing that is slow is the checkpointing, and right now, the checkpointing is very granular: the various stateful components checkpoint themselves at the bottom of the call stack, so even a single RPC might cause them to checkpoint multiple times right now. Below I've tried to map out all the reasons that the `Application/DeploymentStateManager`s might checkpoint: ```mermaid graph TD; deployment_state_set_target_state[DeploymentState._set_target_state] --> dsm_checkpoint[DeploymentStateManager._save_checkpoint_func] deployment_state_deploy[DeploymentState.deploy] --> deployment_state_set_target_state deployment_state_manager_deploy[DeploymentStateManager.deploy] --> deployment_state_deploy application_state_apply_deployment_info[ApplicationState.apply_deployment_info] --> deployment_state_manager_deploy application_state_reconcile_target_deployments[ApplicationState._reconcile_target_deployments] --x application_state_apply_deployment_info application_state_update[ApplicationState.update] --> application_state_reconcile_target_deployments application_state_manager_update[ApplicationStateManager.update] --x application_state_update serve_controller_run_control_loop[ServeController.run_control_loop] --> application_state_manager_update deployment_state_set_target_state_deleting[DeploymentState._set_target_state_deleting] --> dsm_checkpoint deployment_state_delete[DeploymentState.delete] --> deployment_state_set_target_state_deleting deployment_state_manager_delete_deployment[DeploymentStateManager.delete_deployment] --> deployment_state_delete application_state_delete_deployment[ApplicationState._delete_deployment] --> deployment_state_manager_delete_deployment application_state_reconcile_target_deployments --> application_state_delete_deployment deployment_state_autoscale[DeploymentState.autoscale] --> deployment_state_set_target_state deployment_state_manager_update[DeploymentStateManager.update] --> deployment_state_autoscale serve_controller_run_control_loop --> deployment_state_manager_update as_set_target_state[ApplicationState._set_target_state] --> asm_checkpoint[ApplicationStateManager._save_checkpoint_func] as_recover_target_state_from_checkpoint[ApplicationState.recover_target_state_from_checkpoint] --> as_set_target_state asm_recover_from_checkpoint[ApplicationStateManager._recover_from_checkpoint] --> as_recover_target_state_from_checkpoint asm_init[ApplicationStateManager.__init__] --> asm_recover_from_checkpoint sc_init[ServeController.__init__] --> asm_init as_set_target_state_deleting[ApplicationState._set_target_state_deleting] --> as_set_target_state as_delete[ApplicationState.delete] --> as_set_target_state_deleting asm_delete_app[ApplicationStateManager.delete_app] --> as_delete sc_delete_apps[ServeController.delete_apps] --x asm_delete_app RPC --> sc_delete_apps as_clear_target_state_and_store_config[ApplicationState._clear_target_state_and_store_config] --> as_set_target_state as_apply_app_config[ApplicationState.apply_app_config] --> as_clear_target_state_and_store_config asm_apply_app_configs[ApplicationStateManager.apply_app_configs] --x as_apply_app_config sc_apply_config[ServeController.apply_config] --> asm_apply_app_configs RPC --> sc_apply_config as_deploy_app[ApplicationState.deploy_app] --> as_set_target_state asm_deploy_app[ApplicationStateManager.deploy_app] --> as_deploy_app sc_deploy_application[ServeController.deploy_application] --> asm_deploy_app RPC --> sc_deploy_application as_apply_app_config --> as_set_target_state ``` So, in addition to the batch API that the client sees, I've refactored where these checkpoints are done so that they happen at the *top* of those call stacks instead of at the bottom. - We still checkpoint before (now just before) returning an RPC that mutates state. - We still checkpoint after making any changes to internal state and before issuing any commands to the cluster to e.g. start/stop replicas (just not *immediately* after making the internal state change). I did *not* change the `EndpointState`'s checkpointing because it hasn't shown up in our flamegraphs. --- Before these changes, deploying 5k Serve apps, each with one deployment, took >1 hour and would often never finish because the Serve Controller would become unresponsive and KubeRay would end up restarting the cluster. With these changes, deploying 5k Serve apps with a batch size of 100 per API call only takes about 90 seconds! ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Josh Karpel <josh.karpel@gmail.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
- Loading branch information