[core] Autoscaler consistency fix [2/2] #40254

vitsai · 2023-10-11T08:38:03Z

Autoscaler currently does not report a consistent snapshot when polling cluster state. Point-in-time consistency aside, the vast majority of issues show that resources allocated to a task are often double-counted between load (queued for scheduling) and usage (dispatched), which may occur on different nodes when there is spillback. The end result is that autoscaler may provision unnecessary extra nodes for double-counted load (the task has already been dispatched, but the load for the task still exists) before scaling back down, which causes undesirable churn.

While guaranteeing a fix to this issue is complex, one simple change that drastically decreases the probability of an inconsistent snapshot and does not sacrifice performance is to report load and usage for a given node together. By default, node usage changes are pushed at a 100ms cadence by each raylet to every other raylet through Ray Syncer for scheduling purposes. On the other hand, load is reported only to GCS in response to a 1s poll that GCS sends to every node. Even when there is no network error, the potential delta in reporting easily reaches 900ms between load and usage.

Fix this so that load and usage are reported together for autoscaler accounting purposes independently of Ray Syncer. While this still does not guarantee the inconsistency doesn't exist (GCS polls by sending an RPC to all the nodes and collecting the responses, so in theory a task could still be double-counted between nodes if spillback occurs between when two of them respond), it does significantly decrease the chances.

For other approaches to the general consistency problem (as well as autoscaling churn from transient demand), please see:
https://docs.google.com/document/d/10VhjYQWIGMiOXN7FGw-aUaOn43Y2Z8t5QDqq5n4gBkQ/edit
https://docs.google.com/document/d/10fy0mSLK5p0EHVzMvSSlSsFXPljsOHWcM7ZU6RWkxGU/edit

Note also that while autoscaler state is now pretty much entirely decoupled from other GCS/scheduling state, it does rely on GcsNodeInfo in GCSNodeManager to provide dead node information at a different cadence than the cadence at which autoscaler information is polled.

This change also adds a repro and enables an existing one that had been disabled.

Why are these changes needed?

Lots of users have run into this and complained.

Related issue number

#36926

Closes #40254

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/autoscaler/v2/tests/test_e2e.py

rickyyx

Looks pretty good to me! Just a question on the testing. Great work.

nit:Let's add TODO at places where the old logic could be deprecated.

python/ray/autoscaler/v2/tests/test_e2e.py

src/ray/gcs/gcs_server/gcs_autoscaler_state_manager.cc

src/ray/gcs/gcs_server/gcs_resource_manager.cc

src/ray/gcs/gcs_server/gcs_resource_manager.h

src/ray/gcs/gcs_server/gcs_server.cc

src/ray/gcs/gcs_server/state_util.cc

src/ray/gcs/gcs_server/gcs_autoscaler_state_manager.cc

src/ray/gcs/gcs_server/gcs_resource_manager.cc

DmitriGekhtman · 2023-10-12T18:18:34Z

Any chance the google docs in the PR description could be updated with global read access?

rickyyx

Chatted with @jjyao offline - i think we should also try to fix v1 autoscaler (the fix shouldn't be too hard) We probably just make sure both v1 and v2 get the resource data as a separate snapshot for autoscaler?

vitsai · 2023-10-16T19:23:52Z

@DmitriGekhtman for now they are still internal as they haven't been finalized yet

vitsai · 2023-10-16T19:39:29Z

Separated out the original v2 fixes into this PR: #40369

rickyyx

I think we need more tests for the v1 (or at least alter the test)

Those tests in v2 for pr calls get_cluster_status directly, which is not active in v1.

rickyyx · 2023-10-16T21:21:36Z

src/ray/rpc/gcs_server/gcs_rpc_server.h

@@ -335,6 +342,43 @@ class MonitorGrpcService : public GrpcService {
  MonitorGcsServiceHandler &service_handler_;
 };

+// Legacy RPC interface for supporting autoscaler v1
+class LegacyAutoscalerGcsServiceHandler {


could we just piggyback on the existing autoscaler state service for the HandleGetAllResourceUsage?

It's possible, we would just have to make new message types for GetAllResourceUsageRequest and GetAllResourceUsageReply, marshal them to python, and also regenerate python protos. We could also do it by having autoscaler.proto take a dependency on gcs.proto which feels hacky (because the "temporary" dependency could grow over time if we're not careful) but is a smaller change.

Adding a small service handler just minimizes mixing new autoscaler with old autoscaler by creating a concrete piece of "obvious tech debt" that can be pretty cleanly removed. The size of the code change compared to doing it in autoscaler.proto would be around the same, but we wouldn't have the overhead of another service handler.

Signed-off-by: vitsai <vitsai@cs.stanford.edu>

vitsai · 2023-10-24T19:36:15Z

Alternate approach on #40488. Let's see which one is more promising.

stale · 2023-12-15T05:59:43Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

vitsai force-pushed the report-usage branch from 14b9b10 to a1c5901 Compare October 11, 2023 08:48

vitsai changed the title ~~[core] Autoscaler consistency fix prototype~~ [core] Autoscaler consistency fix Oct 12, 2023

vitsai assigned jjyao and rickyyx Oct 12, 2023

vitsai marked this pull request as ready for review October 12, 2023 00:31

vitsai requested review from a team, wuisawesome, DmitriGekhtman, ericl and architkulkarni as code owners October 12, 2023 00:31

jjyao reviewed Oct 12, 2023

View reviewed changes

python/ray/autoscaler/v2/tests/test_e2e.py Outdated Show resolved Hide resolved

rickyyx approved these changes Oct 12, 2023

View reviewed changes

jjyao reviewed Oct 12, 2023

View reviewed changes

src/ray/gcs/gcs_server/gcs_autoscaler_state_manager.cc Outdated Show resolved Hide resolved

src/ray/gcs/gcs_server/gcs_resource_manager.cc Outdated Show resolved Hide resolved

rickyyx self-requested a review October 12, 2023 19:38

rickyyx requested changes Oct 12, 2023

View reviewed changes

rickyyx added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 12, 2023

move getallresourceusage to autoscalerstatemanager

748e0cd

vitsai requested review from AmeerHajAli, robertnishihara, pcmoritz and raulchen as code owners October 16, 2023 19:17

remove if

0c784b6

vitsai mentioned this pull request Oct 16, 2023

[core] Autoscaler v2 consistency fix [1/2] #40369

Merged

8 tasks

vitsai changed the base branch from master to report-usage-v2 October 16, 2023 19:38

rickyyx reviewed Oct 16, 2023

View reviewed changes

vitsai changed the title ~~[core] Autoscaler consistency fix~~ [core] Autoscaler consistency fix [2/2] Oct 16, 2023

vitsai linked an issue Oct 16, 2023 that may be closed by this pull request

[Autoscaler] starts extra workers than necessary #36926

Open

vitsai added 6 commits October 17, 2023 05:40

Merge branch 'report-usage-v2' into report-usage

6cb7bf2

fix stub

7b3b4f1

Signed-off-by: vitsai <vitsai@cs.stanford.edu>

Merge branch 'report-usage-v2' into report-usage

9ab7068

fixes

570f5c9

Signed-off-by: vitsai <vitsai@cs.stanford.edu>

another fix

2a89f5d

Signed-off-by: vitsai <vitsai@cs.stanford.edu>

Merge branch 'report-usage-v2' into report-usage

84d46c4

rickyyx mentioned this pull request Oct 23, 2023

[Autoscaler] starts extra workers than necessary #36926

Open

vitsai mentioned this pull request Oct 24, 2023

[core] v1 autoscaler consistency fix #40488

Closed

8 tasks

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 15, 2023

anyscalesam closed this Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Autoscaler consistency fix [2/2] #40254

[core] Autoscaler consistency fix [2/2] #40254

vitsai commented Oct 11, 2023 •

edited by rickyyx

Loading

rickyyx left a comment

DmitriGekhtman commented Oct 12, 2023

rickyyx left a comment

vitsai commented Oct 16, 2023

vitsai commented Oct 16, 2023

rickyyx left a comment

rickyyx Oct 16, 2023

vitsai Oct 16, 2023

vitsai commented Oct 24, 2023

stale bot commented Dec 15, 2023

[core] Autoscaler consistency fix [2/2] #40254

[core] Autoscaler consistency fix [2/2] #40254

Conversation

vitsai commented Oct 11, 2023 • edited by rickyyx Loading

Why are these changes needed?

Related issue number

Checks

rickyyx left a comment

Choose a reason for hiding this comment

DmitriGekhtman commented Oct 12, 2023

rickyyx left a comment

Choose a reason for hiding this comment

vitsai commented Oct 16, 2023

vitsai commented Oct 16, 2023

rickyyx left a comment

Choose a reason for hiding this comment

rickyyx Oct 16, 2023

Choose a reason for hiding this comment

vitsai Oct 16, 2023

Choose a reason for hiding this comment

vitsai commented Oct 24, 2023

stale bot commented Dec 15, 2023

vitsai commented Oct 11, 2023 •

edited by rickyyx

Loading