Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Autoscaler consistency fix [2/2] #40254

Closed
wants to merge 8 commits into from

Conversation

vitsai
Copy link
Contributor

@vitsai vitsai commented Oct 11, 2023

Autoscaler currently does not report a consistent snapshot when polling cluster state. Point-in-time consistency aside, the vast majority of issues show that resources allocated to a task are often double-counted between load (queued for scheduling) and usage (dispatched), which may occur on different nodes when there is spillback. The end result is that autoscaler may provision unnecessary extra nodes for double-counted load (the task has already been dispatched, but the load for the task still exists) before scaling back down, which causes undesirable churn.

While guaranteeing a fix to this issue is complex, one simple change that drastically decreases the probability of an inconsistent snapshot and does not sacrifice performance is to report load and usage for a given node together. By default, node usage changes are pushed at a 100ms cadence by each raylet to every other raylet through Ray Syncer for scheduling purposes. On the other hand, load is reported only to GCS in response to a 1s poll that GCS sends to every node. Even when there is no network error, the potential delta in reporting easily reaches 900ms between load and usage.

Fix this so that load and usage are reported together for autoscaler accounting purposes independently of Ray Syncer. While this still does not guarantee the inconsistency doesn't exist (GCS polls by sending an RPC to all the nodes and collecting the responses, so in theory a task could still be double-counted between nodes if spillback occurs between when two of them respond), it does significantly decrease the chances.

For other approaches to the general consistency problem (as well as autoscaling churn from transient demand), please see:
https://docs.google.com/document/d/10VhjYQWIGMiOXN7FGw-aUaOn43Y2Z8t5QDqq5n4gBkQ/edit
https://docs.google.com/document/d/10fy0mSLK5p0EHVzMvSSlSsFXPljsOHWcM7ZU6RWkxGU/edit

Note also that while autoscaler state is now pretty much entirely decoupled from other GCS/scheduling state, it does rely on GcsNodeInfo in GCSNodeManager to provide dead node information at a different cadence than the cadence at which autoscaler information is polled.

This change also adds a repro and enables an existing one that had been disabled.

Why are these changes needed?

Lots of users have run into this and complained.

Related issue number

#36926

Closes #40254

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@vitsai vitsai changed the title [core] Autoscaler consistency fix prototype [core] Autoscaler consistency fix Oct 12, 2023
@vitsai vitsai marked this pull request as ready for review October 12, 2023 00:31
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me! Just a question on the testing. Great work.

nit:Let's add TODO at places where the old logic could be deprecated.

@DmitriGekhtman
Copy link
Contributor

Any chance the google docs in the PR description could be updated with global read access?

@rickyyx rickyyx self-requested a review October 12, 2023 19:38
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with @jjyao offline - i think we should also try to fix v1 autoscaler (the fix shouldn't be too hard) We probably just make sure both v1 and v2 get the resource data as a separate snapshot for autoscaler?

@rickyyx rickyyx added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 12, 2023
@vitsai
Copy link
Contributor Author

vitsai commented Oct 16, 2023

@DmitriGekhtman for now they are still internal as they haven't been finalized yet

@vitsai vitsai changed the base branch from master to report-usage-v2 October 16, 2023 19:38
@vitsai
Copy link
Contributor Author

vitsai commented Oct 16, 2023

Separated out the original v2 fixes into this PR: #40369

Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need more tests for the v1 (or at least alter the test)

Those tests in v2 for pr calls get_cluster_status directly, which is not active in v1.

@@ -335,6 +342,43 @@ class MonitorGrpcService : public GrpcService {
MonitorGcsServiceHandler &service_handler_;
};

// Legacy RPC interface for supporting autoscaler v1
class LegacyAutoscalerGcsServiceHandler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we just piggyback on the existing autoscaler state service for the HandleGetAllResourceUsage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, we would just have to make new message types for GetAllResourceUsageRequest and GetAllResourceUsageReply, marshal them to python, and also regenerate python protos. We could also do it by having autoscaler.proto take a dependency on gcs.proto which feels hacky (because the "temporary" dependency could grow over time if we're not careful) but is a smaller change.

Adding a small service handler just minimizes mixing new autoscaler with old autoscaler by creating a concrete piece of "obvious tech debt" that can be pretty cleanly removed. The size of the code change compared to doing it in autoscaler.proto would be around the same, but we wouldn't have the overhead of another service handler.

@vitsai vitsai changed the title [core] Autoscaler consistency fix [core] Autoscaler consistency fix [2/2] Oct 16, 2023
@vitsai vitsai linked an issue Oct 16, 2023 that may be closed by this pull request
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
@vitsai
Copy link
Contributor Author

vitsai commented Oct 24, 2023

Alternate approach on #40488. Let's see which one is more promising.

Copy link

stale bot commented Dec 15, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Autoscaler] starts extra workers than necessary
5 participants