-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Autoscaler consistency fix [2/2] #40254
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me! Just a question on the testing. Great work.
nit:Let's add TODO at places where the old logic could be deprecated.
Any chance the google docs in the PR description could be updated with global read access? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatted with @jjyao offline - i think we should also try to fix v1 autoscaler (the fix shouldn't be too hard) We probably just make sure both v1 and v2 get the resource data as a separate snapshot for autoscaler?
@DmitriGekhtman for now they are still internal as they haven't been finalized yet |
Separated out the original v2 fixes into this PR: #40369 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need more tests for the v1 (or at least alter the test)
Those tests in v2 for pr calls get_cluster_status
directly, which is not active in v1.
@@ -335,6 +342,43 @@ class MonitorGrpcService : public GrpcService { | |||
MonitorGcsServiceHandler &service_handler_; | |||
}; | |||
|
|||
// Legacy RPC interface for supporting autoscaler v1 | |||
class LegacyAutoscalerGcsServiceHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we just piggyback on the existing autoscaler state service for the HandleGetAllResourceUsage
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible, we would just have to make new message types for GetAllResourceUsageRequest
and GetAllResourceUsageReply
, marshal them to python, and also regenerate python protos. We could also do it by having autoscaler.proto
take a dependency on gcs.proto
which feels hacky (because the "temporary" dependency could grow over time if we're not careful) but is a smaller change.
Adding a small service handler just minimizes mixing new autoscaler with old autoscaler by creating a concrete piece of "obvious tech debt" that can be pretty cleanly removed. The size of the code change compared to doing it in autoscaler.proto
would be around the same, but we wouldn't have the overhead of another service handler.
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Alternate approach on #40488. Let's see which one is more promising. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Autoscaler currently does not report a consistent snapshot when polling cluster state. Point-in-time consistency aside, the vast majority of issues show that resources allocated to a task are often double-counted between load (queued for scheduling) and usage (dispatched), which may occur on different nodes when there is spillback. The end result is that autoscaler may provision unnecessary extra nodes for double-counted load (the task has already been dispatched, but the load for the task still exists) before scaling back down, which causes undesirable churn.
While guaranteeing a fix to this issue is complex, one simple change that drastically decreases the probability of an inconsistent snapshot and does not sacrifice performance is to report load and usage for a given node together. By default, node usage changes are pushed at a 100ms cadence by each raylet to every other raylet through Ray Syncer for scheduling purposes. On the other hand, load is reported only to GCS in response to a 1s poll that GCS sends to every node. Even when there is no network error, the potential delta in reporting easily reaches 900ms between load and usage.
Fix this so that load and usage are reported together for autoscaler accounting purposes independently of Ray Syncer. While this still does not guarantee the inconsistency doesn't exist (GCS polls by sending an RPC to all the nodes and collecting the responses, so in theory a task could still be double-counted between nodes if spillback occurs between when two of them respond), it does significantly decrease the chances.
For other approaches to the general consistency problem (as well as autoscaling churn from transient demand), please see:
https://docs.google.com/document/d/10VhjYQWIGMiOXN7FGw-aUaOn43Y2Z8t5QDqq5n4gBkQ/edit
https://docs.google.com/document/d/10fy0mSLK5p0EHVzMvSSlSsFXPljsOHWcM7ZU6RWkxGU/edit
Note also that while autoscaler state is now pretty much entirely decoupled from other GCS/scheduling state, it does rely on GcsNodeInfo in GCSNodeManager to provide dead node information at a different cadence than the cadence at which autoscaler information is polled.
This change also adds a repro and enables an existing one that had been disabled.
Why are these changes needed?
Lots of users have run into this and complained.
Related issue number
#36926
Closes #40254
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.