-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Incorrect Cluster Size Update in ExperimentalControlLoop When Scaling Down is Disabled #728
Conversation
Uploaded ArtifactsTo use these artifacts in your Gradle project, paste the following lines in your build.gradle.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
logger.info("scaleDownStage decrementing number of workers from {} to {}", numCurrentWorkers, desiredWorkers); | ||
cancelOutstandingScalingRequest(); | ||
final StageScalingPolicy scalingPolicy = stageSchedulingInfo.getScalingPolicy(); | ||
if (scalingPolicy != null && scalingPolicy.isAllowAutoScaleManager() && !jobAutoscalerManager.isScaleDownEnabled()) { | ||
logger.warn("Scaledown is disabled for all autoscaling strategy. For stage {} of job {}", stage, jobId); | ||
return; | ||
return false; | ||
} | ||
final Subscription subscription = masterClientApi.scaleJobStage(jobId, stage, desiredWorkers, reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this an observable? it feels like there's no need for a stream here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I didn't check the full logic, so I don't really know.
@@ -38,7 +38,9 @@ protected Double processStep(Tuple3<String, Double, Integer> tup) { | |||
|
|||
String reason = tup._1; | |||
if (desiredNumWorkers < tup._3) { | |||
scaler.scaleDownStage(tup._3, desiredNumWorkers, reason); | |||
if (!scaler.scaleDownStage(tup._3, desiredNumWorkers, reason)) { | |||
return tup._3 * 1.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment on what this is doing?
1f3181e
to
85a8242
Compare
Context
Background
While monitoring the autoscaling behavior of our clusters, I observed an unexpected scale-down event in the shared MRE source ~3 AM PST, despite having scaling down disabled (new feature introduced for the next live event). This anomaly was not present in the PlayAPI source, which behaved correctly under the same conditions.
Issue
Upon investigation, I discovered that the
ExperimentalControlLoop
class, used by the Clutch RPS scaling policy, was incorrectly updating thecurrentSize
to thedesiredSize
even when the actuator did not perform a scale-down due to scaling being disabled. This led to a discrepancy between thecurrentSize
and the actual cluster size, causing future scaling decisions to be based on incorrect data.Impact
This behavior resulted in a critical issue where, upon scaling up, the system could mistakenly reduce the cluster size. For example, with scaling down disabled:
Solution
The fix ensures that the currentSize is not updated when the actuator is unable to perform a scale-down due to scaling being disabled. This maintains the integrity of the cluster size data, preventing erroneous scaling operations.
Testing