Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Incorrect Cluster Size Update in ExperimentalControlLoop When Scaling Down is Disabled #728

Merged
merged 3 commits into from
Dec 6, 2024

Conversation

fdc-ntflx
Copy link
Collaborator

Context

Background

While monitoring the autoscaling behavior of our clusters, I observed an unexpected scale-down event in the shared MRE source ~3 AM PST, despite having scaling down disabled (new feature introduced for the next live event). This anomaly was not present in the PlayAPI source, which behaved correctly under the same conditions.

Issue

Upon investigation, I discovered that the ExperimentalControlLoop class, used by the Clutch RPS scaling policy, was incorrectly updating the currentSize to the desiredSize even when the actuator did not perform a scale-down due to scaling being disabled. This led to a discrepancy between the currentSize and the actual cluster size, causing future scaling decisions to be based on incorrect data.

Impact

This behavior resulted in a critical issue where, upon scaling up, the system could mistakenly reduce the cluster size. For example, with scaling down disabled:

  • The currentSize was reduced from 100 to 50, while the actual size remained 100.
  • When traffic increased, the scaler decided to "scale up" to 60, which effectively reduced the cluster from 100 to 60 nodes—a scale-down operation.

Solution

The fix ensures that the currentSize is not updated when the actuator is unable to perform a scale-down due to scaling being disabled. This maintains the integrity of the cluster size data, preventing erroneous scaling operations.

Testing

  • Added unit tests to simulate scenarios where scaling down is disabled, verifying that the currentSize remains unchanged.
  • Confirmed that the system behaves correctly when scaling is enabled, with appropriate updates to the currentSize.
  • TODO: apply this change to SharedMRE and test again in production

Copy link

github-actions bot commented Dec 5, 2024

Uploaded Artifacts

To use these artifacts in your Gradle project, paste the following lines in your build.gradle.

resolutionStrategy {
    force "io.mantisrx:mantis-discovery-proto:0.1.0-20241206.144640-558"
    force "io.mantisrx:mantis-common-serde:0.1.0-20241206.144640-558"
    force "io.mantisrx:mantis-common:0.1.0-20241206.144640-558"
    force "io.mantisrx:mantis-client:0.1.0-20241206.144640-559"
    force "io.mantisrx:mantis-remote-observable:0.1.0-20241206.144640-559"
    force "io.mantisrx:mantis-runtime:0.1.0-20241206.144640-559"
    force "io.mantisrx:mantis-rxcontrol:0.1.0-20241206.144640-32"
    force "io.mantisrx:mantis-network:0.1.0-20241206.144640-558"
    force "io.mantisrx:mantis-shaded:0.1.0-20241206.144640-557"
    force "io.mantisrx:mantis-testcontainers:0.1.0-20241206.144640-228"
    force "io.mantisrx:mantis-runtime-executor:0.1.0-20241206.144640-94"
    force "io.mantisrx:mantis-connector-job-source:0.1.0-20241206.144640-10"
    force "io.mantisrx:mantis-connector-kafka:0.1.0-20241206.144640-559"
    force "io.mantisrx:mantis-runtime-loader:0.1.0-20241206.144640-559"
    force "io.mantisrx:mantis-connector-iceberg:0.1.0-20241206.144640-557"
    force "io.mantisrx:mantis-control-plane-core:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-control-plane-dynamodb:0.1.0-20241206.144640-19"
    force "io.mantisrx:mantis-control-plane-server:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-core:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-connector-publish:0.1.0-20241206.144640-558"
    force "io.mantisrx:mantis-control-plane-client:0.1.0-20241206.144640-558"
    force "io.mantisrx:mantis-examples-groupby-sample:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-jobconnector-sample:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-mantis-publish-sample:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-sine-function:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-twitter-sample:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-synthetic-sourcejob:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-wordcount:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-publish-core:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-publish-netty:0.1.0-20241206.144640-551"
    force "io.mantisrx:mantis-server-worker-client:0.1.0-20241206.144640-553"
    force "io.mantisrx:mantis-server-agent:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-source-job-publish:0.1.0-20241206.144640-553"
    force "io.mantisrx:mantis-publish-netty-guice:0.1.0-20241206.144640-553"
    force "io.mantisrx:mantis-source-job-kafka:0.1.0-20241206.144640-553"
}

Copy link

github-actions bot commented Dec 5, 2024

Test Results

617 tests  +2   607 ✅ +2   7m 35s ⏱️ -12s
142 suites ±0    10 💤 ±0 
142 files   ±0     0 ❌ ±0 

Results for commit 85a8242. ± Comparison against base commit 0419367.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@Andyz26 Andyz26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

logger.info("scaleDownStage decrementing number of workers from {} to {}", numCurrentWorkers, desiredWorkers);
cancelOutstandingScalingRequest();
final StageScalingPolicy scalingPolicy = stageSchedulingInfo.getScalingPolicy();
if (scalingPolicy != null && scalingPolicy.isAllowAutoScaleManager() && !jobAutoscalerManager.isScaleDownEnabled()) {
logger.warn("Scaledown is disabled for all autoscaling strategy. For stage {} of job {}", stage, jobId);
return;
return false;
}
final Subscription subscription = masterClientApi.scaleJobStage(jobId, stage, desiredWorkers, reason)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this an observable? it feels like there's no need for a stream here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I didn't check the full logic, so I don't really know.

@@ -38,7 +38,9 @@ protected Double processStep(Tuple3<String, Double, Integer> tup) {

String reason = tup._1;
if (desiredNumWorkers < tup._3) {
scaler.scaleDownStage(tup._3, desiredNumWorkers, reason);
if (!scaler.scaleDownStage(tup._3, desiredNumWorkers, reason)) {
return tup._3 * 1.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on what this is doing?

@fdc-ntflx fdc-ntflx temporarily deployed to Integrate Pull Request December 6, 2024 14:45 — with GitHub Actions Inactive
@fdc-ntflx fdc-ntflx force-pushed the fix-worker-count-on-scale-disable branch from 1f3181e to 85a8242 Compare December 6, 2024 14:46
@fdc-ntflx fdc-ntflx temporarily deployed to Integrate Pull Request December 6, 2024 14:46 — with GitHub Actions Inactive
@fdc-ntflx fdc-ntflx merged commit f1db4c4 into master Dec 6, 2024
7 checks passed
@fdc-ntflx fdc-ntflx deleted the fix-worker-count-on-scale-disable branch December 6, 2024 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants