Fix Incorrect Cluster Size Update in ExperimentalControlLoop When Scaling Down is Disabled #728

fdc-ntflx · 2024-12-05T17:01:16Z

Context

Background

While monitoring the autoscaling behavior of our clusters, I observed an unexpected scale-down event in the shared MRE source ~3 AM PST, despite having scaling down disabled (new feature introduced for the next live event). This anomaly was not present in the PlayAPI source, which behaved correctly under the same conditions.

Issue

Upon investigation, I discovered that the ExperimentalControlLoop class, used by the Clutch RPS scaling policy, was incorrectly updating the currentSize to the desiredSize even when the actuator did not perform a scale-down due to scaling being disabled. This led to a discrepancy between the currentSize and the actual cluster size, causing future scaling decisions to be based on incorrect data.

Impact

This behavior resulted in a critical issue where, upon scaling up, the system could mistakenly reduce the cluster size. For example, with scaling down disabled:

The currentSize was reduced from 100 to 50, while the actual size remained 100.
When traffic increased, the scaler decided to "scale up" to 60, which effectively reduced the cluster from 100 to 60 nodes—a scale-down operation.

Solution

The fix ensures that the currentSize is not updated when the actuator is unable to perform a scale-down due to scaling being disabled. This maintains the integrity of the cluster size data, preventing erroneous scaling operations.

Testing

Added unit tests to simulate scenarios where scaling down is disabled, verifying that the currentSize remains unchanged.
Confirmed that the system behaves correctly when scaling is enabled, with appropriate updates to the currentSize.
TODO: apply this change to SharedMRE and test again in production

github-actions · 2024-12-05T17:08:57Z

Uploaded Artifacts

To use these artifacts in your Gradle project, paste the following lines in your build.gradle.

resolutionStrategy {
    force "io.mantisrx:mantis-discovery-proto:0.1.0-20241206.144640-558"
    force "io.mantisrx:mantis-common-serde:0.1.0-20241206.144640-558"
    force "io.mantisrx:mantis-common:0.1.0-20241206.144640-558"
    force "io.mantisrx:mantis-client:0.1.0-20241206.144640-559"
    force "io.mantisrx:mantis-remote-observable:0.1.0-20241206.144640-559"
    force "io.mantisrx:mantis-runtime:0.1.0-20241206.144640-559"
    force "io.mantisrx:mantis-rxcontrol:0.1.0-20241206.144640-32"
    force "io.mantisrx:mantis-network:0.1.0-20241206.144640-558"
    force "io.mantisrx:mantis-shaded:0.1.0-20241206.144640-557"
    force "io.mantisrx:mantis-testcontainers:0.1.0-20241206.144640-228"
    force "io.mantisrx:mantis-runtime-executor:0.1.0-20241206.144640-94"
    force "io.mantisrx:mantis-connector-job-source:0.1.0-20241206.144640-10"
    force "io.mantisrx:mantis-connector-kafka:0.1.0-20241206.144640-559"
    force "io.mantisrx:mantis-runtime-loader:0.1.0-20241206.144640-559"
    force "io.mantisrx:mantis-connector-iceberg:0.1.0-20241206.144640-557"
    force "io.mantisrx:mantis-control-plane-core:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-control-plane-dynamodb:0.1.0-20241206.144640-19"
    force "io.mantisrx:mantis-control-plane-server:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-core:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-connector-publish:0.1.0-20241206.144640-558"
    force "io.mantisrx:mantis-control-plane-client:0.1.0-20241206.144640-558"
    force "io.mantisrx:mantis-examples-groupby-sample:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-jobconnector-sample:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-mantis-publish-sample:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-sine-function:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-twitter-sample:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-synthetic-sourcejob:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-examples-wordcount:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-publish-core:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-publish-netty:0.1.0-20241206.144640-551"
    force "io.mantisrx:mantis-server-worker-client:0.1.0-20241206.144640-553"
    force "io.mantisrx:mantis-server-agent:0.1.0-20241206.144640-552"
    force "io.mantisrx:mantis-source-job-publish:0.1.0-20241206.144640-553"
    force "io.mantisrx:mantis-publish-netty-guice:0.1.0-20241206.144640-553"
    force "io.mantisrx:mantis-source-job-kafka:0.1.0-20241206.144640-553"
}

github-actions · 2024-12-05T17:13:02Z

Test Results

617 tests +2 607 ✅ +2 7m 35s ⏱️ -12s
142 suites ±0 10 💤 ±0
142 files ±0 0 ❌ ±0

Results for commit 85a8242. ± Comparison against base commit 0419367.

♻️ This comment has been updated with latest results.

Andyz26

Nice catch!

sundargates · 2024-12-05T23:51:37Z

mantis-runtime-executor/src/main/java/io/mantisrx/server/worker/jobmaster/JobAutoScaler.java

        logger.info("scaleDownStage decrementing number of workers from {} to {}", numCurrentWorkers, desiredWorkers);
        cancelOutstandingScalingRequest();
        final StageScalingPolicy scalingPolicy = stageSchedulingInfo.getScalingPolicy();
        if (scalingPolicy != null && scalingPolicy.isAllowAutoScaleManager() && !jobAutoscalerManager.isScaleDownEnabled()) {
            logger.warn("Scaledown is disabled for all autoscaling strategy. For stage {} of job {}", stage, jobId);
-            return;
+            return false;
        }
        final Subscription subscription = masterClientApi.scaleJobStage(jobId, stage, desiredWorkers, reason)


why is this an observable? it feels like there's no need for a stream here?

Good point. I didn't check the full logic, so I don't really know.

sundargates · 2024-12-05T23:52:33Z

...in/java/io/mantisrx/server/worker/jobmaster/control/actuators/ClutchMantisStageActuator.java

@@ -38,7 +38,9 @@ protected Double processStep(Tuple3<String, Double, Integer> tup) {

        String reason = tup._1;
        if (desiredNumWorkers < tup._3) {
-            scaler.scaleDownStage(tup._3, desiredNumWorkers, reason);
+            if (!scaler.scaleDownStage(tup._3, desiredNumWorkers, reason)) {
+                return tup._3 * 1.0;


can you add a comment on what this is doing?

fdc-ntflx requested review from calvin681, sundargates, Andyz26 and hmitnflx as code owners December 5, 2024 17:01

fdc-ntflx temporarily deployed to Integrate Pull Request December 5, 2024 17:01 — with GitHub Actions Inactive

Andyz26 approved these changes Dec 5, 2024

View reviewed changes

sundargates approved these changes Dec 5, 2024

View reviewed changes

fdc-ntflx temporarily deployed to Integrate Pull Request December 6, 2024 14:45 — with GitHub Actions Inactive

fdc-ntflx added 3 commits December 6, 2024 09:46

Fix worker count when scale down is disabled

5f7c71f

Add & fix tests

aaa3343

Add comment.

85a8242

fdc-ntflx force-pushed the fix-worker-count-on-scale-disable branch from 1f3181e to 85a8242 Compare December 6, 2024 14:46

fdc-ntflx temporarily deployed to Integrate Pull Request December 6, 2024 14:46 — with GitHub Actions Inactive

fdc-ntflx merged commit f1db4c4 into master Dec 6, 2024
7 checks passed

fdc-ntflx deleted the fix-worker-count-on-scale-disable branch December 6, 2024 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Incorrect Cluster Size Update in ExperimentalControlLoop When Scaling Down is Disabled #728

Fix Incorrect Cluster Size Update in ExperimentalControlLoop When Scaling Down is Disabled #728

fdc-ntflx commented Dec 5, 2024

github-actions bot commented Dec 5, 2024 •

edited

Loading

github-actions bot commented Dec 5, 2024 •

edited

Loading

Andyz26 left a comment

sundargates Dec 5, 2024

fdc-ntflx Dec 6, 2024

sundargates Dec 5, 2024

Fix Incorrect Cluster Size Update in ExperimentalControlLoop When Scaling Down is Disabled #728

Fix Incorrect Cluster Size Update in ExperimentalControlLoop When Scaling Down is Disabled #728

Conversation

fdc-ntflx commented Dec 5, 2024

Context

Background

Issue

Impact

Solution

Testing

github-actions bot commented Dec 5, 2024 • edited Loading

Uploaded Artifacts

github-actions bot commented Dec 5, 2024 • edited Loading

Test Results

Andyz26 left a comment

Choose a reason for hiding this comment

sundargates Dec 5, 2024

Choose a reason for hiding this comment

fdc-ntflx Dec 6, 2024

Choose a reason for hiding this comment

sundargates Dec 5, 2024

Choose a reason for hiding this comment

github-actions bot commented Dec 5, 2024 •

edited

Loading

github-actions bot commented Dec 5, 2024 •

edited

Loading