Fix losing job on control plane restart #594

Andyz26 · 2023-11-29T01:20:22Z

Context

Currently, any failure during the loading job from the persistency layer step causes the job to be dropped without proper cleanup.
Here the logic is updated as follows:

If two duplicate workers on the same stage index are found, load the one with the newer worker number to use as active.
The dupe one with the older worker number then gets terminated in the job actor when it sends the heartbeat and doesn't match with the active worker number.

Checklist

./gradlew build compiles code correctly
Added new tests where applicable
./gradlew test passes all tests
Extended README or added javadocs where applicable

github-actions · 2023-11-29T01:26:50Z

Test Results

131 files +1 131 suites +1 7m 31s ⏱️ -47s
548 tests +1 539 ✔️ +1 8 💤 ±0 1 ❌ ±0
549 runs +1 540 ✔️ +1 8 💤 ±0 1 ❌ ±0

For more details on these failures, see this check.

Results for commit c9c4fc7. ± Comparison against base commit fe6604d.

♻️ This comment has been updated with latest results.

sundargates · 2023-11-29T19:11:14Z

...er/src/main/java/io/mantisrx/server/master/persistence/KeyValueBasedPersistenceProvider.java

+                    // If there are duplicate workers on the same stage index, only attach the one with latest
+                    // worker number. The stale workers will not present in the stage metadata thus gets terminated
+                    // when it sends heartbeats to its JobActor.


It looks like this logic is being decided at the MantisJobWorkerMetadataWritable level. Could we instead do this here? We already have two different building blocks today - one to add and another to replace. Let's use them instead.

It's also not clear to me what the boolean represents in this new method call.

"one to add and another to replace,": i think it's just different names used for different components. I only find one path to add/replace workers in the code.

I can do "addWorkerMedata" -> "tryAddOrReplaceWorker"

sundargates · 2023-11-29T19:11:43Z

...er/src/main/java/io/mantisrx/server/master/persistence/KeyValueBasedPersistenceProvider.java

                }
                jobMetas.add(DataFormatAdapter.convertMantisJobWriteableToMantisJobMetadata(jobMeta, eventPublisher));
            } catch (Exception e) {
-                logger.warn("Exception loading job {}", jobId, e);
+                logger.error("Exception loading job {}", jobId, e);


Should this be critical instead?

there is no critical level at logger?

sundargates · 2023-11-29T19:22:59Z

Can you add some tests to test this behavior?

Andyz26 · 2023-11-29T19:27:31Z

Can you add some tests to test this behavior?

UT added in the 2nd commit

sundargates · 2023-11-29T23:56:45Z

...ol-plane-server/src/main/java/io/mantisrx/server/master/store/MantisJobMetadataWritable.java

+     * @param workerMetadata new worker metadata instance.
+     * @return true if the given worker metadata instance is added to this job.
+     */
+    public boolean tryAddOrReplaceWorker(int stageNum, MantisWorkerMetadata workerMetadata) {


Can you return the OldWorkerMetadata instead? I'm unsure of the boolean return value's meaning.

patch lost job on restart

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

5fd2419

Andyz26 requested review from nickmahilani, jeffchao, piygoyal, calvin681, sundargates, hmitnflx, markcho and fdc-ntflx as code owners November 29, 2023 01:20

Andyz26 had a problem deploying to Integrate Pull Request November 29, 2023 01:20 — with GitHub Actions Failure

ut

0bd0287

Andyz26 had a problem deploying to Integrate Pull Request November 29, 2023 18:52 — with GitHub Actions Failure

sundargates reviewed Nov 29, 2023

View reviewed changes

rename

092004c

Andyz26 had a problem deploying to Integrate Pull Request November 29, 2023 21:28 — with GitHub Actions Failure

Merge branch 'master' into andyz/patchJobLostOnStart

89e5fd0

Andyz26 had a problem deploying to Integrate Pull Request November 29, 2023 22:24 — with GitHub Actions Failure

sundargates reviewed Nov 29, 2023

View reviewed changes

comment

c9c4fc7

Andyz26 had a problem deploying to Integrate Pull Request November 30, 2023 00:17 — with GitHub Actions Failure

Andyz26 merged commit 0b7e629 into master Nov 30, 2023

Andyz26 deleted the andyz/patchJobLostOnStart branch November 30, 2023 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix losing job on control plane restart #594

Fix losing job on control plane restart #594

Andyz26 commented Nov 29, 2023 •

edited

Loading

github-actions bot commented Nov 29, 2023 •

edited

Loading

sundargates Nov 29, 2023

Andyz26 Nov 29, 2023

Andyz26 Nov 29, 2023

sundargates Nov 29, 2023

Andyz26 Nov 29, 2023

sundargates commented Nov 29, 2023

Andyz26 commented Nov 29, 2023

sundargates Nov 29, 2023

Fix losing job on control plane restart #594

Fix losing job on control plane restart #594

Conversation

Andyz26 commented Nov 29, 2023 • edited Loading

Context

Checklist

github-actions bot commented Nov 29, 2023 • edited Loading

Test Results

sundargates Nov 29, 2023

Choose a reason for hiding this comment

Andyz26 Nov 29, 2023

Choose a reason for hiding this comment

Andyz26 Nov 29, 2023

Choose a reason for hiding this comment

sundargates Nov 29, 2023

Choose a reason for hiding this comment

Andyz26 Nov 29, 2023

Choose a reason for hiding this comment

sundargates commented Nov 29, 2023

Andyz26 commented Nov 29, 2023

sundargates Nov 29, 2023

Choose a reason for hiding this comment

Andyz26 commented Nov 29, 2023 •

edited

Loading

github-actions bot commented Nov 29, 2023 •

edited

Loading