Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix losing job on control plane restart #594

Merged
merged 5 commits into from
Nov 30, 2023
Merged

Conversation

Andyz26
Copy link
Collaborator

@Andyz26 Andyz26 commented Nov 29, 2023

Context

Currently, any failure during the loading job from the persistency layer step causes the job to be dropped without proper cleanup.
Here the logic is updated as follows:

  • If two duplicate workers on the same stage index are found, load the one with the newer worker number to use as active.
  • The dupe one with the older worker number then gets terminated in the job actor when it sends the heartbeat and doesn't match with the active worker number.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Copy link

github-actions bot commented Nov 29, 2023

Test Results

131 files  +1  131 suites  +1   7m 31s ⏱️ -47s
548 tests +1  539 ✔️ +1  8 💤 ±0  1 ±0 
549 runs  +1  540 ✔️ +1  8 💤 ±0  1 ±0 

For more details on these failures, see this check.

Results for commit c9c4fc7. ± Comparison against base commit fe6604d.

♻️ This comment has been updated with latest results.

@Andyz26 Andyz26 had a problem deploying to Integrate Pull Request November 29, 2023 18:52 — with GitHub Actions Failure
Comment on lines +402 to +404
// If there are duplicate workers on the same stage index, only attach the one with latest
// worker number. The stale workers will not present in the stage metadata thus gets terminated
// when it sends heartbeats to its JobActor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this logic is being decided at the MantisJobWorkerMetadataWritable level. Could we instead do this here? We already have two different building blocks today - one to add and another to replace. Let's use them instead.

It's also not clear to me what the boolean represents in this new method call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"one to add and another to replace,": i think it's just different names used for different components. I only find one path to add/replace workers in the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do "addWorkerMedata" -> "tryAddOrReplaceWorker"

}
jobMetas.add(DataFormatAdapter.convertMantisJobWriteableToMantisJobMetadata(jobMeta, eventPublisher));
} catch (Exception e) {
logger.warn("Exception loading job {}", jobId, e);
logger.error("Exception loading job {}", jobId, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be critical instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no critical level at logger?

@sundargates
Copy link
Contributor

Can you add some tests to test this behavior?

@Andyz26
Copy link
Collaborator Author

Andyz26 commented Nov 29, 2023

Can you add some tests to test this behavior?

UT added in the 2nd commit

@Andyz26 Andyz26 had a problem deploying to Integrate Pull Request November 29, 2023 21:28 — with GitHub Actions Failure
@Andyz26 Andyz26 had a problem deploying to Integrate Pull Request November 29, 2023 22:24 — with GitHub Actions Failure
* @param workerMetadata new worker metadata instance.
* @return true if the given worker metadata instance is added to this job.
*/
public boolean tryAddOrReplaceWorker(int stageNum, MantisWorkerMetadata workerMetadata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you return the OldWorkerMetadata instead? I'm unsure of the boolean return value's meaning.

@Andyz26 Andyz26 had a problem deploying to Integrate Pull Request November 30, 2023 00:17 — with GitHub Actions Failure
@Andyz26 Andyz26 merged commit 0b7e629 into master Nov 30, 2023
@Andyz26 Andyz26 deleted the andyz/patchJobLostOnStart branch November 30, 2023 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants