-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix losing job on control plane restart #594
Conversation
Test Results131 files +1 131 suites +1 7m 31s ⏱️ -47s For more details on these failures, see this check. Results for commit c9c4fc7. ± Comparison against base commit fe6604d. ♻️ This comment has been updated with latest results. |
// If there are duplicate workers on the same stage index, only attach the one with latest | ||
// worker number. The stale workers will not present in the stage metadata thus gets terminated | ||
// when it sends heartbeats to its JobActor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this logic is being decided at the MantisJobWorkerMetadataWritable level. Could we instead do this here? We already have two different building blocks today - one to add and another to replace. Let's use them instead.
It's also not clear to me what the boolean represents in this new method call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"one to add and another to replace,": i think it's just different names used for different components. I only find one path to add/replace workers in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do "addWorkerMedata" -> "tryAddOrReplaceWorker"
} | ||
jobMetas.add(DataFormatAdapter.convertMantisJobWriteableToMantisJobMetadata(jobMeta, eventPublisher)); | ||
} catch (Exception e) { | ||
logger.warn("Exception loading job {}", jobId, e); | ||
logger.error("Exception loading job {}", jobId, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be critical instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no critical level at logger?
Can you add some tests to test this behavior? |
UT added in the 2nd commit |
* @param workerMetadata new worker metadata instance. | ||
* @return true if the given worker metadata instance is added to this job. | ||
*/ | ||
public boolean tryAddOrReplaceWorker(int stageNum, MantisWorkerMetadata workerMetadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you return the OldWorkerMetadata instead? I'm unsure of the boolean return value's meaning.
Context
Currently, any failure during the loading job from the persistency layer step causes the job to be dropped without proper cleanup.
Here the logic is updated as follows:
Checklist
./gradlew build
compiles code correctly./gradlew test
passes all tests