Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure worker failed before first heartbeat gets resubmission #731

Merged
merged 1 commit into from
Dec 13, 2024

Conversation

Andyz26
Copy link
Collaborator

@Andyz26 Andyz26 commented Dec 13, 2024

Context

Problem:
The missing heartbeat logic relies on the last HB timestamp to determine whether it should be re-submitted by the job actor or the scheduler. This creates a problem for workers failed right after getting scheduled but before it's able to emit the first heartbeat message. This problem gets amplified when we have higher transient worker startup failures due to other issues.

Solution:
Instead of last heartbeat, use the launched event status to decide whether the job actor should start taking over the job resubmission from the scheduler.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

Sorry, something went wrong.

Copy link

Test Results

618 tests  +1   608 ✅ +1   7m 47s ⏱️ +8s
142 suites ±0    10 💤 ±0 
142 files   ±0     0 ❌ ±0 

Results for commit dcd1493. ± Comparison against base commit d01775b.

Copy link
Collaborator

@fdc-ntflx fdc-ntflx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@sundargates sundargates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Andyz26 Andyz26 merged commit 735435f into master Dec 13, 2024
4 of 5 checks passed
@Andyz26 Andyz26 deleted the andyz/fixStuckWorkerInStartUp branch December 13, 2024 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants