Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resubmit Worker Allocations #725

Merged

Conversation

kmg-stripe
Copy link
Collaborator

A recent change removed worker resubmits when workers are stuck in accepted: #719

Our underlying scheduler will not retry the allocations, so we need a way to conditionally enable the ability to resubmit.

Context

I did not see unit tests for the functionality that was removed. I'd be happy to add them, but would like to get this merged first, since we had to pin master to a different version than the agents to avoid stuck workers.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

A recent change removed worker resubmits when workers are stuck in accepted: Netflix#719

Our underlying scheduler will not retry the allocations, so we need a way to conditionally enable the ability to resubmit.
@kmg-stripe
Copy link
Collaborator Author

cc: @Andyz26

Copy link

Test Results

615 tests  ±0   605 ✅ ±0   8m 4s ⏱️ ±0s
142 suites ±0    10 💤 ±0 
142 files   ±0     0 ❌ ±0 

Results for commit b59503c. ± Comparison against base commit a5874b2.

@Andyz26
Copy link
Collaborator

Andyz26 commented Nov 13, 2024

@kmg-stripe this lgtm. just curious what was the error/root cause you saw to have the workers stuck in accepted? For us we usually get that if the new job artifact contains bugs or failed to init so retry doesn't help in those cases (and sometimes it pollutes our agent pool e.g. error cause frequent agent crash or filled the disk etc.)

@kmg-stripe
Copy link
Collaborator Author

kmg-stripe commented Nov 13, 2024

@kmg-stripe this lgtm. just curious what was the error/root cause you saw to have the workers stuck in accepted? For us we usually get that if the new job artifact contains bugs or failed to init so retry doesn't help in those cases (and sometimes it pollutes our agent pool e.g. error cause frequent agent crash or filled the disk etc.)

@Andyz26 thanks! yup, this was it. on our end, this was triggered by slowness in the underlying ASG to spin-up new instances. it is an internal limitation we will hope to fix soon, but need to tolerate it as "expected behavior" for now.

@kmg-stripe kmg-stripe merged commit 4da2e05 into Netflix:master Nov 13, 2024
4 of 5 checks passed
@kmg-stripe kmg-stripe had a problem deploying to Integrate Pull Request December 13, 2024 20:28 — with GitHub Actions Failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants