Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance failure detection to also recognize failure status of top level resources #74

Closed
dgrove-oss opened this issue Apr 5, 2024 · 3 comments
Assignees

Comments

@dgrove-oss
Copy link
Collaborator

MCAD provided a mechanism for encoding conditions of the top level resources that if present would indicate success or failure. This was used to augment the pod-level status infromation.

We need to design and Implement a similar mechanism for the v1beta2 AppWrapper.

@dgrove-oss
Copy link
Collaborator Author

We've discussed further and primarily see value in recognizing when a top level resource has failed. This would allow us to transition the AppWrapper to failed immediately (without relying on pod-level health).

There seems to be less value in the resource-level success status because we still would have to wait for all pods to reach a Completed state before we could release resources (if we released eagerly, we could cause quota overages).

@dgrove-oss dgrove-oss self-assigned this Jun 19, 2024
@dgrove-oss dgrove-oss changed the title Optionally compute success/failure from the status of the top level resource Enhance failure detection to also recognize failure status of top level resources Jun 19, 2024
@dgrove-oss
Copy link
Collaborator Author

#163 does this for PyTorchJobs

@dgrove-oss
Copy link
Collaborator Author

Closing; we've now implemented for all the supported GVKs that have useful operator-level failure statuses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant