Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs][KubeRay] Update RayJob doc for backoffLimit and DELETE_RAYJOB_CR_AFTER_JOB_FINISHES #47445

Merged

Conversation

MortalHappiness
Copy link
Member

Why are these changes needed?

Update RayJob doc for backoffLimit and DELETE_RAYJOB_CR_AFTER_JOB_FINISHES behaviors in the following PRs:

Related issue number

N/A

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@MortalHappiness
Copy link
Member Author

@kevin85421 @jjyao @andrewsykim PTAL

@MortalHappiness MortalHappiness changed the title [Docs][KubeRay] Update RayJob doc for backoffLimit and DELETE_RAYJOB_CR_AFTER_JOB_FINISHES [Docs][KubeRay] Update RayJob doc for backoffLimit and DELETE_RAYJOB_CR_AFTER_JOB_FINISHES Sep 1, 2024
@MortalHappiness MortalHappiness changed the title [Docs][KubeRay] Update RayJob doc for backoffLimit and DELETE_RAYJOB_CR_AFTER_JOB_FINISHES [Docs][KubeRay] Update RayJob doc for backoffLimit and DELETE_RAYJOB_CR_AFTER_JOB_FINISHES Sep 1, 2024
@MortalHappiness
Copy link
Member Author

@kevin85421 kevin85421 self-assigned this Sep 2, 2024
@kevin85421 kevin85421 added the kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side label Sep 3, 2024
@anyscalesam anyscalesam added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core labels Sep 3, 2024
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments indicating which version started supporting these APIs?

* Automatic resource cleanup
* `shutdownAfterJobFinishes` (Optional): Determines whether to recycle the RayCluster after the Ray job finishes. The default value is false.
* `ttlSecondsAfterFinished` (Optional): Only works if `shutdownAfterJobFinishes` is true. The KubeRay operator deletes the RayCluster and the submitter `ttlSecondsAfterFinished` seconds after the Ray job finishes. The default value is 0.
* `activeDeadlineSeconds` (Optional): If the RayJob doesn't transition the `JobDeploymentStatus` to `Complete` or `Failed` within `activeDeadlineSeconds`, the KubeRay operator transitions the `JobDeploymentStatus` to `Failed`, citing `DeadlineExceeded` as the reason.
* `DELETE_RAYJOB_CR_AFTER_JOB_FINISHES` (Optional): If this environment variable is set to true, the RayJob custom resource itself will be deleted if `shutdownAfterJobFinishes` is also set to true. Note that all resources created by the RayJob will be deleted, including the K8s Job.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this environment variable should be set in the KubeRay operator, not the CR. Could we clarify that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@MortalHappiness MortalHappiness force-pushed the docs/update-rayjob-backoff-delete branch 2 times, most recently from 69e9af2 to d215030 Compare September 4, 2024 16:46
* Automatic resource cleanup
* `shutdownAfterJobFinishes` (Optional): Determines whether to recycle the RayCluster after the Ray job finishes. The default value is false.
* `ttlSecondsAfterFinished` (Optional): Only works if `shutdownAfterJobFinishes` is true. The KubeRay operator deletes the RayCluster and the submitter `ttlSecondsAfterFinished` seconds after the Ray job finishes. The default value is 0.
* `activeDeadlineSeconds` (Optional): If the RayJob doesn't transition the `JobDeploymentStatus` to `Complete` or `Failed` within `activeDeadlineSeconds`, the KubeRay operator transitions the `JobDeploymentStatus` to `Failed`, citing `DeadlineExceeded` as the reason.
* `DELETE_RAYJOB_CR_AFTER_JOB_FINISHES` (Optional, added in version 1.2.0): This environment variable should be set for the KubeRay operator, not the RayJob resource. If this environment variable is set to true, the RayJob custom resource itself will be deleted if `shutdownAfterJobFinishes` is also set to true. Note that all resources created by the RayJob will be deleted, including the K8s Job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it still confusing to add DELETE_RAYJOB_CR_AFTER_JOB_FINISHES here, because the other entries are for RayJob API where-as this is an environment variable in kuberay.

Maybe it's time for a dedicated section in the kubray installation docs about environment variables?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. But I don't find a good place to put the kuberay-related configuration. @kevin85421 @andrewsykim where do you recommend to put them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not a blocker for this PR, but we should consider a "Configuring KubeRay" page. There are other enviornment variables that are not documented yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open an issue ray-project/kuberay#2356 to track the progress.

Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some style nits. Please consider using Vale to find these issues in the future. Please excuse any inaccuracies I introduced in my suggestions and correct as needed. Happy to answer any questions you have about the suggestions. Thanks for your contribution!

* Automatic resource cleanup
* `shutdownAfterJobFinishes` (Optional): Determines whether to recycle the RayCluster after the Ray job finishes. The default value is false.
* `ttlSecondsAfterFinished` (Optional): Only works if `shutdownAfterJobFinishes` is true. The KubeRay operator deletes the RayCluster and the submitter `ttlSecondsAfterFinished` seconds after the Ray job finishes. The default value is 0.
* `activeDeadlineSeconds` (Optional): If the RayJob doesn't transition the `JobDeploymentStatus` to `Complete` or `Failed` within `activeDeadlineSeconds`, the KubeRay operator transitions the `JobDeploymentStatus` to `Failed`, citing `DeadlineExceeded` as the reason.
* `DELETE_RAYJOB_CR_AFTER_JOB_FINISHES` (Optional, added in version 1.2.0): This environment variable should be set for the KubeRay operator, not the RayJob resource. If this environment variable is set to true, the RayJob custom resource itself will be deleted if `shutdownAfterJobFinishes` is also set to true. Note that all resources created by the RayJob will be deleted, including the K8s Job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `DELETE_RAYJOB_CR_AFTER_JOB_FINISHES` (Optional, added in version 1.2.0): This environment variable should be set for the KubeRay operator, not the RayJob resource. If this environment variable is set to true, the RayJob custom resource itself will be deleted if `shutdownAfterJobFinishes` is also set to true. Note that all resources created by the RayJob will be deleted, including the K8s Job.
* `DELETE_RAYJOB_CR_AFTER_JOB_FINISHES` (Optional, added in version 1.2.0): Set this environment variable for the KubeRay operator, not the RayJob resource. If you set this environment variable to true, the RayJob custom resource itself is deleted if you also set `shutdownAfterJobFinishes` to true. Note that KubeRay deletes all resources created by the RayJob, including the Kubernetes Job.

Copy link
Member Author

@MortalHappiness MortalHappiness Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I thought I have no style error because here says that the editorial style is enforced in CI by Vale and the CI passed without error. I've pushed updates. Is the CI broken so that it did not catch this error?

MortalHappiness and others added 3 commits September 10, 2024 20:35
…CR_AFTER_JOB_FINISHES

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
…rt.md

Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Sep 17, 2024
@jjyao jjyao merged commit 265dcc6 into ray-project:master Sep 18, 2024
6 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…CR_AFTER_JOB_FINISHES (ray-project#47445)

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants