-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make infeasible tasks error much more obvious #45909
Labels
core
Issues that should be addressed in Ray Core
enhancement
Request for new feature and/or capability
observability
Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling
P1
Issue that should be fixed within a few weeks
Comments
adding for consideration in next sprint |
can we consider this for next sprint? This comes up in ray data as well. |
@jjyao what would be the tshirt size for this? |
8 tasks
edoakes
pushed a commit
that referenced
this issue
Feb 24, 2025
…e Manager (#50200) This is the part#2 of the PRs to resolve #45909. This PR: - Added a `CancelTaskWithResourceShapes` to the `NodeManagerService`. - The corresponding calls are added to the `RayletClient` as well - The `CancelTaskWithResourceShape` will leverage the functions in `ClusterTaskManager` to cancel the tasks that matches certain resource shapes on the node. - The corresponding tests are added as well. There will be one more followup PR to leverage the API to cancel the tasks with requirements matches the infeasible tasks resource requirements obtained from the autoscaler. Also, note that the implementation assumes: 1. The same resource shape should be either feasible or infeasible on a certain node 2. The [PlacementConstraint](url) doesn't impact whether a task is infeasible or not --------- Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
not done yet |
8 tasks
kevin85421
pushed a commit
to kevin85421/ray
that referenced
this issue
Feb 28, 2025
…e Manager (ray-project#50200) This is the part#2 of the PRs to resolve ray-project#45909. This PR: - Added a `CancelTaskWithResourceShapes` to the `NodeManagerService`. - The corresponding calls are added to the `RayletClient` as well - The `CancelTaskWithResourceShape` will leverage the functions in `ClusterTaskManager` to cancel the tasks that matches certain resource shapes on the node. - The corresponding tests are added as well. There will be one more followup PR to leverage the API to cancel the tasks with requirements matches the infeasible tasks resource requirements obtained from the autoscaler. Also, note that the implementation assumes: 1. The same resource shape should be either feasible or infeasible on a certain node 2. The [PlacementConstraint](url) doesn't impact whether a task is infeasible or not --------- Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Signed-off-by: kaihsun <kaihsun@anyscale.com>
edoakes
pushed a commit
that referenced
this issue
Mar 2, 2025
…ler State (#50886) This is the last PR for the feature of early termination of infeasible tasks. The previous PRs: (1) added logic in GCS to obtain the per node infeasible resource requests from the autoscaler state; (2) added the new API in raylet_client, node_manager and cluster_task_manager to cancel tasks with certain resource request shapes This PR added the integration of the above PRs: (1) added the logic to call the cancel resource shape API based on the per node infeasible requests from the autoscaler (2) put the feature behind a ray config and make the default to be on (3) small improvements on previous PRs (logging, comments, messages, add an early exit when getting the per node infeasible requests) (4) added integration tests for both normal task scheduling case and actor creation case With the change, the infeasible tasks (both normal tasks and the actor creation tasks) can be early terminated by default. Closes #45909 --------- Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
VamshikShetty
pushed a commit
to VamshikShetty/ray
that referenced
this issue
Mar 3, 2025
…ler State (ray-project#50886) This is the last PR for the feature of early termination of infeasible tasks. The previous PRs: (1) added logic in GCS to obtain the per node infeasible resource requests from the autoscaler state; (2) added the new API in raylet_client, node_manager and cluster_task_manager to cancel tasks with certain resource request shapes This PR added the integration of the above PRs: (1) added the logic to call the cancel resource shape API based on the per node infeasible requests from the autoscaler (2) put the feature behind a ray config and make the default to be on (3) small improvements on previous PRs (logging, comments, messages, add an early exit when getting the per node infeasible requests) (4) added integration tests for both normal task scheduling case and actor creation case With the change, the infeasible tasks (both normal tasks and the actor creation tasks) can be early terminated by default. Closes ray-project#45909 --------- Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Signed-off-by: vs030455 <vamshikdshetty@gmail.com>
Michaelhess17
pushed a commit
to Michaelhess17/ray
that referenced
this issue
Mar 3, 2025
…e Manager (ray-project#50200) This is the part#2 of the PRs to resolve ray-project#45909. This PR: - Added a `CancelTaskWithResourceShapes` to the `NodeManagerService`. - The corresponding calls are added to the `RayletClient` as well - The `CancelTaskWithResourceShape` will leverage the functions in `ClusterTaskManager` to cancel the tasks that matches certain resource shapes on the node. - The corresponding tests are added as well. There will be one more followup PR to leverage the API to cancel the tasks with requirements matches the infeasible tasks resource requirements obtained from the autoscaler. Also, note that the implementation assumes: 1. The same resource shape should be either feasible or infeasible on a certain node 2. The [PlacementConstraint](url) doesn't impact whether a task is infeasible or not --------- Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Michaelhess17
pushed a commit
to Michaelhess17/ray
that referenced
this issue
Mar 3, 2025
…ler State (ray-project#50886) This is the last PR for the feature of early termination of infeasible tasks. The previous PRs: (1) added logic in GCS to obtain the per node infeasible resource requests from the autoscaler state; (2) added the new API in raylet_client, node_manager and cluster_task_manager to cancel tasks with certain resource request shapes This PR added the integration of the above PRs: (1) added the logic to call the cancel resource shape API based on the per node infeasible requests from the autoscaler (2) put the feature behind a ray config and make the default to be on (3) small improvements on previous PRs (logging, comments, messages, add an early exit when getting the per node infeasible requests) (4) added integration tests for both normal task scheduling case and actor creation case With the change, the infeasible tasks (both normal tasks and the actor creation tasks) can be early terminated by default. Closes ray-project#45909 --------- Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
xsuler
pushed a commit
to antgroup/ant-ray
that referenced
this issue
Mar 4, 2025
…e Manager (ray-project#50200) This is the part#2 of the PRs to resolve ray-project#45909. This PR: - Added a `CancelTaskWithResourceShapes` to the `NodeManagerService`. - The corresponding calls are added to the `RayletClient` as well - The `CancelTaskWithResourceShape` will leverage the functions in `ClusterTaskManager` to cancel the tasks that matches certain resource shapes on the node. - The corresponding tests are added as well. There will be one more followup PR to leverage the API to cancel the tasks with requirements matches the infeasible tasks resource requirements obtained from the autoscaler. Also, note that the implementation assumes: 1. The same resource shape should be either feasible or infeasible on a certain node 2. The [PlacementConstraint](url) doesn't impact whether a task is infeasible or not --------- Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
xsuler
pushed a commit
to antgroup/ant-ray
that referenced
this issue
Mar 4, 2025
…ler State (ray-project#50886) This is the last PR for the feature of early termination of infeasible tasks. The previous PRs: (1) added logic in GCS to obtain the per node infeasible resource requests from the autoscaler state; (2) added the new API in raylet_client, node_manager and cluster_task_manager to cancel tasks with certain resource request shapes This PR added the integration of the above PRs: (1) added the logic to call the cancel resource shape API based on the per node infeasible requests from the autoscaler (2) put the feature behind a ray config and make the default to be on (3) small improvements on previous PRs (logging, comments, messages, add an early exit when getting the per node infeasible requests) (4) added integration tests for both normal task scheduling case and actor creation case With the change, the infeasible tasks (both normal tasks and the actor creation tasks) can be early terminated by default. Closes ray-project#45909 --------- Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
core
Issues that should be addressed in Ray Core
enhancement
Request for new feature and/or capability
observability
Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling
P1
Issue that should be fixed within a few weeks
Many times I see users run into common error scenarios that end up hanging the entire workload.
num_cpus=0
Although we show a warning in the events. Those can be hard to catch. We should make infeasible things errors and have the task itself return an error.
The text was updated successfully, but these errors were encountered: