Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make infeasible tasks error much more obvious #45909

Closed
alanwguo opened this issue Jun 12, 2024 · 4 comments · Fixed by #50200 or #50886
Closed

Make infeasible tasks error much more obvious #45909

alanwguo opened this issue Jun 12, 2024 · 4 comments · Fixed by #50200 or #50886
Assignees
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks

Comments

@alanwguo
Copy link
Contributor

alanwguo commented Jun 12, 2024

Many times I see users run into common error scenarios that end up hanging the entire workload.

  1. Scheduling something on the head node when head node num_cpus=0
  2. Scheduling a task that requests more cpus or gpus then is available in the cluster. (And autoscaling is disabled)

Although we show a warning in the events. Those can be hard to catch. We should make infeasible things errors and have the task itself return an error.

@alanwguo alanwguo added observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling core Issues that should be addressed in Ray Core labels Jun 12, 2024
@anyscalesam anyscalesam added p0.5 enhancement Request for new feature and/or capability labels Jun 24, 2024
@anyscalesam
Copy link
Contributor

adding for consideration in next sprint

@alanwguo
Copy link
Contributor Author

can we consider this for next sprint? This comes up in ray data as well.

@anyscalesam
Copy link
Contributor

@jjyao what would be the tshirt size for this?

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed P0.5 labels Oct 30, 2024
@MengjinYan MengjinYan self-assigned this Nov 25, 2024
edoakes pushed a commit that referenced this issue Feb 24, 2025
…e Manager (#50200)

This is the part#2 of the PRs to resolve  #45909.

This PR:
- Added a `CancelTaskWithResourceShapes` to the `NodeManagerService`. 
- The corresponding calls are added to the `RayletClient` as well
- The `CancelTaskWithResourceShape` will leverage the functions in
`ClusterTaskManager` to cancel the tasks that matches certain resource
shapes on the node.
- The corresponding tests are added as well.

There will be one more followup PR to leverage the API to cancel the
tasks with requirements matches the infeasible tasks resource
requirements obtained from the autoscaler.

Also, note that the implementation assumes:
1. The same resource shape should be either feasible or infeasible on a
certain node
2. The [PlacementConstraint](url) doesn't impact whether a task is
infeasible or not

---------

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@edoakes
Copy link
Contributor

edoakes commented Feb 24, 2025

not done yet

@edoakes edoakes reopened this Feb 24, 2025
kevin85421 pushed a commit to kevin85421/ray that referenced this issue Feb 28, 2025
…e Manager (ray-project#50200)

This is the part#2 of the PRs to resolve  ray-project#45909.

This PR:
- Added a `CancelTaskWithResourceShapes` to the `NodeManagerService`. 
- The corresponding calls are added to the `RayletClient` as well
- The `CancelTaskWithResourceShape` will leverage the functions in
`ClusterTaskManager` to cancel the tasks that matches certain resource
shapes on the node.
- The corresponding tests are added as well.

There will be one more followup PR to leverage the API to cancel the
tasks with requirements matches the infeasible tasks resource
requirements obtained from the autoscaler.

Also, note that the implementation assumes:
1. The same resource shape should be either feasible or infeasible on a
certain node
2. The [PlacementConstraint](url) doesn't impact whether a task is
infeasible or not

---------

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
edoakes pushed a commit that referenced this issue Mar 2, 2025
…ler State (#50886)

This is the last PR for the feature of early termination of infeasible
tasks.

The previous PRs:
(1) added logic in GCS to obtain the per node infeasible resource
requests from the autoscaler state;
(2) added the new API in raylet_client, node_manager and
cluster_task_manager to cancel tasks with certain resource request
shapes

This PR added the integration of the above PRs:
(1) added the logic to call the cancel resource shape API based on the
per node infeasible requests from the autoscaler
(2) put the feature behind a ray config and make the default to be on
(3) small improvements on previous PRs (logging, comments, messages, add
an early exit when getting the per node infeasible requests)
(4) added integration tests for both normal task scheduling case and
actor creation case

With the change, the infeasible tasks (both normal tasks and the actor
creation tasks) can be early terminated by default.

Closes #45909

---------

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@exalate-issue-sync exalate-issue-sync bot reopened this Mar 2, 2025
VamshikShetty pushed a commit to VamshikShetty/ray that referenced this issue Mar 3, 2025
…ler State (ray-project#50886)

This is the last PR for the feature of early termination of infeasible
tasks.

The previous PRs:
(1) added logic in GCS to obtain the per node infeasible resource
requests from the autoscaler state;
(2) added the new API in raylet_client, node_manager and
cluster_task_manager to cancel tasks with certain resource request
shapes

This PR added the integration of the above PRs:
(1) added the logic to call the cancel resource shape API based on the
per node infeasible requests from the autoscaler
(2) put the feature behind a ray config and make the default to be on
(3) small improvements on previous PRs (logging, comments, messages, add
an early exit when getting the per node infeasible requests)
(4) added integration tests for both normal task scheduling case and
actor creation case

With the change, the infeasible tasks (both normal tasks and the actor
creation tasks) can be early terminated by default.

Closes ray-project#45909

---------

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: vs030455 <vamshikdshetty@gmail.com>
Michaelhess17 pushed a commit to Michaelhess17/ray that referenced this issue Mar 3, 2025
…e Manager (ray-project#50200)

This is the part#2 of the PRs to resolve  ray-project#45909.

This PR:
- Added a `CancelTaskWithResourceShapes` to the `NodeManagerService`. 
- The corresponding calls are added to the `RayletClient` as well
- The `CancelTaskWithResourceShape` will leverage the functions in
`ClusterTaskManager` to cancel the tasks that matches certain resource
shapes on the node.
- The corresponding tests are added as well.

There will be one more followup PR to leverage the API to cancel the
tasks with requirements matches the infeasible tasks resource
requirements obtained from the autoscaler.

Also, note that the implementation assumes:
1. The same resource shape should be either feasible or infeasible on a
certain node
2. The [PlacementConstraint](url) doesn't impact whether a task is
infeasible or not

---------

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Michaelhess17 pushed a commit to Michaelhess17/ray that referenced this issue Mar 3, 2025
…ler State (ray-project#50886)

This is the last PR for the feature of early termination of infeasible
tasks.

The previous PRs:
(1) added logic in GCS to obtain the per node infeasible resource
requests from the autoscaler state;
(2) added the new API in raylet_client, node_manager and
cluster_task_manager to cancel tasks with certain resource request
shapes

This PR added the integration of the above PRs:
(1) added the logic to call the cancel resource shape API based on the
per node infeasible requests from the autoscaler
(2) put the feature behind a ray config and make the default to be on
(3) small improvements on previous PRs (logging, comments, messages, add
an early exit when getting the per node infeasible requests)
(4) added integration tests for both normal task scheduling case and
actor creation case

With the change, the infeasible tasks (both normal tasks and the actor
creation tasks) can be early terminated by default.

Closes ray-project#45909

---------

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
xsuler pushed a commit to antgroup/ant-ray that referenced this issue Mar 4, 2025
…e Manager (ray-project#50200)

This is the part#2 of the PRs to resolve  ray-project#45909.

This PR:
- Added a `CancelTaskWithResourceShapes` to the `NodeManagerService`. 
- The corresponding calls are added to the `RayletClient` as well
- The `CancelTaskWithResourceShape` will leverage the functions in
`ClusterTaskManager` to cancel the tasks that matches certain resource
shapes on the node.
- The corresponding tests are added as well.

There will be one more followup PR to leverage the API to cancel the
tasks with requirements matches the infeasible tasks resource
requirements obtained from the autoscaler.

Also, note that the implementation assumes:
1. The same resource shape should be either feasible or infeasible on a
certain node
2. The [PlacementConstraint](url) doesn't impact whether a task is
infeasible or not

---------

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
xsuler pushed a commit to antgroup/ant-ray that referenced this issue Mar 4, 2025
…ler State (ray-project#50886)

This is the last PR for the feature of early termination of infeasible
tasks.

The previous PRs:
(1) added logic in GCS to obtain the per node infeasible resource
requests from the autoscaler state;
(2) added the new API in raylet_client, node_manager and
cluster_task_manager to cancel tasks with certain resource request
shapes

This PR added the integration of the above PRs:
(1) added the logic to call the cancel resource shape API based on the
per node infeasible requests from the autoscaler
(2) put the feature behind a ray config and make the default to be on
(3) small improvements on previous PRs (logging, comments, messages, add
an early exit when getting the per node infeasible requests)
(4) added integration tests for both normal task scheduling case and
actor creation case

With the change, the infeasible tasks (both normal tasks and the actor
creation tasks) can be early terminated by default.

Closes ray-project#45909

---------

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks
Projects
None yet
5 participants