Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix check failure due to negative available resource #50517

Merged
merged 4 commits into from
Feb 14, 2025

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Feb 13, 2025

Why are these changes needed?

The sequence of events that can trigger the check failure:

  1. A node has PG resources {"CPU_group_aaa": 2, "CPU_group_0_aaa": 1, "CPU_group_1_aaa": 1}
  2. Worker1 acquires 1 PG CPU resource, node available resource becomes {"CPU_group_aaa": 1, "CPU_group_0_aaa": 0, "CPU_group_1_aaa": 1}
  3. Worker1 calls ray.get() which temporarily released its CPU resource, node available resource becomes {"CPU_group_aaa": 2, "CPU_group_0_aaa": 1, "CPU_group_1_aaa": 1}
  4. Worker2 acquires 1 PG CPU resource, node available resource becomes {"CPU_group_aaa": 1, "CPU_group_0_aaa": 0, "CPU_group_1_aaa": 1}
  5. ray.get() returns from worker1 and it acquires back the resource, node available resource becomes {"CPU_group_aaa": 0, "CPU_group_0_aaa": -1, "CPU_group_1_aaa": 1}
  6. Worker3 acquires 1 PG CPU resource and the check failure happens since it can get CPU_group_1_aaa but not CPU_group_aaa. The fix is that we allow CPU_group_aaa to go negative as well to match the indexed PG CPU resources.

Related issue number

Closes #50433

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Feb 13, 2025
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao marked this pull request as ready for review February 13, 2025 19:53
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao requested review from MengjinYan and dentiny February 13, 2025 23:56
Copy link
Collaborator

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao merged commit 2e37145 into ray-project:master Feb 14, 2025
5 checks passed
@jjyao jjyao deleted the jjyao/overrr branch February 14, 2025 17:04
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 20, 2025
…ect#50517)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: 400Ping <43886578+400Ping@users.noreply.github.com>
israbbani pushed a commit that referenced this pull request Feb 25, 2025
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
xsuler pushed a commit to antgroup/ant-ray that referenced this pull request Mar 4, 2025
xsuler pushed a commit to antgroup/ant-ray that referenced this pull request Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Crash in raylet (ray::RayLog::~RayLog)
3 participants