-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Set all_reduce_token to None when exiting" #6321
Conversation
Let's wait and manually verify all GPU tests. |
The GPU CI fails but the CI log didn't show any failing test. Though I can see the error:
Will try to reproduce on the last run test |
Running the last run test in the CI gave me:
The test gives OK result but it actually failed. The error
also appeared for other tests such as |
I also checked the GPU CI log: Before this PR, the GPU CI actually fails with error:
for example. With this PR, I don't see the error message such as So after this PR is merged, the next step is to fix GPU CI so the error won't go unnoticed. |
Note, I also need to revert #6268 to make the CI pass. #6268 was backported to r2.2 as well. @ManfeiBai @JackCaoG . So we may want to revert it in the r2.2 release branch |
@vanbasten23 sorry for #6247 break the ci. I think the error is due to the ComputationClient has exited in
#6247 fixes the issue in #6246, which is a common scenario during testing. |
BTW, when using xla_multiprocessing, #6246 will not arise. The mp_test_early_exit.py is not needed. This test should be placed in |
Reverts #6247
See #6320 and #6247 (comment) for detail. Also, we observed error in the GPU CI result in #6247 even though the GPU CI didn't catch it. Need to investigate