Skip to content

Commit

Permalink
Catch cuda driver shutdown error in NCCLWatchdog (pytorch#106503)
Browse files Browse the repository at this point in the history
There is a design flaw in NCCLWatchdog, namely it spawns threads that
talk to the CUDA api, but the CUDA api may have been deinitialized,
forming a race.

This is a known issue with widespread impact
(pytorch#90848).

I should point out that i tested this fix on the repro command for pytorch#82632 by running `NCCL_DESYNC_DEBUG=1 CUDA_LAUNCH_BLOCKING=1 python test/distributed/test_c10d_nccl.py -k test_find_unused_parameters_kwarg_debug_detail` and observing that instead of crashing, we observe log messages with the exception string about the cuda driver shutdown error.

A partial fix was landed already, but it applied too narrowly:
pytorch@ec071a0

This PR is a copy-paste of the previous fix, applying to one more case,
plugging a hole.  We probably need to do a more thorough review and
either plug all the holes, or design this differently.
Pull Request resolved: pytorch#106503
Approved by: https://github.com/kwen2501
  • Loading branch information
wconstab authored and pytorchmergebot committed Aug 3, 2023
1 parent c9c2b14 commit a6f7dd4
Showing 1 changed file with 14 additions and 4 deletions.
18 changes: 14 additions & 4 deletions torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -399,12 +399,22 @@ bool ProcessGroupNCCL::WorkNCCL::finishedGPUExecution() {
}

bool ProcessGroupNCCL::WorkNCCL::startedGPUExecutionInternal() const {
for (const auto i : c10::irange(devices_.size())) {
// Checking the work's corresponding CUDA events' status
if (!(*ncclStartEvents_)[i].query()) {
return false;
try {
for (const auto i : c10::irange(devices_.size())) {
// Checking the work's corresponding CUDA events' status
if (!(*ncclStartEvents_)[i].query()) {
return false;
}
}
} catch (const std::exception& e) {
if (std::string(e.what()).find("driver shutting down") ==
std::string::npos) {
throw;
}
LOG(INFO) << "[Rank " << rank_
<< "] Event query failed with exception: " << e.what();
}

return true;
}

Expand Down

0 comments on commit a6f7dd4

Please sign in to comment.