-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add patches for PyTorch 1.7.1 avoiding failures on POWER and A100 #12753
Add patches for PyTorch 1.7.1 avoiding failures on POWER and A100 #12753
Conversation
@boegelbot please test @ generoso |
@boegel: Request for testing this PR well received on generoso PR test command '
Test results coming soon (I hope)... - notification for comment with ID 827555748 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
Test report by @boegel |
I was running my last test build on a node that has the latest CUDA driver (465.19.01), which seems to have impact on the tests for
Ever seen something like that @Flamefire? Perhaps the test is just unstable, and it was bad luck? edit: This also happens without the extra patches being added here BTW... |
Was your update to CUDA 11.2? The PyTorch guys have seen those: pytorch/pytorch#51905 Edit: That test was on foss, no CUDA. Running on fosscuda now |
Test report by @boegel |
Test failure on
Driver version is |
Test report by @boegel |
Test report by @Flamefire |
@Flamefire Fix with latest CUDA drivers confirmed, but trouble on other systems? |
@verdurin Does that system has the latest patches for OpenBLAS? Because that is a CPU test failure which works on our POWER system. @boegel IIRC it did never work on our A100 system (the previous was foss test only) FWIW the upstream issue is pytorch/pytorch#52278 Maybe I'll just disable |
@boegel If it helps, I got that currently installed on the AMD EPYC 7552 48-Core Processo machine I got access to: |
@Flamefire this is the same node I mentioned recently. It has the following patches:
|
Test report by @Flamefire |
Test report by @Flamefire |
Test report by @Flamefire |
Test report by @branfosj |
I've seen that for PyTorch 1.8.1 on our A100 GPUs but never on the older ones. Maybe that is CUDA driver version related? Still got to check if I see this on our new partition with this EC here too. Might be that our (EB) NCCL is faulty. |
The failed test was also with A100s. NVidia driver 460.73.01. I have this built fine on a system with a single P100 - NVidia driver 460.32.03. PyTorch 1.8.1 failed on the A100s with the same problem that you have reported to PyTorch. That also failed with a self-built NCCL 2.9.8. |
Confirmed also with PyTorch 1.8.1 fosscuda/2020b and their submodule NCCL (2.7.8) |
@boegelbot please test @ generoso |
@boegel: Request for testing this PR well received on generoso PR test command '
Test results coming soon (I hope)... - notification for comment with ID 851550784 processed Message to humans: this is just bookkeeping information for me, |
Test report by @Flamefire |
Test report by @Flamefire |
Test report by @branfosj |
Test report by @branfosj |
Test report by @boegel |
Test report by @Flamefire |
The failed tests are the bottleneck_test which will be disabled by easybuilders/easybuild-easyblocks#2450 and a flaky test with fosscuda-2020b where a timeout is reached to early, i.e. all are non-critical. |
OK, thanks for clarifying @Flamefire! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Going in, thanks @Flamefire! |
(created using
eb --new-pr
)