Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add patches for PyTorch 1.7.1 avoiding failures on POWER and A100 #12753

Merged

Conversation

Flamefire
Copy link
Contributor

(created using eb --new-pr)

@branfosj branfosj added this to the next release (4.3.5?) milestone Apr 26, 2021
@boegel
Copy link
Member

boegel commented Apr 27, 2021

@boegelbot please test @ generoso
EB_ARGS="PyTorch-1.7.1-foss-2020b.eb"
CORE_COUNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=12753 EB_ARGS="PyTorch-1.7.1-foss-2020b.eb" /apps/slurm/default/bin/sbatch --job-name test_PR_12753 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 16925

Test results coming soon (I hope)...

- notification for comment with ID 827555748 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
generoso-c1-s-2 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/310ff178b1ae79051a55d52a3b7dd515 for a full test report.

@boegel
Copy link
Member

boegel commented Apr 27, 2021

Test report by @boegel
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
node3300.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/4f3971033005d330fbad209ad8bcbc8e for a full test report.

@boegel
Copy link
Member

boegel commented Apr 28, 2021

I was running my last test build on a node that has the latest CUDA driver (465.19.01), which seems to have impact on the tests for PyTorch-1.7.1-fosscuda-2020b.eb?
I'm not seeing those failing tests on a node that didn't get the CUDA driver update yet (still running 455.23.05)...

FAIL: test_adadelta (__main__.TestOptim)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/eb-4st2qifr/tmp_vktr7wn/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 396, in wrapper
    fn(*args, **kwargs)
  File "test_optim.py", line 491, in test_adadelta
    self._test_basic_cases(
  File "test_optim.py", line 204, in _test_basic_cases
    self._test_state_dict(
  File "test_optim.py", line 192, in _test_state_dict
    self.assertEqual(weight, weight_cuda)
  File "/tmp/eb-4st2qifr/tmp_vktr7wn/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1136, in assertEqual
    self.assertTrue(result, msg=msg)
AssertionError: False is not true : Tensors failed to compare as equal! With rtol=1.3e-06 and atol=1e-05, found 2 element(s) (out of 50) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.2218952178955078e-05 (-0.7925083041191101 vs. -0.7924960851669312), which occurred at index (9, 1).

----------------------------------------------------------------------
Ran 103 tests in 72.666s

FAILED (failures=1)

Ever seen something like that @Flamefire? Perhaps the test is just unstable, and it was bad luck?

edit: This also happens without the extra patches being added here BTW...

@Flamefire
Copy link
Contributor Author

Flamefire commented Apr 28, 2021

I guess my test just finished, that is on AMD with A100s, so no haven't seen that

Was your update to CUDA 11.2? The PyTorch guys have seen those: pytorch/pytorch#51905

Edit: That test was on foss, no CUDA. Running on fosscuda now

@boegel
Copy link
Member

boegel commented Apr 28, 2021

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3300.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/2e3c2453ed2e744ce302dea6809f385f for a full test report.

@verdurin
Copy link
Member

Test failure on ppc64le:

======================================================================
FAIL: test_norm_matrix_cpu_float64 (__main__.TestLinalgCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/eb-zf1wbxny/tmpbooavsxk/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 273, in instantiated_test
    result = test_fn(self, *args)
  File "/tmp/eb-zf1wbxny/tmpbooavsxk/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 508, in dep_fn
    return fn(slf, device, *args, **kwargs)
  File "test_linalg.py", line 314, in test_norm_matrix
    run_test_case(input, ord, dim, keepdim)
  File "test_linalg.py", line 289, in run_test_case
    self.assertEqual(result, result_numpy, msg=msg)
  File "/tmp/eb-zf1wbxny/tmpbooavsxk/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1162, in assertEqual
    exact_dtype=exact_dtype, exact_device=exact_device)
  File "/tmp/eb-zf1wbxny/tmpbooavsxk/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1162, in assertEqual
    exact_dtype=exact_dtype, exact_device=exact_device)
  File "/tmp/eb-zf1wbxny/tmpbooavsxk/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1058, in assertEqual
    exact_dtype=exact_dtype, exact_device=exact_device)
  File "/tmp/eb-zf1wbxny/tmpbooavsxk/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1173, in assertEqual
    self.assertTrue(result, msg=msg)
AssertionError: False is not true : input.size()=torch.Size([1000, 1000]), ord=-2, dim=None, keepdim=True, dtype=torch.float64

Driver version is 440.118.02.

@boegel
Copy link
Member

boegel commented Apr 29, 2021

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3300.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/db6c727c6719225627d9046e43f9ede0 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8028 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/7444d3c63482ac1565bb481189778606 for a full test report.

@boegel
Copy link
Member

boegel commented Apr 29, 2021

@Flamefire Fix with latest CUDA drivers confirmed, but trouble on other systems?

@Flamefire
Copy link
Contributor Author

@verdurin Does that system has the latest patches for OpenBLAS? Because that is a CPU test failure which works on our POWER system.

@boegel IIRC it did never work on our A100 system (the previous was foss test only) FWIW the upstream issue is pytorch/pytorch#52278

Maybe I'll just disable test_nn for 2020b?

@sassy-crick
Copy link
Collaborator

I was running my last test build on a node that has the latest CUDA driver (465.19.01), which seems to have impact on the tests for PyTorch-1.7.1-fosscuda-2020b.eb?
I'm not seeing those failing tests on a node that didn't get the CUDA driver update yet (still running 455.23.05)...

Ever seen something like that @Flamefire? Perhaps the test is just unstable, and it was bad luck?

edit: This also happens without the extra patches being added here BTW...

@boegel If it helps, I got that currently installed on the AMD EPYC 7552 48-Core Processo machine I got access to:
NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2

@verdurin
Copy link
Member

@Flamefire this is the same node I mentioned recently. It has the following patches:

OpenBLAS-0.3.7_fix-build-on-arm-tsv110.patch
OpenBLAS-0.3.7_fix-missing-sync-on-power.patch
OpenBLAS-0.3.7_fix-possible-memory-leak-after-fork.patch
OpenBLAS-0.3.7_reinit-threads-after-fork.patch
OpenBLAS-0.3.8_fix-dscal-inline-asm.patch

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusml20 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 2.7.5
See https://gist.github.com/99818b322c357a18b81a8f787580c05a for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusml20 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 2.7.5
See https://gist.github.com/b61c71011808e661d7049267fca2a584 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 16 out of 16 (2 easyconfigs in total)
taurusml13 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 2.7.5
See https://gist.github.com/4cb3e4a1164017b0c2ab72b431a6bdeb for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0309u25a - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz (icelake), Python 3.6.8
See https://gist.github.com/f8c92f2e5ea97526fba6ffde80e1201a for a full test report.

@Flamefire
Copy link
Contributor Author

@branfosj

what(): CUDA error: uncorrectable NVLink error detected during the execution

I've seen that for PyTorch 1.8.1 on our A100 GPUs but never on the older ones. Maybe that is CUDA driver version related? Still got to check if I see this on our new partition with this EC here too. Might be that our (EB) NCCL is faulty.

@branfosj
Copy link
Member

@branfosj

what(): CUDA error: uncorrectable NVLink error detected during the execution

I've seen that for PyTorch 1.8.1 on our A100 GPUs but never on the older ones. Maybe that is CUDA driver version related? Still got to check if I see this on our new partition with this EC here too. Might be that our (EB) NCCL is faulty.

The failed test was also with A100s. NVidia driver 460.73.01.

I have this built fine on a system with a single P100 - NVidia driver 460.32.03.

PyTorch 1.8.1 failed on the A100s with the same problem that you have reported to PyTorch. That also failed with a self-built NCCL 2.9.8.

@Flamefire
Copy link
Contributor Author

That also failed with a self-built NCCL 2.9.8.

Confirmed also with PyTorch 1.8.1 fosscuda/2020b and their submodule NCCL (2.7.8)

@boegel boegel modified the milestones: 4.4.0, release after 4.4.0 May 27, 2021
@boegel
Copy link
Member

boegel commented May 31, 2021

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=12753 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_12753 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 17326

Test results coming soon (I hope)...

- notification for comment with ID 851550784 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml4 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 2.7.5
See https://gist.github.com/7ff4ca773bf0621e4829253d878785dc for a full test report.

@boegel boegel modified the milestones: release after 4.4.0, 4.4.0 May 31, 2021
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
taurusa4 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/8ec1f1b47ced050a742842d085f1e5fe for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0211u13a.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/14cea78c30fb1f83200be1af0c1ab7d3 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
bear-pg0212u15b.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), Python 3.6.8
See https://gist.github.com/0af507f398d3918bdbe9ddd1b5ccc155 for a full test report.

@boegel
Copy link
Member

boegel commented May 31, 2021

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3302.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/e1a6c67d1f21a63b28ce877458461ba0 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusa7 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/8a4e86dfcedde1e7eec268336ded2c88 for a full test report.

@Flamefire
Copy link
Contributor Author

The failed tests are the bottleneck_test which will be disabled by easybuilders/easybuild-easyblocks#2450 and a flaky test with fosscuda-2020b where a timeout is reached to early, i.e. all are non-critical.
The latter is annoying though and might be related to me seeing increased test times for 2020b over 2020a over 2019b which I'm still investigating.

@boegel
Copy link
Member

boegel commented Jun 1, 2021

The failed tests are the bottleneck_test which will be disabled by easybuilders/easybuild-easyblocks#2450 and a flaky test with fosscuda-2020b where a timeout is reached to early, i.e. all are non-critical.
The latter is annoying though and might be related to me seeing increased test times for 2020b over 2020a over 2019b which I'm still investigating.

OK, thanks for clarifying @Flamefire!

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Jun 1, 2021

Going in, thanks @Flamefire!

@boegel boegel merged commit c170815 into easybuilders:develop Jun 1, 2021
@Flamefire Flamefire deleted the 20210426111814_new_pr_PyTorch171 branch June 1, 2021 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants