Add patches for PyTorch 1.7.1 avoiding failures on POWER and A100 #12753

Flamefire · 2021-04-26T09:18:21Z

(created using eb --new-pr)

boegel · 2021-04-27T12:08:49Z

@boegelbot please test @ generoso
EB_ARGS="PyTorch-1.7.1-foss-2020b.eb"
CORE_COUNT=16

boegelbot · 2021-04-27T12:10:21Z

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=12753 EB_ARGS="PyTorch-1.7.1-foss-2020b.eb" /apps/slurm/default/bin/sbatch --job-name test_PR_12753 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 16925

Test results coming soon (I hope)...

- notification for comment with ID 827555748 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegelbot · 2021-04-27T14:37:51Z

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
generoso-c1-s-2 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/310ff178b1ae79051a55d52a3b7dd515 for a full test report.

boegel · 2021-04-27T16:20:58Z

Test report by @boegel
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
node3300.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/4f3971033005d330fbad209ad8bcbc8e for a full test report.

boegel · 2021-04-28T10:50:44Z

I was running my last test build on a node that has the latest CUDA driver (465.19.01), which seems to have impact on the tests for PyTorch-1.7.1-fosscuda-2020b.eb?
I'm not seeing those failing tests on a node that didn't get the CUDA driver update yet (still running 455.23.05)...

FAIL: test_adadelta (__main__.TestOptim)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/eb-4st2qifr/tmp_vktr7wn/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 396, in wrapper
    fn(*args, **kwargs)
  File "test_optim.py", line 491, in test_adadelta
    self._test_basic_cases(
  File "test_optim.py", line 204, in _test_basic_cases
    self._test_state_dict(
  File "test_optim.py", line 192, in _test_state_dict
    self.assertEqual(weight, weight_cuda)
  File "/tmp/eb-4st2qifr/tmp_vktr7wn/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1136, in assertEqual
    self.assertTrue(result, msg=msg)
AssertionError: False is not true : Tensors failed to compare as equal! With rtol=1.3e-06 and atol=1e-05, found 2 element(s) (out of 50) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.2218952178955078e-05 (-0.7925083041191101 vs. -0.7924960851669312), which occurred at index (9, 1).

----------------------------------------------------------------------
Ran 103 tests in 72.666s

FAILED (failures=1)

Ever seen something like that @Flamefire? Perhaps the test is just unstable, and it was bad luck?

edit: This also happens without the extra patches being added here BTW...

Flamefire · 2021-04-28T11:26:47Z

~~I guess my test just finished, that is on AMD with A100s, so no haven't seen that~~

Was your update to CUDA 11.2? The PyTorch guys have seen those: pytorch/pytorch#51905

Edit: That test was on foss, no CUDA. Running on fosscuda now

boegel · 2021-04-28T16:08:43Z

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3300.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/2e3c2453ed2e744ce302dea6809f385f for a full test report.

verdurin · 2021-04-28T21:15:04Z

Test failure on ppc64le:

======================================================================
FAIL: test_norm_matrix_cpu_float64 (__main__.TestLinalgCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/eb-zf1wbxny/tmpbooavsxk/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 273, in instantiated_test
    result = test_fn(self, *args)
  File "/tmp/eb-zf1wbxny/tmpbooavsxk/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 508, in dep_fn
    return fn(slf, device, *args, **kwargs)
  File "test_linalg.py", line 314, in test_norm_matrix
    run_test_case(input, ord, dim, keepdim)
  File "test_linalg.py", line 289, in run_test_case
    self.assertEqual(result, result_numpy, msg=msg)
  File "/tmp/eb-zf1wbxny/tmpbooavsxk/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1162, in assertEqual
    exact_dtype=exact_dtype, exact_device=exact_device)
  File "/tmp/eb-zf1wbxny/tmpbooavsxk/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1162, in assertEqual
    exact_dtype=exact_dtype, exact_device=exact_device)
  File "/tmp/eb-zf1wbxny/tmpbooavsxk/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1058, in assertEqual
    exact_dtype=exact_dtype, exact_device=exact_device)
  File "/tmp/eb-zf1wbxny/tmpbooavsxk/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1173, in assertEqual
    self.assertTrue(result, msg=msg)
AssertionError: False is not true : input.size()=torch.Size([1000, 1000]), ord=-2, dim=None, keepdim=True, dtype=torch.float64

Driver version is 440.118.02.

boegel · 2021-04-29T01:15:22Z

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3300.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/db6c727c6719225627d9046e43f9ede0 for a full test report.

Flamefire · 2021-04-29T05:55:52Z

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8028 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/7444d3c63482ac1565bb481189778606 for a full test report.

boegel · 2021-04-29T06:44:31Z

@Flamefire Fix with latest CUDA drivers confirmed, but trouble on other systems?

Flamefire · 2021-04-29T10:50:11Z

@verdurin Does that system has the latest patches for OpenBLAS? Because that is a CPU test failure which works on our POWER system.

@boegel IIRC it did never work on our A100 system (the previous was foss test only) FWIW the upstream issue is pytorch/pytorch#52278

Maybe I'll just disable test_nn for 2020b?

sassy-crick · 2021-04-29T11:00:30Z

I was running my last test build on a node that has the latest CUDA driver (465.19.01), which seems to have impact on the tests for PyTorch-1.7.1-fosscuda-2020b.eb?
I'm not seeing those failing tests on a node that didn't get the CUDA driver update yet (still running 455.23.05)...

Ever seen something like that @Flamefire? Perhaps the test is just unstable, and it was bad luck?

edit: This also happens without the extra patches being added here BTW...

@boegel If it helps, I got that currently installed on the AMD EPYC 7552 48-Core Processo machine I got access to:
NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2

verdurin · 2021-04-29T11:26:35Z

@Flamefire this is the same node I mentioned recently. It has the following patches:

OpenBLAS-0.3.7_fix-build-on-arm-tsv110.patch
OpenBLAS-0.3.7_fix-missing-sync-on-power.patch
OpenBLAS-0.3.7_fix-possible-memory-leak-after-fork.patch
OpenBLAS-0.3.7_reinit-threads-after-fork.patch
OpenBLAS-0.3.8_fix-dscal-inline-asm.patch

Flamefire · 2021-05-04T11:29:25Z

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusml20 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 2.7.5
See https://gist.github.com/99818b322c357a18b81a8f787580c05a for a full test report.

… EasyBuild

Flamefire · 2021-05-05T11:44:43Z

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusml20 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 2.7.5
See https://gist.github.com/b61c71011808e661d7049267fca2a584 for a full test report.

Flamefire · 2021-05-13T08:47:02Z

Test report by @Flamefire
SUCCESS
Build succeeded for 16 out of 16 (2 easyconfigs in total)
taurusml13 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 2.7.5
See https://gist.github.com/4cb3e4a1164017b0c2ab72b431a6bdeb for a full test report.

branfosj · 2021-05-13T16:18:23Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0309u25a - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz (icelake), Python 3.6.8
See https://gist.github.com/f8c92f2e5ea97526fba6ffde80e1201a for a full test report.

Flamefire · 2021-05-16T11:35:05Z

@branfosj

what(): CUDA error: uncorrectable NVLink error detected during the execution

I've seen that for PyTorch 1.8.1 on our A100 GPUs but never on the older ones. Maybe that is CUDA driver version related? Still got to check if I see this on our new partition with this EC here too. Might be that our (EB) NCCL is faulty.

branfosj · 2021-05-16T17:28:27Z

@branfosj

what(): CUDA error: uncorrectable NVLink error detected during the execution

I've seen that for PyTorch 1.8.1 on our A100 GPUs but never on the older ones. Maybe that is CUDA driver version related? Still got to check if I see this on our new partition with this EC here too. Might be that our (EB) NCCL is faulty.

The failed test was also with A100s. NVidia driver 460.73.01.

I have this built fine on a system with a single P100 - NVidia driver 460.32.03.

PyTorch 1.8.1 failed on the A100s with the same problem that you have reported to PyTorch. That also failed with a self-built NCCL 2.9.8.

Flamefire · 2021-05-17T08:22:32Z

That also failed with a self-built NCCL 2.9.8.

Confirmed also with PyTorch 1.8.1 fosscuda/2020b and their submodule NCCL (2.7.8)

boegel · 2021-05-31T15:10:34Z

@boegelbot please test @ generoso
CORE_CNT=16

boegelbot · 2021-05-31T15:15:08Z

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=12753 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_12753 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 17326

Test results coming soon (I hope)...

- notification for comment with ID 851550784 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

Flamefire · 2021-05-31T18:18:19Z

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml4 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 2.7.5
See https://gist.github.com/7ff4ca773bf0621e4829253d878785dc for a full test report.

Flamefire · 2021-05-31T19:38:09Z

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
taurusa4 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/8ec1f1b47ced050a742842d085f1e5fe for a full test report.

branfosj · 2021-05-31T21:12:57Z

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0211u13a.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/14cea78c30fb1f83200be1af0c1ab7d3 for a full test report.

branfosj · 2021-05-31T21:45:44Z

Test report by @branfosj
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
bear-pg0212u15b.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), Python 3.6.8
See https://gist.github.com/0af507f398d3918bdbe9ddd1b5ccc155 for a full test report.

boegel · 2021-05-31T22:14:25Z

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3302.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/e1a6c67d1f21a63b28ce877458461ba0 for a full test report.

Flamefire · 2021-05-31T22:24:14Z

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusa7 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/8a4e86dfcedde1e7eec268336ded2c88 for a full test report.

Flamefire · 2021-06-01T08:24:14Z

The failed tests are the bottleneck_test which will be disabled by easybuilders/easybuild-easyblocks#2450 and a flaky test with fosscuda-2020b where a timeout is reached to early, i.e. all are non-critical.
The latter is annoying though and might be related to me seeing increased test times for 2020b over 2020a over 2019b which I'm still investigating.

boegel · 2021-06-01T08:29:39Z

The failed tests are the bottleneck_test which will be disabled by easybuilders/easybuild-easyblocks#2450 and a flaky test with fosscuda-2020b where a timeout is reached to early, i.e. all are non-critical.
The latter is annoying though and might be related to me seeing increased test times for 2020b over 2020a over 2019b which I'm still investigating.

OK, thanks for clarifying @Flamefire!

boegel

lgtm

boegel · 2021-06-01T08:32:28Z

Going in, thanks @Flamefire!

Add patches for PyTorch 1.7.1 avoiding failures on POWER and A100

0ffbf42

branfosj added the bug fix label Apr 26, 2021

branfosj added this to the next release (4.3.5?) milestone Apr 26, 2021

Flamefire added 3 commits April 28, 2021 13:47

Disable tests failing with CUDA 11.2 driver

e9e4c05

Disable tests failing with CUDA 11.2 driver - 2

277282f

Disable tests failing with CUDA 11.2 driver - 3

e4ad5d7

Flamefire added 2 commits May 3, 2021 15:16

Exclude test_nn (failing on A100)

78024cb

Exclude test_nn on POWER

d320539

Exclude test_utils on POWER because test_bottleneck fails when run in…

c6b036d

… EasyBuild

Flamefire added 4 commits May 11, 2021 13:18

Fix mis-optimization on POWER

227a308

Fixup last patch

c32962e

Fixup last patch

b5b1f6a

Relax precision in tests for A100

a9b2a10

Reenable test_nn after adding the patch

fb3b679

Update alias patch

e95beb2

boegel modified the milestones: 4.4.0, release after 4.4.0 May 27, 2021

Run all tests before failing

3ab6a22

boegel modified the milestones: release after 4.4.0, 4.4.0 May 31, 2021

boegel approved these changes Jun 1, 2021

View reviewed changes

boegel merged commit c170815 into easybuilders:develop Jun 1, 2021

Flamefire deleted the 20210426111814_new_pr_PyTorch171 branch June 1, 2021 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add patches for PyTorch 1.7.1 avoiding failures on POWER and A100 #12753

Add patches for PyTorch 1.7.1 avoiding failures on POWER and A100 #12753

Flamefire commented Apr 26, 2021

boegel commented Apr 27, 2021

boegelbot commented Apr 27, 2021

boegelbot commented Apr 27, 2021

boegel commented Apr 27, 2021

boegel commented Apr 28, 2021 •

edited

Loading

Flamefire commented Apr 28, 2021 •

edited

Loading

boegel commented Apr 28, 2021

verdurin commented Apr 28, 2021

boegel commented Apr 29, 2021

Flamefire commented Apr 29, 2021

boegel commented Apr 29, 2021

Flamefire commented Apr 29, 2021

sassy-crick commented Apr 29, 2021

verdurin commented Apr 29, 2021

Flamefire commented May 4, 2021

Flamefire commented May 5, 2021

Flamefire commented May 13, 2021

branfosj commented May 13, 2021

Flamefire commented May 16, 2021

branfosj commented May 16, 2021

Flamefire commented May 17, 2021

boegel commented May 31, 2021

boegelbot commented May 31, 2021

Flamefire commented May 31, 2021

Flamefire commented May 31, 2021

branfosj commented May 31, 2021

branfosj commented May 31, 2021

boegel commented May 31, 2021

Flamefire commented May 31, 2021

Flamefire commented Jun 1, 2021

boegel commented Jun 1, 2021

boegel left a comment

boegel commented Jun 1, 2021

Add patches for PyTorch 1.7.1 avoiding failures on POWER and A100 #12753

Add patches for PyTorch 1.7.1 avoiding failures on POWER and A100 #12753

Conversation

Flamefire commented Apr 26, 2021

boegel commented Apr 27, 2021

boegelbot commented Apr 27, 2021

boegelbot commented Apr 27, 2021

boegel commented Apr 27, 2021

boegel commented Apr 28, 2021 • edited Loading

Flamefire commented Apr 28, 2021 • edited Loading

boegel commented Apr 28, 2021

verdurin commented Apr 28, 2021

boegel commented Apr 29, 2021

Flamefire commented Apr 29, 2021

boegel commented Apr 29, 2021

Flamefire commented Apr 29, 2021

sassy-crick commented Apr 29, 2021

verdurin commented Apr 29, 2021

Flamefire commented May 4, 2021

Flamefire commented May 5, 2021

Flamefire commented May 13, 2021

branfosj commented May 13, 2021

Flamefire commented May 16, 2021

branfosj commented May 16, 2021

Flamefire commented May 17, 2021

boegel commented May 31, 2021

boegelbot commented May 31, 2021

Flamefire commented May 31, 2021

Flamefire commented May 31, 2021

branfosj commented May 31, 2021

branfosj commented May 31, 2021

boegel commented May 31, 2021

Flamefire commented May 31, 2021

Flamefire commented Jun 1, 2021

boegel commented Jun 1, 2021

boegel left a comment

Choose a reason for hiding this comment

boegel commented Jun 1, 2021

boegel commented Apr 28, 2021 •

edited

Loading

Flamefire commented Apr 28, 2021 •

edited

Loading