Add TransformerEngine to PT 2.0 training images #3315

arjkesh · 2023-09-07T23:25:53Z

GitHub Issue #, if available:

Note:

If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.
All PR's are checked weekly for staleness. This PR will be closed if not updated in 30 days.

Description

Add transformer engine and flash attention support to CU121 images
Add associated tests on heavy instance types
Add CUDNN (required dependency of transformer engine)
Add future test to match CUDNN versions in torch/dlc
Patch requirements in existing DLC
Add NCCL_ASYNC_ERROR_HANDLING=1 env

Tests run

TE tests pass in f6976e5
Sanity tests pass in 5e14ead

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

roywei

Let's also update the env var NCCL_ASYNC_ERROR_HANDLING=1 per customer request, this will make sure pytorch errors out properly out during distributed training.

…containers-1 into tf_engine

arjkesh · 2023-09-19T20:44:15Z

/rerun

into tf_engine

pytorch/training/docker/2.0/py3/cu121/Dockerfile.gpu

jeet4320 · 2023-09-26T16:59:19Z

pytorch/training/docker/2.0/py3/cu121/Dockerfile.gpu

+# Install flash attn and NVIDIA transformer engine
+RUN MAX_JOBS=4 pip install flash-attn==2.0.4 --no-build-isolation
+RUN pip install git+https://github.com/NVIDIA/TransformerEngine.git@release_v0.12
+ENV NCCL_ASYNC_ERROR_HANDLING=1


this is already defined on line number 63

Ack, added because of a different review comment, will remove

test/dlc_tests/container_tests/bin/transformerengine/testPTTransformerEngine

jeet4320 · 2023-09-26T17:02:13Z

test/dlc_tests/ec2/pytorch/training/test_pytorch_training.py

+    pytorch_training, ec2_connection, region, gpu_only, ec2_instance_type, pt21_and_above_only
+):
+    """
+    PT 2.1 reintroduces a dependency on CUDNN to support NVDA TransformerEngine. This test is to ensure that torch CUDNN matches system CUDNN in the container.


there is no PT 2.1 yet

There is no PT 2.1 yet, this is an anticipatory test that we are adding to ensure that torch binaries are compiled with the same cudnn as exists in the container

jeet4320 · 2023-09-26T17:03:18Z

test/dlc_tests/ec2/pytorch/training/test_pytorch_training.py

+    ).stdout.split()[-1]
+
+    cudnn_from_torch = ec2_connection.run(
+        f"nvidia-docker exec --user root {container_name} python -c 'from torch.backends import cudnn; print(cudnn.version())'",


this cudnn comes from pytorch and not from installed from OS package, right?

This cudnn represents the cudnn version that torch is compiled with, not the DLC cudnn version. There are basically static links to cudnn from torch - while it doesn't appear to be a big issue if there are slightly different versions of cudnn from compile --> system, adding this test for future safety so that the versions don't go out of sync

test/dlc_tests/ec2/test_transformerengine.py

Add TransformerEngine to PT 2.0 training images

813bfe9

aws-deep-learning-containers-ci bot added build Reflects file change in build folder ec2 Reflects file change in dlc_tests/ec2 folder pytorch Reflects file change in pytorch folder Size:S Determines the size of the PR test Reflects file change in test folder labels Sep 7, 2023

arjkesh added 6 commits September 7, 2023 16:26

Merge branch 'master' into tf_engine

9a50f6b

Update Dockerfile.gpu

4653068

Update buildspec.yml

227daaa

Update buildspec.yml

efe2170

install cudnn

97d3440

Update Dockerfile.gpu

d5d0314

roywei reviewed Sep 13, 2023

View reviewed changes

arjkesh added 3 commits September 18, 2023 22:53

update

e1d10c8

Merge branch 'tf_engine' of https://github.com/arjkesh/deep-learning-…

d9d742d

…containers-1 into tf_engine

update

ce8d087

arjkesh added 13 commits September 20, 2023 10:55

Update Dockerfile.gpu

ee98782

Update Dockerfile.gpu

22d7d60

Update Dockerfile.gpu

af662fd

Merge branch 'master' of https://github.com/aws/deep-learning-containers

c97541b

into tf_engine

save progress

6cb71c8

skip efa

d5626a4

run TE test

3d83645

update formatting

15df3aa

update formatting

e91071d

update

02c9187

rebuild image

6f1caca

update cudnn

95a9003

update cudnn to 8.9.4.25 for fused attn fix

7530535

arjkesh added 12 commits September 25, 2023 14:52

try cudnn 8.9.5

51594ab

install TE v12

91285fe

revert to 8.9.3, upgrade TE

f6976e5

add cudnn match test

c410e1d

add cudnn test

edd5550

python formatting

6c91e67

add hide=true for ease of debug

799e144

docstring update

7a8f34b

patch cryptography

5e14ead

revert temp changes

6368fc0

update skip condition

8ae87c2

update

4295532

arjkesh marked this pull request as ready for review September 26, 2023 01:15

arjkesh requested a review from a team as a code owner September 26, 2023 01:15

arjkesh added 3 commits September 25, 2023 20:11

typo fix

52c8b86

add docker pull cmd

b2787cd

update test, format

2c43c26

roywei previously approved these changes Sep 26, 2023

View reviewed changes

jeet4320 requested changes Sep 26, 2023

View reviewed changes

Update testPTTransformerEngine

da102bb

arjkesh dismissed roywei’s stale review via da102bb September 26, 2023 17:14

Update Dockerfile.gpu

7fb2477

jeet4320 approved these changes Sep 26, 2023

View reviewed changes

arjkesh merged commit 5241309 into aws:master Sep 26, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TransformerEngine to PT 2.0 training images #3315

Add TransformerEngine to PT 2.0 training images #3315

arjkesh commented Sep 7, 2023 •

edited

Loading

roywei left a comment

arjkesh commented Sep 19, 2023

jeet4320 Sep 26, 2023

arjkesh Sep 26, 2023

jeet4320 Sep 26, 2023

arjkesh Sep 26, 2023

jeet4320 Sep 26, 2023

arjkesh Sep 26, 2023

Add TransformerEngine to PT 2.0 training images #3315

Add TransformerEngine to PT 2.0 training images #3315

Conversation

arjkesh commented Sep 7, 2023 • edited Loading

Description

Tests run

roywei left a comment

Choose a reason for hiding this comment

arjkesh commented Sep 19, 2023

jeet4320 Sep 26, 2023

Choose a reason for hiding this comment

arjkesh Sep 26, 2023

Choose a reason for hiding this comment

jeet4320 Sep 26, 2023

Choose a reason for hiding this comment

arjkesh Sep 26, 2023

Choose a reason for hiding this comment

jeet4320 Sep 26, 2023

Choose a reason for hiding this comment

arjkesh Sep 26, 2023

Choose a reason for hiding this comment

arjkesh commented Sep 7, 2023 •

edited

Loading