Diffusion Transformer Training Pipeline #10843

zpx01 · 2024-10-10T22:01:03Z

What does this PR do ?

Implements end-to-end diffusion transformer (DiT) pretraining / fine-tuning.

Collection: diffusion

Changelog

Adds DiT model implementation with cross attention functionality.
Adds EDM diffusion sampler for higher-order sampling.
Adds training script to train DiT models on text/image datasets.

Usage

Readme contains instructions on how to launch training.

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

nemo/collections/diffusion/models/dit/dit_embeddings.py

+# limitations under the License.
+
+
+import math


nemo/collections/diffusion/models/dit/dit_embeddings.py

+
+
+import math
+from typing import Dict, Literal, Optional


nemo/collections/diffusion/models/dit/dit_embeddings.py

+import math
+from typing import Dict, Literal, Optional
+
+import numpy as np


nemo/collections/diffusion/models/dit/dit_embeddings.py

+
+import numpy as np
+import torch
+import torch.nn.functional as F


nemo/collections/diffusion/models/dit/dit_embeddings.py

+import torch.nn.functional as F
+from diffusers.models.embeddings import TimestepEmbedding, get_3d_sincos_pos_embed
+from einops import rearrange
+from einops.layers.torch import Rearrange


nemo/collections/diffusion/models/dit/dit_layer_spec.py

nemo/collections/diffusion/models/dit/dit_model.py

+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F


nemo/collections/diffusion/models/model.py

+
+import importlib
+import warnings
+from dataclasses import dataclass, field


nemo/collections/diffusion/sampler/edm/edm_pipeline.py

+import numpy as np
+import torch
+import torch.distributed
+from einops import rearrange


nemo/collections/diffusion/sampler/edm/edm_pipeline.py

+    def training_step(
+        self, data_batch: dict[str, torch.Tensor], iteration: int
+    ) -> tuple[dict[str, torch.Tensor], torch.Tensor]:


ethanhe42 · 2024-10-11T04:48:06Z

can you fix "
Isort and Black Formatting / reformat_with_isort_and_black (pull_request_target) "?

Signed-off-by: Zeeshan Patel <zeeshanp@berkeley.edu>

zpx01 · 2024-10-11T17:42:14Z

can you fix " Isort and Black Formatting / reformat_with_isort_and_black (pull_request_target) "?

@ethanhe42 this is fixed now, it should be ready to merge.

nemo/collections/diffusion/models/dit/dit_embeddings.py

+from einops import rearrange
+from einops.layers.torch import Rearrange
+from megatron.core import parallel_state
+from megatron.core.models.common.embeddings.rotary_pos_embedding import get_pos_emb_on_this_cp_rank


nemo/collections/diffusion/models/dit/dit_embeddings.py

+from megatron.core import parallel_state
+from megatron.core.models.common.embeddings.rotary_pos_embedding import get_pos_emb_on_this_cp_rank
+from megatron.core.transformer.module import MegatronModule
+from torch import nn


nemo/collections/diffusion/models/dit/dit_model.py

ethanhe42 · 2024-10-11T20:51:27Z

seems that one is still failing "Code scanning results / CodeQL"

ethanhe42 assigned ethanhe42, Victor49152, yaoyu-33 and yashaswikarnati Oct 10, 2024

ethanhe42 added the Run CICD label Oct 10, 2024

zpx01 force-pushed the zpx01/dit_training branch from abd281c to 4e22d4e Compare October 10, 2024 22:09

zpx01 requested review from pablo-garay and ko3n1g as code owners October 10, 2024 22:09

github-actions bot added the CI label Oct 10, 2024

ethanhe42 unassigned ethanhe42, yaoyu-33, Victor49152 and yashaswikarnati Oct 10, 2024

ethanhe42 requested review from ethanhe42, yaoyu-33, Victor49152 and yashaswikarnati October 10, 2024 22:13

zpx01 force-pushed the zpx01/dit_training branch 4 times, most recently from 931fabc to 8dde4b8 Compare October 10, 2024 22:24

github-actions bot removed the CI label Oct 10, 2024

github-advanced-security bot found potential problems Oct 10, 2024

View reviewed changes

ko3n1g added Run CICD and removed Run CICD labels Oct 11, 2024

zpx01 added 4 commits October 11, 2024 10:34

diffusion training

10c3163

Signed-off-by: Zeeshan Patel <zeeshanp@berkeley.edu>

fixing issues with data module

3fff9ff

Signed-off-by: Zeeshan Patel <zeeshanp@berkeley.edu>

added dit llama support, cleaned up dit code

a1cffca

Signed-off-by: Zeeshan Patel <zeeshanp@berkeley.edu>

fixed code formatting

a42f7d2

Signed-off-by: Zeeshan Patel <zeeshanp@berkeley.edu>

added dit llama models

f706757

Signed-off-by: Zeeshan Patel <zeeshanp@berkeley.edu>

zpx01 force-pushed the zpx01/dit_training branch from 15a23fb to f706757 Compare October 11, 2024 17:34

Merge branch 'NVIDIA:main' into zpx01/dit_training

35c04d4

github-advanced-security bot found potential problems Oct 11, 2024

View reviewed changes

ethanhe42 approved these changes Oct 11, 2024

View reviewed changes

ethanhe42 enabled auto-merge (squash) October 11, 2024 20:52

ethanhe42 disabled auto-merge October 11, 2024 20:52

ethanhe42 enabled auto-merge (squash) October 13, 2024 07:02

ethanhe42 added Run CICD and removed Run CICD labels Oct 13, 2024

ethanhe42 merged commit ce21ffb into NVIDIA:main Oct 13, 2024
165 of 169 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diffusion Transformer Training Pipeline #10843

Diffusion Transformer Training Pipeline #10843

zpx01 commented Oct 10, 2024

ethanhe42 commented Oct 11, 2024

zpx01 commented Oct 11, 2024

ethanhe42 commented Oct 11, 2024

Diffusion Transformer Training Pipeline #10843

Diffusion Transformer Training Pipeline #10843

Conversation

zpx01 commented Oct 10, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

ethanhe42 commented Oct 11, 2024

zpx01 commented Oct 11, 2024

ethanhe42 commented Oct 11, 2024