Skip to content

Commit

Permalink
Bump Dockerfile.ci (2024-09-09) (#10423)
Browse files Browse the repository at this point in the history
* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 8307fcd !

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* update TE import paths

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* Update parallelisms.rst

fix sed typo.

Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

* fix for mcore dist opt refactor: move overlap_grad_reduce/overlap_param_gather to ddp config

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* remove overlap_grad_reduce overlap_param_gather from autoconfig

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* subclass TransformerConfig because megatronmodule expects it to have fp8 attr

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* revert change; Use ModelParallelConfig & add fp8

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix, set NVTE_APPLY_QK_LAYER_SCALIN=1

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
  • Loading branch information
6 people authored and monica-sekoyan committed Oct 11, 2024
1 parent 8670e28 commit 9e1a72b
Show file tree
Hide file tree
Showing 25 changed files with 91 additions and 45 deletions.
2 changes: 1 addition & 1 deletion Dockerfile.ci
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ RUN pip install nemo_run@git+https://github.com/NVIDIA/NeMo-Run.git@${NEMU_RUN_T
ARG TE_TAG=7d576ed25266a17a7b651f2c12e8498f67e0baea
ARG MODELOPT_VERSION=0.15.0

ARG MCORE_TAG=01945b98d1ea3a2acb5e8301e181a328104f4856
ARG MCORE_TAG=8307fcda5fff57ab0e77131b09bf37da997ec1f2

ARG APEX_TAG=810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c
RUN \
Expand Down
2 changes: 1 addition & 1 deletion docs/source/features/parallelisms.rst
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ Implement Context Parallelism
NeMo Framework leverages functionalities from both Megatron Core and Transformer Engine to implement CP efficiently. During forward propagation, each GPU handles a segment of the sequence, storing only the necessary Key and Value (KV) pairs. In the backward pass, these KV pairs are reassembled across GPUs using advanced communication schemes like all-gather and reduce-scatter transformed into point-to-point communications in a ring topology. This method reduces the memory footprint significantly while maintaining computational efficiency.

Visit our source code for more insights into the implementation:
- `Megatron Core wrappers for Transformer Engine <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/custom_layers/transformer_engine.py>`_
- `Megatron Core wrappers for Transformer Engine <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/extensions/transformer_engine.py>`_
- `Transformer Engine attention modules <https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py>`_


Expand Down
8 changes: 8 additions & 0 deletions nemo/collections/llm/recipes/llama3_70b.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import nemo_run as run
import pytorch_lightning as pl
import torch
from megatron.core.distributed import DistributedDataParallelConfig
from pytorch_lightning.callbacks.callback import Callback

from nemo import lightning as nl
Expand Down Expand Up @@ -109,6 +110,13 @@ def trainer(
gradient_as_bucket_view=True,
ckpt_async_save=True,
ckpt_parallel_load=True,
ddp=run.Config(
DistributedDataParallelConfig,
check_for_nan_in_grad=True,
grad_reduce_in_fp32=True,
overlap_grad_reduce=True,
overlap_param_gather=True,
),
)

trainer = run.Config(
Expand Down
8 changes: 8 additions & 0 deletions nemo/collections/llm/recipes/llama3_8b.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import nemo_run as run
import pytorch_lightning as pl
import torch
from megatron.core.distributed import DistributedDataParallelConfig
from pytorch_lightning.callbacks.callback import Callback

from nemo import lightning as nl
Expand Down Expand Up @@ -109,6 +110,13 @@ def trainer(
gradient_as_bucket_view=True,
ckpt_async_save=True,
ckpt_parallel_load=True,
ddp=run.Config(
DistributedDataParallelConfig,
check_for_nan_in_grad=True,
grad_reduce_in_fp32=True,
overlap_grad_reduce=True,
overlap_param_gather=True,
),
)

trainer = run.Config(
Expand Down
8 changes: 8 additions & 0 deletions nemo/collections/llm/recipes/mistral.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import nemo_run as run
import pytorch_lightning as pl
import torch
from megatron.core.distributed import DistributedDataParallelConfig
from pytorch_lightning.callbacks.callback import Callback

from nemo import lightning as nl
Expand Down Expand Up @@ -105,6 +106,13 @@ def trainer(
ckpt_include_optimizer=True,
ckpt_async_save=True,
ckpt_parallel_load=True,
ddp=run.Config(
DistributedDataParallelConfig,
check_for_nan_in_grad=True,
grad_reduce_in_fp32=True,
overlap_grad_reduce=True,
overlap_param_gather=True,
),
)

trainer = run.Config(
Expand Down
8 changes: 8 additions & 0 deletions nemo/collections/llm/recipes/mixtral_8x3b.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import nemo_run as run
import pytorch_lightning as pl
import torch
from megatron.core.distributed import DistributedDataParallelConfig
from pytorch_lightning.callbacks.callback import Callback

from nemo import lightning as nl
Expand Down Expand Up @@ -107,6 +108,13 @@ def trainer(
gradient_as_bucket_view=True,
ckpt_async_save=True,
ckpt_parallel_load=True,
ddp=run.Config(
DistributedDataParallelConfig,
check_for_nan_in_grad=True,
grad_reduce_in_fp32=True,
overlap_grad_reduce=True,
overlap_param_gather=True,
),
)

trainer = run.Config(
Expand Down
2 changes: 2 additions & 0 deletions nemo/collections/llm/recipes/mixtral_8x7b.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,8 @@ def trainer(
DistributedDataParallelConfig,
check_for_nan_in_grad=True,
grad_reduce_in_fp32=True,
overlap_grad_reduce=True,
overlap_param_gather=True,
),
)

Expand Down
4 changes: 2 additions & 2 deletions nemo/collections/llm/recipes/optim/adam.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ def distributed_fused_adam_with_cosine_annealing(max_lr: float = 1e-4) -> run.Co
adam_beta2=0.95,
adam_eps=1e-5,
use_distributed_optimizer=True,
overlap_grad_reduce=True,
overlap_param_gather=True,
# overlap_grad_reduce=True,
# overlap_param_gather=True,
clip_grad=1.0,
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,6 @@ def get_optim(self) -> Config[OptimizerConfig]:
"bf16": True,
"adam_beta1": 0.9,
"adam_beta2": 0.95,
"overlap_grad_reduce": True,
"overlap_param_gather": True,
"clip_grad": 1.0,
"adam_eps": 1e-5,
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,18 +64,18 @@
from megatron.core import parallel_state
from megatron.core.distributed import DistributedDataParallel as McoreDDP
from megatron.core.distributed import DistributedDataParallelConfig
from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
from megatron.core.models.gpt import GPTModel as MCoreGPTModel
from megatron.core.models.vision.clip_vit_model import CLIPViTModel
from megatron.core.pipeline_parallel.schedules import get_forward_backward_func
from megatron.core.transformer.attention import CrossAttention, CrossAttentionSubmodules
from megatron.core.transformer.custom_layers.transformer_engine import (
from megatron.core.extensions.transformer_engine import (
TEColumnParallelLinear,
TEDotProductAttention,
TELayerNormColumnParallelLinear,
TENorm,
TERowParallelLinear,
)
from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
from megatron.core.models.gpt import GPTModel as MCoreGPTModel
from megatron.core.models.vision.clip_vit_model import CLIPViTModel
from megatron.core.pipeline_parallel.schedules import get_forward_backward_func
from megatron.core.transformer.attention import CrossAttention, CrossAttentionSubmodules
from megatron.core.transformer.enums import AttnMaskType as MCoreAttnMaskType
from megatron.core.transformer.identity_op import IdentityOp
from megatron.core.transformer.mlp import MLP, MLPSubmodules
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,16 @@


try:
from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
from megatron.core.fusions.fused_layer_norm import FusedLayerNorm
from megatron.core.tensor_parallel.layers import ColumnParallelLinear, RowParallelLinear
from megatron.core.transformer.attention import SelfAttention, SelfAttentionSubmodules
from megatron.core.transformer.custom_layers.transformer_engine import (
from megatron.core.extensions.transformer_engine import (
TEColumnParallelLinear,
TEDotProductAttention,
TENorm,
TERowParallelLinear,
)
from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
from megatron.core.fusions.fused_layer_norm import FusedLayerNorm
from megatron.core.tensor_parallel.layers import ColumnParallelLinear, RowParallelLinear
from megatron.core.transformer.attention import SelfAttention, SelfAttentionSubmodules
from megatron.core.transformer.dot_product_attention import DotProductAttention
from megatron.core.transformer.enums import AttnMaskType
from megatron.core.transformer.identity_op import IdentityOp
Expand Down Expand Up @@ -59,7 +59,11 @@
self_attn_bda=get_bias_dropout_add,
post_att_layernorm=TENorm,
mlp=ModuleSpec(
module=MLP, submodules=MLPSubmodules(linear_fc1=TEColumnParallelLinear, linear_fc2=TERowParallelLinear,),
module=MLP,
submodules=MLPSubmodules(
linear_fc1=TEColumnParallelLinear,
linear_fc2=TERowParallelLinear,
),
),
mlp_bda=get_bias_dropout_add,
post_mlp_layernorm=TENorm,
Expand All @@ -84,7 +88,11 @@
self_attn_bda=get_bias_dropout_add,
post_att_layernorm=FusedLayerNorm,
mlp=ModuleSpec(
module=MLP, submodules=MLPSubmodules(linear_fc1=ColumnParallelLinear, linear_fc2=RowParallelLinear,),
module=MLP,
submodules=MLPSubmodules(
linear_fc1=ColumnParallelLinear,
linear_fc2=RowParallelLinear,
),
),
mlp_bda=get_bias_dropout_add,
post_mlp_layernorm=FusedLayerNorm,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@
from nemo.collections.nlp.modules.common.megatron.utils import ApexGuardDefaults

try:
from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
from megatron.core.transformer.attention import SelfAttention, SelfAttentionSubmodules
from megatron.core.transformer.custom_layers.transformer_engine import (
from megatron.core.extensions.transformer_engine import (
TEColumnParallelLinear,
TEDotProductAttention,
TENorm,
TERowParallelLinear,
)
from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
from megatron.core.transformer.attention import SelfAttention, SelfAttentionSubmodules
from megatron.core.transformer.enums import AttnMaskType
from megatron.core.transformer.identity_op import IdentityOp
from megatron.core.transformer.mlp import MLP, MLPSubmodules
Expand Down Expand Up @@ -62,7 +62,11 @@ def get_falcon_layer_spec() -> ModuleSpec:
self_attn_bda=get_bias_dropout_add,
pre_mlp_layernorm=TENorm,
mlp=ModuleSpec(
module=MLP, submodules=MLPSubmodules(linear_fc1=TEColumnParallelLinear, linear_fc2=TERowParallelLinear,),
module=MLP,
submodules=MLPSubmodules(
linear_fc1=TEColumnParallelLinear,
linear_fc2=TERowParallelLinear,
),
),
mlp_bda=get_bias_dropout_add,
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@

import torch
from megatron.core import parallel_state, tensor_parallel
from megatron.core.extensions.transformer_engine import TENorm, TERowParallelLinear
from megatron.core.fusions.fused_softmax import FusedScaleMaskSoftmax
from megatron.core.packed_seq_params import PackedSeqParams
from megatron.core.tensor_parallel import ColumnParallelLinear
from megatron.core.transformer import MegatronModule, TransformerConfig
from megatron.core.transformer.custom_layers.transformer_engine import TENorm, TERowParallelLinear
from megatron.core.transformer.enums import AttnMaskType
from megatron.core.transformer.utils import attention_mask_func
from megatron.core.utils import divide
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from megatron.core.extensions.transformer_engine import TELayerNormColumnParallelLinear
from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
from megatron.core.transformer import ModuleSpec, TransformerLayer, TransformerLayerSubmodules
from megatron.core.transformer.attention import SelfAttention, SelfAttentionSubmodules
from megatron.core.transformer.custom_layers.transformer_engine import TELayerNormColumnParallelLinear
from megatron.core.transformer.enums import AttnMaskType
from megatron.core.transformer.mlp import MLP, MLPSubmodules

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@
# limitations under the License.

try:
from megatron.core.extensions.transformer_engine import TENorm
from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
from megatron.core.tensor_parallel.layers import ColumnParallelLinear, RowParallelLinear
from megatron.core.transformer.attention import SelfAttention, SelfAttentionSubmodules
from megatron.core.transformer.custom_layers.transformer_engine import TENorm
from megatron.core.transformer.dot_product_attention import DotProductAttention
from megatron.core.transformer.enums import AttnMaskType
from megatron.core.transformer.identity_op import IdentityOp
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from torch import Tensor, nn

from nemo.collections.nlp.models.language_modeling.megatron.griffin.griffin_layer_spec import (
griffin_mqa_layer_with_transformer_engine_spec,
griffin_recurrent_layer_with_transformer_engine_spec,
Expand All @@ -20,9 +21,9 @@

try:
from megatron.core import parallel_state, tensor_parallel
from megatron.core.extensions.transformer_engine import TENorm, te_checkpoint
from megatron.core.models.common.language_module.language_module import LanguageModule
from megatron.core.packed_seq_params import PackedSeqParams
from megatron.core.transformer.custom_layers.transformer_engine import TENorm, te_checkpoint
from megatron.core.transformer.spec_utils import build_module
from megatron.core.transformer.transformer_config import TransformerConfig

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
from megatron.core.transformer.attention import SelfAttention, SelfAttentionSubmodules
from megatron.core.transformer.custom_layers.transformer_engine import (
from megatron.core.extensions.transformer_engine import (
TEDotProductAttention,
TELayerNormColumnParallelLinear,
TERowParallelLinear,
)
from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
from megatron.core.transformer.attention import SelfAttention, SelfAttentionSubmodules
from megatron.core.transformer.enums import AttnMaskType
from megatron.core.transformer.identity_op import IdentityOp
from megatron.core.transformer.mlp import MLP, MLPSubmodules
Expand Down Expand Up @@ -53,7 +53,10 @@
self_attn_bda=get_bias_dropout_add,
mlp=ModuleSpec(
module=MLP,
submodules=MLPSubmodules(linear_fc1=TELayerNormColumnParallelLinear, linear_fc2=TERowParallelLinear,),
submodules=MLPSubmodules(
linear_fc1=TELayerNormColumnParallelLinear,
linear_fc2=TERowParallelLinear,
),
),
mlp_bda=get_bias_dropout_add,
),
Expand All @@ -74,7 +77,10 @@
recurrent_bda=get_bias_dropout_add,
mlp=ModuleSpec(
module=MLP,
submodules=MLPSubmodules(linear_fc1=TELayerNormColumnParallelLinear, linear_fc2=TERowParallelLinear,),
submodules=MLPSubmodules(
linear_fc1=TELayerNormColumnParallelLinear,
linear_fc2=TERowParallelLinear,
),
),
mlp_bda=get_bias_dropout_add,
),
Expand Down
5 changes: 1 addition & 4 deletions nemo/collections/nlp/modules/common/hyena/hyena.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,7 @@
import torch
import torch.nn as nn
from einops import rearrange
from megatron.core.transformer.custom_layers.transformer_engine import (
TELayerNormColumnParallelLinear,
TERowParallelLinear,
)
from megatron.core.extensions.transformer_engine import TELayerNormColumnParallelLinear, TERowParallelLinear
from megatron.core.transformer.identity_op import IdentityFuncOp, IdentityOp
from megatron.core.transformer.spec_utils import ModuleSpec, build_module
from megatron.core.transformer.transformer_config import TransformerConfig
Expand Down
5 changes: 1 addition & 4 deletions nemo/collections/nlp/modules/common/hyena/hyena_spec.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
import torch.nn as nn
from megatron.core.extensions.transformer_engine import TELayerNormColumnParallelLinear, TERowParallelLinear
from megatron.core.models.gpt.gpt_layer_specs import get_gpt_layer_with_transformer_engine_spec
from megatron.core.transformer.custom_layers.transformer_engine import (
TELayerNormColumnParallelLinear,
TERowParallelLinear,
)
from megatron.core.transformer.spec_utils import ModuleSpec

from nemo.collections.nlp.modules.common.hyena.hyena import (
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@
import torch
import torch.nn.functional as F
from megatron.core import InferenceParams
from megatron.core.extensions.transformer_engine import SplitAlongDim
from megatron.core.fusions.fused_bias_geglu import bias_geglu_impl
from megatron.core.fusions.fused_bias_gelu import bias_gelu_impl
from megatron.core.fusions.fused_bias_swiglu import bias_swiglu_impl
from megatron.core.models.common.embeddings.language_model_embedding import LanguageModelEmbedding
from megatron.core.models.common.embeddings.rotary_pos_embedding import apply_rotary_pos_emb
from megatron.core.packed_seq_params import PackedSeqParams
from megatron.core.transformer.attention import SelfAttention
from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim
from megatron.core.transformer.mlp import MLP
from megatron.core.transformer.moe.experts import SequentialMLP
from megatron.core.transformer.transformer_block import TransformerBlock
Expand Down
2 changes: 1 addition & 1 deletion nemo/lightning/_strategy_lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ def megatron_lazy_init_context(config) -> Generator[None, None, None]:
def monkey_patched(c):
return {"device": "meta"}

from megatron.core.transformer.custom_layers import transformer_engine as _te
from megatron.core.extensions import transformer_engine as _te

original = _te._get_extra_te_kwargs # noqa: SLF001
_te._get_extra_te_kwargs = monkey_patched # noqa: SLF001
Expand Down
2 changes: 1 addition & 1 deletion nemo/lightning/fabric/strategies.py
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,7 @@ def megatron_context(self) -> Generator[None, None, None]:
def monkey_patched(config):
return {"device": "meta"}

from megatron.core.transformer.custom_layers import transformer_engine as _te
from megatron.core.extensions import transformer_engine as _te

original = _te._get_extra_te_kwargs # noqa: SLF001
_te._get_extra_te_kwargs = monkey_patched # noqa: SLF001
Expand Down
Loading

0 comments on commit 9e1a72b

Please sign in to comment.