-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into rnnt_cuda_graphs_default
- Loading branch information
Showing
26 changed files
with
513 additions
and
617 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
Converting from Megatron-LM | ||
=========================== | ||
|
||
NVIDIA NeMo and NVIDIA Megatron-LM share many underlying technologies. This document provides guidance for migrating your project from Megatron-LM to NVIDIA NeMo. | ||
|
||
Converting Checkpoints | ||
---------------------- | ||
|
||
You can convert your GPT-style model checkpoints trained with Megatron-LM into the NeMo Framework using the provided example script. This script facilitates the conversion of Megatron-LM checkpoints to NeMo compatible formats. | ||
|
||
.. code-block:: bash | ||
<NeMo_ROOT_FOLDER>/examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \ | ||
--checkpoint_folder <path_to_PTL_checkpoints_folder> \ | ||
--checkpoint_name megatron_gpt--val_loss=99.99-step={steps}-consumed_samples={consumed}.0 \ | ||
--nemo_file_path <path_to_output_nemo_file> \ | ||
--model_type <megatron_model_type> \ | ||
--tensor_model_parallel_size <tensor_model_parallel_size> \ | ||
--pipeline_model_parallel_size <pipeline_model_parallel_size> \ | ||
--gpus_per_node <gpus_per_node> | ||
Resuming Training | ||
----------------- | ||
|
||
To resume training from a converted Megatron-LM checkpoint, it is crucial to correctly set up the training parameters to match the previous learning rate schedule. Use the following setting for the `trainer.max_steps` parameter in your NeMo training configuration: | ||
|
||
.. code-block:: none | ||
trainer.max_steps=round(lr-warmup-fraction * lr-decay-iters + lr-decay-iters) | ||
This configuration ensures that the learning rate scheduler in NeMo continues from where it left off in Megatron-LM, using the `lr-warmup-fraction` and `lr-decay-iters` arguments from the original Megatron-LM training setup. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
Community Checkpoint Converter | ||
============================== | ||
|
||
We provide easy-to-use tools that enable users to convert community checkpoints into the NeMo format. These tools facilitate various operations, including resuming training, Sparse Fine-Tuning (SFT), Parameter-Efficient Fine-Tuning (PEFT), and deployment. For detailed instructions and guidelines, please refer to our documentation. | ||
|
||
We offer comprehensive guides to assist both end users and developers: | ||
|
||
- **User Guide**: Detailed steps on how to convert community model checkpoints for further training or deployment within NeMo. For more information, please see our :doc:`user_guide`. | ||
|
||
- **Developer Guide**: Instructions for developers on how to implement converters for community model checkpoints, allowing for broader compatibility and integration within the NeMo ecosystem. For development details, refer to our :doc:`dev_guide`. | ||
|
||
- **Megatron-LM Checkpoint Conversion**: NVIDIA NeMo and NVIDIA Megatron-LM share several foundational technologies. You can convert your GPT-style model checkpoints trained with Megatron-LM into the NeMo Framework using our scripts, see our :doc:`convert_mlm`. | ||
|
||
Access the user and developer guides directly through the links below: | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Conversion Guides | ||
|
||
user_guide | ||
dev_guide | ||
convert_mlm |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
================ | ||
NeMo Collections | ||
================ | ||
|
||
Documentation for the individual collections | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Large Language Models (LLMs) | ||
:name: Large Language Models | ||
:titlesonly: | ||
|
||
nlp/nemo_megatron/intro | ||
nlp/models | ||
nlp/machine_translation/machine_translation | ||
nlp/megatron_onnx_export | ||
nlp/quantization | ||
nlp/api | ||
|
||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Speech AI | ||
:name: Speech AI | ||
:titlesonly: | ||
|
||
asr/intro | ||
asr/speech_classification/intro | ||
asr/speaker_recognition/intro | ||
asr/speaker_diarization/intro | ||
asr/ssl/intro | ||
asr/speech_intent_slot/intro | ||
|
||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Multimodal Models (MMs) | ||
:name: Multimodal | ||
:titlesonly: | ||
|
||
multimodal/mllm/intro | ||
multimodal/vlm/intro | ||
multimodal/text2img/intro | ||
multimodal/nerf/intro | ||
multimodal/api | ||
|
||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Text To Speech (TTS) | ||
:name: Text To Speech | ||
:titlesonly: | ||
|
||
tts/intro | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Vision (CV) | ||
:name: vision | ||
:titlesonly: | ||
|
||
vision/intro | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Common | ||
:name: Common | ||
:titlesonly: | ||
|
||
common/intro |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
Memory Optimizations | ||
==================== | ||
|
||
Parallelism | ||
----------- | ||
Refer to :doc:`Parallelism <./parallelism>`. | ||
|
||
Flash Attention | ||
--------------- | ||
|
||
Overview | ||
^^^^^^^^ | ||
|
||
Flash Attention is a method designed to enhance the efficiency of Transformer models, which are widely utilized in applications such as Natural Language Processing (NLP). Traditional Transformers are slow and consume a lot of memory, especially with long sequences, due to the quadratic time and memory complexity of self-attention. FlashAttention, an IO-aware exact attention algorithm that leverages tiling to minimize the number of memory reads/writes between the GPU's high bandwidth memory (HBM) and on-chip SRAM. This approach is designed to be more efficient in terms of IO complexity compared to standard attention mechanisms. | ||
|
||
Turn Flash Attention On and Off | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
In the NeMo Framework, Flash Attention is supported through the Transformer Engine with the inclusion of Flash Attention 2. By default, Flash Attention is enabled, but the Transformer Engine may switch to a different kernel if the tensor dimensions are not optimal for Flash Attention. Users can completely disable Flash Attention by setting the environment variable ``NVTE_FLASH_ATTN=0``. | ||
|
||
For more details on the supported Dot Attention backend, please refer to the Transformer Engine source code available at `Transformer Engine's Attention Mechanism <https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py>`_. | ||
|
||
.. bibliography:: ./nlp_all.bib | ||
:style: plain | ||
:labelprefix: nlp-megatron | ||
:keyprefix: nlp-megatron- | ||
|
||
Overview | ||
^^^^^^^^ | ||
|
||
Full Activation Recomputation | ||
""""""""""""""""""""""""""""" | ||
This method recalculates all the intermediate activations during the backward pass of a model's training, instead of storing them during the forward pass. This technique maximizes memory efficiency at the cost of computational overhead, as each activation is recomputed when needed. | ||
|
||
Partial Activation Recomputation | ||
"""""""""""""""""""""""""""""""" | ||
This method recomputes only a subset of layers during the backward phase. It is a trade-off between the full recomputation and no recomputation, balancing memory savings with computational efficiency. | ||
|
||
Selective Activation Recomputation | ||
"""""""""""""""""""""""""""""""""" | ||
This method reduces memory footprint of activations significantly via smart activation checkpointing. This approach involves selectively storing only crucial activations and recomputing the others as needed. It is particularly useful in large models to minimize memory usage while controlling the computational cost. | ||
|
||
Refer to "Reducing Activation Recomputation in Large Transformer Models" for more details: https://arxiv.org/abs/2205.05198 | ||
|
||
.. bibliography:: ./nlp_all.bib | ||
:style: plain | ||
:labelprefix: nlp-megatron | ||
:keyprefix: nlp-megatron- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
.. _mix_precision: | ||
|
||
Mixed Precision Training | ||
------------------------ | ||
|
||
Mixed precision training significantly enhances computational efficiency by conducting operations in half-precision and fp8 formats, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo now supports FP16, BF16, and FP8 (via Transformer Engine) across most models. Further details will be provided shortly. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.