Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. #9198

galv · 2024-05-14T18:15:06Z

What does this PR do ?

I demonstrate, using transcribe_speech.py, that simply casting the entire model to bfloat16 gives about 15% higher performance than using automatic mixed precision. The reasons why are discussed in #9086

There are a few small modifications to our conformer encoder and multi head attention implementations to make this work. Basically, torch.float32 was unintentionally used at a few points.

I also disable updates to the batch norm statistics in this PR during inference. See the relevant comment in module.py.

Note that casting the preprocessor's input waveform to bfloat16 causes a serious accuracy degredation. I provide a warning about this. It is better to just do the preprocessor in float32, and then cast its output to bfloat16 (which is what I do).

I have verified that this works with Parakeet CTC 1.1B and Parakeet RNN-T 1.1B. I will upload a table showing WER and RTFx throughput when running transcribe_speech.py, using the new casting method and the old AMP method. RTFx improves by about 15%, and WER stays about the same.

This image demonstrates (1) the increase in RTFx across a range of datasets and (2) the fact that WER does not degrade.

Collection: ASR

Changelog

Add specific line by line info of high level changes in this PR.

Usage

See transcribe_speech.py. Note that several parts of the code use AMP rather than simply casting the model.

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to Draft: Fix the "cast ping pong" problem when we run AMP inference. #9086

examples/asr/transcribe_speech.py

nithinraok · 2024-05-14T18:24:02Z

nemo/collections/asr/parts/utils/transcribe_utils.py

+                    sound = sf.SoundFile(item["audio_filepath"])
+                    duration = sound.frames / sound.samplerate
+                    item["duration"] = duration
+                print(json.dumps(item), file=durations_f, flush=True)


why print is required here?

It's not. I coped it from: https://github.com/NVIDIA/NeMo/pull/9198/files/9a928206ec297f0b2652f4c382883c3bb29fd8a3#diff-f7b768bed0ebc1145787134787f17cc8dfb0defbf8758c8cb759b96e39fbbec5R361

In fact, there is a subtle bug with using print(). Flushing at a new line is not guaranteed. So the code I linked to actually chops off the last few lines of a manifest file. That was one of the reasons why I took a while to get this PR up. FYI @pzelasko , since I think you worked on pre-sorting before.

I will change from using print() to simply do json.write or json.save or w/e as an update to this PR.

https://docs.python.org/3/library/functions.html#print

I think I raised a concern wrt this print with @pzelasko during review but didnt think much of it. Let's just use json write or file.write as normal

Indeed my bad; I forgot to add f.flush() at the end of line writing. f.write() would have the same issue. print is just fine IMO but if you want to change it I don't mind.

nithinraok · 2024-05-14T18:25:55Z

nemo/collections/asr/parts/utils/transcribe_utils.py

+                if item.get("duration") is None:
+                    sound = sf.SoundFile(item["audio_filepath"])
+                    duration = sound.frames / sound.samplerate
+                    item["duration"] = duration


soundfile reads full file when duration is not mentioned, is it chosen here for performance?

We need to get the duration of the audio file if we want to sort by duration. Sorting by duration can speed you up by a factor of two or so, IIRC. (I have the real data if you want to see it.) I'm not sure if I want to keep the change. It turns out that simply loading these audio files serially just to get their duration is slower than actually transcribing the data, annoyingly...

The duration fields are all "null" for the dataset I used here: https://bc.ngc.nvidia.com/datasets/1618966

So I threw this shim code in.

It is probably better to just update the duration field and make a new version of that dataset.

And throw an exception when someone wants to sort by duration but there is no duration field.

I like the second idea better. Rather not read a full file to get duration. Error out if sorting is necessary but duration isn't provided in any file.

yeah reason I asked is, unfortunately reading and writing is where most of the time spent.

It's more complicated than that: if the format has a header, such as WAV or FLAC, soundfile will only read the header unless you call .read() on that object. But if it's something like MP3, it may read the whole thing.

nemo/core/classes/module.py

nemo/collections/asr/parts/submodules/multi_head_attention.py

galv · 2024-05-15T00:09:45Z

FYI, the PR description has been updated to describe the benefit from using this mode of execution instead of AMP

examples/asr/transcribe_speech.py

nemo/collections/asr/modules/audio_preprocessing.py

titu1994 · 2024-05-15T00:29:11Z

nemo/collections/asr/parts/submodules/conformer_modules.py

@@ -348,7 +348,7 @@ def forward(self, x, pad_mask=None, cache=None):
            x = self.pointwise_activation(x)

        if pad_mask is not None:
-            x = x.float().masked_fill(pad_mask.unsqueeze(1), 0.0)
+            x = x.masked_fill(pad_mask.unsqueeze(1), 0.0)


Hmm can we cast to appropriate dtype explicitly. Vahid added this for a particular reason which I forget

@VahidooX can you comment on why you added this cast? It is immediately followed by self.depthwise_conv(x, cache=cache), which will end up casting the input x back down to fp16 when you run in automatic mixed precision. The only reason I can think of for why you might do this is if your cache is in float32, and adding a float16 value to the float32 cache caused a problem.

Haven't heard back. Dropping this one.

My bad, wasnt notified of ping. I am extremely wary of modifying any part of mha or relpos mha operations. I'd seriously like them to continue operating at fp32. Vahid worked a lot on this module, he has these explicit casts for a reason because he was unable to get stable training at all on fp16 and bf16 without these.

The goal of this PR was to affect only inference, this affects training so id request to revert any changes to any file concerning MHA or RelPosMHA

A middle ground is the following - check for self.training - if true, then use the explicit fp32 casts here and below, otherwise perform no cast (and we anyway explicitly cast to fp16/bf16 so it wont make a difference in inference)

I sent @galv the details on this casting. It is not used for train or inference in nemo. @borisfom added that, I guess to make it work with trt or onxx conversions. Let's check it out with Boris before dropping that.
Here is the PR for this specific one:
https://github.com/NVIDIA/NeMo/pull/3787/files

Sure if that works without the cast, do remove it - both ONNX exporter and TRT went very far since then.

titu1994 · 2024-05-15T00:30:02Z

nemo/collections/asr/parts/submodules/multi_head_attention.py

        """Reset and extend the positional encodings if needed."""
        needed_size = 2 * length - 1
        if hasattr(self, 'pe') and self.pe.size(1) >= needed_size:
            return
        # positions would be from negative numbers to positive
        # positive positions would be used for left positions and negative for right positions
+        # fix this


I talk to myself in comments often, sorry.

nemo/collections/asr/parts/submodules/multi_head_attention.py

titu1994 · 2024-05-15T00:32:16Z

nemo/collections/asr/parts/utils/transcribe_utils.py

+                if item.get("duration") is None:
+                    sound = sf.SoundFile(item["audio_filepath"])
+                    duration = sound.frames / sound.samplerate
+                    item["duration"] = duration


I like the second idea better. Rather not read a full file to get duration. Error out if sorting is necessary but duration isn't provided in any file.

titu1994 · 2024-05-15T00:33:24Z

nemo/collections/asr/parts/utils/transcribe_utils.py

+                    sound = sf.SoundFile(item["audio_filepath"])
+                    duration = sound.frames / sound.samplerate
+                    item["duration"] = duration
+                print(json.dumps(item), file=durations_f, flush=True)


I think I raised a concern wrt this print with @pzelasko during review but didnt think much of it. Let's just use json write or file.write as normal

nemo/core/classes/module.py

examples/asr/transcribe_speech.py

nemo/collections/asr/parts/utils/transcribe_utils.py

examples/asr/transcribe_speech.py

galv · 2024-05-17T20:25:48Z

This is ready for another round of review. Unfortunately the black + isort change made a lot of noise. But basically I removed a lot of cruft. This PR basically does only three things now:

Modify transcribe_speech.py to run using casting to float16/bfloat16 instead of AMP
Modify transcribe_speech.py to output RTFx scores, using the calculate_rtfx=True config. It is set to false by default (but I wouldn't be opposed to setting it to True by default later for the sake of educating users about their throughput results).
Make small changes to the positional encoding buffer and a call to masked_fill() to make sure that data is not accidentally casted up to float32.

examples/asr/transcribe_speech.py

nithinraok

Minor comments, LGTM otherwise

nemo/collections/asr/parts/utils/transcribe_utils.py

examples/asr/transcribe_speech.py

nithinraok

revert the duration required field and also make sure to compute RTF when duration is provided, its not mandatory to provide duration

All issues covered.

galv · 2024-05-29T21:12:40Z

@nithinraok @pzelasko @titu1994 I think this change is good to go at this point.

nithinraok

LGTM!

titu1994

Overall everything looks good, apart from chages to MHA and RelPosMHA. Please see comments.

titu1994 · 2024-05-31T00:44:38Z

nemo/collections/asr/parts/submodules/conformer_modules.py

@@ -348,7 +348,7 @@ def forward(self, x, pad_mask=None, cache=None):
            x = self.pointwise_activation(x)

        if pad_mask is not None:
-            x = x.float().masked_fill(pad_mask.unsqueeze(1), 0.0)
+            x = x.masked_fill(pad_mask.unsqueeze(1), 0.0)


My bad, wasnt notified of ping. I am extremely wary of modifying any part of mha or relpos mha operations. I'd seriously like them to continue operating at fp32. Vahid worked a lot on this module, he has these explicit casts for a reason because he was unable to get stable training at all on fp16 and bf16 without these.

The goal of this PR was to affect only inference, this affects training so id request to revert any changes to any file concerning MHA or RelPosMHA

titu1994 · 2024-05-31T00:44:55Z

nemo/collections/asr/parts/submodules/multi_head_attention.py

        )

        global_attn_scores = global_attn_scores.view(batch_size * self.h, max_num_global_attn_indices, seq_len)

        # compute global attn probs
-        global_attn_probs_float = nn.functional.softmax(global_attn_scores, dim=-1, dtype=torch.float32)
+        global_attn_probs_float = nn.functional.softmax(global_attn_scores, dim=-1)


titu1994 · 2024-05-31T00:46:21Z

nemo/collections/asr/parts/submodules/conformer_modules.py

@@ -348,7 +348,7 @@ def forward(self, x, pad_mask=None, cache=None):
            x = self.pointwise_activation(x)

        if pad_mask is not None:
-            x = x.float().masked_fill(pad_mask.unsqueeze(1), 0.0)
+            x = x.masked_fill(pad_mask.unsqueeze(1), 0.0)


A middle ground is the following - check for self.training - if true, then use the explicit fp32 casts here and below, otherwise perform no cast (and we anyway explicitly cast to fp16/bf16 so it wont make a difference in inference)

pzelasko · 2024-06-04T20:19:54Z

Looks good to me!

This has been tested only for Parakeet-CTC-1.1B right now. This problem certainly exists elsewhere. Automatic mixed precision and inference do not play well together. First, automatic mixed precision was created back when neural networks were much simpler. In particular, they did not have softmax and layer norm as frequent operations. In the era of transformers, softmax and layer norm are very common. AMP will uncoditionally output fp32 outputs from these operations, even if their inputs are fp16. See here: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32 This is no longer necessary, now that layer norm does accumulation in fp32 in pytorch, even if the input is fp16: pytorch/pytorch#66707 Do infernece by casting model to bfloat16, not by using AMP. Do feature preprocessing in float32 for accuracy. Warn if someone tries to input a non-float32 tensor. Always create the output in the type the rest of the model expects. Sort manifests by duration. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

While we don't need this for accurate results in b/float16, this is a safety precaution to make sure that training accuracy does not regress. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

titu1994

Thanks for the fixes !

…#9198) * Fix the "cast ping pong" problem when we run AMP inference. This has been tested only for Parakeet-CTC-1.1B right now. This problem certainly exists elsewhere. Automatic mixed precision and inference do not play well together. First, automatic mixed precision was created back when neural networks were much simpler. In particular, they did not have softmax and layer norm as frequent operations. In the era of transformers, softmax and layer norm are very common. AMP will uncoditionally output fp32 outputs from these operations, even if their inputs are fp16. See here: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32 This is no longer necessary, now that layer norm does accumulation in fp32 in pytorch, even if the input is fp16: pytorch/pytorch#66707 Do infernece by casting model to bfloat16, not by using AMP. Do feature preprocessing in float32 for accuracy. Warn if someone tries to input a non-float32 tensor. Always create the output in the type the rest of the model expects. Sort manifests by duration. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Always cast softmax inputs to float32 when in training mode. While we don't need this for accurate results in b/float16, this is a safety precaution to make sure that training accuracy does not regress. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com>

…rategy (#9387) * Integrating mcore's DistributedDataParallel into MegatronStrategy Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply ddp-hooks from pytorch only when needed Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * bugfix if using mcore distOpt with sft (#9356) * bugfix if using mcore distOpt Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Apply isort and black reformatting Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> Co-authored-by: akoumpa <akoumpa@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * fix typo infer_seq_lenght -> infer_seq_length (#9370) Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Marc Romeyn <mromeijn@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Rachitg/ag (#9083) * Rachitg/ag (#9081) * disable overlap for qkv Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix * bugfix --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Signed-off-by: Rachit Garg <rachitgarg91@gmail.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: michal2409 <michal2409@users.noreply.github.com> --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Signed-off-by: Rachit Garg <rachitgarg91@gmail.com> Signed-off-by: michal2409 <michal2409@users.noreply.github.com> Co-authored-by: Rachit Garg <rachitgarg91@gmail.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: michal2409 <michal2409@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Adding the original change made for label_models (#9377) (#9378) Signed-off-by: Taejin Park <tango4j@gmail.com> Co-authored-by: Taejin Park <tango4j@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Dgalvez/fix greedy batch strategy name r2.0.0rc0 (#9243) (#9253) * Lazily warn about using greedy strategy instead of greedy_batch strategy. Previously, the warning would often run spuriously, since several existing code paths simply call "change_decoding_strategy()" after having first initialized a Module, rather than changing the config before initializing the Module. This can be confusing. The only problem I can see with this is that using logging inside a forward() method might interfere with some compiler toolkits like Torchscript or thunder.compile. Presumably it would be easy to add a conditional statement to avoid this statement in a compiler context if necessary. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Update README.rst (#9393) Revised content per https://gitlab-master.nvidia.com/nemo-framework-tme/documentation/-/issues/25. Also removed reference to NIMs in LLMs and MMs Deployment and Optimization. It should be NVIDIA NeMo Microservices and not NIM. Removed nemo:24.03.framework and nemo:24.01.speech in Docker Containers section and replaced with 24.05 . Please verify all changes. Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * a2a fix removed tp world size and group from init (#8944) (#8952) Signed-off-by: Anmol Gupta <14880251+anmolgupt@users.noreply.github.com> Co-authored-by: anmolgupt <14880251+anmolgupt@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Add config option for FP32 embedding grads (#8953) * Add config option for FP32 embedding grads (#8946) Signed-off-by: Tim Moon <tmoon@nvidia.com> * Apply isort and black reformatting Signed-off-by: ericharper <ericharper@users.noreply.github.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: ericharper <ericharper@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: ericharper <ericharper@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Changes to enable CUDA graph for LLM (#8955) * Changes to enable CUDA graph for LLM (#8751) * Use next instead of get_batch Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * CUDA graph changes Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change to enable CG with weight caching Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Use next instead of get_batch" This reverts commit 0021bb444cdd1b27674fc0cfea909c1a42475336. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py Signed-off-by: Jan Baczek <jbaczek@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py" This reverts commit b4f736ed2b39f6c48d2868ac3febb82c763ab3fb. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Remove skip_weight_update argument Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix + cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Use new TE API for FP8 Param transpose Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change config param cuda_graph to enable_cuda_graph Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Enable TE RNGStatesTracker through config Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change te_rng_tracker to use_te_rng_tracker Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * FP8 weight transpose handled inside TE Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Revert "Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py"" This reverts commit e31862481216f9adf7fa584a0c0262916c935639. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by: Jan Baczek <jbaczek@nvidia.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jan Baczek <jbaczek@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: ericharper <ericharper@users.noreply.github.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by: Jan Baczek <jbaczek@nvidia.com> Signed-off-by: ericharper <ericharper@users.noreply.github.com> Co-authored-by: vasunvidia <108759426+vasunvidia@users.noreply.github.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jan Baczek <jbaczek@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: ericharper <ericharper@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Enhance Distributed Adam (#9051) * Enhance Distributed Adam (#9037) * Fix deprecated env. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Use user desired value for distributed adam. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Preserve memory format in parameter buffer of distributed adam. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Fix the contiguous_param_buffer bug about bprop overlap and redundant copy after all-gather. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Provide API to lock SHArP tree for distributed adam within nodes. Signed-off-by: Wil Kong <alpha0422@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Wil Kong <alpha0422@gmail.com> --------- Signed-off-by: Wil Kong <alpha0422@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: ericharper <ericharper@users.noreply.github.com> --------- Signed-off-by: Wil Kong <alpha0422@gmail.com> Signed-off-by: ericharper <ericharper@users.noreply.github.com> Co-authored-by: Wil Kong <alpha0422@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: ericharper <ericharper@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Force diarizer to use CUDA if cuda is available and if device=None. (#9380) (#9390) * Fixed clustering diarizer to load MSDD to GPU by default if cuda on * Fixed clustering diarizer to load MSDD to GPU by default if cuda on * Apply isort and black reformatting --------- Signed-off-by: Taejin Park <tango4j@gmail.com> Signed-off-by: tango4j <tango4j@users.noreply.github.com> Co-authored-by: Taejin Park <tango4j@gmail.com> Co-authored-by: tango4j <tango4j@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * ci: Properly catch failed tests by introduction of workflow templates (#9324) * ci: Refactor tests into reusable template Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Fix sending alerts on failure Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * disable slack Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix alerting Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Increase timeout for `L0_Unit_Tests_CPU` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * increase timeout Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * increase timeout for `Speech_Checkpoints_tests` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * improve readability Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * test Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * test Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * finalize Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * add missing rm statement for `L2_PTQ_Llama2_Export_Only` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * all your comments are belong to us Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * remove github output Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * revive more comments Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * add L2: ASR dev run - part two Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Pablo Garay <palenq@gmail.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Fix T5 G2P Input and Output Types (#9224) (#9269) * fix t5 g2p model * Apply isort and black reformatting --------- Signed-off-by: Jason <jasoli@nvidia.com> Signed-off-by: blisc <blisc@users.noreply.github.com> Co-authored-by: Jason <jasoli@nvidia.com> Co-authored-by: blisc <blisc@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. (#9198) * Fix the "cast ping pong" problem when we run AMP inference. This has been tested only for Parakeet-CTC-1.1B right now. This problem certainly exists elsewhere. Automatic mixed precision and inference do not play well together. First, automatic mixed precision was created back when neural networks were much simpler. In particular, they did not have softmax and layer norm as frequent operations. In the era of transformers, softmax and layer norm are very common. AMP will uncoditionally output fp32 outputs from these operations, even if their inputs are fp16. See here: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32 This is no longer necessary, now that layer norm does accumulation in fp32 in pytorch, even if the input is fp16: https://github.com/pytorch/pytorch/issues/66707 Do infernece by casting model to bfloat16, not by using AMP. Do feature preprocessing in float32 for accuracy. Warn if someone tries to input a non-float32 tensor. Always create the output in the type the rest of the model expects. Sort manifests by duration. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Always cast softmax inputs to float32 when in training mode. While we don't need this for accurate results in b/float16, this is a safety precaution to make sure that training accuracy does not regress. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Huvu/rag pipeline citest (#9384) * huvu/NeMo_rag_citest first commit * adding llama-index to dependency * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjusting data/models path in ci-test to dependency * putting llama-index to optional * update cicd-main.yml --------- Co-authored-by: Huy Vu2 <huvu@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Re-org export code (#9353) * reorg the export code Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * replaced log with raise Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * add converter and loader folders Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * move nemo_ckpt_convert into the converter folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * move nemo_file into loader folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * reorg converter Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * continue to reorg converter Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * continue to reorg Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * move nemo file back into nemo folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * renamed nemo folder to nemo_ckpt_loader Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove unused function Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * removed nemo file Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * moved a function to tensorrt_llm_run file Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * Remove unused imports Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * import csv added Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * ci: Fix `L2_Segmentation_Tool_Parallel_ctc_segmentation_test_L2_Eng_CitriNet_with_wav` (#9399) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * disable overlap for qkv (#9079) * disable overlap for qkv (#9072) * disable overlap for qkv Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: michal2409 <michal2409@users.noreply.github.com> --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Signed-off-by: michal2409 <michal2409@users.noreply.github.com> Signed-off-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: Rachit Garg <rachitgarg91@gmail.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: michal2409 <michal2409@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Fix circular import for MM dataprep notebook (#9287) (#9292) * update launcher name and fix mm circular import * Apply isort and black reformatting --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * add check if num layers is divisible by pp size (#9208) (#9298) * add check if num_layers % pp == 0 * Apply isort and black reformatting * move num_layers / pp check to build_transformer_config --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Add HF siglip vision encoder (#9185) * temp save Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * temp save 2 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update code Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * enable seq packing Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix neva and clip Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Enable parallel seq packing algo and few other fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Pipeline parallel support Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update data preprocess Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix few pp issues Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * enable sequence packing w/ PP Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix cu_seqlens in inputs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add assert Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Depend on PP to decide whether do padding Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add docstring Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix few evaluation issues Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix few PP evaluation issues Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add llama3 template Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix license Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix llama3 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Few fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Few neva bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Few neva bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Few neva bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * llama3 inference fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Force vision encoder to run in fp32 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Revert "Force vision encoder to run in fp32" This reverts commit 9d2160d96cb3e2a27a18538950ef43b4482c04da. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Try adding distributed format of checkpoint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Allow dist checkpoint to be non-strict Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Some fixes for PP + dist ckpt in Neva Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix peft Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * few fixes for lora Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * checkpoint updates Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> * bug fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add HF siglip vision encoder Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * handle steerlm label in nv_dpo template Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * Add neva dist checkpoint converter Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> * fix CLEAN RESPONSE logic to not use last EOS Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * strip extra_id_1 from clean response Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * change inference time image processor Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * resolve comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * remove open_clip vision encoder for siglip Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * update neva dist ckpt apis Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> * fix return Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * resolve CLEAN RESPONSE multiturn issue Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * code format Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fixes for isort Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refac image processor loading to util Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * black and isort Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * move crop size assertion Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * few neva fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * [Nemo CICD] timeouts fix (#9407) * timeouts fix * timeouts fix Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Removing un-used ModelConfig class (#9389) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Extend multimodal/speech_llm with lhotse, t5 and bestow supports (#9169) * Fixes * Docs fix * Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom) * Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support distributed_fused_adam Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support distributed_fused_adam Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Add support for sharded NeMo manifest files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support megatron_amp_O2 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Support heterogeneous sampling rates in non tarred NeMo manifests * migrate to PTL2.0 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * update manifest util Signed-off-by: stevehuang52 <heh@nvidia.com> * Support multiple tokenizer/parser types, aggregate tokenizers, and custom language fields * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * agg and normal tokenizers actually work * Support weights for NeMo tarred manifests * Temporarily hardcoded pnc stripping/lowercasing * fix * make pnc hack configurable from the config and disabled by default * fix the hack * migrate to ptl2.1 to support multiple dataloaders Signed-off-by: stevehuang52 <heh@nvidia.com> * support encoder overwrite Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update misc Signed-off-by: stevehuang52 <heh@nvidia.com> * fix eval and clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * support add_sep for perception model Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix https://github.com/Lightning-AI/pytorch-lightning/issues/18803 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * add_bos Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Transformer decoder with conditioning for canary (#8091) * initial commit for multi-task conf-enc transf-dec for canary Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * removing decoder states caching during training Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Option to limit the number of open streams (#8095) * audio signal support in multi Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update asr evaluator Signed-off-by: stevehuang52 <heh@nvidia.com> * fix from https://github.com/NVIDIA/NeMo/commit/fcc0f9f6ff7947c3c7fba3ed17d8ec8af6391397 and https://github.com/NVIDIA/NeMo/commit/f97c9016e6438ca4174b66bf9c3e248b28197aaa Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * transcribe fn for Canary models (#8110) * improve readability Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * adding context in transcribe function for ConfTransfModels Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * supporting relative paths in transcribe function for canary Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * removing cuts.sort_by_duration in __getitem__ to maintain manifest order during inference Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update for evaluation Signed-off-by: stevehuang52 <heh@nvidia.com> * update for eval Signed-off-by: stevehuang52 <heh@nvidia.com> * update for evaluation Signed-off-by: stevehuang52 <heh@nvidia.com> * fix bleu Signed-off-by: stevehuang52 <heh@nvidia.com> * fix typo Signed-off-by: stevehuang52 <heh@nvidia.com> * Add missing audio_filepath validation for Canary (#8119) * Add missing audio_filepath validation for Canary * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add default concat_sampling_probabilities Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support lhotse dataset in speechllm Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * bypass get_iterator_k_split Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * tmp fix Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * try to use fixed batch with megatron Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * add batch logging Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support unfrozen llm Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Create README.md Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Update README.md Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Update README.md Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * rename Signed-off-by: stevehuang52 <heh@nvidia.com> * add llama prompt template Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update and refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * support sample alpha Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support lhotse validation set and canary pretrained ckpt with pseudo label Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * make sure backward compatibility Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * remove pad Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * make sure asr_model is frozen Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support greedy decoding Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * valid on lhotse Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix multi dataloader in val case for lhotse SALM; add default data names; keep asr model tokenizer by default to enable adding canary dataset Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * remove the bruteforce _keep_special_tokens implementation Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * decoding_ratio and convert_canary_prompt_to_text support Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * canary_tokens_augment_ratio Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * debug Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * bug fix Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix lhotse based eval of llama canary model Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support some overwrite for eval Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support zero shot prompt in training Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support cross attention based SALM Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support cross attention based SALM Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix for batch train/valid of cross Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support learnable gate and plotting Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support using pseudo label in prompt rather than cross att Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * bug fix for perception cfg and context tokens shift Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * DentityConnectorsAdd Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix ckpt saving Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Support RnnGatedCrossAttention Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * add include_ffw and fix _optimizer_param_groups for all unfrozen run Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support grad acc when using bucket Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support TransformerCrossAttention Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support ProjectTransformerCrossAttention Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support ++model.use_am_tokenizer ++model.override_vocab_size ++model.override.hidden_size Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support question set on val without canary Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support load_audio_encoder and wip in optim_param_groups Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * minor fix for audio pretrain model init Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * simplify canary_tokens_augment Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * use question in the manifest if it exists Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support dataset weighting for non tar Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Update SpeechLLM code (#8475) * add pleasefixme marker for potential failed nightly tests. (#7678) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Add new text segmentation library for better TTS quality (#7645) * Add new text segmentation library for better TTS quality * Update zh_cn_pinyin.py added detailed instruction on how to install pkuseg. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Update requirements_tts.txt remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774) * Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer * Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add '32-true' for precision values --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix(clustering_diarizer.py): fix typo (#7772) Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> * fix(diarization-README): typo (#7771) Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> * Fix bug wrt change decoding strategy for bpe models (#7762) (#7764) * Fix bug wrt change decoding strategy for bpe models * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Remove incorrect extra argument for load_from_checkpoint_dir() (#7500) Signed-off-by: Robin Dong <robin.k.dong@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Add nemo to mcore GPT conversion script (#7730) * add conversion script Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove references to 'ckpt' Signed-off-by: Chen Cui <chcui@nvidia.com> * add one more sanity check to make sure there is no unexpected keys in state dict Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make cpu loading work Signed-off-by: Chen Cui <chcui@nvidia.com> * make script work for llama2 models Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address code check Signed-off-by: Chen Cui <chcui@nvidia.com> * remove trainer precision (was for old sanity check) Signed-off-by: Chen Cui <chcui@nvidia.com> * fix script for llama2 model Signed-off-by: Chen Cui <chcui@nvidia.com> * remove commented code Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785) Signed-off-by: anferico <f.cariaggi4@gmail.com> * Add some docs and update scripts for ASR (#7790) * Add some docs and update scripts Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * set context for text memmap to fork (#7784) * set context for text memmap to fork Signed-off-by: arendu <adithyare@nvidia.com> * typo Signed-off-by: arendu <adithyare@nvidia.com> --------- Signed-off-by: arendu <adithyare@nvidia.com> * add training with multiple audios Signed-off-by: stevehuang52 <heh@nvidia.com> * Support flash decoding (#7744) * Add flash-decoding Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Fix Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> --------- Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com> * Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761) * Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747) * Change accelerator to auto Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass omegaconf object to trainer in nlp_checkpoint_port.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass omegaconf object to trainer in export.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> * docs: fix typos (#7758) Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> * Snake act (#7736) Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update gpt_dataset.py (#6963) Signed-off-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> Signed-off-by: Abhishree <abhishreetm@gmail.com> --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com> Signed-off-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: shuoer86 <129674997+shuoer86@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Co-authored-by: Xin Yao <yaox12@outlook.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> * Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788) * add selection criteria for reference audios Signed-off-by: anferico <f.cariaggi4@gmail.com> * Update configuration files Signed-off-by: anferico <f.cariaggi4@gmail.com> * add informative comment in config files Signed-off-by: anferico <f.cariaggi4@gmail.com> * sample random index for reference audio selection Signed-off-by: anferico <f.cariaggi4@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: anferico <f.cariaggi4@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update text server to support compute logprobs (#7733) * update text server to support compute logprobs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo --------- Signed-off-by: Zhilin Wang <zhilinw@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add multi-layer feat extract and fix random question insertion Signed-off-by: stevehuang52 <heh@nvidia.com> * Configure MCore logger (#7781) Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Revert "PEFT eval fix (#7626) (#7638)" (#7693) This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9. * remove TN from ctc_segm tut (#7807) Signed-off-by: Evelina <ebakhturina@nvidia.com> * [TTS] Support audio offsets in TTS data loaders (#7156) * [TTS] Support audio offsets in TTS data loaders Signed-off-by: Ryan <rlangman@nvidia.com> * [TTS] Change docstring mentions of .pt to .npy Signed-off-by: Ryan <rlangman@nvidia.com> --------- Signed-off-by: Ryan <rlangman@nvidia.com> * Update Apex install command in Dockerfile (#7794) (#7804) * move core install to /workspace (#7706) * update apex install in dockerfile * use fetch head --------- Signed-off-by: Abhinav Khattar <aklife97@gmail.com> Signed-off-by: eharper <eharper@nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: Abhinav Khattar <aklife97@gmail.com> * fix typo Signed-off-by: stevehuang52 <heh@nvidia.com> * Nemo to HF converter for LLaMA model (#7770) * Create config_llama_truncate.yaml Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * Add files via upload Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update config_llama_truncate.yaml Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * clean up trainer * remove dependency on yaml config. load config from nemo file instead. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable ckpt saving into other precision formats * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support 70b + cleanup qkv slice logic * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug * move hf model folder code from comment to function and add instruction to run * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: Chen Cui <chcui@nvidia.com> * Save best NeMo model only when necessary (#7836) Signed-off-by: Ante Jukić <ajukic@nvidia.com> * add guard if its a distributed checkpoint (#7845) Signed-off-by: Gerald Shen <geshen@nvidia.com> * Fix tn duplex (#7808) * fix duplex tn infer Signed-off-by: Evelina <ebakhturina@nvidia.com> * fix typo Signed-off-by: Evelina <ebakhturina@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix TN docs Signed-off-by: Evelina <ebakhturina@nvidia.com> --------- Signed-off-by: Evelina <ebakhturina@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update transformers cache on Jenkins (#7854) * update transformers cache Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * add cd Signed-off-by: eharper <eharper@nvidia.com> --------- Signed-off-by: eharper <eharper@nvidia.com> * Update README.rst for container update (#7844) Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> * Add support for finetuning with huggingface datasets (#7834) * add finetune with huggingface dataset Signed-off-by: stevehuang52 <heh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update yaml Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update and refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * add extrac hf text and update Signed-off-by: stevehuang52 <heh@nvidia.com> * update and refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * move dataset dependency to common Signed-off-by: stevehuang52 <heh@nvidia.com> * add docstring Signed-off-by: stevehuang52 <heh@nvidia.com> * Add to Dics Signed-off-by: Nithin Rao Koluguri <nithinraok> * add ci test Signed-off-by: Nithin Rao Koluguri <nithinraok> * add max steps in jenkins Signed-off-by: Nithin Rao Koluguri <nithinraok> * reduce max steps Signed-off-by: Nithin Rao Koluguri <nithinraok> * jenkins test Signed-off-by: Nithin Rao Koluguri <nithinraok> * add bs=2 Signed-off-by: Nithin Rao Koluguri <nithinraok> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> * Multimodal merge (#7728) * ControlNet TRT export * Final MR before release * SD2 update * Fixed export issue * Fix for instruct p2p and reformat * Fix SD export issue * Add nemo clip export for DB * Fix ins pix2pix * fix sd2 config * [Mingyuan Ma] BF16 and SD conversion script * [Imagen] NHWC Feature * Fix .nemo loading issue for NeMo CLIP in SD * NeMo r1.20.0 Multimodal Merge * fix the inductor issue in inference * Fix inductor loading .nemo issue * Add Neva Model Support * Imagen Optimizations * Neva inference code * NeMo TOT 1.21 to Internal/main * Update neva_inference.yaml * REBASING for latest code changes * Update internal/main to main tot * Parallel DDIM implementation * 1. Fixing indentation bug. (#7352) Signed-off-by: Micha Livne <mlivne@nvidia.com> * NeMo MCore llama2 support + MCore PEFT adapters (#7299) * start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * add TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * add todo Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove imports Signed-off-by: ericharper <complex451@gmail.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * remove import Signed-off-by: ericharper <complex451@gmail.com> * small clean up Signed-off-by: ericharper <complex451@gmail.com> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <complex451@gmail.com> * add config obj to flash attention tests Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to test Signed-off-by: ericharper <complex451@gmail.com> * get hidden_size from config Signed-off-by: ericharper <complex451@gmail.com> * add try except Signed-off-by: ericharper <complex451@gmail.com> * use default Signed-off-by: ericharper <complex451@gmail.com> * update config with hidden size Signed-off-by: ericharper <complex451@gmail.com> * remove arg Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * comment out jenkins test Signed-off-by: ericharper <complex451@gmail.com> * revert import Signed-off-by: ericharper <complex451@gmail.com> * build transformer config Signed-off-by: ericharper <complex451@gmail.com> * add model to provider func Signed-off-by: ericharper <complex451@gmail.com> * update forward and float16 wrapper Signed-off-by: ericharper <complex451@gmail.com> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <complex451@gmail.com> * set virtual rank Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <jasonwan@nvidia.com> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <jasonwan@nvidia.com> --------- Signed-off-by: jasonwan <jasonwan@nvidia.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * mcore llama2 ckpt conversion & small fix Signed-off-by: jasonwan <jasonwan@nvidia.com> * Add inference & sft config by Hongbin Co-authored-by: Hongbin Liu <hongbinl@nvidia.com> Signed-off-by: jasonwan <jasonwan@nvidia.com> * fix config Signed-off-by: jasonwan <jasonwan@nvidia.com> * add inference param. update TP/PP script to support mcore gpt Signed-off-by: jasonwan <jasonwan@nvidia.com> * p-tuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * modify ckpt conversion script (adding model cast) Signed-off-by: jasonwan <jasonwan@nvidia.com> * ckpt conversion use relative path for config Signed-off-by: jasonwan <jasonwan@nvidia.com> * start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * add todo Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * remove imports Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <complex451@gmail.com> * small clean up Signed-off-by: ericharper <complex451@gmail.com> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <complex451@gmail.com> * update module args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <complex451@gmail.com> * remove args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to test Signed-off-by: ericharper <complex451@gmail.com> * get hidden_size from config Signed-off-by: ericharper <complex451@gmail.com> * add try except Signed-off-by: ericharper <complex451@gmail.com> * use default Signed-off-by: ericharper <complex451@gmail.com> * update config with hidden size Signed-off-by: ericharper <complex451@gmail.com> * remove arg Signed-off-by: ericharper <complex451@gmail.com> * comment out jenkins test Signed-off-by: ericharper <complex451@gmail.com> * revert import Signed-off-by: ericharper <complex451@gmail.com> * remove optimizer_idx Signed-off-by: eharper <eharper@nvidia.com> * prefetch num microbatches Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * fix for p-tuning sequence parallel Signed-off-by: jasonwan <jasonwan@nvidia.com> * support SFT/distOpt mcore (#7207) * add inference param. update TP/PP script to support mcore gpt * p-tuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * change layer names for SFT Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> * fix bug in SFT Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> --------- Signed-off-by: jasonwan <jasonwan@nvidia.com> Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> Co-authored-by: Hongbin Liu <hongbinl@nvidia.com> Co-authored-by: jasonwan <jasonwan@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * remove imports Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * build transformer config Signed-off-by: ericharper <complex451@gmail.com> * add model to provider func Signed-off-by: ericharper <complex451@gmail.com> * update forward and float16 wrapper Signed-off-by: ericharper <complex451@gmail.com> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <complex451@gmail.com> * set virtual rank Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <jasonwan@nvidia.com> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <jasonwan@nvidia.com> --------- Signed-off-by: jasonwan <jasonwan@nvidia.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * remove import Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rollback model cast for p-tuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * update for dist adam Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use get_gpt_module_list Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update ckpt conversion script Signed-off-by: jasonwan <jasonwan@nvidia.com> * ptl2.0 patch for llama config Signed-off-by: jasonwan <jasonwan@nvidia.com> * add plugins to trainer in scripts Signed-off-by: jasonwan <jasonwan@nvidia.com> * fix activation checkpointing mcore Signed-off-by: jasonwan <jasonwan@nvidia.com> * fix variable names Signed-off-by: jasonwan <jasonwan@nvidia.com> * overwrite normalization type for mcore/te Signed-off-by: jasonwan <jasonwan@nvidia.com> * Update megatron_llama_sft.yaml Signed-off-by: Jason Wang <jasonwan@nvidia.com> * add PEFT adapter support for mcore gpt path (#7276) * implementation for mcore adapter/mxins Signed-off-by: jasonwan <jasonwan@nvidia.com> * small fix for lora and ptuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * support layerwise peft Signed-off-by: jasonwan <jasonwan@nvidia.com> * support multiple target layers Signed-off-by: jasonwan <jasonwan@nvidia.com> * support lora GQA Signed-off-by: jasonwan <jasonwan@nvidia.com> * support amp O2 Signed-off-by: jasonwan <jasonwan@nvidia.com> * revert & more O2 fix Signed-off-by: jasonwan <jasonwan@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lora inject to attention Signed-off-by: jasonwan <jasonwan@nvidia.com> * support …

…#9198) * Fix the "cast ping pong" problem when we run AMP inference. This has been tested only for Parakeet-CTC-1.1B right now. This problem certainly exists elsewhere. Automatic mixed precision and inference do not play well together. First, automatic mixed precision was created back when neural networks were much simpler. In particular, they did not have softmax and layer norm as frequent operations. In the era of transformers, softmax and layer norm are very common. AMP will uncoditionally output fp32 outputs from these operations, even if their inputs are fp16. See here: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32 This is no longer necessary, now that layer norm does accumulation in fp32 in pytorch, even if the input is fp16: pytorch/pytorch#66707 Do infernece by casting model to bfloat16, not by using AMP. Do feature preprocessing in float32 for accuracy. Warn if someone tries to input a non-float32 tensor. Always create the output in the type the rest of the model expects. Sort manifests by duration. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Always cast softmax inputs to float32 when in training mode. While we don't need this for accurate results in b/float16, this is a safety precaution to make sure that training accuracy does not regress. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

…rategy (#9387) * Integrating mcore's DistributedDataParallel into MegatronStrategy Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply ddp-hooks from pytorch only when needed Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * bugfix if using mcore distOpt with sft (#9356) * bugfix if using mcore distOpt Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Apply isort and black reformatting Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> Co-authored-by: akoumpa <akoumpa@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * fix typo infer_seq_lenght -> infer_seq_length (#9370) Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Marc Romeyn <mromeijn@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Rachitg/ag (#9083) * Rachitg/ag (#9081) * disable overlap for qkv Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix * bugfix --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Signed-off-by: Rachit Garg <rachitgarg91@gmail.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: michal2409 <michal2409@users.noreply.github.com> --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Signed-off-by: Rachit Garg <rachitgarg91@gmail.com> Signed-off-by: michal2409 <michal2409@users.noreply.github.com> Co-authored-by: Rachit Garg <rachitgarg91@gmail.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: michal2409 <michal2409@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Adding the original change made for label_models (#9377) (#9378) Signed-off-by: Taejin Park <tango4j@gmail.com> Co-authored-by: Taejin Park <tango4j@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Dgalvez/fix greedy batch strategy name r2.0.0rc0 (#9243) (#9253) * Lazily warn about using greedy strategy instead of greedy_batch strategy. Previously, the warning would often run spuriously, since several existing code paths simply call "change_decoding_strategy()" after having first initialized a Module, rather than changing the config before initializing the Module. This can be confusing. The only problem I can see with this is that using logging inside a forward() method might interfere with some compiler toolkits like Torchscript or thunder.compile. Presumably it would be easy to add a conditional statement to avoid this statement in a compiler context if necessary. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Update README.rst (#9393) Revised content per https://gitlab-master.nvidia.com/nemo-framework-tme/documentation/-/issues/25. Also removed reference to NIMs in LLMs and MMs Deployment and Optimization. It should be NVIDIA NeMo Microservices and not NIM. Removed nemo:24.03.framework and nemo:24.01.speech in Docker Containers section and replaced with 24.05 . Please verify all changes. Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * a2a fix removed tp world size and group from init (#8944) (#8952) Signed-off-by: Anmol Gupta <14880251+anmolgupt@users.noreply.github.com> Co-authored-by: anmolgupt <14880251+anmolgupt@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Add config option for FP32 embedding grads (#8953) * Add config option for FP32 embedding grads (#8946) Signed-off-by: Tim Moon <tmoon@nvidia.com> * Apply isort and black reformatting Signed-off-by: ericharper <ericharper@users.noreply.github.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: ericharper <ericharper@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: ericharper <ericharper@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Changes to enable CUDA graph for LLM (#8955) * Changes to enable CUDA graph for LLM (#8751) * Use next instead of get_batch Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * CUDA graph changes Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change to enable CG with weight caching Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Use next instead of get_batch" This reverts commit 0021bb444cdd1b27674fc0cfea909c1a42475336. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py Signed-off-by: Jan Baczek <jbaczek@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py" This reverts commit b4f736ed2b39f6c48d2868ac3febb82c763ab3fb. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Remove skip_weight_update argument Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix + cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Use new TE API for FP8 Param transpose Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change config param cuda_graph to enable_cuda_graph Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Enable TE RNGStatesTracker through config Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change te_rng_tracker to use_te_rng_tracker Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * FP8 weight transpose handled inside TE Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Revert "Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py"" This reverts commit e31862481216f9adf7fa584a0c0262916c935639. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by: Jan Baczek <jbaczek@nvidia.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jan Baczek <jbaczek@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: ericharper <ericharper@users.noreply.github.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by: Jan Baczek <jbaczek@nvidia.com> Signed-off-by: ericharper <ericharper@users.noreply.github.com> Co-authored-by: vasunvidia <108759426+vasunvidia@users.noreply.github.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jan Baczek <jbaczek@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: ericharper <ericharper@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Enhance Distributed Adam (#9051) * Enhance Distributed Adam (#9037) * Fix deprecated env. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Use user desired value for distributed adam. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Preserve memory format in parameter buffer of distributed adam. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Fix the contiguous_param_buffer bug about bprop overlap and redundant copy after all-gather. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Provide API to lock SHArP tree for distributed adam within nodes. Signed-off-by: Wil Kong <alpha0422@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Wil Kong <alpha0422@gmail.com> --------- Signed-off-by: Wil Kong <alpha0422@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: ericharper <ericharper@users.noreply.github.com> --------- Signed-off-by: Wil Kong <alpha0422@gmail.com> Signed-off-by: ericharper <ericharper@users.noreply.github.com> Co-authored-by: Wil Kong <alpha0422@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: ericharper <ericharper@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Force diarizer to use CUDA if cuda is available and if device=None. (#9380) (#9390) * Fixed clustering diarizer to load MSDD to GPU by default if cuda on * Fixed clustering diarizer to load MSDD to GPU by default if cuda on * Apply isort and black reformatting --------- Signed-off-by: Taejin Park <tango4j@gmail.com> Signed-off-by: tango4j <tango4j@users.noreply.github.com> Co-authored-by: Taejin Park <tango4j@gmail.com> Co-authored-by: tango4j <tango4j@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * ci: Properly catch failed tests by introduction of workflow templates (#9324) * ci: Refactor tests into reusable template Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Fix sending alerts on failure Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * disable slack Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix alerting Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Increase timeout for `L0_Unit_Tests_CPU` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * increase timeout Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * increase timeout for `Speech_Checkpoints_tests` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * improve readability Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * test Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * test Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * finalize Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * add missing rm statement for `L2_PTQ_Llama2_Export_Only` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * all your comments are belong to us Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * remove github output Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * revive more comments Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * add L2: ASR dev run - part two Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Pablo Garay <palenq@gmail.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Fix T5 G2P Input and Output Types (#9224) (#9269) * fix t5 g2p model * Apply isort and black reformatting --------- Signed-off-by: Jason <jasoli@nvidia.com> Signed-off-by: blisc <blisc@users.noreply.github.com> Co-authored-by: Jason <jasoli@nvidia.com> Co-authored-by: blisc <blisc@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. (#9198) * Fix the "cast ping pong" problem when we run AMP inference. This has been tested only for Parakeet-CTC-1.1B right now. This problem certainly exists elsewhere. Automatic mixed precision and inference do not play well together. First, automatic mixed precision was created back when neural networks were much simpler. In particular, they did not have softmax and layer norm as frequent operations. In the era of transformers, softmax and layer norm are very common. AMP will uncoditionally output fp32 outputs from these operations, even if their inputs are fp16. See here: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32 This is no longer necessary, now that layer norm does accumulation in fp32 in pytorch, even if the input is fp16: https://github.com/pytorch/pytorch/issues/66707 Do infernece by casting model to bfloat16, not by using AMP. Do feature preprocessing in float32 for accuracy. Warn if someone tries to input a non-float32 tensor. Always create the output in the type the rest of the model expects. Sort manifests by duration. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Always cast softmax inputs to float32 when in training mode. While we don't need this for accurate results in b/float16, this is a safety precaution to make sure that training accuracy does not regress. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Huvu/rag pipeline citest (#9384) * huvu/NeMo_rag_citest first commit * adding llama-index to dependency * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjusting data/models path in ci-test to dependency * putting llama-index to optional * update cicd-main.yml --------- Co-authored-by: Huy Vu2 <huvu@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Re-org export code (#9353) * reorg the export code Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * replaced log with raise Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * add converter and loader folders Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * move nemo_ckpt_convert into the converter folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * move nemo_file into loader folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * reorg converter Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * continue to reorg converter Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * continue to reorg Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * move nemo file back into nemo folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * renamed nemo folder to nemo_ckpt_loader Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove unused function Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * removed nemo file Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * moved a function to tensorrt_llm_run file Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * Remove unused imports Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * import csv added Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * ci: Fix `L2_Segmentation_Tool_Parallel_ctc_segmentation_test_L2_Eng_CitriNet_with_wav` (#9399) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * disable overlap for qkv (#9079) * disable overlap for qkv (#9072) * disable overlap for qkv Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: michal2409 <michal2409@users.noreply.github.com> --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Signed-off-by: michal2409 <michal2409@users.noreply.github.com> Signed-off-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: Rachit Garg <rachitgarg91@gmail.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: michal2409 <michal2409@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Fix circular import for MM dataprep notebook (#9287) (#9292) * update launcher name and fix mm circular import * Apply isort and black reformatting --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * add check if num layers is divisible by pp size (#9208) (#9298) * add check if num_layers % pp == 0 * Apply isort and black reformatting * move num_layers / pp check to build_transformer_config --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Add HF siglip vision encoder (#9185) * temp save Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * temp save 2 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update code Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * enable seq packing Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix neva and clip Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Enable parallel seq packing algo and few other fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Pipeline parallel support Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update data preprocess Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix few pp issues Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * enable sequence packing w/ PP Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix cu_seqlens in inputs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add assert Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Depend on PP to decide whether do padding Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add docstring Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix few evaluation issues Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix few PP evaluation issues Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add llama3 template Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix license Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix llama3 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Few fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Few neva bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Few neva bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Few neva bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * llama3 inference fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Force vision encoder to run in fp32 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Revert "Force vision encoder to run in fp32" This reverts commit 9d2160d96cb3e2a27a18538950ef43b4482c04da. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Try adding distributed format of checkpoint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Allow dist checkpoint to be non-strict Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Some fixes for PP + dist ckpt in Neva Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix peft Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * few fixes for lora Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * checkpoint updates Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> * bug fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add HF siglip vision encoder Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * handle steerlm label in nv_dpo template Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * Add neva dist checkpoint converter Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> * fix CLEAN RESPONSE logic to not use last EOS Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * strip extra_id_1 from clean response Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * change inference time image processor Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * resolve comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * remove open_clip vision encoder for siglip Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * update neva dist ckpt apis Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> * fix return Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * resolve CLEAN RESPONSE multiturn issue Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * code format Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fixes for isort Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refac image processor loading to util Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * black and isort Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * move crop size assertion Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * few neva fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * [Nemo CICD] timeouts fix (#9407) * timeouts fix * timeouts fix Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Removing un-used ModelConfig class (#9389) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Extend multimodal/speech_llm with lhotse, t5 and bestow supports (#9169) * Fixes * Docs fix * Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom) * Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support distributed_fused_adam Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support distributed_fused_adam Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Add support for sharded NeMo manifest files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support megatron_amp_O2 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Support heterogeneous sampling rates in non tarred NeMo manifests * migrate to PTL2.0 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * update manifest util Signed-off-by: stevehuang52 <heh@nvidia.com> * Support multiple tokenizer/parser types, aggregate tokenizers, and custom language fields * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * agg and normal tokenizers actually work * Support weights for NeMo tarred manifests * Temporarily hardcoded pnc stripping/lowercasing * fix * make pnc hack configurable from the config and disabled by default * fix the hack * migrate to ptl2.1 to support multiple dataloaders Signed-off-by: stevehuang52 <heh@nvidia.com> * support encoder overwrite Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update misc Signed-off-by: stevehuang52 <heh@nvidia.com> * fix eval and clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * support add_sep for perception model Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix https://github.com/Lightning-AI/pytorch-lightning/issues/18803 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * add_bos Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Transformer decoder with conditioning for canary (#8091) * initial commit for multi-task conf-enc transf-dec for canary Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * removing decoder states caching during training Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Option to limit the number of open streams (#8095) * audio signal support in multi Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update asr evaluator Signed-off-by: stevehuang52 <heh@nvidia.com> * fix from https://github.com/NVIDIA/NeMo/commit/fcc0f9f6ff7947c3c7fba3ed17d8ec8af6391397 and https://github.com/NVIDIA/NeMo/commit/f97c9016e6438ca4174b66bf9c3e248b28197aaa Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * transcribe fn for Canary models (#8110) * improve readability Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * adding context in transcribe function for ConfTransfModels Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * supporting relative paths in transcribe function for canary Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * removing cuts.sort_by_duration in __getitem__ to maintain manifest order during inference Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update for evaluation Signed-off-by: stevehuang52 <heh@nvidia.com> * update for eval Signed-off-by: stevehuang52 <heh@nvidia.com> * update for evaluation Signed-off-by: stevehuang52 <heh@nvidia.com> * fix bleu Signed-off-by: stevehuang52 <heh@nvidia.com> * fix typo Signed-off-by: stevehuang52 <heh@nvidia.com> * Add missing audio_filepath validation for Canary (#8119) * Add missing audio_filepath validation for Canary * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add default concat_sampling_probabilities Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support lhotse dataset in speechllm Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * bypass get_iterator_k_split Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * tmp fix Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * try to use fixed batch with megatron Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * add batch logging Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support unfrozen llm Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Create README.md Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Update README.md Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Update README.md Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * rename Signed-off-by: stevehuang52 <heh@nvidia.com> * add llama prompt template Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update and refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * support sample alpha Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support lhotse validation set and canary pretrained ckpt with pseudo label Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * make sure backward compatibility Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * remove pad Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * make sure asr_model is frozen Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support greedy decoding Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * valid on lhotse Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix multi dataloader in val case for lhotse SALM; add default data names; keep asr model tokenizer by default to enable adding canary dataset Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * remove the bruteforce _keep_special_tokens implementation Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * decoding_ratio and convert_canary_prompt_to_text support Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * canary_tokens_augment_ratio Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * debug Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * bug fix Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix lhotse based eval of llama canary model Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support some overwrite for eval Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support zero shot prompt in training Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support cross attention based SALM Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support cross attention based SALM Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix for batch train/valid of cross Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support learnable gate and plotting Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support using pseudo label in prompt rather than cross att Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * bug fix for perception cfg and context tokens shift Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * DentityConnectorsAdd Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix ckpt saving Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Support RnnGatedCrossAttention Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * add include_ffw and fix _optimizer_param_groups for all unfrozen run Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support grad acc when using bucket Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support TransformerCrossAttention Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support ProjectTransformerCrossAttention Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support ++model.use_am_tokenizer ++model.override_vocab_size ++model.override.hidden_size Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support question set on val without canary Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support load_audio_encoder and wip in optim_param_groups Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * minor fix for audio pretrain model init Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * simplify canary_tokens_augment Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * use question in the manifest if it exists Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support dataset weighting for non tar Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Update SpeechLLM code (#8475) * add pleasefixme marker for potential failed nightly tests. (#7678) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Add new text segmentation library for better TTS quality (#7645) * Add new text segmentation library for better TTS quality * Update zh_cn_pinyin.py added detailed instruction on how to install pkuseg. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Update requirements_tts.txt remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774) * Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer * Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add '32-true' for precision values --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix(clustering_diarizer.py): fix typo (#7772) Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> * fix(diarization-README): typo (#7771) Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> * Fix bug wrt change decoding strategy for bpe models (#7762) (#7764) * Fix bug wrt change decoding strategy for bpe models * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Remove incorrect extra argument for load_from_checkpoint_dir() (#7500) Signed-off-by: Robin Dong <robin.k.dong@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Add nemo to mcore GPT conversion script (#7730) * add conversion script Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove references to 'ckpt' Signed-off-by: Chen Cui <chcui@nvidia.com> * add one more sanity check to make sure there is no unexpected keys in state dict Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make cpu loading work Signed-off-by: Chen Cui <chcui@nvidia.com> * make script work for llama2 models Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address code check Signed-off-by: Chen Cui <chcui@nvidia.com> * remove trainer precision (was for old sanity check) Signed-off-by: Chen Cui <chcui@nvidia.com> * fix script for llama2 model Signed-off-by: Chen Cui <chcui@nvidia.com> * remove commented code Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785) Signed-off-by: anferico <f.cariaggi4@gmail.com> * Add some docs and update scripts for ASR (#7790) * Add some docs and update scripts Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * set context for text memmap to fork (#7784) * set context for text memmap to fork Signed-off-by: arendu <adithyare@nvidia.com> * typo Signed-off-by: arendu <adithyare@nvidia.com> --------- Signed-off-by: arendu <adithyare@nvidia.com> * add training with multiple audios Signed-off-by: stevehuang52 <heh@nvidia.com> * Support flash decoding (#7744) * Add flash-decoding Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Fix Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> --------- Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com> * Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761) * Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747) * Change accelerator to auto Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass omegaconf object to trainer in nlp_checkpoint_port.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass omegaconf object to trainer in export.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> * docs: fix typos (#7758) Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> * Snake act (#7736) Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update gpt_dataset.py (#6963) Signed-off-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> Signed-off-by: Abhishree <abhishreetm@gmail.com> --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com> Signed-off-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: shuoer86 <129674997+shuoer86@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Co-authored-by: Xin Yao <yaox12@outlook.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> * Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788) * add selection criteria for reference audios Signed-off-by: anferico <f.cariaggi4@gmail.com> * Update configuration files Signed-off-by: anferico <f.cariaggi4@gmail.com> * add informative comment in config files Signed-off-by: anferico <f.cariaggi4@gmail.com> * sample random index for reference audio selection Signed-off-by: anferico <f.cariaggi4@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: anferico <f.cariaggi4@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update text server to support compute logprobs (#7733) * update text server to support compute logprobs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo --------- Signed-off-by: Zhilin Wang <zhilinw@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add multi-layer feat extract and fix random question insertion Signed-off-by: stevehuang52 <heh@nvidia.com> * Configure MCore logger (#7781) Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Revert "PEFT eval fix (#7626) (#7638)" (#7693) This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9. * remove TN from ctc_segm tut (#7807) Signed-off-by: Evelina <ebakhturina@nvidia.com> * [TTS] Support audio offsets in TTS data loaders (#7156) * [TTS] Support audio offsets in TTS data loaders Signed-off-by: Ryan <rlangman@nvidia.com> * [TTS] Change docstring mentions of .pt to .npy Signed-off-by: Ryan <rlangman@nvidia.com> --------- Signed-off-by: Ryan <rlangman@nvidia.com> * Update Apex install command in Dockerfile (#7794) (#7804) * move core install to /workspace (#7706) * update apex install in dockerfile * use fetch head --------- Signed-off-by: Abhinav Khattar <aklife97@gmail.com> Signed-off-by: eharper <eharper@nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: Abhinav Khattar <aklife97@gmail.com> * fix typo Signed-off-by: stevehuang52 <heh@nvidia.com> * Nemo to HF converter for LLaMA model (#7770) * Create config_llama_truncate.yaml Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * Add files via upload Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update config_llama_truncate.yaml Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * clean up trainer * remove dependency on yaml config. load config from nemo file instead. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable ckpt saving into other precision formats * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support 70b + cleanup qkv slice logic * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug * move hf model folder code from comment to function and add instruction to run * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: Chen Cui <chcui@nvidia.com> * Save best NeMo model only when necessary (#7836) Signed-off-by: Ante Jukić <ajukic@nvidia.com> * add guard if its a distributed checkpoint (#7845) Signed-off-by: Gerald Shen <geshen@nvidia.com> * Fix tn duplex (#7808) * fix duplex tn infer Signed-off-by: Evelina <ebakhturina@nvidia.com> * fix typo Signed-off-by: Evelina <ebakhturina@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix TN docs Signed-off-by: Evelina <ebakhturina@nvidia.com> --------- Signed-off-by: Evelina <ebakhturina@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update transformers cache on Jenkins (#7854) * update transformers cache Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * add cd Signed-off-by: eharper <eharper@nvidia.com> --------- Signed-off-by: eharper <eharper@nvidia.com> * Update README.rst for container update (#7844) Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> * Add support for finetuning with huggingface datasets (#7834) * add finetune with huggingface dataset Signed-off-by: stevehuang52 <heh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update yaml Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update and refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * add extrac hf text and update Signed-off-by: stevehuang52 <heh@nvidia.com> * update and refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * move dataset dependency to common Signed-off-by: stevehuang52 <heh@nvidia.com> * add docstring Signed-off-by: stevehuang52 <heh@nvidia.com> * Add to Dics Signed-off-by: Nithin Rao Koluguri <nithinraok> * add ci test Signed-off-by: Nithin Rao Koluguri <nithinraok> * add max steps in jenkins Signed-off-by: Nithin Rao Koluguri <nithinraok> * reduce max steps Signed-off-by: Nithin Rao Koluguri <nithinraok> * jenkins test Signed-off-by: Nithin Rao Koluguri <nithinraok> * add bs=2 Signed-off-by: Nithin Rao Koluguri <nithinraok> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> * Multimodal merge (#7728) * ControlNet TRT export * Final MR before release * SD2 update * Fixed export issue * Fix for instruct p2p and reformat * Fix SD export issue * Add nemo clip export for DB * Fix ins pix2pix * fix sd2 config * [Mingyuan Ma] BF16 and SD conversion script * [Imagen] NHWC Feature * Fix .nemo loading issue for NeMo CLIP in SD * NeMo r1.20.0 Multimodal Merge * fix the inductor issue in inference * Fix inductor loading .nemo issue * Add Neva Model Support * Imagen Optimizations * Neva inference code * NeMo TOT 1.21 to Internal/main * Update neva_inference.yaml * REBASING for latest code changes * Update internal/main to main tot * Parallel DDIM implementation * 1. Fixing indentation bug. (#7352) Signed-off-by: Micha Livne <mlivne@nvidia.com> * NeMo MCore llama2 support + MCore PEFT adapters (#7299) * start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * add TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * add todo Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove imports Signed-off-by: ericharper <complex451@gmail.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * remove import Signed-off-by: ericharper <complex451@gmail.com> * small clean up Signed-off-by: ericharper <complex451@gmail.com> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <complex451@gmail.com> * add config obj to flash attention tests Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to test Signed-off-by: ericharper <complex451@gmail.com> * get hidden_size from config Signed-off-by: ericharper <complex451@gmail.com> * add try except Signed-off-by: ericharper <complex451@gmail.com> * use default Signed-off-by: ericharper <complex451@gmail.com> * update config with hidden size Signed-off-by: ericharper <complex451@gmail.com> * remove arg Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * comment out jenkins test Signed-off-by: ericharper <complex451@gmail.com> * revert import Signed-off-by: ericharper <complex451@gmail.com> * build transformer config Signed-off-by: ericharper <complex451@gmail.com> * add model to provider func Signed-off-by: ericharper <complex451@gmail.com> * update forward and float16 wrapper Signed-off-by: ericharper <complex451@gmail.com> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <complex451@gmail.com> * set virtual rank Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <jasonwan@nvidia.com> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <jasonwan@nvidia.com> --------- Signed-off-by: jasonwan <jasonwan@nvidia.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * mcore llama2 ckpt conversion & small fix Signed-off-by: jasonwan <jasonwan@nvidia.com> * Add inference & sft config by Hongbin Co-authored-by: Hongbin Liu <hongbinl@nvidia.com> Signed-off-by: jasonwan <jasonwan@nvidia.com> * fix config Signed-off-by: jasonwan <jasonwan@nvidia.com> * add inference param. update TP/PP script to support mcore gpt Signed-off-by: jasonwan <jasonwan@nvidia.com> * p-tuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * modify ckpt conversion script (adding model cast) Signed-off-by: jasonwan <jasonwan@nvidia.com> * ckpt conversion use relative path for config Signed-off-by: jasonwan <jasonwan@nvidia.com> * start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * add todo Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * remove imports Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <complex451@gmail.com> * small clean up Signed-off-by: ericharper <complex451@gmail.com> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <complex451@gmail.com> * update module args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <complex451@gmail.com> * remove args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to test Signed-off-by: ericharper <complex451@gmail.com> * get hidden_size from config Signed-off-by: ericharper <complex451@gmail.com> * add try except Signed-off-by: ericharper <complex451@gmail.com> * use default Signed-off-by: ericharper <complex451@gmail.com> * update config with hidden size Signed-off-by: ericharper <complex451@gmail.com> * remove arg Signed-off-by: ericharper <complex451@gmail.com> * comment out jenkins test Signed-off-by: ericharper <complex451@gmail.com> * revert import Signed-off-by: ericharper <complex451@gmail.com> * remove optimizer_idx Signed-off-by: eharper <eharper@nvidia.com> * prefetch num microbatches Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * fix for p-tuning sequence parallel Signed-off-by: jasonwan <jasonwan@nvidia.com> * support SFT/distOpt mcore (#7207) * add inference param. update TP/PP script to support mcore gpt * p-tuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * change layer names for SFT Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> * fix bug in SFT Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> --------- Signed-off-by: jasonwan <jasonwan@nvidia.com> Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> Co-authored-by: Hongbin Liu <hongbinl@nvidia.com> Co-authored-by: jasonwan <jasonwan@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * remove imports Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * build transformer config Signed-off-by: ericharper <complex451@gmail.com> * add model to provider func Signed-off-by: ericharper <complex451@gmail.com> * update forward and float16 wrapper Signed-off-by: ericharper <complex451@gmail.com> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <complex451@gmail.com> * set virtual rank Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <jasonwan@nvidia.com> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <jasonwan@nvidia.com> --------- Signed-off-by: jasonwan <jasonwan@nvidia.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * remove import Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rollback model cast for p-tuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * update for dist adam Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use get_gpt_module_list Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update ckpt conversion script Signed-off-by: jasonwan <jasonwan@nvidia.com> * ptl2.0 patch for llama config Signed-off-by: jasonwan <jasonwan@nvidia.com> * add plugins to trainer in scripts Signed-off-by: jasonwan <jasonwan@nvidia.com> * fix activation checkpointing mcore Signed-off-by: jasonwan <jasonwan@nvidia.com> * fix variable names Signed-off-by: jasonwan <jasonwan@nvidia.com> * overwrite normalization type for mcore/te Signed-off-by: jasonwan <jasonwan@nvidia.com> * Update megatron_llama_sft.yaml Signed-off-by: Jason Wang <jasonwan@nvidia.com> * add PEFT adapter support for mcore gpt path (#7276) * implementation for mcore adapter/mxins Signed-off-by: jasonwan <jasonwan@nvidia.com> * small fix for lora and ptuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * support layerwise peft Signed-off-by: jasonwan <jasonwan@nvidia.com> * support multiple target layers Signed-off-by: jasonwan <jasonwan@nvidia.com> * support lora GQA Signed-off-by: jasonwan <jasonwan@nvidia.com> * support amp O2 Signed-off-by: jasonwan <jasonwan@nvidia.com> * revert & more O2 fix Signed-off-by: jasonwan <jasonwan@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lora inject to attention Signed-off-by: jasonwan <jasonwan@nvidia.com> * support …

…rategy (NVIDIA#9387) * Integrating mcore's DistributedDataParallel into MegatronStrategy Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply ddp-hooks from pytorch only when needed Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * bugfix if using mcore distOpt with sft (#9356) * bugfix if using mcore distOpt Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Apply isort and black reformatting Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> Co-authored-by: akoumpa <akoumpa@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * fix typo infer_seq_lenght -> infer_seq_length (#9370) Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Marc Romeyn <mromeijn@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Rachitg/ag (#9083) * Rachitg/ag (#9081) * disable overlap for qkv Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix * bugfix --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Signed-off-by: Rachit Garg <rachitgarg91@gmail.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: michal2409 <michal2409@users.noreply.github.com> --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Signed-off-by: Rachit Garg <rachitgarg91@gmail.com> Signed-off-by: michal2409 <michal2409@users.noreply.github.com> Co-authored-by: Rachit Garg <rachitgarg91@gmail.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: michal2409 <michal2409@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Adding the original change made for label_models (#9377) (#9378) Signed-off-by: Taejin Park <tango4j@gmail.com> Co-authored-by: Taejin Park <tango4j@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Dgalvez/fix greedy batch strategy name r2.0.0rc0 (#9243) (#9253) * Lazily warn about using greedy strategy instead of greedy_batch strategy. Previously, the warning would often run spuriously, since several existing code paths simply call "change_decoding_strategy()" after having first initialized a Module, rather than changing the config before initializing the Module. This can be confusing. The only problem I can see with this is that using logging inside a forward() method might interfere with some compiler toolkits like Torchscript or thunder.compile. Presumably it would be easy to add a conditional statement to avoid this statement in a compiler context if necessary. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Update README.rst (#9393) Revised content per https://gitlab-master.nvidia.com/nemo-framework-tme/documentation/-/issues/25. Also removed reference to NIMs in LLMs and MMs Deployment and Optimization. It should be NVIDIA NeMo Microservices and not NIM. Removed nemo:24.03.framework and nemo:24.01.speech in Docker Containers section and replaced with 24.05 . Please verify all changes. Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * a2a fix removed tp world size and group from init (#8944) (#8952) Signed-off-by: Anmol Gupta <14880251+anmolgupt@users.noreply.github.com> Co-authored-by: anmolgupt <14880251+anmolgupt@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Add config option for FP32 embedding grads (#8953) * Add config option for FP32 embedding grads (#8946) Signed-off-by: Tim Moon <tmoon@nvidia.com> * Apply isort and black reformatting Signed-off-by: ericharper <ericharper@users.noreply.github.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: ericharper <ericharper@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: ericharper <ericharper@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Changes to enable CUDA graph for LLM (#8955) * Changes to enable CUDA graph for LLM (#8751) * Use next instead of get_batch Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * CUDA graph changes Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change to enable CG with weight caching Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Use next instead of get_batch" This reverts commit 0021bb444cdd1b27674fc0cfea909c1a42475336. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py Signed-off-by: Jan Baczek <jbaczek@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py" This reverts commit b4f736ed2b39f6c48d2868ac3febb82c763ab3fb. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Remove skip_weight_update argument Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix + cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Use new TE API for FP8 Param transpose Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change config param cuda_graph to enable_cuda_graph Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Enable TE RNGStatesTracker through config Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change te_rng_tracker to use_te_rng_tracker Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * FP8 weight transpose handled inside TE Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Revert "Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py"" This reverts commit e31862481216f9adf7fa584a0c0262916c935639. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by: Jan Baczek <jbaczek@nvidia.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jan Baczek <jbaczek@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: ericharper <ericharper@users.noreply.github.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by: Jan Baczek <jbaczek@nvidia.com> Signed-off-by: ericharper <ericharper@users.noreply.github.com> Co-authored-by: vasunvidia <108759426+vasunvidia@users.noreply.github.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jan Baczek <jbaczek@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: ericharper <ericharper@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Enhance Distributed Adam (#9051) * Enhance Distributed Adam (#9037) * Fix deprecated env. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Use user desired value for distributed adam. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Preserve memory format in parameter buffer of distributed adam. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Fix the contiguous_param_buffer bug about bprop overlap and redundant copy after all-gather. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Provide API to lock SHArP tree for distributed adam within nodes. Signed-off-by: Wil Kong <alpha0422@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Wil Kong <alpha0422@gmail.com> --------- Signed-off-by: Wil Kong <alpha0422@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: ericharper <ericharper@users.noreply.github.com> --------- Signed-off-by: Wil Kong <alpha0422@gmail.com> Signed-off-by: ericharper <ericharper@users.noreply.github.com> Co-authored-by: Wil Kong <alpha0422@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: ericharper <ericharper@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Force diarizer to use CUDA if cuda is available and if device=None. (#9380) (#9390) * Fixed clustering diarizer to load MSDD to GPU by default if cuda on * Fixed clustering diarizer to load MSDD to GPU by default if cuda on * Apply isort and black reformatting --------- Signed-off-by: Taejin Park <tango4j@gmail.com> Signed-off-by: tango4j <tango4j@users.noreply.github.com> Co-authored-by: Taejin Park <tango4j@gmail.com> Co-authored-by: tango4j <tango4j@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * ci: Properly catch failed tests by introduction of workflow templates (#9324) * ci: Refactor tests into reusable template Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Fix sending alerts on failure Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * disable slack Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix alerting Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Increase timeout for `L0_Unit_Tests_CPU` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * increase timeout Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * increase timeout for `Speech_Checkpoints_tests` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * improve readability Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * test Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * test Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * finalize Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * add missing rm statement for `L2_PTQ_Llama2_Export_Only` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * all your comments are belong to us Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * remove github output Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * revive more comments Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * add L2: ASR dev run - part two Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Pablo Garay <palenq@gmail.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Fix T5 G2P Input and Output Types (#9224) (#9269) * fix t5 g2p model * Apply isort and black reformatting --------- Signed-off-by: Jason <jasoli@nvidia.com> Signed-off-by: blisc <blisc@users.noreply.github.com> Co-authored-by: Jason <jasoli@nvidia.com> Co-authored-by: blisc <blisc@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. (#9198) * Fix the "cast ping pong" problem when we run AMP inference. This has been tested only for Parakeet-CTC-1.1B right now. This problem certainly exists elsewhere. Automatic mixed precision and inference do not play well together. First, automatic mixed precision was created back when neural networks were much simpler. In particular, they did not have softmax and layer norm as frequent operations. In the era of transformers, softmax and layer norm are very common. AMP will uncoditionally output fp32 outputs from these operations, even if their inputs are fp16. See here: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32 This is no longer necessary, now that layer norm does accumulation in fp32 in pytorch, even if the input is fp16: https://github.com/pytorch/pytorch/issues/66707 Do infernece by casting model to bfloat16, not by using AMP. Do feature preprocessing in float32 for accuracy. Warn if someone tries to input a non-float32 tensor. Always create the output in the type the rest of the model expects. Sort manifests by duration. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Always cast softmax inputs to float32 when in training mode. While we don't need this for accurate results in b/float16, this is a safety precaution to make sure that training accuracy does not regress. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Huvu/rag pipeline citest (#9384) * huvu/NeMo_rag_citest first commit * adding llama-index to dependency * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjusting data/models path in ci-test to dependency * putting llama-index to optional * update cicd-main.yml --------- Co-authored-by: Huy Vu2 <huvu@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Re-org export code (#9353) * reorg the export code Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * replaced log with raise Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * add converter and loader folders Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * move nemo_ckpt_convert into the converter folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * move nemo_file into loader folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * reorg converter Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * continue to reorg converter Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * continue to reorg Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * move nemo file back into nemo folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * renamed nemo folder to nemo_ckpt_loader Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove unused function Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * removed nemo file Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * moved a function to tensorrt_llm_run file Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * Remove unused imports Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * import csv added Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * ci: Fix `L2_Segmentation_Tool_Parallel_ctc_segmentation_test_L2_Eng_CitriNet_with_wav` (#9399) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * disable overlap for qkv (#9079) * disable overlap for qkv (#9072) * disable overlap for qkv Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: michal2409 <michal2409@users.noreply.github.com> --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Signed-off-by: michal2409 <michal2409@users.noreply.github.com> Signed-off-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: Rachit Garg <rachitgarg91@gmail.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: michal2409 <michal2409@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Fix circular import for MM dataprep notebook (#9287) (#9292) * update launcher name and fix mm circular import * Apply isort and black reformatting --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * add check if num layers is divisible by pp size (#9208) (#9298) * add check if num_layers % pp == 0 * Apply isort and black reformatting * move num_layers / pp check to build_transformer_config --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Add HF siglip vision encoder (#9185) * temp save Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * temp save 2 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update code Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * enable seq packing Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix neva and clip Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Enable parallel seq packing algo and few other fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Pipeline parallel support Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update data preprocess Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix few pp issues Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * enable sequence packing w/ PP Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix cu_seqlens in inputs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add assert Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Depend on PP to decide whether do padding Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add docstring Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix few evaluation issues Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix few PP evaluation issues Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add llama3 template Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix license Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix llama3 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Few fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Few neva bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Few neva bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Few neva bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * llama3 inference fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Force vision encoder to run in fp32 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Revert "Force vision encoder to run in fp32" This reverts commit 9d2160d96cb3e2a27a18538950ef43b4482c04da. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Try adding distributed format of checkpoint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Allow dist checkpoint to be non-strict Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Some fixes for PP + dist ckpt in Neva Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix peft Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * few fixes for lora Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * checkpoint updates Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> * bug fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add HF siglip vision encoder Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * handle steerlm label in nv_dpo template Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * Add neva dist checkpoint converter Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> * fix CLEAN RESPONSE logic to not use last EOS Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * strip extra_id_1 from clean response Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * change inference time image processor Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * resolve comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * remove open_clip vision encoder for siglip Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * update neva dist ckpt apis Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> * fix return Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * resolve CLEAN RESPONSE multiturn issue Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * code format Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fixes for isort Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refac image processor loading to util Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * black and isort Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * move crop size assertion Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * few neva fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * [Nemo CICD] timeouts fix (#9407) * timeouts fix * timeouts fix Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Removing un-used ModelConfig class (#9389) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Extend multimodal/speech_llm with lhotse, t5 and bestow supports (#9169) * Fixes * Docs fix * Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom) * Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support distributed_fused_adam Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support distributed_fused_adam Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Add support for sharded NeMo manifest files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support megatron_amp_O2 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Support heterogeneous sampling rates in non tarred NeMo manifests * migrate to PTL2.0 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * update manifest util Signed-off-by: stevehuang52 <heh@nvidia.com> * Support multiple tokenizer/parser types, aggregate tokenizers, and custom language fields * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * agg and normal tokenizers actually work * Support weights for NeMo tarred manifests * Temporarily hardcoded pnc stripping/lowercasing * fix * make pnc hack configurable from the config and disabled by default * fix the hack * migrate to ptl2.1 to support multiple dataloaders Signed-off-by: stevehuang52 <heh@nvidia.com> * support encoder overwrite Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update misc Signed-off-by: stevehuang52 <heh@nvidia.com> * fix eval and clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * support add_sep for perception model Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix https://github.com/Lightning-AI/pytorch-lightning/issues/18803 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * add_bos Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Transformer decoder with conditioning for canary (#8091) * initial commit for multi-task conf-enc transf-dec for canary Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * removing decoder states caching during training Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Option to limit the number of open streams (#8095) * audio signal support in multi Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update asr evaluator Signed-off-by: stevehuang52 <heh@nvidia.com> * fix from https://github.com/NVIDIA/NeMo/commit/fcc0f9f6ff7947c3c7fba3ed17d8ec8af6391397 and https://github.com/NVIDIA/NeMo/commit/f97c9016e6438ca4174b66bf9c3e248b28197aaa Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * transcribe fn for Canary models (#8110) * improve readability Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * adding context in transcribe function for ConfTransfModels Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * supporting relative paths in transcribe function for canary Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * removing cuts.sort_by_duration in __getitem__ to maintain manifest order during inference Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update for evaluation Signed-off-by: stevehuang52 <heh@nvidia.com> * update for eval Signed-off-by: stevehuang52 <heh@nvidia.com> * update for evaluation Signed-off-by: stevehuang52 <heh@nvidia.com> * fix bleu Signed-off-by: stevehuang52 <heh@nvidia.com> * fix typo Signed-off-by: stevehuang52 <heh@nvidia.com> * Add missing audio_filepath validation for Canary (#8119) * Add missing audio_filepath validation for Canary * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add default concat_sampling_probabilities Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support lhotse dataset in speechllm Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * bypass get_iterator_k_split Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * tmp fix Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * try to use fixed batch with megatron Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * add batch logging Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support unfrozen llm Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Create README.md Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Update README.md Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Update README.md Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * rename Signed-off-by: stevehuang52 <heh@nvidia.com> * add llama prompt template Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update and refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * support sample alpha Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support lhotse validation set and canary pretrained ckpt with pseudo label Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * make sure backward compatibility Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * remove pad Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * make sure asr_model is frozen Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support greedy decoding Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * valid on lhotse Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix multi dataloader in val case for lhotse SALM; add default data names; keep asr model tokenizer by default to enable adding canary dataset Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * remove the bruteforce _keep_special_tokens implementation Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * decoding_ratio and convert_canary_prompt_to_text support Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * canary_tokens_augment_ratio Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * debug Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * bug fix Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix lhotse based eval of llama canary model Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support some overwrite for eval Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support zero shot prompt in training Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support cross attention based SALM Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support cross attention based SALM Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix for batch train/valid of cross Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support learnable gate and plotting Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support using pseudo label in prompt rather than cross att Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * bug fix for perception cfg and context tokens shift Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * DentityConnectorsAdd Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix ckpt saving Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Support RnnGatedCrossAttention Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * add include_ffw and fix _optimizer_param_groups for all unfrozen run Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support grad acc when using bucket Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support TransformerCrossAttention Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support ProjectTransformerCrossAttention Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support ++model.use_am_tokenizer ++model.override_vocab_size ++model.override.hidden_size Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support question set on val without canary Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support load_audio_encoder and wip in optim_param_groups Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * minor fix for audio pretrain model init Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * simplify canary_tokens_augment Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * use question in the manifest if it exists Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support dataset weighting for non tar Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Update SpeechLLM code (#8475) * add pleasefixme marker for potential failed nightly tests. (#7678) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Add new text segmentation library for better TTS quality (#7645) * Add new text segmentation library for better TTS quality * Update zh_cn_pinyin.py added detailed instruction on how to install pkuseg. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Update requirements_tts.txt remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774) * Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer * Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add '32-true' for precision values --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix(clustering_diarizer.py): fix typo (#7772) Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> * fix(diarization-README): typo (#7771) Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> * Fix bug wrt change decoding strategy for bpe models (#7762) (#7764) * Fix bug wrt change decoding strategy for bpe models * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Remove incorrect extra argument for load_from_checkpoint_dir() (#7500) Signed-off-by: Robin Dong <robin.k.dong@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Add nemo to mcore GPT conversion script (#7730) * add conversion script Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove references to 'ckpt' Signed-off-by: Chen Cui <chcui@nvidia.com> * add one more sanity check to make sure there is no unexpected keys in state dict Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make cpu loading work Signed-off-by: Chen Cui <chcui@nvidia.com> * make script work for llama2 models Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address code check Signed-off-by: Chen Cui <chcui@nvidia.com> * remove trainer precision (was for old sanity check) Signed-off-by: Chen Cui <chcui@nvidia.com> * fix script for llama2 model Signed-off-by: Chen Cui <chcui@nvidia.com> * remove commented code Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785) Signed-off-by: anferico <f.cariaggi4@gmail.com> * Add some docs and update scripts for ASR (#7790) * Add some docs and update scripts Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * set context for text memmap to fork (#7784) * set context for text memmap to fork Signed-off-by: arendu <adithyare@nvidia.com> * typo Signed-off-by: arendu <adithyare@nvidia.com> --------- Signed-off-by: arendu <adithyare@nvidia.com> * add training with multiple audios Signed-off-by: stevehuang52 <heh@nvidia.com> * Support flash decoding (#7744) * Add flash-decoding Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Fix Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> --------- Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com> * Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761) * Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747) * Change accelerator to auto Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass omegaconf object to trainer in nlp_checkpoint_port.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass omegaconf object to trainer in export.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> * docs: fix typos (#7758) Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> * Snake act (#7736) Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update gpt_dataset.py (#6963) Signed-off-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> Signed-off-by: Abhishree <abhishreetm@gmail.com> --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com> Signed-off-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: shuoer86 <129674997+shuoer86@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Co-authored-by: Xin Yao <yaox12@outlook.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> * Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788) * add selection criteria for reference audios Signed-off-by: anferico <f.cariaggi4@gmail.com> * Update configuration files Signed-off-by: anferico <f.cariaggi4@gmail.com> * add informative comment in config files Signed-off-by: anferico <f.cariaggi4@gmail.com> * sample random index for reference audio selection Signed-off-by: anferico <f.cariaggi4@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: anferico <f.cariaggi4@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update text server to support compute logprobs (#7733) * update text server to support compute logprobs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo --------- Signed-off-by: Zhilin Wang <zhilinw@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add multi-layer feat extract and fix random question insertion Signed-off-by: stevehuang52 <heh@nvidia.com> * Configure MCore logger (#7781) Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Revert "PEFT eval fix (#7626) (#7638)" (#7693) This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9. * remove TN from ctc_segm tut (#7807) Signed-off-by: Evelina <ebakhturina@nvidia.com> * [TTS] Support audio offsets in TTS data loaders (#7156) * [TTS] Support audio offsets in TTS data loaders Signed-off-by: Ryan <rlangman@nvidia.com> * [TTS] Change docstring mentions of .pt to .npy Signed-off-by: Ryan <rlangman@nvidia.com> --------- Signed-off-by: Ryan <rlangman@nvidia.com> * Update Apex install command in Dockerfile (#7794) (#7804) * move core install to /workspace (#7706) * update apex install in dockerfile * use fetch head --------- Signed-off-by: Abhinav Khattar <aklife97@gmail.com> Signed-off-by: eharper <eharper@nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: Abhinav Khattar <aklife97@gmail.com> * fix typo Signed-off-by: stevehuang52 <heh@nvidia.com> * Nemo to HF converter for LLaMA model (#7770) * Create config_llama_truncate.yaml Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * Add files via upload Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update config_llama_truncate.yaml Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * clean up trainer * remove dependency on yaml config. load config from nemo file instead. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable ckpt saving into other precision formats * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support 70b + cleanup qkv slice logic * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug * move hf model folder code from comment to function and add instruction to run * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: Chen Cui <chcui@nvidia.com> * Save best NeMo model only when necessary (#7836) Signed-off-by: Ante Jukić <ajukic@nvidia.com> * add guard if its a distributed checkpoint (#7845) Signed-off-by: Gerald Shen <geshen@nvidia.com> * Fix tn duplex (#7808) * fix duplex tn infer Signed-off-by: Evelina <ebakhturina@nvidia.com> * fix typo Signed-off-by: Evelina <ebakhturina@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix TN docs Signed-off-by: Evelina <ebakhturina@nvidia.com> --------- Signed-off-by: Evelina <ebakhturina@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update transformers cache on Jenkins (#7854) * update transformers cache Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * add cd Signed-off-by: eharper <eharper@nvidia.com> --------- Signed-off-by: eharper <eharper@nvidia.com> * Update README.rst for container update (#7844) Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> * Add support for finetuning with huggingface datasets (#7834) * add finetune with huggingface dataset Signed-off-by: stevehuang52 <heh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update yaml Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update and refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * add extrac hf text and update Signed-off-by: stevehuang52 <heh@nvidia.com> * update and refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * move dataset dependency to common Signed-off-by: stevehuang52 <heh@nvidia.com> * add docstring Signed-off-by: stevehuang52 <heh@nvidia.com> * Add to Dics Signed-off-by: Nithin Rao Koluguri <nithinraok> * add ci test Signed-off-by: Nithin Rao Koluguri <nithinraok> * add max steps in jenkins Signed-off-by: Nithin Rao Koluguri <nithinraok> * reduce max steps Signed-off-by: Nithin Rao Koluguri <nithinraok> * jenkins test Signed-off-by: Nithin Rao Koluguri <nithinraok> * add bs=2 Signed-off-by: Nithin Rao Koluguri <nithinraok> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> * Multimodal merge (#7728) * ControlNet TRT export * Final MR before release * SD2 update * Fixed export issue * Fix for instruct p2p and reformat * Fix SD export issue * Add nemo clip export for DB * Fix ins pix2pix * fix sd2 config * [Mingyuan Ma] BF16 and SD conversion script * [Imagen] NHWC Feature * Fix .nemo loading issue for NeMo CLIP in SD * NeMo r1.20.0 Multimodal Merge * fix the inductor issue in inference * Fix inductor loading .nemo issue * Add Neva Model Support * Imagen Optimizations * Neva inference code * NeMo TOT 1.21 to Internal/main * Update neva_inference.yaml * REBASING for latest code changes * Update internal/main to main tot * Parallel DDIM implementation * 1. Fixing indentation bug. (#7352) Signed-off-by: Micha Livne <mlivne@nvidia.com> * NeMo MCore llama2 support + MCore PEFT adapters (#7299) * start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * add TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * add todo Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove imports Signed-off-by: ericharper <complex451@gmail.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * remove import Signed-off-by: ericharper <complex451@gmail.com> * small clean up Signed-off-by: ericharper <complex451@gmail.com> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <complex451@gmail.com> * add config obj to flash attention tests Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to test Signed-off-by: ericharper <complex451@gmail.com> * get hidden_size from config Signed-off-by: ericharper <complex451@gmail.com> * add try except Signed-off-by: ericharper <complex451@gmail.com> * use default Signed-off-by: ericharper <complex451@gmail.com> * update config with hidden size Signed-off-by: ericharper <complex451@gmail.com> * remove arg Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * comment out jenkins test Signed-off-by: ericharper <complex451@gmail.com> * revert import Signed-off-by: ericharper <complex451@gmail.com> * build transformer config Signed-off-by: ericharper <complex451@gmail.com> * add model to provider func Signed-off-by: ericharper <complex451@gmail.com> * update forward and float16 wrapper Signed-off-by: ericharper <complex451@gmail.com> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <complex451@gmail.com> * set virtual rank Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <jasonwan@nvidia.com> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <jasonwan@nvidia.com> --------- Signed-off-by: jasonwan <jasonwan@nvidia.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * mcore llama2 ckpt conversion & small fix Signed-off-by: jasonwan <jasonwan@nvidia.com> * Add inference & sft config by Hongbin Co-authored-by: Hongbin Liu <hongbinl@nvidia.com> Signed-off-by: jasonwan <jasonwan@nvidia.com> * fix config Signed-off-by: jasonwan <jasonwan@nvidia.com> * add inference param. update TP/PP script to support mcore gpt Signed-off-by: jasonwan <jasonwan@nvidia.com> * p-tuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * modify ckpt conversion script (adding model cast) Signed-off-by: jasonwan <jasonwan@nvidia.com> * ckpt conversion use relative path for config Signed-off-by: jasonwan <jasonwan@nvidia.com> * start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * add todo Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * remove imports Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <complex451@gmail.com> * small clean up Signed-off-by: ericharper <complex451@gmail.com> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <complex451@gmail.com> * update module args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <complex451@gmail.com> * remove args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to test Signed-off-by: ericharper <complex451@gmail.com> * get hidden_size from config Signed-off-by: ericharper <complex451@gmail.com> * add try except Signed-off-by: ericharper <complex451@gmail.com> * use default Signed-off-by: ericharper <complex451@gmail.com> * update config with hidden size Signed-off-by: ericharper <complex451@gmail.com> * remove arg Signed-off-by: ericharper <complex451@gmail.com> * comment out jenkins test Signed-off-by: ericharper <complex451@gmail.com> * revert import Signed-off-by: ericharper <complex451@gmail.com> * remove optimizer_idx Signed-off-by: eharper <eharper@nvidia.com> * prefetch num microbatches Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * fix for p-tuning sequence parallel Signed-off-by: jasonwan <jasonwan@nvidia.com> * support SFT/distOpt mcore (#7207) * add inference param. update TP/PP script to support mcore gpt * p-tuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * change layer names for SFT Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> * fix bug in SFT Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> --------- Signed-off-by: jasonwan <jasonwan@nvidia.com> Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> Co-authored-by: Hongbin Liu <hongbinl@nvidia.com> Co-authored-by: jasonwan <jasonwan@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * remove imports Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * build transformer config Signed-off-by: ericharper <complex451@gmail.com> * add model to provider func Signed-off-by: ericharper <complex451@gmail.com> * update forward and float16 wrapper Signed-off-by: ericharper <complex451@gmail.com> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <complex451@gmail.com> * set virtual rank Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <jasonwan@nvidia.com> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <jasonwan@nvidia.com> --------- Signed-off-by: jasonwan <jasonwan@nvidia.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * remove import Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rollback model cast for p-tuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * update for dist adam Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use get_gpt_module_list Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update ckpt conversion script Signed-off-by: jasonwan <jasonwan@nvidia.com> * ptl2.0 patch for llama config Signed-off-by: jasonwan <jasonwan@nvidia.com> * add plugins to trainer in scripts Signed-off-by: jasonwan <jasonwan@nvidia.com> * fix activation checkpointing mcore Signed-off-by: jasonwan <jasonwan@nvidia.com> * fix variable names Signed-off-by: jasonwan <jasonwan@nvidia.com> * overwrite normalization type for mcore/te Signed-off-by: jasonwan <jasonwan@nvidia.com> * Update megatron_llama_sft.yaml Signed-off-by: Jason Wang <jasonwan@nvidia.com> * add PEFT adapter support for mcore gpt path (#7276) * implementation for mcore adapter/mxins Signed-off-by: jasonwan <jasonwan@nvidia.com> * small fix for lora and ptuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * support layerwise peft Signed-off-by: jasonwan <jasonwan@nvidia.com> * support multiple target layers Signed-off-by: jasonwan <jasonwan@nvidia.com> * support lora GQA Signed-off-by: jasonwan <jasonwan@nvidia.com> * support amp O2 Signed-off-by: jasonwan <jasonwan@nvidia.com> * revert & more O2 fix Signed-off-by: jasonwan <jasonwan@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lora inject to attention Signed-off-by: jasonwan <jasonwan@nvidia.com> * support …

…NVIDIA#9198) * Fix the "cast ping pong" problem when we run AMP inference. This has been tested only for Parakeet-CTC-1.1B right now. This problem certainly exists elsewhere. Automatic mixed precision and inference do not play well together. First, automatic mixed precision was created back when neural networks were much simpler. In particular, they did not have softmax and layer norm as frequent operations. In the era of transformers, softmax and layer norm are very common. AMP will uncoditionally output fp32 outputs from these operations, even if their inputs are fp16. See here: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32 This is no longer necessary, now that layer norm does accumulation in fp32 in pytorch, even if the input is fp16: pytorch/pytorch#66707 Do infernece by casting model to bfloat16, not by using AMP. Do feature preprocessing in float32 for accuracy. Warn if someone tries to input a non-float32 tensor. Always create the output in the type the rest of the model expects. Sort manifests by duration. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Always cast softmax inputs to float32 when in training mode. While we don't need this for accurate results in b/float16, this is a safety precaution to make sure that training accuracy does not regress. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

…rategy (NVIDIA#9387) * Integrating mcore's DistributedDataParallel into MegatronStrategy Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply ddp-hooks from pytorch only when needed Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * bugfix if using mcore distOpt with sft (#9356) * bugfix if using mcore distOpt Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Apply isort and black reformatting Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> Co-authored-by: akoumpa <akoumpa@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * fix typo infer_seq_lenght -> infer_seq_length (#9370) Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Marc Romeyn <mromeijn@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Rachitg/ag (#9083) * Rachitg/ag (#9081) * disable overlap for qkv Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix * bugfix --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Signed-off-by: Rachit Garg <rachitgarg91@gmail.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: michal2409 <michal2409@users.noreply.github.com> --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Signed-off-by: Rachit Garg <rachitgarg91@gmail.com> Signed-off-by: michal2409 <michal2409@users.noreply.github.com> Co-authored-by: Rachit Garg <rachitgarg91@gmail.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: michal2409 <michal2409@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Adding the original change made for label_models (#9377) (#9378) Signed-off-by: Taejin Park <tango4j@gmail.com> Co-authored-by: Taejin Park <tango4j@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Dgalvez/fix greedy batch strategy name r2.0.0rc0 (#9243) (#9253) * Lazily warn about using greedy strategy instead of greedy_batch strategy. Previously, the warning would often run spuriously, since several existing code paths simply call "change_decoding_strategy()" after having first initialized a Module, rather than changing the config before initializing the Module. This can be confusing. The only problem I can see with this is that using logging inside a forward() method might interfere with some compiler toolkits like Torchscript or thunder.compile. Presumably it would be easy to add a conditional statement to avoid this statement in a compiler context if necessary. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Update README.rst (#9393) Revised content per https://gitlab-master.nvidia.com/nemo-framework-tme/documentation/-/issues/25. Also removed reference to NIMs in LLMs and MMs Deployment and Optimization. It should be NVIDIA NeMo Microservices and not NIM. Removed nemo:24.03.framework and nemo:24.01.speech in Docker Containers section and replaced with 24.05 . Please verify all changes. Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * a2a fix removed tp world size and group from init (#8944) (#8952) Signed-off-by: Anmol Gupta <14880251+anmolgupt@users.noreply.github.com> Co-authored-by: anmolgupt <14880251+anmolgupt@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Add config option for FP32 embedding grads (#8953) * Add config option for FP32 embedding grads (#8946) Signed-off-by: Tim Moon <tmoon@nvidia.com> * Apply isort and black reformatting Signed-off-by: ericharper <ericharper@users.noreply.github.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: ericharper <ericharper@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: ericharper <ericharper@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Changes to enable CUDA graph for LLM (#8955) * Changes to enable CUDA graph for LLM (#8751) * Use next instead of get_batch Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * CUDA graph changes Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change to enable CG with weight caching Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Use next instead of get_batch" This reverts commit 0021bb444cdd1b27674fc0cfea909c1a42475336. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py Signed-off-by: Jan Baczek <jbaczek@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py" This reverts commit b4f736ed2b39f6c48d2868ac3febb82c763ab3fb. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Remove skip_weight_update argument Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix + cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Use new TE API for FP8 Param transpose Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change config param cuda_graph to enable_cuda_graph Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Enable TE RNGStatesTracker through config Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change te_rng_tracker to use_te_rng_tracker Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * FP8 weight transpose handled inside TE Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Revert "Revert "Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py"" This reverts commit e31862481216f9adf7fa584a0c0262916c935639. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by: Jan Baczek <jbaczek@nvidia.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jan Baczek <jbaczek@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: ericharper <ericharper@users.noreply.github.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by: Jan Baczek <jbaczek@nvidia.com> Signed-off-by: ericharper <ericharper@users.noreply.github.com> Co-authored-by: vasunvidia <108759426+vasunvidia@users.noreply.github.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Jan Baczek <jbaczek@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: ericharper <ericharper@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Enhance Distributed Adam (#9051) * Enhance Distributed Adam (#9037) * Fix deprecated env. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Use user desired value for distributed adam. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Preserve memory format in parameter buffer of distributed adam. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Fix the contiguous_param_buffer bug about bprop overlap and redundant copy after all-gather. Signed-off-by: Wil Kong <alpha0422@gmail.com> * Provide API to lock SHArP tree for distributed adam within nodes. Signed-off-by: Wil Kong <alpha0422@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Wil Kong <alpha0422@gmail.com> --------- Signed-off-by: Wil Kong <alpha0422@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: ericharper <ericharper@users.noreply.github.com> --------- Signed-off-by: Wil Kong <alpha0422@gmail.com> Signed-off-by: ericharper <ericharper@users.noreply.github.com> Co-authored-by: Wil Kong <alpha0422@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: ericharper <ericharper@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Force diarizer to use CUDA if cuda is available and if device=None. (#9380) (#9390) * Fixed clustering diarizer to load MSDD to GPU by default if cuda on * Fixed clustering diarizer to load MSDD to GPU by default if cuda on * Apply isort and black reformatting --------- Signed-off-by: Taejin Park <tango4j@gmail.com> Signed-off-by: tango4j <tango4j@users.noreply.github.com> Co-authored-by: Taejin Park <tango4j@gmail.com> Co-authored-by: tango4j <tango4j@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * ci: Properly catch failed tests by introduction of workflow templates (#9324) * ci: Refactor tests into reusable template Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Fix sending alerts on failure Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * disable slack Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix alerting Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Increase timeout for `L0_Unit_Tests_CPU` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * increase timeout Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * increase timeout for `Speech_Checkpoints_tests` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * improve readability Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * test Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * test Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * finalize Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * add missing rm statement for `L2_PTQ_Llama2_Export_Only` Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * all your comments are belong to us Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * remove github output Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * revive more comments Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * add L2: ASR dev run - part two Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Pablo Garay <palenq@gmail.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Fix T5 G2P Input and Output Types (#9224) (#9269) * fix t5 g2p model * Apply isort and black reformatting --------- Signed-off-by: Jason <jasoli@nvidia.com> Signed-off-by: blisc <blisc@users.noreply.github.com> Co-authored-by: Jason <jasoli@nvidia.com> Co-authored-by: blisc <blisc@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. (#9198) * Fix the "cast ping pong" problem when we run AMP inference. This has been tested only for Parakeet-CTC-1.1B right now. This problem certainly exists elsewhere. Automatic mixed precision and inference do not play well together. First, automatic mixed precision was created back when neural networks were much simpler. In particular, they did not have softmax and layer norm as frequent operations. In the era of transformers, softmax and layer norm are very common. AMP will uncoditionally output fp32 outputs from these operations, even if their inputs are fp16. See here: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32 This is no longer necessary, now that layer norm does accumulation in fp32 in pytorch, even if the input is fp16: https://github.com/pytorch/pytorch/issues/66707 Do infernece by casting model to bfloat16, not by using AMP. Do feature preprocessing in float32 for accuracy. Warn if someone tries to input a non-float32 tensor. Always create the output in the type the rest of the model expects. Sort manifests by duration. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Always cast softmax inputs to float32 when in training mode. While we don't need this for accurate results in b/float16, this is a safety precaution to make sure that training accuracy does not regress. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Huvu/rag pipeline citest (#9384) * huvu/NeMo_rag_citest first commit * adding llama-index to dependency * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjusting data/models path in ci-test to dependency * putting llama-index to optional * update cicd-main.yml --------- Co-authored-by: Huy Vu2 <huvu@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Re-org export code (#9353) * reorg the export code Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * replaced log with raise Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * add converter and loader folders Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * move nemo_ckpt_convert into the converter folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * move nemo_file into loader folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * reorg converter Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * continue to reorg converter Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * continue to reorg Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * move nemo file back into nemo folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * renamed nemo folder to nemo_ckpt_loader Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove unused function Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * removed nemo file Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * moved a function to tensorrt_llm_run file Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * Remove unused imports Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * import csv added Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * ci: Fix `L2_Segmentation_Tool_Parallel_ctc_segmentation_test_L2_Eng_CitriNet_with_wav` (#9399) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * disable overlap for qkv (#9079) * disable overlap for qkv (#9072) * disable overlap for qkv Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: michal2409 <michal2409@users.noreply.github.com> --------- Signed-off-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Signed-off-by: michal2409 <michal2409@users.noreply.github.com> Signed-off-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: Rachit Garg <rachitgarg91@gmail.com> Co-authored-by: Rachit Garg <rachitg@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: michal2409 <michal2409@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Fix circular import for MM dataprep notebook (#9287) (#9292) * update launcher name and fix mm circular import * Apply isort and black reformatting --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * add check if num layers is divisible by pp size (#9208) (#9298) * add check if num_layers % pp == 0 * Apply isort and black reformatting * move num_layers / pp check to build_transformer_config --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Add HF siglip vision encoder (#9185) * temp save Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * temp save 2 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update code Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * enable seq packing Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix neva and clip Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Enable parallel seq packing algo and few other fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Pipeline parallel support Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update data preprocess Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix few pp issues Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * enable sequence packing w/ PP Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix cu_seqlens in inputs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * add assert Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Depend on PP to decide whether do padding Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add docstring Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix few evaluation issues Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix few PP evaluation issues Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add llama3 template Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix license Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix llama3 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Few fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Few neva bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Few neva bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Few neva bugs Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * llama3 inference fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Force vision encoder to run in fp32 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Revert "Force vision encoder to run in fp32" This reverts commit 9d2160d96cb3e2a27a18538950ef43b4482c04da. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Try adding distributed format of checkpoint Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Allow dist checkpoint to be non-strict Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Some fixes for PP + dist ckpt in Neva Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix peft Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * few fixes for lora Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * checkpoint updates Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> * bug fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add HF siglip vision encoder Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * handle steerlm label in nv_dpo template Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * Add neva dist checkpoint converter Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> * fix CLEAN RESPONSE logic to not use last EOS Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * strip extra_id_1 from clean response Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * change inference time image processor Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * resolve comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * remove open_clip vision encoder for siglip Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * update neva dist ckpt apis Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> * fix return Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * resolve CLEAN RESPONSE multiturn issue Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * code format Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fixes for isort Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refac image processor loading to util Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * black and isort Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * move crop size assertion Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * few neva fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: yaoyu-33 <yaoyu-33@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * [Nemo CICD] timeouts fix (#9407) * timeouts fix * timeouts fix Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Removing un-used ModelConfig class (#9389) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Extend multimodal/speech_llm with lhotse, t5 and bestow supports (#9169) * Fixes * Docs fix * Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom) * Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support distributed_fused_adam Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support distributed_fused_adam Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Add support for sharded NeMo manifest files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support megatron_amp_O2 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Support heterogeneous sampling rates in non tarred NeMo manifests * migrate to PTL2.0 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * update manifest util Signed-off-by: stevehuang52 <heh@nvidia.com> * Support multiple tokenizer/parser types, aggregate tokenizers, and custom language fields * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * agg and normal tokenizers actually work * Support weights for NeMo tarred manifests * Temporarily hardcoded pnc stripping/lowercasing * fix * make pnc hack configurable from the config and disabled by default * fix the hack * migrate to ptl2.1 to support multiple dataloaders Signed-off-by: stevehuang52 <heh@nvidia.com> * support encoder overwrite Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update misc Signed-off-by: stevehuang52 <heh@nvidia.com> * fix eval and clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * support add_sep for perception model Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix https://github.com/Lightning-AI/pytorch-lightning/issues/18803 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * add_bos Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Transformer decoder with conditioning for canary (#8091) * initial commit for multi-task conf-enc transf-dec for canary Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * removing decoder states caching during training Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Option to limit the number of open streams (#8095) * audio signal support in multi Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update asr evaluator Signed-off-by: stevehuang52 <heh@nvidia.com> * fix from https://github.com/NVIDIA/NeMo/commit/fcc0f9f6ff7947c3c7fba3ed17d8ec8af6391397 and https://github.com/NVIDIA/NeMo/commit/f97c9016e6438ca4174b66bf9c3e248b28197aaa Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * transcribe fn for Canary models (#8110) * improve readability Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * adding context in transcribe function for ConfTransfModels Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * supporting relative paths in transcribe function for canary Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * removing cuts.sort_by_duration in __getitem__ to maintain manifest order during inference Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update for evaluation Signed-off-by: stevehuang52 <heh@nvidia.com> * update for eval Signed-off-by: stevehuang52 <heh@nvidia.com> * update for evaluation Signed-off-by: stevehuang52 <heh@nvidia.com> * fix bleu Signed-off-by: stevehuang52 <heh@nvidia.com> * fix typo Signed-off-by: stevehuang52 <heh@nvidia.com> * Add missing audio_filepath validation for Canary (#8119) * Add missing audio_filepath validation for Canary * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add default concat_sampling_probabilities Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support lhotse dataset in speechllm Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * bypass get_iterator_k_split Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * tmp fix Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * try to use fixed batch with megatron Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * add batch logging Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support unfrozen llm Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Create README.md Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Update README.md Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Update README.md Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * rename Signed-off-by: stevehuang52 <heh@nvidia.com> * add llama prompt template Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update and refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * support sample alpha Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support lhotse validation set and canary pretrained ckpt with pseudo label Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * make sure backward compatibility Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * remove pad Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * make sure asr_model is frozen Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support greedy decoding Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * valid on lhotse Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix multi dataloader in val case for lhotse SALM; add default data names; keep asr model tokenizer by default to enable adding canary dataset Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * remove the bruteforce _keep_special_tokens implementation Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * decoding_ratio and convert_canary_prompt_to_text support Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * canary_tokens_augment_ratio Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * debug Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * bug fix Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix lhotse based eval of llama canary model Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support some overwrite for eval Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support zero shot prompt in training Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support cross attention based SALM Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support cross attention based SALM Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix for batch train/valid of cross Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support learnable gate and plotting Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support using pseudo label in prompt rather than cross att Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * bug fix for perception cfg and context tokens shift Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * DentityConnectorsAdd Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * fix ckpt saving Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Support RnnGatedCrossAttention Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * add include_ffw and fix _optimizer_param_groups for all unfrozen run Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support grad acc when using bucket Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support TransformerCrossAttention Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support ProjectTransformerCrossAttention Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support ++model.use_am_tokenizer ++model.override_vocab_size ++model.override.hidden_size Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support question set on val without canary Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support load_audio_encoder and wip in optim_param_groups Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * minor fix for audio pretrain model init Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * simplify canary_tokens_augment Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * use question in the manifest if it exists Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * support dataset weighting for non tar Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * Update SpeechLLM code (#8475) * add pleasefixme marker for potential failed nightly tests. (#7678) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Add new text segmentation library for better TTS quality (#7645) * Add new text segmentation library for better TTS quality * Update zh_cn_pinyin.py added detailed instruction on how to install pkuseg. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Update requirements_tts.txt remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774) * Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer * Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add '32-true' for precision values --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix(clustering_diarizer.py): fix typo (#7772) Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> * fix(diarization-README): typo (#7771) Signed-off-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> * Fix bug wrt change decoding strategy for bpe models (#7762) (#7764) * Fix bug wrt change decoding strategy for bpe models * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Remove incorrect extra argument for load_from_checkpoint_dir() (#7500) Signed-off-by: Robin Dong <robin.k.dong@gmail.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Add nemo to mcore GPT conversion script (#7730) * add conversion script Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove references to 'ckpt' Signed-off-by: Chen Cui <chcui@nvidia.com> * add one more sanity check to make sure there is no unexpected keys in state dict Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make cpu loading work Signed-off-by: Chen Cui <chcui@nvidia.com> * make script work for llama2 models Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address code check Signed-off-by: Chen Cui <chcui@nvidia.com> * remove trainer precision (was for old sanity check) Signed-off-by: Chen Cui <chcui@nvidia.com> * fix script for llama2 model Signed-off-by: Chen Cui <chcui@nvidia.com> * remove commented code Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> * Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785) Signed-off-by: anferico <f.cariaggi4@gmail.com> * Add some docs and update scripts for ASR (#7790) * Add some docs and update scripts Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * set context for text memmap to fork (#7784) * set context for text memmap to fork Signed-off-by: arendu <adithyare@nvidia.com> * typo Signed-off-by: arendu <adithyare@nvidia.com> --------- Signed-off-by: arendu <adithyare@nvidia.com> * add training with multiple audios Signed-off-by: stevehuang52 <heh@nvidia.com> * Support flash decoding (#7744) * Add flash-decoding Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Fix Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> --------- Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com> * Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761) * Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747) * Change accelerator to auto Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass omegaconf object to trainer in nlp_checkpoint_port.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass omegaconf object to trainer in export.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> * docs: fix typos (#7758) Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> * Snake act (#7736) Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update gpt_dataset.py (#6963) Signed-off-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> Signed-off-by: Abhishree <abhishreetm@gmail.com> --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: shuoer86 <129674997+shuoer86@users.noreply.github.com> Signed-off-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: shuoer86 <129674997+shuoer86@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Co-authored-by: Xin Yao <yaox12@outlook.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> * Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788) * add selection criteria for reference audios Signed-off-by: anferico <f.cariaggi4@gmail.com> * Update configuration files Signed-off-by: anferico <f.cariaggi4@gmail.com> * add informative comment in config files Signed-off-by: anferico <f.cariaggi4@gmail.com> * sample random index for reference audio selection Signed-off-by: anferico <f.cariaggi4@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: anferico <f.cariaggi4@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update text server to support compute logprobs (#7733) * update text server to support compute logprobs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo --------- Signed-off-by: Zhilin Wang <zhilinw@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add multi-layer feat extract and fix random question insertion Signed-off-by: stevehuang52 <heh@nvidia.com> * Configure MCore logger (#7781) Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Revert "PEFT eval fix (#7626) (#7638)" (#7693) This reverts commit c24bb454bf1fa6f5820f1805c6387254a73220b9. * remove TN from ctc_segm tut (#7807) Signed-off-by: Evelina <ebakhturina@nvidia.com> * [TTS] Support audio offsets in TTS data loaders (#7156) * [TTS] Support audio offsets in TTS data loaders Signed-off-by: Ryan <rlangman@nvidia.com> * [TTS] Change docstring mentions of .pt to .npy Signed-off-by: Ryan <rlangman@nvidia.com> --------- Signed-off-by: Ryan <rlangman@nvidia.com> * Update Apex install command in Dockerfile (#7794) (#7804) * move core install to /workspace (#7706) * update apex install in dockerfile * use fetch head --------- Signed-off-by: Abhinav Khattar <aklife97@gmail.com> Signed-off-by: eharper <eharper@nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: Abhinav Khattar <aklife97@gmail.com> * fix typo Signed-off-by: stevehuang52 <heh@nvidia.com> * Nemo to HF converter for LLaMA model (#7770) * Create config_llama_truncate.yaml Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * Add files via upload Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update config_llama_truncate.yaml Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> * clean up trainer * remove dependency on yaml config. load config from nemo file instead. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable ckpt saving into other precision formats * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support 70b + cleanup qkv slice logic * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug * move hf model folder code from comment to function and add instruction to run * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Utkarsh <49331882+uppalutkarsh@users.noreply.github.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Co-authored-by: Chen Cui <chcui@nvidia.com> * Save best NeMo model only when necessary (#7836) Signed-off-by: Ante Jukić <ajukic@nvidia.com> * add guard if its a distributed checkpoint (#7845) Signed-off-by: Gerald Shen <geshen@nvidia.com> * Fix tn duplex (#7808) * fix duplex tn infer Signed-off-by: Evelina <ebakhturina@nvidia.com> * fix typo Signed-off-by: Evelina <ebakhturina@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix TN docs Signed-off-by: Evelina <ebakhturina@nvidia.com> --------- Signed-off-by: Evelina <ebakhturina@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update transformers cache on Jenkins (#7854) * update transformers cache Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * add cd Signed-off-by: eharper <eharper@nvidia.com> --------- Signed-off-by: eharper <eharper@nvidia.com> * Update README.rst for container update (#7844) Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> * Add support for finetuning with huggingface datasets (#7834) * add finetune with huggingface dataset Signed-off-by: stevehuang52 <heh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update yaml Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update and refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * add extrac hf text and update Signed-off-by: stevehuang52 <heh@nvidia.com> * update and refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * move dataset dependency to common Signed-off-by: stevehuang52 <heh@nvidia.com> * add docstring Signed-off-by: stevehuang52 <heh@nvidia.com> * Add to Dics Signed-off-by: Nithin Rao Koluguri <nithinraok> * add ci test Signed-off-by: Nithin Rao Koluguri <nithinraok> * add max steps in jenkins Signed-off-by: Nithin Rao Koluguri <nithinraok> * reduce max steps Signed-off-by: Nithin Rao Koluguri <nithinraok> * jenkins test Signed-off-by: Nithin Rao Koluguri <nithinraok> * add bs=2 Signed-off-by: Nithin Rao Koluguri <nithinraok> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> * Multimodal merge (#7728) * ControlNet TRT export * Final MR before release * SD2 update * Fixed export issue * Fix for instruct p2p and reformat * Fix SD export issue * Add nemo clip export for DB * Fix ins pix2pix * fix sd2 config * [Mingyuan Ma] BF16 and SD conversion script * [Imagen] NHWC Feature * Fix .nemo loading issue for NeMo CLIP in SD * NeMo r1.20.0 Multimodal Merge * fix the inductor issue in inference * Fix inductor loading .nemo issue * Add Neva Model Support * Imagen Optimizations * Neva inference code * NeMo TOT 1.21 to Internal/main * Update neva_inference.yaml * REBASING for latest code changes * Update internal/main to main tot * Parallel DDIM implementation * 1. Fixing indentation bug. (#7352) Signed-off-by: Micha Livne <mlivne@nvidia.com> * NeMo MCore llama2 support + MCore PEFT adapters (#7299) * start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * add TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * add todo Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove imports Signed-off-by: ericharper <complex451@gmail.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * remove import Signed-off-by: ericharper <complex451@gmail.com> * small clean up Signed-off-by: ericharper <complex451@gmail.com> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <complex451@gmail.com> * add config obj to flash attention tests Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to test Signed-off-by: ericharper <complex451@gmail.com> * get hidden_size from config Signed-off-by: ericharper <complex451@gmail.com> * add try except Signed-off-by: ericharper <complex451@gmail.com> * use default Signed-off-by: ericharper <complex451@gmail.com> * update config with hidden size Signed-off-by: ericharper <complex451@gmail.com> * remove arg Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * comment out jenkins test Signed-off-by: ericharper <complex451@gmail.com> * revert import Signed-off-by: ericharper <complex451@gmail.com> * build transformer config Signed-off-by: ericharper <complex451@gmail.com> * add model to provider func Signed-off-by: ericharper <complex451@gmail.com> * update forward and float16 wrapper Signed-off-by: ericharper <complex451@gmail.com> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <complex451@gmail.com> * set virtual rank Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <jasonwan@nvidia.com> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <jasonwan@nvidia.com> --------- Signed-off-by: jasonwan <jasonwan@nvidia.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * mcore llama2 ckpt conversion & small fix Signed-off-by: jasonwan <jasonwan@nvidia.com> * Add inference & sft config by Hongbin Co-authored-by: Hongbin Liu <hongbinl@nvidia.com> Signed-off-by: jasonwan <jasonwan@nvidia.com> * fix config Signed-off-by: jasonwan <jasonwan@nvidia.com> * add inference param. update TP/PP script to support mcore gpt Signed-off-by: jasonwan <jasonwan@nvidia.com> * p-tuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * modify ckpt conversion script (adding model cast) Signed-off-by: jasonwan <jasonwan@nvidia.com> * ckpt conversion use relative path for config Signed-off-by: jasonwan <jasonwan@nvidia.com> * start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * add todo Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * remove imports Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <complex451@gmail.com> * small clean up Signed-off-by: ericharper <complex451@gmail.com> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <complex451@gmail.com> * update module args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <complex451@gmail.com> * remove args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to test Signed-off-by: ericharper <complex451@gmail.com> * get hidden_size from config Signed-off-by: ericharper <complex451@gmail.com> * add try except Signed-off-by: ericharper <complex451@gmail.com> * use default Signed-off-by: ericharper <complex451@gmail.com> * update config with hidden size Signed-off-by: ericharper <complex451@gmail.com> * remove arg Signed-off-by: ericharper <complex451@gmail.com> * comment out jenkins test Signed-off-by: ericharper <complex451@gmail.com> * revert import Signed-off-by: ericharper <complex451@gmail.com> * remove optimizer_idx Signed-off-by: eharper <eharper@nvidia.com> * prefetch num microbatches Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * fix for p-tuning sequence parallel Signed-off-by: jasonwan <jasonwan@nvidia.com> * support SFT/distOpt mcore (#7207) * add inference param. update TP/PP script to support mcore gpt * p-tuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * change layer names for SFT Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> * fix bug in SFT Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> --------- Signed-off-by: jasonwan <jasonwan@nvidia.com> Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> Co-authored-by: Hongbin Liu <hongbinl@nvidia.com> Co-authored-by: jasonwan <jasonwan@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * remove imports Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * build transformer config Signed-off-by: ericharper <complex451@gmail.com> * add model to provider func Signed-off-by: ericharper <complex451@gmail.com> * update forward and float16 wrapper Signed-off-by: ericharper <complex451@gmail.com> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <complex451@gmail.com> * set virtual rank Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <jasonwan@nvidia.com> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <jasonwan@nvidia.com> --------- Signed-off-by: jasonwan <jasonwan@nvidia.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * remove import Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rollback model cast for p-tuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * update for dist adam Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use get_gpt_module_list Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update ckpt conversion script Signed-off-by: jasonwan <jasonwan@nvidia.com> * ptl2.0 patch for llama config Signed-off-by: jasonwan <jasonwan@nvidia.com> * add plugins to trainer in scripts Signed-off-by: jasonwan <jasonwan@nvidia.com> * fix activation checkpointing mcore Signed-off-by: jasonwan <jasonwan@nvidia.com> * fix variable names Signed-off-by: jasonwan <jasonwan@nvidia.com> * overwrite normalization type for mcore/te Signed-off-by: jasonwan <jasonwan@nvidia.com> * Update megatron_llama_sft.yaml Signed-off-by: Jason Wang <jasonwan@nvidia.com> * add PEFT adapter support for mcore gpt path (#7276) * implementation for mcore adapter/mxins Signed-off-by: jasonwan <jasonwan@nvidia.com> * small fix for lora and ptuning Signed-off-by: jasonwan <jasonwan@nvidia.com> * support layerwise peft Signed-off-by: jasonwan <jasonwan@nvidia.com> * support multiple target layers Signed-off-by: jasonwan <jasonwan@nvidia.com> * support lora GQA Signed-off-by: jasonwan <jasonwan@nvidia.com> * support amp O2 Signed-off-by: jasonwan <jasonwan@nvidia.com> * revert & more O2 fix Signed-off-by: jasonwan <jasonwan@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lora inject to attention Signed-off-by: jasonwan <jasonwan@nvidia.com> * support …

galv requested review from titu1994, pzelasko and nithinraok and removed request for titu1994 and pzelasko May 14, 2024 18:15

github-actions bot added core Changes to NeMo Core ASR labels May 14, 2024

galv requested a review from pzelasko May 14, 2024 18:15

galv mentioned this pull request May 14, 2024

Draft: Fix the "cast ping pong" problem when we run AMP inference. #9086

Closed

nithinraok requested changes May 14, 2024

View reviewed changes

galv commented May 14, 2024

View reviewed changes

nemo/collections/asr/parts/submodules/multi_head_attention.py Outdated Show resolved Hide resolved

galv commented May 14, 2024

View reviewed changes

nemo/collections/asr/parts/submodules/multi_head_attention.py Outdated Show resolved Hide resolved

titu1994 reviewed May 15, 2024

View reviewed changes

github-actions bot removed the core Changes to NeMo Core label May 17, 2024

galv force-pushed the dgalvez/fix-autocast-slowness-2 branch from fbf0a05 to a97048e Compare May 17, 2024 18:23

github-advanced-security bot found potential problems May 17, 2024

View reviewed changes

examples/asr/transcribe_speech.py Fixed Show fixed Hide fixed

examples/asr/transcribe_speech.py Fixed Show fixed Hide fixed

nemo/collections/asr/parts/utils/transcribe_utils.py Fixed Show fixed Hide fixed

nemo/collections/asr/parts/utils/transcribe_utils.py Fixed Show fixed Hide fixed

galv force-pushed the dgalvez/fix-autocast-slowness-2 branch 2 times, most recently from 96cdaca to 82c607c Compare May 17, 2024 18:35

github-advanced-security bot found potential problems May 17, 2024

View reviewed changes

examples/asr/transcribe_speech.py Dismissed Show dismissed Hide dismissed

galv force-pushed the dgalvez/fix-autocast-slowness-2 branch from 8e87d9d to bf017ac Compare May 17, 2024 20:20

galv requested review from titu1994 and nithinraok May 17, 2024 20:22

github-advanced-security bot found potential problems May 17, 2024

View reviewed changes

examples/asr/transcribe_speech.py Dismissed Show dismissed Hide dismissed

nithinraok approved these changes May 17, 2024

View reviewed changes

nemo/collections/asr/parts/utils/transcribe_utils.py Outdated Show resolved Hide resolved

nemo/collections/asr/parts/utils/transcribe_utils.py Outdated Show resolved Hide resolved

nithinraok added the Run CICD label May 17, 2024

nithinraok reviewed May 17, 2024

View reviewed changes

examples/asr/transcribe_speech.py Show resolved Hide resolved

nithinraok previously requested changes May 17, 2024

View reviewed changes

galv requested a review from nithinraok May 29, 2024 18:28

galv force-pushed the dgalvez/fix-autocast-slowness-2 branch from 0f5385d to f702953 Compare May 29, 2024 18:31

galv added Run CICD and removed Run CICD labels May 29, 2024

nithinraok previously approved these changes May 30, 2024

View reviewed changes

titu1994 requested changes May 31, 2024

View reviewed changes

galv dismissed nithinraok’s stale review via 9e8dc25 June 5, 2024 21:45

galv added 2 commits June 5, 2024 14:48

Always cast softmax inputs to float32 when in training mode.

2a6f156

While we don't need this for accurate results in b/float16, this is a safety precaution to make sure that training accuracy does not regress. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>

galv force-pushed the dgalvez/fix-autocast-slowness-2 branch from 9e8dc25 to 2a6f156 Compare June 5, 2024 21:50

galv added Run CICD and removed Run CICD labels Jun 5, 2024

titu1994 approved these changes Jun 6, 2024

View reviewed changes

titu1994 merged commit fc2e693 into NVIDIA:main Jun 6, 2024
130 checks passed

anteju mentioned this pull request Jul 17, 2024

[Audio] Metric with Squim objective and MOS #9751

Merged

8 tasks

ko3n1g mentioned this pull request Jul 18, 2024

Release 2.0.0rc1 #9786

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. #9198

Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. #9198

galv commented May 14, 2024 •

edited

Loading

nithinraok May 14, 2024

galv May 14, 2024

titu1994 May 15, 2024

pzelasko Jun 4, 2024

nithinraok May 14, 2024

galv May 14, 2024

titu1994 May 15, 2024

nithinraok May 15, 2024

pzelasko Jun 4, 2024

galv commented May 15, 2024

titu1994 May 15, 2024

galv May 15, 2024

galv May 29, 2024

titu1994 May 31, 2024

titu1994 May 31, 2024

VahidooX Jun 3, 2024

borisfom Jun 4, 2024

titu1994 May 15, 2024

galv May 15, 2024

titu1994 May 15, 2024

titu1994 May 15, 2024

galv commented May 17, 2024

nithinraok left a comment

nithinraok left a comment

galv commented May 29, 2024

nithinraok left a comment

titu1994 left a comment

titu1994 May 31, 2024

titu1994 May 31, 2024

titu1994 May 31, 2024

pzelasko commented Jun 4, 2024

titu1994 left a comment

Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. #9198

Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. #9198

Conversation

galv commented May 14, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galv commented May 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galv commented May 17, 2024

nithinraok left a comment

Choose a reason for hiding this comment

nithinraok left a comment

Choose a reason for hiding this comment

galv commented May 29, 2024

nithinraok left a comment

Choose a reason for hiding this comment

titu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko commented Jun 4, 2024

titu1994 left a comment

Choose a reason for hiding this comment

galv commented May 14, 2024 •

edited

Loading