Releases: NVIDIA/NeMo
NVIDIA Neural Modules 2.1.0rc2
Prerelease: NVIDIA Neural Modules 2.1.0rc2 (2024-12-21)
NVIDIA Neural Modules 2.1.0rc1
Prerelease: NVIDIA Neural Modules 2.1.0rc1 (2024-12-20)
NVIDIA Neural Modules 2.1.0rc0
[🤠]: Howdy folks, let's release NeMo `r2.1.0` ! (#11556) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
NVIDIA Neural Modules 2.0.0
Highlights
Large language models & Multi modal
- Training
- Long context recipe
- PyTorch Native FSDP 1
- Models
- Llama 3
- Mixtral
- Nemotron
- NeMo 1.0
Export
- TensorRT-LLM v0.12 integration
- LoRA support for vLLM
- FP8 checkpoint
ASR
- Parakeet large (ASR with PnC model)
- Added Uzbek offline and Gregorian streaming models
- Optimization feature for efficient bucketing to improve bs consumption on GPUs
Detailed Changelogs
ASR
Changelog
- add parakeet-tdt_ctc-110m model by @nithinraok :: PR: #10461
- fix asr finetune by @stevehuang52 :: PR: #10508
- replace unbiased with correction by @nithinraok :: PR: #10555
- Update Multi_Task_Adapters.ipynb by @pzelasko :: PR: #10600
- Fix asr warnings by @nithinraok :: PR: #10469
- Fix typo in ASR RNNT BPE model by @pzelasko :: PR: #10742
- TestEncDecMultiTaskModel for canary parallel by @karpnv :: PR: #10740
- fix chunked infer by @stevehuang52 :: PR: #10581
- training code for hybrid-autoregressive inference model by @hainan-xv :: PR: #10841
- remove stacking operation from batched functions by @lilithgrigoryan :: PR: #10524
- Add lhotse fixes for rnnt model training and WER hanging issue with f… by @nithinraok :: PR: #10821
- Fix ASR tests by @artbataev :: PR: #10794
- [Fix] Fixed sampler override and audio_key in prepare_audio_data by @anteju :: PR: #10980
- [WIP] Add docs for NEST SSL by @stevehuang52 :: PR: #10804
- Akoumparouli/mixtral recipe fix r2.0.0 by @akoumpa :: PR: #10994
- TDT compute timestamps option and Extra Whitespace handling for SPE by @monica-sekoyan :: PR: #10875
- ci: Switch to CPU only runner by @ko3n1g :: PR: #11035
- Fix timestamps tests by @monica-sekoyan :: PR: #11053
- ci: Pin release freeze by @ko3n1g :: PR: #11143
- Fix RNN-T loss memory usage by @artbataev :: PR: #11144
- Added deprecation notice by @Ssofja :: PR: #11133
- Fixes for Canary adapters tutorial by @pzelasko :: PR: #11184
- add ipython import guard by @nithinraok :: PR: #11191
- Self Supervised Pre-Training tutorial Fix by @monica-sekoyan :: PR: #11206
- update the return type by @nithinraok :: PR: #11210
- Timestamps to transcribe by @nithinraok :: PR: #10950
- [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
- Beam search algorithm implementation for TDT models by @lilithgrigoryan :: PR: #10903
TTS
Changelog
- Fix asr warnings by @nithinraok :: PR: #10469
- Make nemo text processing optional in TTS by @blisc :: PR: #10584
- [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
NLP / NMT
Changelog
-
MCORE interface for TP-only FP8 AMAX reduction by @erhoo82 :: PR: #10437
-
Remove Apex dependency if not using MixedFusedLayerNorm by @cuichenx :: PR: #10468
-
Add missing import guards for causal_conv1d and mamba_ssm dependencies by @janekl :: PR: #10429
-
Update doc for fp8 trt-llm export by @Laplasjan107 :: PR: #10444
-
Remove running validating after finetuning by @huvunvidia :: PR: #10560
-
Extending modelopt spec for TEDotProductAttention by @janekl :: PR: #10523
-
Fix mb_calculator import in lora tutorial by @BoxiangW :: PR: #10624
-
.nemo conversion bug fix by @dimapihtar :: PR: #10598
-
Require setuptools>=70 and update deprecated api by @thomasdhc :: PR: #10659
-
Akoumparouli/fix get tokenizer list by @akoumpa :: PR: #10596
-
[McoreDistOptim] fix the naming to match apex.dist by @gdengk :: PR: #10707
-
[fix] Ensures disabling exp_manager with exp_manager=null does not error by @terrykong :: PR: #10651
-
[feat] Update get_model_parallel_src_rank to support tp-pp-dp ordering by @terrykong :: PR: #10652
-
feat: Migrate GPTSession refit path in Nemo export to ModelRunner for Aligner by @terrykong :: PR: #10654
-
[MCoreDistOptim] Add assertions for McoreDistOptim and fix fp8 arg specs by @gdengk :: PR: #10748
-
Fix for crashes with tensorboard_logger=false and VP + LoRA by @vysarge :: PR: #10792
-
Adding init_model_parallel to FabricMegatronStrategy by @marcromeyn :: PR: #10733
-
Moving steps to MegatronParallel to improve UX for Fabric by @marcromeyn :: PR: #10732
-
Adding setup_megatron_optimizer to FabricMegatronStrategy by @marcromeyn :: PR: #10833
-
Make FabricMegatronMixedPrecision match MegatronMixedPrecision by @marcromeyn :: PR: #10835
-
Fix VPP bug in MegatronStep by @marcromeyn :: PR: #10847
-
Expose drop_last in MegatronDataSampler by @farhadrgh :: PR: #10837
-
Move collectiob.nlp imports inline for t5 by @marcromeyn :: PR: #10877
-
Use a context-manager when opening files by @akoumpa :: PR: #10895
-
ckpt convert bug fixes by @dimapihtar :: PR: #10878
-
remove deprecated ci tests by @dimapihtar :: PR: #10922
-
Update T5 tokenizer (adding additional tokens to tokenizer config) by @huvunvidia :: PR: #10972
-
Add support and recipes for HF models via AutoModelForCausalLM by @akoumpa :: PR: #10962
- gpt3 175b cli by @malay-nagda :: PR: #10985
- Fix for crash with LoRA + tp_overlap_comm=false + sequence_parallel=true by @vysarge :: PR: #10920
- Update
BaseMegatronSampler
for compatibility with PTL'''s_BatchProgress
by @ashors1 :: PR: #11016 - add deprecation note by @dimapihtar :: PR: #11024
- Update ModelOpt Width Pruning example defaults by @kevalmorabia97 :: PR: #10902
- switch to NeMo 2.0 recipes by @dimapihtar :: PR: #10948
- NeMo 1.0: upcycle dense to moe by @akoumpa :: PR: #11002
- Update mcore parallelism initialization in nemo2 by @yaoyu-33 :: PR: #10643
- Gemma2 in Nemo2 with Recipes by @suiyoubi :: PR: #11037
- Add Packed Seq option to GPT based models by @suiyoubi :: PR: #11100
- Fix MCoreGPTModel import in llm.gpt.model.base by @hemildesai :: PR: #11109
- TP+MoE peft fix by @akoumpa :: PR: #11114
- GPT recipes to use full te spec by @JimmyZhang12 :: PR: #11119
- Virtual pipeline parallel support for LoRA in NLPAdapterModelMixin by @vysarge :: PR: #11128
- update nemo args for mcore flash decode arg change by @HuiyingLi :: PR: #11138
- Call
ckpt_to_weights_subdir
fromMegatronCheckpointIO
by @ashors1 :: PR: #10897 - fix typo by @dimapihtar :: PR: #11234
- [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
- fix(export): GPT models w/ bias=False convert properly by @terrykong :: PR: #11255
NVIDIA Neural Modules 2.0.0rc1
Highlights
Large language models
- PEFT: QLoRA support, LoRA/QLora for Mixture-of-Experts (MoE) dense layer
- State Space Models & Hybrid Architecture support (Mamba2 and NV-Mamba2-hybrid)
- Support Nemotron, Minitron, Gemma2, Qwen, RAG
- Custom Tokenizer training in NeMo
- Update the Auto-Configurator for EP, CP and FSDP
Multimodal
- NeVA: Add SOTA LLM backbone support (Mixtral/LLaMA3) and suite of model parallelism support (PP/EP)
- Support Language Instructed Temporal-Localization Assistant (LITA) on top of video NeVA
ASR
- SpeechLM and SALM
- Adapters for Canary Customization
- Pytorch allocator in PyTorch 2.2 improves training speed up to 30% for all ASR models
- Cuda Graphs for Transducer Inference
- Replaced webdataset with Lhotse - gives up to 2x speedup
- Transcription Improvements - Speedup and QoL Changes
- ASR Prompt Formatter for multimodal Canary
Export & Deploy
- In framework PyTriton deployment with backends: - PyTorch - vLLM - TRT-LLM update to 0.10
- TRT-LLM C++ runtime
Detailed Changelogs
ASR
Changelog
- Support dataloader as input to
audio
for transcription by @titu1994 :: PR: #9201 - Clean up dev docs collection section by @yaoyu-33 :: PR: #9205
- Fix Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9251
- Remove .nemo instead of renaming by @mikolajblaz :: PR: #9281
- Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. by @galv :: PR: #9347
- Revert "Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer." by @titu1994 :: PR: #9351
- Prompt formatter API and canary transcribe tensor input support by @pzelasko :: PR: #9206
- Fix prompt formatter's defaults=None case in multi-task model by @pzelasko :: PR: #9366
- move AED chunked infer script by @stevehuang52 :: PR: #9367
- Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. by @galv :: PR: #9198
- ci: Fix `L2_Segmentation_Tool_Parallel_ctc_segmentation_test_L2_Eng_C… by @ko3n1g :: PR: #9399
- Fix logging message for ASR by @titu1994 :: PR: #9469
- Add support to change Multi task model prompt by @titu1994 :: PR: #9542
- Enable encoder adapters for Canary and MultiTaskAED models by @titu1994 :: PR: #9409
- Audio model collection by @anteju :: PR: #9263
- TitaNet Batch Verify Speaker by @monica-sekoyan :: PR: #9337
- Fix the arguments of forward_for_export function in msdd_models by @tango4j :: PR: #9624
- chore: Pin branch in notebooks by @ko3n1g :: PR: #9697
- refactor: notebook branch release by @ko3n1g :: PR: #9711
- Canary Adapters tutorial (#9670) by @nithinraok :: PR: #9777
- typos and branch name update to r2.0.0rc1 by @nithinraok :: PR: #9846
- Fix RNNT alignments test by @artbataev :: PR: #9770
- By default trust remote code from HF Datasets by @nithinraok :: PR: #9886
- Temporarily disable cuda graph based RNN-T greedy inference for r2.0.0rc1 by @galv :: PR: #9904
- Enable CUDA graphs by default, but require CUDA 12.6 for full graphs by @artbataev :: PR: #9919
- update branch name for script by @nithinraok :: PR: #9936
- updte branch by @nithinraok :: PR: #9942
TTS
Changelog
LLM/Multimodal
Changelog
- Update nemo.export module for quantized models by @janekl :: PR: #9218
- Add save option to the TRT-LLM export test script by @oyilmaz-nvidia :: PR: #9221
- Checkpoint resuming compatible for 2403 container by @suiyoubi :: PR: #9199
- Clean up dev docs collection section by @yaoyu-33 :: PR: #9205
- use get with fallback when reading checkpoint_callback_params by @akoumpa :: PR: #9223
- Revert rope fusion defaults by @cuichenx :: PR: #9237
- fix import by @akoumpa :: PR: #9240
- Add TRT-LLM params like max_num_tokens and opt_num_tokens by @oyilmaz-nvidia :: PR: #9210
- sum-reduce grad_norm in DP+CP domain by @erhoo82 :: PR: #9262
- Alit/bert convert fix by @JRD971000 :: PR: #9285
- conv1d stable version by @JRD971000 :: PR: #9330
- Fix trainer builder when exp_manager is not in config by @yaoyu-33 :: PR: #9293
- Fix Peft Weights Loading in NeVA by @yaoyu-33 :: PR: #9341
- Skip sequence_parallel allreduce when using Mcore DistOpt by @akoumpa :: PR: #9344
- Fix FSDP gradient calculation with orig params by @janEbert :: PR: #9335
- TRT-LLM Export Code Cleanup by @oyilmaz-nvidia :: PR: #9270
- support null/None truncation field by @arendu :: PR: #9355
- NeVa token fusion by @paul-gibbons :: PR: #9245
- bugfix if using mcore distOpt with sft by @akoumpa :: PR: #9356
- Re-org export code by @oyilmaz-nvidia :: PR: #9353
- QLoRA by @cuichenx :: PR: #9340
- PeFT fix for distOpt by @akoumpa :: PR: #9392
- [NeMo-UX] Integrating mcore's DistributedDataParallel into MegatronStrategy by @marcromeyn :: PR: #9387
- cherry pick of #9266 by @dimapihtar :: PR: #9411
- Enable specifying alpha for PTQ INT8 SmoothQuant method by @janekl :: PR: #9423
- add support for new mcore ds features by @dimapihtar :: PR: #9388
- LoRA for MoE Layer by @cuichenx :: PR: #9396
- Mistral-7B: apply user's precision to output checkpoint by @akoumpa :: PR: #9222
- Add option to merge distributed optimizer buckets by @timmoon10 :: PR: #9414
- TRT-LLM 0.10 Update by @oyilmaz-nvidia :: PR: #9402
- In-framework deployment by @oyilmaz-nvidia :: PR: #9438
- Bugfix missing variables and argument changes to MegatronPretrainingRandomSampler by @jstjohn :: PR: #9458
- Hyena Operator by @guyjacob :: PR: #9264
- Refactor Quantizer for reusing in QAT by @kevalmorabia97 :: PR: #9276
- move load state dict after initialize parallel state in nlp_model by @ryxli :: PR: #9382
- Enable user to optionally upgrade Megatron by @jstjohn :: PR: #9478
- Fix unwrap model by @cuichenx :: PR: #9480
- fix operator precedence by @akoumpa :: PR: #9403
- [NeMo-UX] Adding context- & expert-parallelism to MegatronStrategy by @marcromeyn :: PR: #9525
- update mcoreddp call by @akoumpa :: PR: #9345
- mcore distOpt restore fix by @akoumpa :: PR: #9421
- vLLM Export Support by @apanteleev :: PR: #9381
- PL: Delete precision if using plugin. TODO switch to MegatronTrainerB… by @akoumpa :: PR: #9535
- extend get_gpt_layer_modelopt_spec to support MoE by @akoumpa :: PR: #9532
- fix mock data generation for legacy dataset by @dimapihtar :: PR: #9530
- add reset learning rate functionality by @dimapihtar :: PR: #9372
- Use closed-formula to round by multiple by @akoumpa :: PR: #9307
- GPU unit tests: Mark flaky tests to be fixed by @pablo-garay :: PR: #9559
- Consolidate gpt continue training script into pretraining script by @yaoyu-33 :: PR: #9413
- Enable encoder adapters for Canary and MultiTaskAED models by @titu1994 :: PR: #9409
- PTQ refinements by @janekl :: PR: #9574
- Add ModelOpt QAT example for Llama2 SFT model by @kevalmorabia97 :: PR: #9326
- Multimodal projection layer adapter fix for PP>1 by @paul-gibbons :: PR: #9445
- Add offline quantization script for QLoRA deployment by @cuichenx :: PR: #9455
- Make QLoRA more model-agnostic by @cuichenx :: PR: #9488
- Set n_gpu to None in nemo export by @oyilmaz-nvidia :: PR: #9593
- [NeMo-UX] Fix Megatron-optimizer by @marcromeyn :: PR: #9599
- Chat template support for megatron_gpt_eval.py by @akoumpa :: PR: #9354
- [NeMo-UX] Add PEFT by @cuichenx :: PR: #9490
- Alit/mamba tmp by @JRD971000 :: PR: #9612
- Enable MCore checkpointing optimizations by @mikolajblaz :: PR: #9505
- Change mixtral moe key name for trt-llm by @oyilmaz-nvidia :: PR: #9620
- fix ckpt load bug by @dimapihtar :: PR: #9621
- Alit/mamba by @JRD971000 :: PR: #9575
- Unwrap ckpt_io for model opt (async save) by @mikolajblaz :: PR: #9622
- MCore T5 support for NeMo - Training by @huvunvidia :: PR: #9432
- [Nemo-UX] Expose transformer_layer_spec inside GPTConfig by @marcromeyn :: PR: #9592
- Update NeMo Clip to Use MCore Modules by @yaoyu-33 :: PR: #9594
- Mistral + Mixtral Support for NeVa by @paul-gibbons :: PR: #9459
- Adding support for mcore generate by @shanmugamr1992 :: PR: #9566
- Improve error messaging during trt-llm export by @oyilmaz-nvidia :: PR: #9638
- [Cherrypick] support lora when kv_channel != hidden_size / num_heads by @cuichenx :: PR: #9644
- Parametrize FPS group by @mikolajblaz :: PR: #9648
- Cherry-pick megatron export fix from main by @borisfom :: PR: #9643
- add documentation for reset_lr feature by @dimapihta
- chore: Pin branch in notebooks by @ko3n1g :: PR: #9697
- Cherry pick: LITA Integration by @Slyne :: PR: #9684
- SDXL improvements (and support for Draft+) by @rohitrango :: PR: #9654
- Gemma 2 by @cuichenx :: PR: #9672
- Allows non-strict load with distributed checkpoints by @mikolajblaz :: PR: #9613
- refactor: notebook branch release by @ko3n1g :: PR: #9711
- [NeMo-UX] Make TE and Apex dependencies optional by @ashors1 :: PR: #9550
- Alit/r2.0.0 by @JRD971000 :: PR: #9718
- Manually cherry-pick from PR 9679 (PR to main - Support SFT/Eval/PEFT for mcore T5) by @huvunvidia :: PR: #9737
- In framework export by @oyilmaz-nvidia :: PR: #9658
- T5 changes based on mcore changes by @pablo-garay :: PR: #9829
- [NeMo-UX] Use single instance of loss reductions in GPTModel by @hemildesai :: PR: #9801
- deprecate NeMo NLP tutorial by @dimapihtar :: PR: #9864
- Disable nvFuser setup with PyTorch 23.11 and later by @athitten :: PR: #9837
- make torch_dist ckpt strategy as default by @dimapihtar :: PR: #9852
- add rampup bs documentation by @dimapihtar :: PR: #9884
- copy of #9576 by @dimapihtar :: PR: #9986
- Support Nvidia Torch and Arch versions by @thomasdhc :: PR: #9897
-...
NVIDIA Neural Modules 2.0.0rc0
Highlights
LLM and MM
Models
-
Megatron Core RETRO
- Pre-training
- Zero-shot Evaluation
-
Pretraining, conversion, evaluation, SFT, and PEFT for:
- Mixtral 8X22B
- Llama 3
- SpaceGemma
-
Embedding Models Fine Tuning
- Mistral
- BERT
-
BERT models
- Context Parallel
- Distributed checkpoint
-
Video capabilities with NeVa
Performance
-
Distributed Checkpointing
- Torch native backend
- Parallel read/write
- Async write
-
Multimodal LLM (LLAVA/NeVA)
- Pipeline Parallelism support
- Sequence packing support
Export
- Integration of Export & Deploy Modules into NeMo Framework container
- Upgrade to TRT-LLM 0.9
Speech (ASR & TTS)
Models
- AED Multi Task Models (Canary) - Multi-Task Multi-Lingual Speech Recognition / Speech Translation model
- Multimodal Domain - Speech LLM supporting SALM Model
- Parakeet-tdt_ctc-1.1b Model - RTFx of > 1500 (can transcribe 1500 seconds of audio in 1 second)
- Audio Codec 16kHz Small - NeMo Neural Audio Codec for discretizing speech for use in LLMs
- mel_codec_22khz_medium
- mel_codec_44khz_medium
Perf Improvements
- Transcribe() upgrade - Enables one line transcribe with files, tensors, data loaders
- Frame looping algorithm for RNNT faster decoding - Improves Real Time Factor (RTF) by 2-3x
- Cuda Graphs + Label-Looping algorithm for RNN-T and TDT Decoding - Transducer Greedy decoding at over 1500x RTFx, on par with CTC Non-Autoregressive models
- Semi Sorted Batching support - External User contribution that speeds up training by 15-30%.
Customization
- Context biasing for CTC word stamping - Improve accuracy for custom vocabulary and pronunciation
- Longform Inference
- Longform inference support for AED models
- Transcription of multi-channel audio for AED models
Misc
- Upgraded webdataset - Speech and LLM / Multimodal unified container
Detailed Changelogs
ASR
Changelog
- Enable using hybrid asr models in CTC Segmentation tool by @erastorgueva-nv :: PR: #8828
- TDT confidence fix by @GNroy :: PR: #8982
- Fix union type annotations for autodoc+mock-import rendering by @pzelasko :: PR: #8956
- NeMo dev doc restructure by @yaoyu-33 :: PR: #8896
- Improved random seed configuration for Lhotse dataloaders with docs by @pzelasko :: PR: #9001
- Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization by @galv :: PR: #8964
- [ASR] Support for transcription of multi-channel audio for AED models by @anteju :: PR: #9007
- Add ASR latest news by @titu1994 :: PR: #9073
- Fix docs errors and most warnings by @erastorgueva-nv :: PR: #9006
- PyTorch CUDA allocator optimization for dynamic batch shape dataloading in ASR by @pzelasko :: PR: #9061
- RNN-T and TDT inference: use CUDA graphs by default by @artbataev :: PR: #8972
- Fix #8891 by supported GPU-side batched CTC Greedy Decoding by @galv :: PR: #9100
- Update branch for notebooks and ci in release by @ericharper :: PR: #9189
- Enable CUDA graphs by default only for transcription by @artbataev :: PR: #9196
- rename paths2audiofiles to audio by @nithinraok :: PR: #9209
- Fix ASR_Context_Biasing.ipynb contains FileNotFoundError by @andrusenkoau :: PR: #9233
- Cherrypick: Support dataloader as input to
audio
for transcription (#9201) by @titu1994 :: PR: #9235 - Update Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9252
- Dgalvez/fix greedy batch strategy name r2.0.0rc0 by @galv :: PR: #9243
- Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfer::forward by @galv :: PR: #9246
- Fix loading github raw images on notebook by @nithinraok :: PR: #9282
- typos by @nithinraok :: PR: #9314
- Re-enable cuda graphs in training modes. by @galv :: PR: #9338
- add large model stable training fix and contrastive loss update for variable seq by @nithinraok :: PR: #9259
- Fix conv1d package in r2.0.0rc0 by @pablo-garay :: PR: #9369
- Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) by @titu1994 :: PR: #9350
- Make a backward compatibility for old MSDD configs in label models by @tango4j :: PR: #9377
- Force diarizer to use CUDA if cuda is available and if device=None. by @tango4j :: PR: #9380
TTS
Changelog
LLM and MM
Changelog
- Rachitg/dpa by @rachitgarg91 :: PR: #8911
- Remove precision args in trainer due to PTL update by @yaoyu-33 :: PR: #8908
- Huvu/mcore retro by @huvunvidia :: PR: #8861
- fsdp tp > 1 bug fix by @dimapihtar :: PR: #8947
- Fix memory leak at loss func by @minitu :: PR: #8868
- change the condition for get qkv tensor from linear_qkv output in mcoremixin by @HuiyingLi :: PR: #8965
- Add safety checks for 'data' key in MegatronGPTModel cfg by @HuiyingLi :: PR: #8991
- [NeMo-UX] Adding MegatronParallel by @cuichenx :: PR: #8987
- Skip top_p computations when set to 1.0 by @odelalleau :: PR: #8905
- Gemma bug by @cuichenx :: PR: #8962
- [NeMo-UX] Adding megatron strategy by @marcromeyn :: PR: #8995
- Quantized checkpoint support in export and deploy modules by @janekl :: PR: #8859
- add geglu to mlp swap by @JRD971000 :: PR: #8999
- add timeout for new_group by @acphile :: PR: #8998
- Zero-shot evaluation pipeline for mcore RETRO by @huvunvidia :: PR: #8941
- Added fusion for squared relu by @sanandaraj5597 :: PR: #8963
- Developer Documents for mcore RETRO by @huvunvidia :: PR: #9026
- [NeMo-UX] Adding GPTModel & MockDataModule by @marcromeyn :: PR: #9011
- Adding unit test for mcore RETRO model by @huvunvidia :: PR: #9022
- docs and simplification of cmd args by @arendu :: PR: #8979
- [NeMo-UX] Add checkpoint-io to MegatronStrategy by @marcromeyn :: PR: #9057
- Enable Sequence Packing and Pipeline Parallel in NeVA by @yaoyu-33 :: PR: #8957
- Mingyuanm/add back fp8 support to sd by @Victor49152 :: PR: #9070
- unfused lora by @arendu :: PR: #9004
- Handle case where num_query_groups is set to null for LoRA config setup by @vysarge :: PR: #9075
- Alit/griffin by @JRD971000 :: PR: #9021
- Implement DistributedCheckpointIO by @mikolajblaz :: PR: #9016
- Video Neva Pretraining + Inference Implementation by @paul-gibbons :: PR: #9095
- HF to .nemo for Mixtral-8x22B-instruct by @akoumpa :: PR: #9060
- mcore ds updates by @dimapihtar :: PR: #8951
- Alit/griffin perf by @JRD971000 :: PR: #9107
- Add assert for max_steps to be positive in MegatronGPTSFTModel by @athitten :: PR: #9110
- Extend sequence length padding for GPT SFT to account for context parallel by @vysarge :: PR: #8869
- Update gpt dataset config parameter for mock by @thomasdhc :: PR: #9118
- Add Mcore DistributedDataParallel and distributed optimizer into Nemo by @gdengk :: PR: #9034
- Revert "Add assert for max_steps to be positive in MegatronGPTSFTMode… by @pablo-garay :: PR: #9128
- scripts to convert HF lora to nemo by @arendu :: PR: #9102
- Prevent duplicated checkpoints by @mikolajblaz :: PR: #9015
- add TN/ITN link in speech tools list by @erastorgueva-nv :: PR: #9142
- Cleanup deprecated files and temporary changes by @cuichenx :: PR: #9088
- Use DP+CP groups as the FSDP sharding domain by @erhoo82 :: PR: #9145
- CUDA memory profile by @erhoo82 :: PR: #9096
- Fix missing func for T5 model by @gdengk :: PR: #9141
- Add knob for load_directly_on_device by @mikolajblaz :: PR: #9125
- Revert rope fusion defaults by @cuichenx :: PR: #9238
- Update nemo.export module for quantized models by @janekl :: PR: #9250
- Fix circular import for MM dataprep notebook by @cuichenx :: PR: #9287
- neva media_type + text generation default fix by @paul-gibbons :: PR: #9257
- fix lora and ptuning and isort/black by @oyilmaz-nvidia :: PR: #9290
- add check if num layers is divisible by pp size by @dimapihtar :: PR: #9208
- Fix P-tuning for Llama based models by @apanteleev :: PR: #9297
- add deprecation warnings by @pablo-garay :: PR: #9266
- move pooler under post_process by @dimapihtar :: PR: #9328
- add deprecation note for nmt by @dimapihtar :: PR: #9342
- Fix incorrect checkpoint removal logic (#9192) by @mikolajblaz :: PR: #9204
- fix fp16 precision issue by @dimapihtar :: PR: #9376
- Fix module.training for Neva in FusedAttn backward which causes nan by @yaoyu-33 :: PR: #8877
Export
Changelog
- Updates for TRT-LLM 0.9 by @oyilmaz-nvidia :: PR: #8873
- Mingyuanm/sdxl export by @Victor49152 :: PR: #8926
- Avoid unpacking NeMo checkpoints before exporting to TRT-LLM by @apanteleev :: PR: #8866
- Update gemma for trt-llm 0.9 by @oyilmaz-nvidia :: PR: #8974
- TRT-LLM export P-tuning related fixes by @apanteleev :: PR: #8863
General Improvements
Changelog
- Update package info by @ericharper :: PR: #8793
- [Nemo CICD] Update mcore 4.13.24 by @pablo-garay :: PR: #8917
- Akoumparouli/low mem mixtral ckpt converter by @akoumpa :: PR: #8895
- Adding RETRO tests to Action Tests (cicd-main.yml) by @huvunvidia :: PR: #8942
- Akoumparouli/fix sd train 2 by @akoumpa :: PR: #8883
- Update te install for jenkins by @ericharper :: PR: #8954
- [Nemo CICD] Add last job depending on others for blocking check by @pablo-garay :: PR: #8959
- Minor quantization...
NVIDIA Neural Modules 1.23.0
Highlights
Models
Nvidia Starcoder 2 - 15B
- Announcement - https://developer.nvidia.com/blog/unlock-your-llm-coding-potential-with-starcoder2/
- AI Foundation Model Inference - https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/starcoder2-15b
- https://huggingface.co/bigcode/starcoder2-15b
NeMo Canary
Announcement - https://nvidia.github.io/NeMo/blogs/2024/2024-02-canary/
NeMo LLM
- Falcon
- Code Llama
- StarCoder
- GPT perf improvements
- Context parallelism
- Mistral
- Mixtral (without expert parallelism)
- Mcore GPT Dataset integration
NeMo MM
- CLIP
- Stable Diffusion (supporting LoRA)
- Imagen
- ControlNet (for SD)
- Instruct pix2pix (for SD)
- LLAVA
- NeVA
- DreamFusion++
- NSFW filtering
NeMo ASR
- Lhotse Dataloading support #7880
- Canary: Multi task multi lingual ASR #8242
- LongForm Audio for Diarization #7737
- Faster algorithm for RNN-T Greedy #7926
- Cache-Aware streaming notebook #8296
NeMo TTS
NeMo Vision
Known Issues
ASR
RNNT WER calculation when fused batch size > 1 during validation / test step()
Previously, the RNNT metric was stateful while the CTC one was not (r1.22.0, r1.23.0)
Therefore this calculation in the RNNT joint for fused operation worked properly. However with the unification of metrics in r1.23.0, a bug was introduced where only the last sub-batch of metrics calculates the scores and does not accumulate. This is patched via #8587 and will be fixed in the next release.
Workaround: Explicitly disable fused batch size during inference using the following command
from omegaconf import open_dict
model = ...
decoding_cfg = model.cfg.decoding
with open_dict(decoding_cfg):
decoding_cfg.fused_batch_size = -1
model.change_decoding_strategy(decoding_cfg)
Note: This bug does not affect scores calculated via model.transcribe() (since it does not calculate metrics during inference, just text), or using the transcribe_speech.py
or speech_to_text_eval.py
in examples/asr
.
Two failing unit tests due to a change in expected results, caused by lhotse version update.
Container
For additional information regarding NeMo containers, please visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo
docker pull nvcr.io/nvidia/nemo:24.01.speech
Detailed Changelogs
ASR
Changelog
- Update link to yaml file in ASR_with_Transducers.ipynb by @Faith-Nchifor :: PR: #8014
- Use convert_hf_dataset_to_nemo by @karpnv :: PR: #8017
- Update asr_language_modeling.rst: Add a missing word by @martin0258 :: PR: #8007
- spelling mistake by @orena1 :: PR: #7903
- update asr eval by @stevehuang52 :: PR: #8045
- fix noise aug by @stevehuang52 :: PR: #8057
- Various fixes for typos and urls by @titu1994 :: PR: #8066
- [Fix] Increase length check tolerance to prevent test failing by @anteju :: PR: #8067
- Add text metrics to asr eval by @stevehuang52 :: PR: #8087
- fix device setting to allow using accelerator cpu by @orena1 :: PR: #8084
- .ctm in data simulator annotator compliant with RT-09 specification by @popcornell :: PR: #8004
- Fix AST eval by @stevehuang52 :: PR: #8112
- fix: numba.*_num_threads resets torch num_threads #8141 by @itzsimpl :: PR: #8145
- Update dependencies by @titu1994 :: PR: #8156
- NeMo + Lhotse integration by @pzelasko :: PR: #7880
- Speedup RNN-T greedy decoding by @artbataev :: PR: #7926
- [docker] Install k2 before NeMo for faster image rebuilding by @pzelasko :: PR: #8204
- [docs] Add --force_codec to tarred dataset creation examples by @pzelasko :: PR: #8227
- Temporarily use the previous RNN-T decoding algorithm as default by @artbataev :: PR: #8226
- Make TDT inference not require duration params by @hainan-xv :: PR: #8207
- Cache Aware Streaming tutorial notebook by @erastorgueva-nv :: PR: #8296
- fix path location and branch by @nithinraok :: PR: #8304
- Attention encoder-decoder models for multiple speech-to-text tasks … by @titu1994 :: PR: #8324
- Remove asr webapp by @titu1994 :: PR: #8347
- remove target at model level in aed model config [ASR] by @krishnacpuvvada :: PR: #8351
- Add change_vocabulary and save_tokenizers() support to Multitask ASR models by @titu1994 :: PR: #8357
- Change default beam size by @titu1994 :: PR: #8371
- adding jenkins test for speech_to_text_aed model by @krishnacpuvvada :: PR: #8368
- Add Finetuning tutorial with HF Datasets by @nithinraok :: PR: #8356
- wer fix by @tbartley94 :: PR: #8404
- add ensemble decoding fix by @nithinraok :: PR: #8427
- Update k2 by @artbataev :: PR: #8492
TTS
Changelog
- [TTS] Scale sampler steps by number of devices by @rlangman :: PR: #7947
- Add All Multimodal Source Code Part 2: Text to image, x to nerf by @yaoyu-33 :: PR: #7970
- [TTS] Add period discriminator and feature matching loss to codec recipe by @rlangman :: PR: #7884
- Added VectorQuantizer base class by @anteju :: PR: #8011
LLMS
Changelog
- Add interface to set NCCL options of each process group by @erhoo82 :: PR: #7923
- Support O2 training of PEFT and SFT by @cuichenx :: PR: #7971
- [NLP] Access scaler only in FP16 case by @janekl :: PR: #7916
- [NLP] Minor improvements in Llama conversion script by @janekl :: PR: #7978
- [NLP] Use helpers from utils_funcs.py in Llama conversion by @janekl :: PR: #7979
- [NLP] Remove replace_sampler_ddp (deprecated in Trainer) by @janekl :: PR: #7981
- Reworked MegatronPretrainingRandomBatchSampler to correctly handle epochs > 1 by @trias702 :: PR: #7920
- Remove deprecated arguments from TE's TransformerLayer by @jbaczek :: PR: #7917
- Add All Multimodal Source Code by @yaoyu-33 :: PR: #7791
- First draft of mcore bert model in NeMo by @shanmugamr1992 :: PR: #7814
- Support Falcon Variants (7B/40B/180B) in Mcore NeMo by @xuanzic :: PR: #7666
- FSDP + Tensor Parallelism by @erhoo82 :: PR: #7897
- Packed Sequence by @cuichenx :: PR: #7945
- Adding method back that was removed accidentally by @ericharper :: PR: #8038
- [NLP] ArtifactItem with init=True to make it debuggable by @janekl :: PR: #7980
- SFT patch: (1) enable sequence parallelism and (2) enable profile by @erhoo82 :: PR: #7963
- migration to PTL 2.0 for spellmapper model by @bene-ges :: PR: #7924
- Change the megatron config lr scheduler default and fix to change partitions script by @shan18 :: PR: #8094
- (1) Add SHARP interface to M-CORE, (2) use send/recv to send train loss to the first rank instead of b-cast by @erhoo82 :: PR: #7793
- Reconfigure limit_val_batches only for int by @athitten :: PR: #8099
- Fixing wrapper and moving it to base class by @shanmugamr1992 :: PR: #8055
- fix gated_linear_unit bug by @Agoniii :: PR: #8042
- Fix Adapter for MCore models by @cuichenx :: PR: #8124
- add war fix for sync issues by @gshennvm :: PR: #8130
- Improve PEFT UX by @cuichenx :: PR: #8131
- Enhance flexibility by passing callbacks as method argument by @michal2409 :: PR: #8015
- context parallelism by @xrennvidia :: PR: #7739
- Make pipelined TP comm overlap available with mcore by @erhoo82 :: PR: #8005
- remove deprecated scripts by @arendu :: PR: #8138
- adding OnlineSampleMapping by @arendu :: PR: #8137
- Add distopt support for FP8 params and BF16 optimizer state by @timmoon10 :: PR: #7909
- Revert adding OnlineSampleMapping by @pablo-garay :: PR: #8164
- Token count and sequence length logging for MegatronGPTSFTModel by @vysarge :: PR: #8136
- Use latest apex internal API by @jbaczek :: PR: #8129
- tune specific params in the base model by @arendu :: PR: #7745
- Virtual pipeline parallel support for MegatronGPTSFTModel by @vysarge :: PR: #7964
- removed deprecated peft model by @arendu :: PR: #8183
- remove more deprecated files by @arendu :: PR: #8169
- Pre-generate cu_seqlens argmin and max_seqlen to remove host-to-device sync by @erhoo82 :: PR: #8108
- Add the interface to use SHARP to FSDP strategy by @erhoo82 :: PR: #8202
- Multimodal required NLP base model changes by @yaoyu-33 :: PR: #8188
- [NLP] Improve and unify loading state_dict for community models by @janekl :: PR: #7977
- Rename Finetuning Scripts by @cuichenx :: PR: #8201
- Final multimodal PR with our recent developments on MM side by @yaoyu-33 :: PR: #8127
- Add include_text parameter to SFT dataloaders by @Kipok :: PR: #8198
- Add random_seed argument to generate by @Kipok :: PR: #8162
- Added support for neptune logger by @harishankar-gopalan :: PR: #8210
- Pre-compute max_seqlen and cu_seqlens_argmin in all model-parallel cases by @erhoo82 :: PR: #8222
- Use PackedSeqParams in accordance with changes in Megatron-LM by @cuichenx :: PR: #8205
- Fix to peft & virtual pipeline parallel unsupported check by @vysarge :: PR: #8216
- Fixed the tp overlap switch by @sanandaraj5597 :: PR: #8195
- add knobs for rope/swiglu fusion by @lhb8125 :: PR: #8184
- Added sample cpu_offloading switch to YAML by @sanandaraj5597 :: PR: #8148
- Syncing random seed between ranks in generate by @Kipok :: PR: #8230
- add first_val_step to mcore scheduler by @JimmyZhang12 :: PR: #8150
- Correct padding for SFT input data to account for sequence parallel + TE's fp8 op dimension requirements by @vysarge :: PR: #8240
- Mistral 7b conversion script by @akoumpa :: PR: #8052
- switch to mcore dataset [with FIM support] by @dimapihtar :: PR: #8149
- Mixtral to NeMo conversion script. by @akoumpa :: PR: #8155
- fixes to accomendate mcore changes by @HuiyingLi :: PR: #8261
- Allow MegatronPretrainingRandomSample...
NVIDIA Neural Modules 1.22.0
Highlights
Models
NeMo Parakeet
Announcement - https://nvidia.github.io/NeMo/blogs/2024/2024-01-parakeet/
- https://huggingface.co/nvidia/parakeet-rnnt-1.1b
- https://huggingface.co/nvidia/parakeet-ctc-1.1b
- https://huggingface.co/nvidia/parakeet-rnnt-0.6b
- https://huggingface.co/nvidia/parakeet-ctc-0.6b
NeMo Parakeet-TDT
Announcement - https://nvidia.github.io/NeMo/blogs/2024/2024-01-parakeet-tdt/
ASR
- stt_en_fastconformer_transducer_large_ls #7641
- stt_en_fastconformer_ctc_larg_ls #7641
- stt_en_fastconformer_hybrid_large_streaming_multi
- stt_nl_fastconformer_hybrid_large_pc
- stt_fa_fastconformer_hybrid_large
NeMo ASR
- Multi-lookahead cache-aware streaming Conformer #6711
- Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) by @burchim #7330
- Speech ehancement tutorial #6492
- Support punctuation error rate #7538
Container
For additional information regarding NeMo containers, please visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo
docker pull nvcr.io/nvidia/nemo:23.10
Detailed Changelogs
ASR
Changelog
- Fix missing pip package 'einops' by @RobinDong :: PR: #7397
- Fix failure of installing pyaudio in Online_Offline_Speech_Commands_Demo.ipynb by @RobinDong :: PR: #7396
- [ASR] Confidence measure -> method renames by @GNroy :: PR: #7434
- RNN-T confidence and alignment bugfix by @GNroy :: PR: #7381
- Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) by @burchim :: PR: #7330
- [TTS] Read audio as int32 to avoid flac read errors by @rlangman :: PR: #7477
- Fix typos in confidence tutorial notebooks by @Kipok :: PR: #7581
- Safeguard nemo_text_processing installation on ARM by @blisc :: PR: #7485
- add fc large ls models by @nithinraok :: PR: #7641
- [ASR] RNN-T greedy decoding max_frames fix for alignment and confidence by @GNroy :: PR: #7635
- Create per.py by @ssh-meister :: PR: #7538
- Update docs: readme, getting started, ASR intro by @erastorgueva-nv :: PR: #7679
- [ASR] Multichannel mask estimator with flex number of channels by @anteju :: PR: #7317
- Fix code block typo in docs by @erastorgueva-nv :: PR: #7717
- Replace gpus with devices by @athitten :: PR: #7743
- docs: fix typos by @shuoer86 :: PR: #7758
- Snake act by @nithinraok :: PR: #7736
- fix(clustering_diarizer.py): fix typo by @jqueguiner :: PR: #7772
- Add some docs and update scripts for ASR by @titu1994 :: PR: #7790
- remove TN from ctc_segm tut by @ekmb :: PR: #7807
- Add support for finetuning with huggingface datasets by @stevehuang52 :: PR: #7834
- Adding long-form audio speaker diarization (clustering) class and functions by @tango4j :: PR: #7737
- Fix k2 installation: update for latest PyTorch, move script to dir by @artbataev :: PR: #7887
- [ASR] GSS-based mask estimator by @anteju :: PR: #7849
- add Dutch P&C FC model info by @zhehuaichen :: PR: #7892
- Add checks for unit tests that are looking for data from CI machine by @ericharper :: PR: #7943
- update branch name by @nithinraok :: PR: #7990
- fix librosa display issue by @nithinraok :: PR: #7991
- Fixes Notebooks for ASR by @titu1994 :: PR: #7994
- cherry pick bug 4405781 by @karpnv :: PR: #8044
- fix noise augmentation by @stevehuang52 :: PR: #8056
- Fix various issues with broken links and bugs by @titu1994 :: PR: #8064
- run with non-dev option by @nithinraok :: PR: #8077
- update broken links by @nithinraok :: PR: #8079
- langid bug fix by @karpnv :: PR: #8134
TTS
Changelog
- Add steps for document of getting dataset 'SF Bilingual Speech' by @RobinDong :: PR: #7378
- Fix checking of cuda/cpu device for inputs of Decoder by @RobinDong :: PR: #7444
- Fix failure of ljspeech's get_data.py by @RobinDong :: PR: #7430
- [TTS] Fix audio codec type checks by @rlangman :: PR: #7373
- [TTS] Add dataset to path of logged artifacts by @rlangman :: PR: #7462
- Fix adding positional embeddings in-place in FFTransformerDecoder by @The0nix :: PR: #7440
- Add dataset 'AISHELL-3' from OpenSLR for training mandarin TTS by @RobinDong :: PR: #7409
- [TTS] Fix FastPitch data prep tutorial by @rlangman :: PR: #7524
- add italian tokenization by @GiacomoLeoneMaria :: PR: #7486
- Remap speakers to continuous range of speaker_id for dataset AISHELL3 by @RobinDong :: PR: #7536
- add ItalianPhonemesTokenizer by @GiacomoLeoneMaria :: PR: #7587
- [TTS] Add STFT and SI-SDR loss to audio codec recipe by @rlangman :: PR: #7468
- Fix typo in audio codec config, encoder target by @anteju :: PR: #7697
- Group-residual vector quantizer by @anteju :: PR: #7643
- French g2p with pronunciation dictionary by @mgrafu :: PR: #7601
- add pleasefixme marker for potential failed nightly tests. by @XuesongYang :: PR: #7678
- Add new text segmentation library for better TTS quality by @RobinDong :: PR: #7645
- ConditionalInput: cat along the feature dim, not the batch dim by @anferico :: PR: #7785
- Add selection criteria for reference audios in the submodule by @anferico :: PR: #7788
- [Codec] Update codec checkpoint config by @anteju :: PR: #7835
- [Codec] Finite scalar quantizer by @anteju :: PR: #7886
- Tar codec by @nithinraok :: PR: #7867
LLM
Changelog
- Allow disabling sanity checking when num_sanity_val_steps=0 by @athitten :: PR: #7413
- Add comprehensive error messages by @PeganovAnton :: PR: #7261
- layer selection for ia3 by @arendu :: PR: #7417
- Add rope dynamic linear scaling by @hsiehjackson :: PR: #7437
- Fix sft dataset truncation by @hsiehjackson :: PR: #7464
- fix bug when loading dist ckpt in peft by @lhb8125 :: PR: #7452
- Fix sft chat dataset truncation by @hsiehjackson :: PR: #7478
- SFT model parallel fix for dist ckpt by @aklife97 :: PR: #7511
- remove auto generated examples by @arendu :: PR: #7510
- Add the argument to by @odelalleau :: PR: #7264
- PEFT GPT & T5 Refactor by @meatybobby :: PR: #7308
- fix a typo by @BestJuly :: PR: #7496
- StarCoder SFT test + bump PyT NGC image to 23.09 by @janekl :: PR: #7540
- fix llama2 70b lora tuning bug by @cuichenx :: PR: #7622
- generalized chat sft prompt by @yidong72 :: PR: #7655
- Set base frequency from config by @shan18 :: PR: #7734
- Megatron LLM documentation updates by @ssh-meister :: PR: #7400
- Remove incorrect extra argument of load_from_checkpoint_dir() by @RobinDong :: PR: #7500
- Add nemo to mcore GPT conversion script by @cuichenx :: PR: #7730
- set context for text memmap to fork by @arendu :: PR: #7784
- Support flash decoding by @hsiehjackson :: PR: #7744
- update text server to support compute logprobs by @Zhilin123 :: PR: #7733
- Revert PEFT eval fix by @ericharper :: PR: #7693
- Fix tn duplex by @ekmb :: PR: #7808
- Multimodal merge by @yaoyu-33 :: PR: #7728
- Fix flash decoding precision by @hsiehjackson :: PR: #7852
- Removing duplicate Megatron-LM installation by @Davood-M :: PR: #7864
- adding special_tokens from tokenizer config for transformer-lm model by @clumsy :: PR: #7613
- Add Adapter and IA3 support for MCore models by @cuichenx :: PR: #7750
- Add back import guard by @cuichenx :: PR: #7882
- Change FP8 Defaults by @cuichenx :: PR: #7894
- Added knob for ub_tp_comm_overlap for the MCORE pass by @sanandaraj5597 :: PR: #7902
- Upgrade NeMo to latest mcore and TE by @dimapihtar :: PR: #7862
- Pad sequences to multiples of 16 for GPTSFTDataset by @vysarge :: PR: #7904
- upgrade to latest mcore and TE by @dimapihtar :: PR: #7908
- added missing torch import by @Davood-M :: PR: #7913
- Fix CPU initialization of GPT models by @cuichenx :: PR: #7889
- Fix pinned triton version by @hsiehjackson :: PR: #7925
- fix tp_overlap config var name by @xrennvidia :: PR: #7928
- only enable query key scaling during fp16 by @gshennvm :: PR: #7946
- Fix for gpt3 eval hang with PP (a dtype issue) by @yaoyu-33 :: PR: #7927
- Pass in rotary_base to mcore and from HF by @Kipok :: PR: #7933
- Use NLPDDPStrategyNotebook in Multitask_Prompt_and_PTuning.ipynb by @athitten :: PR: #8061
General Improvements
Changelog
- Add fix for max time to quit trainer gracefully, without running validation by @SeanNaren :: PR: #7731
- SDE Tutorial minor fix by @Jorjeous :: PR: #7598
- Temporary pin Lightning-Utilities version due to broken NamedTuple by @artbataev :: PR: #8022
- Karpnv/issue 7320 by @karpnv :: PR: #7418
- Speech Simulator, update README.md: output_path --> output_manifest_filepath by @popcornell :: PR: #7442
- Fix None dataloader issue in PTL2.0 by @KunalDhawan :: PR: #7455
- HF StarCoder to NeMo conversion script by @janekl :: PR: #7421
- [doc] fix broken link by @stas00 :: PR: #7481
- dllogger - log on rank 0 only by @stas00 :: PR: #7513
- Add two youtube introductory videos to README and Docs. by @XuesongYang :: PR: #7570
- defaults changed by @arendu :: PR: #7600
- Bound transformers version in requirements by @athitten :: PR: #7620
- Fix import error no module name model_utils by @menon92 :: PR: #7629
- Fix in the confidence ensemble test by @Kipok :: PR: #7682
- move core install to /workspace by @aklife97 :: PR: #7706
- distributed checkpoint average script by @yidong72 :: PR: #7721
- fix hybrid eval by @karpnv :: PR: #7757
- fix(diarization-README): typo by @jqueguiner :: PR: #7771
- Configure MCore logger by @mikolajblaz :: PR: #7781
- Nemo to HF converter for LLaMA model by @uppalutkarsh :: PR: #7770
- [Fix] Save best NeMo model only when necessary by @anteju :: PR: #7836
- add guard if its a distributed checkpoint b...
NVIDIA Neural Modules 1.21.0
Highlights
Models
NeMo ASR
- Multi-lookahead cache-aware streaming
- Speech enahncement tutorial #6492
- Online code switching dataset #6579
NeMo TTS
- AudioCodec: Training recipe for EnCodec #6852
NeMo Framework
NeMo Core
- Update to PTL 2.0 #6433
NeMo Tools
- Forced aligner tutorial #7210
Container
For additional information regarding NeMo containers, please visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo
docker pull nvcr.io/nvidia/nemo:23.08
ASR
Changelog
- Fix require_grad typos by @kit1980 :: PR: #6930
- rnnt_greedy_decoding.py: typos? auto-repressively -> auto-regressively by @vadimkantorov :: PR: #6989
- Adding tutorial for confidence ensembles by @Kipok :: PR: #6932
- Add support for Numba FP16 RNNT Loss by @titu1994 :: PR: #6991
- fix install_beamsearch_decoders by @karpnv :: PR: #7011
- rnnt and char utils by @karpnv :: PR: #6971
- ASR Confidence update and tutorial by @GNroy :: PR: #6810
- st standalone model by @AlexGrinch :: PR: #6969
- Fix typo in ASR-TTS tutorial by @artbataev :: PR: #7049
- Update Frame-VAD doc and fix onnx export by @stevehuang52 :: PR: #7076
- Fast Conformer global token fix by @sam1373 :: PR: #7085
- Added script to extract ASR CTC and RNNT models from ASR hybrid models by @trias702 :: PR: #7092
- Fix absolute path in path join call by @kingjan1999 :: PR: #7099
- NeMo ASR Demo by @lleaver :: PR: #7110
- Fix plot function in vad_utils.py by @stevehuang52 :: PR: #7113
- Fixed small bug with NoisePerturbationWithNormalization by @trias702 :: PR: #7118
- Merge release r1.20.0 to main by @ericharper :: PR: #7167
- minor fix for conformer subsampling docstring. by @XuesongYang :: PR: #7195
- [ASR] Fix GPU memory leak in transcribe_speech.py by @rlangman :: PR: #7249
- Adding Multilingual, Code-Switched, and Hybrid ASR models by @KunalDhawan :: PR: #7250
- fix partial transcribe by @stevehuang52 :: PR: #7284
- Conv1d subsampling by @burchim :: PR: #7294
- add bf16 inference support and fix seq_len stft issue by @nithinraok :: PR: #7338
- Add finetuning scripts by @nithinraok :: PR: #7263
- Move parameter: trainer -> exp_manager (for PTL 2.0) by @artbataev :: PR: #7339
- Fix typos by @omahs :: PR: #7361
- Fix wrong calling of librosa.get_duration() in notebook by @RobinDong :: PR: #7376
- RNN-T confidence and alignment bugfix (#7381) by @GNroy :: PR: #7459
- update branch by @nithinraok :: PR: #7488
- Replace strategy = None with strategy = auto for notebooks by @athitten :: PR: #7521
- Fix PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dataloader issue by @KunalDhawan :: PR: #7531
- gpus -> devices by @nithinraok :: PR: #7542
- [BugFix] Add missing quotes for auto strategy in tutorial notebooks by @athitten :: PR: #7541
- Append output of val_step to self.validation_step_outputs in EncMaskDecAudioToAudioModel by @athitten :: PR: #7543
- fix validation_step_outputs initialization for multi-dataloader by @KunalDhawan :: PR: #7546
- Append val/test output to instance variable in EncDecSpeakerLabelModel by @athitten :: PR: #7562
- update strategy by @nithinraok :: PR: #7577
- Typo fixes by @Kipok :: PR: #7591
- Fix metrics for SE tutorial by @anteju :: PR: #7604
- fix ssl models ptl monitor val through logging by @nithinraok :: PR: #7608
- Fix py3.11 dataclasses issue by @titu1994 :: PR: #7582
- bugfix: trainer.gpus, trainer.strategy, trainer.accelerator by @XuesongYang :: PR: #7621
- Safeguard nemo_text_processing installation on ARM (#7485) by @blisc :: PR: #7619
- [ASR] Fix type error in jasper by @rlangman :: PR: #7636
- Fix vad & speech command tutorial - onnx by @fayejf :: PR: #7671
- Replace strategy='dp'/None with 'auto' by @athitten :: PR: #7681
- Fix multi rank finetune for ASR by @titu1994 :: PR: #7684
- fix ptl_bugs in slu_models.py by @jzi040941 :: PR: #7689
- Add NLPDDPStrategyNotebook and change trainer gpus to devices by @athitten :: PR: #7741
- Updated installation of ctc-decoders by @vsl9 :: PR: #7746
- Fix bug wrt change decoding strategy for bpe models by @titu1994 :: PR: #7762
TTS
Changelog
- [TTS] Add cosine distance option to TTS aligner by @rlangman :: PR: #6806
- [TTS] Add tutorial for TTS data prep scripts by @rlangman :: PR: #6922
- update TTS readme by @XuesongYang :: PR: #7088
- [TTS] Create EnCodec training recipe by @rlangman :: PR: #6852
- [TTS][ZH] add Chinese TTS recipes based on IPA symbol sets. by @XuesongYang :: PR: #6893
- [TTS] Add output audio format to preprocessing by @rlangman :: PR: #6889
- [TTS] Remove nested TTS configs by @rlangman :: PR: #7154
- [TTS] Fix TTS recipes with PTL 2.0 by @rlangman :: PR: #7188
- [TTS] Add license to ported EnCodec code by @rlangman :: PR: #7197
- [Fix] Discriminator update in AudioCodecModel by @anteju :: PR: #7209
- Adapter ipa Tutorial and config update by @styagi130 :: PR: #7260
- [TTS] Audio codec fixes by @rlangman :: PR: #7266
- [TTS] minor fix typos and input_types by @XuesongYang :: PR: #7272
- specify explicitly to set pretrained model paths by @styagi130 :: PR: #7305
- [TTS] Update AudioCodec API by @anteju :: PR: #7310
- [TTS] Add additional config to preprocess_text and compute_feature_stats by @rlangman :: PR: #7321
- [TTS] Change audio codec token type to TokenIndex by @rlangman :: PR: #7356
- fixed trainer.strategy=auto from None. by @XuesongYang :: PR: #7369
- [TTS] Added a callback for logging initial data by @anteju :: PR: #7384
- [TTS] bugfix: trainer.accelerator=auto from None. by @XuesongYang :: PR: #7492
- bugfix: specify trainer.strategy=auto when devices=1 by @XuesongYang :: PR: #7509
- Fix dimensionality in get_dist function by @redoctopus :: PR: #7506
- Fix TTS FastPitch tutorial by @hsiehjackson :: PR: #7494
- [TTS] remove curly braces from in jupyer notebook cell. by @XuesongYang :: PR: #7554
- [TTS] fixed trainer's accelerator and strategy. by @XuesongYang :: PR: #7569
- Change hifigan finetune strategy to ddp_find_unused_parameters_true by @hsiehjackson :: PR: #7579
- Fix validation in G2PModel and ThutmoseTaggerModel by @athitten :: PR: #7597
- [TTS] Fix FastPitch data prep tutorial by @rlangman :: PR: #7602
- [TTS] Add dataset to path of logged artifacts by @rlangman :: PR: #7651
NLP / NMT
Changelog
- Minor MPT-7B fixes and creation script update by @trias702 :: PR: #6982
- remove hard coded input and output fields by @arendu :: PR: #7008
- RoPE length extrapolation with interpolation by @MaximumEntropy :: PR: #7005
- add async + distopt to sft by @MaximumEntropy :: PR: #7018
- ptuning inference table bug fix by @arendu :: PR: #7015
- Fix missing import for GPT SFT by @MaximumEntropy :: PR: #7026
- Add end_strings to SamplingParams by @markelsanz14 :: PR: #6986
- Fix race condition for downloading cache when executing with multi-node by @findkim :: PR: #7016
- added back the retro documents. by @yidong72 :: PR: #7033
- remove pos emb from state dict for old models by @ekmb :: PR: #7068
- memmap worker arg by @arendu :: PR: #7062
- Disable distopt contiguous param buffer by default by @timmoon10 :: PR: #7095
- [Fix] load_state_dict in nlp_model.py by @stevehuang52 :: PR: #7086
- Fix tokenizer file caching where torch.distributed may not be initialized yet by @findkim :: PR: #7061
- freeze base mode on init during peft by @arendu :: PR: #7152
- Include the scripts for preprocessing OAST and unit tests for chat sft datasets by @yidong72 :: PR: #7112
- T5 metrics fix by @jubick1337 :: PR: #7037
- megatron gpt training fix by @anmolgupt :: PR: #7199
- Fix T5 using FA by @hsiehjackson :: PR: #7196
- fix-causal-fa-infer by @hsiehjackson :: PR: #7200
- Fix gpt trainer test by @hsiehjackson :: PR: #6915
- Load ub_cfg from hydra config by @jbaczek :: PR: #7003
- Fixes for lightning 2.0 upgrade by @athitten :: PR: #7176
- Fix which was off by one batch by @odelalleau :: PR: #7212
- Start using ModelParallelConfig from Megatron Core by @ericharper :: PR: #6885
- deprecation warning by @arendu :: PR: #7193
- Fix attention mask inference by @hsiehjackson :: PR: #7213
- Use GPTModel from mcore by @ericharper :: PR: #7093
- Add bf16-mixed and 16-mixed in module.py by @athitten :: PR: #7227
- Refactor LLM pretraining examples by @maanug-nv :: PR: #7159
- Add only trainable parameters to optimizer group in PEFT by @guyueh1 :: PR: #7230
- Dummy class for ModelParallelConfig by @ericharper :: PR: #7254
- [TN][Docs] update language coverage matrix and refs by @mgrafu :: PR: #7247
- tied weights for adapters by @arendu :: PR: #6928
- Fix skip generation by @hsiehjackson :: PR: #7270
- Hidden transforms model parallel config + CI with Perceiver by @michalivne :: PR: #7241
- Fix restore sequence parallel by @hsiehjackson :: PR: #7273
- fix ptuning and lora model_parallel_config by @blahBlahhhJ :: PR: #7287
- Fix adapters and ptuning for amp O2 by @guyueh1 :: PR: #7285
- remove additional line in peft state dict by @blahBlahhhJ :: PR: #7293
- loss mask aware final layer applicaiton by @arendu :: PR: #7275
- Adding server option to peft eval by @Davood-M :: PR: #7292
- migrated class CSVFieldsMemmapDataset from BioNeMo by @dorotat-nv :: PR: #7314
- remove old prompt table for storing cached ptunig representations by @arendu :: PR: #7295
- Bugfix and optimization in by @odelalleau :: PR: #7267
- Set a default value when getting by @yaox12 :: PR: #7115
- Distributed checkpointing with mcore GPT by @ericharper :: PR: #7116
- Fix activation checkpoint by @hsiehjackson :: PR: #7334
- Replace prefetch with val iterator check in megatron models by @athitten :: PR: #7318
- Fixing indentation bug in indexed_dataset memory d...
NVIDIA Neural Modules 1.20.0
Highlights
Models
- STT En Fast Conformer CTC XXLarge - 1.2 B param Fast Conformer CTC
- STT En Fast Conformer Transducer XXLarge - 1.2 B param Fast Conformer Transducer
- STT En Fast Conformer Transducer XLarge - XLarge Fast Conformer English
- STT En Fast Conformer CTC XLarge - XLarge Fast Conformer CTC
- STT En Fast Conformer Transducer XLarge - XLarge Fast Conformer Transducer
- STT En Fast Conformer CTC Large - Large Fast Conformer CTC
- STT En Fast Conformer Transducer Large - Large Fast Conformer Transducer
- STT It Fast Conformer Hybrid Large P&C - Large P&C Italian Fast Conformer
- STT Ua Fast Conformer Hybrid Large P&C - Large Ukranian Fast Conformer
NeMo ASR
- Graph-RNN-T #6168
- WildCard-RNN-T #6168
- Confidence Ensembles for ASR
- Token-and-Duration Transducer (TDT) #6536
- Spellchecking ASR #6179
- Numba FP16 RNNT Loss #6991
NeMo TTS
- TTS Adapter Customization
- TTS Dataloader Framework
NeMo Framework
- LoRA for T5 and mT5 #6612
- Flash Attention integration #6666
- Mosaic 7B compatibility
- Models with LongContext (32K) #6666, #6687, #6773
NeMo Tools
- Speech Data Explorer: Utterance level ASR model comparsion #6669
- Speech Data Processor: Spanish P&C
- NeMo Forced Aligner: Large sequence alignment + memory reduction #6695
Container
For additional information regarding NeMo containers, please visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo
docker pull nvcr.io/nvidia/nemo:23.06
Detailed Changelogs
ASR
Changelog
- [ASR] Adding ssl config for fast-conformer by @krishnacpuvvada :: PR: #6672
- Fix for interctc test random failure by @Kipok :: PR: #6644
- sharded manifests docs by @bmwshop :: PR: #6751
- [TTS] Implement new vocoder dataset by @rlangman :: PR: #6670
- TDT model pull request by @hainan-xv :: PR: #6536
- Spec aug fix by @tbartley94 :: PR: #6775
- Support large inputs to Conformer and Fast Conformer by @bmwshop :: PR: #6556
- sharded manifests updated docs by @bmwshop :: PR: #6833
- added fc-xl, xxl and titanet-s models by @nithinraok :: PR: #6832
- Multi-lookahead cache-aware streaming models by @VahidooX :: PR: #6711
- Update transcribe_utils.py by @stevehuang52 :: PR: #6865
- Fix k2 build topo helper by @artbataev :: PR: #6887
- Fix transcribe_utils.py for hybrid models in partial transcribe mode by @stevehuang52 :: PR: #6899
- Add hybrid model support to transcribe_speech_parallel.py by @stevehuang52 :: PR: #6906
- Update Frame-VAD doc by @stevehuang52 :: PR: #6902
- Make sure asr_model.change_attention_model is run if either cfg.model_path or cfg.pretrained_name is specified by @erastorgueva-nv :: PR: #6908
- Update fvad doc by @stevehuang52 :: PR: #6920
- Online Code Switching Dataset for ASR by @trias702 :: PR: #6579
- Fix AN4 dataset links by @artbataev :: PR: #6926
- Fix confidence ensembles RNNT logprobs selection logic for exclude_blank scenario by @KunalDhawan :: PR: #6937
- Adding cache-aware streaming ASR checkpoints. by @VahidooX :: PR: #6940
- Remove from metrics by @titu1994 :: PR: #6979
- Hybrid conformer export by @borisfom :: PR: #6983
- Cache handling without input tensors mutation by @borisfom :: PR: #6980
- Fixing an issue with confidence ensembles by @Kipok :: PR: #6987
- Add ASR with TTS Tutorial. Fix enhancer usage. by @artbataev :: PR: #6955
- fix install_beamsearch_decoders.sh by @karpnv :: PR: #7019
- Add support for Numba FP16 RNNT Loss (#6991) by @titu1994 :: PR: #7038
- Fix typo and branch in tutorial by @artbataev :: PR: #7048
- Refined export_config by @borisfom :: PR: #7053
- Fix documentation for Numba by @titu1994 :: PR: #7065
- Adding docs and models for multiple lookahead cache-aware ASR by @VahidooX :: PR: #7067
- Add updated fc ctc and rnnt xxl models by @nithinraok :: PR: #7128
- Update notebook branch by @ericharper :: PR: #7135
- Fixed main and merging this to r1.20 by @tango4j :: PR: #7127
- Fix default context size by @nithinraok :: PR: #7141
- Fix incorrect embedding grads with distopt BF16 grad reductions by @timmoon10 :: PR: #6958
TTS
Changelog
- [TTS] Add callback for saving audio during FastPitch training by @rlangman :: PR: #6665
- [TTS] Add script for text preprocessing by @rlangman :: PR: #6541
- [TTS] Fix adapter duration issue by @hsiehjackson :: PR: #6697
- [TTS] Filter out silent audio files during preprocessing by @rlangman :: PR: #6716
- [TTS] fix inconsistent type hints for IpaG2p by @XuesongYang :: PR: #6733
- [TTS] relax hardcoded prefix for phonemes and tones and infer phoneme set through dict by @XuesongYang :: PR: #6735
- [TTS] corrected misleading deprecation warnings. by @XuesongYang :: PR: #6702
- Fix TTS adapter tutorial by @hsiehjackson :: PR: #6741
- [TTS][zh] refine hardcoded lowercase for ASCII letters. by @XuesongYang :: PR: #6781
- [TTS] Append pretrained FastPitch & SpectrogamEnhancer pair to available models by @racoiaws :: PR: #7012
NLP / NMT
Changelog
- minor fix for missing chat attr by @arendu :: PR: #6671
- eval fix by @arendu :: PR: #6685
- VP Fixes for converter + Config management by @titu1994 :: PR: #6698
- lora notebook by @arendu :: PR: #6765
- peft eval directly from ckpt by @arendu :: PR: #6785
- GPT inference long context by @ekmb :: PR: #6687
- Fix validation with drop_last=False by @mikolajblaz :: PR: #6704
- fix spellmapper tutorial, change branch to main by @bene-ges :: PR: #6803
- text_generation_utils memory reduction if no logprob needed by @yzhang123 :: PR: #6773
- Add optional index mapping dir in mmap text datasets by @gheinrich :: PR: #6683
- Add inference kv cache support for transformer TE path by @yen-shi :: PR: #6627
- add reference to our paper by @bene-ges :: PR: #6821
- added changes to ramp up bs by @dimapihtar :: PR: #6799
- t5 lora tuning by @arendu :: PR: #6612
- Added rouge monitoring support for T5 by @jubick1337 :: PR: #6737
- GPT extrapolatable position embedding (xpos/sandwich/alibi/kerple) and Flash Attention by @hsiehjackson :: PR: #6666
- Import Enum for chatbot component by @ericharper :: PR: #6877
- typo fix from #6666 by @arendu :: PR: #6882
- removed unnecessary print by @dimapihtar :: PR: #6884
- Fix destructor for delayed mmap dataset case by @mikolajblaz :: PR: #6703
- Make Gradio library optional by @yidong72 :: PR: #6904
- Fix fast-glu activation in change partitions by @hsiehjackson :: PR: #6909
- Documentation for ONNX export of Megatron Models by @asfiyab-nvidia :: PR: #6914
- FixTextMemMapDataset index file creation in multi-node setup by @gheinrich :: PR: #6768
- Fix flash-attention by @hsiehjackson :: PR: #6901
- ptuning oom fix by @arendu :: PR: #6916
- add rampup bs assertion by @dimapihtar :: PR: #6927
- Enable methods in bert-like models by @sararb :: PR: #6898
- support value attribution condition by @yidong72 :: PR: #6934
- Add missing save restore connector to eval scripts by @titu1994 :: PR: #6935
- Merge release r1.19.0 into main by @ericharper :: PR: #6948
- Stop at the stop token by @yidong72 :: PR: #6957
- fixes for spellmapper by @bene-ges :: PR: #6994
- Fix tabular data text generation by @yidong72 :: PR: #7022
- fix pos id - hf update by @ekmb :: PR: #7075
- fix syntax error introduced in PR-7079 by @bene-ges :: PR: #7102
NeMo Tools
Bugfixes
Changelog
- small Bugfix by @fayejf :: PR: #7079
- Fix caching bug in causal convolutions for cache-aware ASR models by @VahidooX :: PR: #7034
- Fix masking bug for TTS Aligner by @redoctopus :: PR: #6677
- [bugfix] avoid the random shuffle of phoneme and tone tokens. by @XuesongYang :: PR: #6855
- fix ptuning residuals bug by @arendu :: PR: #6866
- TE bug fix by @dimapihtar :: PR: #7027
- Update distopt API for coalesced NCCL calls by @timmoon10 :: PR: #6886
General Improvements
Changelog
- update batch size recommendation to min 32 for 43b by @Zhilin123 :: PR: #6675
- Make Note usage consistent in adapter_mixins.py by @BrianMcBrayer :: PR: #6678
- Update all invalid tree references to blobs for NeMo samples by @BrianMcBrayer :: PR: #6679
- Update README.rst about container by @fayejf :: PR: #6686
- karpnv/issues6690 by @karpnv :: PR: #6705
- Limit codeql scope by @titu1994 :: PR: #6710
- Not pinning Gradio version by @yidong72 :: PR: #6680
- preprocess squad in sft format by @arendu :: PR: #6727
- Fix Codeql config by @titu1994 :: PR: #6731
- Fix fastpitch test nightly by @hsiehjackson :: PR: #6730
- Lora/PEFT training script CI test by @arendu :: PR: #6664
- fixed decor to show messages only when the wrapped object is called. by @XuesongYang :: PR: #6793
- lora pp2 by @arendu :: PR: #6818
- Upperbound Numpy to < 1.24 by @titu1994 :: PR: #6829
- Fix typo in documentation by @Dounx :: PR: #6838
- NFA updates by @erastorgueva-nv :: PR: #6695
- Update container for import action by @ericharper :: PR: #6883
- removed some tests by @arendu :: PR: #6900
- Update contai...