NeMo Lab

Important

NeMo Lab is under active development

NeMo Lab is an example template for Generative AI with NVIDIA NeMo 2.0.

NVIDA NeMo is an accelerated end-to-end platform that is flexible and production ready. NeMo is comprised of several component frameworks which enable teams to build, customize, and deploy Generative AI solutions for:

large language models
vision language models
video models
speech models

Tip

Get started with the quick start tutorials and scripts

Tutorial Concepts

NeMo Lab is inspired by NeMo tutorials and focuses on using NeMo to train, tune, and serve language models.

Data Processing

Data processing is task dependent, relative to pretraining or finetuning. When pretraining, we will use Hugging Face's cosmopedia dataset. When finetuning, we will use NeMo's default dataset – SquadDataModule – a variant of the Stanford QA dataset.

Note

Refer to the data processing tutorial for a detailed walk-through

Model Development

We will use NeMo to train Nemotron 3 4B on the cosmopedia dataset, and tune a Llama variant on the SQuAD dataset.

Note

Refer to the model development tutorial for a detailed walk-through

Model Deployment

We will use NeMo interfaces to export models for inference with TensorRT-LLM and Triton Inference Server, or vLLM.

Note

Refer to the model deployment tutorial for a detailed walk-through

Additional Concepts

Code profiling
Logging training and tuning runs with Weights & Biases
Model output control with NeMo Guardrails
Agents as DAGs with LangGraph
Agent traces with LangSmith
Containerization with Docker
System prompt design

Source Code Concepts

The source code found in src/nemo_lab is used to provide examples of implementing concepts "from-scratch" with NeMo. For instance – how might we add a custom model, or our own training recipe given base interfaces and mixins found within the framework.

Note

The current focus for the source code is implementing support for Llama 3.2 variants

Models

We will use NVIDIA and Meta models including, but not limited to:

NVIDIA Llama variants, Mistral variants, Megatron distillations, and Minitron
NVIDIA embedding, reranking, and retrieval models
NVIDIA Cosmos tokenizers
NeMo compatible Meta Llama variants

Tip

See models/ for more on model families and types

System Requirements

a CUDA compatible OS and device (GPU) with at least 48GB of VRAM (e.g. an L40S).
CUDA 12.1
Python 3.10.10
Pytorch 2.2.1

Tip

See hardware/ for more regarding VRAM requirements of particular models

User Account Requirements

NVIDIA Developer Program
NVIDIA NGC for NeMo and TensorRT-LLM containers
build.nvidia.com for API calls to NVIDIA hosted endpoints
Hugging Face Hub for model weights and datasets
LangSmith for tracing
Weights & Biases for experiment management during finetuning

Setup

Tip

Get started with the quick start tutorials and scripts

On Host (local, no container)

To prepare a development environment, please run the following in terminal:

bash install_requirements.sh

Doing so will install nemo_lab along with the nemo_run, megatron_core 0.10.0rc0, and the nvidia/apex PyTorch extension.

Note

megatron_core 0.10.0rc0 is required for compatibility with NeMo 2.0

Note

NVIDIA Apex is required for RoPE Scaling in NeMo 2.0. NVIDIA Apex is built with CUDA and C++ extensions for performance and full functionality. please be aware that the build process may take several minutes

Docker

Important

running the images requires for the host machine to have access to NVIDIA GPUs

Two Docker images have been created for the quick start tutorials. One for pretraining, and one for finetuning.

To run pretraining, do the following in terminal:

docker pull jxtngx/nemo-lab:pretrain
docker run --rm --gpus 1 -it jxtngx/nemo-lab:pretrain
python pretrain_nemotron3_4b.py

To run finetuning, do the following in terminal:

docker pull jxtngx/nemo-lab:finetune
docker run --rm --gpus 1 -it jxtngx/nemo-lab:finetune
# WAIT FOR CONTAINER TO START 
huggingface-cli login
# ENTER HF KEY WHEN PROMPTED
python finetune_llama3_8b.py

Important

Finetuning requires a Hugging Face key and access to Llama 3 8B
For keys, see: https://huggingface.co/docs/hub/en/security-tokens
For Llama 3 8B access, see: https://huggingface.co/meta-llama/Meta-Llama-3-8B

Resources

Quickstart Images and Containers

Quickstart	Docker Image	NVIDIA Launchable
Pretrain Nemotron 3 4B
Finetune Llama 3 8B

Important

to run the script in the NVIDIA Launchable, use the following command in terminal:
python /workspace/finetune_llama3_8b.py or python /workspace/pretrain_nemotron3_4b.py

NeMo References

NeMo documentation
NeMo tutorials
NeMo Guardrails documentation
Deploy on a SLURM cluster
Mixed Precision Training
CPU Offloading
Communication Overlap

Dependency References

NVIDIA NIM (LLM) documentation
langchain-nvidia-ai-endpoints documentation
LangGraph documentation
DSPy
W&B documentation
vLLM documentation
cuVS (GPU accelerated vector search by NVIDIA Rapids)
Weaviate documentation

Interoperability Guides

Weaviate and LangChain (LangChain)
vLLM and LangChain (LangChain)

NVIDIA Deep Learning Institute

Generative AI Explained
Deploying a Model for Inference at Production Scale
Sizing LLM Inference Systems
Building RAG Agents with LLMs
Introduction to Deploying RAG Pipelines for Production at Scale
Prompt Engineering with LLaMA-2

NVIDIA On-Demand

Generative AI and LLMs
Accelerated LLM Model Alignment and Deployment
Beyond RAG Basics: Building Agents, Co-Pilots, Assistants, and More!
Generative AI Essentials
GTC 2024 - Latest in Generative AI

NVIDIA Technical Blog

What is Generative AI
Prompt Engineering and P-Tuning
Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server
Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer
Getting Started with Large Language Models for Enterprise Solutions
Unlocking the Power of Enterprise-Ready LLMs with NVIDIA NeMo
TensorRT-LLM KV Cache Early Reuse
An Introduction to Model Merging for LLMs

Academic Papers

A Neural Probabilistic Language Model
Generative Adversarial Nets
Sequence to Sequence Learning with Neural Networks
Neural GPUs Learn Algorithms
A Structured Self-attentive Sentence Embedding
Neural Machine Translation by Jointly Learning to Align and Translate
Attention is All You Need
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Improving Language Understanding by Generative Pre-Training
Language Models are Few-Shot Learners
Language Models are Unsupervised Multitask Learners
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
LLaMA: Open and Efficient Foundation Language Models
The Llama 3 Herd of Models
Compact Language Models via Pruning and Knowledge Distillation
LLM Pruning and Distillation in Practice: The Minitron Approach
8-bit Optimizers via Block-wise Quantization
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
LoRA: Low-Rank Adaptation of Large Language Models
QLoRA: Efficient Finetuning of Quantized LLMs
RoFormer: Enhanced Transformer with Rotary Position Embedding
Efficient Memory Management for Large Language Model Serving with PagedAttention
Efficiently Scaling Transformer Inference
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Efficient Tool Use with Chain-of-Abstraction Reasoning
ReAct: Synergizing Reasoning and Acting in Language Models
Adaptive Mixtures of Local Experts
Convergence of artificial intelligence and high performance computing on NSF-supported cyberinfrastructure

Additional Materials

Artificial Intelligence: A Modern Approach (Russell, Norvig)
Deep Learning (Bengio et al)
Reinforcement Learning (Sutton, Barto)
Mathematics for Machine Learning (Deisenroth et al)
Programming Massively Parallel Processors (Hwu et al)
Deep Learning with PyTorch (Stevens, Antiga, Viehmann)
Build a Large Language Model (Sebastian Raschka)
Hands-On Generative AI with Transformers and Diffusion Models (Sanseverio et al)
Hands-On Large Language Models (Alammar et al)
StatQuest (Josh Starmer)
Coding a ChatGPT Like Transformer From Scratch in PyTorch (Josh Starmer)
Serrano Academy (Luis Serrano)
Intro to Large Language Models (Andrej Karpathy)
CS25: V2 I Introduction to Transformers (Stanford, Karpathy)
Building LLMs from the Ground Up (Sebastian Raschka)
Coding the Self-Attention Mechanism of LLMs (Sebastian Raschka)
Neural networks (Grant Sanderson)
Visualizing Transformers and Attention (Grant Sanderson)
The Shift from Models to Compound AI Systems (Berkeley AI Research)
What are Compound Systems (Databricks)
Agentic Design Patterns (Deep Learning AI)
Intro to LangGraph (LangChain)
Prompt Engineering Guide
Getting Beyond the Hype: A Guide to AI’s Potential (Stanford)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NeMo Lab

Tutorial Concepts

Data Processing

Model Development

Model Deployment

Additional Concepts

Source Code Concepts

Models

System Requirements

User Account Requirements

Setup

On Host (local, no container)

Docker

Resources

Quickstart Images and Containers

NeMo References

Dependency References

Interoperability Guides

NVIDIA Deep Learning Institute

NVIDIA On-Demand

NVIDIA Technical Blog

Academic Papers

Additional Materials

Files

README.md

Latest commit

History

README.md

File metadata and controls

NeMo Lab

Tutorial Concepts

Data Processing

Model Development

Model Deployment

Additional Concepts

Source Code Concepts

Models

System Requirements

User Account Requirements

Setup

On Host (local, no container)

Docker

Resources

Quickstart Images and Containers

NeMo References

Dependency References

Interoperability Guides

NVIDIA Deep Learning Institute

NVIDIA On-Demand

NVIDIA Technical Blog

Academic Papers

Additional Materials