Important
NeMo Lab is under active development
NeMo Lab is an example template for Generative AI with NVIDIA NeMo 2.0.
NVIDA NeMo is an accelerated end-to-end platform that is flexible and production ready. NeMo is comprised of several component frameworks which enable teams to build, customize, and deploy Generative AI solutions for:
- large language models
- vision language models
- video models
- speech models
NeMo Lab is inspired by NeMo tutorials
and focuses on using NeMo to train, tune, and serve language models.
Data processing is task dependent, relative to pretraining or finetuning. When pretraining, we will use Hugging Face's cosmopedia dataset. When finetuning, we will use NeMo's default dataset – SquadDataModule
– a variant of the Stanford QA dataset.
Note
Refer to the data processing tutorial for a detailed walk-through
We will use NeMo to train Nemotron 3 4B on the cosmopedia dataset, and tune a Llama variant on the SQuAD dataset.
Note
Refer to the model development tutorial for a detailed walk-through
We will use NeMo interfaces to export models for inference with TensorRT-LLM and Triton Inference Server, or vLLM.
Note
Refer to the model deployment tutorial for a detailed walk-through
- Code profiling
- Logging training and tuning runs with Weights & Biases
- Model output control with NeMo Guardrails
- Agents as DAGs with LangGraph
- Agent traces with LangSmith
- Containerization with Docker
- System prompt design
The source code found in src/nemo_lab
is used to provide examples of implementing concepts "from-scratch" with NeMo. For instance – how might we add a custom model, or our own training recipe given base interfaces and mixins found within the framework.
Note
The current focus for the source code is implementing support for Llama 3.2 variants
We will use NVIDIA and Meta models including, but not limited to:
- NVIDIA Llama variants, Mistral variants, Megatron distillations, and Minitron
- NVIDIA embedding, reranking, and retrieval models
- NVIDIA Cosmos tokenizers
- NeMo compatible Meta Llama variants
Tip
See models/ for more on model families and types
- a CUDA compatible OS and device (GPU) with at least 48GB of VRAM (e.g. an L40S).
- CUDA 12.1
- Python 3.10.10
- Pytorch 2.2.1
Tip
See hardware/ for more regarding VRAM requirements of particular models
- NVIDIA Developer Program
- NVIDIA NGC for NeMo and TensorRT-LLM containers
- build.nvidia.com for API calls to NVIDIA hosted endpoints
- Hugging Face Hub for model weights and datasets
- LangSmith for tracing
- Weights & Biases for experiment management during finetuning
To prepare a development environment, please run the following in terminal:
bash install_requirements.sh
Doing so will install nemo_lab
along with the nemo_run
, megatron_core 0.10.0rc0
, and the nvidia/apex
PyTorch extension.
Note
megatron_core 0.10.0rc0
is required for compatibility with NeMo 2.0
Note
NVIDIA Apex is required for RoPE Scaling in NeMo 2.0. NVIDIA Apex is built with CUDA and C++ extensions for performance and full functionality. please be aware that the build process may take several minutes
Important
running the images requires for the host machine to have access to NVIDIA GPUs
Two Docker images have been created for the quick start tutorials. One for pretraining, and one for finetuning.
To run pretraining, do the following in terminal:
docker pull jxtngx/nemo-lab:pretrain
docker run --rm --gpus 1 -it jxtngx/nemo-lab:pretrain
python pretrain_nemotron3_4b.py
To run finetuning, do the following in terminal:
docker pull jxtngx/nemo-lab:finetune
docker run --rm --gpus 1 -it jxtngx/nemo-lab:finetune
# WAIT FOR CONTAINER TO START
huggingface-cli login
# ENTER HF KEY WHEN PROMPTED
python finetune_llama3_8b.py
Important
Finetuning requires a Hugging Face key and access to Llama 3 8B
For keys, see: https://huggingface.co/docs/hub/en/security-tokens
For Llama 3 8B access, see: https://huggingface.co/meta-llama/Meta-Llama-3-8B
Quickstart | Docker Image | NVIDIA Launchable |
---|---|---|
Pretrain Nemotron 3 4B | ||
Finetune Llama 3 8B |
Important
to run the script in the NVIDIA Launchable, use the following command in terminal:
python /workspace/finetune_llama3_8b.py
or python /workspace/pretrain_nemotron3_4b.py
- NeMo documentation
- NeMo tutorials
- NeMo Guardrails documentation
- Deploy on a SLURM cluster
- Mixed Precision Training
- CPU Offloading
- Communication Overlap
- NVIDIA NIM (LLM) documentation
- langchain-nvidia-ai-endpoints documentation
- LangGraph documentation
- DSPy
- W&B documentation
- vLLM documentation
- cuVS (GPU accelerated vector search by NVIDIA Rapids)
- Weaviate documentation
- Weaviate and LangChain (LangChain)
- vLLM and LangChain (LangChain)
- Generative AI Explained
- Deploying a Model for Inference at Production Scale
- Sizing LLM Inference Systems
- Building RAG Agents with LLMs
- Introduction to Deploying RAG Pipelines for Production at Scale
- Prompt Engineering with LLaMA-2
- Generative AI and LLMs
- Accelerated LLM Model Alignment and Deployment
- Beyond RAG Basics: Building Agents, Co-Pilots, Assistants, and More!
- Generative AI Essentials
- GTC 2024 - Latest in Generative AI
- What is Generative AI
- Prompt Engineering and P-Tuning
- Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM
- Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server
- Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer
- Getting Started with Large Language Models for Enterprise Solutions
- Unlocking the Power of Enterprise-Ready LLMs with NVIDIA NeMo
- TensorRT-LLM KV Cache Early Reuse
- An Introduction to Model Merging for LLMs
- A Neural Probabilistic Language Model
- Generative Adversarial Nets
- Sequence to Sequence Learning with Neural Networks
- Neural GPUs Learn Algorithms
- A Structured Self-attentive Sentence Embedding
- Neural Machine Translation by Jointly Learning to Align and Translate
- Attention is All You Need
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
- Improving Language Understanding by Generative Pre-Training
- Language Models are Few-Shot Learners
- Language Models are Unsupervised Multitask Learners
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- LLaMA: Open and Efficient Foundation Language Models
- The Llama 3 Herd of Models
- Compact Language Models via Pruning and Knowledge Distillation
- LLM Pruning and Distillation in Practice: The Minitron Approach
- 8-bit Optimizers via Block-wise Quantization
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- LoRA: Low-Rank Adaptation of Large Language Models
- QLoRA: Efficient Finetuning of Quantized LLMs
- RoFormer: Enhanced Transformer with Rotary Position Embedding
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- Efficiently Scaling Transformer Inference
- LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
- Efficient Tool Use with Chain-of-Abstraction Reasoning
- ReAct: Synergizing Reasoning and Acting in Language Models
- Adaptive Mixtures of Local Experts
- Convergence of artificial intelligence and high performance computing on NSF-supported cyberinfrastructure
- Artificial Intelligence: A Modern Approach (Russell, Norvig)
- Deep Learning (Bengio et al)
- Reinforcement Learning (Sutton, Barto)
- Mathematics for Machine Learning (Deisenroth et al)
- Programming Massively Parallel Processors (Hwu et al)
- Deep Learning with PyTorch (Stevens, Antiga, Viehmann)
- Build a Large Language Model (Sebastian Raschka)
- Hands-On Generative AI with Transformers and Diffusion Models (Sanseverio et al)
- Hands-On Large Language Models (Alammar et al)
- StatQuest (Josh Starmer)
- Coding a ChatGPT Like Transformer From Scratch in PyTorch (Josh Starmer)
- Serrano Academy (Luis Serrano)
- Intro to Large Language Models (Andrej Karpathy)
- CS25: V2 I Introduction to Transformers (Stanford, Karpathy)
- Building LLMs from the Ground Up (Sebastian Raschka)
- Coding the Self-Attention Mechanism of LLMs (Sebastian Raschka)
- Neural networks (Grant Sanderson)
- Visualizing Transformers and Attention (Grant Sanderson)
- The Shift from Models to Compound AI Systems (Berkeley AI Research)
- What are Compound Systems (Databricks)
- Agentic Design Patterns (Deep Learning AI)
- Intro to LangGraph (LangChain)
- Prompt Engineering Guide
- Getting Beyond the Hype: A Guide to AI’s Potential (Stanford)