Dev Notes

Profiling Kernels with Nsight Systems & Nsight Compute

Nsight Systems

Documentation: https://docs.nvidia.com/nsight-systems/UserGuide/#cli-profiling

Command:

PYTHONPATH=. nsys profile -t cuda,osrt,nvtx,cudnn,cublas -w true -o ./nvidia_nsight/nsys_mlstm_xlchunksize python scripts/run_training_kernel_benchmarks_with_profile.py

Nsight Compute

Documentation: https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html

Command:

PYTHONPATH=. ncu -o kernel_prof -f -c 1 -k mlstm_chunkwise__parallel_fw_Hintra_kernel --set=full python ./scripts/run_training_kernel_benchmarks_with_profile.py

Running kernel benchmarks with baselines

To run the benchmarks including all baselines, you have to install:

pip install mamba_ssm causal_conv1d fla

For FlashAttention3, you have to clone the original repo https://github.com/Dao-AILab/flash-attention:

# clone FlashAttention
cd ..
git clone https://github.com/Dao-AILab/flash-attention
# Apply CONDA ENV patch
git apply ../mlstm_kernels/flash_attention.patch
# Install flash attention 3
cd hopper
PYTHONPATH=. python3 setup.py install
cd ..
# Install regular flash attention 2
python3 pip install -e .
# Go back to this repo
cd ../mlstm_kernels

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dev_notes.md

dev_notes.md

Dev Notes

Profiling Kernels with Nsight Systems & Nsight Compute

Nsight Systems

Nsight Compute

Running kernel benchmarks with baselines

Files

dev_notes.md

Latest commit

History

dev_notes.md

File metadata and controls

Dev Notes

Profiling Kernels with Nsight Systems & Nsight Compute

Nsight Systems

Nsight Compute

Running kernel benchmarks with baselines