| Website | SVG 1 Paper | SVG2 Paper | SVG 1 Twitter/X | SVG 2 Twitter/X |
- [2025/09] We release Flash k-Means, a batched K-Means clustering algorithm implemented with Triton that offers >10x speedup!
- [2025/09] Sparse VideoGen2 is open-sourced! HunyuanVideo, Wan 2.1 and Cosmos can be accelerated by 2Γ
- [2025/09] Sparse VideoGen2 is accepted by NeurIPS 2025 as a spotlight!
- [2025/05] Sparse VideoGen is accepted by ICML 2025!
- [2025/04] Wan 2.1 is supported! Both T2V and I2V are accelerated.
- [2025/03] Sparse VideoGen is open-sourced! HunyuanVideo and CogVideoX v1.5 can be accelerated by 2Γ
Sparse VideoGen 1 & 2 are training-free frameworks that leverage inherent sparsity in the 3D Full Attention operations to accelerate video generation.
Sparse VideoGen 1's core contributions:
- Identifying the spatial and temporal sparsity patterns in video diffusion models.
- Proposing an Online Profiling Strategy to dynamically identify these patterns.
- Implementing an end-to-end generation framework through efficient algorithm-system co-design, with hardware-efficient layout transformation and customized kernels.
Sparse VideoGen 2's core contributions:
- Tackles inaccurate token identification and computation waste in video diffusion.
- Introduces semantic-aware sparse attention with efficient token permutation.
- Provides an end-to-end system design with a dynamic attention kernel and flash k-means kernel.
Comp_A.mp4 |
Comp_F.mp4 |
Comp_L.mp4 |
Begin by cloning the repository:
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/svg-project/Sparse-VideoGen.git # Do not clone the demo, otherwise is too large
cd Sparse-VideoGen
We recommend using CUDA versions 12.4 / 12.8 + PyTorch versions 2.5.1 / 2.6.0
# 1. Create and activate conda environment
conda create -n SVG python==3.12.9 # or 3.11.9 if have error when installing kernels
conda activate SVG
# 2. Install uv, then install other packages
pip install uv
uv pip install -e .
pip install flash-attn --no-build-isolation
# 4. Install customized kernels. (You might need to upgrade your cmake and CUDA version.)
pip install -U setuptools # Require at least version 77.0.0
git submodule update --init --recursive
cd svg/kernels
pip install -U cmake
bash setup.sh
cd 3rdparty/flashinfer
cp ../../../../assets/patches/modifications.patch ./
git apply modifications.patch
pip install --no-build-isolation --verbose --editable . # Block Sparse Attention with varied block sizes
pip install cuvs-cu12 --extra-index-url=https://pypi.nvidia.com #
You donβt need to install flash-kmeans separately. A copy of flash-kmeans is included in Sparse VideoGen and is used by default.
We support Text-to-Video and Image-to-Video inference of Wan 2.1 model. The running scripts are:
# Text-to-Video
# bash scripts/wan/wan_t2v_720p_svg.sh # SVG
bash scripts/wan/wan_t2v_720p_sap.sh # SVG2
# Image-to-Video
# bash scripts/wan/wan_i2v_720p_svg.sh # SVG
bash scripts/wan/wan_i2v_720p_sap.sh # SVG2
The running scripts are:
# bash scripts/hyvideo/hyvideo_t2v_720p_svg.sh # SVG
bash scripts/hyvideo/hyvideo_t2v_720p_sap.sh # SVG2
We evaluate the performance of our customized kernels against the baseline implementations. The following tables show the memory bandwidth (GB/s) comparison for different batch sizes and hidden dimensions:
Batch Size | Hidden Dim | Diffusers (GB/s) | SVG Customized (GB/s) | Speedup |
---|---|---|---|---|
2,097,152 | 32 | 151.36 | 809.69 | 5.35Γ |
1,048,576 | 64 | 196.54 | 810.61 | 4.12Γ |
524,288 | 128 | 232.66 | 810.21 | 3.48Γ |
262,144 | 256 | 252.67 | 810.41 | 3.21Γ |
Batch Size | Hidden Dim | Diffusers (GB/s) | SVG Customized (GB/s) | Speedup |
---|---|---|---|---|
2,097,152 | 32 | 45.82 | 808.28 | 17.64Γ |
1,048,576 | 64 | 91.18 | 805.22 | 8.83Γ |
524,288 | 128 | 197.89 | 804.29 | 4.06Γ |
262,144 | 256 | 350.87 | 804.43 | 2.29Γ |
Our customized kernels achieve significantly higher memory bandwidth across all configurations, with speedups ranging from 2.29Γ to 17.64Γ. The performance improvement is particularly notable for smaller hidden dimensions and larger batch sizes.
Batch Size | Num Heads | Seq Length | Head Dim | Diffusers (GB/s) | SVG Customized (GB/s) | Speedup |
---|---|---|---|---|---|---|
1 | 32 | 1024 | 64 | 17.25 | 158.81 | 9.21Γ |
1 | 32 | 4096 | 64 | 27.74 | 405.75 | 14.63Γ |
1 | 32 | 16384 | 64 | 30.86 | 605.89 | 19.63Γ |
4 | 32 | 1024 | 64 | 27.60 | 475.94 | 17.24Γ |
4 | 32 | 4096 | 64 | 30.93 | 614.11 | 19.85Γ |
4 | 32 | 16384 | 64 | 32.41 | 648.36 | 20.00Γ |
The RoPE implementation in SVG shows substantial performance improvements over the Diffusers baseline, with speedups ranging from 9.21Γ to 20.00Γ. The performance gain is particularly significant for longer sequence lengths and larger batch sizes, demonstrating excellent scaling characteristics.
If you find Sparse VideoGen useful for your research and applications or interesting, please cite our work using BibTeX:
@article{xi2025sparse,
title={Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity},
author={Xi, Haocheng and Yang, Shuo and Zhao, Yilong and Xu, Chenfeng and Li, Muyang and Li, Xiuyu and Lin, Yujun and Cai, Han and Zhang, Jintao and Li, Dacheng and others},
journal={arXiv preprint arXiv:2502.01776},
year={2025}
}
@article{yang2025sparse,
title={Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation},
author={Yang, Shuo and Xi, Haocheng and Zhao, Yilong and Li, Muyang and Zhang, Jintao and Cai, Han and Lin, Yujun and Li, Xiuyu and Xu, Chenfeng and Peng, Kelly and others},
journal={arXiv preprint arXiv:2505.18875},
year={2025}
}