Skip to content

svg-project/Sparse-VideoGen

Repository files navigation

logo

Accelerate Video Generation with High Pixel-level Fidelity

| Website | SVG 1 Paper | SVG2 Paper | SVG 1 Twitter/X | SVG 2 Twitter/X |

πŸ”₯NewsπŸ”₯

  • [2025/09] We release Flash k-Means, a batched K-Means clustering algorithm implemented with Triton that offers >10x speedup!
  • [2025/09] Sparse VideoGen2 is open-sourced! HunyuanVideo, Wan 2.1 and Cosmos can be accelerated by 2Γ—
  • [2025/09] Sparse VideoGen2 is accepted by NeurIPS 2025 as a spotlight!
  • [2025/05] Sparse VideoGen is accepted by ICML 2025!
  • [2025/04] Wan 2.1 is supported! Both T2V and I2V are accelerated.
  • [2025/03] Sparse VideoGen is open-sourced! HunyuanVideo and CogVideoX v1.5 can be accelerated by 2Γ—

πŸ“š About

Sparse VideoGen 1 & 2 are training-free frameworks that leverage inherent sparsity in the 3D Full Attention operations to accelerate video generation.

Sparse VideoGen 1's core contributions:

  • Identifying the spatial and temporal sparsity patterns in video diffusion models.
  • Proposing an Online Profiling Strategy to dynamically identify these patterns.
  • Implementing an end-to-end generation framework through efficient algorithm-system co-design, with hardware-efficient layout transformation and customized kernels.

Sparse VideoGen 2's core contributions:

  • Tackles inaccurate token identification and computation waste in video diffusion.
  • Introduces semantic-aware sparse attention with efficient token permutation.
  • Provides an end-to-end system design with a dynamic attention kernel and flash k-means kernel.

πŸŽ₯ Demo of SVG1

πŸŽ₯ Demo of SVG2

Comp_A.mp4
Comp_F.mp4
Comp_L.mp4

πŸ› οΈ Installation

Begin by cloning the repository:

GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/svg-project/Sparse-VideoGen.git # Do not clone the demo, otherwise is too large
cd Sparse-VideoGen

We recommend using CUDA versions 12.4 / 12.8 + PyTorch versions 2.5.1 / 2.6.0

# 1. Create and activate conda environment
conda create -n SVG python==3.12.9 # or 3.11.9 if have error when installing kernels
conda activate SVG

# 2. Install uv, then install other packages
pip install uv
uv pip install -e .

pip install flash-attn --no-build-isolation

# 4. Install customized kernels. (You might need to upgrade your cmake and CUDA version.)
pip install -U setuptools # Require at least version 77.0.0
git submodule update --init --recursive
cd svg/kernels
pip install -U cmake
bash setup.sh
cd 3rdparty/flashinfer
cp ../../../../assets/patches/modifications.patch ./
git apply modifications.patch
pip install --no-build-isolation --verbose --editable . # Block Sparse Attention with varied block sizes
pip install cuvs-cu12 --extra-index-url=https://pypi.nvidia.com # 

You don’t need to install flash-kmeans separately. A copy of flash-kmeans is included in Sparse VideoGen and is used by default.

πŸš€ Inference Examples

Wan 2.1

We support Text-to-Video and Image-to-Video inference of Wan 2.1 model. The running scripts are:

# Text-to-Video
# bash scripts/wan/wan_t2v_720p_svg.sh # SVG
bash scripts/wan/wan_t2v_720p_sap.sh # SVG2

# Image-to-Video
# bash scripts/wan/wan_i2v_720p_svg.sh # SVG
bash scripts/wan/wan_i2v_720p_sap.sh # SVG2

HunyuanVideo

The running scripts are:

# bash scripts/hyvideo/hyvideo_t2v_720p_svg.sh # SVG
bash scripts/hyvideo/hyvideo_t2v_720p_sap.sh # SVG2

πŸ“‘ Open-source Plan

Efficiency Benchmark

Customized Kernels Performance

We evaluate the performance of our customized kernels against the baseline implementations. The following tables show the memory bandwidth (GB/s) comparison for different batch sizes and hidden dimensions:

RMSNorm Performance

Batch Size Hidden Dim Diffusers (GB/s) SVG Customized (GB/s) Speedup
2,097,152 32 151.36 809.69 5.35Γ—
1,048,576 64 196.54 810.61 4.12Γ—
524,288 128 232.66 810.21 3.48Γ—
262,144 256 252.67 810.41 3.21Γ—

LayerNorm Performance

Batch Size Hidden Dim Diffusers (GB/s) SVG Customized (GB/s) Speedup
2,097,152 32 45.82 808.28 17.64Γ—
1,048,576 64 91.18 805.22 8.83Γ—
524,288 128 197.89 804.29 4.06Γ—
262,144 256 350.87 804.43 2.29Γ—

Our customized kernels achieve significantly higher memory bandwidth across all configurations, with speedups ranging from 2.29Γ— to 17.64Γ—. The performance improvement is particularly notable for smaller hidden dimensions and larger batch sizes.

RoPE (Rotary Position Embedding) Performance

Batch Size Num Heads Seq Length Head Dim Diffusers (GB/s) SVG Customized (GB/s) Speedup
1 32 1024 64 17.25 158.81 9.21Γ—
1 32 4096 64 27.74 405.75 14.63Γ—
1 32 16384 64 30.86 605.89 19.63Γ—
4 32 1024 64 27.60 475.94 17.24Γ—
4 32 4096 64 30.93 614.11 19.85Γ—
4 32 16384 64 32.41 648.36 20.00Γ—

The RoPE implementation in SVG shows substantial performance improvements over the Diffusers baseline, with speedups ranging from 9.21Γ— to 20.00Γ—. The performance gain is particularly significant for longer sequence lengths and larger batch sizes, demonstrating excellent scaling characteristics.

πŸ”— BibTeX

If you find Sparse VideoGen useful for your research and applications or interesting, please cite our work using BibTeX:

@article{xi2025sparse,
  title={Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity},
  author={Xi, Haocheng and Yang, Shuo and Zhao, Yilong and Xu, Chenfeng and Li, Muyang and Li, Xiuyu and Lin, Yujun and Cai, Han and Zhang, Jintao and Li, Dacheng and others},
  journal={arXiv preprint arXiv:2502.01776},
  year={2025}
}

@article{yang2025sparse,
  title={Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation},
  author={Yang, Shuo and Xi, Haocheng and Zhao, Yilong and Li, Muyang and Zhang, Jintao and Cai, Han and Lin, Yujun and Li, Xiuyu and Xu, Chenfeng and Peng, Kelly and others},
  journal={arXiv preprint arXiv:2505.18875},
  year={2025}
}

About

[ICML2025, NeurIPS2025 Spotlight] Sparse VideoGen 1 & 2: Accelerating Video Diffusion Transformers with Sparse Attention

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •