python 3.9
pip install -r requirements.txt
Ensure all model checkpoints are put in the experiments
folder.
- Testing the reconstruction performance of VAE
bash shell_scripts/stage1_vae_scripts/face_vae_infer.sh
- Testing the generation performance of AR Diffusion
DATA_FILE_NAME=facelatent
MODEL_FILE_NAME=diff_ardiff_vtattn_x0pred_nvae_midt
VALIDDATA_FILE=facevidflatten
TRAINDATA_FILE=$DATA_FILE_NAME
bash shell_scripts/stage2_ardiff_scripts/face_gen/infer_base_script.sh $TRAINDATA_FILE $VALIDDATA_FILE $MODEL_FILE_NAME 2.0 16 5
- Training VAE on video frames.
【NOTE: Please download the tokenizer_titok_l32.bin file from https://huggingface.co/TrizZZZ/ar_diffusion and put it in the root folder before training the VAE.】
bash shell_scripts/stage1_vae_scripts/sky_vae_train.sh
- Finetuning VAE on videos with temporal causal attention.
bash shell_scripts/stage1_vae_scripts/sky_vae_train_ftwt.sh
- Extract video latents using VAEs for speeding up the training of AR Diffusion model.
Open the line of 'bash shell_scripts/base_vae/infer_savelatent_script.sh' in
shell_scripts/stage1_vae_scripts/sky_vae_infer.sh
- Training AR Diffusion model.
DATA_FILE_NAME=skyvidlatent
MODEL_FILE_NAME=diff_ardiff_vtattn_x0pred_nvae_midt
bash shell_scripts/stage2_ardiff_scripts/sky_gen/train_base_script.sh ${DATA_FILE_NAME} ${MODEL_FILE_NAME}
Checkpoints of VAE and AR-Diffusion models on the Sky-Timelapse, TaiChi-HD, UCF101, and Faceforensics datasets have been uploaded to the Huggingface hub: https://huggingface.co/TrizZZZ/ar_diffusion
More video samples can be viewed in: https://anonymouss765.github.io/AR-Diffusion
TODO
@article{ardiff,
title={AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion},
author={Sun, Mingzhen and Wang, Weining and Li, Gen and Liu, Jiawei and Sun, Jiahui and Feng, Wanquan and Lao, shanshan and Zhou, SiYu and He, Qian and Liu, Jing},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
year={2025}
}