StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models
Yunzhi Yan*, Zhen Xu*, Haotong Lin, Haian Jin, Haoyu Guo, Yida Wang, Kun Zhan, Xianpeng Lang, Hujun Bao, Xiaowei Zhou, Sida Peng
CVPR 2025
street_crafter.mp4
git clone https://github.com/zju3dv/street_crafter.git --recursive
Our model is tested on one A100/A800 80GB GPU.
conda create -n streetcrafter python=3.9
conda activate streetcrafter
# Install dependencies.
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# Install requirements
pip install -r requirements.txt
# Install gsplat
pip install "git+https://github.com/dendenxu/gsplat.git"
# This issue might help when installation fails: https://github.com/nerfstudio-project/gsplat/issues/226
# Install submodules
pip install ./submodules/sdata
pip install ./submodules/simple-knn
Please go to data_processor
and refer to README.md for processing details.
We provide some example scenes on this link. You can skip the processing steps and download the data to data/waymo
directory.
The pretrained model weights can be downloaded from this link to video_diffusion/ckpts
directory. We also provide the model weights trained using multi-cameras of Waymo under this link.
Inference video diffusion model
python render.py --config {config_path} mode diffusion
We also provide another option for inference by setting the meta info file path.
# run the command under video diffusion directory
python sample_condition.py
We distill the video diffusion model into dynamic 3D representation based on the codebase of Street Gaussians. Please refer to street_gaussian/config/config.py
for details of parameters.
Train street gaussian
python train.py --config {config_path}
Render input trajectory
python render.py --config {config_path} mode trajectory
Render novel trajectory
python render.py --config {config_path} mode novel_view
First download the model weights of Vista from this link to video_diffusion/ckpts
directory.
We finetune the video diffuson model based on the codebase of Vista. Please refer to their official Documents for environment setup and training details.
# run the command under video diffusion directory
sh training.sh
(a) We process the LiDAR using calibrated images and object tracklets to obtain a colorized point cloud, which can be rendered to image space as pixel-level conditions.
(b) Given observed images and reference image embedding
If you find this code useful for your research, please use the following BibTeX entry.
@inproceedings{yan2024streetcrafter,
title={StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models},
author={Yan, Yunzhi and Xu, Zhen and Lin, Haotong and Jin, Haian and Guo, Haoyu and Wang, Yida and Zhan, Kun and Lang, Xianpeng and Bao, Hujun and Zhou, Xiaowei and Peng, Sida},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025},
}