Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers (CVPR 2023)
This repository is an official implementation of the paper:
Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers (CVPR 2023)
Jaehoon Yoo, Semin Kim, Doyup Lee, Chiheon Kim, Seunghoon Hong
Project Page | Paper
We installed the packages specified in requirements.txt
based on this docker image
docker pull pytorch/pytorch:1.10.0-cuda11.3-cudnn8-devel
docker run -it --shm-size=24G pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime /bin/bash
git clone
mv MeBT
pip install requirements.txt
- Extract all frames in each video. The filename should be
[VIDEO_ID]_[FRAME_NUM].[png, jpg, ...]
- create
containing the directory of the entire frames. For example,
find $(pwd)/dataset/train -name "*.png" >> 'train.txt'
find $(pwd)/dataset/test -name "*.png" >> 'test.txt'
- The txt files should be located as
You may control the experiments with a configuration files.
The default configuration files can be found in the configs
Here is an example of the config file.
target: mebt.transformer.Net2NetTransformer
unconditional: True
vocab_size: 16384 # You should follow the vocab_size of 3d VQGAN.
first_stage_vocab_size: 16384
block_size: 1024 # total number of input tokens (output of 3d VQGAN.)
n_layer: 24 # number of layers for MeBT
n_head: 16 # number of attention heads
n_embd: 1024 # hidden dimension
n_unmasked: 0
embd_pdrop: 0.1 # Dropout ratio
resid_pdrop: 0.1 # Dropout ratio
attn_pdrop: 0.1 # Dropout ratio
sample_every_n_latent_frames: 0
first_stage_key: video # ignore
cond_stage_key: label # ignore
vtokens: False # ignore
vtokens_pos: False # ignore
vis_epoch: 100
sos_emb: 256 # Number of latent tokens.
avg_loss: True
mode: # You may stack different type of layers. The total number of layers should be matched with n_layer
- latent_enc
- latent_self
- latent_enc
- latent_self
- latent_enc
- latent_self
- latent_enc
- latent_self
- latent_enc
- latent_self
- latent_enc
- latent_self
- latent_enc
- latent_dec
- lt2l
- latent_dec
- lt2l
- latent_dec
- lt2l
- latent_dec
- lt2l
- latent_dec
- lt2l
- latent_dec
target: mebt.mask_sampler.MaskGen
iid: False
schedule: linear
max_token: 1024 # total number of input tokens (output of 3d VQGAN.)
method: 'mlm'
shape: [4, 16, 16] # shape of the output of 3d VQGAN. (T, H, W)
t_range: [0.0, 1.0]
budget: 1024 # total number of input tokens (output of 3d VQGAN.)
ckpt_path: 'ckpts/vqgan_sky_128_488_epoch=12-step=29999-train.ckpt' # Path to the 3d VQGAN checkpoint.
ignore_keys: ['loss']
data_path: 'datasets/vqgan_data/stl_128' # [DATA_PATH]
sequence_length: 16 # Length of the training video (in frames)
resolution: 128 # Resolution of the training video (in pixels)
batch_size: 6 # Batch_size per GPU
num_workers: 8
image_channels: 3
smap_cond: 0
smap_only: False
text_cond: False
vtokens: False
vtokens_pos: False
spatial_length: 0
sample_every_n_frames: 1
image_folder: True
stft_data: False
exact_lr: 1.08e-05 # learning rate
The scripts for training can be found in scripts
folder. You may excute the script as following:
bash scripts/ [CONFIG_FILE] [LOG_DIR] [GPU_IDs]
- [GPU_IDs]:
- 0, : use GPU_ID 0 only.
- 0,1,2,3,4,5,6,7 : use 8 GPUs from 0 to 7.
The scripts for inference can be found in scripts
folder. You may excute the script as following:
bash scripts/valid_dnr_config_ckpt_exp_[stl, taichi, ucf]_[16f, 128f].sh [CONFIG_FILE] [CKPT_PATH] [SAVE_DIR]
- You should change the [DATA_PATH] in the script file to measure FVD and KVD.
- Our code is based on VQGAN and TATS.
- The development of this open-sourced code was supported in part by the National Research Foundation of Korea (NRF) (No. 2021R1A4A3032834).
title={Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers},
author={Jaehoon Yoo, Semin Kim, Doyup Lee, Chiheon Kim, Seunghoon Hong},
journal={arXiv preprint arXiv:2303.11251},