Skip to content
/ MeBT Public

official implementation of the paper: Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers (CVPR 2023)

Notifications You must be signed in to change notification settings

Ugness/MeBT

Repository files navigation

Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers (CVPR 2023)

This repository is an official implementation of the paper:
Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers (CVPR 2023)
Jaehoon Yoo, Semin Kim, Doyup Lee, Chiheon Kim, Seunghoon Hong
Project Page | Paper

Setup

We installed the packages specified in requirements.txt based on this docker image

docker pull pytorch/pytorch:1.10.0-cuda11.3-cudnn8-devel
docker run -it --shm-size=24G pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime /bin/bash
git clone https://github.com/Ugness/MeBT
mv MeBT
pip install requirements.txt

Datasets

Download

Data preprocessing & setup

  1. Extract all frames in each video. The filename should be [VIDEO_ID]_[FRAME_NUM].[png, jpg, ...]
  2. create train.txt and test.txt containing the directory of the entire frames. For example,
find $(pwd)/dataset/train -name "*.png" >> 'train.txt'
find $(pwd)/dataset/test -name "*.png" >> 'test.txt'
  1. The txt files should be located as [DATA_PATH]/train.txt and [DATA_PATH]/test.txt.

Checkpoints

  1. VQGAN: Checkpoints for the 3d VQGAN can be found here.
  2. MeBT: Checkpoints for MeBT can be found here

Configuration Files

You may control the experiments with a configuration files. The default configuration files can be found in the configs folder.

Here is an example of the config file.

model:
    target: mebt.transformer.Net2NetTransformer
    params:
        unconditional: True
        vocab_size: 16384 # You should follow the vocab_size of 3d VQGAN.
        first_stage_vocab_size: 16384
        block_size: 1024 # total number of input tokens (output of 3d VQGAN.)
        n_layer: 24 # number of layers for MeBT
        n_head: 16 # number of attention heads
        n_embd: 1024 # hidden dimension
        n_unmasked: 0
        embd_pdrop: 0.1 # Dropout ratio
        resid_pdrop: 0.1 # Dropout ratio
        attn_pdrop: 0.1 # Dropout ratio
        sample_every_n_latent_frames: 0
        first_stage_key: video # ignore
        cond_stage_key: label # ignore
        vtokens: False # ignore
        vtokens_pos: False # ignore
        vis_epoch: 100
        sos_emb: 256 # Number of latent tokens.
        avg_loss: True
        mode: # You may stack different type of layers. The total number of layers should be matched with n_layer
            - latent_enc
            - latent_self
            - latent_enc
            - latent_self
            - latent_enc
            - latent_self
            - latent_enc
            - latent_self
            - latent_enc
            - latent_self
            - latent_enc
            - latent_self
            - latent_enc
            - latent_dec
            - lt2l
            - latent_dec
            - lt2l
            - latent_dec
            - lt2l
            - latent_dec
            - lt2l
            - latent_dec
            - lt2l
            - latent_dec
    mask:
        target: mebt.mask_sampler.MaskGen
        params:
            iid: False
            schedule: linear
            max_token: 1024 # total number of input tokens (output of 3d VQGAN.)
            method: 'mlm'
            shape: [4, 16, 16] # shape of the output of 3d VQGAN. (T, H, W)
            t_range: [0.0, 1.0]
            budget: 1024 # total number of input tokens (output of 3d VQGAN.)

    vqvae:
        params:
            ckpt_path: 'ckpts/vqgan_sky_128_488_epoch=12-step=29999-train.ckpt' # Path to the 3d VQGAN checkpoint.
            ignore_keys: ['loss']

data:
    data_path: 'datasets/vqgan_data/stl_128' # [DATA_PATH]
    sequence_length: 16 # Length of the training video (in frames)
    resolution: 128 # Resolution of the training video (in pixels)
    batch_size: 6 # Batch_size per GPU
    num_workers: 8
    image_channels: 3
    smap_cond: 0
    smap_only: False
    text_cond: False
    vtokens: False
    vtokens_pos: False
    spatial_length: 0
    sample_every_n_frames: 1
    image_folder: True
    stft_data: False

exp:
    exact_lr: 1.08e-05 # learning rate

Training

The scripts for training can be found in scripts folder. You may excute the script as following:

bash scripts/train_config_log_gpus.sh [CONFIG_FILE] [LOG_DIR] [GPU_IDs]
  • [GPU_IDs]:
    • 0, : use GPU_ID 0 only.
    • 0,1,2,3,4,5,6,7 : use 8 GPUs from 0 to 7.

Inference

The scripts for inference can be found in scripts folder. You may excute the script as following:

bash scripts/valid_dnr_config_ckpt_exp_[stl, taichi, ucf]_[16f, 128f].sh [CONFIG_FILE] [CKPT_PATH] [SAVE_DIR]
  • You should change the [DATA_PATH] in the script file to measure FVD and KVD.

Acknowledgements

  • Our code is based on VQGAN and TATS.
  • The development of this open-sourced code was supported in part by the National Research Foundation of Korea (NRF) (No. 2021R1A4A3032834).

Citation

@article{yoo2023mebt,
         title={Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers},
         author={Jaehoon Yoo, Semin Kim, Doyup Lee, Chiheon Kim, Seunghoon Hong},
         journal={arXiv preprint arXiv:2303.11251},
         year={2023}
}

About

official implementation of the paper: Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers (CVPR 2023)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published