Our training data mirrors that of Video-LLaVA, so follow their instructions from here to download the data.
After Downloading them, the data structure should be as follows:
data/download/videollava
├── valley_llavaimage.json
├── videochatgpt_llavaimage_tune.json
├── valley
│ ├── 000001_000050
│ └── ...
├── llava_image_tune
│ ├── coco
│ ├── gqa
│ ├── ocr_vqa
│ ├── textvqa
│ └── vg
└── videochatgpt_tune
├── v_---9CpRcKoU.mp4
└── ...
Here is a bare-bones shell script one can use to launch reproduction of our method.
ID="merv-run"
torchrun --standalone --nnodes 1 --nproc-per-node 8 scripts/pretrain_video.py \
--run_id $ID \
--model.model_id $ID \
--model.type "merv-base" \
--dataset.type "videollava" \
--stage finetune
To modify some parameters, feel free to adjust any of the configs outlined in merv/conf/models.py
.
For example, to swap out the encoders and adjust the projector token length from 64 to 16, you can run
ID="merv-novel"
# Visual_feature_length = Projector_token_length x temporal_resolution (i.e. # of frames after encoding)
torchrun --standalone --nnodes 1 --nproc-per-node 8 scripts/pretrain_video.py \
--run_id $ID \
--model.model_id $ID \
--model.type "merv-base" \
--model.video_backbone_ids ['languagebind-video-noclass','dinov2-video-all-tokens','hiera-base-plus-video','siglip-vit-b16-224px-all-no-cls'] \
--model.num_frames [16,16,32,16] \
--model.projector_token_length 16 \
--model.visual_feature_length 256 \
--dataset.type "videollava" \
--stage finetune