Training

Data Preparation

Our training data mirrors that of Video-LLaVA, so follow their instructions from here to download the data.

After Downloading them, the data structure should be as follows:

data/download/videollava
├── valley_llavaimage.json
├── videochatgpt_llavaimage_tune.json
├── valley
│   ├── 000001_000050
│   └── ...
├── llava_image_tune
│   ├── coco
│   ├── gqa
│   ├── ocr_vqa
│   ├── textvqa
│   └── vg
└── videochatgpt_tune
    ├── v_---9CpRcKoU.mp4
    └── ...

Example Training Script

Here is a bare-bones shell script one can use to launch reproduction of our method.

ID="merv-run"

torchrun --standalone --nnodes 1 --nproc-per-node 8 scripts/pretrain_video.py \
  --run_id $ID \
  --model.model_id $ID \
  --model.type "merv-base" \
  --dataset.type "videollava" \
  --stage finetune

To modify some parameters, feel free to adjust any of the configs outlined in merv/conf/models.py. For example, to swap out the encoders and adjust the projector token length from 64 to 16, you can run

ID="merv-novel"

# Visual_feature_length = Projector_token_length x temporal_resolution (i.e. # of frames after encoding)
torchrun --standalone --nnodes 1 --nproc-per-node 8 scripts/pretrain_video.py \
  --run_id $ID \
  --model.model_id $ID \
  --model.type "merv-base" \
  --model.video_backbone_ids ['languagebind-video-noclass','dinov2-video-all-tokens','hiera-base-plus-video','siglip-vit-b16-224px-all-no-cls'] \
  --model.num_frames [16,16,32,16] \
  --model.projector_token_length 16 \
  --model.visual_feature_length 256 \
  --dataset.type "videollava" \
  --stage finetune

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRAINING.md

TRAINING.md

Training

Data Preparation

Example Training Script

Files

TRAINING.md

Latest commit

History

TRAINING.md

File metadata and controls

Training

Data Preparation

Example Training Script