A deep learning library built for video understanding tasks. Primarily relies upon PyTorch Lightning and wandb. Takes inspiration from fast.ai
Clone the repository, then do: cd video_understanding
and pip install -e .
.
Use black to format the code: run black .
in the main directory. If you do not have black
installed, install it with pip install black
.
Format with black .
and run pytest
in the parent directory.
In the current state, absolutely not. You can see all the TODOs at the bottom of this README.
- Add UCF101 dataset to start
- Create video frame loader
- Create video data visualizer (from dataset name)
- Add CLIP implementation (like ViFi-CLIP)
- Add initial version of trainer with wandb support in train.py
- Add general trainer code
- Add debug mode and lr_find flag for running train.py
- Get good classification performance on UCF101
- Make as a package (pip install -e .)
- Delete temp ckpt file created
- Figure out why the learning rate and momentum is not logged to wandb
- Create good nested config system
- Fix config test
- Add hyperparameter sweeps with wandb
- Detect the number of classes for the dataset(s) automatically (requires setup)
- Finishing touches on wandb sweeps
- Configure model checkpoint locations
- Create evaluate.py (give config, create args for metrics, and give test/val.csv)
- Renew pytest-bed
- Add hmdb dataset
- Add Kinetics-400 dataset
- Add Kinetics-700 dataset
- Add WebVid10M dataset
- Add RewrittenWebVid dataset
- Revisit codebase to support video-text matching as a task.
- Allow for multiple datasets for training (how to weight?)
- Modify code to allow for large csv loading
- Add multiple dataset support + Kinetics-400
- Add NTP part to the codebase
- Add the webvid rewritten dataset to the codebase.
- Add VideoMAEv2 model to codebase? Maybe we just do a frame-level MAE model instead?
- Add LoRA fine-tuning capabilities (especially important for LLaMA models and maybe for video encoders too)
- Explore other optimizers?
- Create next TODOs