We mainly follow VINDLU to prepare the enviroment.
# create
conda env create -f vl.yml
# activate
conda activate vl
To run UMT pretraining, you have to prepare the weights of the CLIP visual encoder as in the extract.ipynb, and set the MODEL_PATH
in clip.py.