This outline aims to explain how you can use the repo for your own Vision & Language problem. Be it Visual Question Answering, Visual Reasoning, Classification etc.
For applying it to the Hateful Memes Dataset, refer to SCORE_REPRO.md.
If anything pops up, do feel free to: Drop an issue / send a PR / send me an email at n.muennighoff@gmail.com
Extracting important features from images before training is the current standard in VL, as it significantly speeds up things. If you don't have extracted featurs yet, you can use the subrepo vilio/py-bottom-up-attention/data
.
Place a folder named img
with all your images into vilio/py-bottom-up-attention/data
.
- Clone the repo:
git clone https://github.com/Muennighoff/vilio.git
- Setup extraction:
cd vilio/py-bottom-up-attention; pip install -r requirements.txt
pip install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
cd vilio/py-bottom-up-attention; python setup.py build develop
Then run feature extraction as follows:
cd vilio/py-bottom-up-attention; python detectron2_mscoco_proposal_maxnms.py --batchsize 4 --split img --weight vgattr --minboxes 36 --maxboxes 36
I recommend leaving the parameters as is. Increasing the amount of boxes (hence features extracted) sometimes helps marginally, but will slow down extraction, training & inference significantly.
Refer to the README.md under vilio/py-bottom-up-attention/README.md
if you run into any problems.
The repo provides code for dealing with .tsv features (which are generated by the extraction above) or .lmdb features.
Depending on your feature format and text format, you probably want to go through either the code under vilio/fts_lmdb/
or vilio/fts_tsv/
. I recommend just copying the hm_data.py
in either of these folders and adjusting the code for your file format & data columns. You can also adjust hm_pretrain_data.py
if you plan to perform task-specific pretraining (Refer to the table at the end of this .md to see which model has task-specific pretraining - Note that all models are pretrained models, but it sometimes helps performing additional pre-training (masking etc) on your specific dataset).
Once your data is ready, I'd recommend making a copy of vilio/hm.py
and depending on your project consider the following adjustments:
- The score metric (currently roc-auc & accuracy)
- Remove/adjust the
clean_data
call, which is specific to the hm dataset - Adjust the result dumping (currently dump_csv for a csv file output with id, predicted label, predicted probability)
If you choose to run one of the ERNIE models implemented in PaddlePaddle, I'd recommend making a copy of vilio/ernie-vil/reader/hm_finetuning.py
and making necessary adjustments on the go, while going through the file, such as
- Add function in
vilio/ernie-vil/baching/finetune_batching.py
- Data handling in
vilio/ernie-vil/reader/_tsv_reader.py
- Copy the hm conf folder & adjust under
vilio/ernie-vil/conf/
- Add a data folder for your project at
vilio/ernie-vil/data
Finally it is time to choose the model you want to run. Refer to the below table for a rough performance & implementation guide. When pre-trained models are available, you can download them by clicking on the respective language transformer.
Note that the performance rank might be very different for other datasets than Hateful Memes.
Model | Language Transformers (--tr in params.py) | Performance Rank for HM | Pre-trained model available | Task-specific pre-training enabled |
---|---|---|---|---|
E - ERNIE-VIL LARGE/SMALL | ERNIE | 1, 2 | LARGE / BASE | No (TODO) |
D - DeVLBERT | bert-base-uncased | 8 | BASE | No |
O - OSCAR LARGE/SMALL | bert-large-uncased / bert-base-uncased | 5, 6 | LARGE / BASE | Yes |
U - UNITER LARGE/SMALL | bert-large-cased / bert-base-cased | 3, 4 | LARGE / BASE | Yes |
U - UNITER LARGE/SMALL | roberta-large / roberta-small | 14 | No | No |
V - VisualBERT | bert-large-uncased | 7 | LARGE | Yes |
V - VisualBERT | roberta-large / roberta-small | 11 | No | Yes |
V - VisualBERT | albert-base-v2 - albert-xxlarge-v2 | 10 (XXL V2) | No | Yes |
X - LXMERT | bert-large-uncased / bert-base-uncased | 9 | LARGE | Yes |
X - LXMERT | roberta-large / roberta-small | 13 | No | Yes |
X - LXMERT | albert-base-v2 - albert-xxlarge-v2 | 12 (XXL V2) | No | Yes |
For most models other language transformers might work as well, but havn't been tested yet. Note that for VL tasks having a pre-trained model makes a major difference. If you choose to use a pretrained model, make sure to place the weights file in vilio/data
or for E-Models the params folder in vilio/ernie-vil/data/ernielarge/
/ vilio/ernie-vil/data/erniesmall/
.
Now just place your features & text data in the respective data folders & run the model.
Depending on which model & features you chose, refer to the bash files either under vilio/bash/training
or vilio/ernie-vil/bash/training
and adjust them to your needs.
The parameters are explained at vilio/params.py
.