Outline to use Vilio for your own project

This outline aims to explain how you can use the repo for your own Vision & Language problem. Be it Visual Question Answering, Visual Reasoning, Classification etc. For applying it to the Hateful Memes Dataset, refer to SCORE_REPRO.md.

If anything pops up, do feel free to: Drop an issue / send a PR / send me an email at n.muennighoff@gmail.com

Data

Image-extraction

Extracting important features from images before training is the current standard in VL, as it significantly speeds up things. If you don't have extracted featurs yet, you can use the subrepo vilio/py-bottom-up-attention/data. Place a folder named img with all your images into vilio/py-bottom-up-attention/data.

Clone the repo:
git clone https://github.com/Muennighoff/vilio.git
Setup extraction:
cd vilio/py-bottom-up-attention; pip install -r requirements.txt
pip install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
cd vilio/py-bottom-up-attention; python setup.py build develop

Then run feature extraction as follows: cd vilio/py-bottom-up-attention; python detectron2_mscoco_proposal_maxnms.py --batchsize 4 --split img --weight vgattr --minboxes 36 --maxboxes 36
I recommend leaving the parameters as is. Increasing the amount of boxes (hence features extracted) sometimes helps marginally, but will slow down extraction, training & inference significantly.
Refer to the README.md under vilio/py-bottom-up-attention/README.md if you run into any problems.

Using the image & text features

The repo provides code for dealing with .tsv features (which are generated by the extraction above) or .lmdb features.

Depending on your feature format and text format, you probably want to go through either the code under vilio/fts_lmdb/ or vilio/fts_tsv/. I recommend just copying the hm_data.py in either of these folders and adjusting the code for your file format & data columns. You can also adjust hm_pretrain_data.py if you plan to perform task-specific pretraining (Refer to the table at the end of this .md to see which model has task-specific pretraining - Note that all models are pretrained models, but it sometimes helps performing additional pre-training (masking etc) on your specific dataset).

Modeling

PyTorch

Once your data is ready, I'd recommend making a copy of vilio/hm.py and depending on your project consider the following adjustments:

The score metric (currently roc-auc & accuracy)
Remove/adjust the clean_data call, which is specific to the hm dataset
Adjust the result dumping (currently dump_csv for a csv file output with id, predicted label, predicted probability)

PaddlePaddle

If you choose to run one of the ERNIE models implemented in PaddlePaddle, I'd recommend making a copy of vilio/ernie-vil/reader/hm_finetuning.py and making necessary adjustments on the go, while going through the file, such as

Add function in vilio/ernie-vil/baching/finetune_batching.py
Data handling in vilio/ernie-vil/reader/_tsv_reader.py
Copy the hm conf folder & adjust under vilio/ernie-vil/conf/
Add a data folder for your project at vilio/ernie-vil/data

Finally it is time to choose the model you want to run. Refer to the below table for a rough performance & implementation guide. When pre-trained models are available, you can download them by clicking on the respective language transformer.
Note that the performance rank might be very different for other datasets than Hateful Memes.

Model	Language Transformers (--tr in params.py)	Performance Rank for HM	Pre-trained model available	Task-specific pre-training enabled
E - ERNIE-VIL LARGE/SMALL	ERNIE	1, 2	LARGE / BASE	No (TODO)
D - DeVLBERT	bert-base-uncased	8	BASE	No
O - OSCAR LARGE/SMALL	bert-large-uncased / bert-base-uncased	5, 6	LARGE / BASE	Yes
U - UNITER LARGE/SMALL	bert-large-cased / bert-base-cased	3, 4	LARGE / BASE	Yes
U - UNITER LARGE/SMALL	roberta-large / roberta-small	14	No	No
V - VisualBERT	bert-large-uncased	7	LARGE	Yes
V - VisualBERT	roberta-large / roberta-small	11	No	Yes
V - VisualBERT	albert-base-v2 - albert-xxlarge-v2	10 (XXL V2)	No	Yes
X - LXMERT	bert-large-uncased / bert-base-uncased	9	LARGE	Yes
X - LXMERT	roberta-large / roberta-small	13	No	Yes
X - LXMERT	albert-base-v2 - albert-xxlarge-v2	12 (XXL V2)	No	Yes

For most models other language transformers might work as well, but havn't been tested yet. Note that for VL tasks having a pre-trained model makes a major difference. If you choose to use a pretrained model, make sure to place the weights file in vilio/data or for E-Models the params folder in vilio/ernie-vil/data/ernielarge/ / vilio/ernie-vil/data/erniesmall/.

Now just place your features & text data in the respective data folders & run the model.
Depending on which model & features you chose, refer to the bash files either under vilio/bash/training or vilio/ernie-vil/bash/training and adjust them to your needs.
The parameters are explained at vilio/params.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GETTING_STARTED.md

GETTING_STARTED.md

Outline to use Vilio for your own project

Data

Image-extraction

Using the image & text features

Modeling

PyTorch

PaddlePaddle

Files

GETTING_STARTED.md

Latest commit

History

GETTING_STARTED.md

File metadata and controls

Outline to use Vilio for your own project

Data

Image-extraction

Using the image & text features

Modeling

PyTorch

PaddlePaddle