This repo contains the dataset & code for exploring multimodal fusion-type transformer models (from Huggingface 🤗) for the task of visual question answering.
🗂️ Dataset Used: DAQUAR Dataset
Create a virtual environment & install the required packages using pip install -r requirements.txt
:
In particular, we will require the following packages:
datasets==1.17.0
nltk==3.5
pandas==1.3.5
Pillow==9.0.0
scikit-learn==0.23.2
torch==1.8.2+cu111
transformers==4.15.0
dvc==2.9.3
(for automating the training pipeline)
Note: It is best to have some GPU available to train the multimodal models (Google Colab can be used).
📝 Notebook: VisualQuestionAnsweringWithTransformers.ipynb
The src/
folder contains all the scripts necessary for data processing & model training. All the configs & hyperparameters are specified in teh params.yaml
file.
Following are the important scripts to experiment with the VQA models:
src/process_data.py
: Process the raw DAQUAR dataset available indataset/
folder & split it into training & evaluation sets, along with the space of all possible answerssrc/main.py
: Train & evaluate the multimodal VQA model after loading the processed dataset.src/inference.py
: Use a trained multimodal VQA model from a checkpoint to answer a question, given a reference image
After making necessary changes to the params.yaml
file, the pipeline can be automated by running dvc repro
. This will run the data processing & model training (& evaluation) stages.
For inferencing, run python src/inference.py --config=params.yaml --img_path=<path-to-image> --question=<question>
- Text Transformers (for encoding questions):
- BERT (Bidirectional Encoder Representations from Transformers):
'bert-base-uncased'
- RoBERTa (Robustly Optimized BERT Pretraining Approach):
'roberta-base'
- ALBERT (A Lite BERT):
'albert-base-v2'
- BERT (Bidirectional Encoder Representations from Transformers):
- Image Transformers (for encoding images):
- ViT (Vision Transformer):
'google/vit-base-patch16-224-in21k'
- DeiT (Data-Efficient Image Transformer):
'facebook/deit-base-distilled-patch16-224'
- BEiT (Bidirectional Encoder representation from Image Transformers):
'microsoft/beit-base-patch16-224-pt22k-ft22k'
- ViT (Vision Transformer):
To know about Wu-Palmer Similarity Score, check out this video!
Text Transformer | Image Transformer | Wu & Palmer Score | Accuracy | F1 | No. of Trainable Parameters |
---|---|---|---|---|---|
BERT | ViT | 0.286 | 0.235 | 0.020 | 197M |
BERT | DeiT | 0.297 | 0.246 | 0.027 | 197M |
BERT | BEiT | 0.303 | 0.254 | 0.034 | 196M |
RoBERTa | ViT | 0.294 | 0.246 | 0.025 | 212M |
RoBERTa | DeiT | 0.291 | 0.242 | 0.028 | 212M |
RoBERTa | BEiT | 0.308 | 0.261 | 0.033 | 211M |
ALBERT | ViT | 0.265 | 0.215 | 0.018 | 99M |
ALBERT | DeiT | 0.140 | 0.085 | 0.002 | 99M |
ALBERT | BEiT | 0.220 | 0.162 | 0.017 | 98M |
Created with ❤️ by Tezan Sahu