This is the repository for the paper: Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models (EMNLP 2021). This paper evaluates NLU in some V&L models pre-trained in the VOLTA framework using the GLUE Benchmark.
We publish the source codes and some weights for our models and GLUE evaluation.
In this README, we describe the outline of this repository.
- eval_vl_glue: directory for the eval_vl_glue python package that cotains extracter and transformers_volta.
- vl_models: directory for pre-trained and fine-tuned models.
- demo: Notebooks for demonstration.
- evaluation: directory for our evalutaion experiments.
- download: directory for downloaded files.
We assume that:
-
We can use Python3 from the 'python' command.
We used the venv of python3.
-
pip is upgraded:
pip install -U pip
-
PyTorch and torchvision appropriate for your environment is installed.
We used the version of 1.9.0 with CUDA 11.1.
-
The Notebook packages are installed when you run notebooks in your environment:
pip install notebook ipywidgets jupyter nbextension enable --py widgetsnbextension
-
Clone this repository.
-
Install the eval_vl_glue package to use transformers_volta.
pip install -e .
transformers_volta provides an interface similar to Huggingface's Transformers for V&L models in the Volta framework.
See the transformers_volta section in eval_vl_glue for more detail. -
Prepare pretrained models for transformers_volta.
We describe the way to obtain those models in vl_models.
-
Fine-tune the models with evaluation/glue_tasks/run_glue.py .
The run_glue.py script is a script to fine-tune a model on the GLUE task. We modified run_glue.py in the Huggingface's transformers repository, and the usage is basically the same as the original one.
See the glue_tasks section in evaluation. -
Summarize the results.
We used Notebook to summarize the results (get_glue_score.ipynb) .
See the analysis section in the evaluation.
We checked our implementation and conversion briefly in the following ways:
-
Training pre-trained models on the V&L task to compare to the original Volta.
See the evaluation vl_tasks in the evaluation for the results.
-
Masked token prediction with image context.
See demo/masked_lm_with_vision_demo.ipynb for the results.
The original Volta framework relies on an image detector pretrained in Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (Github repository). This detector runs with a specific version of the caffe framework (typically configured in a docker environment).
To improve the connectivity to PyTorch models, we converted the part for image extraction of this model, including model definition, weight and detection procedure, into a PyTorch model.
You can access the converted model from extractor of the eval_vl_glue package.
See the extractor section in eval_vl_glue for more detail.
Notes:
- This extractor work was completed after our paper, so our results in the paper was based on the original detector.
- The outputs of the converted detector are similar, but not fully identical to those of the original model as you can see in the above figure (see demo/extractor_demo.ipynb and demo/original_vs_ours.ipynb for more detail).
- We have not conducted quantitative bench marking.
This work is licensed under the Apache License 2.0 license.
See LICENSE for details.
Third-party software and data sets are subject to their respective licenses.
If you find our work useful in your research, please consider citing the paper:
CITATION
We created our code with reference to the following repository:
- https://github.com/peteanderson80/bottom-up-attention
- https://github.com/airsplay/lxmert
- https://github.com/huggingface/transformers
- https://github.com/e-bug/volta
We also use the pre-trained weights available in the following repository:
We would like to thank them for making their resources available.