DeViL: Decoding Vision features into Language

This is the official repository for our GCPR 2023 Oral paper on Decoding Vision features into Language (DeViL).

Getting started

To ensure you have the right environment to work on please use the file environment.yml using this command

conda env create -n <ENVNAME> --file environment.yml

Training DeViL on image-text pairs

Datasets are implemented in the src/datasets.py file which contains the code for data loading and data collation.

A paired image-text dataset for training returns a dictionay with items: {'id': image id, 'image': image, 'text': captions}.

In the src/datasets.py file you can find implementations for the CC3M and MILANNOTATIONS datasets.

Once the dataset is prepared, you can run training with the src/main.py file. Don't forget to set the data_root and logdir arguments.

For a detailed explanation of all arguments, see: src/main.py.

Example command:

python src/main.py --data_root ./data --dataset cc3m --logdir ./results --language_model facebook/opt-125m \
--vision_backbone timm_resnet50 --token_dropout 0.5 --feature_dropout 0.5 --vision_feat_layer -1 -2 -3 -4

Evaluation (NLP metrics)

To evaluate a trained model run:

python src/main.py
--do_eval_nlp \
--model_ckpt <path of the saved checkpoint parent folder> \
--data_root <path to dataset>

If you wish to evaluate descriptions of a specific layer, set the layer argument.

Generation of textual descriptions

To generate textual descriptions for different layers and feature locations run:

python src/main.py
--do_eval_qualitative \
--model_ckpt <path of the saved checkpoint parent folder> \
--data_root <path to dataset> \
--loc_ids "-1: [[-1, -1]]"

For more details, see the loc_ids, pool_locs, kernel_size arguments in src/main.py.

Generation of open-vocabulary saliency maps

To generate open-vocabulary saliency maps see viz_notebook.ipynb.

CC3M Dataset

To obtain CC3M in webdataset format, you can use img2dataset.

MILAN

To train on the MILANNOTATIONS dataset, follow the instructions to download the dataset and change the dataset argument to one of the MILANNOTATION keys (e.g. "imagenet").

You can also activate the by_unit argument so that this dataset is processed by neuron (aka unit) instead of by image.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeViL: Decoding Vision features into Language

Getting started

Training DeViL on image-text pairs

Evaluation (NLP metrics)

Generation of textual descriptions

Generation of open-vocabulary saliency maps

CC3M Dataset

MILAN

About

Releases

Packages

Languages

License

ExplainableML/DeViL

Folders and files

Latest commit

History

Repository files navigation

DeViL: Decoding Vision features into Language

Getting started

Training DeViL on image-text pairs

Evaluation (NLP metrics)

Generation of textual descriptions

Generation of open-vocabulary saliency maps

CC3M Dataset

MILAN

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages