Introduction

This is Stacked Cross Attention Network, source code of Stacked Cross Attention for Image-Text Matching (project page) from Microsoft AI and Research. The paper will appear in ECCV 2018. It is built on top of the VSE++ in PyTorch.

Requirements and Installation

We recommended the following dependencies.

Python 2.7
PyTorch 0.3
NumPy (>1.12.1)
TensorBoard
Punkt Sentence Tokenizer:

import nltk
nltk.download()
> d punkt

Download data

Download the dataset files and pre-trained models. We use splits produced by Andrej Karpathy. The raw images can be downloaded from from their original sources here, here and here.

The precomputed image features of MS-COCO are from here. The precomputed image features of Flickr30K are extracted from the raw Flickr30K images using the bottom-up attention model from here. All the data needed for reproducing the experiments in the paper, including image features and vocabularies, can be downloaded from:

https://www.kaggle.com/datasets/kuanghueilee/scan-features

We refer to the path of extracted files for data.zip as $DATA_PATH and files for vocab.zip to ./vocab directory. Alternatively, you can also run vocab.py to produce vocabulary files. For example,

python vocab.py --data_path data --data_name f30k_precomp
python vocab.py --data_path data --data_name coco_precomp

Data pre-processing (Optional)

The image features of Flickr30K and MS-COCO are available in numpy array format, which can be used for training directly. However, if you wish to test on another dataset, you will need to start from scratch:

Use the bottom-up-attention/tools/generate_tsv.py and the bottom-up attention model to extract features of image regions. The output file format will be a tsv, where the columns are ['image_id', 'image_w', 'image_h', 'num_boxes', 'boxes', 'features'].
Use util/convert_data.py to convert the above output to a numpy array.

If downloading the whole data package containing bottom-up image features for Flickr30K and MS-COCO is too slow for you, you can download everything but image features from https://www.kaggle.com/datasets/kuanghueilee/scan-features and compute image features locally from raw images.

Training new models

Run train.py:

python train.py --data_path "$DATA_PATH" --data_name coco_precomp --vocab_path "$VOCAB_PATH" --logger_name runs/coco_scan/log --model_name runs/coco_scan/log --max_violation --bi_gru

Arguments used to train Flickr30K models:

Method	Arguments
SCAN t-i LSE	`--max_violation --bi_gru --agg_func=LogSumExp --cross_attn=t2i --lambda_lse=6 --lambda_softmax=9`
SCAN t-i AVG	`--max_violation --bi_gru --agg_func=Mean --cross_attn=t2i --lambda_softmax=9`
SCAN i-t LSE	`--max_violation --bi_gru --agg_func=LogSumExp --cross_attn=i2t --lambda_lse=5 --lambda_softmax=4`
SCAN i-t AVG	`--max_violation --bi_gru --agg_func=Mean --cross_attn=i2t --lambda_softmax=4`

Arguments used to train MS-COCO models:

Method	Arguments
SCAN t-i LSE	`--max_violation --bi_gru --agg_func=LogSumExp --cross_attn=t2i --lambda_lse=6 --lambda_softmax=9 --num_epochs=20 --lr_update=10 --learning_rate=.0005`
SCAN t-i AVG	`--max_violation --bi_gru --agg_func=Mean --cross_attn=t2i --lambda_softmax=9 --num_epochs=20 --lr_update=10 --learning_rate=.0005`
SCAN i-t LSE	`--max_violation --bi_gru --agg_func=LogSumExp --cross_attn=i2t --lambda_lse=20 --lambda_softmax=4 --num_epochs=20 --lr_update=10 --learning_rate=.0005`
SCAN i-t AVG	`--max_violation --bi_gru --agg_func=Mean --cross_attn=i2t --lambda_softmax=4 --num_epochs=20 --lr_update=10 --learning_rate=.0005`

Evaluate trained models

from vocab import Vocabulary
import evaluation
evaluation.evalrank("$RUN_PATH/coco_scan/model_best.pth.tar", data_path="$DATA_PATH", split="test")

To do cross-validation on MSCOCO, pass fold5=True with a model trained using --data_name coco_precomp.

Reference

If you found this code useful, please cite the following paper:

@inproceedings{lee2018stacked,
  title={Stacked cross attention for image-text matching},
  author={Lee, Kuang-Huei and Chen, Xi and Hua, Gang and Hu, Houdong and He, Xiaodong},
  booktitle={Proceedings of the European conference on computer vision (ECCV)},
  pages={201--216},
  year={2018}
}

License

Apache License 2.0

Acknowledgments

The authors would like to thank Po-Sen Huang and Yokesh Kumar for helping the manuscript. We also thank Li Huang, Arun Sacheti, and Bing Multimedia team for supporting this work.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
bottom-up-attention @ 29167f3		bottom-up-attention @ 29167f3
util		util
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
data.py		data.py
evaluation.py		evaluation.py
model.py		model.py
train.py		train.py
vocab.py		vocab.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Requirements and Installation

Download data

Data pre-processing (Optional)

Training new models

Evaluate trained models

Reference

License

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

kuanghuei/SCAN

Folders and files

Latest commit

History

Repository files navigation

Introduction

Requirements and Installation

Download data

Data pre-processing (Optional)

Training new models

Evaluate trained models

Reference

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages