relative_captions_shoes.json
contains relative expressions which describe fine-grained visual differences on 10,751 pairs of shoe images. The data is in the following format:
{
"ImageName": "img_womens_clogs_851.jpg",
"ReferenceImageName": "img_womens_clogs_512.jpg",
"RelativeCaption": "is more of a textured material"
},
To obtain the image files, please download the zip file from Attribute Discovery Dataset. After unzipping the folder, you could find the images by their names inside the womens_*
folders.
The following figure shows the length distribution of the relative captions and a few examples from the dataset.
Most relative expressions contain composite phrases on more than one types of visual feature. And a signifant portion of the data contains propositional phrases that provide information about spatial or structural details.
We tested a few simple baseline methods for the task of relative image captioning. Using the feature concatenation of the two images as the input (ResNet101 pre-trained on ImageNet), the Show and Tell based model resulted 26.3
on BLEU-1 and the Show, Attend and Tell based model produced 29.6
on BLEU-1.
Please refer to the supplemental material of the paper to see more details on the annotation interface (Supp. A), dataset visualization (Supp. B), baseline performance on the relative image captioner (Supp. C).
In our experiment, we found that when the target image and the reference image are visually distinct, users often rely on the visual apprearance of the target image directly and the provided relative descriptions are similar to single-image captions. So, we augmented the size of the dataset to train the user simulator by leveraging additional single-image captioning annotations. captions_shoes.json
contains captions on 3,600
shoe images. For the experiments in the paper, we paired each image in this set with five visually distinct reference images. The user simulator was trained using both the relative captions from and this argumented set.
If find this dataset useful, please cite the following paper:
@incollection{NIPS2018_7348,
title = {Dialog-based Interactive Image Retrieval},
author = {Guo, Xiaoxiao and Wu, Hui and Cheng, Yu and Rennie, Steven and Tesauro, Gerald and Feris, Rogerio},
booktitle = {Advances in Neural Information Processing Systems 31},
editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
pages = {676--686},
year = {2018},
publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/7348-dialog-based-interactive-image-retrieval.pdf}
}
- Please cite the Attribute Discovery paper if you use the original image files.
- Please cite the WhittleSearch paper if you use relative attribute labels in your experiment.