Skip to content

Latest commit

 

History

History
 
 

cross-modal-search

Table of Contents

Build a CrossModal Search System to look for Images from Captions and viceversa

Jina

Jina Jina Jina Docs We are hiring tweet button Python 3.7 3.8 Docker

In this example, jina is used to implement a Cross-modal search system. This example allows the user to search for images given a caption description and to look for a caption description given an image. We encode images and its captions (any descriptive text of the image) in separate indexes, which are later queried in a cross-modal fashion. It queries the text index using image embeddings and query the image index using text embeddings.

Motive behind Cross Modal Retrieval

Cross-modal retrieval tries to effectively search for documents in a set of documents of a given modality by querying with documents from a different modality.

Modality is an attribute assigned to a document in Jina in the protobuf Document structure. It is possible that documents may be of the same mime type, but come from different distributions, for them to have different modalities. Example: In a an article or web page, the body text and the title are from the same mime type (text), but can be considered of different modalities (distributions).

Different encoders map different modalities to a common embedding space. They need to extract semantic information from the documents.

In this embedding space, documents that are semantically relevant to each other from different modalities are expected to be close to another - Metric Learning

In the example, we expect images embeddings to be nearby their captions’ embeddings.

Research for Cross Modal Retrieval

The models used for the example are cited from the paper Improving Visual-Semantic Embeddings with Hard Negatives (https://arxiv.org/pdf/1707.05612.pdf).

To make this search system retrieve good results, we have used the models trained in https://github.com/fartashf/vsepp . A model has been trained that encodes text and images in a common embedding space trying to put together the embedding of images and its corresponding captions.

We use one network per modality:

  • VGG19 for images, pretrained on ImageNet.
  • A Gated Recurrent Unit (GRU) for captions.

Last layers of these networks are removed and they are used as feature extractors. A Fully Connected Layer is added on top of each one that actually maps the extracted features to the new embedding space.

They are trained on Flickr30k dataset with ContrastiveLoss (Tries to put positive matches close in the embedding space and separate negative samples).

VSE Encoders in Jina for Cross Modal Search

Two encoders have been created for this example, namely VSEImageEncoder and VSETextEncoder

Process followed is as below:

  • Load the weights published by the research paper as result
  • Instantiate their VSE encoder and extracts the branch interesting for the modality.

The 2 models exist in our jinahub VSEImageEncoder and VSETextEncoder

Table of Contents

Prerequisites

This demo requires Python 3.7 and jina installation.

Prepare the data

Use Flickr8k

Although the model is trained on Flickr30k, you can test on Flickr8k dataset, which is a much smaller version of flickr30k. This is the default dataset used for this example

To make this work, we need to get the image files from the kaggle dataset. To get it, once you have your Kaggle Token in your system as described here, run:

kaggle datasets download adityajn105/flickr8k
unzip flickr8k.zip 
rm flickr8k.zip
mv Images data/f8k/images
mv captions.txt data/f8k/captions.txt

make sure that your data folder has:

data/f8k/images/*jpg
data/f8k/captions.txt

Use Flickr30k

The model used has been trained using Flickr30k and therefore we recommend using this dataset to try this system. But it is a good exercise to see if it works as well for other datasets or your custom ones.

To do so, instead of downloading the flickr8k from kaggle, just take its 30k counterpart

pip install kaggle
kaggle datasets download hsankesara/flickr-image-dataset
unzip flickr-image-dataset.zip
rm flickr-image-dataset.zip

Then we also need captions data, to get this:

wget http://www.cs.toronto.edu/~faghri/vsepp/data.tar
tar -xvf data.tar
rm -rf data.tar
rm -rf data/coco*
rm -rf data/f8k*
rm -rf data/*precomp*
rm -rf data/f30k/images
mv flickr-image-dataset data/f30k/images

Once all the steps are completed, we need to make sure that under cross-modal-search/data/f30k folder, we have a folder images and a json file dataset_flickr30k.json. Inside the images folder there should be all the images of Flickr30K and the dataset_flickr30k.json contains the captions and its linkage to the images.

Run the Flows

Index

Index is run with the following command, where request_size can be chosen by the user. Index will process both images and captions

python app.py -t index -n $num_docs -s request_size -d 'f8k'

Not that num_docs should be 8k or 30k depending on the flickr dataset you use. If you decide to index the complete datasets, it is recommendable to increase the number of shards and parallelization. The dataset is provided with the -d parameter with the valid options of 30k and 8k. If you want to index your own dataset, check dataset.py to see how data is provided and adapt to your own data source.

Jina normalizes the images needed before entering them in the encoder. QueryLanguageDriver is used to redirect (filtering) documents based on modality.

Query

python app.py -t query-restful

You can then query the system from jinabox using either images or text. The default port number will be 45678

Examples of captions in the dataset:

A man in an orange hat starring at something, A Boston terrier is running in the grass, A television with a picture of a girl on it

Note the cross for which cross modal stands.

Internally, TextEncoder targets ImageVectorIndexer and ImageEncoder targets TextVectorIndexer. ImageVectorIndexer and TextVectorIndexer map to a common Embedding Space. (To Jina it means having common dimensionality).

Use Docker image from the jina hub

To make it easier for the user, we have built and published the Docker image with the indexed documents. Just be aware that the image weights 11 GB. Make sure that your docker lets you allocate a sufficient amount of memory.

You can retrieve the docker image using:

docker pull jinahub/app.example.crossmodalsearch:0.0.2-0.9.20

So you can pull from its latest tags.

To run the application with the pre-indexed documents and ready to be used from jina-box, run

docker run -p 45678:45678 jinahub/app.example.crossmodalsearch:0.0.2-0.9.20

Build the docker image yourself

In order to buld the docker image, please first run ./get_data.sh or make sure that flickr8k.zip is downloaded.

And then just simply run:

docker build -f Dockerfile -t {DOCKER_IMAGE_TAG} .

Results

Documentation

The best way to learn Jina in depth is to read our documentation. Documentation is built on every push, merge, and release event of the master branch. You can find more details about the following topics in our documentation.

Community

  • Slack channel - a communication platform for developers to discuss Jina
  • Community newsletter - subscribe to the latest update, release and event news of Jina
  • LinkedIn - get to know Jina AI as a company and find job opportunities
  • Twitter Follow - follow us and interact with us using hashtag #JinaSearch
  • Company - know more about our company, we are fully committed to open-source!

License

Copyright (c) 2020-2021 Jina AI Limited. All rights reserved.

Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.