Table of Contents
In this example, jina
is used to implement a Cross-modal search system. This example allows the user to search for images given a caption description and to look for a caption description given an image. We encode images and its captions (any descriptive text of the image) in separate indexes, which are later queried in a cross-modal
fashion. It queries the text index
using image embeddings
and query the image index
using text embeddings
.
Motive behind Cross Modal Retrieval
Cross-modal retrieval tries to effectively search for documents in a set of documents of a given modality by querying with documents from a different modality.
Modality is an attribute assigned to a document in Jina in the protobuf Document structure. It is possible that documents may be of the same mime type, but come from different distributions, for them to have different modalities. Example: In a an article or web page, the body text and the title are from the same mime type (text), but can be considered of different modalities (distributions).
Different encoders map different modalities to a common embedding space. They need to extract semantic information from the documents.
In this embedding space, documents that are semantically relevant to each other from different modalities are expected to be close to another - Metric Learning
In the example, we expect images embeddings to be nearby their captions’ embeddings.
Research for Cross Modal Retrieval
The models used for the example are cited from the paper Improving Visual-Semantic Embeddings with Hard Negatives (https://arxiv.org/pdf/1707.05612.pdf).
To make this search system retrieve good results, we have used the models trained in https://github.com/fartashf/vsepp . A model has been trained that encodes text
and images
in a common embedding space trying to put together the embedding of images and its corresponding captions.
We use one network per modality:
- VGG19 for images, pretrained on ImageNet.
- A Gated Recurrent Unit (GRU) for captions.
Last layers of these networks are removed and they are used as feature extractors. A Fully Connected Layer is added on top of each one that actually maps the extracted features to the new embedding space.
They are trained on Flickr30k
dataset with ContrastiveLoss (Tries to put positive matches close in the embedding space and separate negative samples).
VSE Encoders in Jina for Cross Modal Search
Two encoders have been created for this example, namely VSEImageEncoder and VSETextEncoder
Process followed is as below:
- Load the weights published by the research paper as result
- Instantiate their VSE encoder and extracts the branch interesting for the modality.
The 2 models exist in our jinahub VSEImageEncoder and VSETextEncoder
Table of Contents
- Prerequisites
- Prepare the data
- Build the docker images
- Run the Flows
- Documentation
- Community
- License
This demo requires Python 3.7 and jina
installation.
Although the model is trained on Flickr30k
, you can test on Flickr8k
dataset, which is a much smaller version of flickr30k. This is the default dataset used for this example
To make this work, we need to get the image files from the kaggle
dataset. To get it, once you have your Kaggle Token in your system as described here, run:
kaggle datasets download adityajn105/flickr8k
unzip flickr8k.zip
rm flickr8k.zip
mv Images data/f8k/images
mv captions.txt data/f8k/captions.txt
make sure that your data folder has:
data/f8k/images/*jpg
data/f8k/captions.txt
The model used has been trained using Flickr30k
and therefore we recommend using this dataset to try this system. But it is a good exercise to see if it works as well for other datasets or your custom ones.
To do so, instead of downloading the flickr8k
from kaggle, just take its 30k counterpart
pip install kaggle
kaggle datasets download hsankesara/flickr-image-dataset
unzip flickr-image-dataset.zip
rm flickr-image-dataset.zip
Then we also need captions
data, to get this:
wget http://www.cs.toronto.edu/~faghri/vsepp/data.tar
tar -xvf data.tar
rm -rf data.tar
rm -rf data/coco*
rm -rf data/f8k*
rm -rf data/*precomp*
rm -rf data/f30k/images
mv flickr-image-dataset data/f30k/images
Once all the steps are completed, we need to make sure that under cross-modal-search/data/f30k
folder, we have a folder images
and a json file dataset_flickr30k.json
. Inside the images
folder there should be all the images of Flickr30K
and the dataset_flickr30k.json
contains the captions and its linkage to the images.
Index is run with the following command, where request_size
can be chosen by the user. Index will process both images and captions
python app.py -t index -n $num_docs -s request_size -d 'f8k'
Not that num_docs
should be 8k or 30k depending on the flickr
dataset you use. If you decide to index the complete datasets,
it is recommendable to increase the number of shards and parallelization. The dataset is provided with the -d
parameter
with the valid options of 30k
and 8k
. If you want to index your own dataset, check dataset.py
to see
how data
is provided and adapt to your own data source.
Jina normalizes the images needed before entering them in the encoder. QueryLanguageDriver is used to redirect (filtering) documents based on modality.
python app.py -t query-restful
You can then query the system from jinabox using either images or text.
The default port number will be 45678
Examples of captions in the dataset:
A man in an orange hat starring at something, A Boston terrier is running in the grass, A television with a picture of a girl on it
Note the cross for which cross modal stands.
Internally, TextEncoder
targets ImageVectorIndexer
and ImageEncoder
targets TextVectorIndexer
.
ImageVectorIndexer
and TextVectorIndexer
map to a common Embedding Space. (To Jina it means having common dimensionality).
To make it easier for the user, we have built and published the Docker image with the indexed documents. Just be aware that the image weights 11 GB. Make sure that your docker lets you allocate a sufficient amount of memory.
You can retrieve the docker image using:
docker pull jinahub/app.example.crossmodalsearch:0.0.2-0.9.20
So you can pull from its latest tags.
To run the application with the pre-indexed documents and ready to be used from jina-box, run
docker run -p 45678:45678 jinahub/app.example.crossmodalsearch:0.0.2-0.9.20
In order to buld the docker image, please first run ./get_data.sh
or make sure that flickr8k.zip
is downloaded.
And then just simply run:
docker build -f Dockerfile -t {DOCKER_IMAGE_TAG} .

The best way to learn Jina in depth is to read our documentation. Documentation is built on every push, merge, and release event of the master branch. You can find more details about the following topics in our documentation.
- Jina command line interface arguments explained
- Jina Python API interface
- Jina YAML syntax for executor, driver and flow
- Jina Protobuf schema
- Environment variables used in Jina
- ... and more
- Slack channel - a communication platform for developers to discuss Jina
- Community newsletter - subscribe to the latest update, release and event news of Jina
- LinkedIn - get to know Jina AI as a company and find job opportunities
- follow us and interact with us using hashtag
#JinaSearch
- Company - know more about our company, we are fully committed to open-source!
Copyright (c) 2020-2021 Jina AI Limited. All rights reserved.
Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.