DLSIR

In the modern era, an enormous amount of digital pictures, from personal photos to medical images, is produced and stored every day. It is more and more common to have thousands of photos sitting in our smart phones; however, what comes with the convenience of recording unforgettable moments is the pain of searching for a specific picture or frame. How nice it would be to be able to find the desired image just by typing one or few words to describe it? In this context, automated caption image retrieval is becoming an increasingly attracting feature, comparable to text search.

In this project, we consider the task of content-based image retrieval and propose effective neural network-based solutions for that. Specifically, the input to our algorithm is a collection of raw images in which the user would like to search, and a query sentence meant to describe the desired image. The output of the algorithm would be a list of top images that we think are relevant to the query sentence.

In particular, we obtain a representation of the sentence that will be properly align with the corresponding image features in a shared high dimensional space. The images are found based on nearest neighborhood search in that shared space.

Focus of the Project

Better Captioning using attention
- The focus will be on incorporating attention into the baseline model to see if there will be any improvements.
- Usage of various word embedding models
- The solution involves experimenting with various word embeddings and evaluating them for the caption dataset.
Usage of various CNN models
- The solution involves experimenting with multiple convolutional neural network models via transfer learning and evaluating which model gives out the best results in the long run.
Building on top of a state of the art base model
- The base models being used for this model are best in class and hence provide the best representation for both the image and the captions.
- E.g. In most of our models, the images are represented with the feature vectors derived from Google’s Inception CNN model whereas the captions are represented with the feature vectors derived from Facebook’s Fasttext word embedding model.
- Both of the above models have been used time and again to achieve a state of the performance in the field of computer vision and natural language processing respectively

Experiments conducted

As a part of this project we have tried out a multitude of experiments where each new model developed had some minor architectural differences which gave it a slight edge over its previous model but the overall arching encoder-decoder architecture still remains the same.

Listed below are all the experiments we conducted so far:

Baseline models
- Resnet + GRU + fasttext
- Inception + GRU + fastext
Baseline with bidirectional
- Inception + Bidirectional(LSTM) + fastext
Self attention models
- SelfAttention(Inception) + Bidirectional(LSTM) + fastext
- Inception + SelfAttention(fastext)
- SelfAttention(inception) + SelfAttention(fastext)

Results

In the models column, each model has been named in the following format:

MODEL_NAME [configuration] configuration refers to layers used in the model architecture.

Here the models labelled as SA1, SA2 and DSA refer to models which are different experiments conducted by using the “Self Attention” layer found in “transformer networks”.

SA1: Self Attention Model 1 where we used the multi head self-attention layer as a part of the decoder side. i.e on the images.
SA2: Self Attention Model 2 where we used the multi head self-attention layer as a part of the encode side. i.e on the captions
DSA: Double Self Attention Model where we used the multihead self-attention layer in both the encoder and the decoder side of the model architecture.

How to run the project

To start with, download the weights of the pretrained models from here, and save the folder in the same directory as the other files of this repo.

Now the base directory will have 3 subdirs which are namely:

DLSIR_demo: contains the code for the final GUI demo
DLSIR_model_training_experiments: contains several colab notebooks where different experiments where trained and the model weights were saved.
DLSIR_model_weights: contains several model weights for the various models which were trained.
DLSIR_report: this is the project report which entails all the architectural details and training details in detail.

In order to run the DLSIR_demo the steps to follow are:

Run download_data.py
Run inception_features_saving_to_disk.py
Run cache_image_embeddings.py
Run predictions.py and server.py as two separate simultaneous processes
Go to 0.0.0.0:5000 to see the application running

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
DLSIR_demo		DLSIR_demo
DLSIR_model_training_experiments		DLSIR_model_training_experiments
metrics		metrics
DLSIR_report.pdf		DLSIR_report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DLSIR

Focus of the Project

Experiments conducted

Results

How to run the project

Developers:

About

Releases

Packages

Contributors 2

Languages

shahidikram0701/DLSIR

Folders and files

Latest commit

History

Repository files navigation

DLSIR

Focus of the Project

Experiments conducted

Results

How to run the project

Developers:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages