An encoder-decoder based model to caption images built using PyTorch and deployed using Streamlit. This model uses inceptionV3 as encoder and LSTM layers as decoder. This model is trained on Flickr30k dataset.
Try it yourself here
Prediction: a man in wetsuit is surfing .
Prediction: a man in blue helmet is riding a dirt bike on a dirt track .
Prediction: a dog is running on the beach .
- python3
python -m spacy download en
- for tokenizing english sentences
pip install -r requirements.txt
neuralnet/train.py
- is used to train the model
engine.py
- is used to perform inference
ui.py
- is used to build the streamlit app
For more details make sure to visit these files to look at script arguments and description
-
Dataset
i. Download the Flickr30k dataset
ii. Remove the duplicate images folder and csv file -
Training
use train.py to train the model