An encoder-decoder based model to caption images built using PyTorch and deployed using Streamlit. This model uses inceptionV3 as encoder and LSTM layers as decoder. This model is trained on Flickr30k dataset.
Try it yourself here
Prediction: a man in wetsuit is surfing .
Prediction: a man in blue helmet is riding a dirt bike on a dirt track .
Prediction: a dog is running on the beach .
- python3
python -m spacy download en
- for tokenizing english sentences
pip install -r requirements.txt
- is used to train the model
- is used to perform inference
- is used to build the streamlit app
For more details make sure to visit these files to look at script arguments and description
i. Download the Flickr30k dataset
ii. Remove the duplicate images folder and csv file -
use to train the model