Image Captioning is the process of generating textual description of an image. It uses both Natural Language Processing and Computer Vision to generate the captions. Deep Learning using CNNs-LSTMs can be used to solve this problem of generating a caption for a given image, hence called Image Captioning.
- Flickr 8k (containing 8k images),
- Flickr 30k (containing 30k images),
- MS COCO (containing 180k images), etc.
Point to Note:
Here I have used the Flickr8k dataset based on the availability of standard computational resources. This dataset is the best for 8GB RAM, and takes about 25mins/epoch training on a CPU. Flickr30k and MS COCO may need about 32GB-64GB RAM based on how it's processed. Consider using AWS EC2 workstation for the best and fastest output. Its paid tho😞!
- https://towardsdatascience.com/image-captioning-with-keras-teaching-computers-to-describe-pictures-c88a46a311b8
- https://towardsdatascience.com/image-captioning-in-deep-learning-9cd23fb4d8d2
- https://www.analyticsvidhya.com/blog/2018/04/solving-an-image-captioning-task-using-deep-learning/
- https://www.youtube.com/watch?v=NmoW_AYWkb4
- https://www.kaggle.com/shadabhussain/automated-image-captioning-flickr8