Objective of the project was to develop a comprehensive Video Summarization and Captioning framework. This framework is designed to automatically generate concise summaries and descriptive captions for video content, streamlining the processing of extensive video data. We utilized two datasets in our study: TVSum and SumMe. TVSum encompasses a diverse collection of 50 videos, providing a broad spectrum of content. SumMe, with its 25 distinct videos, offers a variety of personal and event-focused content. Our approach begins with inputting a video from these datasets into DSNet, an anchor-based deep summarizer network. DSNet processes this video, distilling it into a condensed summary that captures the key moments. This summary is then inputted into a Timesformer GPT-2 model, which crafts a caption encapsulating the core narrative of the summarized video. The anchor-based method generates temporal interest proposals, which help in determining and localizing key video content. DSNet is distinct for leveraging temporal consistency in its summarization process.
Create a virtual environment with python 3.6
conda create --name VSC python=3.6
conda activate VSC
Install python dependencies.
pip install -r requirements.txt
Create a virtual environment with python 3.11
conda create --name Vid-Cap python=3.11
conda activate Vid-Cap
Install python dependencies.
pip install -r requirements-captioning.txt
Download the pre-processed datasets into datasets/
folder, including TVSum, SumMe
mkdir -p datasets/ && cd datasets/
wget https://www.dropbox.com/s/tdknvkpz1jp6iuz/dsnet_datasets.zip
unzip dsnet_datasets.zip
To train anchor-based attention model on TVSum and SumMe datasets with canonical settings, run
python train.py anchor-based --model-dir ../models/ab_basic --splits ../splits/tvsum.yml ../splits/summe.yml
To evaluate the anchor-based models, run
python evaluate.py anchor-based --model-dir ../models/ab_basic/ --splits ../splits/tvsum.yml ../splits/summe.yml
To predict the summary of a raw video, use infer.py
. For example, run
python infer.py anchor-based --ckpt-path ../models/custom/checkpoint/custom.yml.0.pt \
--source ../custom_data/videos/EE-bNr36nyA.mp4 --save-path ./output.mp4
To predict the summary of a raw video, use caption.py
. For example, run
python caption.py --video_path ./St-Maarten-Landing-output.mp4 --output_path St-Maarten-Landing-caption.txt
- Anisha Bhatnagar (ab10945@nyu.edu)
- Tanya Ojha (to2141@nyu.edu)