Developed as a part of our Semester 7 Major Project, this repository contains scripts and code to run and test the performance of popular text summarization algorithms. The algorithms studied are:
- TextRank [Mihalcea Tarau 2004]
- LexRank [Erkan Radev 2004]
- LSA [Steinberger Ježek 2004]
For our experiments, the Opinosis dataset was used. It can be obtained here
@inproceedings{ganesan2010opinosis,
title={Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions},
author={Ganesan, Kavita and Zhai, ChengXiang and Han, Jiawei},
booktitle={Proceedings of the 23rd International Conference on Computational Linguistics},
pages={340--348},
year={2010},
organization={Association for Computational Linguistics}
}
To compare the relative performance of the algorithms, a simple implementation of ROGUE-1 metric in python was used.
To imitate the results of our project, one may do the following:
-
Clone this repository and ensure that the Opinosis Dataset is present. If not, download from the link above and extract into
data/
. -
Run the
run-project
script.$ sh +x run-project.sh
This script will clean the dataset, extract keywords, run the algorithms on the dataset, and print their respective running times and ROGUE-1 scores.
- Individual performances of each of the algorithms can be computed by simply first running the
$algorithm/$algorithm.py
script, followed by running therogue_one
script with:
$ python rogue_one.py --gold data/summaries_keywords --test $algorithm/results
- Individual performances of each of the algorithms can be computed by simply first running the
- python 2.7+
- nltk
- sumy
- networkx