Skip to content

Latest commit

 

History

History
43 lines (41 loc) · 2.07 KB

README.md

File metadata and controls

43 lines (41 loc) · 2.07 KB

Automatic Text Summarizer Using Wu-Palmer Measure and Topological Sentence Selection (V 0.1)

This code is based on an unfinished paper of mine that uses a version of topological sorting and Wu-Palmer measure and topological sentence selection to summarize a corpus. I've written a short script to demonstrate the methodology used in the paper.

Paper Methodology

Coming Soon.

Pre-Requisites

You need the following libraries for Python to be installed in your computer.

You can install NLTK via pip by executing: 
pip install nltk

Along with our summarizer script, I've attached a summarizer script Summa that uses Text Rank. We have used this script to compare results generated by Text Rank to our script.

You can install Summa via pip by executing:
pip install summa

Pre-Processing

Before running our summarizer script, you need to run the pre-processor script (which I've written yet). The preprocessor script performs the following tasks:

1. Remove stop words from the corpus using a standard list. 
2. Prune off punctuations & numeric values as they do not affect the quality of sentence selection.
3. Remove symbolic short forms such as Mr.,Ms.,Dr.,Rs.,&,%,$ etc. 
4. Expand texual short forms such as It's , That's , What's etc. 
5. Form a list of sentences from the corpus after preprocessing through steps 1 to 4 are complete. 

Running the Script

You can run the script simply by typing the following in terminal:

python v01.py 
Selecting the Corpus

I've already attached a few manually pre-processed text files in the 'Corpus-Collection' folder. However, you can use any passage of your choice as long as it is parsable into a string by Python. You can simply edit the line:

file=open('Corpus-Collection/text4.txt','r')
Changing The Percetage of Summarization

By default the percentage_summarization value has been set to 0.5 indicating 50% summarization. You can change the factor by simply editing the line:

percentage_summarization=0.5