This repository presents and compares HeterSUMGraph and variants using GATConv, GATv2Conv and a combination of HeterSUMGraph and SummaRuNNer (using HeterSUMGraph as a sentence encoder).
The datasets are CNN-DailyMail and NYT50.
paper: HeterSUMGraph
git clone https://github.com/Baragouine/HeterSUMGraph.git
cd HeterSUMGraph
conda create --name HeterSUMGraph python=3.9
conda activate HeterSUMGraph
pip install -r requirements.txt
To install nltk data:
- Open a python console.
- Type
import nltk; nltk.download()
. - Download all data.
- Close the python console.
preprocessing mean cleaning, labeling, etc. not mean preprocessing before training.
- Download raw NYT zip from https://catalog.ldc.upenn.edu/LDC2008T19 to
data/
- Run
00-00-convert_nyt_to_json.ipynb
(convert zip to json). - Run
00-01-nyt_filter_short_summaries.ipynb
(keep summary with 50 distinct word only). - Run
00-02-compute_nyt_labels.ipynb
(comput labels). - Run
python scripts/compute_tfidf_dataset.py -input data/nyt_corpus_LDC2008T19_50.json -output data/nyt50_dataset_tfidf.json -docs_col_name docs
(compute tfidfs for whole dataset). - Run
python scripts/compute_tfidf_sent_dataset.py -input data/nyt_corpus_LDC2008T19_50.json -output data/compute_tfidf_sent_dataset.json -docs_col_name docs
(compute tfidfs for each document). - Run
00-03-split_NYT50.ipynb
(split NYT50 to train, val, test).
tfidfs computing is only necessary for HeterSUMGraph based models.
preprocessing mean cleaning, labeling, etc. not mean preprocessing before training.
- Follow CNN-DailyMail preprocessing instruction on: https://github.com/Baragouine/SummaRuNNer/tree/master.
- After labels computed, run
00-03-merge_cnn_dailymail.ipynb
to merge CNN-DailyMail to one json file. - Run
python scripts/compute_tfidf_dataset.py -input data/cnn_dailymail.json -output data/cnn_dailymail_dataset_tfidf.json -docs_col_name article
(compute tfidfs for whole dataset). - Run
python scripts/compute_tfidf_sent_dataset.py -input data/cnn_dailymail.json -output data/cnn_dailymail_sent_tfidf.json -docs_col_name article
(compute tfidfs for each document).
tfidfs computing is only necessary for HeterSUMGraph based models.
For training you must use glove 300 embeddings, they must have the following path: data/glove.6B/glove.6B.300d.txt
For CNN/DailyMail max doc len is 100 sentences, not 50 as in the paper (same max doc len as SummaRuNNer to compare both). Run one of the notebooks below to train and evaluate the associated model:
01-train_HeterSUMGraph_CNN_DailyMail.ipynb
: paper model on CNN-DailyMail02-train_HeterSUMGraph_NYT50.ipynb
: paper model on NYT5003-train_HeterSUMGraph_CNN_DailyMail_TG_GATConv.ipynb
: HeterSUMGraph with torch_geometric GATConv layer on CNN-DailyMail.04-train_HeterSUMGraph_NYT50_TG_GATConv.ipynb
: HeterSUMGraph with torch_geometric GATConv layer on NYT50.05-train_HeterSUMGraph_CNN_DailyMail_TG_GATv2Conv.ipynb
: HeterSUMGraph with torch_geometric GATv2Conv layer on CNN-DailyMail.06-train_HeterSUMGraph_NYT50_TG_GATv2Conv.ipynb
: HeterSUMGraph with torch_geometric GATv2Conv layer on NYT50.07-train_HSGRNN_CNN_DailyMail_TG_GATv2Conv.ipynb
: HeterSUMGraph with torch_geometric GATv2Conv layer + SummaRuNNer on CNN-DailyMail.08-train_HSGRNN_NYT50_TG_GATv2Conv.ipynb
: HeterSUMGraph with torch_geometric GATv2Conv layer + SummaRuNNer on NYT50.
model | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
HeterSUMGraph (Wang) | 46.89 | 26.26 | 42.58 |
HeterSUMGraph (ours) | 45.5 ± 0.0 | 24.2 ± 0.0 | 34.1 ± 0.0 |
HSG GATConv | 45.4 ± 0.0 | 24.2 ± 0.0 | 34.0 ± 0.0 |
HSG GATv2Conv | 47.2 ± 0.0 | 26.5 ± 0.0 | 35.5* ± 0.0 |
HSGRNN GATv2Conv | 46.9 ± 0.0 | 26.3 ± 0.0 | 35.3 ± 0.0 |
*: maybe the ROUGE-L have changed in the rouge library I use.
model | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
SummaRuNNer(Nallapati) | 39.6 ± 0.2 | 16.2 ± 0.2 | 35.3 ± 0.2 |
HeterSUMGraph (ours) | 38.2 ± 0.0 | 15.1 ± 0.0 | 24.1 ± 0.0 |
HSG GATConv | 39.8 ± 0.0 | 16.3 ± 0.0 | 24.6 ± 0.0 |
HSG GATv2Conv | 39.9 ± 0.0 | 16.4 ± 0.0 | 24.7* ± 0.0 |
HSGRNN GATv2Conv | 39.5 ± 0.0 | 16.2 ± 0.0 | 24.4 ± 0.0 |
*: maybe the ROUGE-L have changed in the rouge library I use.