HeterSUMGraph (extractive summarization)

This repository presents and compares HeterSUMGraph and variants using GATConv, GATv2Conv and a combination of HeterSUMGraph and SummaRuNNer (using HeterSUMGraph as a sentence encoder).

The datasets are CNN-DailyMail and NYT50.

paper: HeterSUMGraph

Clone project

git clone https://github.com/Baragouine/HeterSUMGraph.git

Enter into the directory

cd HeterSUMGraph

Create environnement

conda create --name HeterSUMGraph python=3.9

Activate environnement

conda activate HeterSUMGraph

Install dependencies

pip install -r requirements.txt

Install nltk data

To install nltk data:

Open a python console.
Type import nltk; nltk.download().
Download all data.
Close the python console.

Convert NYT zip to NYT50 json and preprocessing it

preprocessing mean cleaning, labeling, etc. not mean preprocessing before training.

Download raw NYT zip from https://catalog.ldc.upenn.edu/LDC2008T19 to data/
Run 00-00-convert_nyt_to_json.ipynb (convert zip to json).
Run 00-01-nyt_filter_short_summaries.ipynb (keep summary with 50 distinct word only).
Run 00-02-compute_nyt_labels.ipynb (comput labels).
Run python scripts/compute_tfidf_dataset.py -input data/nyt_corpus_LDC2008T19_50.json -output data/nyt50_dataset_tfidf.json -docs_col_name docs (compute tfidfs for whole dataset).
Run python scripts/compute_tfidf_sent_dataset.py -input data/nyt_corpus_LDC2008T19_50.json -output data/compute_tfidf_sent_dataset.json -docs_col_name docs (compute tfidfs for each document).
Run 00-03-split_NYT50.ipynb (split NYT50 to train, val, test).

tfidfs computing is only necessary for HeterSUMGraph based models.

CNN-DailyMail preprocessing

preprocessing mean cleaning, labeling, etc. not mean preprocessing before training.

Follow CNN-DailyMail preprocessing instruction on: https://github.com/Baragouine/SummaRuNNer/tree/master.
After labels computed, run 00-03-merge_cnn_dailymail.ipynb to merge CNN-DailyMail to one json file.
Run python scripts/compute_tfidf_dataset.py -input data/cnn_dailymail.json -output data/cnn_dailymail_dataset_tfidf.json -docs_col_name article (compute tfidfs for whole dataset).
Run python scripts/compute_tfidf_sent_dataset.py -input data/cnn_dailymail.json -output data/cnn_dailymail_sent_tfidf.json -docs_col_name article (compute tfidfs for each document).

tfidfs computing is only necessary for HeterSUMGraph based models.

Embeddings

For training you must use glove 300 embeddings, they must have the following path: data/glove.6B/glove.6B.300d.txt

Training

For CNN/DailyMail max doc len is 100 sentences, not 50 as in the paper (same max doc len as SummaRuNNer to compare both). Run one of the notebooks below to train and evaluate the associated model:

01-train_HeterSUMGraph_CNN_DailyMail.ipynb: paper model on CNN-DailyMail
02-train_HeterSUMGraph_NYT50.ipynb: paper model on NYT50
03-train_HeterSUMGraph_CNN_DailyMail_TG_GATConv.ipynb: HeterSUMGraph with torch_geometric GATConv layer on CNN-DailyMail.
04-train_HeterSUMGraph_NYT50_TG_GATConv.ipynb: HeterSUMGraph with torch_geometric GATConv layer on NYT50.
05-train_HeterSUMGraph_CNN_DailyMail_TG_GATv2Conv.ipynb: HeterSUMGraph with torch_geometric GATv2Conv layer on CNN-DailyMail.
06-train_HeterSUMGraph_NYT50_TG_GATv2Conv.ipynb: HeterSUMGraph with torch_geometric GATv2Conv layer on NYT50.
07-train_HSGRNN_CNN_DailyMail_TG_GATv2Conv.ipynb: HeterSUMGraph with torch_geometric GATv2Conv layer + SummaRuNNer on CNN-DailyMail.
08-train_HSGRNN_NYT50_TG_GATv2Conv.ipynb: HeterSUMGraph with torch_geometric GATv2Conv layer + SummaRuNNer on NYT50.

Result

NYT50 (limited-length ROUGE Recall)

model	ROUGE-1	ROUGE-2	ROUGE-L
HeterSUMGraph (Wang)	46.89	26.26	42.58
HeterSUMGraph (ours)	45.5 ± 0.0	24.2 ± 0.0	34.1 ± 0.0
HSG GATConv	45.4 ± 0.0	24.2 ± 0.0	34.0 ± 0.0
HSG GATv2Conv	47.2 ± 0.0	26.5 ± 0.0	*35.5 ± 0.0**
HSGRNN GATv2Conv	46.9 ± 0.0	26.3 ± 0.0	35.3 ± 0.0

*: maybe the ROUGE-L have changed in the rouge library I use.

CNN/DailyMail (full-length f1 rouge)

model	ROUGE-1	ROUGE-2	ROUGE-L
SummaRuNNer(Nallapati)	39.6 ± 0.2	16.2 ± 0.2	35.3 ± 0.2
HeterSUMGraph (ours)	38.2 ± 0.0	15.1 ± 0.0	24.1 ± 0.0
HSG GATConv	39.8 ± 0.0	16.3 ± 0.0	24.6 ± 0.0
HSG GATv2Conv	39.9 ± 0.0	16.4 ± 0.0	*24.7 ± 0.0**
HSGRNN GATv2Conv	39.5 ± 0.0	16.2 ± 0.0	24.4 ± 0.0

*: maybe the ROUGE-L have changed in the rouge library I use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HeterSUMGraph (extractive summarization)

Clone project

Enter into the directory

Create environnement

Activate environnement

Install dependencies

Install nltk data

Convert NYT zip to NYT50 json and preprocessing it

CNN-DailyMail preprocessing

Embeddings

Training

Result

NYT50 (limited-length ROUGE Recall)

CNN/DailyMail (full-length f1 rouge)

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
scripts		scripts
utils		utils
.gitignore		.gitignore
00-00-convert_nyt_to_json.ipynb		00-00-convert_nyt_to_json.ipynb
00-01-nyt_filter_short_summaries.ipynb		00-01-nyt_filter_short_summaries.ipynb
00-02-compute_nyt_labels.ipynb		00-02-compute_nyt_labels.ipynb
00-03-merge_cnn_dailymail.ipynb		00-03-merge_cnn_dailymail.ipynb
00-03-split_NYT50.ipynb		00-03-split_NYT50.ipynb
01-train_HeterSUMGraph_CNN_DailyMail.ipynb		01-train_HeterSUMGraph_CNN_DailyMail.ipynb
02-train_HeterSUMGraph_NYT50.ipynb		02-train_HeterSUMGraph_NYT50.ipynb
03-train_HeterSUMGraph_CNN_DailyMail_TG_GATConv.ipynb		03-train_HeterSUMGraph_CNN_DailyMail_TG_GATConv.ipynb
04-train_HeterSUMGraph_NYT50_TG_GATConv.ipynb		04-train_HeterSUMGraph_NYT50_TG_GATConv.ipynb
05-train_HeterSUMGraph_CNN_DailyMail_TG_GATv2Conv.ipynb		05-train_HeterSUMGraph_CNN_DailyMail_TG_GATv2Conv.ipynb
06-train_HeterSUMGraph_NYT50_TG_GATv2Conv.ipynb		06-train_HeterSUMGraph_NYT50_TG_GATv2Conv.ipynb
07-train_HSGRNN_CNN_DailyMail_TG_GATv2Conv.ipynb		07-train_HSGRNN_CNN_DailyMail_TG_GATv2Conv.ipynb
08-train_HSGRNN_NYT50_TG_GATv2Conv.ipynb		08-train_HSGRNN_NYT50_TG_GATv2Conv.ipynb
README.md		README.md
requirements.txt		requirements.txt

rhfdn/HeterSUMGraph

Folders and files

Latest commit

History

Repository files navigation

HeterSUMGraph (extractive summarization)

Clone project

Enter into the directory

Create environnement

Activate environnement

Install dependencies

Install nltk data

Convert NYT zip to NYT50 json and preprocessing it

CNN-DailyMail preprocessing

Embeddings

Training

Result

NYT50 (limited-length ROUGE Recall)

CNN/DailyMail (full-length f1 rouge)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages