Ordering Sentences and Paragraphs with Pre-trained Encoder-Decoder Transformers and Pointer Ensembles

Code, models and data for the paper Ordering Sentences and Paragraphs with Pre-trained Encoder-Decoder Transformers and Pointer Ensembles under review at DocEng2021.

Getting started

Create the environnement, activate and install requirements.

conda create -n ordering python=3.7
conda activate ordering
pip install -r requirements.txt

Datasets

Download the datasets

ArXiv, VIST, ROCStory and Wikipedia are stored on Google Drive. We can download the datasets using gdown. For CNN-DailyMail, there is no need to download the data (see after).

pip install gdown==3.12.2

ArXiv

gdown https://drive.google.com/uc?id=0B-mnK8kniGAieXZtRmRzX2NSVDg
mkdir dataset/arxiv
tar -xf dataSet.tgz -C dataset/arxiv
rm dataSet.tgz

VIST

gdown https://drive.google.com/uc?id=1Arc5vnthfeg6qEHpKU_--y6MKZd5DM78
unzip vist.zip -d dataset/vist
rm vist.zip

ROCStory

gdown https://drive.google.com/uc?id=1xXuy_7XWzgiwS4tYdclKizvmg_MZ-LLX
unzip ROCStory.zip -d dataset/rocstory
rm ROCStory.zip

Wikipedia

gdown https://drive.google.com/uc?id=1B05WiMNKYKjsi1TEweexu01GHElWwIDJ
unzip best_wikipedia.zip -d dataset/best_wikipedia
rm best_wikipedia.zip

Use the datasets

We use the datasets library from HuggingFace to load and access the datasets. The dataset custom loading script are in the dataset/ folder. The loading script uses the downloaded datasets except for CNN-DailyMail where the loading script download the dataset itself.

To load a dataset, run:

from datasets import load_dataset

dataset = load_dataset("path/to/dataset/python/file")

Models

We present 3 models in the paper.

Model	Number of parameters
BART + Simple Pointer	140601600
BART + Deep Pointer	144145152
BART + Ensemble Pointer	140601612

multi in the code or in a filename corresponds to the Ensemble Pointer in the paper.

Train a model

We create our models on top of the transformers library from Huggingface and use the Trainer to train our models. The configuration files for the models presented in the paper are in the training/args/ folder.

To retrain a model, run:

python run.py --model model --args_file path/to/json/file

model is the the model to train (default for BART + simple PtrNet, deep for BART + deep PtrNet, multi for BART + Ensemble PtrNet, or baseline for our baseline LSTM+Attention) and path/to/json/file is the path to the configuration file to use (note that the configuration file should correspond to model).

To change the training parameters you can directly change the configuration file or create a new one.

Evaluate the models

To evaluate the models on a dataset, we create configuration files in the evaluation/args/.

To run the evaluation, run:

from evaluation.benchmark import Benchmark
ben = Benchmark.from_json("path/to/json/file")
df = ben.run()
print(df)
>> *dataframe containing the results*

results may vary a bit as the passages are randomly mixed

Use the models

Use the OrderingModel class from use.py. For example, to use BART + multi PtrNet trained on the Wikipedia dataset:

from use import OrderingModel
from training.scripts.models.bart_multi import  BartForSequenceOrderingWithMultiPointer

model = OrderingModel(BartForSequenceOrderingWithMultiPointer, "models/bart-base-multi-best-wikipedia", "facebook/bart-base")

PASSAGES_TO_ORDER = ["p3", "p2", "p1", "p4"]

model.order(PASSAGES_TO_ORDER)
>> ["p1", "p2", "p3", "p4"]

Training time

We train our 3 models on a single GPU Quadro RTX 6000 for:

arXiv : 68 hours
VIST : 4 hours
ROCStory : 6 hours
Wikipedia : 25 hours
CNN-DailyMail : 14 hours

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ordering Sentences and Paragraphs with Pre-trained Encoder-Decoder Transformers and Pointer Ensembles

Getting started

Datasets

Download the datasets

ArXiv

VIST

ROCStory

Wikipedia

Use the datasets

Models

Train a model

Evaluate the models

Use the models

Training time

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
evaluation		evaluation
models		models
training		training
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
explore.py		explore.py
requirements.txt		requirements.txt
run.py		run.py
use.py		use.py

License

airKlizz/passage-ordering

Folders and files

Latest commit

History

Repository files navigation

Ordering Sentences and Paragraphs with Pre-trained Encoder-Decoder Transformers and Pointer Ensembles

Getting started

Datasets

Download the datasets

ArXiv

VIST

ROCStory

Wikipedia

Use the datasets

Models

Train a model

Evaluate the models

Use the models

Training time

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages