This repository contains the code to run a pivot-based text simplification for the Dutch medical domain and municipal domains. The full pipeline consists of the 3 models:
- 1st model (MNL→EN): Translates complex dutch sentences to complex english sentences
- 2nd Model (MC→S): Simplifies complex english sentences to simple english sentences
- 3rd Model (MEN→NL): Translates simple english sentences to simple dutch sentences
On top of training the models, the repo contains code for evaluating the pipeline's quality using a number of automatic evaluation metrics (BLEU,SARI,METEOR).
Explain briefly what's where so people can find their way around. For example:
There are the following folders in the structure:
scripts
: Folder with the scripts used to perform all experiments, including individual bash scripts for each one of the pivot-based models pipelines and a python script for the gpt-based experiment.src
: Folder containing all supporting code, such as preprocessing and filtering scripts, tokenization, extraction of domain-specific subsets of the translation corpora, etc.config
: Folder containing configuration files for the training of the modelsexamples
: Folder containing examples of translations and simplifications of sentences for the different pipelines, as well as the manual numerical preservation reviews.NMT-Data
: Folder where all data will be downloaded and models will be savedmedia
: Folder containing media files for demo purposes
You can install this repo by following these steps:
-
Clone this repository:
git clone https://github.com/Amsterdam-Internships/Text_Simplification
-
Install all dependencies:
pip install -r requirements.txt
The scripts
folder contains individual bash scripts for all of our experiments.
Each script is self-sufficient and covers the full setup and execution of the experiment:
- installation of requirements
- downloading corresponding data
- possibly extracting a domain-specific subsets of the translation corpora
- preprocessing, filtering and tokenization of the data
- all steps required for the training of each of the translation models (using OpenNMT)
- inference and evaluation on the test dataset
To run the medical pipeline, the scripts expect evaluation data to be uploaded:
- Original sentences to
NMT-Data/Eval_Medical_Dutch_C_Dutch_S/NL_test_org
- Simplified sentences to
NMT-Data/Eval_Medical_Dutch_C_Dutch_S/NL_test_simp
The medical evaluation data has been provided by the authors of
Evers, Marloes. Low-Resource Neural Machine Translation for Simplification of Dutch Medical Text. Diss. Tilburg University, 2021.
To run the municipal pipeline, the scripts expect evaluation data to be uploaded:
- Original sentences to
NMT-Data/Eval_Municipal/complex
- Simplified sentences to
NMT-Data/Eval_Municipal/simple
The municipal evaluation data and further details about it can be found in this repository.
In many of our experiments we use in-domain data, extracted from the Opensubtitles corpus on the basis of similarity to a reference corpus. To generate these in-domain data use the following script.
python src/extract_sentences.py
If you wish to create your own in-domain subset you can substitute the reference_file,
the output paths for the Dutch and English parts of the extracted subset,
as well as tweak other arguments such as encoding_method and num_samples.
For full documentaion, python src/extract_sentences.py --help
This repository was created in collaboration with Amsterdam Intelligence for the City of Amsterdam.
We thank the Communications Department of the City of Amsterdam for providing us with a set of simplified documents which has been used for the creation of the municipal evaluation dataset.
We thank Marloes Evers for providing us with the medical evaluation dataset.
Our code uses preproccesing scripts from MT-Preparation