Automatic Text Simplification for Low Resource Languages using a Pivot Approach

This repository contains the code to run a pivot-based text simplification for the Dutch medical domain and municipal domains. The full pipeline consists of the 3 models:

1st model (M^NL→EN): Translates complex dutch sentences to complex english sentences
2nd Model (M^C→S): Simplifies complex english sentences to simple english sentences
3rd Model (M^EN→NL): Translates simple english sentences to simple dutch sentences

On top of training the models, the repo contains code for evaluating the pipeline's quality using a number of automatic evaluation metrics (BLEU,SARI,METEOR).

Figure 1. Pivot pipeline for text simplification

Project Folder Structure

Explain briefly what's where so people can find their way around. For example:

There are the following folders in the structure:

scripts: Folder with the scripts used to perform all experiments, including individual bash scripts for each one of the pivot-based models pipelines and a python script for the gpt-based experiment.
src: Folder containing all supporting code, such as preprocessing and filtering scripts, tokenization, extraction of domain-specific subsets of the translation corpora, etc.
config: Folder containing configuration files for the training of the models
examples: Folder containing examples of translations and simplifications of sentences for the different pipelines, as well as the manual numerical preservation reviews.
NMT-Data: Folder where all data will be downloaded and models will be saved
media: Folder containing media files for demo purposes

Installation

You can install this repo by following these steps:

Clone this repository:

git clone https://github.com/Amsterdam-Internships/Text_Simplification

Install all dependencies:
```
pip install -r requirements.txt
```

Usage

The scripts folder contains individual bash scripts for all of our experiments. Each script is self-sufficient and covers the full setup and execution of the experiment:

installation of requirements
downloading corresponding data
possibly extracting a domain-specific subsets of the translation corpora
preprocessing, filtering and tokenization of the data
all steps required for the training of each of the translation models (using OpenNMT)
inference and evaluation on the test dataset

Medical pipeline

To run the medical pipeline, the scripts expect evaluation data to be uploaded:

Original sentences to NMT-Data/Eval_Medical_Dutch_C_Dutch_S/NL_test_org
Simplified sentences to NMT-Data/Eval_Medical_Dutch_C_Dutch_S/NL_test_simp

The medical evaluation data has been provided by the authors of

Evers, Marloes. Low-Resource Neural Machine Translation for Simplification of Dutch Medical Text. Diss. Tilburg University, 2021.

Municipal pipeline

To run the municipal pipeline, the scripts expect evaluation data to be uploaded:

Original sentences to NMT-Data/Eval_Municipal/complex
Simplified sentences to NMT-Data/Eval_Municipal/simple

The municipal evaluation data and further details about it can be found in this repository.

In-domain data extraction

In many of our experiments we use in-domain data, extracted from the Opensubtitles corpus on the basis of similarity to a reference corpus. To generate these in-domain data use the following script.

    python src/extract_sentences.py

If you wish to create your own in-domain subset you can substitute the reference_file, the output paths for the Dutch and English parts of the extracted subset, as well as tweak other arguments such as encoding_method and num_samples. For full documentaion, python src/extract_sentences.py --help

Acknowledgements

This repository was created in collaboration with Amsterdam Intelligence for the City of Amsterdam.

We thank the Communications Department of the City of Amsterdam for providing us with a set of simplified documents which has been used for the creation of the municipal evaluation dataset.

We thank Marloes Evers for providing us with the medical evaluation dataset.

Our code uses preproccesing scripts from MT-Preparation

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.github/workflows		.github/workflows
NMT-Data		NMT-Data
config		config
examples		examples
media		media
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.pylintrc		.pylintrc
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Text Simplification for Low Resource Languages using a Pivot Approach

Project Folder Structure

Installation

Usage

Medical pipeline

Municipal pipeline

In-domain data extraction

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

Amsterdam-Internships/Text_Simplification

Folders and files

Latest commit

History

Repository files navigation

Automatic Text Simplification for Low Resource Languages using a Pivot Approach

Project Folder Structure

Installation

Usage

Medical pipeline

Municipal pipeline

In-domain data extraction

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages