Skip to content

Amsterdam-Internships/Text_Simplification

Repository files navigation

Automatic Text Simplification for Low Resource Languages using a Pivot Approach

This repository contains the code to run a pivot-based text simplification for the Dutch medical domain and municipal domains. The full pipeline consists of the 3 models:

  • 1st model (MNL→EN): Translates complex dutch sentences to complex english sentences
  • 2nd Model (MC→S): Simplifies complex english sentences to simple english sentences
  • 3rd Model (MEN→NL): Translates simple english sentences to simple dutch sentences

On top of training the models, the repo contains code for evaluating the pipeline's quality using a number of automatic evaluation metrics (BLEU,SARI,METEOR).


Figure 1. Pivot pipeline for text simplification

Project Folder Structure

Explain briefly what's where so people can find their way around. For example:

There are the following folders in the structure:

  1. scripts: Folder with the scripts used to perform all experiments, including individual bash scripts for each one of the pivot-based models pipelines and a python script for the gpt-based experiment.
  2. src: Folder containing all supporting code, such as preprocessing and filtering scripts, tokenization, extraction of domain-specific subsets of the translation corpora, etc.
  3. config: Folder containing configuration files for the training of the models
  4. examples: Folder containing examples of translations and simplifications of sentences for the different pipelines, as well as the manual numerical preservation reviews.
  5. NMT-Data: Folder where all data will be downloaded and models will be saved
  6. media: Folder containing media files for demo purposes

Installation

You can install this repo by following these steps:

  1. Clone this repository:

    git clone https://github.com/Amsterdam-Internships/Text_Simplification
  2. Install all dependencies:

    pip install -r requirements.txt

Usage

The scripts folder contains individual bash scripts for all of our experiments. Each script is self-sufficient and covers the full setup and execution of the experiment:

  • installation of requirements
  • downloading corresponding data
  • possibly extracting a domain-specific subsets of the translation corpora
  • preprocessing, filtering and tokenization of the data
  • all steps required for the training of each of the translation models (using OpenNMT)
  • inference and evaluation on the test dataset

Medical pipeline

To run the medical pipeline, the scripts expect evaluation data to be uploaded:

  • Original sentences to NMT-Data/Eval_Medical_Dutch_C_Dutch_S/NL_test_org
  • Simplified sentences to NMT-Data/Eval_Medical_Dutch_C_Dutch_S/NL_test_simp

The medical evaluation data has been provided by the authors of

Evers, Marloes. Low-Resource Neural Machine Translation for Simplification of Dutch Medical Text. Diss. Tilburg University, 2021.

Municipal pipeline

To run the municipal pipeline, the scripts expect evaluation data to be uploaded:

  • Original sentences to NMT-Data/Eval_Municipal/complex
  • Simplified sentences to NMT-Data/Eval_Municipal/simple

The municipal evaluation data and further details about it can be found in this repository.

In-domain data extraction

In many of our experiments we use in-domain data, extracted from the Opensubtitles corpus on the basis of similarity to a reference corpus. To generate these in-domain data use the following script.

    python src/extract_sentences.py

If you wish to create your own in-domain subset you can substitute the reference_file, the output paths for the Dutch and English parts of the extracted subset, as well as tweak other arguments such as encoding_method and num_samples. For full documentaion, python src/extract_sentences.py --help


Acknowledgements

This repository was created in collaboration with Amsterdam Intelligence for the City of Amsterdam.

We thank the Communications Department of the City of Amsterdam for providing us with a set of simplified documents which has been used for the creation of the municipal evaluation dataset.

We thank Marloes Evers for providing us with the medical evaluation dataset.

Our code uses preproccesing scripts from MT-Preparation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published