Skip to content

Commit

Permalink
Refactor/documentation (#2)
Browse files Browse the repository at this point in the history
* NMT data for gitignore
* readme: add pipeline visual
* readme: intro
* readme: updated folder structure
* readme: usage and acknowledgements
* remove City of Amsterdam from acknowledgements
  • Loading branch information
igornishka authored Oct 17, 2023
1 parent eb606d7 commit 7f4ad2c
Show file tree
Hide file tree
Showing 3 changed files with 71 additions and 25 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,6 @@ __pycache__/

# other
.idea/*

# repo-specific
NMT-Data/*
93 changes: 68 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
# Automatic Text Simplification for Low Resource Languages using a Pivot Approach

This repository contains the code to run a pivot-based text simplification for the Dutch medical domain. It trains 3 models:
-1st model: Translates complex dutch sentences to complex english sentences
-2nd Model: Simplifies complex english sentences to simple english sentences
-3rd Model: Translates simple english sentences to simple dutch sentences
This repository contains the code to run a pivot-based text simplification for the Dutch medical domain and municipal domains.
The full pipeline consists of the 3 models:
* 1st model (M<sup>NL&rarr;EN</sup>): Translates complex dutch sentences to complex english sentences
* 2nd Model (M<sup>C&rarr;S</sup>): Simplifies complex english sentences to simple english sentences
* 3rd Model (M<sup>EN&rarr;NL</sup>): Translates simple english sentences to simple dutch sentences

On top of training the models the repo performs the simplification and evaluates its quality using a number of automatic evaluation metrics (BLEU,SARI,METEOR)

![](media/Pipeline_Text_Simplification_Pivot.pdf)
On top of training the models, the repo contains code for evaluating the pipeline's quality using a number of automatic evaluation metrics (BLEU,SARI,METEOR).

[//]: # (![]&#40;./media/pivot_pipeline_TS.png&#41;)
<div align="center">
<img src="./media/pivot_pipeline_TS.png" width="600"/>
<br>
<em>Figure 1. Pivot pipeline for text simplification</em>
</div>

## Project Folder Structure

Expand All @@ -17,11 +22,17 @@ Explain briefly what's where so people can find their way around. For example:
There are the following folders in the structure:


1) [`scripts`](./scripts): Folder with example scripts for performing different tasks (could serve as usage documentation)
1) [`notebooks`](./notebooks): folder containing notebooks for running the pipeline as well as data-processing scripts for filtering, subwording, desubwording and splitting data
1) [`media`](./media): Folder containing media files (icons, video)
1) [`NMT-Data`](./NMT-Data): Folder where all data and models will be saved
1) [`scripts`](./scripts): Folder with the scripts used to perform all experiments,
including individual bash scripts for each one of the pivot-based models pipelines and
a python script for the [gpt-based experiment](./scripts/chatgpt.py).
1) [`src`](./src): Folder containing all supporting code, such as
preprocessing and filtering scripts, tokenization,
extraction of domain-specific subsets of the translation corpora, etc.
1) [`config`](./config): Folder containing configuration files for the training of the models
1) [`NMT-Data`](./NMT-Data): Folder where all data will be downloaded and models will be saved
1) [`media`](./media): Folder containing media files for demo purposes

[//]: # (1&#41; [`notebooks`]&#40;./notebooks&#41;: folder containing notebooks for running the pipeline as well as data-processing scripts for filtering, subwording, desubwording and splitting data)

## Installation
You can install this repo by following these steps:
Expand All @@ -38,26 +49,58 @@ You can install this repo by following these steps:
---

## Usage
To Run the pipeline the script expects evaluation data to be uploaded: <br>
Original sentences: NMT-Data/Eval_Medical_Dutch_C_Dutch_S/NL_test_org <br>
Simplified sentences: NMT-Data/Eval_Medical_Dutch_C_Dutch_S/NL_test_simp

In many of our experiments we use in-domain data, extracted from the Opensubtitles corpus on the basis of similarity to a reference corpus. To generate these in-domain data use the following script.

The [`scripts`](./scripts) folder contains individual bash scripts for all of our experiments.
Each script is self-sufficient and covers the full setup and execution of the experiment:
- installation of requirements
- downloading corresponding data
- possibly extracting a domain-specific subsets of the translation corpora
- preprocessing, filtering and tokenization of the data
- all steps required for the training of each of the translation models (using OpenNMT)
- inference and evaluation on the test dataset

### Medical pipeline
To run the medical pipeline, the scripts expect evaluation data to be uploaded:
* Original sentences to `NMT-Data/Eval_Medical_Dutch_C_Dutch_S/NL_test_org`
* Simplified sentences to `NMT-Data/Eval_Medical_Dutch_C_Dutch_S/NL_test_simp`

The medical evaluation data has been provided by the authors of
> Evers, Marloes. Low-Resource Neural Machine Translation for Simplification of Dutch Medical Text. Diss. Tilburg University, 2021.

### Municipal pipeline
To run the municipal pipeline, the scripts expect evaluation data to be uploaded:
* Original sentences to `NMT-Data/Eval_Municipal/complex`
* Simplified sentences to `NMT-Data/Eval_Municipal/simple`

The municipal evaluation data is available on request for research purposes.

[//]: # (Please contact [Iva Gornishka]&#40;i.gornishka@amsterdam.nl&#41;.)
[//]: # (TODO: Uncomment after anonimity period.)

### In-domain data extraction
In many of our experiments we use in-domain data, extracted from the Opensubtitles corpus on the basis of similarity to a reference corpus.
To generate these in-domain data use the following script.

```commandline
python scripts/extract_sentences.py
```

If you wish to create your own in-domain subset you can substitute the reference_file, as well as tweak other arguments such as encoding_method and num_samples.
If you wish to create your own in-domain subset you can substitute the reference_file,
the output paths for the Dutch and English parts of the extracted subset,
as well as tweak other arguments such as encoding_method and num_samples.
For full documentaion, `python scripts/extract_sentences.py --help`

By default, the extract_sentences.py script will generate an in-domain medical translation corpora. Which is used in many of our pipelines. The default pipeline script downloads data, processes it, trains the relevant models, performs the simplification and evaluates the simplification. It can be executed using the following script:

```
$ /scripts/run_pipeline.sh
```
---
## Acknowledgements

Different pipeline setups are available in the scripts folder.
[//]: # (This repository was created in collaboration with [Amsterdam Intelligence]&#40;https://amsterdamintelligence.com/&#41; )
[//]: # (for the [City of Amsterdam]&#40;https://www.amsterdam.nl/innovation/&#41;.)

[//]: # (We thank the [Communications Department of the City of Amsterdam]&#40;https://www.amsterdam.nl/bestuur-organisatie/organisatie/bedrijfsvoering/directie-communicatie/&#41; )
[//]: # (for providing us with a set of simplified documents which has been used for the creation of the municipal evaluation dataset.)

[//]: # (TODO: Uncomment after anonimity period)

We thank [Marloes Evers](https://www.linkedin.com/in/marloes-evers-36675b134/) for providing us with the medical evaluation dataset.

---
## Acknowledgements
Our code uses preproccesing scripts from [MT-Preparation](https://github.com/ymoslem/MT-Preparation)
Binary file added media/pivot_pipeline_TS.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 7f4ad2c

Please sign in to comment.