DEPlain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification
To advance sentence simplification and document simplification in German, we present DEplain, a new dataset of parallel, professionally written and manually aligned simplifications in plain German ("plain DE" or in German: "Einfache Sprache").
More details can be found in our paper: Stodden, Momen, Kallmeyer (2023). "DEplain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification." In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada. Association for Computational Linguistics.
Overall, our paper contains the following contributions. A more detailed description and the resources per contribution can be found in the corresponding/linked subdirectories:
- A web harvester to download and harvest parallel documents with standard German and plain German,
- Two document simplification datasets,
- Sentence-wise Alignment (manually using TS-ANNO and automatically using some alignment algorithms),
- A simplification plan per document based on the manually sentence-wise alignments,
- Four sentence simplification datasets,
- Some Human Annotations on the manual aligned sentence pairs,
- Automatic text simplification models for document simplification and sentence simplification.
The following figure shows the connection between the contributions made in our paper. The document level corpora (B) and the sentence level corpora (E) are used for training and evaluating the automatic text simplification models (F).
Metadata of the resulting subcorpora are shown in the table below:
Name | License | # Doc. Pairs (train/dev/test) | # Original Sents | # Simple Sents. | Alignment | # Sent. Pairs (train/dev/test) | Corpus Name Doc. | Corpus Name Sent. | |
---|---|---|---|---|---|---|---|---|---|
1 | DEplain-apa | upon request | 483 (387/48/48) | 25,607 | 26,471 | manual | 13,122 (10,660/1,231/1,231) | DEplain-APA-doc | DEplain-APA-sent |
2 | DEplain-web | open | 147 (-/-/147) | 6,138 | 6,402 | manual | 1,846 (-/-/1846) | DEplain-web-doc-manual-open | DEplain-web-sent-manual-open |
3 | DEplain-web | open | 249 (199/50/-) | 7,087 | 7,760 | auto | 652 (514/138/-) | DEplain-web-doc-auto-open | DEplain-web-sent-auto-open |
4 | DEplain-web | closed | 360 (288/72/-) | 12,847 | 18,068 | auto | 942 (767/175/-) | DEplain-web-doc-auto-closed | DEplain-web-sent-auto-closed |
In total | mixed | 1,239 (874/170/195) | 51,681 | 58,701 | mixed | 16,562 (11,941/1,544/3,077) |
Please check ./B__Document-level_Corpus for information on how to access our document simplification corpora (DEplain-APA-doc and DEplain-web-doc). For DEplain-APA, please request the access via DEplain-APA zenodo repository. The documents of DEplain-web with open licenses are provided here; the documents with closed licenses can be downloaded using the web crawler.
Please check ./E__Sentence-level_Corpus for information on how to access our sentence simplification corpora (DEplain-APA-sent and DEplain-web-sent). For DEplain-APA, please request the access via DEplain-APA zenodo repository. The manually aligned sentence pairs of DEplain-web and the automatic aligned sentence pairs with an open license can directly downloaded from the repository. If you downloaded the documents of DEplain-web with a closed license, you can automatically align these documents using one of the provided alignment algorithms.
For reproduction of our experiments regarding automatic sentence-wise alignment, please see ./C__Alignment Algorithms.
For reproduction of our experiments regarding automatic document simplification and sentence simplification, please see ./G__Automatic_Text_Simplification_Experiments.
The parts of the work are licensed under different licenses. Please see the corresponding subdirectory for more information on the license per contribution.
If you use part of this work, please cite our paper:
@inproceedings{stodden-etal-2023-deplain,
title = "{DE}plain: A {G}erman Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification",
author = "Stodden, Regina and Momen, Omar and Kallmeyer, Laura",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.908",
doi = "10.18653/v1/2023.acl-long.908",
pages = "16441--16463",
}
Feel free to contact Regina Stodden if you have any comments or problems with the provided materials.