From d4a4094d64dd7ecd5a37f40ea8051369bc924d23 Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Tue, 4 Sep 2018 09:18:15 +0200 Subject: [PATCH] Update MERLIN URL and add m2_to_parallel step --- README.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 6fd775b..b0f70ad 100644 --- a/README.md +++ b/README.md @@ -49,8 +49,8 @@ files for the experiment. The original corpora are available at: -- Falko: https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falko/zugang -- MERLIN: https://merlin-platform.eu +- [Falko](https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falko/zugang) +- [MERLIN](http://hdl.handle.net/20.500.12124/6) Tables linking the Falko/MERLIN sentence pairs to their text IDs from the original corpora are in `data/source/`. For both corpora, the `ctok` @@ -187,3 +187,10 @@ data: ``` python filter_m2.py -filt wiki-unfiltered.m2 -ref fm-train.m2 -out wiki-filtered.m2 ``` + +Convert the filtered wiki m2 back to a plaintext file of parallel +sentences: + +``` +python m2_to_parallel.py -m2 wiki-filtered.m2 -out wiki-filtered.src-trg.txt +```