Skip to content

Latest commit

 

History

History
21 lines (20 loc) · 978 Bytes

README.md

File metadata and controls

21 lines (20 loc) · 978 Bytes

Data Preprocessing for Syntax-based Machine Translation

  1. You can get the preprocessed data from this Google Drive (thanks to the original authors).
  2. Then you should change the data format to the same as example.txt by using get_data.sh (check it before use) (the current script assuming the download path is gtos/translator_data/data and the directores to save are gtos/translator_data/cs and gtos/translator_data/de)

The above steps will produce data in the following structure.

cs
├── dev.txt
├── newstest2015-encs-ref.tok.cs
├── newstest2016-encs-ref.tok.cs
├── test.txt
└── train.txt
de
├── dev.txt
├── newstest2015-ende-ref.tok.de
├── newstest2016-ende-ref.tok.de
├── test.txt
└── train.txt

where newstest2015* and newstest2016* store translation references for dev.txt and test.txt respectively (for evaluation).