- You can get the preprocessed data from this Google Drive (thanks to the original authors).
- Then you should change the data format to the same as
example.txt
by usingget_data.sh (check it before use)
(the current script assuming the download path isgtos/translator_data/data
and the directores to save aregtos/translator_data/cs
andgtos/translator_data/de
)
The above steps will produce data in the following structure.
cs
├── dev.txt
├── newstest2015-encs-ref.tok.cs
├── newstest2016-encs-ref.tok.cs
├── test.txt
└── train.txt
de
├── dev.txt
├── newstest2015-ende-ref.tok.de
├── newstest2016-ende-ref.tok.de
├── test.txt
└── train.txt
where newstest2015* and newstest2016* store translation references for dev.txt and test.txt respectively (for evaluation).