v1.2
- Added segmentation for all languages except: ben, bod, kat, kur
- Better publication date coverage
- Remove zero-width space in segmentation and tokenization output for Thai, Lao, Khmer (zero-width space is kept in the original text in
paragraphs
- Release as described in camera-ready LREC 2022 paper