Releases: bltlab/mot
Releases · bltlab/mot
V1.10
V1.9
-
Scrape up to April 1, 2024
-
Better filtering out of <!-- IMAGE --> and variants
V1.8
- Added scraping from April 2023 to November 15 2023
1.7
- Additional data scraped from October 2022 to end of April 2023
v1.6
- Fix issue in French and Spanish sentence segmentation relating to candidate sites.
- Add tokenization for remaining languages
- bod is tokenized with botok https://github.com/OpenPecha/Botok/tree/docs
- Remaining languages are tokenized with utoken https://github.com/uhermjakob/utoken
v1.5
- Added segmentation for remaining languages
- Improvements to some of the existing segmentation models
- Both cases of under-segmentation and over-segmentation were found and addressed
v1.4
Updated scrape through July 1st, 2022
Fix missing yue documents
Change yue to cmn and voacambodia from khm to eng
Authors extraction from metadata improved
Paragraph splits extraction improved
v1.3
Release 1.3 with updated scrapes through the end of May 2022.
v1.2
- Added segmentation for all languages except: ben, bod, kat, kur
- Better publication date coverage
- Remove zero-width space in segmentation and tokenization output for Thai, Lao, Khmer (zero-width space is kept in the original text in
paragraphs
- Release as described in camera-ready LREC 2022 paper
v1.1
-
Additional scraping from January 2022 to March 1, 2022.
-
Fix for Cantonese segmentation
-
Add segmentation for Portuguese and Urdu
-
Added source code