The following table shows the data size, number of lines, and description for each data source we used in transformer-based Thai language model pre-training.
Dataset name | Data size | Number of lines | Description |
---|---|---|---|
wisesight-large | 51.44GB | 314M | a large dataset of social media posts provided by the social listening platform Wisesight for this study. The dataset contains posts Twitter, Facebook, Pantip, Instagram, YouTube and other websites sampled from 2019. |
pantip-large | 22.35GB | 95M | a collection of posts and answers of Thailand's largest online bulletin board Pantip.com from 2015 to 2019 provided by audience analytics platform Chaos Theory. |
Thairath-222k | 1.48GB | 5M | a collection of articles published on newspaper website Thairath.com up to December 2019. (GitHub) |
prachathai-67k | 903.1MB | 2.7M | a collection of articles published on newspaper website Prachathai.com from August 24, 2004 to November 15, 2018. (GitHub) |
Thai Wikipedia | 515MB | 843k | the Wikipedia articles extracted using Giuseppe Attardi’s WikiExtractor in September 2020. All HTML tags, bullet points, and tables are removed. (GitHub) |
OpenSubtitles | 468.8MB | 5M | a collection of movie subtitles translated by crowdsourcing from OpenSubtitles.org [Lison and Tiedemann, 2016]. We use only the portions containing Thai texts. |
ThaiPBS-111k | 372.3MB | 858k | a collection of articles published on newspaper website ThaiPBS.or.th up to December 2019. (GitHub) |
Thai National Corpus (TNC) | 366MB | 797k | a 14-million-word corpus of Thai texts containing 75% non-fiction and 25% fiction works. Media source breakdown is 60% books, 25% magazines, and the rest from other publications and writings. Most of the texts are curated from 1998 to 2007 [Aroonmanakun et al., 2009]. |
scb-mt-en-th-2020 | 290.4MB | 947k | a parallel corpus of Englsih-Thai sentence pairs curated news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data, government documents, and machine-generated text [Lowphansirikul et al., 2021]. (GitHub) |
JW300 | 182.8MB | 727k | a parallel corpus of religion texts from jw.org that includes Thai texts. |
wongnai-corpus | 64MB | 101k | a collection of restaurant reivews and ratings (1 to 5 stars) published on Wongnai.com. (GitHub) |
QED | 42MB | 407k | a collection of transcripts for educational videos and lectures collaboratively created on the AMARA web-based platform [Abdelali et al., 2014]. |
bibleuedin | 2.18MB | 62k | a multilingual corpus of the Bible created by Christos Christodoulopoulos and Mark Steedman. |
wisesight-sentiment | 5.3MB | 22k | a collection of Twitter posts about consumer products and services from 2016 to early 2019 labeled positive, negative, neutral and question [GitHub]. |
tanzil | 2.4MB | 6k | a collection of Quran translations compiled by the Tanzil project [Tiedemann, 2012].. |
tatoeba | 1MB | 2k | a collection of translated sentences from the crowdsourced multilingual dataset Tatoeba [Tiedemann, 2012].. |