Skip to content

Latest commit

 

History

History
23 lines (15 loc) · 872 Bytes

README.md

File metadata and controls

23 lines (15 loc) · 872 Bytes

TweetTaglish Dataset

Megan Herrera, Ankit Aich, Natalie Parde
Department of Computer Science University of Illinois at Chicago
{mherre42, aaich2, parde}@uic.edu

Download our dataset from here directly. Refer to the instructions below for downloading the data.

If you use the data or benefit from the paper, please cite

@inproceedings{herrera_aich_parde, title={Language Resources and Evaluation. LREC 2022}, booktitle={TweetTaglish: A Dataset for Investigating Tagalog-English Code-Switching}, author={Herrera, Megan and Aich, Ankit and Parde, Natalie} }

A large (20k+ instances) Tagalog-English code-switching dataset, harvested from Twitter.

tweets_split_id.csv - Contains tweet IDs and their Tagalog/English/Other split as (Tagalog, English, Other) tuples
embeddings.csv - Contains tweet embeddings and Tagalog/English/Other splits