Skip to content

Releases: IlyaGusev/tgcontest

Final submission

03 Dec 17:58
e5b5094
Compare
Choose a tag to compare

Final submission

Fasttext language identification: https://fasttext.cc/docs/en/language-identification.html

Pretrained embeddings models: fasttext with dim=50, bucket=200000, epoch=10, minCount=50
ru datasets: first 2 archives, Lenta (https://github.com/yutkin/Lenta.Ru-News-Dataset)
en datasets: first 2 archives, BBC (http://mlg.ucd.ie/datasets/bbc.html), News category (https://www.kaggle.com/rmisra/news-category-dataset), All The News (https://www.kaggle.com/snapcrack/all-the-news)
No preprocessing

News detection: manual and Yandex.Toloka markup; fasttext autotune to 2M model size with pretrained embeddings

Categories: manual and Yandex.Toloka markup + Lenta, BBC, News category datasets; fasttext autotune to 4M model size with pretrained embeddings

PageRank was computed for every agency using links in texts.

Embeddings for clusterization: fasttext average/max/min embeddings concatenated and multiplied to a matrix. Matrix training process described in https://github.com/IlyaGusev/tgcontest/blob/master/scripts/Similarity.ipynb

Clustering algorithm: SLINK (https://sites.cs.ucsb.edu/~veronika/MAE/summary_SLINK_Sibson72.pdf)

Clustering trick: all data was split to batches by time. Clustering was done by batch with intersection and merged after that. O(n^2) -> O(k * m^2)

Title choice: PageRank and time factors

Cluster ranking: time, cluster size and weighted PageRank factors